Download - Feature Extraction for ASR
![Page 1: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/1.jpg)
Feature Extraction for ASRFeature Extraction for ASR
Spectral(envelope)Analysis
AuditoryModel/
Normalizations
![Page 2: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/2.jpg)
Deriving the envelope (or Deriving the envelope (or the excitation)the excitation)
excitation Time-varying filter
e(n) ht(n) y(n)=e(n)*ht(n)
HOW CAN WE GET e(n) OR h(n) from y(n)?
![Page 3: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/3.jpg)
But first, why?But first, why?
• Excitation/pitch: for vocoding; for synthesis; for signal transformation; for prosody extraction (emotion, sentence end, ASR for tonal languages …); for voicing category in ASR
• Filter (envelope): for vocoding; for synthesis; for phonetically relevant information for ASR
![Page 4: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/4.jpg)
Spectral Envelope EstimationSpectral Envelope Estimation
• Filters
• Cepstral Deconvolution
(Homomorphic filtering)
• LPC
![Page 5: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/5.jpg)
![Page 6: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/6.jpg)
Channel vocoder Channel vocoder (analysis)(analysis)
e(n)*h(n)
Broad w.r.t harmonics
![Page 7: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/7.jpg)
Rectifier Low-pass filterBand-pass filterA B C
B
C
A
Bandpass power estimationBandpass power estimation
![Page 8: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/8.jpg)
speech
BP 1
BP 2
BP N
rectify
rectify
rectify
LP 1
LP 2
LP N
decimate
decimate
decimate
Magnitudesignals
Deriving spectral envelope with a filter bank
![Page 9: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/9.jpg)
![Page 10: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/10.jpg)
Filterbank properties Filterbank properties
• Original Dudley Voder/Vocoder: 10 filters,
300 Hz bandwidth (based on # fingers!)
• A decade later, Vaderson used 30 filters,
100 Hz bandwidth (better)
• Using variable frequency resolution, can use
16 filters with the same quality
![Page 11: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/11.jpg)
Mel filterbank Mel filterbank
• Warping function B(f) = 1125 ln (1 + f/700)
• Based on listening experiments with pitch
![Page 12: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/12.jpg)
Towards other Towards other deconvolution methodsdeconvolution methods
• Filters seem biologically plausible• Other operations could potentially
separate excitation from filter• Periodic source provides harmonics
(close together in frequency)• Filter provides broad influence
(envelope) on harmonic series• Can we use these facts to separate?
![Page 13: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/13.jpg)
““Homomorphic” Homomorphic” processingprocessing
• Linear processing is well-behaved
• Some simple nonlinearities also permit simple processing, interpretation
• Logarithm a good example; multiplicative effects become additive
• Sometimes in additive domain, parts more separable
• Famous example: blind deconvolution of Caruso recordings
![Page 14: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/14.jpg)
Oppenheim: Then all speech compression systems and many speech recognition systems are oriented toward doing this deconvolution, then processing things separately, and then going on from there. A very different application of homomorphic deconvolution was something that Tom Stockham did. He started it at Lincoln and continued it at the University of Utah. It has become very famous, actually. It involves using homomorphic deconvolution to restore old Caruso recordings.
Goldstein: I have heard about that.
Oppenheim: Yes. So you know that's become one of the well-known applications of deconvolution for speech.…Oppenheim: What happens in a recording like Caruso's is that he was singing into a horn that to make the recording. The recording horn has an impulse response, and that distorts the effect of his voice, my talking like this. [cupping his hands around his mouth]
Goldstein: Okay.
IEEE Oral History Transcripts: Oppenheim on Stockham’s Deconvolution of Caruso Recordings (1)
![Page 15: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/15.jpg)
Oppenheim: So there is a reverberant quality to it. Now what you want to do is deconvolve that out, because what you hear when I do this [cupping his hands around his mouth] is the convolution of what I'm saying and the impulse response of this horn. Now you could say, "Well why don't you go off and measure it. Just get one of those old horns, measure its impulse response, and then you can do the deconvolution." The problem is that the characteristics of those horns changed with temperature, and they changed with the way they were turned up each time. So you've got to estimate that from the music itself. That led to a whole notion which I believe Tom launched, which is the concept of blind deconvolution. In other words, being able to estimate from the signal that you've got the convolutional piece that you want to get rid of. Tom did that using some of the techniques of homomorphic filtering. Tom and a student of his at Utah named Neil Miller did some further work. After the deconvolution, what happens is you apply some high pass filtering to the recording. That's what it ends up doing. What that does is amplify some of the noise that's on the recording. Tom and Neil knew Caruso's singing. You can use the homomorphic vocoder that I developed to analyze the singing and then resynthesize it. When you resynthesize it you can do so without the noise. They did that, and of course what happens is not only do you get rid of the noise but you get rid of the orchestra. That's actually become a very fun demo which I still play in my class. This was done twenty years ago, but it's still pretty dramatic. You hear Caruso singing with the orchestra, then you can hear the enhanced version after the blind deconvolution, and then you can also hear the result after you get rid of the orchestra,. Getting rid of the orchestra is something you can't do with linear filtering. It has to be a nonlinear technique.
IEEE Oral History Transcripts (2)
![Page 16: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/16.jpg)
Log processingLog processing
• Suppose y(n) = e(n)*h(n)
• Then Y(f) = E(f)H(f)
• And logY(f) = log E(f) + log H(f)
• In some cases, these pieces are separable by a linear filter
• If all you want is H, processing can smooth Y(f)
![Page 17: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/17.jpg)
![Page 18: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/18.jpg)
![Page 19: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/19.jpg)
Windowedspeech
FFTLog
magnitude FFTTime
separationSpectralfunction
Excitation Pitchdetection
Source-filter separation by cepstral analysis
![Page 20: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/20.jpg)
Cepstral featuresCepstral features
• Typically truncated (smoothing)• Corresponds to spectral envelope estimation• Features also are roughly orthogonal• Common transformation for many spectral
features, e.g.,- filter bank energies- FFT power- LPC coefficients
• Used almost universally for ASR (in some form)
![Page 21: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/21.jpg)
Key Processing Step for Key Processing Step for ASR:ASR:
Cepstral Mean Cepstral Mean SubtractionSubtraction
• Imagine a fixed filter h(n), so y(n)=h(n)*x(n)• Same arguments as before, but
- let x vary over time- let h be fixed over time
• Then average cepstra should represent the fixed component (including fixed part of x)
• (Think about it)
![Page 22: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/22.jpg)
An alternative:An alternative:Incorporate Production Incorporate Production
• Assume simple excitation/vocal tract model
• Assume cascaded resonators for vocal tract
frequency response (envelope)
• Find resonator parameters for best spectral
approximation
![Page 23: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/23.jpg)
![Page 24: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/24.jpg)
=
= = r2
![Page 25: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/25.jpg)
![Page 26: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/26.jpg)
![Page 27: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/27.jpg)
![Page 28: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/28.jpg)
![Page 29: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/29.jpg)
![Page 30: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/30.jpg)
Some LPC Issues Some LPC Issues
• Error criterion
• Model order
![Page 31: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/31.jpg)
![Page 32: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/32.jpg)
LPC Peak Modeling LPC Peak Modeling
• Total error constrained to be (at best)
gain factor squared
• Error where model spectrum is larger
contributes less
• Model spectrum tends to “hug” peaks
![Page 33: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/33.jpg)
LPC SpectrumLPC Spectrum
![Page 34: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/34.jpg)
More effects More effects of of error criterionerror criterion
• Globally tracks, but worse match in
log spectrum for low values
• “Attempts” to model anti-aliasing
filter, mic response
• Ill-conditioned for wide-ranging spectral
values
![Page 35: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/35.jpg)
Other LPC Other LPC properties properties • Behavior in noise
• Sharpness of peaks
• Speaker dependence
![Page 36: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/36.jpg)
Model Order Model Order
• Too few, can’t represent formants
• Too many, model detail, especially
harmonics
• Too many, low error, ill-conditioned
matrices
![Page 37: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/37.jpg)
LPC Model OrderLPC Model Order
![Page 38: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/38.jpg)
![Page 39: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/39.jpg)
Optimal Model Optimal Model Order Order • Akaike Information Criterion (AIC)
• Cross-validation (trial and error)
![Page 40: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/40.jpg)
Coefficient Coefficient Estimation Estimation • Minimize squared error - set derivs to zero
• Compute in blocks or on-line
• For blocks, use autocorrelation or
covariance methods (pertains to windowing,
edge effects)
![Page 41: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/41.jpg)
![Page 42: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/42.jpg)
Solving the Solving the Equations Equations
• Autocorrelation method: Levinson or Durbin
recursions, O(P2) ops; uses Toeplitz property
(constant along left-right diagonals),
guaranteed stable
• Covariance method: Cholesky
decomposition,
O(P3) ops; just uses symmetry property, not
guaranteed stable
![Page 43: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/43.jpg)
LPC-based LPC-based representationsrepresentations • Predictor polynomial - ai, 1<=i<=p , direct
computation
• Root pairs - roots of polynomial, complex pairs
• Reflection coefficients - recursion; interpolated
values always stable (also called PARCOR coefficients
ki, 1<=i<=p)
• Log area ratios = ln((1-k)/(1+k)) , low spectral
sensitivity
• Line spectral frequencies - freq. pts around
resonance; low spectral sensitivity, stable
• Cepstra - can be unstable, but useful for recognition
![Page 44: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/44.jpg)
AutocorrelationAnalysis
![Page 45: Feature Extraction for ASR](https://reader035.vdocuments.us/reader035/viewer/2022062520/568158d6550346895dc61e5c/html5/thumbnails/45.jpg)
Spectral EstimationSpectral Estimation
Filter BanksCepstralAnalysis
LPC
Reduced Pitch Effects
Excitation Estimate
Direct Access to Spectra
Less Resolution at HF
Orthogonal Outputs
Peak-hugging Property
Reduced Computation
X
X
X
X XXX
X
X
X