speech recognition feature extraction. speech recognition simplified block diagram speech capture...

15
Speech Recognition Feature Extraction

Upload: willis-stevens

Post on 05-Jan-2016

223 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

Speech Recognition

Feature Extraction

Page 2: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

Speech recognition simplified block diagram

SpeechCapture

SpeechCapture

FeatureExtraction

FeatureExtraction

TrainingTraining

ModelsModels

PatternMatching

PatternMatching

ProcessResults

ProcessResults TextText

Page 3: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

Speech capture

• Use good quality noise cancelling mic

• Use bandwidth of 4kHz for phone

• Use bandwidth of 8kHz for desktop

• Sample at 8kHz or 16 kHz

• Alias filter the input

• Avoid background noise

• Speak clearly but naturally

Page 4: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

Spectral Features• Need to extract key frequency components• Visible in a spectrogram – 2d real time

examples

Page 5: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

Feature extraction

• Need to extract frequency content (spectrogram)• Matching on raw data is inefficient

– Much of the data is redundant for information– Analyse the signal and extract key features– The same word spoken by different people looks very

different in time domain– In the frequency domain, patterns are more evident

• Generally use Mel Frequency Cepstral Coefficients

Page 6: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

The process

• MFCCs are short-term spectral features They are calculated as follows– Divide signal into frames– For each frame, obtain the amplitude

spectrum– Take the natural logarithm– Convert to Mel spectrum (cepstrum)– Take the discrete cosine transform (DCT)

Page 7: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

Divide signal into framesApply window function – typically Hamming window• Select about 25mS of speech data and window it to cleanly

cut it out of the data stream• Shift window by about 10mS and do the same continuously

Page 8: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

• Why Hamming? Why not rectangular?

Page 9: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

• Now have a series of vectors being produced– If sampling at 8kHz

then sample period = 125uS

– Vector size = 25mS/125uS = 25000 / 125 = 200 element array

Page 10: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

• Feed the speech frame into an FFT to get frequency component of that slice

• Calculate the power of the spectrum for each element of the vector – s[k]=(Real X[k])2 + (Imag X[k])2 where X is FFT coef

• Use a set of filters to split up frequency bands– Typically use mel scale filter to match the Basilar

Membrane. Get energy in each band– Sphinx III uses 40 filters over 8kHz bandwidth

Page 11: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

• Frequency response is non-linear– Mel(ody) = 1127.01048 x log_e(1+f/700)– f = 700(e^{m x 1127.01048} – 1) – Bark =13 x arctan(0.76f x 1000) + 3.5 x arctan((f x 7500)^2)

Page 12: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

• Calculate mel spectrum by multiplying the power spectrum by each of the of the triangular mel weighting filters and integrating the result.

Page 13: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

• Calculate the mel cepstrum– A DCT is applied to the natural logarithm of the mel spectrum to

obtain the mel cepstrum. C=num of cepstral coefficients required (n=0 to 12 to get 13 for Sphinx III) and L is the number of filter banks and S[i] is the mel spectrum coefficient – one for each filter output. n is usually less than C as the DCT has the effect of compressing the spectrum such that the bulk of the information is in the first few coefficients. Sphinx III uses 40 filters but keeps only the first 13 cepstral coefficients.

Page 14: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

Default values for the SPHINX III front-end

Page 15: Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction

• Typical Feature Extraction Block Diagram