speech recognition feature extraction. speech recognition simplified block diagram speech capture...

Speech Recognition

Feature Extraction

Speech recognition simplified block diagram

SpeechCapture

SpeechCapture

FeatureExtraction

FeatureExtraction

TrainingTraining

ModelsModels

PatternMatching

PatternMatching

ProcessResults

ProcessResults TextText

Speech capture

• Use good quality noise cancelling mic

• Use bandwidth of 4kHz for phone

• Use bandwidth of 8kHz for desktop

• Sample at 8kHz or 16 kHz

• Alias filter the input

• Avoid background noise

• Speak clearly but naturally

Spectral Features• Need to extract key frequency components• Visible in a spectrogram – 2d real time

examples

Feature extraction

• Need to extract frequency content (spectrogram)• Matching on raw data is inefficient

– Much of the data is redundant for information– Analyse the signal and extract key features– The same word spoken by different people looks very

different in time domain– In the frequency domain, patterns are more evident

• Generally use Mel Frequency Cepstral Coefficients

The process

• MFCCs are short-term spectral features They are calculated as follows– Divide signal into frames– For each frame, obtain the amplitude

spectrum– Take the natural logarithm– Convert to Mel spectrum (cepstrum)– Take the discrete cosine transform (DCT)

Divide signal into framesApply window function – typically Hamming window• Select about 25mS of speech data and window it to cleanly

cut it out of the data stream• Shift window by about 10mS and do the same continuously

• Why Hamming? Why not rectangular?

• Now have a series of vectors being produced– If sampling at 8kHz

then sample period = 125uS

– Vector size = 25mS/125uS = 25000 / 125 = 200 element array

• Feed the speech frame into an FFT to get frequency component of that slice

• Calculate the power of the spectrum for each element of the vector – s[k]=(Real X[k])2 + (Imag X[k])2 where X is FFT coef

• Use a set of filters to split up frequency bands– Typically use mel scale filter to match the Basilar

Membrane. Get energy in each band– Sphinx III uses 40 filters over 8kHz bandwidth

• Frequency response is non-linear– Mel(ody) = 1127.01048 x log_e(1+f/700)– f = 700(e^{m x 1127.01048} – 1) – Bark =13 x arctan(0.76f x 1000) + 3.5 x arctan((f x 7500)^2)

• Calculate mel spectrum by multiplying the power spectrum by each of the of the triangular mel weighting filters and integrating the result.

• Calculate the mel cepstrum– A DCT is applied to the natural logarithm of the mel spectrum to

obtain the mel cepstrum. C=num of cepstral coefficients required (n=0 to 12 to get 13 for Sphinx III) and L is the number of filter banks and S[i] is the mel spectrum coefficient – one for each filter output. n is usually less than C as the DCT has the effect of compressing the spectrum such that the bulk of the information is in the first few coefficients. Sphinx III uses 40 filters but keeps only the first 13 cepstral coefficients.

Default values for the SPHINX III front-end

• Typical Feature Extraction Block Diagram

speech recognition feature extraction. speech recognition simplified block diagram speech capture...

Documents