automatic speech recognition a summary of contributions from multiple disciplines mark d. skowronski...

13
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and Computer Engineering University of Florida

Upload: aileen-morton

Post on 18-Jan-2016

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and

Automatic Speech RecognitionA summary of contributions from multiple disciplines

Mark D. Skowronski

Computational Neuro-Engineering Lab

Electrical and Computer Engineering

University of Florida

October 6, 2004

Page 2: Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and

What is ASR?

• Automatic Speech Recognition is:– A system that converts a raw acoustic

signal into phonetically meaningful text.– A combination of engineering, linguistics,

statistics, psychoacoustics, and computer science.

Page 3: Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and

“seven”

Psychoacousticians provide expert knowledge about human acoustic perception.

Engineers provide efficient algorithms and hardware.

Linguists provide language rules.

What is ASR?

Feature extraction Classification Language model

Computer scientists and statisticians provide optimum modeling.

Page 4: Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and

Feature extraction• Acoustic-phonetic paradigm (pre 1980):

– Holistic features (voicing and frication measures, durations, formants and BW)

– Difficult to construct robust classifiers

• Frame-based paradigm (1980 to today):– Short (20 ms) sliding analysis window, assumes

speech frame is quasi-stationarity– Relies on classifier to account for speech

nonstationarity– Allows for the inclusion of expert information of

speech perception

Page 5: Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and

Feature extraction algorithms• Cepstrum (1962)• Linear prediction (1967)• Mel frequency cepstral coefficients (Davis &

Mermelstein, 1980)• Perceptual linear prediction

(Hermansky,1990)• Human factor cepstral coefficients

(Skowronski & Harris, 2002)

Page 6: Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and

“seven”

Cepstral domain

DCT

Log energy

Mel-scaled filter bank

Fourier

x(t)

Time

Filter #

MFCC algorithm

Page 7: Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and

Classification• Operates on frame-based features• Accounts for time variations of speech• Uses training data to transform features into

symbols (phonemes, bi-/tri-phones, words)• Non-parametric: Dynamic time warp (DTW)

– No parameters to estimate– Computationally expensive, scaling issues

• Parametric: Hidden Markov model (HMM)– State-of-the-art model, complements features– Data-intensive, scales well

Page 8: Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and

HMM classificationA Hidden Markov Model is a piecewise stationary model of a nonstationary signal.

Model characteristics• states: represent domains of piecewise

stationarity• interstate connections: defines model

architecture• parameters: pdf means & covariance

Page 9: Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and

HMM diagram

Time domain

State space

Feature space

Page 10: Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and

Symbol # Models Positive Negative

Word <1000 Coarticulation Scaling

Phoneme 40 pdf estimation Coarticulation

Biphone 1400

Triphone 40K Coarticulation pdf estimation

TRADEOFF

HMM output symbols

Page 11: Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and

Language models• Considers multiple output symbol hypotheses• Delays making hard decision on classifier

output• Uses language-based expert knowledge to

predict meaningful words/phrases from classifier output N-phones/word symbols

• Major research topic since early 1990s with advent of large speech corpora

Page 12: Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and

ASR Problems• Test/Train mismatch• Speaker variations (gender, accent, mood)• Weak model assumptions• Noise: energetic or informational (babble)• Current state-of-the-art does not model the

human brain nor function with the accuracy or reliability of humans

• Most progress of late comes from faster computers, not new ideas

Page 13: Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and

Conclusions• Automatic speech recognition technology

emerges from several diverse disciplines– Acousticians describe how speech is produced and

perceived by humans– Computer scientists create machine learning

models for signal-to-symbol conversion– Linguists provide language information– Engineers optimize the algorithms and provide the

hardware, and put the pieces together