automatic speech recognition introduction. the human dialogue system

Automatic Speech RecognitionIntroduction

The Human Dialogue System

Computer Dialogue Systems

AuditionAutomatic

SpeechRecognition

NaturalLanguage

Understanding

DialogueManagement

Planning

NaturalLanguageGeneration

Text-to-speech

signal words

logical form

words signalsignal

Computer Dialogue Systems

Audition ASR NLU

DialogueMgmt.

Planning

NLGText-to-speech

signal words

logical form

words signalsignal

Parameters of ASR Capabilities

• Different types of tasks with different difficulties– Speaking mode (isolated words/continuous speech)

– Speaking style (read/spontaneous)

– Enrollment (speaker-independent/dependent)

– Vocabulary (small < 20 wd/large >20kword)

– Language model (finite state/context sensitive)

– Signal-to-noise ratio (high > 30 dB/low < 10dB)

– Transducer (high quality microphone/telephone)

The Noisy Channel Model (Shannon)

message

Message

noisy channel

Channel+

message

=Signal

Decoding model: find Message*= argmax P(Message|Signal)But how do we represent each of these things?

What are the basic units for acoustic information?

When selecting the basic unit of acoustic information, we want it to be accurate, trainable and generalizable.

Words are good units for small-vocabulary SR – but not a good choice for large-vocabulary & continuous SR:

• Each word is treated individually –which implies large amount of training data and storage.

• The recognition vocabulary may consist of words which have never been given in the training data.

• Expensive to model interword coarticulation effects.

Why phones are better units than words: an example

"SAY BITE AGAIN" spoken so that the phonemes are separated in time

Recorded soundRecorded sound

spectrogramspectrogram

http://homepages.ius.edu/kforinas/S/sounds/saybiteagainP.wav

"SAY BITE AGAIN" spoken normally

http://homepages.ius.edu/kforinas/S/sounds/saybiteagainN.wav

And why phones are still not the perfect choice

Phonemes are more trainable (there are only about 50 phonemes in English, for example) and generalizable (vocabulary independent).

However, each word is not a sequence of independent phonemes!

Our articulators move continuously from one position to another.

The realization of a particular phoneme is affected by its phonetic neighbourhood, as well as by local stress effects etc.

Different realizations of a phoneme are called allophones.

Example: different spectrograms for “eh”

Triphone modelEach triphone captures facts about preceding and following phone

•Monophone: p, t, k

•Triphone: iy-p+aa

•a-b+c means “phone b, preceding by phone

a, followed by phone c”

In practice, systems use order of 100,000 3phones, andthe 3phone model is the one currently used (e.g. Sphynx)

Parts of an ASR System

FeatureCalculation

LanguageModeling

AcousticModeling

k @

PronunciationModeling

cat: k@tdog: dogmail: mAlthe: D&, DE…

cat dog: 0.00002cat the: 0.0000005the cat: 0.029the dog: 0.031the mail: 0.054 …

Produces acoustic vectors (xt)

Maps acousticsto 3phones

Maps 3phonesto words

Strings wordstogether

Feature calculation

interpretationsinterpretations

Feature calculationF

requ

ency

Time

Find energy at each time step ineach frequency channel

Feature calculation

Fre

quen

cy

Time

Take Inverse Discrete FourierTransform to decorrelate frequencies

Feature calculation

-0.10.31.4-1.22.32.6…

0.20.11.2-1.24.42.2…

-6.1-2.13.12.41.02.2…

0.20.01.2-1.24.42.2…

…

Input:

Output: acousticobservations vectors

Robust Speech Recognition

• Different schemes have been developed for dealing with noise, reverberation– Additive noise: reduce effects of particular

frequencies– Convolutional noise: remove effects of linear

filters (cepstral mean subtraction)

cepstrum: fourier transfor of the LOGARITHM of the spectrum cepstrum: fourier transfor of the LOGARITHM of the spectrum

How do we map from vectors to word sequences?

-0.10.31.4-1.22.32.6…

0.20.11.2-1.24.42.2…

-6.1-2.13.12.41.02.2…

0.20.01.2-1.24.42.2…

“That you” …???

HMM (again)!

-0.10.31.4-1.22.32.6…

0.20.11.2-1.24.42.2…

-6.1-2.13.12.41.02.2…

0.20.01.2-1.24.42.2…

“That you” …Pattern recognition

with HMMs

ASR using HMMs

• Try to solve P(Message|Signal) by breaking the problem up into separate components

• Most common method: Hidden Markov Models– Assume that a message is composed of words

– Assume that words are composed of sub-word parts (3phones)

– Assume that 3phones have some sort of acoustic realization

– Use probabilistic models for matching acoustics to phones to words

Creating HMMs for word sequences: Context independent units

3phones3phones

“Need” 3phone model

Hierarchical system of HMMs

HMM of a triphone HMM of a triphone HMM of a triphone

Higher level HMM of a word

Language model

To simplify, let’s now ignorelower level HMM

Each phone node hasa “hidden” HMM (H2MM)

HMMs for ASR

go home

g o h o m

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

Markov modelbackbone composedof sequences of 3phones(hidden because wedon’t knowcorrespondences)

Acoustic observations

Each line represents a probability estimate (more later)

g o o o o o oh mm

HMMs for ASR

go home

g o h o m

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9

Markov modelbackbone composedof phones(hidden because wedon’t knowcorrespondences)

Acoustic observations

Even with same word hypothesis, can have different alignments (red arrows). Also, have to search over all word hypothesesEven with same word hypothesis, can have different alignments (red arrows). Also, have to search over all word hypotheses

For every HMM (in hierarchy): compute Max probability sequence

th a t

h iy

y uw

p(he|that)

p(you|that)

h iy

sh uh d

X= acoustic observations,(3)phones, phone sequencesW= (3)phones, phonesequences, word sequences

argmaxW P(W|X)=argmaxW P(X|W)P(W)/P(X)=argmaxW P(X|W)P(W)

COMPUTE:

Search• When trying to find W*=argmaxW P(W|X), need to look

at (in theory)– All possible (3phone, word.. etc) sequences– All possible segmentations/alignments of W&X

• Generally, this is done by searching the space of W– Viterbi search: dynamic programming approach that looks for

the most likely path– A* search: alternative method that keeps a stack of hypotheses

around

• If |W| is large, pruning becomes important• Need also to estimate transition probabilities

Training: speech corpora

• Have a speech corpus at hand– Should have word (and preferrably phone)

transcriptions

– Divide into training, development, and test sets

• Develop models of prior knowledge– Pronunciation dictionary

– Grammar, lexical trees

• Train acoustic models– Possibly realigning corpus phonetically

Acoustic Model

-0.10.31.4-1.22.32.6…

0.20.11.2-1.24.42.2…

-6.1-2.13.12.41.02.2…

0.20.01.2-1.24.42.2…

dh a a t • Assume that you can label each vector with a phonetic label

• Collect all of the examples of a phone together and build a Gaussian model (or some other statistical model, e.g. neural networks)

Na()P(X|state=a)

Pronunciation model

• Pronunciation model gives connections between phones and words

• Multiple pronunciations (tomato):

owt m

dh pdh

1-pdh

a pa

1-pa

t pt

1-pt

ah

ow ey

ah

t

Training models for a sound unit

Language Model

• Language model gives connections between words (e.g., bigrams: probability of two word sequences)

dh a t

h iy

y uw

p(he|that)

p(you|that)

Lexical treesSTART S-T-AA-R-TD STARTING S-T-AA-R-DX-IX-NG STARTED S-T-AA-R-DX-IX-DD STARTUP S-T-AA-R-T-AX-PD START-UP S-T-AA-R-T-AX-PD

S T AA

R

R T

TD

DXIX

IX

NG

DD

AXPD

PD

start

starting

started

startup

start-up

Judging the quality of a system

• Usually, ASR performance is judged by the word error rateErrorRate = 100*(Subs + Ins + Dels) / Nwords

REF: I WANT TO GO HOME ***

REC: * WANT TWO GO HOME NOW

SC: D C S C C I

100*(1S+1I+1D)/5 = 60%

Judging the quality of a system

• Usually, ASR performance is judged by the word error rate

• This assumes that all errors are equal– Also, a bit of a mismatch between optimization

criterion and error measurement

• Other (task specific) measures sometimes used– Task completion– Concept error rate

Sphinx4http://cmusphinx.sourceforge.net

http://cmusphinx.sourceforge.net/

• Feature extractor

• Feature extractor• Mel-Frequency Cepstral

Coefficients (MFCCs)Feature vectors

• Acoustic Observations

• Acoustic Observations• Hidden States

• Acoustic Observations• Hidden States• Acoustic Observation likelihoods

“Six”

• Constructs the search graph of HMMs from: – Acoustic model– Statistical Language model ~or~– Grammar– Dictionary

• Constructs the HMMs of phones• Produces observation likelihoods

• Constructs the HMMs for units of speech

• Produces observation likelihoods • Sampling rate is critical! • WSJ vs. WSJ_8k

• Constructs the HMMs for units of speech

• Produces observation likelihoods • Sampling rate is critical! • WSJ vs. WSJ_8k• TIDIGITS, RM1, AN4, HUB4

• Word likelihoods

• ARPA format Example:1-grams:-3.7839 board -0.1552-2.5998 bottom -0.3207-3.7839 bunch -0.21742-grams:-0.7782 as the -0.2717-0.4771 at all 0.0000-0.7782 at the -0.29153-grams:-2.4450 in the lowest -0.5211 in the middle -2.4450 in the on

• Maps words to phoneme sequences

• Example from cmudict.06dPOULTICE P OW L T AH S

POULTICES P OW L T AH S IH Z

POULTON P AW L T AH N

POULTRY P OW L T R IY

POUNCE P AW N S

POUNCED P AW N S T

POUNCEY P AW N S IY

POUNCING P AW N S IH NG

POUNCY P UW NG K IY

• Can be statically or dynamically constructed

• Maps feature vectors to search graph

• Searches the graph for the “best fit”

• Searches the graph for the “best fit”

• P(sequence of feature vectors| word/phone)

• aka. P(O|W)

-> “how likely is the input to have been generated by the word”

F ay ay ay ay v v v v vF f ay ay ay ay v v v vF f f ay ay ay ay v v vF f f f ay ay ay ay v vF f f f ay ay ay ay ay vF f f f f ay ay ay ay vF f f f f f ay ay ay v…

TimeO1 O2 O3

• Uses algorithms to weed out low scoring paths during decoding

• Words!

• Most common metric• Measure the # of modifications to

transform recognized sentence into reference sentence

• Reference: “This is a reference sentence.”

• Result: “This is neuroscience.”


• Result: “This is neuroscience.”• Requires 2 deletions, 1 substitution



• D S D

Installation details

• http://cmusphinx.sourceforge.net/wiki/sphinx4:howtobuildand_run_sphinx4

• Student report on NLP course web site

http://cmusphinx.sourceforge.net/wiki/sphinx4:howtobuildand_run_sphinx4

http://cmusphinx.sourceforge.net/wiki/sphinx4:howtobuildand_run_sphinx4

automatic speech recognition introduction. the human dialogue system

Documents