automatic speech recognition introduction. the human dialogue system
TRANSCRIPT
![Page 1: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/1.jpg)
Automatic Speech RecognitionIntroduction
![Page 2: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/2.jpg)
The Human Dialogue System
![Page 3: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/3.jpg)
The Human Dialogue System
![Page 4: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/4.jpg)
Computer Dialogue Systems
AuditionAutomatic
SpeechRecognition
NaturalLanguage
Understanding
DialogueManagement
Planning
NaturalLanguageGeneration
Text-to-speech
signal words
logical form
words signalsignal
![Page 5: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/5.jpg)
Computer Dialogue Systems
Audition ASR NLU
DialogueMgmt.
Planning
NLGText-to-speech
signal words
logical form
words signalsignal
![Page 6: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/6.jpg)
Parameters of ASR Capabilities
• Different types of tasks with different difficulties– Speaking mode (isolated words/continuous speech)
– Speaking style (read/spontaneous)
– Enrollment (speaker-independent/dependent)
– Vocabulary (small < 20 wd/large >20kword)
– Language model (finite state/context sensitive)
– Signal-to-noise ratio (high > 30 dB/low < 10dB)
– Transducer (high quality microphone/telephone)
![Page 7: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/7.jpg)
The Noisy Channel Model (Shannon)
message
Message
noisy channel
Channel+
message
=Signal
Decoding model: find Message*= argmax P(Message|Signal)But how do we represent each of these things?
![Page 8: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/8.jpg)
What are the basic units for acoustic information?
When selecting the basic unit of acoustic information, we want it to be accurate, trainable and generalizable.
Words are good units for small-vocabulary SR – but not a good choice for large-vocabulary & continuous SR:
• Each word is treated individually –which implies large amount of training data and storage.
• The recognition vocabulary may consist of words which have never been given in the training data.
• Expensive to model interword coarticulation effects.
![Page 9: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/9.jpg)
Why phones are better units than words: an example
![Page 10: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/10.jpg)
"SAY BITE AGAIN" spoken so that the phonemes are separated in time
Recorded soundRecorded sound
spectrogramspectrogram
![Page 12: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/12.jpg)
And why phones are still not the perfect choice
Phonemes are more trainable (there are only about 50 phonemes in English, for example) and generalizable (vocabulary independent).
However, each word is not a sequence of independent phonemes!
Our articulators move continuously from one position to another.
The realization of a particular phoneme is affected by its phonetic neighbourhood, as well as by local stress effects etc.
Different realizations of a phoneme are called allophones.
![Page 13: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/13.jpg)
Example: different spectrograms for “eh”
![Page 14: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/14.jpg)
Triphone modelEach triphone captures facts about preceding and following phone
•Monophone: p, t, k
•Triphone: iy-p+aa
•a-b+c means “phone b, preceding by phone
a, followed by phone c”
In practice, systems use order of 100,000 3phones, andthe 3phone model is the one currently used (e.g. Sphynx)
![Page 15: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/15.jpg)
Parts of an ASR System
FeatureCalculation
LanguageModeling
AcousticModeling
k @
PronunciationModeling
cat: k@tdog: dogmail: mAlthe: D&, DE…
cat dog: 0.00002cat the: 0.0000005the cat: 0.029the dog: 0.031the mail: 0.054 …
Produces acoustic vectors (xt)
Maps acousticsto 3phones
Maps 3phonesto words
Strings wordstogether
![Page 16: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/16.jpg)
Feature calculation
interpretationsinterpretations
![Page 17: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/17.jpg)
Feature calculationF
requ
ency
Time
Find energy at each time step ineach frequency channel
![Page 18: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/18.jpg)
Feature calculation
Fre
quen
cy
Time
Take Inverse Discrete FourierTransform to decorrelate frequencies
![Page 19: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/19.jpg)
Feature calculation
-0.10.31.4-1.22.32.6…
0.20.11.2-1.24.42.2…
-6.1-2.13.12.41.02.2…
0.20.01.2-1.24.42.2…
…
Input:
Output: acousticobservations vectors
![Page 20: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/20.jpg)
Robust Speech Recognition
• Different schemes have been developed for dealing with noise, reverberation– Additive noise: reduce effects of particular
frequencies– Convolutional noise: remove effects of linear
filters (cepstral mean subtraction)
cepstrum: fourier transfor of the LOGARITHM of the spectrum cepstrum: fourier transfor of the LOGARITHM of the spectrum
![Page 21: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/21.jpg)
How do we map from vectors to word sequences?
-0.10.31.4-1.22.32.6…
0.20.11.2-1.24.42.2…
-6.1-2.13.12.41.02.2…
0.20.01.2-1.24.42.2…
“That you” …???
![Page 22: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/22.jpg)
HMM (again)!
-0.10.31.4-1.22.32.6…
0.20.11.2-1.24.42.2…
-6.1-2.13.12.41.02.2…
0.20.01.2-1.24.42.2…
“That you” …Pattern recognition
with HMMs
![Page 23: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/23.jpg)
ASR using HMMs
• Try to solve P(Message|Signal) by breaking the problem up into separate components
• Most common method: Hidden Markov Models– Assume that a message is composed of words
– Assume that words are composed of sub-word parts (3phones)
– Assume that 3phones have some sort of acoustic realization
– Use probabilistic models for matching acoustics to phones to words
![Page 24: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/24.jpg)
Creating HMMs for word sequences: Context independent units
3phones3phones
![Page 25: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/25.jpg)
“Need” 3phone model
![Page 26: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/26.jpg)
Hierarchical system of HMMs
HMM of a triphone HMM of a triphone HMM of a triphone
Higher level HMM of a word
Language model
![Page 27: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/27.jpg)
To simplify, let’s now ignorelower level HMM
Each phone node hasa “hidden” HMM (H2MM)
![Page 28: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/28.jpg)
HMMs for ASR
go home
g o h o m
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
Markov modelbackbone composedof sequences of 3phones(hidden because wedon’t knowcorrespondences)
Acoustic observations
Each line represents a probability estimate (more later)
g o o o o o oh mm
![Page 29: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/29.jpg)
HMMs for ASR
go home
g o h o m
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9
Markov modelbackbone composedof phones(hidden because wedon’t knowcorrespondences)
Acoustic observations
Even with same word hypothesis, can have different alignments (red arrows). Also, have to search over all word hypothesesEven with same word hypothesis, can have different alignments (red arrows). Also, have to search over all word hypotheses
![Page 30: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/30.jpg)
For every HMM (in hierarchy): compute Max probability sequence
th a t
h iy
y uw
p(he|that)
p(you|that)
h iy
sh uh d
X= acoustic observations,(3)phones, phone sequencesW= (3)phones, phonesequences, word sequences
argmaxW P(W|X)=argmaxW P(X|W)P(W)/P(X)=argmaxW P(X|W)P(W)
COMPUTE:
![Page 31: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/31.jpg)
Search• When trying to find W*=argmaxW P(W|X), need to look
at (in theory)– All possible (3phone, word.. etc) sequences– All possible segmentations/alignments of W&X
• Generally, this is done by searching the space of W– Viterbi search: dynamic programming approach that looks for
the most likely path– A* search: alternative method that keeps a stack of hypotheses
around
• If |W| is large, pruning becomes important• Need also to estimate transition probabilities
![Page 32: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/32.jpg)
Training: speech corpora
• Have a speech corpus at hand– Should have word (and preferrably phone)
transcriptions
– Divide into training, development, and test sets
• Develop models of prior knowledge– Pronunciation dictionary
– Grammar, lexical trees
• Train acoustic models– Possibly realigning corpus phonetically
![Page 33: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/33.jpg)
Acoustic Model
-0.10.31.4-1.22.32.6…
0.20.11.2-1.24.42.2…
-6.1-2.13.12.41.02.2…
0.20.01.2-1.24.42.2…
dh a a t • Assume that you can label each vector with a phonetic label
• Collect all of the examples of a phone together and build a Gaussian model (or some other statistical model, e.g. neural networks)
Na()P(X|state=a)
![Page 34: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/34.jpg)
Pronunciation model
• Pronunciation model gives connections between phones and words
• Multiple pronunciations (tomato):
owt m
dh pdh
1-pdh
a pa
1-pa
t pt
1-pt
ah
ow ey
ah
t
![Page 35: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/35.jpg)
Training models for a sound unit
![Page 36: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/36.jpg)
Language Model
• Language model gives connections between words (e.g., bigrams: probability of two word sequences)
dh a t
h iy
y uw
p(he|that)
p(you|that)
![Page 37: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/37.jpg)
Lexical treesSTART S-T-AA-R-TD STARTING S-T-AA-R-DX-IX-NG STARTED S-T-AA-R-DX-IX-DD STARTUP S-T-AA-R-T-AX-PD START-UP S-T-AA-R-T-AX-PD
S T AA
R
R T
TD
DXIX
IX
NG
DD
AXPD
PD
start
starting
started
startup
start-up
![Page 38: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/38.jpg)
Judging the quality of a system
• Usually, ASR performance is judged by the word error rateErrorRate = 100*(Subs + Ins + Dels) / Nwords
REF: I WANT TO GO HOME ***
REC: * WANT TWO GO HOME NOW
SC: D C S C C I
100*(1S+1I+1D)/5 = 60%
![Page 39: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/39.jpg)
Judging the quality of a system
• Usually, ASR performance is judged by the word error rate
• This assumes that all errors are equal– Also, a bit of a mismatch between optimization
criterion and error measurement
• Other (task specific) measures sometimes used– Task completion– Concept error rate
![Page 41: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/41.jpg)
![Page 42: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/42.jpg)
![Page 43: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/43.jpg)
• Feature extractor
![Page 44: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/44.jpg)
• Feature extractor• Mel-Frequency Cepstral
Coefficients (MFCCs)Feature vectors
![Page 45: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/45.jpg)
• Acoustic Observations
![Page 46: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/46.jpg)
• Acoustic Observations• Hidden States
![Page 47: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/47.jpg)
• Acoustic Observations• Hidden States• Acoustic Observation likelihoods
![Page 48: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/48.jpg)
“Six”
![Page 49: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/49.jpg)
![Page 50: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/50.jpg)
• Constructs the search graph of HMMs from: – Acoustic model– Statistical Language model ~or~– Grammar– Dictionary
![Page 51: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/51.jpg)
• Constructs the HMMs of phones• Produces observation likelihoods
![Page 52: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/52.jpg)
• Constructs the HMMs for units of speech
• Produces observation likelihoods • Sampling rate is critical! • WSJ vs. WSJ_8k
![Page 53: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/53.jpg)
• Constructs the HMMs for units of speech
• Produces observation likelihoods • Sampling rate is critical! • WSJ vs. WSJ_8k• TIDIGITS, RM1, AN4, HUB4
![Page 54: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/54.jpg)
• Word likelihoods
![Page 55: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/55.jpg)
• ARPA format Example:1-grams:-3.7839 board -0.1552-2.5998 bottom -0.3207-3.7839 bunch -0.21742-grams:-0.7782 as the -0.2717-0.4771 at all 0.0000-0.7782 at the -0.29153-grams:-2.4450 in the lowest -0.5211 in the middle -2.4450 in the on
![Page 56: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/56.jpg)
public <basicCmd> = <startPolite> <command> <endPolite>;
public <startPolite> = (please | kindly | could you ) *;
public <endPolite> = [ please | thanks | thank you ];
<command> = <action> <object>;
<action> = (open | close | delete | move); <object> = [the | a] (window | file | menu);
![Page 57: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/57.jpg)
• Maps words to phoneme sequences
![Page 58: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/58.jpg)
• Example from cmudict.06dPOULTICE P OW L T AH S
POULTICES P OW L T AH S IH Z
POULTON P AW L T AH N
POULTRY P OW L T R IY
POUNCE P AW N S
POUNCED P AW N S T
POUNCEY P AW N S IY
POUNCING P AW N S IH NG
POUNCY P UW NG K IY
![Page 59: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/59.jpg)
![Page 60: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/60.jpg)
![Page 61: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/61.jpg)
![Page 62: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/62.jpg)
• Can be statically or dynamically constructed
![Page 63: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/63.jpg)
![Page 64: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/64.jpg)
• Maps feature vectors to search graph
![Page 65: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/65.jpg)
• Searches the graph for the “best fit”
![Page 66: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/66.jpg)
• Searches the graph for the “best fit”
• P(sequence of feature vectors| word/phone)
• aka. P(O|W)
-> “how likely is the input to have been generated by the word”
![Page 67: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/67.jpg)
F ay ay ay ay v v v v vF f ay ay ay ay v v v vF f f ay ay ay ay v v vF f f f ay ay ay ay v vF f f f ay ay ay ay ay vF f f f f ay ay ay ay vF f f f f f ay ay ay v…
![Page 68: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/68.jpg)
TimeO1 O2 O3
![Page 69: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/69.jpg)
• Uses algorithms to weed out low scoring paths during decoding
![Page 70: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/70.jpg)
• Words!
![Page 71: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/71.jpg)
• Most common metric• Measure the # of modifications to
transform recognized sentence into reference sentence
![Page 72: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/72.jpg)
• Reference: “This is a reference sentence.”
• Result: “This is neuroscience.”
![Page 73: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/73.jpg)
• Reference: “This is a reference sentence.”
• Result: “This is neuroscience.”• Requires 2 deletions, 1 substitution
![Page 74: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/74.jpg)
• Reference: “This is a reference sentence.”
• Result: “This is neuroscience.”
![Page 75: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/75.jpg)
• Reference: “This is a reference sentence.”
• Result: “This is neuroscience.”
• D S D
![Page 76: Automatic Speech Recognition Introduction. The Human Dialogue System](https://reader035.vdocuments.us/reader035/viewer/2022062320/56649d965503460f94a7f82d/html5/thumbnails/76.jpg)
Installation details
• http://cmusphinx.sourceforge.net/wiki/sphinx4:howtobuildand_run_sphinx4
• Student report on NLP course web site