automatic speech recognition (cs753)pjyothi/cs753_spr16/slides/lecture1.pdf · evaluation — final...

36
Instructor: Preethi Jyothi Lecture 1 Lecture 1: Introduction to Statistical Speech Recognition Automatic Speech Recognition (CS753)

Upload: others

Post on 20-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Instructor: Preethi Jyothi Lecture 1

Automatic Speech Recognition (CS753)Lecture 1: Introduction to Statistical Speech RecognitionAutomatic Speech Recognition (CS753)

Page 2: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Course Specifics

Page 3: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Main Topics:

• Introduction to statistical ASR • Acoustic models

Hidden Markov models Deep neural network-based models

• Pronunciation models • Language models (Ngram models, RNN-LMs) • Decoding search problem (Viterbi algorithm, etc.)

About the course (I)

Page 4: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

About the course (II)

Course webpage: www.cse.iitb.ac.in/~pjyothi/cs753

Reading: All mandatory reading will be freely available online. Reading material will be posted on the website.

Attendance: Strongly advised to attend all lectures given there’s no fixed textbook and a lot of the material covered in class will not be on the slides

Page 5: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Evaluation — AssignmentsGrading: 3 assignments + 1 mid-sem exam making up 45% of the grade.

Format: 1. One assignment will be almost entirely programming-based.

The other two will mostly contain problems to be solved by hand.

2. Mid-sem will have some questions based on problems in assignment 1. For every problem that appears both in the assignment & exam, your score for that problem in the assignment will be replaced by averaging it with the score in the exam.

Late Policy: 10% reduction in marks for every additional day past the due date. Submissions closed three days after the due date.

Page 6: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Evaluation — Final Project

Grading: Constitutes 25% of the total grade. (Exceptional projects could get extra credit. Details on website soon.)

Team: 2-3 members. Individual projects are highly discouraged.

Project requirements: • Discuss proposed project with me on or before January 30th • 4-5 page report about methodology & detailed

experiments • Project demo

Page 7: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Evaluation — Final ProjectOn Project:

• Could be implementation of ideas learnt in class, applied to real data (and/or to a new task)

• Could be a new idea/algorithm (with preliminary experiments) • Ideal project would lead to a conference paper

Sample project ideas: • Voice tweeting system • Sentiment classification from voice-based reviews • Detecting accents from speech • Language recognition from speech segments • Audio search of speeches by politicians

Page 8: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Evaluation — Final ExamGrading: Constitutes 30% of the total grade.

Syllabus: Will be tested on all the material covered in the course.

Format: Closed book, written exam.

Image from LOTR-I; meme not original

Page 9: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Academic Integrity Policy

• Write what you know.

• Use your own words.

• If you refer to *any* external material, *always* cite your

sources. Follow proper citation guidelines.

• If you’re caught for plagiarism or copying, penalties are

much higher than simply omitting that question.

• In short: Just not worth it. Don’t do it!

Image credit: https://www.flickr.com/photos/kurok/22196852451

Page 10: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Introduction to Speech Recognition

Page 11: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Exciting time to be an AI/ML researcher!

Image credit: http://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html

Page 12: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Lots of new progress

What is speech recognition? Why is it such a hard problem?

Page 13: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Automatic Speech Recognition (ASR)• Automatic speech recognition (or speech-to-text) systems

transform speech utterances into their corresponding text form, typically in the form of a word sequence

Page 14: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Automatic Speech Recognition (ASR)• Automatic speech recognition (or speech-to-text) systems

transform speech utterances into their corresponding text form, typically in the form of a word sequence.

• Many downstream applications of ASR:

• Speech understanding: comprehending the semantics of text • Audio information retrieval: searching speech databases • Spoken translation: translating spoken language into foreign

text • Keyword search: searching for specific content words in speech

• Other related tasks include speaker recognition, speaker diarization, speech detection, etc.

Page 15: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

History of ASR

RADIO REX (1922)

Page 16: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

History of ASR

SHOEBOX (IBM, 1962)

1922 1942 1962 1982 2002 20121932 1952 1972 1992

1 word

Freq. detector

Page 17: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

History of ASR

1922 1942 1962 1982 2002 20121932 1952 1972 1992

1 word

Freq. detector

16 words

Isolated wordrecognition

HARPY (CMU, 1976)

Page 18: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

History of ASR

1922 1942 1962 1982 2002 20121932 1952 1972 1992

1 word

Freq. detector

16 words

Isolated wordrecognition

1000 words

Connected speech

HIDDEN MARKOV MODELS (1980s)

Page 19: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

History of ASR

1922 1942 1962 1982 2002 20121932 1952 1972 1992

1 word

Freq. detector

16 words

Isolated wordrecognition

1000 words

Connected speech

10K+ words

LVCSR systems

SiriCortana

DEEP NEURAL NETWORK BASED SYSTEMS (>2010)

Page 20: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Why is ASR a challenging problem?Variabilities in different dimensions:

Style: Read speech or spontaneous (conversational) speech? Continuous natural speech or command & control?

Speaker characteristics: Rate of speech, accent, prosody (stress, intonation), speaker age, pronunciation variability even when the same speaker speaks the same word

Channel characteristics: Background noise, room acoustics, microphone properties, interfering speakers

Task specifics: Vocabulary size (very large number of words to be recognized), language-specific complexity, resource limitations

Page 21: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Noisy channel model

Encoder DecoderNoisy channel modelS C O W

Claude Shannon1916-2001

Page 22: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Noisy channel model applied to ASR

Speaker DecoderAcoustic processorW O W*

Claude Shannon1916-2001

Fred Jelinek1932-2010

Page 23: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Statistical Speech RecognitionLet O represent a sequence of acoustic observations (i.e. O = {O1, O2 , … , Ot} where Oi is a feature vector observed at time t) and W denote a word sequence. Then, the decoder chooses W* as follows:

W⇤= argmax

WPr(W|O)

= argmax

W

Pr(O|W) Pr(W)

Pr(O)

This maximisation does not depend on Pr(O). So, we have

W⇤= argmax

WPr(O|W) Pr(W)

Page 24: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Statistical Speech Recognition

W⇤= argmax

WPr(O|W) Pr(W)

Pr(O⎸W) is referred to as the “acoustic model”

Pr(W) is referred to as the “language model”

speechsignal

AcousticFeature

GeneratorSEARCH

AcousticModel

LanguageModel

word sequenceW*

O

Page 25: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Example: Isolated word ASR task

Vocabulary: 10 digits (zero, one, two, …), 2 operations (plus, minus)

Data: Speech utterances corresponding to each word sample from multiple speakers

Recall the acoustic model is Pr(O⎸W): direct estimation is impractical (why?)

Let’s parameterize Prα(O⎸W) using a Markov model with parameters α. Now, the problem reduces to estimating α.

Page 26: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Isolated word-based acoustic models

P. Jyothi, “Discriminative & AF-based Pron. models for ASR”, Ph.D. dissertation, 2013

Transition probabilities denoted by aij from state i to state j

Observation vectors Ot are generated from the probability density bj(Ot)

St-1 St St+1

Pht-1 Pht Pht+1

Trt-1 Trt

Ot-1 Ot Ot+1

1 2 3

O1 O2 O3 O4 OT....

0 4

b1( ) b2( ) b3( )

a01 a12 a23 a34

a11 a22 a33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequence W and the language model

P (W ) provides a prior probability for W .

Acoustic model: The most commonly used acoustic models in ASR systems to-

day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-

prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with

ideas that are largely applicable to systems today). HMMs are used to build prob-

abilistic models for linear sequence labeling problems. Since speech is represented

in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-

eled using HMMs.

The HMM is defined by specifying transition probabilities (aji ) and observation

(or emission) probability distributions (bj(Oi)) (along with the number of hidden

states in the HMM). An HMM makes a transition from state i to state j with a

probability of aji . On reaching a state j, the observation vector at that state (Oj)

20

Model forword “one”

Page 27: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Isolated word-based acoustic modelsSt-1 St St+1

Pht-1 Pht Pht+1

Trt-1 Trt

Ot-1 Ot Ot+1

1 2 3

O1 O2 O3 O4 OT....

0 4

b1( ) b2( ) b3( )

a01 a12 a23 a34

a11 a22 a33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequence W and the language model

P (W ) provides a prior probability for W .

Acoustic model: The most commonly used acoustic models in ASR systems to-

day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-

prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with

ideas that are largely applicable to systems today). HMMs are used to build prob-

abilistic models for linear sequence labeling problems. Since speech is represented

in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-

eled using HMMs.

The HMM is defined by specifying transition probabilities (aji ) and observation

(or emission) probability distributions (bj(Oi)) (along with the number of hidden

states in the HMM). An HMM makes a transition from state i to state j with a

probability of aji . On reaching a state j, the observation vector at that state (Oj)

20

For an O={O1,O2, …, O6} and a state sequence Q={0,1,1,2,3,4}:

Pr(O,Q|W = ‘one’) = a01b1(O1)a11b1(O2) . . .

Model forword “one”

Pr(O|W = ‘one’) =

X

Q

Pr(O,Q|W = ‘one’)

Page 28: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Isolated word recognitionSt-1 St St+1

Pht-1 Pht Pht+1

Trt-1 Trt

Ot-1 Ot Ot+1

1 2 3

O1 O2 O3 O4 OT....

0 4

b1( ) b2( ) b3( )

a01 a12 a23 a34

a11 a22 a33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequence W and the language model

P (W ) provides a prior probability for W .

Acoustic model: The most commonly used acoustic models in ASR systems to-

day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-

prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with

ideas that are largely applicable to systems today). HMMs are used to build prob-

abilistic models for linear sequence labeling problems. Since speech is represented

in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-

eled using HMMs.

The HMM is defined by specifying transition probabilities (aji ) and observation

(or emission) probability distributions (bj(Oi)) (along with the number of hidden

states in the HMM). An HMM makes a transition from state i to state j with a

probability of aji . On reaching a state j, the observation vector at that state (Oj)

20

one:St-1 St St+1

Pht-1 Pht Pht+1

Trt-1 Trt

Ot-1 Ot Ot+1

1 2 3

O1 O2 O3 O4 OT....

0 4

b1( ) b2( ) b3( )

a01 a12 a23 a34

a11 a22 a33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequence W and the language model

P (W ) provides a prior probability for W .

Acoustic model: The most commonly used acoustic models in ASR systems to-

day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-

prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with

ideas that are largely applicable to systems today). HMMs are used to build prob-

abilistic models for linear sequence labeling problems. Since speech is represented

in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-

eled using HMMs.

The HMM is defined by specifying transition probabilities (aji ) and observation

(or emission) probability distributions (bj(Oi)) (along with the number of hidden

states in the HMM). An HMM makes a transition from state i to state j with a

probability of aji . On reaching a state j, the observation vector at that state (Oj)

20

two:

St-1 St St+1

Pht-1 Pht Pht+1

Trt-1 Trt

Ot-1 Ot Ot+1

1 2 3

O1 O2 O3 O4 OT....

0 4

b1( ) b2( ) b3( )

a01 a12 a23 a34

a11 a22 a33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequence W and the language model

P (W ) provides a prior probability for W .

Acoustic model: The most commonly used acoustic models in ASR systems to-

day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-

prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with

ideas that are largely applicable to systems today). HMMs are used to build prob-

abilistic models for linear sequence labeling problems. Since speech is represented

in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-

eled using HMMs.

The HMM is defined by specifying transition probabilities (aji ) and observation

(or emission) probability distributions (bj(Oi)) (along with the number of hidden

states in the HMM). An HMM makes a transition from state i to state j with a

probability of aji . On reaching a state j, the observation vector at that state (Oj)

20

plus:St-1 St St+1

Pht-1 Pht Pht+1

Trt-1 Trt

Ot-1 Ot Ot+1

1 2 3

O1 O2 O3 O4 OT....

0 4

b1( ) b2( ) b3( )

a01 a12 a23 a34

a11 a22 a33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequence W and the language model

P (W ) provides a prior probability for W .

Acoustic model: The most commonly used acoustic models in ASR systems to-

day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-

prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with

ideas that are largely applicable to systems today). HMMs are used to build prob-

abilistic models for linear sequence labeling problems. Since speech is represented

in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-

eled using HMMs.

The HMM is defined by specifying transition probabilities (aji ) and observation

(or emission) probability distributions (bj(Oi)) (along with the number of hidden

states in the HMM). An HMM makes a transition from state i to state j with a

probability of aji . On reaching a state j, the observation vector at that state (Oj)

20

minus:

... acoustic featuresO

What are we assuming about Pr(W)?

Pr(O|W = ‘one’)

Pr(O|W = ‘two’)

Pr(O|W = ‘plus’)

Pr(O|W = ‘minus’)

Pick argmax

wPr(O|W = w)

Page 29: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Isolated word recognitionSt-1 St St+1

Pht-1 Pht Pht+1

Trt-1 Trt

Ot-1 Ot Ot+1

1 2 3

O1 O2 O3 O4 OT....

0 4

b1( ) b2( ) b3( )

a01 a12 a23 a34

a11 a22 a33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequence W and the language model

P (W ) provides a prior probability for W .

Acoustic model: The most commonly used acoustic models in ASR systems to-

day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-

prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with

ideas that are largely applicable to systems today). HMMs are used to build prob-

abilistic models for linear sequence labeling problems. Since speech is represented

in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-

eled using HMMs.

The HMM is defined by specifying transition probabilities (aji ) and observation

(or emission) probability distributions (bj(Oi)) (along with the number of hidden

states in the HMM). An HMM makes a transition from state i to state j with a

probability of aji . On reaching a state j, the observation vector at that state (Oj)

20

one:St-1 St St+1

Pht-1 Pht Pht+1

Trt-1 Trt

Ot-1 Ot Ot+1

1 2 3

O1 O2 O3 O4 OT....

0 4

b1( ) b2( ) b3( )

a01 a12 a23 a34

a11 a22 a33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequence W and the language model

P (W ) provides a prior probability for W .

Acoustic model: The most commonly used acoustic models in ASR systems to-

day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-

prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with

ideas that are largely applicable to systems today). HMMs are used to build prob-

abilistic models for linear sequence labeling problems. Since speech is represented

in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-

eled using HMMs.

The HMM is defined by specifying transition probabilities (aji ) and observation

(or emission) probability distributions (bj(Oi)) (along with the number of hidden

states in the HMM). An HMM makes a transition from state i to state j with a

probability of aji . On reaching a state j, the observation vector at that state (Oj)

20

two:

St-1 St St+1

Pht-1 Pht Pht+1

Trt-1 Trt

Ot-1 Ot Ot+1

1 2 3

O1 O2 O3 O4 OT....

0 4

b1( ) b2( ) b3( )

a01 a12 a23 a34

a11 a22 a33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequence W and the language model

P (W ) provides a prior probability for W .

Acoustic model: The most commonly used acoustic models in ASR systems to-

day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-

prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with

ideas that are largely applicable to systems today). HMMs are used to build prob-

abilistic models for linear sequence labeling problems. Since speech is represented

in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-

eled using HMMs.

The HMM is defined by specifying transition probabilities (aji ) and observation

(or emission) probability distributions (bj(Oi)) (along with the number of hidden

states in the HMM). An HMM makes a transition from state i to state j with a

probability of aji . On reaching a state j, the observation vector at that state (Oj)

20

plus:St-1 St St+1

Pht-1 Pht Pht+1

Trt-1 Trt

Ot-1 Ot Ot+1

1 2 3

O1 O2 O3 O4 OT....

0 4

b1( ) b2( ) b3( )

a01 a12 a23 a34

a11 a22 a33

Figure 2.1: Standard topology used to represent a phone HMM.

sub-word units Q corresponding to the word sequence W and the language model

P (W ) provides a prior probability for W .

Acoustic model: The most commonly used acoustic models in ASR systems to-

day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-

prehensive tutorial of HMMs and their applicability to ASR in the 1980’s (with

ideas that are largely applicable to systems today). HMMs are used to build prob-

abilistic models for linear sequence labeling problems. Since speech is represented

in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-

eled using HMMs.

The HMM is defined by specifying transition probabilities (aji ) and observation

(or emission) probability distributions (bj(Oi)) (along with the number of hidden

states in the HMM). An HMM makes a transition from state i to state j with a

probability of aji . On reaching a state j, the observation vector at that state (Oj)

20

minus:

... acoustic featuresO

Pr(O|W = ‘one’)

Pr(O|W = ‘two’)

Pr(O|W = ‘plus’)

Pr(O|W = ‘minus’)

Is this approach scalable?

Page 30: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Architecture of an ASR system

speechsignal

AcousticFeature

GeneratorSEARCH

AcousticModel

(phones)

LanguageModel

word sequenceW*

O

PronunciationModel

Page 31: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Evaluate an ASR system

Quantitative metric: Error rates computed on an unseen test set by comparing W* (decoded output) against Wref (reference sentence) for each test utterance

• Sentence/Utterance error rate (trivial to compute!)

• Word/Phone error rate

Page 32: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Evaluate an ASR system

Word/Phone error rate (ER) uses the Levenshtein distance measure: What are the minimum number of edits (insertions/deletions/substitutions) required to convert W* to Wref?

ER =

PNj=1 Insj +Delj + Subj

PNj=1 `j

Insj, Delj, Subj are number of insertions/deletions/substitutions in the jth ASR output

`j

On a test set with N instances:

is the total number of words/phones in the jth reference

Page 33: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Course Overview

speechsignal

AcousticFeature

GeneratorSEARCH

AcousticModel

(phones)

LanguageModel

word sequenceW*

O

PronunciationModel

Properties of speech

sounds

AcousticSignal

Processing

Hidden Markov Models

Deep Neural

Networks

Hybrid HMM-DNNSystems

Speaker Adaptation

Ngram/RNN LMs

G2P/feature-based models

Page 34: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Course Overview

speechsignal

AcousticFeature

GeneratorSEARCH

AcousticModel

(phones)

LanguageModel

word sequenceW*

O

PronunciationModel

Properties of speech

sounds

AcousticSignal

Processing

Hidden Markov Models

Deep Neural

Networks

Hybrid HMM-DNNSystems

Speaker Adaptation

Ngram/RNN LMs

G2P/feature-based models

Search algorithms

Page 35: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Formalism: Finite State Transducers

Course Overview

speechsignal

AcousticFeature

GeneratorSEARCH

AcousticModel

(phones)

LanguageModel

word sequenceW*

O

PronunciationModel

Properties of speech

sounds

AcousticSignal

Processing

Hidden Markov Models

Deep Neural

Networks

Hybrid HMM-DNNSystems

Speaker Adaptation

Ngram/RNN LMs

G2P/feature-based models

Search algorithms

Page 36: Automatic Speech Recognition (CS753)pjyothi/cs753_spr16/slides/lecture1.pdf · Evaluation — Final Project On Project: • Could be implementation of ideas learnt in class, applied

Next two classes: Weighted Finite State Transducers in ASR