introduction to speech signal processing

Introduction to Speech Signal Processing

Dr. Zhang SenDr. Zhang Sen

[email protected]

Chinese Academy of SciencesBeijing, China

23/04/21

Report

Docum

ent

2

•Introduction–Sampling and quantization–Speech coding

•Features and Analysis–Main features–Some transformations

•Speech-to-Text –State of the art–Main approaches

•Text-to-Speech–State of the art–Main approaches

•Applications–Human-machine dialogue systems

Report

Docum

ent

3

• Some useful websites for ASR Tools– http://htk.eng.cam.ac.uk

• Free, available since 2000, relation with MS

• Over 12000 users, ver. 2.1, 3.0, 3.1, 3.2

• Include source code and HTK books

• A set of tools for training, decoding, evaluation

• Steve Young in Cambridge University

– http://www.cs.cmu.edu• Free for research and education

• Sphinx 2 and 3

• Tools, source code, speech database

• Reddy in CMU

http://htk.eng.cam.ac.uk/

http://www.cs.cmu.edu/

Report

Docum

ent

4

Research on speech recognition in the world

Report

Docum

ent

5

• Carnegie Mellon University– CMU SCS Speech Group– Interact Lab

• Oregon Graduate Institute– Center for Spoken Language Understanding

• MIT– Lab for Computer Science, Spoken Language Systems– Acoustics & Vibration Lab– AI LAB– Lincoln Lab, Speech Systems Technology Group

• Stanford University– Center for Computer Research in Music and Acoustics– Center for the Study of Language and Information

Report

Docum

ent

6

• University of California– Berkeley, Santa Cruz, Los Angeles

• Boston University– Signal Processing and Interpretation Lab

• Georgia Institute of Technology– Digital Signal Processing Lab

• Johns Hopkins University– Center for Language and Speech Processing

• Brown University– Lab for Engineering Man-Machine Systems

• Mississippi State University• Colorado University• Cornell University

Report

Docum

ent

7

• Cambridge University– speech Vision and Robotics Group

• Edinburgh University – human Communication Research Center – center for Speech Technology Research

• University College London– Phonetics and Linguistics

• University of Essex– Dept. Language and Linguistics

Report

Docum

ent

8

• LIMSI, France • INRIA

– Institut National de Recherche en Informatique et Automatique

• University of Karlsruhe, Germany – Interractive Systems Lab

• DFKI– German Research Center for Artificial Intelligence

• KTH Speech Communication & Music Acoustics • CSELT, Italy

– Centro Studi e Laboratori Telecommunicazioni, Torino

• IRST– Istituto per la Ricerca Scientifica e Tecnologica, Trento

• ATR, Japan

Report

Docum

ent

9

• AT&T, Advanced Speech Product Group• Lucent Technologies, Bell Laboratories • IBM , IBM VoiceType • Texas Instruments Incorporated• National Institute of Standards and Technology• Apple Computer Co.• Digital Equipment Corporation (DEC) • SRI International • Dragon systems Co.• Sun Microsystems Lab. , speech applications • Microsoft Corporation, Speech technology SAPI • Entropic Research Laboratory, Inc.

Report

Docum

ent

10

• Important conferences and journals– IEEE trans. on ASSP– ICASSP (every year) – EUROSPEECH (every odd year) – ICSLP (every even year)– STAR

• Speech Technology and Research at SRI

Report

Docum

ent

11

Brief history and state-of-the-art of the research on speech recognition

Report

Docum

ent

12

ASR Progress Overview

• 50'S – ISOLATED DIGIT RECOGNITION (BELL LAB)

• 60'S : – HARDWARE SPEECH SEGMENTATOR (JAPAN)

– DYNAMIC PROGRAMMING (U.S.S.R)

• 70'S : – CLUSTERING ALGORITHM (SPEAKER INDEPENDECY)

– DTW

• 80'S: – HMM, DARPA, SPHINX

• 90'S : – ADAPTION, ROBUSTNESS

Report

Docum

ent

13

1952 Bell Labs Digits1952 Bell Labs Digits

• First word (digit) recognizer

• Approximates energy in formants (vocal

tract resonances) over word

• Already has some robust ideas

(insensitive to amplitude, timing variation)

• Worked very well

• Main weakness was technological (resistors

and capacitors)

Report

Docum

ent

14

The 60’sThe 60’s

• Better digit recognition

• Breakthroughs: Spectrum Estimation (FFT,

cepstra, LPC), Dynamic Time Warp (DTW),

and Hidden Markov Model (HMM) theory

• HARDWARE SPEECH SEGMENTATOR (JAPAN)

Report

Docum

ent

15

1971-76 ARPA Project1971-76 ARPA Project

• Focus on Speech Understanding

• Main work at 3 sites: System Development

Corporation, CMU and BBN

• Other work at Lincoln, SRI, Berkeley

• Goal was 1000-word ASR, a few speakers,

connected speech, constrained grammar,

less than 10% semantic error

Report

Docum

ent

16

ResultsResults

• Only CMU Harpy fulfilled goals -

used LPC, segments, lots of high level

knowledge, learned from Dragon *

(Baker)

* The CMU system done in the early ‘70’s; as opposed to the company formed in the ‘80’s

Report

Docum

ent

17

Achieved by 1976Achieved by 1976

• Spectral and cepstral features, LPC

• Some work with phonetic features

• Incorporating syntax and semantics

• Initial Neural Network approaches

• DTW-based systems (many)

• HMM-based systems (Dragon, IBM)

Report

Docum

ent

18

Dynamic Time WarpDynamic Time Warp

• Optimal time normalization with dynamic programming

• Proposed by Sakoe and Chiba, circa 1970• Similar time, proposal by Itakura• Probably Vintsyuk was first (1968)

Report

Docum

ent

19

HMMs for SpeechHMMs for Speech

• Math from Baum and others, 1966-1972

• Applied to speech by Baker in the

original CMU Dragon System (1974)

• Developed by IBM (Baker, Jelinek, Bahl,

Mercer,….) (1970-1993)

• Extended by others in the mid-1980’s

Report

Docum

ent

20

The 1980’sThe 1980’s

• Collection of large standard corpora

• Front ends: auditory models, dynamics

• Engineering: scaling to large

vocabulary continuous speech

• Second major (D)ARPA ASR project

• HMMs become ready for prime time

Report

Docum

ent

21

Standard Corpora CollectionStandard Corpora Collection

• Before 1984, chaos

• TIMIT

• RM (later WSJ)

• ATIS

• NIST, ARPA, LDC

Report

Docum

ent

22

Front Ends in the 1980’sFront Ends in the 1980’s

• Mel cepstrum (Bridle, Mermelstein)

• PLP (Hermansky)

• Delta cepstrum (Furui)

• Auditory models (Seneff, Ghitza, others)

Report

Docum

ent

23

Dynamic Speech FeaturesDynamic Speech Features

• temporal dynamics useful for ASR

• local time derivatives of cepstra

• “delta’’ features estimated over

multiple frames (typically 5)

• usually augments static features

• can be viewed as a temporal filter

Report

Docum

ent

24

HMM’s for Continuous SpeechHMM’s for Continuous Speech

• Using dynamic programming for cts speech

(Vintsyuk, Bridle, Sakoe, Ney….)

• Application of Baker-Jelinek ideas to

continuous speech (IBM, BBN, Philips, ...)

• Multiple groups developing major HMM

systems (CMU, SRI, Lincoln, BBN, ATT)

• Engineering development - coping with

data, fast computers

Report

Docum

ent

25

2nd (D)ARPA Project2nd (D)ARPA Project

• Common task• Frequent evaluations• Convergence to good, but similar, systems • Lots of engineering development - now up to

60,000 word recognition, in real time, on aworkstation, with less than 10% word error

• Competition inspired others not in project -Cambridge did HTK, now widely distributed

Report

Docum

ent

26

Some 1990’s IssuesSome 1990’s Issues

• Independence to long-term spectrum

• Adaptation

• Effects of spontaneous speech

• Information retrieval/extraction with

broadcast material

• Query-style systems (e.g., ATIS)

• Applying ASR technology to related

areas (language ID, speaker verification)

Report

Docum

ent

27

Real UsesReal Uses

• Telephone: phone company services

(collect versus credit card)

• Telephone: call centers for query

information (e.g., stock quotes,

parcel tracking)

• Dictation products: continuous

recognition, speaker dependent/adaptive

Report

Docum

ent

28

State-of-the-art of ASR

• Tremendous technical advances in the last few years

• From small to large vocabularies– 5,000 - 10,000 word vocabulary

– 10,000-60,000 word vocabulary

• From isolated word to spontaneous talk– Continuous speech recognition

– Conversational and spontaneous speech recognition

• From speaker-dependent to speaker-independent– Modern ASR is fully speaker independent

Report

Docum

ent

29

SOTA ASR Systems

• IBM, Via Voice– Speaker independent, continuous command

recognition – Large vocabulary recognition – Text-to-speech confirmation – Barge in (The ability to interrupt an audio

prompt as it is playing)

• Microsoft, Whisper, Dr Who

Report

Docum

ent

30

SOTA ASR Systems• DARPA

– 1982– GOAL

• HIGH ACCURACY• REAL-TIME PERFORMANCE• UNDERSTANDING CAPABILITY• CONTINUOUS SPEECH RECOGNITION

– DARPA DATABASE• 997 WORDS (RM)• ABOVE 100 SPEAKERS• TIMID

Report

Docum

ent

31

SOTA ASR Systems• SPHINX II

– CMU

– HMM BASED SPEECH RECOGNITION

– BIGRAM, WORD PAIR

– GENERALIZED TRIPHONE

– DARPA DATABASE

– 97% RECOGNITION (PERPLEXITY 20)

• SPHINX III– CHMM BASED

– WER, about 15% on WSJ

Report

Docum

ent

32

all speakers of the language

including foreign

application independent or

adaptive

all styles including human-human (unaware)

wherever speech occurs

2005

ASR Advances

vehicle noise radiocell phones

regional accentsnative speakers

competent foreign speakers

some application–

specific data and one engineer

year

natural human-machine dialog (user can adapt)

2000

expert years to create app– specific language model

speaker independent and adaptive

normal officevarious microphonestelephone

planned speech

1995

NOISE ENVIRONMENT

SPEECH STYLE

USER

POPULATION

COMPLEXITY

1985

quiet roomfixed high –quality mic

careful reading

speaker-dep.

application– specific speech and language

Report

Docum

ent

33

ButBut

• Still <97% accurate on “yes” for telephone

• Unexpected rate of speech causes doubling

or tripling of error rate

• Unexpected accent hurts badly

• Accuracy on unrestricted speech at 60%

• Don’t know when we know

• Few advances in basic understanding

Report

Docum

ent

34

How to Measure the Performance?

• What benchmarks? – DARPA– NIST (hub-4, hub-5, …)

• What was training? • What was the test? • Were they independent? • The vocabulary and the sample size? • Was the noise added or coincident with speech?

What kind of noise?

Report

Docum

ent

35

• Spontaneous telephone speech is still a “grand challenge”.

• Telephone-quality speech is still central to the problem.

• Broadcast news is a very dynamic domain.

0%

10%

30%

40%

20%

Word Error Rate (WER)

Level Of Difficulty

Digits

ContinuousDigits

Command and Control

Letters and Numbers

BroadcastNews

Read Speech

ConversationalSpeech

ASR Performance

Report

Docum

ent

36

0%

5%

15%

20%

10%

10 dB 16 dB 22 dB Quiet

Wall Street Journal (Additive Noise)

Machines

Human Listeners (Committee)

Word Error Rate

Speech-To-Noise Ratio

• Human performance exceeds machine performance by a factor ranging from 4x to 10x depending on the task.

• On some tasks, such as credit card number recognition, machine performance exceeds humans due to human memory retrieval capacity.

• The nature of the noise is as important as the SNR (e.g., cellular phones).

• A primary failure mode for humans is inattention.

• A second major failure mode is the lack of familiarity with the domain (i.e., business terms and corporation names).

Machine vs Human Performance

Report

Docum

ent

37

Core technology for ASR

Report

Docum

ent

38

Why is ASR Hard?Why is ASR Hard?

• Natural speech is continuous

• Natural speech has disfluencies

• Natural speech is variable over:

global rate, local rate, pronunciation

within speaker, pronunciation across

speakers, phonemes in different

contexts

Report

Docum

ent

39

Why is ASR Hard?Why is ASR Hard?(continued)(continued)

• Large vocabularies are confusable• Out of vocabulary words inevitable• Recorded speech is variable over:

room acoustics, channel characteristics,background noise

• Large training times are not practical• User expectations are for equal to or

greater than “human performance”

Report

Docum

ent

40

Main Causes of Speech VariabilityMain Causes of Speech Variability

Environment

Speaker

InputEquipment

Speech - correlated noisereverberation, reflection

Uncorrelated noiseadditive noise(stationary, nonstationary)

Attributes of speakersdialect, gender, age

Manner of speakingbreath & lip noisestressLombard effectratelevelpitchcooperativeness

Microphone (Transmitter)Distance from microphoneFilterTransmission system

distortion, noise, echoRecording equipment

Report

Docum

ent

41

ASR DimensionsASR Dimensions

• Speaker dependent, independent

• Isolated, continuous, keywords

• Lexicon size and difficulty

• Task constraints, perplexity

• Adverse or easy conditions

• Natural or read speech

Report

Docum

ent

42

Telephone SpeechTelephone Speech

• Limited bandwidth (F vs S)• Large speaker variability• Large noise variability• Channel distortion • Different handset microphones• Mobile and handsfree acoustics

Report

Docum

ent

43

What is Speech Recognition?

SpeechRecognition

Words“How are you?”

Speech Signal

• Related area’s:– Who is the talker (speaker recognition, identification)

– What language did he speak? (language recognition)– What is his meaning? (speech understanding)

Report

Docum

ent

44

What is the problem?

Find the most likely word sequence Ŵ among all possible sequences given acoustic evidence AA tractable reformulation of the problem is:

Language model

Acoustic model

Daunting search task

Report

Docum

ent

45

View ASR as Pattern Recognition

FrontEnd

Recognition

O1O2 OT

AnalogSpeech

ObservationSequence

W1W2 WT

Best WordSequence

Decoder

AcousticModel

DictionaryLanguage

Model

Report

Docum

ent

46

View ASR in Hierarchy

SpeechWaveform

SpectralFeatureVectors

PhoneLikelihoodsP(o|q)

Words

Feature Extraction(Signal Processing)

Phone LikelihoodEstimation (Gaussiansor Neural Networks)

Decoding (Viterbior Stack Decoder)

Neural Net

N-gram Grammar

HMM Lexicon

Report

Docum

ent

47

Front-End Processing

Dynamic featuresK.F. Lee

Report

Docum

ent

48

Feature Extraction• GOAL :

– LESS COMPUTATION & MEMORY – SIMPLE REPRESENTATION OF SIGNAL

• METHODS : – FOURIER SPECTRUM BASED

• MFCC (mel frequency ceptrum coeffcient) • LFCC (linear frequency ceptrum coefficient) • filter-bank energy

– LINEAR PREDICTION SPECTRUM BASED• LPC (linear predictive coding) • LPCC (linear predictive ceptrum coefficeint)

– OTHERS• ZERO CROSSING, PITCH, FORMANT, AMPLITUDE

Report

Docum

ent

49

Cepstrum Computation

• Cepstrum is the inverse Fourier transform of the log spectrum

1,,1,0,)(log2

1)( LndeeSnc njj

IDFT takes form of weighted DCT in computation, see in HTK

Report

Docum

ent

50

Mel Cepstral Coefficients

• Construct mel-frequency domain using a triangularly-shaped weighting function applied to mel-transformed log-magnitude spectral samples

• Filter-bank, under 1k hz, linear, above 1k hz, log • Motivated by human auditory response characteristics

DCT transform

FFT and log

Report

Docum

ent

51

Cepstrum as Vector Space Features

Overlap

Report

Docum

ent

52

Other Features

• LPC– Linear predictive coefficients

• PLP– Perceptual Linear Prediction

• Though MFCC has been successfully used,

what is the robust speech feature?

Report

Docum

ent

53

Acoustic Models

• Template-based AM, used in DTW, obsolete • Hidden Markov Model based AM, popular now• Other AMs

– Articulatory AM

– KNOWLEDGE BASED APPROACH• spectrogram reading (expert system)

– CONNECTIONIST APPROACH - TDNN

Report

Docum

ent

54

Template-based Approach

• DYNAMIC PROGRAMMING ALGORITHM• DISTANCE MEASURE • ISOLATED WORD • SCALING INVARIANCE • TIME WARPING• CLUSTER METHOD

A SSR presentation: 8.2 Definition of the Hidden Markov Model

Definition of HMMDefinition of HMM

Formal definition HMM

An output observation alphabet

The set of states

A transition probability matrix

An output probability matrix

An initial state distribution

Assumptions• Markov assumption• Output independence assumption

},...,,{21 M

oooO

},...,2,1{ N

)|(}{1

isjsPaAttij

)}({ kbBi

)(0

isP

)|()( isoXPkbtkti

A SSR presentation: 8.2 Definition of the Hidden Markov Model

Three Three Problems of HMMProblems of HMM

Given a model Ф and a sequence of observations

• The Evaluation ProblemThe Evaluation Problem

How to compute the probability of the observation sequence?

Forward algorithm

• The Decoding ProblemThe Decoding Problem

How to find the optimal sequence associated with a given observation?

Viterbi algorithm

• The Training/Learning ProblemThe Training/Learning Problem

How can we adjust the model parameter to maximize the joint probability?

Baum-Welch algorithm (FORWARD-BACKWARD ALGORITHM )

Report

Docum

ent

57

Advantages of HMM

• ISOLATED & CONTINUOUS SPEECH RECOGNITION

• NO ATTEMPT TO FIND WORD BOUNDARIES

• RECOVERY OF ERRONEOUS ASSUMPTION

• SCALING INVARIANCE, TIME WARPING, LEARNING CAPABILITY

Report

Docum

ent

58

Limitations of HMM

• HMMs assume the duration follows an exponential

distribution• The transition probability depends only on the

origin and destination • All observation frames are dependent only on the

state that generated them, not on the neighboring

observation frames (observation frames dependent)

Report

Docum

ent

59

HMM-based AM

• Hidden Markov Models (HMMs)

– Probabilistic State Machines - state sequence unknown, only feature vector outputs observed

– Each state has output symbol distribution

– Each state has transition probability distribution

– Issues: • what topology is proper?

• how many states in a model?

• how many mixtures in a state?

Report

Docum

ent

60

• Acoustic models encode the temporal evolution of the features (spectrum).

• Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation.

• Phonetic model topologies are simple left-to-right structures.

• Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models.

• Sharing model parameters (tied) is a common strategy to reduce complexity.

Hidden Markov Models

Report

Docum

ent

61

• Closed-loop data-driven modeling supervised only from a word-level transcription.

• The expectation/maximization (EM) algorithm is used to improve our parameter estimates.

• Computationally efficient training algorithms (Forward-Backward) have been crucial.

• Batch mode parameter updates are typically preferred.

• Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge.

AM Parameter Estimation

• Initialization

• Single Gaussian Estimation

• 2-Way Split

• Mixture Distribution Reestimation

• 4-Way Split

• Reestimation

•••

Report

Docum

ent

62

Basic Speech Units

• RECOGNITION UNITS– PHONEME – WORD – SYLLABLE – DEMISYLLABLE – TRIPHONE– DIPHONE

Report

Docum

ent

63

Basic Units Selection

• Create a set of HMM’s representing the basic sounds (phones) of a language?– English has about 40 distinct phonemes

– Chinese has about 22 Initials + 37 Finials

– Need “lexicon” for pronunciations

– Letter to sound rules for unusual words

– Co-articulation effects must be modeled

• tri-phones - each phone modified by onset and trailing context phones (1k-2k used in English)– e.g. pl-c+pr

Report

Docum

ent

64

Language Models

• What is a language model?– Quantitative ordering of the likelihood of word

sequences (statistical viewpoint)– A set of rule specifying how to create word sequences or sentences (grammar viewpoint)

• Why use language models?– Not all word sequences equally likely– Search space optimization (*)– Improve accuracy (multiple passes)– Wordlattice to n-best

Report

Docum

ent

65

Finite-State Language Model

• Write Grammar of Possible Sentence Patterns• Advantages:

– Long History/ Context– Don’t Need Large Text Database (Rapid Prototyping)– Integrated Syntactic Parsing

• Problem:– Work to write grammars– Words sequences not enabled do not exist– Used in small vocabulary ASR, not for LVCASR

show me

display

any

the next

the last

page

picture

text file

Report

Docum

ent

66

Statistical Language Models• Predict next word based on current and history• Probability of next word is given by

– Trigram: P(wi | wi-1, wi-2)– Bigram: P(wi | wi-1)– Unigram: P(wi)

• Advantage:– Trainable on Large Text Databases– ‘Soft’ Prediction (Probabilities)– Can be directly combined with AM in decoding

• Problem:– Need Large Text Database for each Domain– Sparse problems, smoothing approaches

• backoff approach• word class approach

• Used in LVCASR

Report

Docum

ent

67

Statistical LM Performance

Report

Docum

ent

68

ASR Decoding Levels

/w/ -> /ah/ -> /ts/

/th/ -> /ax/

what's the

display

kirk's

willamette's

sterett's

location

longitude

lattitude

/w/ /ah/ /ts/

/th/ /ax/

States

Phonemes

Words

Sentences

AcousticModels

Dictionary

LanguageModel

Report

Docum

ent

69

Decoding Algorithms

• Given observations, how to determine the most probable utterance/word sequence? (DTW in template-based match)

• Dynamic Programming ( DP) algorithm was proposed by Bellman in 50s for multistep decision process,

the “principle of optimality” is divide and conquer.• The DP-based search algorithms have been used in speech r

ecognition decoder to return n-best paths or wordlattice through the acoustic model and the language model

• Complete search is usually impossible since the search space is too large, so beam search is required to prune less probable paths and save computation load.

• Issues: computation underflow, balance of LM, AM.

Report

Docum

ent

70

Viterbi Search

• Uses Viterbi decoding– Takes MAX, not SUM (Viterbi vs. Forward)– Finds the optimal state sequence, not optimal

word sequence– Computation load: O(T*N2)

• Time synchronous– Extends all paths at each time step– All paths have same length (no need to normalize

to compare scores, but A* decoding needs)

Report

Docum

ent

71

Viterbi Search AlgorithmFunction Viterbi(observations length T, state-graph) returns best-pathNum-states<-num-of-states(state-graph)Create path prob matrix viterbi[num-states+2,T+2]Viterbi[0,0]<- 1.0For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]=0) || (viterbi[s’,t+1]<new-score))

then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s

Backtrace from highest prob state in final column of viterbi[] & return

Report

Docum

ent

72

Viterbi Search Trellis

W1

W2

0 1 2 3 t

Report

Docum

ent

73

Viterbi Search Insight

Word 1 Word 2

time t time t+1

Word 1

Word 2

S1S2S3

S1

S1 S1S2S2

S2S3

S3 S3

OldProb(S1) • OutProb • Transprob OldProb(S3) • P(W2 | W1)

scorebackptrparmptr

Report

Docum

ent

74

Bachtracking

• Find Best Association between Word and Signal• Compose Words from Phones Using Dictionary• Backtracking is to find the best state sequence

/th/

/e/

t1 tn

Report

Docum

ent

75

N-Best Speech Results

• Use grammar to guide recognition • Post-processing based on grammar/LM• Wordlattice to n-best conversion

“Get me two movie tickets…”“I want to movie trips…”“My car’s too groovy”ASR

SpeechWaveform

Grammar

N-Best Result

N=1N=2N=3

Report

Docum

ent

76

Complexity of Search

•Lexicon: contains all the words in the system’s vocabulary along with their pronunciations (often there are multiple pronunciations per word, # of items in lexicon)

•Acoustic Models: HMMs that represent the basic sound units the system is capable of recognizing (# of models, # of states per model, # of mixtures per state)

•Language Model: determines the possible word sequences allowed by the system (fan-out, PP, entropy)

Report

Docum

ent

77

ASR vs Modern AI

• ASR is based on AI techniques– Knowledge representation & manipulation

• AM and LM, lexicon, observation vector

– Machine Learning• Baum-Welch for HMMs

• Nearest neighbor & k-means clustering for signal id

– “Soft” probabilistic reasoning/Bayes rule• Manage uncertainty mapping in signal, phone, word

– ASR is an expert system

Report

Docum

ent

78

ASR Summary

• Performance criterion is WER (word error rate)

• Three main knowledge sources– Acoustic Model (Gaussian Mixture Models)– Language Model (N-Grams, FS Grammars)– Dictionary (Context-dependent sub-phonetic units)

• Decoding– Viterbi Decoder– Time-synchronous– A* decoding (stack decoding, IBM, X.D. Huang)

Report

Docum

ent

79

We still needWe still need

• We still need science

• Need language, intelligence

• Acoustic robustness still poor

• Perceptual research, models

• Fundamentals of statistical pattern

recognition for sequences

• Robustness to accent, stress,

rate of speech, ……..

Report

Docum

ent

80

Conclusions:

• supervised training is a good machine learning technique

• large databases are essential for the development of robust statistics

Challenges:

• discrimination vs. representation

• generalization vs. memorization

• pronunciation modeling

• human-centered language modelingThe algorithmic issues for the next decade:• Better features by extracting articulatory information?

• Bayesian statistics? Bayesian networks?

• Decision Trees? Information-theoretic measures?

• Nonlinear dynamics? Chaos?

Future Directions

1970

Hidden Markov ModelsAnalog Filter Banks Dynamic Time-Warping

1980 19902004

1960

Report

Docum

ent

81

References• Speech & Language Processing

– Jurafsky & Martin -Prentice Hall - 2000• Spoken Language Processing

– X.. D. Huang, al et, Prentice Hall, Inc., 2000

• Statistical Methods for Speech Recognition – Jelinek - MIT Press - 1999

• Foundations of Statistical Natural Language Processing– Manning & Schutze - MIT Press - 1999

• Fundamentals of Speech Recognition– L. R. Rabiner and B. H. Juang, Prentice-Hall, 1993

• Dr. J. Picone - Speech Website– www.isip.msstate.edu

Report

Docum

ent

82

Test

• Mode– A final 4-page report or– A 30-min presentation

• Content– Review of speech processing– Speech features and processing approaches– Review of TTS or ASR– Audio in computer engineering

Report

Docum

ent

83

TTHHAANNKKSS

introduction to speech signal processing

Documents