introduction to speech signal processing

83
Introduction to Speech Signal Processing Dr. Zhang Sen Dr. Zhang Sen [email protected] Chinese Academy of Sciences Beijing, China 22/05/07

Upload: galia

Post on 12-Jan-2016

59 views

Category:

Documents


3 download

DESCRIPTION

Introduction to Speech Signal Processing. Dr. Zhang Sen [email protected] Chinese Academy of Sciences Beijing, China 2014/9/22. Introduction Sampling and quantization Speech coding Features and Analysis Main features Some transformations Speech-to-Text State of the art - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to  Speech Signal Processing

Introduction to Speech Signal Processing

Dr. Zhang SenDr. Zhang Sen

[email protected]

Chinese Academy of SciencesBeijing, China

23/04/21

Page 2: Introduction to  Speech Signal Processing

Report

Docum

ent

2

•Introduction–Sampling and quantization–Speech coding

•Features and Analysis–Main features–Some transformations

•Speech-to-Text –State of the art–Main approaches

•Text-to-Speech–State of the art–Main approaches

•Applications–Human-machine dialogue systems

Page 3: Introduction to  Speech Signal Processing

Report

Docum

ent

3

• Some useful websites for ASR Tools– http://htk.eng.cam.ac.uk

• Free, available since 2000, relation with MS

• Over 12000 users, ver. 2.1, 3.0, 3.1, 3.2

• Include source code and HTK books

• A set of tools for training, decoding, evaluation

• Steve Young in Cambridge University

– http://www.cs.cmu.edu• Free for research and education

• Sphinx 2 and 3

• Tools, source code, speech database

• Reddy in CMU

Page 4: Introduction to  Speech Signal Processing

Report

Docum

ent

4

Research on speech recognition in the world

Page 5: Introduction to  Speech Signal Processing

Report

Docum

ent

5

• Carnegie Mellon University– CMU SCS Speech Group– Interact Lab

• Oregon Graduate Institute– Center for Spoken Language Understanding

• MIT– Lab for Computer Science, Spoken Language Systems– Acoustics & Vibration Lab– AI LAB– Lincoln Lab, Speech Systems Technology Group

• Stanford University– Center for Computer Research in Music and Acoustics– Center for the Study of Language and Information

Page 6: Introduction to  Speech Signal Processing

Report

Docum

ent

6

• University of California– Berkeley, Santa Cruz, Los Angeles

• Boston University– Signal Processing and Interpretation Lab

• Georgia Institute of Technology– Digital Signal Processing Lab

• Johns Hopkins University– Center for Language and Speech Processing

• Brown University– Lab for Engineering Man-Machine Systems

• Mississippi State University• Colorado University• Cornell University

Page 7: Introduction to  Speech Signal Processing

Report

Docum

ent

7

• Cambridge University– speech Vision and Robotics Group

• Edinburgh University – human Communication Research Center – center for Speech Technology Research

• University College London– Phonetics and Linguistics

• University of Essex– Dept. Language and Linguistics

Page 8: Introduction to  Speech Signal Processing

Report

Docum

ent

8

• LIMSI, France • INRIA

– Institut National de Recherche en Informatique et Automatique

• University of Karlsruhe, Germany – Interractive Systems Lab

• DFKI– German Research Center for Artificial Intelligence

• KTH Speech Communication & Music Acoustics • CSELT, Italy

– Centro Studi e Laboratori Telecommunicazioni, Torino

• IRST– Istituto per la Ricerca Scientifica e Tecnologica, Trento

• ATR, Japan

Page 9: Introduction to  Speech Signal Processing

Report

Docum

ent

9

• AT&T, Advanced Speech Product Group• Lucent Technologies, Bell Laboratories • IBM , IBM VoiceType • Texas Instruments Incorporated• National Institute of Standards and Technology• Apple Computer Co.• Digital Equipment Corporation (DEC) • SRI International • Dragon systems Co.• Sun Microsystems Lab. , speech applications • Microsoft Corporation, Speech technology SAPI • Entropic Research Laboratory, Inc.

Page 10: Introduction to  Speech Signal Processing

Report

Docum

ent

10

• Important conferences and journals– IEEE trans. on ASSP– ICASSP (every year) – EUROSPEECH (every odd year) – ICSLP (every even year)– STAR

• Speech Technology and Research at SRI

Page 11: Introduction to  Speech Signal Processing

Report

Docum

ent

11

Brief history and state-of-the-art of the research on speech recognition

Page 12: Introduction to  Speech Signal Processing

Report

Docum

ent

12

ASR Progress Overview

• 50'S – ISOLATED DIGIT RECOGNITION (BELL LAB)

• 60'S : – HARDWARE SPEECH SEGMENTATOR (JAPAN)

– DYNAMIC PROGRAMMING (U.S.S.R)

• 70'S : – CLUSTERING ALGORITHM (SPEAKER INDEPENDECY)

– DTW

• 80'S: – HMM, DARPA, SPHINX

• 90'S : – ADAPTION, ROBUSTNESS

Page 13: Introduction to  Speech Signal Processing

Report

Docum

ent

13

1952 Bell Labs Digits1952 Bell Labs Digits

• First word (digit) recognizer

• Approximates energy in formants (vocal

tract resonances) over word

• Already has some robust ideas

(insensitive to amplitude, timing variation)

• Worked very well

• Main weakness was technological (resistors

and capacitors)

Page 14: Introduction to  Speech Signal Processing

Report

Docum

ent

14

The 60’sThe 60’s

• Better digit recognition

• Breakthroughs: Spectrum Estimation (FFT,

cepstra, LPC), Dynamic Time Warp (DTW),

and Hidden Markov Model (HMM) theory

• HARDWARE SPEECH SEGMENTATOR (JAPAN)

Page 15: Introduction to  Speech Signal Processing

Report

Docum

ent

15

1971-76 ARPA Project1971-76 ARPA Project

• Focus on Speech Understanding

• Main work at 3 sites: System Development

Corporation, CMU and BBN

• Other work at Lincoln, SRI, Berkeley

• Goal was 1000-word ASR, a few speakers,

connected speech, constrained grammar,

less than 10% semantic error

Page 16: Introduction to  Speech Signal Processing

Report

Docum

ent

16

ResultsResults

• Only CMU Harpy fulfilled goals -

used LPC, segments, lots of high level

knowledge, learned from Dragon *

(Baker)

* The CMU system done in the early ‘70’s; as opposed to the company formed in the ‘80’s

Page 17: Introduction to  Speech Signal Processing

Report

Docum

ent

17

Achieved by 1976Achieved by 1976

• Spectral and cepstral features, LPC

• Some work with phonetic features

• Incorporating syntax and semantics

• Initial Neural Network approaches

• DTW-based systems (many)

• HMM-based systems (Dragon, IBM)

Page 18: Introduction to  Speech Signal Processing

Report

Docum

ent

18

Dynamic Time WarpDynamic Time Warp

• Optimal time normalization with dynamic programming

• Proposed by Sakoe and Chiba, circa 1970• Similar time, proposal by Itakura• Probably Vintsyuk was first (1968)

Page 19: Introduction to  Speech Signal Processing

Report

Docum

ent

19

HMMs for SpeechHMMs for Speech

• Math from Baum and others, 1966-1972

• Applied to speech by Baker in the

original CMU Dragon System (1974)

• Developed by IBM (Baker, Jelinek, Bahl,

Mercer,….) (1970-1993)

• Extended by others in the mid-1980’s

Page 20: Introduction to  Speech Signal Processing

Report

Docum

ent

20

The 1980’sThe 1980’s

• Collection of large standard corpora

• Front ends: auditory models, dynamics

• Engineering: scaling to large

vocabulary continuous speech

• Second major (D)ARPA ASR project

• HMMs become ready for prime time

Page 21: Introduction to  Speech Signal Processing

Report

Docum

ent

21

Standard Corpora CollectionStandard Corpora Collection

• Before 1984, chaos

• TIMIT

• RM (later WSJ)

• ATIS

• NIST, ARPA, LDC

Page 22: Introduction to  Speech Signal Processing

Report

Docum

ent

22

Front Ends in the 1980’sFront Ends in the 1980’s

• Mel cepstrum (Bridle, Mermelstein)

• PLP (Hermansky)

• Delta cepstrum (Furui)

• Auditory models (Seneff, Ghitza, others)

Page 23: Introduction to  Speech Signal Processing

Report

Docum

ent

23

Dynamic Speech FeaturesDynamic Speech Features

• temporal dynamics useful for ASR

• local time derivatives of cepstra

• “delta’’ features estimated over

multiple frames (typically 5)

• usually augments static features

• can be viewed as a temporal filter

Page 24: Introduction to  Speech Signal Processing

Report

Docum

ent

24

HMM’s for Continuous SpeechHMM’s for Continuous Speech

• Using dynamic programming for cts speech

(Vintsyuk, Bridle, Sakoe, Ney….)

• Application of Baker-Jelinek ideas to

continuous speech (IBM, BBN, Philips, ...)

• Multiple groups developing major HMM

systems (CMU, SRI, Lincoln, BBN, ATT)

• Engineering development - coping with

data, fast computers

Page 25: Introduction to  Speech Signal Processing

Report

Docum

ent

25

2nd (D)ARPA Project2nd (D)ARPA Project

• Common task• Frequent evaluations• Convergence to good, but similar, systems • Lots of engineering development - now up to

60,000 word recognition, in real time, on aworkstation, with less than 10% word error

• Competition inspired others not in project -Cambridge did HTK, now widely distributed

Page 26: Introduction to  Speech Signal Processing

Report

Docum

ent

26

Some 1990’s IssuesSome 1990’s Issues

• Independence to long-term spectrum

• Adaptation

• Effects of spontaneous speech

• Information retrieval/extraction with

broadcast material

• Query-style systems (e.g., ATIS)

• Applying ASR technology to related

areas (language ID, speaker verification)

Page 27: Introduction to  Speech Signal Processing

Report

Docum

ent

27

Real UsesReal Uses

• Telephone: phone company services

(collect versus credit card)

• Telephone: call centers for query

information (e.g., stock quotes,

parcel tracking)

• Dictation products: continuous

recognition, speaker dependent/adaptive

Page 28: Introduction to  Speech Signal Processing

Report

Docum

ent

28

State-of-the-art of ASR

• Tremendous technical advances in the last few years

• From small to large vocabularies– 5,000 - 10,000 word vocabulary

– 10,000-60,000 word vocabulary

• From isolated word to spontaneous talk– Continuous speech recognition

– Conversational and spontaneous speech recognition

• From speaker-dependent to speaker-independent– Modern ASR is fully speaker independent

Page 29: Introduction to  Speech Signal Processing

Report

Docum

ent

29

SOTA ASR Systems

• IBM, Via Voice– Speaker independent, continuous command

recognition – Large vocabulary recognition – Text-to-speech confirmation – Barge in (The ability to interrupt an audio

prompt as it is playing)

• Microsoft, Whisper, Dr Who

Page 30: Introduction to  Speech Signal Processing

Report

Docum

ent

30

SOTA ASR Systems• DARPA

– 1982– GOAL

• HIGH ACCURACY• REAL-TIME PERFORMANCE• UNDERSTANDING CAPABILITY• CONTINUOUS SPEECH RECOGNITION

– DARPA DATABASE• 997 WORDS (RM)• ABOVE 100 SPEAKERS• TIMID

Page 31: Introduction to  Speech Signal Processing

Report

Docum

ent

31

SOTA ASR Systems• SPHINX II

– CMU

– HMM BASED SPEECH RECOGNITION

– BIGRAM, WORD PAIR

– GENERALIZED TRIPHONE

– DARPA DATABASE

– 97% RECOGNITION (PERPLEXITY 20)

• SPHINX III– CHMM BASED

– WER, about 15% on WSJ

Page 32: Introduction to  Speech Signal Processing

Report

Docum

ent

32

all speakers of the language

including foreign

application independent or

adaptive

all styles including human-human (unaware)

wherever speech occurs

2005

ASR Advances

vehicle noise radiocell phones

regional accentsnative speakers

competent foreign speakers

some application–

specific data and one engineer

year

natural human-machine dialog (user can adapt)

2000

expert years to create app– specific language model

speaker independent and adaptive

normal officevarious microphonestelephone

planned speech

1995

NOISE ENVIRONMENT

SPEECH STYLE

USER

POPULATION

COMPLEXITY

1985

quiet roomfixed high –quality mic

careful reading

speaker-dep.

application– specific speech and language

Page 33: Introduction to  Speech Signal Processing

Report

Docum

ent

33

ButBut

• Still <97% accurate on “yes” for telephone

• Unexpected rate of speech causes doubling

or tripling of error rate

• Unexpected accent hurts badly

• Accuracy on unrestricted speech at 60%

• Don’t know when we know

• Few advances in basic understanding

Page 34: Introduction to  Speech Signal Processing

Report

Docum

ent

34

How to Measure the Performance?

• What benchmarks? – DARPA– NIST (hub-4, hub-5, …)

• What was training? • What was the test? • Were they independent? • The vocabulary and the sample size? • Was the noise added or coincident with speech?

What kind of noise?

Page 35: Introduction to  Speech Signal Processing

Report

Docum

ent

35

• Spontaneous telephone speech is still a “grand challenge”.

• Telephone-quality speech is still central to the problem.

• Broadcast news is a very dynamic domain.

0%

10%

30%

40%

20%

Word Error Rate (WER)

Level Of Difficulty

Digits

ContinuousDigits

Command and Control

Letters and Numbers

BroadcastNews

Read Speech

ConversationalSpeech

ASR Performance

Page 36: Introduction to  Speech Signal Processing

Report

Docum

ent

36

0%

5%

15%

20%

10%

10 dB 16 dB 22 dB Quiet

Wall Street Journal (Additive Noise)

Machines

Human Listeners (Committee)

Word Error Rate

Speech-To-Noise Ratio

• Human performance exceeds machine performance by a factor ranging from 4x to 10x depending on the task.

• On some tasks, such as credit card number recognition, machine performance exceeds humans due to human memory retrieval capacity.

• The nature of the noise is as important as the SNR (e.g., cellular phones).

• A primary failure mode for humans is inattention.

• A second major failure mode is the lack of familiarity with the domain (i.e., business terms and corporation names).

Machine vs Human Performance

Page 37: Introduction to  Speech Signal Processing

Report

Docum

ent

37

Core technology for ASR

Page 38: Introduction to  Speech Signal Processing

Report

Docum

ent

38

Why is ASR Hard?Why is ASR Hard?

• Natural speech is continuous

• Natural speech has disfluencies

• Natural speech is variable over:

global rate, local rate, pronunciation

within speaker, pronunciation across

speakers, phonemes in different

contexts

Page 39: Introduction to  Speech Signal Processing

Report

Docum

ent

39

Why is ASR Hard?Why is ASR Hard?(continued)(continued)

• Large vocabularies are confusable• Out of vocabulary words inevitable• Recorded speech is variable over:

room acoustics, channel characteristics,background noise

• Large training times are not practical• User expectations are for equal to or

greater than “human performance”

Page 40: Introduction to  Speech Signal Processing

Report

Docum

ent

40

Main Causes of Speech VariabilityMain Causes of Speech Variability

Environment

Speaker

InputEquipment

Speech - correlated noisereverberation, reflection

Uncorrelated noiseadditive noise(stationary, nonstationary)

Attributes of speakersdialect, gender, age

Manner of speakingbreath & lip noisestressLombard effectratelevelpitchcooperativeness

Microphone (Transmitter)Distance from microphoneFilterTransmission system

distortion, noise, echoRecording equipment

Page 41: Introduction to  Speech Signal Processing

Report

Docum

ent

41

ASR DimensionsASR Dimensions

• Speaker dependent, independent

• Isolated, continuous, keywords

• Lexicon size and difficulty

• Task constraints, perplexity

• Adverse or easy conditions

• Natural or read speech

Page 42: Introduction to  Speech Signal Processing

Report

Docum

ent

42

Telephone SpeechTelephone Speech

• Limited bandwidth (F vs S)• Large speaker variability• Large noise variability• Channel distortion • Different handset microphones• Mobile and handsfree acoustics

Page 43: Introduction to  Speech Signal Processing

Report

Docum

ent

43

What is Speech Recognition?

SpeechRecognition

Words“How are you?”

Speech Signal

• Related area’s:– Who is the talker (speaker recognition, identification)

– What language did he speak? (language recognition)– What is his meaning? (speech understanding)

Page 44: Introduction to  Speech Signal Processing

Report

Docum

ent

44

What is the problem?

Find the most likely word sequence Ŵ among all possible sequences given acoustic evidence AA tractable reformulation of the problem is:

Language model

Acoustic model

Daunting search task

Page 45: Introduction to  Speech Signal Processing

Report

Docum

ent

45

View ASR as Pattern Recognition

FrontEnd

Recognition

O1O2 OT

AnalogSpeech

ObservationSequence

W1W2 WT

Best WordSequence

Decoder

AcousticModel

DictionaryLanguage

Model

Page 46: Introduction to  Speech Signal Processing

Report

Docum

ent

46

View ASR in Hierarchy

SpeechWaveform

SpectralFeatureVectors

PhoneLikelihoodsP(o|q)

Words

Feature Extraction(Signal Processing)

Phone LikelihoodEstimation (Gaussiansor Neural Networks)

Decoding (Viterbior Stack Decoder)

Neural Net

N-gram Grammar

HMM Lexicon

Page 47: Introduction to  Speech Signal Processing

Report

Docum

ent

47

Front-End Processing

Dynamic featuresK.F. Lee

Page 48: Introduction to  Speech Signal Processing

Report

Docum

ent

48

Feature Extraction• GOAL :

– LESS COMPUTATION & MEMORY – SIMPLE REPRESENTATION OF SIGNAL

• METHODS : – FOURIER SPECTRUM BASED

• MFCC (mel frequency ceptrum coeffcient) • LFCC (linear frequency ceptrum coefficient) • filter-bank energy

– LINEAR PREDICTION SPECTRUM BASED• LPC (linear predictive coding) • LPCC (linear predictive ceptrum coefficeint)

– OTHERS• ZERO CROSSING, PITCH, FORMANT, AMPLITUDE

Page 49: Introduction to  Speech Signal Processing

Report

Docum

ent

49

Cepstrum Computation

• Cepstrum is the inverse Fourier transform of the log spectrum

1,,1,0,)(log2

1)( LndeeSnc njj

IDFT takes form of weighted DCT in computation, see in HTK

Page 50: Introduction to  Speech Signal Processing

Report

Docum

ent

50

Mel Cepstral Coefficients

• Construct mel-frequency domain using a triangularly-shaped weighting function applied to mel-transformed log-magnitude spectral samples

• Filter-bank, under 1k hz, linear, above 1k hz, log • Motivated by human auditory response characteristics

DCT transform

FFT and log

Page 51: Introduction to  Speech Signal Processing

Report

Docum

ent

51

Cepstrum as Vector Space Features

Overlap

Page 52: Introduction to  Speech Signal Processing

Report

Docum

ent

52

Other Features

• LPC– Linear predictive coefficients

• PLP– Perceptual Linear Prediction

• Though MFCC has been successfully used,

what is the robust speech feature?

Page 53: Introduction to  Speech Signal Processing

Report

Docum

ent

53

Acoustic Models

• Template-based AM, used in DTW, obsolete • Hidden Markov Model based AM, popular now• Other AMs

– Articulatory AM

– KNOWLEDGE BASED APPROACH• spectrogram reading (expert system)

– CONNECTIONIST APPROACH - TDNN

Page 54: Introduction to  Speech Signal Processing

Report

Docum

ent

54

Template-based Approach

• DYNAMIC PROGRAMMING ALGORITHM• DISTANCE MEASURE • ISOLATED WORD • SCALING INVARIANCE • TIME WARPING• CLUSTER METHOD

Page 55: Introduction to  Speech Signal Processing

A SSR presentation: 8.2 Definition of the Hidden Markov Model

Definition of HMMDefinition of HMM

Formal definition HMM

An output observation alphabet

The set of states

A transition probability matrix

An output probability matrix

An initial state distribution

Assumptions• Markov assumption• Output independence assumption

},...,,{21 M

oooO

},...,2,1{ N

)|(}{1

isjsPaAttij

)}({ kbBi

)(0

isP

)|()( isoXPkbtkti

Page 56: Introduction to  Speech Signal Processing

A SSR presentation: 8.2 Definition of the Hidden Markov Model

Three Three Problems of HMMProblems of HMM

Given a model Ф and a sequence of observations

• The Evaluation ProblemThe Evaluation Problem

How to compute the probability of the observation sequence?

Forward algorithm

• The Decoding ProblemThe Decoding Problem

How to find the optimal sequence associated with a given observation?

Viterbi algorithm

• The Training/Learning ProblemThe Training/Learning Problem

How can we adjust the model parameter to maximize the joint probability?

Baum-Welch algorithm (FORWARD-BACKWARD ALGORITHM )

Page 57: Introduction to  Speech Signal Processing

Report

Docum

ent

57

Advantages of HMM

• ISOLATED & CONTINUOUS SPEECH RECOGNITION

• NO ATTEMPT TO FIND WORD BOUNDARIES

• RECOVERY OF ERRONEOUS ASSUMPTION

• SCALING INVARIANCE, TIME WARPING, LEARNING CAPABILITY

Page 58: Introduction to  Speech Signal Processing

Report

Docum

ent

58

Limitations of HMM

• HMMs assume the duration follows an exponential

distribution• The transition probability depends only on the

origin and destination • All observation frames are dependent only on the

state that generated them, not on the neighboring

observation frames (observation frames dependent)

Page 59: Introduction to  Speech Signal Processing

Report

Docum

ent

59

HMM-based AM

• Hidden Markov Models (HMMs)

– Probabilistic State Machines - state sequence unknown, only feature vector outputs observed

– Each state has output symbol distribution

– Each state has transition probability distribution

– Issues: • what topology is proper?

• how many states in a model?

• how many mixtures in a state?

Page 60: Introduction to  Speech Signal Processing

Report

Docum

ent

60

• Acoustic models encode the temporal evolution of the features (spectrum).

• Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation.

• Phonetic model topologies are simple left-to-right structures.

• Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models.

• Sharing model parameters (tied) is a common strategy to reduce complexity.

Hidden Markov Models

Page 61: Introduction to  Speech Signal Processing

Report

Docum

ent

61

• Closed-loop data-driven modeling supervised only from a word-level transcription.

• The expectation/maximization (EM) algorithm is used to improve our parameter estimates.

• Computationally efficient training algorithms (Forward-Backward) have been crucial.

• Batch mode parameter updates are typically preferred.

• Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge.

AM Parameter Estimation

• Initialization

• Single Gaussian Estimation

• 2-Way Split

• Mixture Distribution Reestimation

• 4-Way Split

• Reestimation

•••

Page 62: Introduction to  Speech Signal Processing

Report

Docum

ent

62

Basic Speech Units

• RECOGNITION UNITS– PHONEME – WORD – SYLLABLE – DEMISYLLABLE – TRIPHONE– DIPHONE

Page 63: Introduction to  Speech Signal Processing

Report

Docum

ent

63

Basic Units Selection

• Create a set of HMM’s representing the basic sounds (phones) of a language?– English has about 40 distinct phonemes

– Chinese has about 22 Initials + 37 Finials

– Need “lexicon” for pronunciations

– Letter to sound rules for unusual words

– Co-articulation effects must be modeled

• tri-phones - each phone modified by onset and trailing context phones (1k-2k used in English)– e.g. pl-c+pr

Page 64: Introduction to  Speech Signal Processing

Report

Docum

ent

64

Language Models

• What is a language model?– Quantitative ordering of the likelihood of word

sequences (statistical viewpoint)– A set of rule specifying how to create word sequences or sentences (grammar viewpoint)

• Why use language models?– Not all word sequences equally likely– Search space optimization (*)– Improve accuracy (multiple passes)– Wordlattice to n-best

Page 65: Introduction to  Speech Signal Processing

Report

Docum

ent

65

Finite-State Language Model

• Write Grammar of Possible Sentence Patterns• Advantages:

– Long History/ Context– Don’t Need Large Text Database (Rapid Prototyping)– Integrated Syntactic Parsing

• Problem:– Work to write grammars– Words sequences not enabled do not exist– Used in small vocabulary ASR, not for LVCASR

show me

display

any

the next

the last

page

picture

text file

Page 66: Introduction to  Speech Signal Processing

Report

Docum

ent

66

Statistical Language Models• Predict next word based on current and history• Probability of next word is given by

– Trigram: P(wi | wi-1, wi-2)– Bigram: P(wi | wi-1)– Unigram: P(wi)

• Advantage:– Trainable on Large Text Databases– ‘Soft’ Prediction (Probabilities)– Can be directly combined with AM in decoding

• Problem:– Need Large Text Database for each Domain– Sparse problems, smoothing approaches

• backoff approach• word class approach

• Used in LVCASR

Page 67: Introduction to  Speech Signal Processing

Report

Docum

ent

67

Statistical LM Performance

Page 68: Introduction to  Speech Signal Processing

Report

Docum

ent

68

ASR Decoding Levels

/w/ -> /ah/ -> /ts/

/th/ -> /ax/

what's the

display

kirk's

willamette's

sterett's

location

longitude

lattitude

/w/ /ah/ /ts/

/th/ /ax/

States

Phonemes

Words

Sentences

AcousticModels

Dictionary

LanguageModel

Page 69: Introduction to  Speech Signal Processing

Report

Docum

ent

69

Decoding Algorithms

• Given observations, how to determine the most probable utterance/word sequence? (DTW in template-based match)

• Dynamic Programming ( DP) algorithm was proposed by Bellman in 50s for multistep decision process,

the “principle of optimality” is divide and conquer.• The DP-based search algorithms have been used in speech r

ecognition decoder to return n-best paths or wordlattice through the acoustic model and the language model

• Complete search is usually impossible since the search space is too large, so beam search is required to prune less probable paths and save computation load.

• Issues: computation underflow, balance of LM, AM.

Page 70: Introduction to  Speech Signal Processing

Report

Docum

ent

70

Viterbi Search

• Uses Viterbi decoding– Takes MAX, not SUM (Viterbi vs. Forward)– Finds the optimal state sequence, not optimal

word sequence– Computation load: O(T*N2)

• Time synchronous– Extends all paths at each time step– All paths have same length (no need to normalize

to compare scores, but A* decoding needs)

Page 71: Introduction to  Speech Signal Processing

Report

Docum

ent

71

Viterbi Search AlgorithmFunction Viterbi(observations length T, state-graph) returns best-pathNum-states<-num-of-states(state-graph)Create path prob matrix viterbi[num-states+2,T+2]Viterbi[0,0]<- 1.0For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]=0) || (viterbi[s’,t+1]<new-score))

then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s

Backtrace from highest prob state in final column of viterbi[] & return

Page 72: Introduction to  Speech Signal Processing

Report

Docum

ent

72

Viterbi Search Trellis

W1

W2

0 1 2 3 t

Page 73: Introduction to  Speech Signal Processing

Report

Docum

ent

73

Viterbi Search Insight

Word 1 Word 2

time t time t+1

Word 1

Word 2

S1S2S3

S1

S1 S1S2S2

S2S3

S3 S3

OldProb(S1) • OutProb • Transprob OldProb(S3) • P(W2 | W1)

scorebackptrparmptr

Page 74: Introduction to  Speech Signal Processing

Report

Docum

ent

74

Bachtracking

• Find Best Association between Word and Signal• Compose Words from Phones Using Dictionary• Backtracking is to find the best state sequence

/th/

/e/

t1 tn

Page 75: Introduction to  Speech Signal Processing

Report

Docum

ent

75

N-Best Speech Results

• Use grammar to guide recognition • Post-processing based on grammar/LM• Wordlattice to n-best conversion

“Get me two movie tickets…”“I want to movie trips…”“My car’s too groovy”ASR

SpeechWaveform

Grammar

N-Best Result

N=1N=2N=3

Page 76: Introduction to  Speech Signal Processing

Report

Docum

ent

76

Complexity of Search

•Lexicon: contains all the words in the system’s vocabulary along with their pronunciations (often there are multiple pronunciations per word, # of items in lexicon)

•Acoustic Models: HMMs that represent the basic sound units the system is capable of recognizing (# of models, # of states per model, # of mixtures per state)

•Language Model: determines the possible word sequences allowed by the system (fan-out, PP, entropy)

Page 77: Introduction to  Speech Signal Processing

Report

Docum

ent

77

ASR vs Modern AI

• ASR is based on AI techniques– Knowledge representation & manipulation

• AM and LM, lexicon, observation vector

– Machine Learning• Baum-Welch for HMMs

• Nearest neighbor & k-means clustering for signal id

– “Soft” probabilistic reasoning/Bayes rule• Manage uncertainty mapping in signal, phone, word

– ASR is an expert system

Page 78: Introduction to  Speech Signal Processing

Report

Docum

ent

78

ASR Summary

• Performance criterion is WER (word error rate)

• Three main knowledge sources– Acoustic Model (Gaussian Mixture Models)– Language Model (N-Grams, FS Grammars)– Dictionary (Context-dependent sub-phonetic units)

• Decoding– Viterbi Decoder– Time-synchronous– A* decoding (stack decoding, IBM, X.D. Huang)

Page 79: Introduction to  Speech Signal Processing

Report

Docum

ent

79

We still needWe still need

• We still need science

• Need language, intelligence

• Acoustic robustness still poor

• Perceptual research, models

• Fundamentals of statistical pattern

recognition for sequences

• Robustness to accent, stress,

rate of speech, ……..

Page 80: Introduction to  Speech Signal Processing

Report

Docum

ent

80

Conclusions:

• supervised training is a good machine learning technique

• large databases are essential for the development of robust statistics

Challenges:

• discrimination vs. representation

• generalization vs. memorization

• pronunciation modeling

• human-centered language modelingThe algorithmic issues for the next decade:• Better features by extracting articulatory information?

• Bayesian statistics? Bayesian networks?

• Decision Trees? Information-theoretic measures?

• Nonlinear dynamics? Chaos?

Future Directions

1970

Hidden Markov ModelsAnalog Filter Banks Dynamic Time-Warping

1980 19902004

1960

Page 81: Introduction to  Speech Signal Processing

Report

Docum

ent

81

References• Speech & Language Processing

– Jurafsky & Martin -Prentice Hall - 2000• Spoken Language Processing

– X.. D. Huang, al et, Prentice Hall, Inc., 2000

• Statistical Methods for Speech Recognition – Jelinek - MIT Press - 1999

• Foundations of Statistical Natural Language Processing– Manning & Schutze - MIT Press - 1999

• Fundamentals of Speech Recognition– L. R. Rabiner and B. H. Juang, Prentice-Hall, 1993

• Dr. J. Picone - Speech Website– www.isip.msstate.edu

Page 82: Introduction to  Speech Signal Processing

Report

Docum

ent

82

Test

• Mode– A final 4-page report or– A 30-min presentation

• Content– Review of speech processing– Speech features and processing approaches– Review of TTS or ASR– Audio in computer engineering

Page 83: Introduction to  Speech Signal Processing

Report

Docum

ent

83

TTHHAANNKKSS