8- speech recognition

57
8 - Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types 1

Upload: sagira

Post on 23-Feb-2016

108 views

Category:

Documents


0 download

DESCRIPTION

8- Speech Recognition. Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types. 7- Speech Recognition (Cont’d). HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 8- Speech Recognition

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types

1

Page 2: 8- Speech Recognition

7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training In HMM

2

Page 3: 8- Speech Recognition

Recognition Tasks Isolated Word Recognition (IWR) Connected Word (CW) , And Continuous

Speech Recognition (CSR) Speaker Dependent, Multiple Speaker, And

Speaker Independent Vocabulary Size

Small <20Medium >100 , <1000Large >1000, <10000Very Large >10000

3

Page 4: 8- Speech Recognition

Speech Recognition Concepts

4

NLP SpeechProcessing

Text Speech

NLPSpeech Processing

SpeechUnderstanding

Speech Synthesis

TextPhone Sequence

Speech Recognition

Speech recognition is inverse of Speech Synthesis

Page 5: 8- Speech Recognition

Speech Recognition Approaches

Bottom-Up Approach

Top-Down Approach

Blackboard Approach

5

Page 6: 8- Speech Recognition

Bottom-Up Approach

6

Signal Processing

Feature Extraction

Segmentation

Signal Processing

Feature Extraction

Segmentation

Segmentation

Sound Classification Rules

Phonotactic Rules

Lexical Access

Language Model

Voiced/Unvoiced/Silence

Kno

wle

dge

Sou

rces

Recognized Utterance

Page 7: 8- Speech Recognition

Top-Down Approach

7

UnitMatching

System

FeatureAnalysis

LexicalHypothesis

SyntacticHypothesis

SemanticHypothesis

UtteranceVerifier/Matcher

Inventory of speech

recognition units

Word Dictionary Grammar

TaskModel

Recognized Utterance

Page 8: 8- Speech Recognition

Blackboard Approach

8

EnvironmentalProcesses

Acoustic Processes Lexical

Processes

SyntacticProcesses

SemanticProcesses

Blackboard

Page 9: 8- Speech Recognition

Recognition Theories Articulatory Based Recognition

Use from Articulatory system for recognitionThis theory is the most successful until now

Auditory Based RecognitionUse from Auditory system for recognition

Hybrid Based RecognitionIs a hybrid from the above theories

Motor TheoryModel the intended gesture of speaker

9

Page 10: 8- Speech Recognition

Recognition Problem

We have the sequence of acoustic symbols and we want to find the words that expressed by speaker

Solution : Finding the most probable of word sequence by having Acoustic symbols

10

Page 11: 8- Speech Recognition

Recognition Problem

A : Acoustic Symbols W : Word Sequence

we should find so that

11

W)|(max)|ˆ( AWPAWP

W

Page 12: 8- Speech Recognition

Bayse Rule

),()()|( yxPyPyxP

12

)()()|()|(

yPxPxyPyxP

)()()|()|(

APWPWAPAWP

Page 13: 8- Speech Recognition

Bayse Rule (Cont’d)

13

)()()|(max

APWPWAP

W

)|(max)|ˆ( AWPAWPW

)()|(max

)|(maxˆ

WPWAPArg

AWPArgW

W

W

Page 14: 8- Speech Recognition

Simple Language Model

14

nwwwww 321

),...,,,(),...,,|(

).....,,|(),|()|()(

)|()(

121

121

1234

123121

1211

WWWWPWWWWP

WWWWPWWWPWWPWP

wwwwPwP

nnn

nnn

iii

n

i

Computing this probability is very difficult and we need a very big database. So we use from Trigram and Bigram models.

Page 15: 8- Speech Recognition

Simple Language Model (Cont’d)

15

)|()( 211 iii

n

iwwwPwP

)|()( 11 ii

n

iwwPwP

Trigram :

Bigram :

)()(1 i

n

iwPwP

Monogram :

Page 16: 8- Speech Recognition

Simple Language Model (Cont’d)

16

)|( 123 wwwP

Computing Method :Number of happening W3 after W1W2

Total number of happening W1W2

AdHoc Method :)()|()|()|( 332321231123 wfwwfwwwfwwwP

Page 17: 8- Speech Recognition

Error Production Factor

Prosody (Recognition should be Prosody Independent)

Noise (Noise should be prevented)

Spontaneous Speech

17

Page 18: 8- Speech Recognition

P(A|W) Computing Approaches

Dynamic Time Warping (DTW)

Hidden Markov Model (HMM)

Artificial Neural Network (ANN)

Hybrid Systems

18

Page 19: 8- Speech Recognition

Dynamic Time Warping

Page 20: 8- Speech Recognition

Dynamic Time Warping

Page 21: 8- Speech Recognition

Dynamic Time Warping

Page 22: 8- Speech Recognition

Dynamic Time Warping

Page 23: 8- Speech Recognition

Dynamic Time Warping Search Limitation :

- First & End Interval- Global Limitation- Local Limitation

Page 24: 8- Speech Recognition

Dynamic Time Warping Global Limitation :

Page 25: 8- Speech Recognition

Dynamic Time Warping Local Limitation :

Page 26: 8- Speech Recognition

Artificial Neural Network

26

...

1x

0x

1w 0w

1Nw1Nx

y)(

1

0

i

N

ii xwy

Simple Computation Element of a Neural Network

Page 27: 8- Speech Recognition

Artificial Neural Network (Cont’d)

Neural Network TypesPerceptronTime DelayTime Delay Neural Network Computational

Element (TDNN)

27

Page 28: 8- Speech Recognition

Artificial Neural Network (Cont’d)

28

. . .

. . .

0x

0y 1My

1Nx

Single Layer Perceptron

Page 29: 8- Speech Recognition

Artificial Neural Network (Cont’d)

29

. . .

. . .

Three Layer Perceptron

. . .

. . .

Page 30: 8- Speech Recognition

2.5.4.2 Neural Network Topologies

30

Page 31: 8- Speech Recognition

TDNN

31

Page 32: 8- Speech Recognition

2.5.4.6 Neural Network Structures for Speech Recognition

32

Page 33: 8- Speech Recognition

2.5.4.6 Neural Network Structures for Speech Recognition

33

Page 34: 8- Speech Recognition

Hybrid Methods

Hybrid Neural Network and Matched Filter For Recognition

34

PATTERN

CLASSIFIER

Speech Acoustic Features Delays

Output Units

Page 35: 8- Speech Recognition

Neural Network Properties

The system is simple, But too much iteration is needed for training

Doesn’t determine a specific structure Regardless of simplicity, the results are

good Training size is large, so training should

be offline Accuracy is relatively good

35

Page 36: 8- Speech Recognition

Pre-processing Different preprocessing techniques are

employed as the front end for speech recognition systems

The choice of preprocessing method is based on the task, the noise level, the modeling tool, etc.

36

Page 37: 8- Speech Recognition

38

Page 38: 8- Speech Recognition

39

Page 39: 8- Speech Recognition

41

Page 40: 8- Speech Recognition

42

Page 41: 8- Speech Recognition

MFCCروش

روش MFCCبر نحوه ادراک گوش انسان از اصوات ي مبتن باشد.يم

روش MFCCبهتر ي نويزيطهايژگيها در محير وي نسبت به سا کند.يعمل م

MFCCه شده ي گفتار ارايي شناساي اساسا جهت کاربردها دارد.يز راندمان مناسبينده ني گويياست اما در شناسا

دار گوش انسان ي واحد شنMelباشد که به کمک رابطه ي م د:ي آير بدست ميز

43

Page 42: 8- Speech Recognition

MFCCمراحل روش

گنال از حوزه زمان به حوزه ي: نگاشت س1 مرحله زمان کوتاه.FFTفرکانس به کمک

44

گنال گفتاريس : Z(n)تابع پنجره مانند پنجره :

)W(nهمينگWF= e-j2π/F

m : 0,…,F – 1;يم گفتاريطول فر : .F

Page 43: 8- Speech Recognition

MFCCمراحل روش

لتر.ي هر کانال بانک فيافتن انرژي: 2مرحله

Mمعيار مل ي فيلتر مبتني تعداد بانکها pبر باشد.يم

بانک فيلتر يلترهاي تابع فاست.

45

0,1,..., 1k M ( )kW j

Page 44: 8- Speech Recognition

توزيع فيلتر مبتنی بر معيار مل

46

Page 45: 8- Speech Recognition

MFCCمراحل روش

ل ي طيف و اعمال تبدي: فشرده ساز4 مرحلهDCT MFCCب يجهت حصول به ضرا

47

در رابطه باالL،...،0=nتبه ضراpب ي مرMFCC باشد.يم

Page 46: 8- Speech Recognition

روش مل-کپستروم

48

Mel-scaling بندی فریم

IDCT

|FFT|2

Low-order coefficientsDifferentiator

Cepstra

Delta & Delta Delta Cepstra

زمانی سیگنال

Logarithm

Page 47: 8- Speech Recognition

ضرایب مل MFCC)کپستروم

)

49

Page 48: 8- Speech Recognition

ویژگی های مل (MFCC)کپستروم

نگاشت انرژی های بانک فیلترملدرجهتی که واریانس آنها ماکسیمم

(DCT )با استفاده ازباشد استقالل ویژگی های گفتار به صورت

(DCT غیرکامل نسبت به یکدیگر)تاثیرپاسخ مناسب در محیطهای تمیزکاهش کارایی آن در محیطهای نویزی

50

Page 49: 8- Speech Recognition

Time-Frequency analysis

Short-term Fourier Transform Standard way of frequency analysis: decompose the

incoming signal into the constituent frequency components.

W(n): windowing function N: frame length p: step size

51

Page 50: 8- Speech Recognition

Critical band integration Related to masking phenomenon: the

threshold of a sinusoid is elevated when its frequency is close to the center frequency of a narrow-band noise

Frequency components within a critical band are not resolved. Auditory system interprets the signals within a critical band as a whole

52

Page 51: 8- Speech Recognition

Bark scale

53

Page 52: 8- Speech Recognition

Feature orthogonalization

Spectral values in adjacent frequency channels are highly correlated

The correlation results in a Gaussian model with lots of parameters: have to estimate all the elements of the covariance matrix

Decorrelation is useful to improve the parameter estimation.

54

Page 53: 8- Speech Recognition

otherwisevalidiswwif

wwP

wwwPwwwwP

wwwwP

wwwPwwPwPwwwPWP

wwwW

jkkj

jNjjjQ

QQ

Q

Q

01

)|(

),|()|(

|(

)|()|()()()(

,

11121

).121

21312121

21

Language Models for LVCSR

Word Pair Model: Specify which word pairs are valid

Page 54: 8- Speech Recognition

Statistical Language Modeling

)()(

)(),(

),(),,(

),|(ˆ

,),,(

),,,(),,|(ˆ

),,,,|()(

13

1

212

21

3211213

11

1111

1211

i

Nii

NiiiNiii

Niii

Q

iiN

wFwF

pwF

wwFp

wwFwwwF

pwwwP

wwFwwwF

wwwP

wwwwPWP

Page 55: 8- Speech Recognition

),,,(log1lim

)(log)(

)()()(),,,(

),,,(log),,,(1lim

21

2121

2121

QQ

Vw

QQ

QQQ

wwwPQ

H

wPwPH

wPwPwPwwwP

wwwPwwwPQ

H

Perplexity of the Language ModelEntropy of the Source:

First order entropy of the source:

If the source is ergodic, meaning its statistical properties can be completely characterized in a sufficiently long sequence that the Source puts out,

Page 56: 8- Speech Recognition

QQ

H

Qp

Ni

Q

iiiip

Q

wwwPB

wwwPQ

H

wwwwPQ

H

wwwPQ

H

p /121

21

11

21

21

),,,(ˆ2

),,,(ˆlog1

),,,|(log1

),,,(log1

We often compute H based on a finite but sufficiently large Q:

H is the degree of difficulty that the recognizer encounters, on average,When it is to determine a word from the same source.

Using language model, if the N-gram language model PN(W) is used,An estimate of H is:

In general:

Perplexity is defined as:

Page 57: 8- Speech Recognition

Overall recognition system based on subword units