1 speech recognition trend and features of the speech signal for the speech recognition spring 2014...

46
1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing Yoon-Joong Kim

Upload: ann-pierce

Post on 30-Dec-2015

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

1

Speech Recognition Trend and

Features of the Speech Signal for the Speech Recognition

Spring 2014

Hanbat National UniversityDepartment of Computer Engineeroing

Yoon-Joong Kim

Page 2: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

2

Trend of Speech Recognition

• Nuance • 1994 – Nuance spun off from SRI's STAR(SRI

International's Speech Technology and Research) Lab• The technology, SI-NLSR(NLSR), does not require to train

for a special speaker in a way quite• ScanSoft• ViaVoice •Vlingo,• Vlingo is an intelligent software assistant and knowledge

navigator functioning as a personal assistant application for Symbian, Android, iPhone, BlackBerry, and other smartphones.

Page 3: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

3

Trend of Speech Recognition

• Siri • Siri is a spin-out from the SRI International Artificial

Intelligence Center, and is an offshoot of the DARPA(Defense Advanced Research Agency)-funded CALO(Cognitive Assistant that Learns and Organizes) project

• Siri/ˈsɪri/ is an intelligent personal assistant and knowledge navigator which works as an application for Apple's iOS.

• The application uses a natural language user interface to answer questions, make recommendations, and perform actions by delegating requests to a set of Web services. Apple claims that the software adapts to the user's individual preferences over time and personalizes results, and performing tasks such as finding recommendations for nearby restaurants, or getting directions

Page 4: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

4

Trend of Speech Recognition

• S voice• S Voice is an intelligent personal assistant and knowledge

navigator which is only available as a built-in application for the Samsung Galaxy S III, S III Mini, S4, S II Plus, Note II, Note 10.1, Note 8.0, Stellar, Grand and Camera.

Page 5: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

5

Trend of Speech Recognition

• CMU Sphinx• is the general term to describe a group of speech recognition

systems developed at Carnegie Mellon University. These include a series of speech recognizers (Sphinx 2 - 4) and an acoustic model trainer (SphinxTrain).

• In 2000, the Sphinx group at Carnegie Mellon committed to open source several speech recognizer components, including Sphinx 2 and later Sphinx 3 (in 2001).

• The speech decoders come with acoustic models and sample applications. The available resources include in addition software for acoustic model training, Language model compilation and a public-domain pronunciation dictionary, cmudict.

Page 6: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

6

An Isolated Word HMM Recognizer

Page 7: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

7

An Isolated Word HMM Recognizer

O

2

1

)|(maxarg21

v

vOPv

CMS(cepstral mean subtraction) PLP(Perceptual Linear Prediction Coefficients)Multitaper MFCC and PLP Features for Speaker Verification Using i-Vectors

Page 8: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

8

Features must (a) provide good representation of phonemes(b) be robust to non-phonetic changes in signal

Features: How to Represent the Speech Signal

Time domain (waveform):

Frequency domain (spectrogram):

“Markov”: male speaker “Markov”: female speaker

Page 9: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

99

Features: Windowing

In many cases, the math assumes that the signal is periodic. We always assume that the data is zero outside the window.

When we apply a rectangular window, there are usually discontinuities in the signal at the ends. So we can window the signal with other shapes, making the signal closer to zero at the ends. This attenuates discontinuities.

Hamming window:

10)1

2cos(46.054.0)(

Nn

N

nnh

1.0

0.0 N-1Typical window size is 16 msec, which equals 256 samples for16-kHz (microphone) signal and 128 samples for 8-kHz (telephone) signal. Window size does not have to equal frame size!

0

Page 10: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

10

Features: Spectrum and Cepstrum

(log power) spectrum:

1. Hamming window2. Fast Fourier Transform (FFT)3. Compute 10 log10(r2+i2)

where r is the real component, i is the imaginary component

timeampl

itude

frequencyener

gy (

dB)

Page 11: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

11

Features: Spectrum and Cepstrum

cepstrum:treat spectrum as signal subject to frequency analysis…

1. Compute log power spectrum2. Compute FFT of log power spectrum

3. Use only the lower 13 values (cepstral coefficients)

Page 12: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

12

Features: Spectrum and Cepstrum

Why Use Cepstral Features?

• number of features is small (13 vs. 64 or 128 for spectrum)

• models spectral envelope (relevant to phoneme identity), not (irrelevant) pitch

• coefficients tend to not be correlated with each other (useful to assume that non-diagonal elements of covariance matrix are zero… see Lecture 5, slide 30)

• (relatively) easy to compute

Cepstral features are very commonly used. Another type of feature that is commonly used is called Linear Predictive Coding (LPC).

Page 13: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

13

Features: Autocorrelation

Autocorrelation:measure of periodicity in signal

m

n kmxmxkR )()()(

)()()()()(1

0

kmwkmxmwmxkR n

kN

mnn

ampl

itude

n=start sample of analysis, m=sample within analysis window 0…N-1

For periodic signals the function attains a maximum at sample lags of 0, +-P, +-2P, etc. where P is the period of the signal.

x(m)x(m+1)

m=N-1 0

m=N-1 0

x(m)x(m) x(m)x(m-T)

m=N-1 0

Page 14: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

14

Features: Autocorrelation

Autocorrelation: measure of periodicity in signal

KkkmymykRkN

mnnn

0)()()(1

0

and if we set yn(m) = xn(m) w(m), so that y is the windowedsignal of x where the window is zero for m<0 and m>N-1, then:

where K is the maximum autocorrelation index desired.

Note that Rn(k) = Rn(-k), because when we sum over allvalues of m that have a non-zero y value (or just change the limits in the summation to m=k to N-1 and use negative k), then

)()()()()(1

0

kmwkmxmwmxkR n

kN

mnn

from previous slide

KkkmymykRN

kmnnn

0)()()(1

Page 15: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

15

Linear Time-Invariant System

invariant- timeis f() DTSso][][

][)][(]))[(()(

][])[(]))[((][

invariant?- time)( of () DTS Is)

][]))[((]))[((][

]),[(][

0

20

20

200

2

0

nnynw

nnxnxDnxfDnny

nnxnnxfnxDfnw

xxffex

nwnxDfnxfDnny

nxfny

invariant- timeis f() DTSso][)(

][)]([])[(]))[((][

][])[(])))[((][

invariant?- time][])[( of () DTS Is)

0

000

00

nwnny

nnxnnxnxDnxfDnny

nnxnnxfnxDfnw

nxnxffex

A discrete-time system f() is said to be time-invariant if, an input is delayed(shifted) by , then a output is delayed by same amount

x[n] y[n]

0n

][ 0nnx ][ 0nny

0nn n

nn

Page 16: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

16

Linear System

linear not is f() DTS the

])[(])[(])[(])[(

])[][(])[][(

linear? ][])[( Is)

])[(])[(])[][(

])[(])[(:yHomogeneit

])[(])[(])[][(: Additivity

if linear be to said is () system time-discrete A

21

2111

21111

2

1111

1

nxnxnxfnxf

nxnxnxnxf

nxnxfex

nxfnxfnxnxfor

nxfnxf

nyfnxfnynxf

f

linear is f() DTS theSo

][][])[(])[(

][][])[][(

linear? ][])[( of () DTS Is)

2121

2121

naxnaxnxafnxaf

naxnaxnaxnaxf

nxnxffex

Page 17: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

17

system. linear not is))((

))(())((

))(()()())((

system? linear a R is)()())((

2

knxR

knxRknxR

knxRkmxmxknxR

kmxmxknxR

mnn

mnn

Is autocorrelation Linear Time-Invariant System?

system. invariant-time is))((

)))((()))(((

)()(

)()())(((

)()(

))(()))(((

system? Invariant-time a R Is)()())((

00

00

0

knxR

knxRDknxDR

kmxmx

kmxmxDknxRD

kmxmx

knnxRknxDR

kmxmxknxR

mnnnn

mnn

mnnnn

mnn

Page 18: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

18

11

0

1

0

1

'

1

0

)()()(,)()()(

)()()()(

0)()(0

:10)()(

1010)(

:100)(

m'-,'over)'()'(

m- over)()()(

)()()()()()(where

0)()(0)(

100)()(

1010)(

)()()(,100)(

)()(where)()()(

N

kmnn

kN

mnnn

n

N

mnnn

N

kmnn

nn

n

n

mnn

mnnn

nnnn

kN

mnnn

nn

n

n

nm

nnn

kmymykRthenkmymykRif

kRkmymykRso

kmymy

kNmkoverkmymy

NkmkNmkoverkmy

Nmovermy

kkkmmmxkmx

kmxmxkR

kmwkmxkmymwmxmy

kmymykR

kNmoverkmymy

NkmkNmkoverkmy

mwmxmyNmovermy

mnxmxkmxmxkR

?)()( kRkR nn

Page 19: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

19

Features: Autocorrelation

Autocorrelation of speech signals: (from Rabiner & Schafer, p. 143)

Autocorrelation in Speech Signal http://staffwww.dcs.shef.ac.uk/people/M.Cooke/MAD/auto/auto.htm#introduction

Page 20: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

20

Features: Autocorrelation

Eliminate “fall-off” by including samples in w2 not in w1.

otherwisemw

kNmmw

otherwisemw

Nmmw

0)(

101)(

0)(

101)(

2

2

1

1

= modified autocorrelation function= cross-correlation function

Note: requires k ·N multiplications; can be slow

KkkmwkmxmwmxkR n

N

mnn

0)()()()()(ˆ2

1

01

KkkmxmxkRN

mnnn

0)()()(ˆ1

0

Page 21: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

21

Features: LPC

Linear Predictive Coding (LPC) provides• low-dimension representation of speech signal at one frame• representation of spectral envelope, not harmonics• “analytically tractable” method• some ability to identify formants

LPC models the speech signal at time point n as an approximate linear combination of previous p samples:

where a1, a2, … ap are constant for each frame of speech.

We can make the approximation exact by including a“difference” or “residual” term:

(1)

(2)

where G is a scalar gain factor, and u(n) is the (normalized)error signal (residual).

)()2()1()( 21 pnsansansans p

p

kk nGuknsans

1

)()()(

Page 22: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

22

Features: LPC

LPC can be used to generate speech from either the error signal (residual) or a sequence of impulses as input( ):

where ŝ is the generated speech, and e(m) is the error signal or a sequence of impulses. However, we use LPC here as a representation of the signal.

The values a1…ap (where p is typically 10 to 15) describe the signal over the range of one window of data (typically 128 to 256 samples).

While it’s true that 10-15 values are needed to predict (model) only one data point (estimating the value at time m from the previous p points), the same 10-15 values are used to represent all data points in the analysis window. When one frame of speech has more than p values, there is data reduction. For speech, the amount of data reduction is about 10:1. In addition, LPC values model the spectral envelope, not pitch information.

)()2()1()()(ˆ 21 pmsamsamsamems p

)(ˆ ms)(me

Page 23: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

23

then we can find ak by setting En/ak = 0 for k = 1,2,…p, obtaining p equations and p unknowns:

Features: LPC

If the error over a segment of speech is defined as

2

1

2

2

1

2

1

)()(

)(ˆ)()(,)(

M

Mm

p

knkn

nnn

M

Mmnn

kmsams

msmsmemeE

pimsimskmsimsaM

Mmnn

p

k

M

Mmnnk

1)()()()(ˆ2

1

2

11

(3)

(4)

(5)

(as shown on next slide…)Error is minimum (not maximum) when derivative is zero, because as any ak changes away from optimum value, error will increase.

Page 24: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

24

Features: LPC2

1

2

1

)()(

M

Mm

p

kkn kmsamsE

0)()()()(2

)()()(2

)(

)()...()..2()1()(

)()()()(2

1

1

21

11

2

1

2

1

2

1

2

1

p

k

M

Mmk

M

Mm

M

Mm

p

kk

pii

p

kk

i

M

Mm

p

kk

i

n

imskmsaimsms

imskmsams

ims

pmsaimsamsamsamsa

kmsamsa

kmsamsa

E

)0,(

.

)0,2(

)0,1(

.

),(),...,2,(),1,(

...

),2(),...,2,2(),1,2(

),1(),...,2,1(),1,1(

1where)0,(),(...)2,()1,(

)0,(),(

)()(),(

1where)()()()(

2

1

21

1

1

2

1

2

1

2

1

pa

a

a

pppp

p

p

piiapiaiai

iaki

kmsimski

pimsimsakmsims

n

n

n

pnnn

nnn

nnn

npnnn

n

p

kkn

M

Mmn

M

Mm

p

kk

M

Mm

Page 25: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

25

)()2(...)1()2()2()(2

)()1(...)1()1(2)1()(2)(0

2122

11112

1

2

1

pmsamsamsamsamsams

pmsamsamsamsamsamsmsaE

p

M

Mmp

n

Features: LPC

pimsimskmsimsaM

Mm

p

k

M

Mmk

1)()()()(2

1

2

11

(5-1)

pikmsimsaimsmsM

Mm

p

kk

10)()(2)()(22

1 1

pikmsimsaimsmsM

Mm

p

kk

M

Mm

10)()(2)()(22

1

2

1 1

0)()1(2...)2()1(2)1()1(2)1()(22

1

21

M

Mmp pmsmsamsmsamsmsamsms

2

10)1()(...)1()3()1()2(

)()1(...)2()1()1()1(2)1()(2

32

21M

Mm p

p

mspmsamsmsamsmsa

pmsamsmsamsmsmsamsms

2

1

2

1

)()(

M

Mm

p

kkn kmsamsE

2

1 111

2 )()()()(2)(M

Mm

p

rk

p

kk

p

kkn rmsakmsakmsamsmsE

2

1

1

122

111

2

)()()()(2

)()2()2()(2

)()1()1()(2)(

M

Mm

p

rrpp

p

rr

p

rr

n

rmsapmsapmsams

rmsamsamsams

rmsamsamsamsms

E

(5-2)

(5-3)

(5-4)

(5-5)

(5-6)

(5-7)

(5-8)

(5-9)

repeat (5-4) to (5-6) for a2, a3, … ap

Page 26: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

26

Features: LPC Autocorrelation Method

Then, defining

we can re-write equation (5) as:

2

1

)()(),(M

Mmnnn kmsimski

piikia n

p

knk

1)0,(),(ˆ1

We can solve for ak using several methods. The most commonmethod in speech processing is the “autocorrelation” method:

Force the signal to be zero outside of interval 0 m N-1:

where w(m) is a finite-length window (e.g. Hamming) of length N that is zero when less than 0 and greater than N-1. ŝ is the windowed signal. As a result,

)()()(ˆ mwmsms nn

1

0

2 )(pN

mnn meE

(6)

(7)

(8)

(9)

Page 27: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

27

Features: LPC Autocorrelation Method

How did we get from

to

1

0

2 )(pN

mnn meE

2

1

)(2M

Mmnn meE (equation (3))

(equation (9))

with window from 0 to N-1? Why not

1

0

2 )(N

mnn meE ??

Because value for en(m) may not be zero when m > N-1…for example, when m = N+p-1, then

p

knknn kpNsapNspNe

1

)1(ˆ)1(ˆ)1(

)1(ˆ...)11(ˆ)1(ˆ)1( 1 ppNsapNsapNspNe npnnn 0

ŝn(N-1) is not zero!0

Page 28: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

28

Features: LPC Autocorrelation Methodbecause of setting the signal to zero outside the window, eqn (6):

and this can be expressed as

and this is identical to the autocorrelation function for |ik| becausethe autocorrelation function is symmetric, Rn(x) = Rn(x) :

so the set of equations for ak (eqn (7)) can be combo of (7) and (12):

1

0 0

1)(ˆ)(ˆ),(

pN

mnnn pk

pikmsimski

)(1

0 0

1))((ˆ)(ˆ),(

kiN

mnnn pk

pikimsmski

xN

mnnn

nn

xmsmsxR

kiRki1

0

)(ˆ)(ˆ)(

|)(|),(

p

knnk piiRkiRa

1

1)(|)(|ˆ

(10)

(11)

(12)

(13)

(14)

where

Page 29: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

29

Features: LPC Autocorrelation Method

||1

0

)(1

0

0

1|)|(ˆ)(ˆ

|)(|),(

|)(|)(|)(|

)(

0

1))((ˆ)(ˆ),(

|)(|)(

0),()(

0,)()(

?|)(|)( then ),()( If

kiN

mnn

n

kiN

mnnn

pk

pikimsms

kiRki

xRxRkiR

kiR

pk

pikimsmski

xRxR

xxRxR

xxRxR

xRxRxRxR

Page 30: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

30

Features: LPC Autocorrelation MethodWhy can equation (10):

be expressed as (11): ???

1

0 0

1)(ˆ)(ˆ),(

pN

mnnn pk

pikmsimski

)(1

0 0

1)(ˆ)(ˆ),(

kiN

mnnn pk

pikimsmski

1

0 0

1)(ˆ)(ˆ),(

pN

mnnn pk

pikmsimski original equation

)(1',1)('0at,0))('(

1'0at0)'(

'at)'(ˆ)'(ˆ),(1

0'

kiNmNkimkims

Nmms

immikmsmskiipN

imnnn

add i to sn() offset(m’=m-i) and subtract i from summation limits. If m < 0, sn(m) is zero so still start sum at 0.

)(1

0 0

1))((ˆ)(ˆ),(

kiN

mnnn pk

pikimsmski

replace p in sum limit by k, becausewhen m > N+k-1-i, s(m+i-k)=0 and k is always p

Page 31: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

31

Features: LPC Autocorrelation Method

In matrix form, equation (14) looks like this:

)(

)3()2()1(

ˆ

ˆˆˆ

)0()3()2()1(

)3()0()1()2()2()1()0()1()1()2()1()0(

3

2

1

pR

RRR

a

aaa

RpRpRpR

pRRRRpRRRRpRRRR

n

n

n

n

pnnnn

nnnn

nnnn

nnnn

There is a recursive algorithm to solve this: Durbin’s solution

pxyymsmsyR

piiRkiRa

yN

mnnn

p

knnk

0)(ˆ)(ˆ)(

1)(|)(|ˆ

1

0

1

),()(

)()()(ˆ

mnxms

mwmsms

n

nn

10)1

2cos(46.054.0)(

Nn

N

mmw

Page 32: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

32

Features: LPC Durbin’s SolutionSolve a Toeplitz (symmetric, diagonal elements equal) matrix for values of :

)(

)1(2)(

)1()1()(

)(

)1(1

1

)1(

)0(

1

ˆ

)1(

11

1)()(

)0(

1)(|)(|

pjj

ii

i

ijii

ij

ij

ii

i

ii

j

iji

p

knnk

a

EkE

ijk

k

piEjiRiRk

RE

piiRkiR

Page 33: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

33

Features: LPC Example

For 2nd-order LPC, with waveform samples {462 16 -294 -374 -178 98 40 -82}If we apply a Hamming window (because we assume signal is zerooutside of window; if rectangular window, large prediction errorat edges of window), which is

{0.080 0.253 0.642 0.954 0.954 0.642 0.253 0.080}then we get {36.96 4.05 -188.85 -356.96 -169.89 62.95 10.13 -6.56}and so

R(0) = 197442 R(1)=117319 R(2)=-946

0.59420

0.59420)0()1(

0)1(

197442)0(

1)1(

1

)0(1

)0(

k

RR

ERk

RE

p

kk kmame

msamsamems

1

21

)()(

)2()1()()(ˆ

NmmnxmsS 81)}({)}({

Lmmhmsmsn 1)()()(ˆ

8,20)(ˆ)(ˆ)(1

0

NpyymsmsyRyN

mnnn

)()( mnxmsn

)(mh

Page 34: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

34

Features: LPC Example

0.55317ˆ0.92289ˆ

0.92289)1()0(

)2()1()0()1(

0.55317

0.55317)1()0(

)1()0()2()1()2(

127731)0(

)1()0()1()1(

21

22)1(

12)1(

1)2(

1

2)2(

2

22

2)1()1(

12

22)0(2

1

aa

RR

RRRRk

k

RR

RRRERRk

R

RREkE

Note: if divide all R(·) values by R(0), solution is unchanged,but error E(i) is now “normalized error”.Also: -1 kr 1 for r = 1,2,…,p

Page 35: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

35

Features: LPC Example

We can go back and check our results by using these coefficients to “predict” the windowed waveform: s(m)={36.96 4.05 -188.85 -356.96 -169.89 62.95 10.13 -6.56}and compute the error from time 0 to N+p-1 (Eqn (9))

0 ×0.92542 + 0 × -0.5554 = 0 vs. 36.96, e(0) = 36.96 036.96 ×0.92542 + 0 × -0.5554 = 34.1 vs. 4.05, e(1) = -30.05 14.05 ×0.92542 + 36.96 × -0.5554 = -16.7 vs. –188.85, e(2) = -172.15 2-188.9×0.92542 + 4.05 × -0.5554 = -176.5 vs. –356.96, e(3) = -180.43 3-357.0×0.92542 + -188.9×-0.5554 = -225.0 vs. –169.89, e(4) = 55.07 4-169.9×0.92542 + -357.0×-0.5554 = 40.7 vs. 62.95, e(5) = 22.28 562.95×0.92542 + -169.89×-0.5554 = 152.1 vs. 10.13, e(6) = -141.95 610.13×0.92542 + 62.95×-0.5554 = -25.5 vs. –6.56, e(7) = 18.92 7-6.56×0.92542 + 10.13×-0.5554 = -11.6 vs. 0, e(8) = 11.65 80×0.92542 + -6.56×-0.5554 = 3.63 vs. 0, e(9) = -3.63 9

A total squared error of 88,645, or error normalized by R(0) of0.449

(If p=0, then predict nothing, and total error equals R(0), so we cannormalize all error values by dividing by R(0).)

time

)()()(

55317.0)2(92289.0)1(

)2()1()( 21

msmsme

xmsxms

msamsams

:)0(s

:)1(s

Page 36: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

36

Features: LPC Example

If we look at a longer speech sample of the vowel /iy/, dopre-emphasis of 0.97 (see following slides), and perform LPC of various orders, we get:

0.00

0.04

0.08

0.12

0.16

0.20

0 1 2 3 4 5 6 7 8 9 10

LPC Order

Nor

mal

ized

Pre

dic

tion

Err

or

(tot

al s

qu

ared

err

or /

R(0

))

which implies that order 4 captures most of the importantinformation in the signal (probably corresponding to 2 formants)

Page 37: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

37

Features: LPC and Linear Regression

• LPC models the speech at time n as a linear combination of the previous p samples. The term “linear” does not imply that the result involves a straight line, e.g. s = ax + b.

• Speech is then modeled as a linear but time-varying system (piecewise linear).

• LPC is a form of linear regression, called multiple linear regression, in which there is more than one parameter. In other words, instead of an equation with one parameter of the form s = a1x + a2x2, an equation of the form s = a1x + a2y + …

• Because the function is linear in its parameters, the solution reduces to a system of linear equations, and other techniques for linear regression (e.g. gradient descent) are not necessary.

Page 38: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

38

Features: LPC Spectrum

because the log power spectrum is:

We can compute spectral envelope magnitude from LPC parameters by evaluating the transfer function S(z) for z=ej:

22

2

2

2

122

11

}Im{}Re{log10

}Im{}Re{log10)(

0)2

sin(}Im{)2

cos(1}Re{

AA

G

AA

Gn

NnN

nkaA

N

nkaA

p

kk

p

kk

Each resonance (complex pole) in spectrum requires twoLPC coefficients; each spectral slope factor (frequency=0 or Nyquist frequency) requires one LPC coefficient.

For 8 kHz speech, 4 formants LPC order of 9 or 10

p

k

kjk

jj

ea

G

eA

GeS

1

1)(

)(

)sin()cos( je j

Page 39: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

39

Features: LPC Representations

Page 40: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

40

Features: LPC Cepstral Features

The LPC values are more correlated than cepstral coefficients.But, for GMM with diagonal covariance matrix, we want values to be uncorrelated.

So, we can convert the LPC coefficients into cepstral values:

The cepstral coefficients, which are the coefficients of the Fourier transform representation of the log magnitude spectrum, have beenshown a more robust, reliable feature set for speech recognition than the LPC coefficients, the PARCOR coefficients, or the log area ratio coefficients.

pQmpacm

kc

pmacm

kac

c

m

pmkkmkm

m

kkmkmm

m

3

2,

1

ln

1

1

1

2

Page 41: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

41

Features: LPC History

Wikipedia has an interesting article on the history of LPC:

… The first ideas leading to LPC started in 1966 when S. Saito and F. Itakura of NTT described an approach to automatic phoneme discrimination that involved the first maximum likelihood approach to speech coding. In 1967, John Burg outlined the maximum entropy approach. In 1969 Itakura and Saito introduced partial correlation, May Glen Culler proposed real-time speech encoding, and B. S. Atal presented an LPC speech coder at the Annual Meeting of the Acoustical Society of America.

In 1972 Bob Kahn of ARPA, with Jim Forgie (Lincoln Laboratory) and Dave Walden (BBN Technologies), started the first developments in packetized speech, which would eventually lead to Voice over IP. In 1976 the first LPC conference took place over the ARPANET using the Network Voice Protocol.

It is [currently] used as a form of voice compression by phone companies, for example in the GSM standard. It is also used for secure wireless, where voice must be digitized, encrypted and sent over a narrow voice channel.

[from http://en.wikipedia.org/wiki/Linear_predictive_coding]

Page 42: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

42

The source signal for voiced sounds has slope of -6 dB/octave:

We want to model only the resonant energies, not the source.But LPC will model both source and resonances.

If we pre-emphasize the signal for voiced sounds, we flatten it in the spectral domain, and source of speech more closely approximates impulses. LPC can then model only resonances (important information) rather than resonances + source.

Pre-emphasis:

Features: Pre-emphasis

0 1k 2k 3k 4k

97.0)1()()(' kmskmsms nnn

frequency

ener

gy (

dB)

Page 43: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

43

Features: Pre-emphasis

Adaptive pre-emphasis: a better way to flatten the speech signal

1. LPC of order 1= value of spectral slope in dB/octave= R(1)/R(0) = first value of normalized autocorrelation

2. Result = pre-emphasis factor

)1()0(

)1()()(' ms

R

Rmsms nnn

Page 44: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

44

Features: Cpstral Coefficient

For a input speech samples

1.Pre-emphasis factor

2.Framing, n: sample # of a f frame, M:shift rate M<<N

3.Hamming Windows

97.0)1()()(ˆ knsknsns

)}({)( nsns

10)1

2cos(46.054.0)(

)()()(ˆ

NnN

nnw

nwnxnx ll

1,...1,0,1,..1,0)}()( LlNnnlMsnxl

Page 45: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

45

Features: Cpstral Coefficient

5. Autocorrelation Analysis

6. LPC Analysis

1..0)(ˆ)(),()( Nnnxnxandmnxmxwhere n

pyymxmxyR

piiRkiRa

yN

mnnn

p

knnk

0)()()(

1)(|)(|ˆ

1

0

1

pja

EkE

ijk

k

piEjiRiRk

RE

pjj

ii

i

ijii

ij

ij

ii

i

ii

j

iji

)1(

11

1)()(

)0(

)(

)1(2)(

)1()1()(

)(

)1(1

1

)1(

)0(

Page 46: 1 Speech Recognition Trend and Features of the Speech Signal for the Speech Recognition Spring 2014 Hanbat National University Department of Computer Engineeroing

46

Features: Cpstral Coefficient

7. PLC Parameter Conversion to Cepstral Coefficients

1

1

1

,

1

m

pmkkmkm

m

kkmkmm

Qmpacm

kc

pmacm

kac