a comparison of front-end configuration for robust speech

8/8/2019 A Comparison of Front-End Configuration for Robust Speech

1/25

ACOMPARISON

OF

FRONT

END

CONFIGURATIONFORROBUST

SPEECHRECOGNITION

Author:BenMilner

Professor:Reporter:

1


2/25

Outline

Introduction

Staticfeatures

NormalizationMethods

TemporalInformation

ExperimentalResults

2/25


3/25

Introduction

Featureextractionisconsideredascomprising

threedifferent

processing

stage;

namely

static

featureextraction,normalizationand

inclusionoftemporalinformation.

Theaimofthisworkistocompare,both

theoreticallyandexperimentally,anumberof

morepopular

techniques

and

identify

which

combinationsworkbest.

3 /25


4/25

Staticfeaturesextraction

Forspeechrecognition,thevocaltract

componentprovides

best

discrimination

betweenspeechsounds.

Mostoffeatureextractionmethodsuse

cepstralanalysis

to

extract

this

vocal

tract

componentfromcomputingcepstralfeature.

Successful

of

these

methods

also

include

attribute ofthepsychophysicalprocessesofhearingintotheanalysis.

4 /25


5/25

5 /25


6/25

ComparisonofMFCCandPLPAnalysis

Spectralanalysis:

PLPand

MFCC

analysis

both

obtain

a

shorttermpowerspectrumbyapplyinga

Fourier

transform

to

a

frame

of

Hamming

windowedspeech,typically2032msin

duration.

6 /25


7/25

ComparisonofMFCCandPLP

Analysis(cont.) Criticalbandanalysis:

BothPLP

and

MFCC

employ

an

auditory

basedwarpingofthefrequencyaxisderived

from

the

frequency

sensitivity

of

human

hearing.

MFCCarebasedon auniformspacing

alongthe

Mel

scale

and

PLP

uses

the

Bark

scale.

7 /25


8/25

8 /25


9/25

BothcriticalbandanalysisandMelfilteranalysiscanbe

viewasapplyingasetofbasisfunctiontothepower

spectrumofthespeechsignal.

9 /25


10/25


Analysis(cont.) Equalloudnesspreemphasis:

Tocompensatefortheunequalsensitivityofhuman

hearingacross

frequency.

PLPanalysisscalesthecriticalbandsamplitudesaccordingtoanequalloudnesspreemphasisfunction,suchas

InMFCCanalysis,preemphasisisappliedinthetime

domainatypical

implementation

uses

afirst

order

high

passfilter

)1038.0()103.6(

)108.56()(

92262

462

++

+=

ww

wwwE

11)(

= azzH

10 /25


11/25


Analysis(cont.) Intensityloudnesspowerlaw:

Thisprocessing

stage

models

the

non

linear

relationbetweentheintensityofsoundandits

perceivedloudness.

InthePLP,cubicrootcompressionofcritical

bandenergiesisusedtoimplementthisfunction.

Inthe

MFCC

analysis,

logarithmic

compression

ofthemelfilterbankchannelsisapplied.

11 /25


12/25


Analysis(cont.) ARModelingand CepstralAnalysis:

MFCC analysis

computes

cepstral

coefficients

fromthe

log

Mel

filter

using

adiscrete

cosine

transform.

PLP

analysis

the

critical

band

spectrum

is

convertedinto

asmall

number

of

LP

coefficients

throughtheapplicationofaninverse DFTtoprovideautocorrelationcoefficients.

Fromthe

LP

coefficients,

cepstral

coefficients

arecomputedandtheseformthefinalstaticfeaturevector.

12 /25


13/25

NormalizationMethods

Followingcomputationofstaticfeatures,front

endprocessing

techniques

typically

employ

some

formofnormalizationtothefeaturestream.

Inthiscomparisontheprocess

RASTAfiltering

cepstralmean

normalization

13 /25


14/25

RASTAFiltering

TheRASTAfilterasafrontendoperationto

reduce

both

communication

channel

effects

andnoisedistortion.

Channeldistortion

is

additive

in

the

log

frequencyandcepstraldomain,soapplyinga

sharp

cut

off

high

pass

to

each

coefficient,

overtime,removetheoffsetandsuppressesthechanneldistortion.

14 /25


15/25

CepstralMeanNormalization(CMN)

Calculatingthemeanofeachcoefficient

acrossareasonably

large

number

of

frames

givesthecepstralmean.

Subtracting thisfromtheoriginalcepstral

vectorsremoveschannelinducedoffsets

togetherwith

any

other

stationary

speech

component.

15 /25


16/25


17/25

TemporalInformation

HMMsneedtoassumptiontheobservationvectoraregeneratedfromanindependentidentically

distributed(IID).

Thetemporalcorrelationwhichexistsinthefeature

vector

stream,

brakes

this

assumption.

encodingtemporal

information:

TemporalDerivatives

Cepstraltimematrices

17 /25


18/25

TemporalDerivatives

Thefirstordertemporalderivative(velocity)

Where isthefirsttimederivativeofthe cepstralcoefficient

at

time

frame

t,

and

is

the

coefficient

ofthet+ staticcepstralvector.Therangeis kto+kisthetimespanofcepstralvectoracrosswhichthe

derivativeis

calculated.

Inasimilarway,thesecondordertemporalderivative(acceleration) isusuallycomputeassimpledifferenceovervelocityvector.

=

=

+

=k

kk

k

kk

kt

t

k

nkc

nc2

)(

)(

)(nct

)(nckt+

thn

thk

thn

)(nct

18 /25


19/25

Cepstraltimematrices(CTM)

Thecepstraltimematrixisanalternative

framework

for

encoding

the

temporal

variationsofspeechintothefeature.

Thecolumns

of

the

resulting

matrix

representsdifferenttemporalregionsandcan

be

truncated

according

to

the

amount

of

temporalinformationrequiredinthefinalspeechfeature.

19 /25


20/25

20 /25


21/25

2DDCT:

=

+

+=

1

02

)12(cos)(),(

M

k

kttM

mkncnmC

21


22/25

ComparisonofTemporalDerivatives

andCTM

22 /25


23/25

Experimentalresults

Toconstraintheexperimentalresultssoonlytheeffectoffeatureisconsidered,testshavebeen

performedon

an

unconstrained

monophone task.

BTSubscribertelephonydatabasewhichcontainsapproximately4330sentencesinthetrainingset

and2560

in

the

test

set.

Eachofthe44phonemesismodeledbya3state,12model,diagonalcovarianceHMM.

Inthe

feature

extraction

schemes

aframe

rate

of

16mshasbeenused,togetherwithaHammingwindowwidthof32ms.

23 /25


24/25

24 /25


25/25

25 /25

a comparison of front-end configuration for robust speech

Documents