a comparison of front-end configuration for robust speech
TRANSCRIPT
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
1/25
ACOMPARISON
OF
FRONT
END
CONFIGURATIONFORROBUST
SPEECHRECOGNITION
Author:BenMilner
Professor:Reporter:
1
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
2/25
Outline
Introduction
Staticfeatures
NormalizationMethods
TemporalInformation
ExperimentalResults
2/25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
3/25
Introduction
Featureextractionisconsideredascomprising
threedifferent
processing
stage;
namely
static
featureextraction,normalizationand
inclusionoftemporalinformation.
Theaimofthisworkistocompare,both
theoreticallyandexperimentally,anumberof
morepopular
techniques
and
identify
which
combinationsworkbest.
3 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
4/25
Staticfeaturesextraction
Forspeechrecognition,thevocaltract
componentprovides
best
discrimination
betweenspeechsounds.
Mostoffeatureextractionmethodsuse
cepstralanalysis
to
extract
this
vocal
tract
componentfromcomputingcepstralfeature.
Successful
of
these
methods
also
include
attribute ofthepsychophysicalprocessesofhearingintotheanalysis.
4 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
5/25
5 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
6/25
ComparisonofMFCCandPLPAnalysis
Spectralanalysis:
PLPand
MFCC
analysis
both
obtain
a
shorttermpowerspectrumbyapplyinga
Fourier
transform
to
a
frame
of
Hamming
windowedspeech,typically2032msin
duration.
6 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
7/25
ComparisonofMFCCandPLP
Analysis(cont.) Criticalbandanalysis:
BothPLP
and
MFCC
employ
an
auditory
basedwarpingofthefrequencyaxisderived
from
the
frequency
sensitivity
of
human
hearing.
MFCCarebasedon auniformspacing
alongthe
Mel
scale
and
PLP
uses
the
Bark
scale.
7 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
8/25
8 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
9/25
BothcriticalbandanalysisandMelfilteranalysiscanbe
viewasapplyingasetofbasisfunctiontothepower
spectrumofthespeechsignal.
9 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
10/25
ComparisonofMFCCandPLP
Analysis(cont.) Equalloudnesspreemphasis:
Tocompensatefortheunequalsensitivityofhuman
hearingacross
frequency.
PLPanalysisscalesthecriticalbandsamplitudesaccordingtoanequalloudnesspreemphasisfunction,suchas
InMFCCanalysis,preemphasisisappliedinthetime
domainatypical
implementation
uses
afirst
order
high
passfilter
)1038.0()103.6(
)108.56()(
92262
462
++
+=
ww
wwwE
11)(
= azzH
10 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
11/25
ComparisonofMFCCandPLP
Analysis(cont.) Intensityloudnesspowerlaw:
Thisprocessing
stage
models
the
non
linear
relationbetweentheintensityofsoundandits
perceivedloudness.
InthePLP,cubicrootcompressionofcritical
bandenergiesisusedtoimplementthisfunction.
Inthe
MFCC
analysis,
logarithmic
compression
ofthemelfilterbankchannelsisapplied.
11 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
12/25
ComparisonofMFCCandPLP
Analysis(cont.) ARModelingand CepstralAnalysis:
MFCC analysis
computes
cepstral
coefficients
fromthe
log
Mel
filter
using
adiscrete
cosine
transform.
PLP
analysis
the
critical
band
spectrum
is
convertedinto
asmall
number
of
LP
coefficients
throughtheapplicationofaninverse DFTtoprovideautocorrelationcoefficients.
Fromthe
LP
coefficients,
cepstral
coefficients
arecomputedandtheseformthefinalstaticfeaturevector.
12 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
13/25
NormalizationMethods
Followingcomputationofstaticfeatures,front
endprocessing
techniques
typically
employ
some
formofnormalizationtothefeaturestream.
Inthiscomparisontheprocess
RASTAfiltering
cepstralmean
normalization
13 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
14/25
RASTAFiltering
TheRASTAfilterasafrontendoperationto
reduce
both
communication
channel
effects
andnoisedistortion.
Channeldistortion
is
additive
in
the
log
frequencyandcepstraldomain,soapplyinga
sharp
cut
off
high
pass
to
each
coefficient,
overtime,removetheoffsetandsuppressesthechanneldistortion.
14 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
15/25
CepstralMeanNormalization(CMN)
Calculatingthemeanofeachcoefficient
acrossareasonably
large
number
of
frames
givesthecepstralmean.
Subtracting thisfromtheoriginalcepstral
vectorsremoveschannelinducedoffsets
togetherwith
any
other
stationary
speech
component.
15 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
16/25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
17/25
TemporalInformation
HMMsneedtoassumptiontheobservationvectoraregeneratedfromanindependentidentically
distributed(IID).
Thetemporalcorrelationwhichexistsinthefeature
vector
stream,
brakes
this
assumption.
encodingtemporal
information:
TemporalDerivatives
Cepstraltimematrices
17 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
18/25
TemporalDerivatives
Thefirstordertemporalderivative(velocity)
Where isthefirsttimederivativeofthe cepstralcoefficient
at
time
frame
t,
and
is
the
coefficient
ofthet+ staticcepstralvector.Therangeis kto+kisthetimespanofcepstralvectoracrosswhichthe
derivativeis
calculated.
Inasimilarway,thesecondordertemporalderivative(acceleration) isusuallycomputeassimpledifferenceovervelocityvector.
=
=
+
=k
kk
k
kk
kt
t
k
nkc
nc2
)(
)(
)(nct
)(nckt+
thn
thk
thn
)(nct
18 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
19/25
Cepstraltimematrices(CTM)
Thecepstraltimematrixisanalternative
framework
for
encoding
the
temporal
variationsofspeechintothefeature.
Thecolumns
of
the
resulting
matrix
representsdifferenttemporalregionsandcan
be
truncated
according
to
the
amount
of
temporalinformationrequiredinthefinalspeechfeature.
19 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
20/25
20 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
21/25
2DDCT:
=
+
+=
1
02
)12(cos)(),(
M
k
kttM
mkncnmC
21
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
22/25
ComparisonofTemporalDerivatives
andCTM
22 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
23/25
Experimentalresults
Toconstraintheexperimentalresultssoonlytheeffectoffeatureisconsidered,testshavebeen
performedon
an
unconstrained
monophone task.
BTSubscribertelephonydatabasewhichcontainsapproximately4330sentencesinthetrainingset
and2560
in
the
test
set.
Eachofthe44phonemesismodeledbya3state,12model,diagonalcovarianceHMM.
Inthe
feature
extraction
schemes
aframe
rate
of
16mshasbeenused,togetherwithaHammingwindowwidthof32ms.
23 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
24/25
24 /25
-
8/8/2019 A Comparison of Front-End Configuration for Robust Speech
25/25
25 /25