speech preprocessing, speech production, cepstrum

46
Speech Preprocessing, Speech Production, Cepstrum Jan ˇ Cernock´ y, Valentina Hubeika {cernocky|ihubeika}@fit.vutbr.cz FIT BUT Brno Speech Preprocessing, Speech Production, Cepstrum Jan ˇ Cernock´ y, Valentina Hubeika, DCGM FIT BUT Brno1/46

Upload: others

Post on 01-Jul-2022

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speech Preprocessing, Speech Production, Cepstrum

Speech Preprocessing, SpeechProduction, Cepstrum

Jan Cernocky, Valentina Hubeika

{cernocky|ihubeika}@fit.vutbr.cz

FIT BUT Brno

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno1/46

Page 2: Speech Preprocessing, Speech Production, Cepstrum

Agenda

• Speech Parameterization

– Preprocessing– Basic Parameters: short-time energy, zero crossing rate.

• Speech production and its model.

• Spectrogram.

• Separation of excitation and modification – cepstrum.

• Approximation of cepstra according to the human auditory system’s response– MFCC.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno2/46

Page 3: Speech Preprocessing, Speech Production, Cepstrum

PARAMETERIZATION

• Goal: express a signal on a limited number of values – a) “parameterization”, b)

“feature extraction”.

• a) representation based on findings in signals processing (filter banks, Fourier

transform, etc.) ⇒ non-parametric representation.

• b) representation based on findings about speech production ⇒ parametric

representation.

BUT:

• b) also makes use of the techniques of non-parametric representation, thus difficult

(sometimes unfavorable) to distinguish between the two groups.

• The calculated values are anyway usually called parameters.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno3/46

Page 4: Speech Preprocessing, Speech Production, Cepstrum

Parameters

• scalar per frame – one number calculated from a speech frame (short-time energy or

zero crossing rate).

• vector per frame – a set of numbers (vector) calculated from a speech frame. When

having a sequence of frames, parameters are usually stored in matrices.

time

index ...

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno4/46

Page 5: Speech Preprocessing, Speech Production, Cepstrum

PRE-PROCESSING

Mean Normalization

Direct current offset (dc-offset) carries no useful information. Moreover, can carry

disturbing information (when calculating energy).

0 200 400 600 800 1000 1200 1400 1600 1800−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5x 10

4

n

s(n)

dc-offset removal!

s′[n] = s[n] − µs, µs must be estimated. (1)

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno5/46

Page 6: Speech Preprocessing, Speech Production, Cepstrum

Mean Value Off-Line

Here, equivalent to the average value:

s =1

N

N∑

n=1

s[n] (2)

0 200 400 600 800 1000 1200 1400 1600 18005256.5

5257

5257.5

5258

5258.5

5259

n

mea

n of

f−lin

e

0 200 400 600 800 1000 1200 1400 1600 1800−2

−1.5

−1

−0.5

0

0.5

1

1.5

2x 10

4

n

s’(n)

offl

ine

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno6/46

Page 7: Speech Preprocessing, Speech Production, Cepstrum

Mean Value On-Line

The whole signal is not (yet) available: either is too long or there is a flow of new values..

s[n] = γs[n − 1] + (1 − γ)s[n], (3)

where γ −→ 1. This is equivalent to passing a signal through a filter with the impulse

response:

h = [(1 − γ) (1 − γ)γ (1 − γ)γ2 . . .]. (4)

Defined as the geometric progression: the initial element is a0 = 1 − γ and the quotient is

q = γ. The sum is thus:∞∑

n=0

h[n] =a0

1 − q=

1 − γ

1 − γ= 1, (5)

(this is what we originally expected ,).

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno7/46

Page 8: Speech Preprocessing, Speech Production, Cepstrum

Example, γ = 0.99 (see the first computer lab):

0 200 400 600 800 1000 1200 1400 1600 18000

1000

2000

3000

4000

5000

6000

n

mea

n on

−lin

e, γ

=0.9

9

0 200 400 600 800 1000 1200 1400 1600 1800−2

−1.5

−1

−0.5

0

0.5

1

1.5

2x 10

4

n

s’(n)

onl

ine

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno8/46

Page 9: Speech Preprocessing, Speech Production, Cepstrum

Preemphasis

Increases the magnitude of the higher frequencies with respect to the magnitude of the

lower frequencies. Equalization of the speech frequency characteristics (the magnitude

decreases towards higher frequencies). Rather a historical operation.

A simple first-order FIR filter:

H(z) = 1 − κz−1, (6)

where κ ∈ [0.9, 1]. Calculated difference between the two neighboring samples. The

magnitude frequency characteristics for κ=0.95:

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno9/46

Page 10: Speech Preprocessing, Speech Production, Cepstrum

0 0.1 0.2 0.3 0.4 0.50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

normalized frequency

|H(f)

|, κ=

0.95

Passing through the defined filter:

s′[n] = s[n] − κs[n − 1] (7)

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno10/46

Page 11: Speech Preprocessing, Speech Production, Cepstrum

0 20 40 60 80 100 120 140 160 180−2

−1.5

−1

−0.5

0

0.5

1

1.5

2x 10

4

n

orig

inal

0 20 40 60 80 100 120 140 160 180−1

−0.5

0

0.5

1

1.5x 10

4

n

orig

inal

⇒ The processed signal (after applying preemphasis) contains of more higher frequencies.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno11/46

Page 12: Speech Preprocessing, Speech Production, Cepstrum

FRAMES

• Why?

• Speech signal is considered as random, parameter estimation methods require

stationary signals.

• Thus dividing the signal into shorter segments (segments, micro-segments, frames)

within which the signal behaves (we hope) as stationary.

• Frame parameters: length lram, overlap pram, frame shift sram = lram − pram.

Frame Length

1. short enough to assume the signal (within the given length) is stationary.

2. BUT: long enough to provide accurate estimation of the desired parameters (features).

⇒ trade off (momentum of the articulation tract), typical length 20–25 ms (160–200

samples for Fs =8000 Hz).

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno12/46

Page 13: Speech Preprocessing, Speech Production, Cepstrum

Overlap

• small or none: , fast time shift in the signal, low memory/processor demands, / the

difference of the parameter values of the neighboring frames can be significant.

• large: , slow time shift, smooth change in the parameter values, / high

memory/processor demands, alike parameter values (violates the independency

assumption!).

⇒ tradeoff, typical length 10 ms, thus 100 frames per second, centi-second vectors.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno13/46

Page 14: Speech Preprocessing, Speech Production, Cepstrum

How many frames per segment of the length N ?

No overlap, pram = 0

lram

N

ramp =0

Nram =

N

lram

, (8)

. . . ⌊·⌋ denotes the operation ’floor’.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno14/46

Page 15: Speech Preprocessing, Speech Production, Cepstrum

Frames overlap, pram 6= 0

lram

ramp

N

rams

Nram = 1 +

N − lram

sram

(9)

. . . the signal must be at least one frame long.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno15/46

Page 16: Speech Preprocessing, Speech Production, Cepstrum

Signal Segmentation - Windowing Function

Select a frame of a signal using a window - window(ing) function:

Rectangular – no change of the signal, selection only:

w[n] =

1 pro 0 ≤ n ≤ lram − 1

0 otherwise(10)

Hamming – suppresses the signal at the sides of the window, selection and weighting:

w[n] =

0.54 − 0.46 cos2πn

lram − 1pro 0 ≤ n ≤ lram − 1

0 otherwise(11)

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno16/46

Page 17: Speech Preprocessing, Speech Production, Cepstrum

How does windowing change the spectrum of the selected segment? A product in time

domain corresponds to convolution of the speech spectrum with the window spectrum.

X(f) = S(f) ⋆ W (f) (12)

Comparison of the rectangular and the Hamming window:

−50 0 50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

n

rect

angu

lar,

l ram

=160

−0.05 0 0.050

20

40

60

80

100

120

140

160

normalized frequency

rect

angu

lar −

spe

ctru

m

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno17/46

Page 18: Speech Preprocessing, Speech Production, Cepstrum

−50 0 50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

n

Ham

min

g, l

ram

=160

−0.05 0 0.050

10

20

30

40

50

60

70

80

90

normalized frequency

Ham

min

g −

spec

trum

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno18/46

Page 19: Speech Preprocessing, Speech Production, Cepstrum

BASIC PARAMETERS OF SPEECH SIGNAL

all the parameters will be derived for single frames. For each frame:

• scalar ⇒ a row vector.

• vector ⇒ matrix, columns contain dimensions of the parameter vector, rows contain a

sequence of values in a particular dimension over time (time in frames).

Average Short-Time Energy

E =1

lram

lram−1∑

n=0

x2[n] (13)

• speech activity detector.

• separation of phonemes to voiced (high energy) and unvoiced (low energy).

• often we use log-energy.

• careful with noise and low-energy phonemes.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno19/46

Page 20: Speech Preprocessing, Speech Production, Cepstrum

Example: “letajıcı prase”

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−3

−2

−1

0

1

2

3

x 104

n

s(n)

0 20 40 60 80 100 1200

2

4

6

8

10

12

14

16

x 107

frames

lin energy

0 20 40 60 80 100 1206

8

10

12

14

16

18

frames

log energy

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno20/46

Page 21: Speech Preprocessing, Speech Production, Cepstrum

Zero-Crossing Rate

. . . rate of sign changes of the signal within a given frame.

Z =1

2

lram−1∑

n=1

| sign x[n] − sign x[n − 1]|, (14)

where sign (x) is the sign function defined as:

sign x[n] =

+1 pro x[n] ≥ 0

−1 pro x[n] < 0(15)

How does it work? The function | signx[n] − sign x[n − 1]| results in 2 when there is a

change in the sign between the samples x[n − 1] and x[n]:

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno21/46

Page 22: Speech Preprocessing, Speech Production, Cepstrum

0 20 40 60 80 100 120 140 160−8000

−6000

−4000

−2000

0

2000

4000

6000

8000

10000

12000

n

zero cro

ssings

• distinguishing between the voiced (low zero-crossing rate) and unvoiced (high rate,

rather like in noise).

• very sensitive to noise. . .

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno22/46

Page 23: Speech Preprocessing, Speech Production, Cepstrum

Example: “letajıcı prase”

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000−3

−2

−1

0

1

2

3

x 104

n

s(n)

0 20 40 60 80 100 120

10

20

30

40

50

60

70

frames

zero cros

sings

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno23/46

Page 24: Speech Preprocessing, Speech Production, Cepstrum

HUMAN VOCAL APPARATUS AND ITS MODEL

(Adopted from Wikipedia )

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno24/46

Page 25: Speech Preprocessing, Speech Production, Cepstrum

Organs and their Models in Digital Processing

• lungs — energy source — signals: none

• larynx — energy modulation – signals: excitation.

– opened vocal cords — noise.

– vibrating vocal cords — periodic signal (tone). fundamental frequency:

males 90–120 Hz

females 150–300 Hz

children 350–400 Hz

• vocal (articulatory) tract — modification tract — signals: filter.

– pharynx.– velum.– tongue.– oral and nasal cavity.– teeth.– lips.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno25/46

Page 26: Speech Preprocessing, Speech Production, Cepstrum

Model

±°²¯¡¡µ

±°²¯

±°²¯

¡¡µ

-

-

?

6

- -currentsource

Σ

= vibrating vocal cords

= lungs

= open vocal cords

= articulatory tract

impulsgenerator

noisegenerator

lineartransfersystem(filter)

speech

Transfer system: linear filter - usually IIR.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno26/46

Page 27: Speech Preprocessing, Speech Production, Cepstrum

Vocal Tract Model in Time and Frequency Domain

cca -12 dB/okt.

h(t)

t

b)

G(f)d)

f f

H(f)

f

F1F2

F3

f)

F0

t t

s(t)c)a)

g(t) T0

e)S(f)

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno27/46

Page 28: Speech Preprocessing, Speech Production, Cepstrum

Top part – time behaviour, bottom part – spectrum.

• a) and d) excitation: T0 is the period, F0 is the fundamental frequency (pitch).

• b) and e) articulation tract: F1 to F3 are the formants (resonance frequency of the

vocal tract), are given by the physical configuration of the vocal tract.

• c) and f) the resulting signal and its spectrum.

The resulting signal is given in the time domain by convolution:

s(t) = g(t) ⋆ h(t) =

∫ +∞

−∞

g(τ)h(t − τ)dτ. (16)

Convolution in time domain corresponds to product in frequency:

S(f) = G(f)H(f). (17)

A relevant task in speech processing is de-convolution; the goal is to separate excitation

and modification.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno28/46

Page 29: Speech Preprocessing, Speech Production, Cepstrum

SPECTROGRAM

One spectrum is not enough (speech is non-nonstationary) ⇒ representation of the

spectrum (strictly speaking PSD) behaviour over time:

• segment speech into frames.

• estimate the PSD for each frame, usually using DFT.

• depict:

– horizontal axes represents time (“rough” time in frames).– vertical axes represents frequency.– color represents energy.

Depending on the frame length we talk about:

• long-term spectrogram.

• short-term spectrogram.

/ Drawback of DFT: Fine scale in frequency and time domain cannot be satisfied

simultaneously

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno29/46

Page 30: Speech Preprocessing, Speech Production, Cepstrum

long-term: specgram(s,256,8000,hamming(256),200);

0 0.2 0.4 0.6 0.8 1 1.20

500

1000

1500

2000

2500

3000

3500

4000

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno30/46

Page 31: Speech Preprocessing, Speech Production, Cepstrum

short-term: specgram(s,256,8000,hamming(50));

0 0.2 0.4 0.6 0.8 1 1.20

500

1000

1500

2000

2500

3000

3500

4000

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno31/46

Page 32: Speech Preprocessing, Speech Production, Cepstrum

CEPSTRUM

. . . separates excitation from modification – convenient for encoding; dropping excitation

frequency in speech processing (excitation carries information dependent on speaker,

mood,. . . )

What can we do.. 1: filter off frequency lower than 400 Hz and get rid of the fundamental

frequency. . . BAD IDEA:

0 500 1000 1500 2000 2500 3000 3500 4000100

120

140

160

180

200

220

240

260

280

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno32/46

Page 33: Speech Preprocessing, Speech Production, Cepstrum

• fundamental frequency folds are present

along the whole spectrum.

• we can loose the fist formant.

• land line band starts on 300 Hz and we

still can recognize the pitch.

• . . . so we need a better approach.

⇒ Cepstrum

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno33/46

Page 34: Speech Preprocessing, Speech Production, Cepstrum

Challenge

Excitation e(t) is convoluted with the filter (modification) impulse response:

s(t) = g(t) ⋆ h(t) =

∫ +∞

−∞

g(τ)h(t − τ)dτ, (18)

which in frequency domain corresponds to product:

S(f) = G(f)H(f). (19)

we cannot well separate the two components in either domain. Solution: non-linearity,

which can translate product to summation.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno34/46

Page 35: Speech Preprocessing, Speech Production, Cepstrum

Definition of Cepstrum

lnG(f) =

+∞∑

n=−∞

c(n)e−j2πfn (20)

The c(n) values are the cepstral coefficients. Since G(f) is an even function, c(n) are

real and the following holds:

c(n) = c(−n) (21)

The sum in the equation is the definition of DFT, hence we can compute the c(n) as:

c(n) = F−1 [lnG(f)] (22)

DFT-cepstrum

c(n) = F−1{

ln |F [s(n)]|2}

, (23)

• spectrum −→ cepstrum.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno35/46

Page 36: Speech Preprocessing, Speech Production, Cepstrum

Can it really “break” convolution?

s(n) = e(n) ⋆ h(n), (24)

S(f) = E(f)H(f) a thus |S(f)|2 = |E(f)|2 |H(f)|2. (25)

For the cepstrum calculation we make use of the linearity of the inverse Fourier transform:

F−1(a + b) = F−1(a) + F−1(b).

It results in:

c(n) = F−1{

ln[|E(f)|2 |H(f)|2]}

= F−1{

ln |E(f)|2 + ln |H(f)|2}

= (26)

= F−1{

ln |E(f)|2}

+ F−1{

ln |H(f)|2}

= ce(n) + ch(n) (27)

(28)

Convolution becames summation. The coefficients ce(n) and ch(n) are separable in

frequency, which allows to separate them by windowing.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno36/46

Page 37: Speech Preprocessing, Speech Production, Cepstrum

signal, |F [s(n)]|2, ln |F [s(n)]|2, F−1{

ln |F [s(n)]|2}

.

50 100 150 200 250 300 350 400 450 500

−1.5

−1

−0.5

0

0.5

1

1.5

x 104

0 500 1000 1500 2000 2500 3000 3500 40000

5

10

15x 10

11

0 500 1000 1500 2000 2500 3000 3500 400012

14

16

18

20

22

24

26

28

30

50 100 150 200 250 300 350 400 450 500

0

2

4

6

8

10

12

14

16

18

20

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno37/46

Page 38: Speech Preprocessing, Speech Production, Cepstrum

0 20 40 60 80 100 120 140 160 180 200

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

excitation

1st multipleof lag

2nd multipleof lag

filter

c0 is log−energy

For the sampling frequency Fs =8000 Hz, we can separate excitation from modification in

frequency domain using threshold of 30.

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno38/46

Page 39: Speech Preprocessing, Speech Production, Cepstrum

Excitation only – set the cepstra related to modification to zero:

modified cepstrum, ln |F [s(n)]|2, |F [s(n)]|, signal (after IDFT, phases of the

original signal are used).

50 100 150 200 250 300 350 400 450 500

−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

0.15

0.2

0.25

0 500 1000 1500 2000 2500 3000 3500 4000−7

−6

−5

−4

−3

−2

−1

0

1

2

3

0 100 200 300 400 500 6000

0.5

1

1.5

2

2.5

3

3.5

4

50 100 150 200 250 300 350 400 450 500

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno39/46

Page 40: Speech Preprocessing, Speech Production, Cepstrum

Modification only – set the cepstra related to excitation to zero:

modified cepstrum, ln |F [s(n)]|2, |F [s(n)]|, signal (after IDFT, phases are set to

zero).

50 100 150 200 250 300 350 400 450 500

0

2

4

6

8

10

12

14

16

18

20

0 500 1000 1500 2000 2500 3000 3500 400016

17

18

19

20

21

22

23

24

25

26

0 100 200 300 400 500 6000

0.5

1

1.5

2

2.5

3

3.5x 10

5

0 10 20 30 40 50 60 70 80 90−2

−1

0

1

2

3

4

5

x 104

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno40/46

Page 41: Speech Preprocessing, Speech Production, Cepstrum

Mel-frequency cepstrum – MFCC

• DFT has equivalent frequency resolution along the axes.

• Human ear has higher resolution on lower frequencies than on higher frequencies.

• We want to adjust cepstrum to human hearing.

how do we do it?

• Place filters along the frequency axes non-linearly, calculate the energy on the output,

use the calculated values instead of DFT when calculating cepstra.

• Calculate non-linear frequency axes and place the filters on the modified axes linearly,

. . .

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno41/46

Page 42: Speech Preprocessing, Speech Production, Cepstrum

Convert Hertz to Mel (to transform the frequency axes):

FMel = 2959 log10(1 +FHz

700) (29)

0 500 1000 1500 2000 2500 3000 3500 40000

500

1000

1500

2000

2500

frequency [Hz]

freq

uenc

y [M

el]

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno42/46

Page 43: Speech Preprocessing, Speech Production, Cepstrum

When linearly placed on the Mel axes, the filters correspond to the non-linearly placed

filters on the Hertz axes:

m m m m m m1 2 3 4 5 6

frequency

1

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno43/46

Page 44: Speech Preprocessing, Speech Production, Cepstrum

Energy estimation:

1. construct filter bank, filter the input signal in time domain and calculate

energy:∑

n s2i (n) . . . TOO DIFFICULT.

2. apply DFT, power, multiply by a triangular window and add up. (used in the HTK

toolkit).

Inverse FT can be realized by the discrete cosine transform (DCT) . . . (without derivation:

makes use of symmetry of the spectrum and the fact that the result should be real):

cmf (n) =

K∑

i=1

log mk cos[

n(k − 0.5)π

K

]

(30)

⇒ Mel-frequency cepstral coefficients (MFCC)

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno44/46

Page 45: Speech Preprocessing, Speech Production, Cepstrum

Short

Term

DFT

Mel

Filter

Bank

.

.

.

.

.

ln

Sl(1,t)

Sl(2,t)

Sl(23,t)

InputSpeech DCT

C1

C2

C13

| . |

| . |

| . | ln

ln

.

.

.

.

.

S(1,t)

S(2,t)

S(129,t)

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno45/46

Page 46: Speech Preprocessing, Speech Production, Cepstrum

0 5 10 15 20 25−1

−0.5

0

0.5

1a) Segment of speech signal for vowel ’iy’

time [ms]0 5 10 15 20 25

−0.5

0

0.5b) Speech segment after preemphasis and windowing

time [ms]

0 500 1000 1500 2000 2500 3000 3500 40000

1

2

3

4

5

6c) Fourier spectrum of speech segment

frequency [Hz]1 3 5 7 9 11 13 15 17 19 21 23

0

5

10

15

20d) Filter bank energies − smoothed spectrum

band number

1 3 5 7 9 11 13 15 17 19 21 23−4

−3

−2

−1

0

1

2

3

4e) Log of filter bank energies

band number2 4 6 8 10 12

−5

0

5f) Mel frefuency cepstral coefficients

mel quefrency

Speech Preprocessing, Speech Production, Cepstrum Jan Cernocky, Valentina Hubeika, DCGM FIT BUT Brno46/46