wang zheng-bin1, zhang ye-rong2 · web view2020/12/03 · digital signal processing, 2012, 22(6):...

WANG Zheng-Bin1, Zhang Ye-Rong2

February 2018, 25(1): 1–9

Online first publishing time: ****************** http://jcupt.bupt.edu.cn

2 The Journal of China Universities of Posts and Telecommunications 2018

Online first publishing Jiang Xiaoqing, et al. / Noisy speech emotion recognition using… 3

Noisy speech emotion recognition using sample reconstruction and multiple-kernel learning

Jiang Xiaoqing1,2, Xia Kewen1 ((), Lin Yongliang1,3, Bai Jianchuan1

1. School of Electronic and Information Engineering, Hebei University of Technology, Tianjin 300401, China2. School of Information Science and Engineering, University of Jinan, Jinan 250022, China3. Information Center, Tianjin Chengjian University, Tianjin 300384, China

Abstract

Speech emotion recognition (SER) in noisy environment is a vital issue in artificial intelligence (AI). In this paper, the reconstruction of speech samples removes the added noise. Acoustic features extracted from the reconstructed samples are selected to build an optimal feature subset with better emotional recognizability. A multiple-kernel (MK) support vector machine (SVM) classifier solved by semi-definite programming (SDP) is adopted in SER procedure. The proposed method in this paper is demonstrated on Berlin Database of Emotional Speech. Recognition accuracies of the original, noisy, and reconstructed samples classified by both single-kernel (SK) and MK classifiers are compared and analyzed. The experimental results show that the proposed method is effective and robust when noise exists.

Keywords speech emotion recognition, compressed sensing, multiple-kernel learning, feature selection

1 Introduction (

Complementarity exists between human’s affectivity and logical thinking, so emotional information is significant to understand the real meaning in human’s speech. SER is an important research field in the realization of AI [1]. Noise existing in the environment and signal processing systems influences the recognition accuracies and limits the practical applications of SER, such as intelligent customer service systems and adjuvant therapy systems for autism, where accurate recognition of emotions is needed to make a proper response. In this paper, noisy SER is studied using the combination of sample reconstruction based on compressed sensing (CS) theory and multiple-kernel learning (MKL).

In SER two essential aspects influencing the performance of the emotion recognition system are optimal feature set and effective recognition classifier.

The precision and inherent properties of speech features influence the emotional recognizability of the feature set. Noise has negative impact on the extraction of acoustic features, and attempts to cope with the noise in SER started from 2006 [2]. Schuller et al. selected feature subset from a 4k feature set to recognize emotions from noisy speech samples [3]. You et al. proposed enhanced Lipschitz embedding to reduce the influence of noise [4]. Techniques such as switching linear dynamic models and two-stage Wiener filtering etc. were also proposed to handle noisy speech for classification [5]. CS theory proposed by Donoho et al. provides promising methods to noisy speech processing [6–7]. Sparse representation in CS theory has been used in nonparametric classifier. Zhao et al. adopted the enhanced sparse representation classifier to deal with the robust SER [8]. Additionally, as the derived coefficients of noise are not sparse in any transfer domain, it is impossible to reconstruct the noise from measurements. So sparse signals contaminated by noise can be reconstructed with high quality [9]. In this paper, CS theory is utilized in the denoising of noisy speech samples through sample reconstruction. Acoustic features of the reconstructed samples are extracted and selected according to the complementary information in order to constitute robust and optimal feature subset.

SVM is one of the most effective methods in pattern recognition problems. SVM is a kernel method of finding the maximum margin hyperplane in the feature space and its performance depends on the kernel function strongly. So it is necessary to overcome the kernel dependency in designing the effective classifier with SVM. In order to improve the flexibility of kernel function, MKL is proposed and developed to derive the optimal combination of different kernels. Lanckriet et al. proposed MKL with a transduction setting for learning a kernel matrix from data. The method aimed at the optimal combination of predefined base kernels to generate a good target kernel [10]. Jin et al. proposed feature fusion method based on MKL to improve the total SER performance of clean samples. The weights of different kernels corresponding to the global and local features are given through a grid search method [11]. In this paper, MK fusion strategy of Lanckriet is adopted to improve the SVM model in a binary tree structured multi-class classifier, and the fusion coefficients of different kernels are solved by the SDP to find optimal weights of multiple kernels.

The rests of the paper are structured as the followings: Sect. 2 reviews the basic idea of CS in speech signal processing and analyzes the performance of noisy sample reconstruction. Sect. 3 introduces MKL solved by SDP. Acoustic features and feature selection are presented in Sect. 4. The performance evaluation of SER and experimental results are illustrated and analyzed in Sect. 5. Finally Sect. 6 devotes to the conclusions.

2 CS and sample reconstruction of noisy speech

CS combines sampling and compression into one step using the minimum number of measurements with maximum information. CS aims to recover sparse signal with far less than Nyquist-Shannon sampling rate, and the reconstruction can be exact under key concepts such as sparsity and restricted isometry property (RIP) [7,12].

T

[(1),(2),,()]

xxxN

=

K

x

is the signal in N-dimensional space

N

¡

, where N is the number of samples.

x

can be represented by the linear combination of N-dimensional orthogonal basis vector

{

}

1,2,,

n

nN

=

K

y

. Thus

x

can be represented as:

1

N

nn

n

=

==

å

ay

x

Ψ

α

(1)

In Eq. (1)

Ψ

is the orthogonal basis matrix, also named representation matrix,

,

nn

=

ay

x

is projection coefficient,

α

is the projection coefficients matrix and

T

.

=

α

Ψ

x

It can be said that

x

and

α

are the equivalent representations of the same signal with

x

in time domain while

α

in

Ψ

domain. When the signal

x

only has k non-zero

n

a

coefficients and

kN

<<

,

n

y

is the sparse basis of

x

and

x

can be considered k sparse with sparse representation of Eq. (1).

In CS theory, the sensing process can be represented as:

=

y

Φ

x

(2)

In Eq. (2)

Φ

is the

MN

´

measurement matrix, and

M

Î

¡

y

(M<

y

is far less than the dimension of the signal

x

.

With Eq. (1), Eq. (2) can be rewritten as:

=

y

Θ

α

(3)

where

=

Θ

Φ

Ψ

is M

´

N dimensional reconstruction matrix, and

α

is k sparse vector representing the projection coefficients of

x

in

Ψ

domain.

Reconstruction algorithms in CS try to solve Eq. (3), which is an underdetermined equation without a determinant solution. When the signal is sparse and

Θ

satisfies the RIP condition, a sparse approximation solution to Eq. (3) can be obtained by minimizing the

1

L

-norm. RIP of matrix

Θ

is defined on isometry constant

(0,1)

k

d

Î

for a k sparse signal

x

and satisfies:

2

2

2

2

11

kk

dd

-+

≤

≤

Θ

x

x

(4)

It can be loosely said that a matrix obeys the RIP of order k if

k

d

is not too close to one. RIP ensures that all subsets of k columns taken from matrix are nearly orthogonal. The equivalent condition of RIP is the incoherence between measurement matrix

Φ

and the representation matrix

.

Ψ

A variety of reconstruction methods such as greedy algorithms and convex optimization can be used in the solving process of Eq. (3) [13–21].

When CS theory is applied to speech signal processing, the prerequisite is to achieve the sparse representation of speech signals using proper orthogonal basis. The excitation of voiced and unvoiced speech is quasi-periodic vibrations of vocal cords and random noise respectively. So voiced speech carries the most energy of the sample and focuses in lower frequency section. One of the most important spectrum characteristics of discrete cosine transformation (DCT) is the strong energy concentration in low frequency coefficients, which makes it suitable to analyze the sparsity of speech signals. The mth DCT coefficient

()

m

a

of a speech frame x with N samples can be calculated by:

1

π

(21)(1)

()()()cos; 1,2,,

2

1

; 1

()

2

; 2

N

n

nm

mwmxnmN

N

m

N

wm

mN

N

a

=

--

ü

==

ï

ï

ï

ì

ï

=

ý

ï

ï

ï

=

í

ï

ï

ï

ï

ï

î

þ

å

K

≤

≤

(5)

where

()

xn

denotes the nth sample of the speech frame.

Examples of clean voiced frame and unvoiced frame as well as their DCT coefficients are plotted in Fig. 1. Obviously, only a few DCT coefficients have larger amplitude while the rests are towards zero. The sparsity is more obvious in the voiced frame. Therefore the DCT coefficients of voiced speech signals can be considered as k sparse approximately.

(a) Waveform of a voiced frame

(b) DCT coefficients of a voiced frame

(c) Waveform of an unvoiced frame

(d) DCT coefficients of an unvoiced frame

Fig. 1 The sparsity of voiced and unvoiced frames

According to CS theory, voiced signals contaminated by noise can be reconstructed with high quality. Fig. 2 plots the denoising performance of sample reconstruction. If random Gaussian matrix is used as the measurement matrix, compressive sampling matching pursuit (CoSaMP) [14], orthogonal matching pursuit (OMP) [16], basis pursuit (BP) [17] and polytope faces pursuit (PFP) [21] algorithms are adopted in the reconstruction of noisy samples. The noisy voiced frame is produced by added 20 dB Gaussian white noise on the clean frame. It is clear that the reconstructed samples have approximate quality to the clean waveform. The amplitude and the period of the clean frame are preserved in the reconstructed frames. The reconstruction waveforms of BP, OMP and PFP almost coincide. The high quality of the reconstructed voiced speech ensures the precision of the features extraction in SER. Both Fig. 1 and Fig. 2 demonstrate the sample reconstruction is feasible in noisy SER.

Fig. 2 Reconstructed samples of a noisy voiced frame

3 MK SVM classifier

3.1 SK SVM

The SVM based on the theory of structural risk minimization is a classifier proposed for binary classification problem. Given l training patterns

{

}

(,),

l

ii

y

x

i

x

is the input vector of the ith pattern and

i

y

is the class label of

.

i

x

Then in the feature space induced by mapping function

f

, we can find a hyperplane with the maximum margin to classify two classes with discriminant function:

(),()

fb

f

=+

xwx

(6)

where

w

and b are weight vector and the offset that can be computed by solving a quadratic optimization problem:

T

,

T

1

min

2

s.t.

(())1; 1,2,,

b

ii

ybil

f

ü

ï

ï

ý

ï

ï

+=

þ

K

≥

w

ww

wx

(7)

To make the method more flexible and robust, a hyperplane can be constructed by relaxing constrains in Eq. (7), which leads to the following soft margin formulation with the introduction of slack variables

i

x

to account for misclassifications. The objective function and constraints can be formulated as:

T

,

1

T

1

min

2

s.t.

(())1; 0, 1,2,...,

l

i

b

i

iiii

C

ybil

x

fxx

=

ü

+

ï

ï

ý

ï

ï

+-=

þ

å

≥

≥

w

ww

wx

(8)

where l is the number of training patterns, C is a parameter which gives a tradeoff between maximum margin and classification error, and

f

is a mapping from the input space to the feature space. Eq. (8) can be solved by introducing Lagrange multipliers:

T

11

1

L(,,,,)

2

ll

iii

ii

bC

xxbx

==

=+--

åå

w

α

β

ww

T

1

[(())1]

l

iiii

i

yb

afx

=

+-+

å

wx

(9)

where

0

i

a

≥

and

0, 1,2,...,

i

il

b

=

≥

are the Lagrange multipliers. By setting partial derivatives of L to zero and substituting the results into Eq. (9),

,,,

b

x

w

β

can be eliminated and Eq. (9) can be transformed into the following Wolfe dual form:

111

1

1

max(,)

2

s.t.

0,

0; 1,2,...,

lll

iijijij

iij

i

l

ii

i

yyk

C

yil

aaa

a

a

===

=

ü

-

ï

ï

ï

ý

ï

ï

==

ï

þ

ååå

å

≥

≥

α

xx

(10)

where

(,)(),()

ijij

k

ff

=

xxxx

is a kernel function. Eq. (10) can be rewritten in matrix form as:

TT

T

1

max()

2

s.t.

C

ü

-

ï

ï

ý

ï

-

ï

=

þ

≥

≥

α

α

e

α

GK

α

α

e

α

α

y

0

0

0

(11)

where

T

[1,1,,1]

=

K

e

and

()diag()diag().

=

GKyKy

diag()

y

is the diagonal matrix with diagonal

y

and

ll

´

Î

¡

K

is the kernel matrix with

(,),

ijij

Kk

=

xx

1,2,...,,

il

=

j

=

1,2,...,

l

.

3.2 MK SVM

The performance of a SK method depends heavily on the choice of the kernel. Kernel fusion has been proposed to deal with this problem through learning a kernel machine with MKs [10,22]. One of the effective kernel fusion strategies is a weighted combination of multiple kernels. The combined kernel function is

(,)

ij

k

=

xx

(),(),

ij

Φ

x

Φ

x

where

T

12

()[(),(),...,()]

M

fff

=

Φ

xxxx

and M is the number of kernel functions to be combined. The corresponding kernel matrix can be written as:

1

M

ss

s

m

=

=

å

KK

(12)

where

, 1

s

sM

≤

≤

K

is the kernel matrix constructed from

s

f

, and

s

m

(

1

1,1

M

s

s

sM

m

=

=

å

≤

≤

) is the relating weight.

Lanckriet et al. proposed a MKL method with a transduction setting to obtain

s

m

. According to Eq. (11), training the SVM for a given kernel involves yielding the optimal value of

TT

1

()max()

2

w

=-

α

K

α

e

α

GK

α

, which is a function of the particular choice of the kernel matrix obviously. So finding the kernel matrix can be considered as an optimization problem that means to find

K

in some convex subset

κ

of positive semi-definite matrices keeping the trace of

K

constant:

()

min

s.t.

tr

c

w

Î

ü

ï

ï

ý

ï

=

ï

þ

K

κ

K

K

(13)

The kernel matrix

K

in Eq. (13) can be found by solving the following convex optimization problem:

,,,,

TT

min

s.t.

0

tr

()()

0

()2

t

t

c

vtC

l

l

l

ü

ï

ï

ï

=

ï

ý

+-+

éù

ï

êú

+-+-

ï

ëû

ï

ï

þ

y

0

0

f

f

≥

≥

Kv

δ

K

K

GKev

δ

y

v

δ

δ

e

v

δ

(14)

In Eq. (14),

t

Î

¡

,

l

Î

¡

v

,

l

Î

¡

δ

, and

.

l

Î

¡

0

f

K

means that

K

is a positive semi-definite matrix, and the above optimization problem is an SDP. Notice that

0

≥

v

means

diag()0

f

v

and thus a linear matrix inequality (LMI), similarly for

0

≥

δ

. The detailed proof of the above equations can be found in Ref. [10]. In MKL, the combined kernel matrix

1

M

ll

s

s

s

m

´

=

=Î

å

¡

KK

is a linear combination of fixed kernel matrices, where

l

is the total number of training and testing patterns. By adding this additional constraint, Eq. (14) can be represented as Eq. (15), from which

s

m

can be solved.

,,,,

1

1

1

TT

min

s.t.

0

tr

()

0

()2

s

t

M

s

s

s

M

s

s

s

M

ss

s

t

c

tC

ml

m

m

l

m

l

=

=

=

ü

ï

ï

ï

ï

ï

æö

=

ï

ç÷

ý

èø

ï

éù

æö

+-+

ï

êú

ç÷

èø

ï

êú

ï

êú

+-+-

ëû

ï

ï

þ

å

å

å

0

0

f

f

≥

≥

v

δ

K

K

Gev

δ

y

K

ev

δ

y

δ

e

v

δ

(15)

A binary tree structure illustrated in Fig. 3 is adopted in the structure design of the multi-class classifier to recognize five emotions: happy, angry, fear, neutral and sad.

This structure is different from the traditional one-to-one, one-to-rest, or hierarchy SVM structures [23–25]. In the binary tree structure, the first classifying node (Model1) is improved by MK SVM to recognize the most confusable emotion while the deeper classifying nodes (Model2~Model4) still retain SK SVM. Take the Berlin Database of Emotional Speech [26] for example, happy is the main factor influencing overall performance of the classifier because of its lowest recognition accuracy [24,27–28]. Thus, Model1 can be used to recognize happy from other emotions when Berlin Database of Emotional Speech is studied. This arrangement can reduce error accumulation caused by the most confusable emotion and the total computing complexity of solving SDP in every model.

Fig. 3 Multi-class and MK classifier with binary tree structure

4 Acoustic features and feature selection

Speech features usually used in SER are the prosodic features, voice quality features and spectral features. Pitch, energy, duration, formants, and mel frequency cepstrum coefficients (MFCC) and their statistics parameters are extracted in this paper. The total dimension of the feature vector is 45. Table 1 lists the acoustic features adopted in the following experiments.

Table 1 Acoustic features

Type

Feature

Statistic parameters

Prosodic features

Pitch

Maximum, minimum, range, mean, standard deviation, first quartile, median, third quartile, inter-quartile range

Energy

Maximum, minimum, range, mean, standard deviation, first quartile, median, third quartile, inter-quartile range

Duration

Total frames, voiced frames, unvoiced frames, ratio of voiced frames versus unvoiced frames, ratio of voiced frames versus total frames, ratio of unvoiced frames versus total frames

Voice quality features

Formant

The first formant F1: mean, standard deviation, median

The second formant F2: mean, standard deviation, median

The third formant F3: mean, standard deviation, median

Spectral features

MFCC

12 order MFCC

Feature selection is necessary in building optimal feature subset with emotional recognizability. Double input symmetrical relevance (DISR) is an information theoretic selection criterion depending on the utilization of symmetric relevance to consider the complementarity between two input features. The main advantage of DISR criterion is the selected complementary variable has much higher probability relevance on all of the double inputs in the subset [29].

5 Experimental results and analysis

In this paper the Berlin Database of Emotional Speech is selected to test the proposed method from the aspect of speaker-independent SER. Berlin Database of Emotional Speech is also known as EMO-DB, and it is one of the most common databases in SER research. Ten actors (5 female and 5 male) simulated the emotions and produced 10 German utterances which could be used in everyday communication and interpretable in all applied emotions. The complete database was evaluated in a perception test regarding the recognizability of emotions and their naturalness. 409 utterances of five emotions in this database including 71 happy samples, 127 angry samples, 69 fear samples, 63 sad samples and 79 neutral samples are studied. The training samples are 207 including 36 happy samples, 64 angry samples, 35 fear samples, 32 sad samples and 40 neutral samples. And the rest 202 ones are the testing samples.

LibSVM toolbox is adopted in the experiments. In the MKL stage, three-kernel SVM is employed, i.e. M=3 in Eq. (12). The three basis kernel functions are radial basis functions (RBF) with parameters of

1

0.01

g

=

,

2

0.1

g

=

, and

3

1

g

=

respectively. YALMIP toolbox is used to solve the SDP problem [30]. In all SK models, default value of

g

of RBF is

1/

k

g

=

and k is the number of features. Three types of samples (original, noisy, and reconstructed) and two classifiers (SK and MK) are used in experiments. We consider the clean emotional speech samples in Berlin Database of Emotional Speech as original samples. Noisy samples are produced by adding Gaussian white noise with different signal noise ratio (SNR) to the original samples. And the reconstructed samples are the reconstructed outputs calculated from measurements of noisy samples. SK SVM classifier means the four models shown in Fig. 3 are all SK SVM models, while the classifier improved by MKL in Model1 is called MK SVM classifier.

5.1 Evaluation of recognition performance

Besides recognition accuracies, root mean square error (RMSE) and maximum error (MAXE) are used to evaluate the performance of both SK and MK SVM classifiers. RMSE and MAXE are calculated by:

5

2

RMSE

1

MAXE

1

5

max{}

h

h

h

Pe

Qe

=

ü

ï

=

ï

ý

ï

=

ï

þ

å

(16)

where

h

e

is the recognition error of hth emotion. Obviously, the smaller values of RMSE and MAXE mean the better performance of the classifier.

5.2 The negative impacts of noise

Original samples, noisy samples (SNR are 20 dB, 15 dB, and 10 dB) and the samples reconstructed by BP method are tested in the SK SVM classifier. In this section all of the 45 features listed in Table 1 are utilized. The experimental results given in Table 2 demonstrate the negative impacts of noise and the performance of reconstructed samples. The total SER accuracy of 10 dB noisy samples is the worst, and the confusion among happy and the other emotions becomes serious as the decrease of SNR. The results also show sample reconstruction is helpful to reduce the influence of noise.

Table 2 The emotion recognition accuracies of original, noisy, and reconstructed speech with 45 features

Speech sample

Recognition accuracies/%

Total

Anger

Fear

Happy

Neutral

Sad

Original

74.257

76.1900

76.471

48.571

71.795

100.000

Noisy (20 dB)

72.772

95.238 0

58.824

20.000

76.923

96.774

Noisy (15 dB)

70.921

93.508 0

59.120

20.000

71.490

100.000

Noisy (10 dB)

67.218

85.143 0

76.706

14.857

53.462

100.000

Reconstructed (20 dB)

80.693

88.889 0

76.471

42.857

89.744

100.000


74.257

92.063 5

47.059

28.571

92.308

96.774


72.772

93.651 0

61.765

22.857

71.795

100.000

When the SNR is higher, such as 20 dB, the total performance of reconstructed samples even surpasses clean samples. This is mainly because most of the acoustic features are calculated from the voiced speech. Because the reconstruction quality of voiced speech is better than unvoiced speech, the differentiation between voiced and unvoiced speech becomes much clearer after reconstruction, which leads to more precise feature extraction of the reconstructed speech.

5.3 Noisy speech emotion recognition

In this section the tested speech samples are original samples, noisy samples with SNR of 20 dB, and the corresponding reconstructed samples. Both SK and MK classifier are adopted. The combination weight coefficients of kernels

(1,2,3)

s

s

m

=

solved by SDP for the three types of speech are given in Table 3.

Table 3 Combination weight coefficients of MK

Speech sample

1

m

2

m

3

m

Original

0.006 4

0.017 0

0.976 6

Noisy

0.003 4

0.005 5

0.991 2

Reconstructed

0.007 1

0.013 2

0.979 7

The experimental results of SER are plotted in Figs. 4 and 5. The highest recognition accuracies are 76.238% (original, SK), 73.267% (noisy, SK), 82.178% (reconstructed, SK), 88.614% (original, MK), 86.139% (noisy, MK) and 95.545% (reconstructed, MK) respectively. The recognition accuracies of happy, which is the most confusable emotion in Berlin Database of Emotional Speech is shown in Fig. 4(b). Fig. 5 plots the curves of RMSE and MAXE corresponding to the Fig. 4(a).

(a) Total accuracy

(b) Accuracy of happy

Fig. 4 SER accuracies of different selected feature

Table 4 is about the detailed recognition results of five emotions in Berlin Database of Emotional Speech with the highest total accuracies. The corresponding confusion matrices of Table 4 are listed in Table 5. The experimental results show the effectiveness of the combination of sample reconstruction and MKL in noisy SER.

(a) RMSE

(b) MAXE

Fig. 5 RMSE and MAXE values of different selected feature number

Table 4 Recognition details of the highest total accuracies

Type

Feature number

Recognition accuracy/(%)

RMSE

MAXE

Total

Angry

Fear

Happy

Neutral

Sad

Original, SK

44

76.238

80.952

79.412

48.571

71.795

100.000

0.462 80

0.641 03

Noisy, SK

43

73.267

95.238

61.765

17.143

76.923

100.000

0.421 48

0.828 57

Reconstructed, SK

34

82.178

90.476

82.353

42.857

89.744

100.000

0.274 69

0.571 43

Original, MK

39

88.614

93.651

79.412

97.143

74.359

96.774

0.151 01

0.256 41

Noisy, MK

24

86.139

92.063

61.765

100.000

74.359

100.000

0.288 92

0.382 35

Reconstructed, MK

24

95.545

98.413

85.294

100.000

92.308

100.000

0.074 56

0.147 06

Table 5 Confusion matrices of the highest total accuracies

Sample

Emotion

Recognition accuracy, SK/(%)

Recognition accuracy, MK/(%)

Angry

Fear

Happy

Neutral

Sad

Angry

Fear

Happy

Neutral

Sad

Original

Angry

80.952

3.175

15.873

0

0

93.651

4.762

1.587

0

0

Fear

11.765

79.421

5.882

2.941

0

20.588

79.412

0

0

0

Happy

42.857

8.571

48.571

0

0

2.857

0

97.143

0

0

Neutral

0

28.205

0

71.795

0

0

25.641

0

74.359

0

Sad

0

0

0

0

100.000

0

0

0

3.226

96.774

Noisy

Angry

95.238

0

1.587

3.175

0

92.063

3.175

1.587

3.175

0

Fear

23.529

61.765

5.882

8.824

0

11.765

61.765

5.882

17.647

2.941

Happy

77.143

5.714

17.143

0

0

0

0

100.000

0

0

Neutral

2.564

17.949

0

76.923

2.564

0

0

17.949

74.359

7.692

Sad

0

0

0

0

100.000

0

0

0

0

100

Reconstructed

Angry

90.476

0

9.524

0

0

98.413

0

1.587

0

0

Fear

2.941

82.353

14.706

0

0

8.824

85.294

0

5.882

0

Happy

31.429

25.714

42.857

0

0

0

0

100.000

0

0

Neutral

0

10.256

0

89.744

0

0

0

0

92.308

7.692

Sad

0

0

0

0

100.000

0

0

0

0

100.000

In the experimental results of ‘original, SK’ shown in Table 4 and Table 5, the main factor influencing the total accuracy is the recognition performance of happy. The confusion between happy and angry is serious and 42.857% happy samples are recognized as angry falsely. The reason for this phenomenon is the similar manner of utterance in happy and angry states, which leads to the numerical values of the acoustic features such as energy and pitch are closer in these emotions. The negative effect of the similarity of feature values can be enlarged when noise exists. So in the case of ‘noisy, SK’, the confusion between happy and angry is up to 77.143%. However, in the data of ‘reconstructed, SK’, the confusion caused by noise can be eliminated by the reconstruction and the total performance is even better than the original samples, which conforms to the analysis of Table 2. Through the comparison between ‘reconstructed, SK’ and ‘reconstructed, MK’, the confusion between happy and angry that cannot be handled by SK SVM is solved by MKL. In the recognition of other emotions, ‘reconstructed, MK’ has the best accuracies and the total accuracy is the highest. Furthermore, the effectiveness of different kernels can be compared. In ‘noisy, SK’, 27 testing happy samples are recognized as angry by the RBF with

1/430.023 3

g

=»

falsely. If

1

0.01

g

=

,

2

0.1

g

=

and

3

1

g

=

are adopted respectively as the gamma parameters of RBF in SK SVM to recognize the 27 falsely recognized happy samples from 63 angry samples, the corresponding accuracies in crease gradually and demonstrate that RBF with

3

1

g

=

is helpful in the reduction of confusion between happy and angry. So it is reasonable that the weight coefficients of

1

m

and

2

m

are smaller in Table 3.

The experimental results also show that the feature selection can improve the recognition accuracies. For example, only 24 features of the reconstructed samples can achieve the best accuracy in the MK SVM classifier. Thus feature selection is useful for the exploration of the best performance of classifiers.

Although MKL can improve the recognition performance, this strategy adds time complexity of the whole system. The time of SDP solving and MK computation is long, while it is not needed in SK SVM. For the classifier in Fig. 3, the SDP solving time in Model1 by Matlab is 104.543 8 s and the total training and testing time is 7.758 3 s. If all of the four SVM models are MK and use the same 24 features as the ‘reconstructed, MK’, there are four groups of

s

m

to be solved and the total solving time of SDP is 178.816 0 s. In this case, 10.667 8 s is needed to train the four MK models and test 202 samples. In the SK situation, the training and testing time is only 5.015 6 s. If all models in the classifier are MKs, the total performance is 96.039 6%. Compared to the 95.545% in Table 4, there is only 0.495% improvement at the cost of more than half time complexity. So the utilization of MKL in all nodes of the classifier is inefficient. It also means in Berlin Database of Emotional Speech there are samples that cannot be recognized by the three kernels adopted in this paper.

The performance can be compared with the SER works on Berlin Database of Emotional Speech published previously. We compare the experimental data in this paper with results of enhanced sparse representation classifier (Enhanced-SRC) [8], feature fusion based on MKL [11], cross-correlation SVM (CrossCorr-SVM) [24], Bayesian classifier with a Gaussian class-conditional likelihood [28], and neural network [31].

Table 6 gives the comparison of the above methods on original samples in Berlin Database of Emotional Speech, where ‘sample reconstruction and MKL’ is the method proposed in this paper. In Table 6 the symbol ‘N’ denotes no relating experimental results or the corresponding emotions in the references. The methods in the references mainly focus on original emotional samples except for Enhanced-SRC. Though Enhanced-SRC dealt with both original and noisy samples in Berlin Database of Emotional Speech, the confusion between happy and angry was not solved completely. In Table 6, the classifier proposed in this paper has the best recognition accuracies on happy state. In the case of 20 dB noisy SER, the highest total recognition accuracy is 95.545%, which is higher than 80.400% of Enhanced-SRC and performance of other tradional recognition methods listed in Ref. [8].

Table 6 Comparison on recognition accuracies of five emotions of original samples in Berlin Database of Emotional Speech

Method

Recognition accuracy/(%)

Anger

Fear

Happy

Neutral

Sad

Sample reconstruction and MKL

93.651

79.412

97.143

74.359

96.774

Enhanced-SRC

98.550

83.160

57.730

70.080

96.710

CrossCorr-SVM

95.040

N

66.670

85.070

84.340

Bayesian classifier

86.100

N

52.700

52.900

87.600

Neural network

84.200

63.200

N

78.8

N

Feature fusion based on MKL

81.000

83.000

N

65.000

95.000

6 Conclusions and future work

In this paper, sample reconstruction and MKL are combined into noisy SER. Through the comparison and analysis of experimental results on Berlin Database of Emotional Speech, the following conclusions can be drawn:

1) The acoustic features extracted from reconstructed speech samples based on CS theory are robust to noise and even have better emotional recognizability than that of clean speech samples.

2) MKL is effective in the reduction of confusion that cannot be handled by SK classifier and improves the SER performance greatly. However, because of the time complexity added by MKL, the structure design of multi-class classifier must be fully considered to utilize MKL efficiently.

3) Feature selection improves the accuracies of SER and plays an important role in the deep analysis about the performance of recognition methods and classifiers.

There are several challenging aspects in future research. The selection of base kernels suitable for SER and more effective learning strategies in the achievement of MK matrix are important in further study. Faster implementation of sample reconstruction and SDP solving need more research to reduce time complexity of the proposed method. In addition, the automatic acquisition of optimal feature is also essential in the improvement of system performance.

Acknowledgements

This work is supported by the DoCoMo Beijing Labs Co. Ltd., and Program for New Century Excellent Talents in Beijing University of Posts and Telecommunications (04-0112).

Appendix A Water-filling power allocation

Considering the derivative expression in Eq. (10),

(

)

2

22

2

1

i

i

k

k

iii

kkk

i

k

phh

+

=

+

+

msf

w

sf

(A.1)

After some mathematical transmutation,

2

22

ii

i

kk

k

ii

kk

p

hh

wsf

m

=--

(A.2)

By substituting Eq. (A.2) into Eq. (12),

g

2

total

22

11

T

N

U

jj

ll

jj

jl

ll

P

hh

wsf

m

==

--

åå

≤

(A.3)

The expression of Lagrange multiplier

m

is

g

2

total

22

11

11

T

N

U

j

l

jj

jl

T

ll

P

U

hh

sf

m

==

æö

æö

ç÷

ç÷

=-+

ç÷

ç÷

ç÷

èø

èø

åå

(A.4)

Therefore,

'

'

i

k

p

has a water-filling form all over the bandwidth.

g

22

total

2222

11

T

N

U

iji

i

klk

k

jjii

jl

T

llkk

pP

U

hhhh

==

æö

æö

ç÷

ç÷

=-+--

ç÷

ç÷

ç÷

èø

èø

åå

wsfsf

(A.5)

If user

*

k

is assigned to group

*

i

in the transmission, Eq. (A.5) can be expressed as:

g

22*

*

*

*total

2222

**

11

**

1

T

N

U

ji

i

lk

k

jjii

jl

T

llkk

pP

U

hhhh

sfsf

==

æö

æö

ç÷

ç÷

=-+--

ç÷

ç÷

ç÷

èø

èø

åå

(A.6)

where

*1, 2,,

T

kU

=

K

.

References

1. Tao J, Tan T. Affective computing: a review. Proceedings of the 1st International Conference on Affective Computing and Intelligent Interaction, Oct 2224, 2005, Beijing, China. LNCS 3784. Berlin, Germany: Springer, 2005: 981995

2. Schuller B, Batliner A, Steidl S, et al. Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Communication, 2011, 53(9/10): 10621087

3. Schuller B, Arsic D, Wallhoff F, et al. Emotion recognition in the noise applying large acoustic feature sets. Proceedings of the 3rd International Conference on Speech Prosody, May 25, 2006, Dresden, Germany. 2006: IP128

4. You M Y, Chen C, Bu J J, et al. Emotion recognition from noisy speech. Proceedings of the 2006 IEEE International Conference on Multimedia and Expo (ICME’06), July 912, 2006, Toronto, Canada. Piscataway, NJ, USA: IEEE, 2006: 16531656

5. Schuller B, Wöllmer M, Moosmayr T, et al. Recognition of noisy speech: a comparative survey of robust model architecture and feature enhancement. EURASIP Journal on Audio, Speech, and Music Processing, 2009: 942617/117

6. Donoho D L. Compressed sensing. IEEE Transactions on Information Theory, 2006, 52(4): 12891306

7. Candès E J. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 2008, 346(9/10): 589592

8. Zhao X M, Zhang S Q, Lei B C. Robust emotion recognition in noisy speech via sparse representation. Neural Computing and Applications, 2014, 24(7): 15391553

9. Haupt J, Nowak R. Signal reconstruction from noisy random projections. IEEE Transactions on Information Theory, 2006, 52(9): 40364048

10. Lanckriet G R G, Cristianini N, Bartlett P, et al. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 2004, 5(1): 2772

11. Jin Y, Song P, Zheng W M, et al. Novel feature fusion method for speech emotion recognition based on multiple kernel learning. Journal of Southeast University, 2013, 29(2): 129133

12. Baraniuk R G. Compressive sensing. IEEE Signal Processing Magazine, 2007, 24(4): 118120

13. Needell D, Vershynin R. Signal recovery from inaccurate and incomplete measurements via regularized orthogonal matching pursuit. IEEE Journal of Selected Topics in Signal Processing, 2010, 4(2): 310316

14. Needell D, Tropp J A. CoSaMP: iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 2008, 26(3): 301321

15. Dai W, Milenkovic O. Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions on Information Theory, 2009, 55(5): 22302249

16. Tropp J A, Gilbert A C. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 2007, 53(12): 46554666

17. Saligrama V, Zhao M Q. Thresholded basis pursuit: LP algorithm for oder-wise optimal support recovery for sparse and approximately sparse signals from noisy random measurements. IEEE Transactions on Information Theory, 2011, 57(3): 15671586

18. Chen S S, Donoho D L, Saunders M A. Atomic decomposition by basis pursuit. SIAM Review, 2001, 43(1): 129159

19. Figueiredo M A, Nowak R D, Wright S J. Gradient projection for sparse reconstruction: application to compress sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing, 2007, 1(4): 586597

20. Blumensath T, Davies M. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 2009, 27(3): 265274

21. Plumbley M D. Recovery of sparse representations by polytope faces pursuit. Proceedings of the 2006 International Conference on Independent Component Analysis and Blind Source Separation, Mar 58, 2006, Charleston, SC, USA. LNCS3889. Berlin, Germany: Springer, 2006: 206213

22. Yeh C Y, Su W P, Lee S J. An efficient multiple-kernel learning for pattern classification. Expert Systems with Applications, 2013, 40(9): 34913499

23. Chen L J, Mao X, Xue Y L, et al. Speech emotion recognition: features and classification models. Digital Signal Processing, 2012, 22(6): 11541160

24. Chandaka S, Chatterjee A, Munshi S. Support vector machines employing cross-correlation for emotional speech recognition. Measurement, 2009, 42(4): 611618

25. Lee C C, Mower E, Busso C, et al. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 2011, 53(9/10): 11621171

26. Burkhardt F, Paeschke A, Rolfes M, et al. A database of German emotional speech. Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH’05), Sept 48, 2005, Lisbon, Portugal. 2005: 15171520

27. Jiang X Q, Xia K W, Xia X Y, et al. Speech emotion recognition using semi-definite programming multiple-kernel SVM. Journal of Beijing University of Posts and Telecommunications, 2015, 38(S1): 6771 (in Chinese)

28. Yang B, Lugger M. Emotion recognition from speech signals using new harmony features. Signal Processing, 2010, 90(5): 14151423

29. Meyer P E, Schretter C, Bontempi G. Information-theoretic feature selection in microarray dada using variable complementarity. IEEE Journal of Selected Topics in Signal Processing, 2008, 2(3): 261274

30. Löfberg J. YALMIP: A toolbox for modeling and optimization in MATLAB. Proceedings of the 2004 International Symposium on Computer Aided Control Systems Design, Sept 24, 2004, Taipei, China. Piscataway, NJ, USA: IEEE, 2004: 284289

31. Henríquez P, Alonso J B, Ferrer M A, et al. Nonlinear dynamics characterization of emotional speech. Neurocomputing, 2014, 132: 126135

Received date: 29-09-2016

Corresponding author: Xia Kewen, E-mail: [email protected]

DOI: 10.1016/S1005-8885(17)#####

_1553004769.unknown

_1556987732.unknown

_1556988389.unknown

_1559882446.unknown

_1559882580.unknown

_1559882802.unknown

_1559899557.unknown

_1559899564.unknown

_1559899568.unknown

_1559882896.unknown

_1559882931.unknown

_1559882940.unknown

_1559882828.unknown

_1559882772.unknown

_1559882786.unknown

_1559882599.unknown

_1559882472.unknown

_1559882478.unknown

_1559882457.unknown

_1556989811.unknown

_1556990558.unknown

_1556990592.unknown

_1556991612.unknown

_1559882434.unknown

_1556990597.unknown

_1556990583.unknown

_1556990391.unknown

_1556990410.unknown

_1556989832.unknown

_1556988621.unknown

_1556988634.unknown

_1556988650.unknown

_1556988629.unknown

_1556988610.unknown

_1556988618.unknown

_1556988600.unknown

_1556988109.unknown

_1556988243.unknown

_1556988380.unknown

_1556988383.unknown

_1556988352.unknown

_1556988184.unknown

_1556988216.unknown

_1556988176.unknown

_1556987941.unknown

_1556988070.unknown

_1556988084.unknown

_1556987993.unknown

_1556987870.unknown

_1556987897.unknown

_1556987832.unknown

_1553005321.unknown

_1553057881.unknown

_1553057911.unknown

_1553058729.unknown

_1553058742.unknown

_1553058757.unknown

_1553058777.unknown

_1553058751.unknown

_1553058734.unknown

_1553058723.unknown

_1553057898.unknown

_1553057905.unknown

_1553057889.unknown

_1553057531.unknown

_1553057580.unknown

_1553057592.unknown

_1553057538.unknown

_1553005323.unknown

_1553005324.unknown

_1553005322.unknown

_1553005168.unknown

_1553005184.unknown

_1553005195.unknown

_1553005319.unknown

_1553005174.unknown

_1553004838.unknown

_1553005107.unknown

_1553004805.unknown

_1553003012.unknown

_1553004321.unknown

_1553004472.unknown

_1553004673.unknown

_1553004729.unknown

_1553004638.unknown

_1553004338.unknown

_1553004348.unknown

_1553004332.unknown

_1553004086.unknown

_1553004117.unknown

_1553004312.unknown

_1553004093.unknown

_1553004025.unknown

_1553004048.unknown

_1553004019.unknown

_1553002932.unknown

_1553002969.unknown

_1553002982.unknown

_1553002988.unknown

_1553002977.unknown

_1553002950.unknown

_1553002958.unknown

_1553002945.unknown

_1552997366.unknown

_1552997388.unknown

_1553002922.unknown

_1552997382.unknown

_1552997313.unknown

_1552997339.unknown

_1552997351.unknown

_1552997318.unknown

_1244382886.unknown

_1271593768.unknown

_1271593814.unknown

_1271593864.unknown

_1552997275.unknown

_1271593848.unknown

_1271593784.unknown

_1270553861.unknown

_1270553985.unknown

_1244382944.unknown

_1244382953.unknown

_1234567941.unknown

_1244382817.unknown

_1234567918.unknown

wang zheng-bin1, zhang ye-rong2 · web view2020/12/03 · digital signal processing, 2012, 22(6):...

Documents