wang zheng-bin1, zhang ye-rong2 · web view2020/12/03 · digital signal processing, 2012, 22(6):...
TRANSCRIPT
WANG Zheng-Bin1, Zhang Ye-Rong2
February 2018, 25(1): 1–9
Online first publishing time: ****************** http://jcupt.bupt.edu.cn
2 The Journal of China Universities of Posts and Telecommunications 2018
Online first publishing Jiang Xiaoqing, et al. / Noisy speech emotion recognition using… 3
Noisy speech emotion recognition using sample reconstruction and multiple-kernel learning
Jiang Xiaoqing1,2, Xia Kewen1 ((), Lin Yongliang1,3, Bai Jianchuan1
1. School of Electronic and Information Engineering, Hebei University of Technology, Tianjin 300401, China2. School of Information Science and Engineering, University of Jinan, Jinan 250022, China3. Information Center, Tianjin Chengjian University, Tianjin 300384, China
Abstract
Speech emotion recognition (SER) in noisy environment is a vital issue in artificial intelligence (AI). In this paper, the reconstruction of speech samples removes the added noise. Acoustic features extracted from the reconstructed samples are selected to build an optimal feature subset with better emotional recognizability. A multiple-kernel (MK) support vector machine (SVM) classifier solved by semi-definite programming (SDP) is adopted in SER procedure. The proposed method in this paper is demonstrated on Berlin Database of Emotional Speech. Recognition accuracies of the original, noisy, and reconstructed samples classified by both single-kernel (SK) and MK classifiers are compared and analyzed. The experimental results show that the proposed method is effective and robust when noise exists.
Keywords speech emotion recognition, compressed sensing, multiple-kernel learning, feature selection
1 Introduction (
Complementarity exists between human’s affectivity and logical thinking, so emotional information is significant to understand the real meaning in human’s speech. SER is an important research field in the realization of AI [1]. Noise existing in the environment and signal processing systems influences the recognition accuracies and limits the practical applications of SER, such as intelligent customer service systems and adjuvant therapy systems for autism, where accurate recognition of emotions is needed to make a proper response. In this paper, noisy SER is studied using the combination of sample reconstruction based on compressed sensing (CS) theory and multiple-kernel learning (MKL).
In SER two essential aspects influencing the performance of the emotion recognition system are optimal feature set and effective recognition classifier.
The precision and inherent properties of speech features influence the emotional recognizability of the feature set. Noise has negative impact on the extraction of acoustic features, and attempts to cope with the noise in SER started from 2006 [2]. Schuller et al. selected feature subset from a 4k feature set to recognize emotions from noisy speech samples [3]. You et al. proposed enhanced Lipschitz embedding to reduce the influence of noise [4]. Techniques such as switching linear dynamic models and two-stage Wiener filtering etc. were also proposed to handle noisy speech for classification [5]. CS theory proposed by Donoho et al. provides promising methods to noisy speech processing [6–7]. Sparse representation in CS theory has been used in nonparametric classifier. Zhao et al. adopted the enhanced sparse representation classifier to deal with the robust SER [8]. Additionally, as the derived coefficients of noise are not sparse in any transfer domain, it is impossible to reconstruct the noise from measurements. So sparse signals contaminated by noise can be reconstructed with high quality [9]. In this paper, CS theory is utilized in the denoising of noisy speech samples through sample reconstruction. Acoustic features of the reconstructed samples are extracted and selected according to the complementary information in order to constitute robust and optimal feature subset.
SVM is one of the most effective methods in pattern recognition problems. SVM is a kernel method of finding the maximum margin hyperplane in the feature space and its performance depends on the kernel function strongly. So it is necessary to overcome the kernel dependency in designing the effective classifier with SVM. In order to improve the flexibility of kernel function, MKL is proposed and developed to derive the optimal combination of different kernels. Lanckriet et al. proposed MKL with a transduction setting for learning a kernel matrix from data. The method aimed at the optimal combination of predefined base kernels to generate a good target kernel [10]. Jin et al. proposed feature fusion method based on MKL to improve the total SER performance of clean samples. The weights of different kernels corresponding to the global and local features are given through a grid search method [11]. In this paper, MK fusion strategy of Lanckriet is adopted to improve the SVM model in a binary tree structured multi-class classifier, and the fusion coefficients of different kernels are solved by the SDP to find optimal weights of multiple kernels.
The rests of the paper are structured as the followings: Sect. 2 reviews the basic idea of CS in speech signal processing and analyzes the performance of noisy sample reconstruction. Sect. 3 introduces MKL solved by SDP. Acoustic features and feature selection are presented in Sect. 4. The performance evaluation of SER and experimental results are illustrated and analyzed in Sect. 5. Finally Sect. 6 devotes to the conclusions.
2 CS and sample reconstruction of noisy speech
CS combines sampling and compression into one step using the minimum number of measurements with maximum information. CS aims to recover sparse signal with far less than Nyquist-Shannon sampling rate, and the reconstruction can be exact under key concepts such as sparsity and restricted isometry property (RIP) [7,12].
T
[(1),(2),,()]
xxxN
=
K
x
is the signal in N-dimensional space
N
¡
, where N is the number of samples.
x
can be represented by the linear combination of N-dimensional orthogonal basis vector
{
}
1,2,,
n
nN
=
K
y
. Thus
x
can be represented as:
1
N
nn
n
=
==
å
ay
x
Ψ
α
(1)
In Eq. (1)
Ψ
is the orthogonal basis matrix, also named representation matrix,
,
nn
=
ay
x
is projection coefficient,
α
is the projection coefficients matrix and
T
.
=
α
Ψ
x
It can be said that
x
and
α
are the equivalent representations of the same signal with
x
in time domain while
α
in
Ψ
domain. When the signal
x
only has k non-zero
n
a
coefficients and
kN
<<
,
n
y
is the sparse basis of
x
and
x
can be considered k sparse with sparse representation of Eq. (1).
In CS theory, the sensing process can be represented as:
=
y
Φ
x
(2)
In Eq. (2)
Φ
is the
MN
´
measurement matrix, and
M
Î
¡
y
(M<
y
is far less than the dimension of the signal
x
.
With Eq. (1), Eq. (2) can be rewritten as:
=
y
Θ
α
(3)
where
=
Θ
Φ
Ψ
is M
´
N dimensional reconstruction matrix, and
α
is k sparse vector representing the projection coefficients of
x
in
Ψ
domain.
Reconstruction algorithms in CS try to solve Eq. (3), which is an underdetermined equation without a determinant solution. When the signal is sparse and
Θ
satisfies the RIP condition, a sparse approximation solution to Eq. (3) can be obtained by minimizing the
1
L
-norm. RIP of matrix
Θ
is defined on isometry constant
(0,1)
k
d
Î
for a k sparse signal
x
and satisfies:
2
2
2
2
11
kk
dd
-+
≤
≤
Θ
x
x
(4)
It can be loosely said that a matrix obeys the RIP of order k if
k
d
is not too close to one. RIP ensures that all subsets of k columns taken from matrix are nearly orthogonal. The equivalent condition of RIP is the incoherence between measurement matrix
Φ
and the representation matrix
.
Ψ
A variety of reconstruction methods such as greedy algorithms and convex optimization can be used in the solving process of Eq. (3) [13–21].
When CS theory is applied to speech signal processing, the prerequisite is to achieve the sparse representation of speech signals using proper orthogonal basis. The excitation of voiced and unvoiced speech is quasi-periodic vibrations of vocal cords and random noise respectively. So voiced speech carries the most energy of the sample and focuses in lower frequency section. One of the most important spectrum characteristics of discrete cosine transformation (DCT) is the strong energy concentration in low frequency coefficients, which makes it suitable to analyze the sparsity of speech signals. The mth DCT coefficient
()
m
a
of a speech frame x with N samples can be calculated by:
1
π
(21)(1)
()()()cos; 1,2,,
2
1
; 1
()
2
; 2
N
n
nm
mwmxnmN
N
m
N
wm
mN
N
a
=
--
ü
==
ï
ï
ï
ì
ï
=
ý
ï
ï
ï
=
í
ï
ï
ï
ï
ï
î
þ
å
K
≤
≤
(5)
where
()
xn
denotes the nth sample of the speech frame.
Examples of clean voiced frame and unvoiced frame as well as their DCT coefficients are plotted in Fig. 1. Obviously, only a few DCT coefficients have larger amplitude while the rests are towards zero. The sparsity is more obvious in the voiced frame. Therefore the DCT coefficients of voiced speech signals can be considered as k sparse approximately.
(a) Waveform of a voiced frame
(b) DCT coefficients of a voiced frame
(c) Waveform of an unvoiced frame
(d) DCT coefficients of an unvoiced frame
Fig. 1 The sparsity of voiced and unvoiced frames
According to CS theory, voiced signals contaminated by noise can be reconstructed with high quality. Fig. 2 plots the denoising performance of sample reconstruction. If random Gaussian matrix is used as the measurement matrix, compressive sampling matching pursuit (CoSaMP) [14], orthogonal matching pursuit (OMP) [16], basis pursuit (BP) [17] and polytope faces pursuit (PFP) [21] algorithms are adopted in the reconstruction of noisy samples. The noisy voiced frame is produced by added 20 dB Gaussian white noise on the clean frame. It is clear that the reconstructed samples have approximate quality to the clean waveform. The amplitude and the period of the clean frame are preserved in the reconstructed frames. The reconstruction waveforms of BP, OMP and PFP almost coincide. The high quality of the reconstructed voiced speech ensures the precision of the features extraction in SER. Both Fig. 1 and Fig. 2 demonstrate the sample reconstruction is feasible in noisy SER.
Fig. 2 Reconstructed samples of a noisy voiced frame
3 MK SVM classifier
3.1 SK SVM
The SVM based on the theory of structural risk minimization is a classifier proposed for binary classification problem. Given l training patterns
{
}
(,),
l
ii
y
x
i
x
is the input vector of the ith pattern and
i
y
is the class label of
.
i
x
Then in the feature space induced by mapping function
f
, we can find a hyperplane with the maximum margin to classify two classes with discriminant function:
(),()
fb
f
=+
xwx
(6)
where
w
and b are weight vector and the offset that can be computed by solving a quadratic optimization problem:
T
,
T
1
min
2
s.t.
(())1; 1,2,,
b
ii
ybil
f
ü
ï
ï
ý
ï
ï
+=
þ
K
≥
w
ww
wx
(7)
To make the method more flexible and robust, a hyperplane can be constructed by relaxing constrains in Eq. (7), which leads to the following soft margin formulation with the introduction of slack variables
i
x
to account for misclassifications. The objective function and constraints can be formulated as:
T
,
1
T
1
min
2
s.t.
(())1; 0, 1,2,...,
l
i
b
i
iiii
C
ybil
x
fxx
=
ü
+
ï
ï
ý
ï
ï
+-=
þ
å
≥
≥
w
ww
wx
(8)
where l is the number of training patterns, C is a parameter which gives a tradeoff between maximum margin and classification error, and
f
is a mapping from the input space to the feature space. Eq. (8) can be solved by introducing Lagrange multipliers:
T
11
1
L(,,,,)
2
ll
iii
ii
bC
xxbx
==
=+--
åå
w
α
β
ww
T
1
[(())1]
l
iiii
i
yb
afx
=
+-+
å
wx
(9)
where
0
i
a
≥
and
0, 1,2,...,
i
il
b
=
≥
are the Lagrange multipliers. By setting partial derivatives of L to zero and substituting the results into Eq. (9),
,,,
b
x
w
β
can be eliminated and Eq. (9) can be transformed into the following Wolfe dual form:
111
1
1
max(,)
2
s.t.
0,
0; 1,2,...,
lll
iijijij
iij
i
l
ii
i
yyk
C
yil
aaa
a
a
===
=
ü
-
ï
ï
ï
ý
ï
ï
==
ï
þ
ååå
å
≥
≥
α
xx
(10)
where
(,)(),()
ijij
k
ff
=
xxxx
is a kernel function. Eq. (10) can be rewritten in matrix form as:
TT
T
1
max()
2
s.t.
C
ü
-
ï
ï
ý
ï
-
ï
=
þ
≥
≥
α
α
e
α
GK
α
α
e
α
α
y
0
0
0
(11)
where
T
[1,1,,1]
=
K
e
and
()diag()diag().
=
GKyKy
diag()
y
is the diagonal matrix with diagonal
y
and
ll
´
Î
¡
K
is the kernel matrix with
(,),
ijij
Kk
=
xx
1,2,...,,
il
=
j
=
1,2,...,
l
.
3.2 MK SVM
The performance of a SK method depends heavily on the choice of the kernel. Kernel fusion has been proposed to deal with this problem through learning a kernel machine with MKs [10,22]. One of the effective kernel fusion strategies is a weighted combination of multiple kernels. The combined kernel function is
(,)
ij
k
=
xx
(),(),
ij
Φ
x
Φ
x
where
T
12
()[(),(),...,()]
M
fff
=
Φ
xxxx
and M is the number of kernel functions to be combined. The corresponding kernel matrix can be written as:
1
M
ss
s
m
=
=
å
KK
(12)
where
, 1
s
sM
≤
≤
K
is the kernel matrix constructed from
s
f
, and
s
m
(
1
1,1
M
s
s
sM
m
=
=
å
≤
≤
) is the relating weight.
Lanckriet et al. proposed a MKL method with a transduction setting to obtain
s
m
. According to Eq. (11), training the SVM for a given kernel involves yielding the optimal value of
TT
1
()max()
2
w
=-
α
K
α
e
α
GK
α
, which is a function of the particular choice of the kernel matrix obviously. So finding the kernel matrix can be considered as an optimization problem that means to find
K
in some convex subset
κ
of positive semi-definite matrices keeping the trace of
K
constant:
()
min
s.t.
tr
c
w
Î
ü
ï
ï
ý
ï
=
ï
þ
K
κ
K
K
(13)
The kernel matrix
K
in Eq. (13) can be found by solving the following convex optimization problem:
,,,,
TT
min
s.t.
0
tr
()()
0
()2
t
t
c
vtC
l
l
l
ü
ï
ï
ï
=
ï
ý
+-+
éù
ï
êú
+-+-
ï
ëû
ï
ï
þ
y
0
0
f
f
≥
≥
Kv
δ
K
K
GKev
δ
y
v
δ
δ
e
v
δ
(14)
In Eq. (14),
t
Î
¡
,
l
Î
¡
v
,
l
Î
¡
δ
, and
.
l
Î
¡
0
f
K
means that
K
is a positive semi-definite matrix, and the above optimization problem is an SDP. Notice that
0
≥
v
means
diag()0
f
v
and thus a linear matrix inequality (LMI), similarly for
0
≥
δ
. The detailed proof of the above equations can be found in Ref. [10]. In MKL, the combined kernel matrix
1
M
ll
s
s
s
m
´
=
=Î
å
¡
KK
is a linear combination of fixed kernel matrices, where
l
is the total number of training and testing patterns. By adding this additional constraint, Eq. (14) can be represented as Eq. (15), from which
s
m
can be solved.
,,,,
1
1
1
TT
min
s.t.
0
tr
()
0
()2
s
t
M
s
s
s
M
s
s
s
M
ss
s
t
c
tC
ml
m
m
l
m
l
=
=
=
ü
ï
ï
ï
ï
ï
æö
=
ï
ç÷
ý
èø
ï
éù
æö
+-+
ï
êú
ç÷
èø
ï
êú
ï
êú
+-+-
ëû
ï
ï
þ
å
å
å
0
0
f
f
≥
≥
v
δ
K
K
Gev
δ
y
K
ev
δ
y
δ
e
v
δ
(15)
A binary tree structure illustrated in Fig. 3 is adopted in the structure design of the multi-class classifier to recognize five emotions: happy, angry, fear, neutral and sad.
This structure is different from the traditional one-to-one, one-to-rest, or hierarchy SVM structures [23–25]. In the binary tree structure, the first classifying node (Model1) is improved by MK SVM to recognize the most confusable emotion while the deeper classifying nodes (Model2~Model4) still retain SK SVM. Take the Berlin Database of Emotional Speech [26] for example, happy is the main factor influencing overall performance of the classifier because of its lowest recognition accuracy [24,27–28]. Thus, Model1 can be used to recognize happy from other emotions when Berlin Database of Emotional Speech is studied. This arrangement can reduce error accumulation caused by the most confusable emotion and the total computing complexity of solving SDP in every model.
Fig. 3 Multi-class and MK classifier with binary tree structure
4 Acoustic features and feature selection
Speech features usually used in SER are the prosodic features, voice quality features and spectral features. Pitch, energy, duration, formants, and mel frequency cepstrum coefficients (MFCC) and their statistics parameters are extracted in this paper. The total dimension of the feature vector is 45. Table 1 lists the acoustic features adopted in the following experiments.
Table 1 Acoustic features
Type
Feature
Statistic parameters
Prosodic features
Pitch
Maximum, minimum, range, mean, standard deviation, first quartile, median, third quartile, inter-quartile range
Energy
Maximum, minimum, range, mean, standard deviation, first quartile, median, third quartile, inter-quartile range
Duration
Total frames, voiced frames, unvoiced frames, ratio of voiced frames versus unvoiced frames, ratio of voiced frames versus total frames, ratio of unvoiced frames versus total frames
Voice quality features
Formant
The first formant F1: mean, standard deviation, median
The second formant F2: mean, standard deviation, median
The third formant F3: mean, standard deviation, median
Spectral features
MFCC
12 order MFCC
Feature selection is necessary in building optimal feature subset with emotional recognizability. Double input symmetrical relevance (DISR) is an information theoretic selection criterion depending on the utilization of symmetric relevance to consider the complementarity between two input features. The main advantage of DISR criterion is the selected complementary variable has much higher probability relevance on all of the double inputs in the subset [29].
5 Experimental results and analysis
In this paper the Berlin Database of Emotional Speech is selected to test the proposed method from the aspect of speaker-independent SER. Berlin Database of Emotional Speech is also known as EMO-DB, and it is one of the most common databases in SER research. Ten actors (5 female and 5 male) simulated the emotions and produced 10 German utterances which could be used in everyday communication and interpretable in all applied emotions. The complete database was evaluated in a perception test regarding the recognizability of emotions and their naturalness. 409 utterances of five emotions in this database including 71 happy samples, 127 angry samples, 69 fear samples, 63 sad samples and 79 neutral samples are studied. The training samples are 207 including 36 happy samples, 64 angry samples, 35 fear samples, 32 sad samples and 40 neutral samples. And the rest 202 ones are the testing samples.
LibSVM toolbox is adopted in the experiments. In the MKL stage, three-kernel SVM is employed, i.e. M=3 in Eq. (12). The three basis kernel functions are radial basis functions (RBF) with parameters of
1
0.01
g
=
,
2
0.1
g
=
, and
3
1
g
=
respectively. YALMIP toolbox is used to solve the SDP problem [30]. In all SK models, default value of
g
of RBF is
1/
k
g
=
and k is the number of features. Three types of samples (original, noisy, and reconstructed) and two classifiers (SK and MK) are used in experiments. We consider the clean emotional speech samples in Berlin Database of Emotional Speech as original samples. Noisy samples are produced by adding Gaussian white noise with different signal noise ratio (SNR) to the original samples. And the reconstructed samples are the reconstructed outputs calculated from measurements of noisy samples. SK SVM classifier means the four models shown in Fig. 3 are all SK SVM models, while the classifier improved by MKL in Model1 is called MK SVM classifier.
5.1 Evaluation of recognition performance
Besides recognition accuracies, root mean square error (RMSE) and maximum error (MAXE) are used to evaluate the performance of both SK and MK SVM classifiers. RMSE and MAXE are calculated by:
5
2
RMSE
1
MAXE
1
5
max{}
h
h
h
Pe
Qe
=
ü
ï
=
ï
ý
ï
=
ï
þ
å
(16)
where
h
e
is the recognition error of hth emotion. Obviously, the smaller values of RMSE and MAXE mean the better performance of the classifier.
5.2 The negative impacts of noise
Original samples, noisy samples (SNR are 20 dB, 15 dB, and 10 dB) and the samples reconstructed by BP method are tested in the SK SVM classifier. In this section all of the 45 features listed in Table 1 are utilized. The experimental results given in Table 2 demonstrate the negative impacts of noise and the performance of reconstructed samples. The total SER accuracy of 10 dB noisy samples is the worst, and the confusion among happy and the other emotions becomes serious as the decrease of SNR. The results also show sample reconstruction is helpful to reduce the influence of noise.
Table 2 The emotion recognition accuracies of original, noisy, and reconstructed speech with 45 features
Speech sample
Recognition accuracies/%
Total
Anger
Fear
Happy
Neutral
Sad
Original
74.257
76.1900
76.471
48.571
71.795
100.000
Noisy (20 dB)
72.772
95.238 0
58.824
20.000
76.923
96.774
Noisy (15 dB)
70.921
93.508 0
59.120
20.000
71.490
100.000
Noisy (10 dB)
67.218
85.143 0
76.706
14.857
53.462
100.000
Reconstructed (20 dB)
80.693
88.889 0
76.471
42.857
89.744
100.000
Reconstructed (15 dB)
74.257
92.063 5
47.059
28.571
92.308
96.774
Reconstructed (10 dB)
72.772
93.651 0
61.765
22.857
71.795
100.000
When the SNR is higher, such as 20 dB, the total performance of reconstructed samples even surpasses clean samples. This is mainly because most of the acoustic features are calculated from the voiced speech. Because the reconstruction quality of voiced speech is better than unvoiced speech, the differentiation between voiced and unvoiced speech becomes much clearer after reconstruction, which leads to more precise feature extraction of the reconstructed speech.
5.3 Noisy speech emotion recognition
In this section the tested speech samples are original samples, noisy samples with SNR of 20 dB, and the corresponding reconstructed samples. Both SK and MK classifier are adopted. The combination weight coefficients of kernels
(1,2,3)
s
s
m
=
solved by SDP for the three types of speech are given in Table 3.
Table 3 Combination weight coefficients of MK
Speech sample
1
m
2
m
3
m
Original
0.006 4
0.017 0
0.976 6
Noisy
0.003 4
0.005 5
0.991 2
Reconstructed
0.007 1
0.013 2
0.979 7
The experimental results of SER are plotted in Figs. 4 and 5. The highest recognition accuracies are 76.238% (original, SK), 73.267% (noisy, SK), 82.178% (reconstructed, SK), 88.614% (original, MK), 86.139% (noisy, MK) and 95.545% (reconstructed, MK) respectively. The recognition accuracies of happy, which is the most confusable emotion in Berlin Database of Emotional Speech is shown in Fig. 4(b). Fig. 5 plots the curves of RMSE and MAXE corresponding to the Fig. 4(a).
(a) Total accuracy
(b) Accuracy of happy
Fig. 4 SER accuracies of different selected feature
Table 4 is about the detailed recognition results of five emotions in Berlin Database of Emotional Speech with the highest total accuracies. The corresponding confusion matrices of Table 4 are listed in Table 5. The experimental results show the effectiveness of the combination of sample reconstruction and MKL in noisy SER.
(a) RMSE
(b) MAXE
Fig. 5 RMSE and MAXE values of different selected feature number
Table 4 Recognition details of the highest total accuracies
Type
Feature number
Recognition accuracy/(%)
RMSE
MAXE
Total
Angry
Fear
Happy
Neutral
Sad
Original, SK
44
76.238
80.952
79.412
48.571
71.795
100.000
0.462 80
0.641 03
Noisy, SK
43
73.267
95.238
61.765
17.143
76.923
100.000
0.421 48
0.828 57
Reconstructed, SK
34
82.178
90.476
82.353
42.857
89.744
100.000
0.274 69
0.571 43
Original, MK
39
88.614
93.651
79.412
97.143
74.359
96.774
0.151 01
0.256 41
Noisy, MK
24
86.139
92.063
61.765
100.000
74.359
100.000
0.288 92
0.382 35
Reconstructed, MK
24
95.545
98.413
85.294
100.000
92.308
100.000
0.074 56
0.147 06
Table 5 Confusion matrices of the highest total accuracies
Sample
Emotion
Recognition accuracy, SK/(%)
Recognition accuracy, MK/(%)
Angry
Fear
Happy
Neutral
Sad
Angry
Fear
Happy
Neutral
Sad
Original
Angry
80.952
3.175
15.873
0
0
93.651
4.762
1.587
0
0
Fear
11.765
79.421
5.882
2.941
0
20.588
79.412
0
0
0
Happy
42.857
8.571
48.571
0
0
2.857
0
97.143
0
0
Neutral
0
28.205
0
71.795
0
0
25.641
0
74.359
0
Sad
0
0
0
0
100.000
0
0
0
3.226
96.774
Noisy
Angry
95.238
0
1.587
3.175
0
92.063
3.175
1.587
3.175
0
Fear
23.529
61.765
5.882
8.824
0
11.765
61.765
5.882
17.647
2.941
Happy
77.143
5.714
17.143
0
0
0
0
100.000
0
0
Neutral
2.564
17.949
0
76.923
2.564
0
0
17.949
74.359
7.692
Sad
0
0
0
0
100.000
0
0
0
0
100
Reconstructed
Angry
90.476
0
9.524
0
0
98.413
0
1.587
0
0
Fear
2.941
82.353
14.706
0
0
8.824
85.294
0
5.882
0
Happy
31.429
25.714
42.857
0
0
0
0
100.000
0
0
Neutral
0
10.256
0
89.744
0
0
0
0
92.308
7.692
Sad
0
0
0
0
100.000
0
0
0
0
100.000
In the experimental results of ‘original, SK’ shown in Table 4 and Table 5, the main factor influencing the total accuracy is the recognition performance of happy. The confusion between happy and angry is serious and 42.857% happy samples are recognized as angry falsely. The reason for this phenomenon is the similar manner of utterance in happy and angry states, which leads to the numerical values of the acoustic features such as energy and pitch are closer in these emotions. The negative effect of the similarity of feature values can be enlarged when noise exists. So in the case of ‘noisy, SK’, the confusion between happy and angry is up to 77.143%. However, in the data of ‘reconstructed, SK’, the confusion caused by noise can be eliminated by the reconstruction and the total performance is even better than the original samples, which conforms to the analysis of Table 2. Through the comparison between ‘reconstructed, SK’ and ‘reconstructed, MK’, the confusion between happy and angry that cannot be handled by SK SVM is solved by MKL. In the recognition of other emotions, ‘reconstructed, MK’ has the best accuracies and the total accuracy is the highest. Furthermore, the effectiveness of different kernels can be compared. In ‘noisy, SK’, 27 testing happy samples are recognized as angry by the RBF with
1/430.023 3
g
=»
falsely. If
1
0.01
g
=
,
2
0.1
g
=
and
3
1
g
=
are adopted respectively as the gamma parameters of RBF in SK SVM to recognize the 27 falsely recognized happy samples from 63 angry samples, the corresponding accuracies in crease gradually and demonstrate that RBF with
3
1
g
=
is helpful in the reduction of confusion between happy and angry. So it is reasonable that the weight coefficients of
1
m
and
2
m
are smaller in Table 3.
The experimental results also show that the feature selection can improve the recognition accuracies. For example, only 24 features of the reconstructed samples can achieve the best accuracy in the MK SVM classifier. Thus feature selection is useful for the exploration of the best performance of classifiers.
Although MKL can improve the recognition performance, this strategy adds time complexity of the whole system. The time of SDP solving and MK computation is long, while it is not needed in SK SVM. For the classifier in Fig. 3, the SDP solving time in Model1 by Matlab is 104.543 8 s and the total training and testing time is 7.758 3 s. If all of the four SVM models are MK and use the same 24 features as the ‘reconstructed, MK’, there are four groups of
s
m
to be solved and the total solving time of SDP is 178.816 0 s. In this case, 10.667 8 s is needed to train the four MK models and test 202 samples. In the SK situation, the training and testing time is only 5.015 6 s. If all models in the classifier are MKs, the total performance is 96.039 6%. Compared to the 95.545% in Table 4, there is only 0.495% improvement at the cost of more than half time complexity. So the utilization of MKL in all nodes of the classifier is inefficient. It also means in Berlin Database of Emotional Speech there are samples that cannot be recognized by the three kernels adopted in this paper.
The performance can be compared with the SER works on Berlin Database of Emotional Speech published previously. We compare the experimental data in this paper with results of enhanced sparse representation classifier (Enhanced-SRC) [8], feature fusion based on MKL [11], cross-correlation SVM (CrossCorr-SVM) [24], Bayesian classifier with a Gaussian class-conditional likelihood [28], and neural network [31].
Table 6 gives the comparison of the above methods on original samples in Berlin Database of Emotional Speech, where ‘sample reconstruction and MKL’ is the method proposed in this paper. In Table 6 the symbol ‘N’ denotes no relating experimental results or the corresponding emotions in the references. The methods in the references mainly focus on original emotional samples except for Enhanced-SRC. Though Enhanced-SRC dealt with both original and noisy samples in Berlin Database of Emotional Speech, the confusion between happy and angry was not solved completely. In Table 6, the classifier proposed in this paper has the best recognition accuracies on happy state. In the case of 20 dB noisy SER, the highest total recognition accuracy is 95.545%, which is higher than 80.400% of Enhanced-SRC and performance of other tradional recognition methods listed in Ref. [8].
Table 6 Comparison on recognition accuracies of five emotions of original samples in Berlin Database of Emotional Speech
Method
Recognition accuracy/(%)
Anger
Fear
Happy
Neutral
Sad
Sample reconstruction and MKL
93.651
79.412
97.143
74.359
96.774
Enhanced-SRC
98.550
83.160
57.730
70.080
96.710
CrossCorr-SVM
95.040
N
66.670
85.070
84.340
Bayesian classifier
86.100
N
52.700
52.900
87.600
Neural network
84.200
63.200
N
78.8
N
Feature fusion based on MKL
81.000
83.000
N
65.000
95.000
6 Conclusions and future work
In this paper, sample reconstruction and MKL are combined into noisy SER. Through the comparison and analysis of experimental results on Berlin Database of Emotional Speech, the following conclusions can be drawn:
1) The acoustic features extracted from reconstructed speech samples based on CS theory are robust to noise and even have better emotional recognizability than that of clean speech samples.
2) MKL is effective in the reduction of confusion that cannot be handled by SK classifier and improves the SER performance greatly. However, because of the time complexity added by MKL, the structure design of multi-class classifier must be fully considered to utilize MKL efficiently.
3) Feature selection improves the accuracies of SER and plays an important role in the deep analysis about the performance of recognition methods and classifiers.
There are several challenging aspects in future research. The selection of base kernels suitable for SER and more effective learning strategies in the achievement of MK matrix are important in further study. Faster implementation of sample reconstruction and SDP solving need more research to reduce time complexity of the proposed method. In addition, the automatic acquisition of optimal feature is also essential in the improvement of system performance.
Acknowledgements
This work is supported by the DoCoMo Beijing Labs Co. Ltd., and Program for New Century Excellent Talents in Beijing University of Posts and Telecommunications (04-0112).
Appendix A Water-filling power allocation
Considering the derivative expression in Eq. (10),
(
)
2
22
2
1
i
i
k
k
iii
kkk
i
k
phh
+
=
+
+
msf
w
sf
(A.1)
After some mathematical transmutation,
2
22
ii
i
kk
k
ii
kk
p
hh
wsf
m
=--
(A.2)
By substituting Eq. (A.2) into Eq. (12),
g
2
total
22
11
T
N
U
jj
ll
jj
jl
ll
P
hh
wsf
m
==
--
åå
≤
(A.3)
The expression of Lagrange multiplier
m
is
g
2
total
22
11
11
T
N
U
j
l
jj
jl
T
ll
P
U
hh
sf
m
==
æö
æö
ç÷
ç÷
=-+
ç÷
ç÷
ç÷
èø
èø
åå
(A.4)
Therefore,
'
'
i
k
p
has a water-filling form all over the bandwidth.
g
22
total
2222
11
T
N
U
iji
i
klk
k
jjii
jl
T
llkk
pP
U
hhhh
==
æö
æö
ç÷
ç÷
=-+--
ç÷
ç÷
ç÷
èø
èø
åå
wsfsf
(A.5)
If user
*
k
is assigned to group
*
i
in the transmission, Eq. (A.5) can be expressed as:
g
22*
*
*
*total
2222
**
11
**
1
T
N
U
ji
i
lk
k
jjii
jl
T
llkk
pP
U
hhhh
sfsf
==
æö
æö
ç÷
ç÷
=-+--
ç÷
ç÷
ç÷
èø
èø
åå
(A.6)
where
*1, 2,,
T
kU
=
K
.
References
1. Tao J, Tan T. Affective computing: a review. Proceedings of the 1st International Conference on Affective Computing and Intelligent Interaction, Oct 2224, 2005, Beijing, China. LNCS 3784. Berlin, Germany: Springer, 2005: 981995
2. Schuller B, Batliner A, Steidl S, et al. Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Communication, 2011, 53(9/10): 10621087
3. Schuller B, Arsic D, Wallhoff F, et al. Emotion recognition in the noise applying large acoustic feature sets. Proceedings of the 3rd International Conference on Speech Prosody, May 25, 2006, Dresden, Germany. 2006: IP128
4. You M Y, Chen C, Bu J J, et al. Emotion recognition from noisy speech. Proceedings of the 2006 IEEE International Conference on Multimedia and Expo (ICME’06), July 912, 2006, Toronto, Canada. Piscataway, NJ, USA: IEEE, 2006: 16531656
5. Schuller B, Wöllmer M, Moosmayr T, et al. Recognition of noisy speech: a comparative survey of robust model architecture and feature enhancement. EURASIP Journal on Audio, Speech, and Music Processing, 2009: 942617/117
6. Donoho D L. Compressed sensing. IEEE Transactions on Information Theory, 2006, 52(4): 12891306
7. Candès E J. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 2008, 346(9/10): 589592
8. Zhao X M, Zhang S Q, Lei B C. Robust emotion recognition in noisy speech via sparse representation. Neural Computing and Applications, 2014, 24(7): 15391553
9. Haupt J, Nowak R. Signal reconstruction from noisy random projections. IEEE Transactions on Information Theory, 2006, 52(9): 40364048
10. Lanckriet G R G, Cristianini N, Bartlett P, et al. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 2004, 5(1): 2772
11. Jin Y, Song P, Zheng W M, et al. Novel feature fusion method for speech emotion recognition based on multiple kernel learning. Journal of Southeast University, 2013, 29(2): 129133
12. Baraniuk R G. Compressive sensing. IEEE Signal Processing Magazine, 2007, 24(4): 118120
13. Needell D, Vershynin R. Signal recovery from inaccurate and incomplete measurements via regularized orthogonal matching pursuit. IEEE Journal of Selected Topics in Signal Processing, 2010, 4(2): 310316
14. Needell D, Tropp J A. CoSaMP: iterative signal recovery from incomplete and inaccurate samples. Applied and Computational Harmonic Analysis, 2008, 26(3): 301321
15. Dai W, Milenkovic O. Subspace pursuit for compressive sensing signal reconstruction. IEEE Transactions on Information Theory, 2009, 55(5): 22302249
16. Tropp J A, Gilbert A C. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 2007, 53(12): 46554666
17. Saligrama V, Zhao M Q. Thresholded basis pursuit: LP algorithm for oder-wise optimal support recovery for sparse and approximately sparse signals from noisy random measurements. IEEE Transactions on Information Theory, 2011, 57(3): 15671586
18. Chen S S, Donoho D L, Saunders M A. Atomic decomposition by basis pursuit. SIAM Review, 2001, 43(1): 129159
19. Figueiredo M A, Nowak R D, Wright S J. Gradient projection for sparse reconstruction: application to compress sensing and other inverse problems. IEEE Journal of Selected Topics in Signal Processing, 2007, 1(4): 586597
20. Blumensath T, Davies M. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 2009, 27(3): 265274
21. Plumbley M D. Recovery of sparse representations by polytope faces pursuit. Proceedings of the 2006 International Conference on Independent Component Analysis and Blind Source Separation, Mar 58, 2006, Charleston, SC, USA. LNCS3889. Berlin, Germany: Springer, 2006: 206213
22. Yeh C Y, Su W P, Lee S J. An efficient multiple-kernel learning for pattern classification. Expert Systems with Applications, 2013, 40(9): 34913499
23. Chen L J, Mao X, Xue Y L, et al. Speech emotion recognition: features and classification models. Digital Signal Processing, 2012, 22(6): 11541160
24. Chandaka S, Chatterjee A, Munshi S. Support vector machines employing cross-correlation for emotional speech recognition. Measurement, 2009, 42(4): 611618
25. Lee C C, Mower E, Busso C, et al. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 2011, 53(9/10): 11621171
26. Burkhardt F, Paeschke A, Rolfes M, et al. A database of German emotional speech. Proceedings of the 9th European Conference on Speech Communication and Technology (INTERSPEECH’05), Sept 48, 2005, Lisbon, Portugal. 2005: 15171520
27. Jiang X Q, Xia K W, Xia X Y, et al. Speech emotion recognition using semi-definite programming multiple-kernel SVM. Journal of Beijing University of Posts and Telecommunications, 2015, 38(S1): 6771 (in Chinese)
28. Yang B, Lugger M. Emotion recognition from speech signals using new harmony features. Signal Processing, 2010, 90(5): 14151423
29. Meyer P E, Schretter C, Bontempi G. Information-theoretic feature selection in microarray dada using variable complementarity. IEEE Journal of Selected Topics in Signal Processing, 2008, 2(3): 261274
30. Löfberg J. YALMIP: A toolbox for modeling and optimization in MATLAB. Proceedings of the 2004 International Symposium on Computer Aided Control Systems Design, Sept 24, 2004, Taipei, China. Piscataway, NJ, USA: IEEE, 2004: 284289
31. Henríquez P, Alonso J B, Ferrer M A, et al. Nonlinear dynamics characterization of emotional speech. Neurocomputing, 2014, 132: 126135
Received date: 29-09-2016
Corresponding author: Xia Kewen, E-mail: [email protected]
DOI: 10.1016/S1005-8885(17)#####