enhancement of an arabic speech emotion …call centers. emotions in natural speech databases...

10
International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 5 (2018) pp. 2380-2389 © Research India Publications. http://www.ripublication.com 2380 Enhancement of an Arabic Speech Emotion Recognition System Samira Klaylat Department of Computer Science Beirut Arab University, Beirut, Lebanon Ziad Osman Electrical and Computer Engineering Department Beirut Arab University, Beirut, Lebanon Lama Hamandi Electrical and Computer Engineering Department American University of Beirut, Beirut, Lebanon Rached Zantout Electrical and Computer Engineering Department Rafik Hariri University Mechref, Lebanon Abstract In this paper, a novel two phase model is proposed to enhance an emotion recognition system. The system recognizes three emotions, happy, angry and surprised from a realistic Arabic speech corpus. Thirty five classification models were applied and the Sequential minimal optimization (SMO) classifier gave the best result with 95.52% accuracy. After applying the two- phase proposed model, an in enhancement of 3% is achieved for all classification methods. The model is then verified by two training sets and results are analyzed. Keywords: Emotional Recognition; Arabic Speech; Natural corpus; prosodic features. INTRODUCTION Emotion recognition from speech data have become a growing field of research in computer science. Various systems have been modeled to recognize emotion using acted speech data in different languages, while no systems were built using Arabic speech was found to date. In [1], the first natural Arabic speech to recognize emotions was built. Three emotions, happy, angry and surprised were recognized. Acoustic low level descriptors are extracted and different classifiers were applied. The Sequential minimal optimization (SMO) classifier gave the best result with 95.52% accuracy. The aim of this paper is to enhance the accuracy of the classification methods by applying a novel two phase enhancement model. This paper is organized as follows: a review of existing work related to emotion recognition in speech is done in section II. In section III, work in recognizing emotions from Arabic speech is described. In section IV a novel two phase approach to enhance the model of section II. Section V represents a verification model for the proposed two-phase approach while in section VI the verification results are analyzed. Finally, in section VII the contributions of this paper are summarized and future work is presented. RELATED WORK Two types of emotions have been recognized from speech databases in literature, discrete emotions [2, 3, 4] like fear, happiness, surprised and continuous [5, 6, 7] like positive/ negative and active/ passive emotions. Corpora of different languages like English, German, Danish and French have been studied. Speech corpora are either collected from un/professional actors who are asked to read a specific sentence with a specific emotion or from natural live recording of doctors/patients, parents/ children, employee/employers conversations or from human/ machine telephone calls from call centers. Emotions in natural speech databases reflect real life situations and may convey a mixture of emotions [8, 9]. The main advantage of acted corpora is that most emotions are available as well as most languages and hence results can be easily compared, however it doesn’t represent real life scenarios. In natural corpora, however not all emotions are available and the recording environment may not be suitable for modeling. Another way to collect emotional speech corpora is to create an environment that triggers certain emotions in the speakers. Such databases are called induced databases where the reacted emotions are spontaneous. Table I shows some popular acted, natural and induced speech databases. Table I. Example of speech corpora Name Lang. Emotion Type Emotions Corpus Type DES [10] Danish discrete Neutral, angry, happy, sad, surprised Acted Emo- DB[11] German discrete Neutral, anger, happiness, sadness, fear, boredom, disgust Acted LDC [12] American English Discrete and continuous Neutral, panic, anxiety, hot anger, cold anger, despair, sadness, elation, joy, interest, boredom, shame, pride Acted

Upload: others

Post on 09-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enhancement of an Arabic Speech Emotion …call centers. Emotions in natural speech databases reflect real life situations and may convey a mixture of emotions [8, 9]. available as

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 5 (2018) pp. 2380-2389

© Research India Publications. http://www.ripublication.com

2380

Enhancement of an Arabic Speech Emotion Recognition System

Samira Klaylat

Department of Computer Science Beirut Arab University, Beirut, Lebanon

Ziad Osman

Electrical and Computer Engineering Department Beirut Arab University, Beirut, Lebanon

Lama Hamandi

Electrical and Computer Engineering Department American University of Beirut, Beirut, Lebanon

Rached Zantout

Electrical and Computer Engineering Department

Rafik Hariri University

Mechref, Lebanon

Abstract

In this paper, a novel two phase model is proposed to enhance

an emotion recognition system. The system recognizes three

emotions, happy, angry and surprised from a realistic Arabic

speech corpus. Thirty five classification models were applied

and the Sequential minimal optimization (SMO) classifier

gave the best result with 95.52% accuracy. After applying the

two- phase proposed model, an in enhancement of 3% is

achieved for all classification methods. The model is then

verified by two training sets and results are analyzed.

Keywords: Emotional Recognition; Arabic Speech; Natural

corpus; prosodic features.

INTRODUCTION

Emotion recognition from speech data have become a growing

field of research in computer science. Various systems have

been modeled to recognize emotion using acted speech data in

different languages, while no systems were built using Arabic

speech was found to date.

In [1], the first natural Arabic speech to recognize emotions

was built. Three emotions, happy, angry and surprised were

recognized. Acoustic low level descriptors are extracted and

different classifiers were applied. The Sequential minimal

optimization (SMO) classifier gave the best result with

95.52% accuracy. The aim of this paper is to enhance the

accuracy of the classification methods by applying a novel two

phase enhancement model.

This paper is organized as follows: a review of existing work

related to emotion recognition in speech is done in section II.

In section III, work in recognizing emotions from Arabic

speech is described. In section IV a novel two phase approach

to enhance the model of section II. Section V represents a

verification model for the proposed two-phase approach while

in section VI the verification results are analyzed. Finally, in

section VII the contributions of this paper are summarized and

future work is presented.

RELATED WORK

Two types of emotions have been recognized from speech

databases in literature, discrete emotions [2, 3, 4] like fear,

happiness, surprised and continuous [5, 6, 7] like positive/

negative and active/ passive emotions. Corpora of different

languages like English, German, Danish and French have been

studied. Speech corpora are either collected from

un/professional actors who are asked to read a specific

sentence with a specific emotion or from natural live recording

of doctors/patients, parents/ children, employee/employers

conversations or from human/ machine telephone calls from

call centers. Emotions in natural speech databases reflect real

life situations and may convey a mixture of emotions [8, 9].

The main advantage of acted corpora is that most emotions are

available as well as most languages and hence results can be

easily compared, however it doesn’t represent real life

scenarios. In natural corpora, however not all emotions are

available and the recording environment may not be suitable

for modeling. Another way to collect emotional speech

corpora is to create an environment that triggers certain

emotions in the speakers. Such databases are called induced

databases where the reacted emotions are spontaneous. Table I

shows some popular acted, natural and induced speech

databases.

Table I. Example of speech corpora

Name Lang. Emotion

Type Emotions

Corpus

Type

DES [10] Danish discrete

Neutral, angry,

happy, sad,

surprised

Acted

Emo-

DB[11] German discrete

Neutral, anger,

happiness,

sadness, fear,

boredom,

disgust

Acted

LDC

[12]

American

English

Discrete

and

continuous

Neutral, panic,

anxiety, hot

anger, cold

anger, despair,

sadness, elation,

joy, interest,

boredom,

shame, pride

Acted

Page 2: Enhancement of an Arabic Speech Emotion …call centers. Emotions in natural speech databases reflect real life situations and may convey a mixture of emotions [8, 9]. available as

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 5 (2018) pp. 2380-2389

© Research India Publications. http://www.ripublication.com

2381

Serbian

database

of acted

emotions

[13]

Serbian discrete

Neutral, anger,

happiness,

sadness, fear

Acted

ESMBS

[14]

Mandarin

and

Burmese

discrete

Anger, disgust,

fear, joy,

sadness, and

surprise

Acted

KISMET

[15]

American

English continuous

Approval,

attention,

prohibition,

soothing, neutral

Acted

CLDC

[16] Chinese discrete

Joy, anger,

surprise, fear,

neutral, sadness

Acted

IEMOC

[17] English

Discrete

and

continuous

Anger,

happiness,

sadness,

neutrality,

valence,

activation and

dominance

Induced

SmartKo

m [18] German

Discrete

and

continuous

joy/gratification,

anger/irritation,

helplessness,

pondering/reflec

ting, surprise,

neutral,

unidentifiable

Induced

FAUAib

o [19] German continuous

Five set

emotions

(Anger,

Emphatic,

Neutral, Positive

and Rest)

two set

emotions

(Negative and

idle)

Induced

VAM

[20] German continuous

Valence,

activation and

dominant

emotions

Natural

NATUR

AL[21] Mandarin discrete Anger, neutral Natural

SPC

[23] French continuous

big five OCEAN

dimensions [22]

O: Openness

C:conscientious

ness

E: extraversion

A: agreeableness

N: neuroticism

Natural

TUM

AVIC

[24]

English continuous

Disinterest,

indifference,

neutrality,

interest,

curiosity

Natural

CEMO

[25] French discrete

Fear, anger,

sadness,

neutral,relief

Natural

The next step after preparing the speech corpus is to extract

the features of every unit of study. Speech features are

categorized into three types: spectral, excitation, and acoustic

features [26]. Spectral features, also known as vocal features,

were used by [27, 28, 29] and include MFCC (Mel frequency

cepstral) coefficients [30], LPCC (Linear prediction cepstral)

coefficients, LFPC (Log frequency power) coefficients, MFB

(Mel filter bank), Spectral centroid, Formant: F1, F2 and their

bandwidth BW1, BW2.

Few studies have used excitation speech features [31-34] to

recognize emotions from speech. These speech features are

derived from linear prediction residual of the source signal

[28]. The linear prediction residual of the source signal is

obtained by first predicting the vocal tract information using

linear prediction coefficients (LPCs) from speech signal, and

then separating it by inverse filter formulation [35].

Acoustic features are the most effective in recognizing

emotions from speech signals [36]. These features, also called

prosodic features, include the fundamental frequency F0,

signal intensity, zero-crossing rate, jitter, and shimmer of the

sound signal. Acoustic features are the most widely used to

recognize emotions from spoken data [37- 41].

The final step in building a model to recognize emotions is

applying machine learning methods to classify emotions. Both

linear and non-linear classifiers were used in literature to

recognize emotions. Linear classifiers are used when the

features are linearly separable. They include Naïve Bayes,

logistic regression and Support Vector Machines. SVMs are

been widely used [42, 43, 44] and have proved to produce high

accuracy results especially for small data sets.

For nonlinearly separable data, non-linear classifiers are more

efficient. They include Gaussian Mixture Model (GMM) [45,

46], Hidden Markov Model (HMM), Artificial Neural

Networks (ANN) [47, 48], K-nearest neighbours (KNN) [49]

and Decision trees [50]. No classifier or combination of

classifiers has been proved to give the best accuracy in

emotion recognition systems [1].

RECOGNIZING EMOTIONS In ARABIC SPEECH

In this section the study made by [1] to recognize emotions

from Arabic speech is represented. First, the corpus building

process is explained, then the feature extraction phase is

presented and finally the classification models applied and

their corresponding results are discussed.

The motivation of the study [1] was to improve the

communication of hearing-impaired and deaf people.

Integrating an effective emotion recognition system with a

reliable speech-to-text system enables deaf and hearing

impaired individuals to make successful phone calls with

normal people. Applications like IP-Relay [51] and SKC

Interpret [52] allow a hearing-impaired person to effectively

make and receive phone calls. The hearing-impaired individual

can type a message and the person on the other side hears the

words spoken. When the person at the end of the line speaks to

the hearing-impaired person, the words are received as text on

the mobile phone of the hearing-impaired. However, since the

Page 3: Enhancement of an Arabic Speech Emotion …call centers. Emotions in natural speech databases reflect real life situations and may convey a mixture of emotions [8, 9]. available as

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 5 (2018) pp. 2380-2389

© Research India Publications. http://www.ripublication.com

2382

emotion of both parties is missing, this reduces the reliability

and usefulness of those systems. In [1], natural phone call

recordings were collected to recognize emotions.

A. CORPUS ENGINEERING

Speech databases of different languages like English

(TUMAVIC [24], TIMIT [53), German (FAU AEC [19],

aGender [54], ALC [55], SLD [56]), and French (SPC [57])

were used to recognize emotions from speech, however, no

reported reference of Arabic corpus is available. Hence, [1] is

considered the first study made on an Arabic natural corpus to

recognize discrete emotions.

Eight videos (4 Egyptian, 2 Gulf, 1 Jordan, and 1 Lebanese) of

live calls between an anchor and a human outside the studio

were downloaded from online Arabic talk shows [58- 65]. The

total length of all the videos is 1632 seconds. Eighteen human

labelers were asked to listen to the videos and label each one

of them as happy, angry or surprised. The average result was

used to label each video. In literature, the number of labelers

varied between 4, 5 in the TUMAVIC [24], FAU AEC [19]

and 32 labelers like SLD [56] database.

Each video was then divided into turns: callers and receivers.

Silence, laughs and noisy chunks were removed. Every chunk

was then automatically divided into 1 sec speech units forming

our final corpus composed of 1384 records with 505 happy,

137 surprised and 741 angry units.

B. FEATURE EXTRACTION

A combination of acoustic and spectral features, known as

low-level descriptors, was provided for participants by the first

international research challenge INTERSPEECH conference

in 2009 [66]. This challenge was initiated to provide a good

benchmark for speech processing tasks and to enable more

accurate comparison between models proposed by

participants. These Low-level descriptors (LLDs) are extracted

using the open source OpenSMILE feature extractor [67] that

is developed by Technische Universität München’s (TUM’s).

In [1] 25 Low-level descriptors (LLDs) were extracted from

every speech unit: intensity, Zero crossing rates, MFCC 1-12

(Mel-frequency cepstral coefficients), F0 (Fundamental

frequency), F0 envelope, Probability of voicing, and the LSP

(Line Spectral frequency) 0-7.

Next, on each LLD feature, 19 statistical functions were

calculated: Maximum, Minimum, Range, Absolute Position of

maximum, Absolute Position of minimum, Arithmetic Mean,

Linear Regression1, Linear Regression2, Linear RegressionA,

Linear RegressionQ, Standard Deviation, Kurtosis, Skewness,

Quartiles 1, 2, 3 and Inter-quartile ranges 1-2, 2-3, 1-3.The

delta coefficient for every LLD is finally computed as an

estimate of the first derivative hence leading to a total of 950

features.

To remove ineffective features, kruskal Wallis non-parametric

test [68] was applied. We considered a significance level of

0.05 i.e. a level of 95% confidence; hence we removed the

features with p-values less than 0.05 resulting in a new

database of 1384 records with 845 features.

C. CLASSIFICATION

Thirty five classifiers belonging to six classification groups

were applied separately on the collected speech corpus. A

cross validation equal to 10% was applied to all classifiers.

The highest accuracy of 95.52% was achieved by Sequential

Mean Optimization (SMO) classifier and the lowest result was

53.58% by four different methods. SMO was invented by [69]

to solve quadratic problems of training large SVM [70]

models. SMO uses heuristics to partition the training problem

into smaller problems that can be solved analytically. The

values of the kernel and calibrator parameters of the SMO

classifier are polykernel and logistics respectively. Several

studies that used SVM classification model to recognize

emotions using acoustic features. In [71], 85.2% result was

achieved when recognizing anger, sadness boredom, disgust,

fear, joy, and neutral emotions from an acted German corpus,

while 83% was achieved by [72] when recognizing stress and

neutral emotions from an English corpus. Positive and

negative emotions were recognized from a natural French

database in [73] with accuracy 83.16%, while 76.93% and

76% accuracy were achieved by [21] and [74] respectively

when recognizing anger, neutral emotions from Mandarin

speech databases. Hence the result achieved by [1] is

considered to be high when compared to literature especially

that the corpus used is natural and emotions are not

prototyped.

ENHANCEMENT

In this section, we propose a novel two phase classification

approach to enhance the accuracy percentage provided by [1].

A. PHASE ONE

In this phase, we aim to remove the bad speech units from the

original corpora. By bad units we mean the units that were

misclassified by most classifiers. In [1] thirty five

classification methods were applied on 1385 records. All

records were misclassified by more than fourteen methods.

Considering the average which is (35 + 14) / 2, we labeled the

videos that were misclassified by more than twenty four

methods as 1 (bad units) and the rest as 0 (good units). We got

110 records labeled as 1 and 1275 labeled as 0.Then, we

removed the emotion field and added a new field (Type =1 or

0) to the original LLD feature set presented in section three

part (B), hence a new database is formed.

Next, we applied all the thirty five classification models on the

new database and got a maximum accuracy of 92% by the

ZeroR classifier. By observing the confusion matrix of the

ZeroR classifier in Table II, we notice that all the records of

type = 1 are misclassified.

Page 4: Enhancement of an Arabic Speech Emotion …call centers. Emotions in natural speech databases reflect real life situations and may convey a mixture of emotions [8, 9]. available as

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 5 (2018) pp. 2380-2389

© Research India Publications. http://www.ripublication.com

2383

Table II. Confusion Matrix of ZeroR (phase one)

Classified as -> Type = 0 Type = 1

Type= 0 1275 0

Type = 1 110 0

Figure I

B. PHASE TWO

Based on the results of phase one, all the videos of type=1

were removed from the original database presented in section

three part (B). Hence a new database of 1275 records is

formed. Next all the classification methods were applied again

on the new database. SMO classifier shows the highest

accuracy result of an improvement from 95.2% to 98.04%. An

improvement of almost 3% was achieved by all classifiers

Figure I.

VERIFICATION

To verify the proposed two-phase enhancement model, two

verification scenarios were applied. The first verification

scenario is dividing the original dataset proposed in section

three into two parts, 80% training (1034 records) and 20%

testing (277 records) datasets. Next we applied the proposed

two phase model, where in phase one we trained all the

classification models with the training set and tested them with

the testing set. Results are shown in Table III.

For phase two, we removed the records that are misclassified

by more than half of the classifiers (16 methods) from the

training set only. The classification models are trained again

with the new training set and tested with the testing set.

Results are shown in Table IV.

All the classifiers have shown an enhancement from 2% up to

6% when applied on the new training (see Figure II).

However, for the testing set, the Logistic classifier has

improved 8.7%, the BayesNet and the Decision Table

classifiers have improved 2.9% and 2.45% respectively. Four

classifiers; Jrip, Decision Stump, OneR, and Naive Bayes

Multinomial; have improved 1.9%, while the Attribute

Selected and the Multi-Class-Classifier; have improved

0.36%. All the remaining 25 classifiers have shown a

drawback between 1% up to almost 4%.

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

Original

Enhanced

Page 5: Enhancement of an Arabic Speech Emotion …call centers. Emotions in natural speech databases reflect real life situations and may convey a mixture of emotions [8, 9]. available as

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 5 (2018) pp. 2380-2389

© Research India Publications. http://www.ripublication.com

2384

Table III. Verification scenario one- phase one

Classification

Method

Training

Set (80%)

Testing

Set (20%)

SMO 94.85% 96.74%

simple Logistic 94.76% 94.93%

LMT 94.76% 94.93%

Random Subspace 93.04% 93.84%

Random Committee 93.32% 91.67%

Bagging 93.59% 93.12%

Iterative Classifier Optimizer 92.68% 92.03%

LogitBoost 92.68% 92.03%

Random Forest 92.86% 92.03%

MultiClass Classifier

Updateable 92.68% 91.30%

Classification Via Regression 93.22% 92.03%

K nearest 91.96% 91.30%

Filtered Classifier 89.61% 90.94%

PART 90.51% 87.68%

Rep Tree 90.24% 89.13%

Attribute Selected 87.90% 86.96%

Jrip 87.44% 85.14%

J48 87.53% 90.22%

MultiClass Classifier 86.09% 82.25%

Decision Table 84.64% 82.25%

Random Tree 82.38% 83.70%

Logistic 79.77% 76.09%

AdaBoostM1 85.00% 81.16%

Decision Stump 79.49% 77.17%

OneR 78.95% 74.64%

BayesNet 74.25% 69.20%

HoeffdingTree 67.84% 65.94%

Naive Bayes 67.48% 64.49%

Naive Bayes Updatable 67.48% 64.49%

Naive Bayes Multinomial 69.29% 61.96%

Randomizable Filtered

Classifier 64.68% 69.57%

CV Parameter Selection 53.75% 52.90%

Weighted Instances Handler

Wrapper 53.75% 52.90%

ZeroR 53.75% 52.90%

InputMapped Classifier 53.75% 52.90%

Table IV. Verification scenario one- phase two

Classification

Method

Training Set

(80%)

Testing Set

(20%)

SMO 97.97% 94.20%

Simple Logistic 97.77% 92.75%

LMT 97.77% 92.75%

Random Subspace 96.71% 89.86%

Random Committee 96.61% 91.67%

bagging 95.84% 90.58%

Iterative Classifier Optimizer 97.00% 89.49%

LogitBoost 97.00% 89.49%

Random Forest 96.22% 89.49%

MultiClass Classifier

Updateable 96.13% 89.86%

Classification Via

Regression 95.84% 89.13%

K nearest 94.29% 89.86%

Filtered Classifier 91.58% 87.32%

PART 93.22% 86.96%

Rep Tree 91.67% 88.04%

Attribute Selected 91.87% 87.32%

Jrip 92.35% 86.23%

J48 94.00% 89.49%

MultiClass Classifier 85.19% 82.61%

Decision Table 88.19% 84.78%

Random Tree 86.35% 84.42%

Logistic 84.70% 84.78%

AdaBoostM1 88.00% 78.26%

Decision Stump 84.12% 78.26%

OneR 83.93% 75.72%

BayesNet 80.74% 72.10%

HoeffdingTree 72.02% 64.13%

Naive Bayes 71.93% 64.13%

Naive Bayes Updatable 71.93% 64.13%

Naive Bayes Multinomial 72.12% 63.04%

Randomizable Filtered

Classifier 67.47% 67.75%

CVParameter Selection 57.12% 52.90%

Weighted Instances Handler

Wrapper 57.12% 52.90%

zeroR 57.12% 52.90%

InputMapped Classifier 57.12% 52.90%

Page 6: Enhancement of an Arabic Speech Emotion …call centers. Emotions in natural speech databases reflect real life situations and may convey a mixture of emotions [8, 9]. available as

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 5 (2018) pp. 2380-2389

© Research India Publications. http://www.ripublication.com

2385

Figure II

In the second verification scenario, the original database into

three parts: 60% training dataset (829 records), 20% testing

dataset (277 records), and 20% another testing dataset (277

records).The three data sets are mutually exclusive. For phase

one, we trained the thirty five classification models with the

60% training dataset and then tested the model with the two

testing datasets. Below are the results:

Table V. Verification scenario two- phase one

Classification

Method

Training

Set

Testing

Set 1

Testing

Set 2

SMO 93.90% 94.20% 95.95%

simple Logistic 94.35% 93.12% 96.40%

LMT 94.58% 93.12% 96.40%

Random Subspace 92.99% 92.03% 94.59%

Random Committee 92.54% 92.75% 93.24%

bagging 93.67% 91.30% 90.99%

Iterative Classifier

Optimizer 93.33% 92.75% 93.24%

LogitBoost 93.33% 92.75% 93.24%

Random Forest 93.33% 90.94% 91.89%

MultiClass Classifier

Updateable 92.77% 93.12% 93.24%

Classification Via

Regression 91.19% 91.30% 90.54%

K nearest 91.41% 91.30% 92.34%

Filtered Classifier 89.83% 89.49% 89.19%

PART 87.46% 84.78% 87.39%

Rep Tree 89.72% 86.23% 86.49%

Attribute Selected 87.68% 86.96% 87.39%

Jrip 87.01% 85.87% 83.33%

J48 88.36% 89.13% 87.84%

MultiClass Classifier 80.57% 75.36% 80.18%

Decision Table 82.03% 82.97% 81.98%

Random Tree 79.55% 81.16% 81.08%

Logistic 76.05% 75.72% 77.48%

AdaBoostM1 82.60% 80.43% 82.43%

Decision Stump 80.34% 77.54% 76.13%

OneR 78.64% 76.09% 77.03%

BayesNet 73.67% 68.84% 73.42%

HoeffdingTree 65.99% 62.68% 68.02%

Naive Bayes 64.63% 62.68% 69.37%

Naive Bayes

Updatable 64.63% 62.68% 69.37%

Naive Bayes

Multinomial 69.60% 63.04% 68.02%

Randomizable

Filtered Classifier 61.13% 64.86% 69.37%

CVParameter

Selection 54.58% 52.90% 50.45%

Weighted Instances

Handler Wrapper 54.58% 52.90% 50.45%

zeroR 54.58% 52.90% 50.45%

InputMapped

Classifier 54.58% 52.90% 50.45%

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

Phase one_Training

Phase two_Training

Page 7: Enhancement of an Arabic Speech Emotion …call centers. Emotions in natural speech databases reflect real life situations and may convey a mixture of emotions [8, 9]. available as

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 5 (2018) pp. 2380-2389

© Research India Publications. http://www.ripublication.com

2386

Next we applied phase two by removing the videos that were

misclassified by more than 16 methods from the training

dataset only and the classification models were all trained. The

models are then tested twice by the two testing datasets. All

classifiers have shown a drawback when applied on the new

training dataset except the OneR and the Randomizable

Filtered Classifier that have improved 2% and 4%

respectively. The PART and Attribute Selected models have

shown an improvement of 1% only. For the testing set one,

eight methods have shown enhancement between 1% and 3%

while the remaining have shown a drawback. The Jrip

classifier improved 5% when applied on the second testing set,

five classifiers improved 1% up to 2.7% while the remaining

have shown a drawback of maximum 4%.

ANALYSIS

In this section we analyze and compare the results of the

proposed enhancement model and the two verification

scenarios. The two-phase enhancement model aims at building

a classification model that learns only correct (well- classified)

speech units. Obviously, removing the misclassified videos

will improve the accuracy since the dataset contains “good”

data. However, in the first verification scenario, we notice an

enhancement in the training set but not in the testing set. We

believe that this is due to the fact that the training set doesn’t

contain records of similar feature values as the testing set.

Since there is no another ready to use Arabic speech corpora

to be used as testing database, we proposed another

verification scenario where the training set is 60% of the

original database and two testing sets (20% each) are tested.

Unfortunately, both training and testing sets didn’t show any

improvement since obviously the training dataset is considered

to be small and not enough for the classifier to learn. A good

solution for this problem is to increase the training dataset and

add more videos to the corpora so that the classifier will be

able to distinguish “bad” records from “good” ones more

accurately.

TableVI. Verification scenario two- phase two

Classification

Method

Training Set-

modified

Testing

Set 1

Testing

Set 2

SMO 92.61% 94.20% 94.59%

Simple Logistic 93.64% 95.65% 95.50%

LMT 93.64% 93.48% 95.50%

Random Sub Space 91.31% 90.22% 90.54%

Random Committee 91.31% 90.22% 93.69%

bagging 91.44% 90.22% 93.69%

Iterative Classifier

Optimizer 91.44% 90.22% 91.44%

Logit Boost 91.44% 90.22% 91.44%

Random Forest 91.57% 91.30% 90.09%

MultiClass Classifier

Updateable 91.18% 90.22% 92.79%

ClassificationVia

Regression 90.14% 88.77% 87.84%

K nearest 90.14% 92.01% 91.89%

Filtered Classifier 85.73% 87.68% 88.29%

PART 88.46% 86.23% 88.74%

Rep Tree 86.25% 89.13% 89.19%

Attribute Selected 89.11% 87.68% 87.39%

Jrip 85.99% 83.33% 89.19%

J48 86.25% 86.23% 84.23%

MultiClass Classifier 76.13% 75.36% 77.48%

Decision Table 82.62% 82.97% 82.43%

Random Tree 78.99% 81.88% 82.88%

Logistic 75.10% 71.74% 71.62%

AdaBoostM1 82.36% 80.43% 81.53%

Decision Stump 79.51% 77.17% 76.13%

OneR 80.67% 78.26% 77.03%

BayesNet 73.41% 69.93% 75.23%

HoeffdingTree 64.46% 63.04% 68.02%

Naive Bayes 63.42% 63.77% 67.57%

Naive Bayes Updatable 63.42% 63.77% 67.57%

Naive Bayes Multinomial 67.96% 61.59% 68.47%

Randomizable Filtered

Classifier 64.72% 67.03% 64.86%

CVParameterSelection 55.12% 52.90% 50.45%

Weighted Instances

Handler Wrapper 55.12% 52.90% 50.45%

zeroR 55.12% 52.90% 50.45%

InputMapped Classifier 55.12% 52.90% 50.45%

CONTRIBUTIONS AND FUTURE WORK

In this paper we introduce an enhancement model to improve

the recognition of emotions from Arabic speech. This

contribution is language independent and may be used by

other researchers to improve their results. Increasing the

corpus by adding more speech units would be of great benefit

to build a successful classification model to recognize happy,

angry and surprised emotions from Arabic speech.

REFERENCES

[1] Klaylat, S., Zantout, R., Hamandi, L. and Osman, Z.,

2017,” Emotion Recognition in Arabic Speech,”

Proc. IEEE International Conference on Sensors,

Networks, Smart and Emerging Technologies, Beirut,

Lebanon.

[2] Ekman, P., 1971, “Universals and Cultural

Differences in Facial Expressions of Emotion,” Proc.

Nebraska Symp. Motivation, pp. 207-283.

[3] Ekman, P., 1982, “Emotion in the Human Face,” 2nd

ed. Cambridge Univ. Press.

[4] Ekman P. and Oster, H., 1979, “Facial Expressions

of Emotion,” Ann. Rev. Psychology, 30, pp. 527-554.

[5] Clavel, C., Vasilescu, I., Devillers, L. Richard, G.,

and Ehrette, T., 2008, “Fear-Type Emotion

Recognition for Future Audio-Based Surveillance

Systems,” Speech Communication, 50(6), pp. 487-

Page 8: Enhancement of an Arabic Speech Emotion …call centers. Emotions in natural speech databases reflect real life situations and may convey a mixture of emotions [8, 9]. available as

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 5 (2018) pp. 2380-2389

© Research India Publications. http://www.ripublication.com

2387

503.

[6] Cowie, R. and Cornelius, R., 2003,” Describing the

Emotional States that are Expressed in Speech,”

Speech Communication, 40, pp. 5-32.

[7] Kehrein, R., 2002, “The Prosody of Athentic

Emotions,” Proc. Speech Prosody, Aix-en-Provence,

pp. 423–426.

[8] Douglas-Cowie, E., Devillers, L., Martin, J.C.,

Cowie, R., Savvidou, S., Abrilian, S. and Cox, C.,

2005,”Multimodal Databases of Everyday Emotion:

Facing Up to Complexity,” Proc 9th European

Conference on Speech Communication and

Technology, Lisbon, Portugal, pp. 813-816.

[9] Devillers, L., Vidrascu, L., and Lamel. L., 2005,

“Challenges in Real-Life Emotion Annotation and

Machine Learning Based Detection,” Neural

Networks, 18(4), pp. 407- 422.

[10] Engberg, I., and Hansen, A., 1996, “Documentation

of the Danish Emotional Speech Database DES,”

Center for Person Kommunikation, Institute of

Electronic Systems, Alborg University, Denmark.

[11] Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier,

W. and Weiss, B. 2005, “A Database of German

Emotional Speech,” Proc. of Interspeech, Lisbon, pp.

1517–1520.

[12] University of Pennsylvania Linguistic Data

Consortium, Emotional prosody speech and

transcripts,

/http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?

catalogId=LDC200228S, July, 2002.

[13] Jovicic, S.T., Kacic, Z., Dordevic, M., and Rajkovic,

M., 2004, “Serbian Emotional Speech Database:

Design, Processing and Evaluation,” Proc. 9th

Conference on Speech and Computer, SPECOM’04,

St. Petersburg, Russia, pp. 77–81.

[14] Nwe, T.L., 2003,” Analysis and Detection of Human

Emotion and Stress from Speech Signals,” Ph.D.

thesis, Department of Electrical and Computer

Engineering, National University of Singapore.

[15] Breazeal, C., and Aryananda, L., 2002, “Recognition

of Affective Communicative Intent in Robot-Directed

Speech,” Autonomous Robots 12(1), pp. 83–104.

[16] Zhou, J., Wang, G., Yang, Y., and Chen, P., 2006,

“Speech Emotion Recognition Based on Rough Set

and SVM,” Proc.5th IEEE International Conference

on Cognitive Informatics, ICCI, 1, pp. 53–61.

[17] Busso, C., Bulut, M., Lee, C.C. , Kazemzadeh, A.,

Mower, E., Kim, S., Chang, J.N., Lee, S., and

Narayanan, S.S., 2008, ” IEMOCAP: Interactive

emotional dyadic motion capture database,” Journal

of Language Resources and Evaluation, 42 (4),

pp.335-359.

[18] Schiel, F., Steininger, S., and Turk, U., 2002,” The

SmartKom Multimodal Corpus at BAS,” Proc. 3rd

Language Resources and Evaluation Conference,

LREC’02, Canary Islands, Spain, pp. 200–206.

[19] Batliner, A., Hacker, C., Steidl, S., Noth, E.,

D’Arcy, S., Russell, M., and Wong, M. 2004, ” ‘You

stupid tin box’ – Children Interacting with the AIBO

Robot: A Cross-Linguistic Emotional Speech

Corpus.” Proc. 4th Language Resources and

Evaluation Conference LREC, Lisbon, Portugal,

pp.171–174.

[20] Grimm, M., Kroschel, K., and Narayanan, S.,

2008,”The Vera am Mittag German Audio–Visual

Emotional Speech Database,” Proc. IEEE

International Conference on Multimedia and Expo

(ICME), Hannover, Germany, pp. 865–868.

[21] Morrison, D., Wang, R. and De Silva, L., 2007,”

Ensemble Methods for Spoken Emotion Recognition

in Call-Centers,” Speech Communication, 49(2), pp.

98–112.

[22] Wiggins, J. S., 1996 “The Five-Factor Model of

Personality”. Guilford Press, New York.

[23] Mohammadi, G., Vinciarelli, A., and Mortillaro, M.,

2010,” The Voice of Personality: Mapping Nonverbal

Vocal Behavior into Trait Attributions,” Proc. SSPW,

Florence, pp. 17–20.

[24] Schuller, B., Muller, R., Eyben, F., Gast, J., Hornler,

B., Wollmer, M., Rigoll, G., Hothker, A., and

Konosu, H., 2009, ”Being Bored? Recognizing

Natural Interest by Extensive Audiovisual Integration

for Real-Life Application,” Image and Vision

Computing, 27, pp. 1760–1774.

[25] Vidrascu, L., and Devillers, L., 2006,” Real-Life

Emotions in Naturalistic Data Recorded in A Medical

Call Center,” Proc. 1st International Workshop on

Emotion: Corpora for Research on Emotion and

Affect (International conference on Language

Resources and Evaluation, Genoa, Italy, pp. 20-24.

[26] Koolagudi, Sh., and Sreenivasa Rao, K., 2012,

“Emotion Recognition from Speech: a Review,” Int J

Speech Technol.

[27] Mubarak, O. M., Ambikairajah, E., and Epps, J.,

2005, “Analysis of an MFCC-Based Audio Indexing

System for Efficient Coding of Multimedia Sources,”

Proc. 8th International Symposium on Signal

Processing and its Applications, Sydney, Australia,

pp. 28–31.

[28] Pao, T. L., Chen, Y. T., Yeh, J. H., and Liao, W. Y.,

2005, “Combining Acoustic Features for Improved

Emotion Recognition in Mandarin Speech,” in ACII (LNCS 3784), Springer-Verlag Berlin Heidelberg ,

pp. 279–285.

[29] Pao, T. L., Chen, Y. T., Yeh, J. H, Cheng, Y. M., and

Chien, C. S., “Feature Combination for Better

Differentiating Anger from Neutral in Mandarin

Emotional Speech,”. LNCS 4738, ACII 2007:

Springer-Verlag Berlin Heidelberg, 2007.

[30] Davis, S., and Mermelstein, P., 1980,” Comparison of

Parametric Representations for Monosyllabic Word

Recognition in Continuously Spoken

Sentences,”IEEE Trans. Audio, Speech Lang Process,

28 pp. 357–366.

[31] Rabiner, L. R., and Juang, B. H., 1993,”

Fundamentals of Speech Recognition,” Englewood

Page 9: Enhancement of an Arabic Speech Emotion …call centers. Emotions in natural speech databases reflect real life situations and may convey a mixture of emotions [8, 9]. available as

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 5 (2018) pp. 2380-2389

© Research India Publications. http://www.ripublication.com

2388

Cliffs, New Jersy, Prentice-Hall.

[32] Chauhan, A., Koolagudi, S.G., Kafley, S., and Rao,

K.S., 2010,” Emotion recognition using LP residual,”

Proc.IEEE TechSym, West Bengal, India.

[33] Koolagudi, S. G., Reddy, R., and Rao, K.S., 2010,

“Emotion Recognition from Speech Signal Using

Epoch Parameters,” Proc. International conference on

signal processing and communications (SPCOM),

IISc, Bangalore, India, pp. 1–5, New York: IEEE

Press.

[34] I, A. I., and Scordilis, M. S., 2001,” Spoken Emotion

Recognition Using Glottal Symmetry,” EURASIP

Journal on Advances in Signal Processing, , 1(11).

[35] Makhoul, J., 1975, “Linear Prediction: A Tutorial

Review,” Proc. of the IEEE, 63(4), pp. 561–580.

[36] Murray, I.R., and Arnott, J.L., 2008,”Applying an

Analysis of Acted Vocal Emotions to Improve the

Simulation of Synthetic Speech,” Computer Speech

and Language, 22(2), pp. 107–129.

[37] Li, Y., and Zhao, Y., 1999, “Recognizing Emotions

in Speech Using Short-Term and Long Term

Features,” Eurospeech, Budapest.

[38] Wang, Y. Meng, Q., and Li, P., 2009, “Emotional

Feature Analysis and Recognition in Multilingual

Speech Signal,” Proc. Electronic Measurement and

Instruments (ICEMI), Beijing.

[39] Vidrascu, L., and Devillers, L., 2007,”Five Emotion

Classes Detection in Real-World Call Center Data:

The Use of Various Types of Paralinguistic

Features”, LIMSI-CNRS, France.

[40] Ayadi, M.E., Kamel, M.S., and Karray, F., 2011,

“Survey on Speech Emotion Recognition: Features,

Classification Schemes, and Databases,” Pattern

Recognition, pp. 572–587.

[41] Xie, B., Chen, L., Chen, G. C., and Chen, C., 2007,

“Feature Selection for Emotion Recognition of

Mandarin Speech,” Journal of Zhejiang University

(Engineering Science), 41(11), pp. 1816–1822.

[42] Chuang, Z.J. and Wu, C.H., 2004, “Emotion

Recognition Using Acoustic Features and Textual

Content,” Proc. IEEE international conference on

multimedia and expo, 1, pp. 53–56.

[43] Yu, F., Chang, E., Xu, Y.Q. , and Shum, H.Y.,

2001,” Emotion Detection from Speech to Enrich

Multimedia Content,” Proc. 2nd IEEE Pacific-Rim

conference on multimedia, Beijing, China.

[44] Hoch, S., Althoff, F., McGlaun, G., and Rigoll, G.,

2005, “Bimodal Fusion of Emotional Data in an

Automotive Environment,” Proc. IEEE international

conference on acoustics, speech, and signal

processing, 2, pp. 1085–1088.

[45] Hoch, S., Althoff, F., McGlaun, G., and Rigoll, G.,

2005 “Bimodal Fusion of Emotional Data in an

Automotive Environment,” Proc. IEEE international

conference on acoustics, speech, and signal

processing, 2, pp. 1085–1088.

[46] Lugger, M., and Yang, B., 2007, “The Relevance of

Voice Quality Features in Speaker Independent

Emotion Recognition,” Proc. ICASSP, Honolulu,

Hawaii, USA, pp. IV17–IV20. New York: IEEE

Press.

[47] Iliou, T., and Anagnostopoulos, C. N., 2009,

“Statistical Evaluation of Speech Features for

Emotion Recognition,” Proc. 4th international

conference on digital telecommunications, Colmar,

France, pp. 121–126.

[48] Luengo, E., Navas, I., Hernez, I., and Snchez, I.,

2005,” Automatic Emotion Recognition Using

Prosodic Parameters,” Proc. INTERSPEECH, Lisbon,

Portugal, pp. 493–496.

[49] Lee, C.M., Narayanan, S.S., and Pieraccini, R., 2001,

“Recognition of Negative Emotions from the Speech

Signal,” Proc. ASRU, pp. 240–243.

[50] Litman, D.J. and Forbes-Riley, K., 2006,

“Recognizing Student Emotions and Attitudes on the

Basis of Utterances in Spoken Tutoring Dialogues

With Both Human and Computer Tutors,” Speech

Comm., 48, pp. 559–590.

[51] http://android-apps.com/apps/skc-interpret/, Android

Apps website.

[52] http://appcrawlr.com/android/sprint-mobile-ip, Sprint

Mobile IP, App Crawlr website.

[53] Petrushin, V., 2000, “Emotion Recognition in Speech

Signal: Experimental Study, Development, and

Application,” Proc. 6th International Conference on

Spoken Language Processing, Beijing, China.

[54] Burkhardt, F., Eckert, M., Johannsen, W., and

Stegmann, J., 2010, “A Database of Age and Gender

Annotated Telephone Speech,” Proc. LREC, Valletta,

Malta, pp. 1562–1565.

[55] Schiel, F., and Heinrich, C. , 2009,” Laying the

Foundation for In-Car Alcohol Detection by Speech,”

Proc. Interspeech, Brighton, UK, pp. 983–986.

[56] Burkhardt, F., Schuller, B., Weiss, B. and Weninger,

F., 2011,” ‘Would you Buy a Car from Me?’ – On

The Likability of Telephone Voices,” Proc.

Interspeech, Florence, pp. 1557–1560.

[57] Mohammadi, G., Vinciarelli, A., and Mortillaro, M.,

2010,” The Voice of Personality: Mapping Nonverbal

Vocal Behavior into Trait Attributions,” Proc. SSPW,

Florence, pp. 17–20.

[58] http://www.youtube.com/watch?v=uvhNyAXFTMQ,

“Egypt today show”, Alfaraiin channel, May 25,

2012.

[59] http://www.youtube.com/watch?v=S1T_EKDpIR8,”

New cairo show”, Al-hayat channel, May 28, 2012

[60] http://www.youtube.com/watch?v=2v6X2VEjb4k ,

“Laka sumt , AlUraify show”, September 12, 2012.

[61] http://www.youtube.com/watch?v=MQv3tKTwm7k,

Zain telecommunication, January 22, 2009.

[62] http://www.youtube.com/watch?v=16qNcn03G3s,

Prince Sultan bin Fahed call, Alriyadiya sport

channel, October 6, 2011.

[63] http://www.youtube.com/watch?v=E4TqhBo1SCk,

Althaqafiya channel, January 7, 2012.

Page 10: Enhancement of an Arabic Speech Emotion …call centers. Emotions in natural speech databases reflect real life situations and may convey a mixture of emotions [8, 9]. available as

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 5 (2018) pp. 2380-2389

© Research India Publications. http://www.ripublication.com

2389

[64] http://www.youtube.com/watch?v=Wpf3OxEdJak,

“Dairat al dawe”, Noon channel, Haifa wehbe call,

November 19, 2009.

[65] http://www.youtube.com/watch?v=eBznv9QNU7M,

“Musalsalati show”, Mona Zaki call, June 12, 2011.

[66] Schuller, B., Steidl, S. and Batliner , A., 2009, “The

Interspeech 2009 Emotion Challenge,” Proc.

Interspeech, Brighton, UK, , pp. 312–315.

[67] Eyben, F., Wöllmer, M. and Schuller, B., 2010,

“openSMILE – The Munich Versatile and Fast Open-

Source Audio Feature Extractor,” ACM.

[68] https://statistics.laerd.com/spss-tutorials/kruskal-

wallis-h-test-using-spss-statistics.php, “Kruskal-

Wallis H Test Using SPSS Statistics”, Leard Statistics

website.

[69] Platt, J.C., 1998,” Sequential Minimal Optimization:

A Fast Algorithm for Training Support Vector

Machines”, Technical Report MSR-TR, pp. 98-14.

[70] Vapnik, V., 1995, “The Nature of Statistical Learning

Theory,” Springer, New York.

[71] Fisher, W. Doddington, G., and Goudie-Marshall, K.,

1986, “The DARPA Speech Recognition Research

Database: Specifications and Status,” Proc. of the

DARPA Workshop on Speech Recognition, pp. 93–

99.

[72] Vlasenko, B., and Wendemuth, A., 2009, “Processing

Affected Speech within Human Machine Interaction,”

Proc. Annual Conf. of the International Speech

Communication Association, Interspeech.

[73] Vidrascu, L., and Devillers, L., 2005,” Real-Life

Emotion Representation And Detection in Call

Centers Data,” LNCS, 3784, ACII, Berlin, Springer

pp. 739–746.

[74] Zhang, S., 2008, “Emotion Recognition in Chinese

Natural Speech by Combining Prosody and Voice

Quality Features,” Proc. Sun, et al. (Eds.), Advances

in neural networks. Lecture notes in computer

science, Berlin, Springer, pp. 457–464.