emotion recognition from speech using global and local prosodic features

18
Int J Speech Technol (2013) 16:143–160 DOI 10.1007/s10772-012-9172-2 Emotion recognition from speech using global and local prosodic features K. Sreenivasa Rao · Shashidhar G. Koolagudi · Ramu Reddy Vempada Received: 10 May 2012 / Accepted: 20 July 2012 / Published online: 4 August 2012 © Springer Science+Business Media, LLC 2012 Abstract In this paper, global and local prosodic features extracted from sentence, word and syllables are proposed for speech emotion or affect recognition. In this work, du- ration, pitch, and energy values are used to represent the prosodic information, for recognizing the emotions from speech. Global prosodic features represent the gross statis- tics such as mean, minimum, maximum, standard devia- tion, and slope of the prosodic contours. Local prosodic fea- tures represent the temporal dynamics in the prosody. In this work, global and local prosodic features are analyzed sepa- rately and in combination at different levels for the recogni- tion of emotions. In this study, we have also explored the words and syllables at different positions (initial, middle, and final) separately, to analyze their contribution towards the recognition of emotions. In this paper, all the studies are carried out using simulated Telugu emotion speech cor- pus (IITKGP-SESC). These results are compared with the results of internationally known Berlin emotion speech cor- pus (Emo-DB). Support vector machines are used to develop the emotion recognition models. The results indicate that, the recognition performance using local prosodic features is better compared to the performance of global prosodic fea- tures. Words in the final position of the sentences, syllables in the final position of the words exhibit more emotion dis- criminative information compared to the words and syllables present in the other positions. K.S. Rao ( ) · S.G. Koolagudi · R.R. Vempada School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur 721302, West Bengal, India e-mail: [email protected] S.G. Koolagudi e-mail: [email protected] R.R. Vempada e-mail: [email protected] Keywords Emotion recognition · Global prosodic features · Local prosodic features · Emo-DB · IITKGP-SESC · Vowel onset point · Segment-wise emotion recognition · Region-wise emotion recognition 1 Introduction Imposition of duration, intonation, and intensity patterns on the sequence of sound units, while producing speech, makes the speech natural. Lack of prosody knowledge can easily be perceived from the speech. Prosody can be viewed as speech features associated with larger units such as sylla- bles, words, phrases and sentences. Consequently, prosody is often considered as supra-segmental information. The prosody appears to structure the flow of speech. The prosody is represented acoustically by the patterns of duration, in- tonation (F 0 contour), and energy. They represent the per- ceptual properties of speech, which are normally used by human beings to perform various speech tasks (Rao and Yegnanarayana 2006; Werner and Keller 1994). Human be- ings mostly use the prosodic cues for identifying the emo- tions present in day-to-day conversations. For instance pitch and energy values are high for active emotion like anger, whereas the same parameters are comparatively less for the passive emotion like sadness. Duration used for ex- pressing anger is shorter compared to the duration used for sadness (see Table 2). Most of the existing works on emotion recognition focused on spectral features and gross statistics of prosody (Schuller et al. 2011; Koolagudi and Rao 2012a, 2012b, 2012c, 2011, 2011; Rao et al. 2011; Rao 2011a, 2011b). In the literature, prosodic features such as energy, dura- tion, pitch and their derivatives are treated as high correlates

Upload: ramu-reddy

Post on 09-Dec-2016

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Emotion recognition from speech using global and local prosodic features

Int J Speech Technol (2013) 16:143–160DOI 10.1007/s10772-012-9172-2

Emotion recognition from speech using global and local prosodicfeatures

K. Sreenivasa Rao · Shashidhar G. Koolagudi ·Ramu Reddy Vempada

Received: 10 May 2012 / Accepted: 20 July 2012 / Published online: 4 August 2012© Springer Science+Business Media, LLC 2012

Abstract In this paper, global and local prosodic featuresextracted from sentence, word and syllables are proposedfor speech emotion or affect recognition. In this work, du-ration, pitch, and energy values are used to represent theprosodic information, for recognizing the emotions fromspeech. Global prosodic features represent the gross statis-tics such as mean, minimum, maximum, standard devia-tion, and slope of the prosodic contours. Local prosodic fea-tures represent the temporal dynamics in the prosody. In thiswork, global and local prosodic features are analyzed sepa-rately and in combination at different levels for the recogni-tion of emotions. In this study, we have also explored thewords and syllables at different positions (initial, middle,and final) separately, to analyze their contribution towardsthe recognition of emotions. In this paper, all the studiesare carried out using simulated Telugu emotion speech cor-pus (IITKGP-SESC). These results are compared with theresults of internationally known Berlin emotion speech cor-pus (Emo-DB). Support vector machines are used to developthe emotion recognition models. The results indicate that,the recognition performance using local prosodic features isbetter compared to the performance of global prosodic fea-tures. Words in the final position of the sentences, syllablesin the final position of the words exhibit more emotion dis-criminative information compared to the words and syllablespresent in the other positions.

K.S. Rao (�) · S.G. Koolagudi · R.R. VempadaSchool of Information Technology, Indian Institute of TechnologyKharagpur, Kharagpur 721302, West Bengal, Indiae-mail: [email protected]

S.G. Koolagudie-mail: [email protected]

R.R. Vempadae-mail: [email protected]

Keywords Emotion recognition · Global prosodicfeatures · Local prosodic features · Emo-DB ·IITKGP-SESC · Vowel onset point · Segment-wise emotionrecognition · Region-wise emotion recognition

1 Introduction

Imposition of duration, intonation, and intensity patterns onthe sequence of sound units, while producing speech, makesthe speech natural. Lack of prosody knowledge can easilybe perceived from the speech. Prosody can be viewed asspeech features associated with larger units such as sylla-bles, words, phrases and sentences. Consequently, prosodyis often considered as supra-segmental information. Theprosody appears to structure the flow of speech. The prosodyis represented acoustically by the patterns of duration, in-tonation (F0 contour), and energy. They represent the per-ceptual properties of speech, which are normally used byhuman beings to perform various speech tasks (Rao andYegnanarayana 2006; Werner and Keller 1994). Human be-ings mostly use the prosodic cues for identifying the emo-tions present in day-to-day conversations. For instance pitchand energy values are high for active emotion like anger,whereas the same parameters are comparatively less forthe passive emotion like sadness. Duration used for ex-pressing anger is shorter compared to the duration usedfor sadness (see Table 2). Most of the existing works onemotion recognition focused on spectral features and grossstatistics of prosody (Schuller et al. 2011; Koolagudi andRao 2012a, 2012b, 2012c, 2011, 2011; Rao et al. 2011;Rao 2011a, 2011b).

In the literature, prosodic features such as energy, dura-tion, pitch and their derivatives are treated as high correlates

Page 2: Emotion recognition from speech using global and local prosodic features

144 Int J Speech Technol (2013) 16:143–160

Table 1 Literature review on emotion recognition using prosodic features

Speech emotion research using prosodic features

Features Purpose and approach Ref.

Initially 86 prosodic features areused, later best 6 features are chosenfrom the list

Identification of 4 emotions in Basquelanguage. Around 92 % Emotion recognitionperformance is achieved using GMMs.

(Luengo et al. 2005)

35 dimensional prosodic featurevectors including pitch, energy, andduration are used

Classification of seven emotions of Berlinemotion speech corpus. Around 51 %emotion recognition results are obtained forspeaker independent cases using neuralnetworks.

(Iliou andAnagnostopoulos 2009)

Pitch and energy based features areextracted from frame, syllable, andword levels

Recognizing 4 emotions in Mandarin.Combination of features from, frame, syllableand word level yielded 90 % emotionrecognition performance.

(Kao and Lee 2006)

Duration, energy, and pitch basedfeatures

Recognizing emotions in Mandarin language.Sequential forward selection (SFS) is used toselect best features from the pool of prosodicfeatures. Emotion classification studies areconducted on multi-speaker multi-lingualdatabase. Modular neural networks are usedas classifiers.

(Zhu and Luo 2007)

Eight static prosodic features andvoice quality features

Classification of 6 emotions (anger, anxiety,boredom, happiness, neutral, and sadness)from Berlin emotion speech corpus. Speakerindependent emotion classification isperformed using Bayesian classifiers.

(Lugger and Yang 2007)

Energy, pitch and duration basedfeatures

Classification of 6 emotions from Mandarinlanguage. Around 88 % emotion recognitionperformance is reported using SVM andgenetic algorithms.

(Wang et al. 2008)

Prosody and voice quality basedfeatures

Classification of 4 emotions namely anger,joy, neutral, and sadness from Mandarinlanguage. Around 76 % emotion recognitionperformance is reported using support vectormachines (SVMs).

(Zhang 2008)

of emotions (Dellaert et al. 1996; Lee and Narayanan 2005;Nwe et al. 2003; Schroder and Cowie 2006; Banziger andScherer 2005; Cowie and Cornelius 2003). Features suchas, minimum, maximum, mean, variance, range and stan-dard deviation of energy, and pitch are used as importantprosodic information sources for discriminating the emo-tions (Schroder 2001; Murray and Arnott 1995). Steepnessof F0 contour during rise and falls, articulation rate, num-ber and duration of pauses are explored in Cahn (1990) andMurray and Arnott (1995), for characterizing the emotions.Prosodic features extracted from the smaller linguistic unitsat the level of consonants and vowels are also used for an-alyzing the emotions (Murray and Arnott 1995). The im-portance of prosodic contour trends in the context of dif-ferent emotions is discussed in Murray et al. (1996) andScherer (2003). Peaks and troughs in the profiles of fun-damental frequency and intensity, durations of pauses andbursts are proposed for identifying four emotions namelyfear, anger, sadness and joy (McGilloway et al. 2000). The

sequences of frame-wise prosodic features, extracted fromlonger speech segments such as words and phrases are alsoused to categorize the emotions present in the speech (Nweet al. 2003). F0 information is analyzed for emotion clas-sification and it is reported that minimum, maximum andmedian values of F0 and slopes of F0 contours are emotionsalient features. Around 80 % of emotion recognition accu-racy is achieved, using proposed F0 features with K-nearestneighbor classifier (Dellaert et al. 1996). Short time supra-segmental features such as pitch, energy, formant locationsand their bandwidths, dynamics of pitch, energy and for-mant contours, speaking rate are used for analyzing the emo-tions (Ververidis and Kotropoulos 2006). The complex re-lations between pitch, duration and energy parameters areexploited in Iida et al. (2003) for detecting the speech emo-tions. Table 1 briefs out some of the other important andrecent works on speech emotion recognition using prosodicfeatures.

Page 3: Emotion recognition from speech using global and local prosodic features

Int J Speech Technol (2013) 16:143–160 145

From the literature, it is observed that, most of the speechemotion recognition studies are carried out using sentencelevel static (global) prosodic features (Nwe et al. 2003;Schroder and Cowie 2006; Dellaert et al. 1996; Koolagudiet al. 2009; Ververidis et al. 2004; Iida et al. 2003; Schulleret al. 2011; Schuller 2012). Very few attempts have ex-plored the dynamic behavior of prosodic patterns (local)for recognizing speech emotions (McGilloway et al. 2000;Rao et al. 2010). Elementary prosodic analysis of speech ut-terances is carried out in Rao et al. (2007), at sentence, word,and syllable levels, using only the first order statistics of ba-sic prosodic parameters. However, time varying dynamic na-ture of prosodic contours seems to be more emotion specificand did not attract much of the research attention.

Generally, emotion specific prosodic cues may not bepresent uniformly at all positions of the utterance. Someemotions like anger are dominantly perceivable from theinitial portion of the utterances, whereas, surprise is dom-inantly expressed at the final part of the utterance. In thiscontext, it is important to study the contribution of staticand dynamic (i.e. global and local) prosodic features ex-tracted from sentence, word and syllable segments towardemotion recognition. The approach of recognizing emotionsfrom the shorter speech segments may further be helpful forreal time emotion verification. None of the existing stud-ies has explored the speech segments with respect to theirpositional information for identifying the emotions. In thiswork, prosodic features derived from the words and sylla-bles are analyzed with respect to their positions (i.e. ini-tial, middle, and final) to study their contribution towardrecognition of emotions from speech. The study on theseissues, is carried out by performing the following tasks:(1) analysis of speech emotions using global prosodic fea-tures, (2) investigating local prosodic features at the sen-tence, word, and syllable levels, for discriminating the emo-tions, (3) combination of evidence due to static (global) anddynamic (local) prosodic features at different levels to rec-ognize the emotions and (4) in all the above three cases, po-sitional information of words and syllables in the sentencesand words, is analyzed for recognizing the emotion. Sup-port vector machine (SVM) models are used for developingthe emotion recognition models for discriminating the emo-tions.

The rest of the paper is organized as follows. Section 2gives the details of the two databases used in this work. Sec-tion 3 discusses the importance of prosodic features in clas-sifying the speech emotions and briefly mentions the mo-tivation behind this study. Section 4 explains the details ofextraction of static and dynamic prosodic features from var-ious segments of speech utterances. Evaluation of the devel-oped emotion recognition models and their performance arediscussed in Sect. 5. A brief summary of the present workalong with the conclusions, is given in Sect. 6.

2 Speech databases

In this work, two speech data corpora are used to validatethe approach and features proposed for emotion recogni-tion from speech. An acted-affect Telugu speech databaseis used as the primary source and the results obtained arecompared with the results of internationally known Berlinspeech database. The brief details of both the databases aregiven below. In this paper, we have used the terms emotionand affect interchangeably. Among the 8 emotions consid-ered in this work, sarcasm and compassion are mainly atti-tudes rather independent emotions. Therefore, the term af-fect will be appropriate for the cases mentioned above.

Telugu is a south-central Dravidian language primarilyspoken in the state of Andhra Pradesh, India, where it isan official language. According to the 2001 census of In-dia, Telugu is the language with the third largest numberof native speakers in India (74 million), thirteenth in theethnologue list of most-spoken languages worldwide, andmost spoken Dravidian language. It is one of the twenty-twoscheduled languages of the Republic of India and one of thefour classical languages. Telugu was heavily influenced bySanskrit and Prakrit. Telugu words generally end in vow-els. Telugu features a form of vowel harmony wherein thesecond vowel in disyllabic noun and adjective roots alterswhether the first vowel is tense or lax. If the second vowel isopen (i.e., /a:/ or /a/), then the first vowel will be more openand centralized. Telugu words also have vowels in inflec-tional suffixes harmonized with the vowels of the precedingsyllable. The prosodic patterns in Telugu are mostly similarto other Indian languages such as Hindi, Bengali and Tamil.For neutral sentences, generally, the sequence of pitch val-ues will be high at the initial words and slowly decreasestowards final words. In emotional sentences, some specificwords show high intonation and stress patterns.

The speech corpus, IITKGP-SESC, used in this study,was recorded using 10 (5 male and 5 female) professionalartists from All India Radio (AIR) Vijayawada, India. Theartists were sufficiently experienced in expressing the de-sired emotions from the neutral sentences. All the artistsare in the age group of 25–40 years, and had the profes-sional experience of 8–12 years. For analyzing the emo-tions we had considered 15 semantically neutral Telugu sen-tences. Each of the artists had to speak the 15 sentencesin 8 given emotions in one session. The number of ses-sions considered for preparing the database was 10. The to-tal number of utterances in the database was 12000 (15 sen-tences × 8 emotions × 10 artists × 10 sessions). Each emo-tion had 1500 utterances. The number of words and sylla-bles in the sentences were varying from 3–6 and 11–18 re-spectively. The total duration of the database was around7 hours. The eight emotions considered for collecting theproposed speech corpus were: Anger, Compassion, Disgust,

Page 4: Emotion recognition from speech using global and local prosodic features

146 Int J Speech Technol (2013) 16:143–160

Fear, Happiness, Neutral, Sarcastic and Surprise. The speechsamples were recorded using SHURE dynamic cardioid mi-crophone C660N. The distance between the microphoneand the speaker was maintained approximately around 3–4inches. The speech signal was sampled at 16 kHz, and eachsample is represented as 16 bit number. The sessions wererecorded on alternate days to capture the inevitable variabil-ity in the human vocal tract system. In each session, all theartists have given the recordings of 15 sentences in 8 emo-tions. The recording was carried out in such a way that eachartist had to speak all the sentences at a stretch in a particularemotion. This provides the coherence among the sentencesfor each emotion category. The entire speech database wasrecorded using single microphone and at the same location.The recording was done in a quiet room, without any obsta-cles in the recording path.

The quality of the database was also evaluated using sub-jective listening tests. Here, the quality represents how wellthe artists simulated the emotions from the neutral text. Thesubjects were used to assess the naturalness of the emotionsembedded in speech utterances. This evaluation was carriedout by 25 post-graduate and research students of Indian In-stitute of Technology, Kharagpur. This subjective listeningtest was useful for the comparative analysis of emotions inhuman versus machine perspective. In this study, 40 sen-tences (5 sentences from each emotion) randomly selectedfrom male and female speakers were considered for evalua-tion. Before taking the test, the subjects were given the pilottraining by playing 8 sentences (a sentence from each emo-tion) from each artist’s speech data, for understanding (fa-miliarizing) the characteristics of emotion expression. Fortysentences used in this evaluation are randomly ordered, andplayed to the listeners. For each sentence, the listener had tomark the emotion category from the set of 8 given emotions.The overall emotion classification performance for male andfemale speech data was observed to be 61 % and 66 % re-spectively.

In this work, speaker and text independent emotionalspeech data is used for analyzing the emotion recognition.Here, training is performed with 8 speakers’ (4 males and4 females) speech data, from all 10 sessions. Testing is per-formed with the remaining 2 speakers’ (one male and one fe-male) speech data. To realize the text independent data, dur-ing training the speech utterances corresponding to the first10 text prompts of the database were used, and the remain-ing 5 text prompts were used while testing. The descriptionof development of emotion recognition models and their ver-ification is discussed in next section.

Burkhardt et al. (2005) have collected actor based sim-ulated emotion Berlin database in German language. Ten(5 male + 5 female) actors have contributed in preparingthe database. The emotions recorded in the database wereanger, boredom, disgust, fear, happiness, neutral and sad-

ness. Ten linguistically neutral German sentences were cho-sen for database construction. The database was recordedusing the Sennheiser MKH 40 P48 microphone, with thesampling frequency of 16 kHz. Samples were stored as16 bit numbers. Eight hundred and forty (840) utterancesof Emo-DB were used in this work. In the case of Berlindatabase, 8 speakers’ speech data was used for training themodels and remaining 2 speakers’ speech data was used forvalidating the trained models.

3 Motivation

Normally human beings use dynamics of long term speechfeatures like energy profile, intonation pattern, duration vari-ations and formant tracks, to perceive and process the emo-tional content from the speech utterances (Benesty et al.2008; Rao 2005). This might be the main reason for theextensive use of prosodic features by most of the researchcommunity for speech emotion processing. However, manytimes humans tend to get confused while distinguishing theemotions that share similar acoustical and prosodic proper-ties. In real situations, humans are helped by linguistic, con-textual and other modalities like facial expressions, whileinterpreting the emotions from the speech. In the case ofmachine perspective, combining some of these modalitieswould improve the emotion recognition performance. In thisstudy the static and dynamic prosodic parameters are ex-plored for classifying the emotions.

Static prosodic values derived from the sequences ofduration, energy, and pitch values of the sentences fromIITKGP-SESC are explored for classifying the emotions.The mean duration is calculated by averaging the durationsof all sentences. Mean pitch is computed by averaging theframe level pitch values for all sentences. Mean energy is anaverage of the frame level energies calculated for each sen-tence. Frames of size 20 ms and a shift of 10 ms are usedfor the above calculations. Though this statistical analysisof prosody is very simple, it gives a clear insight of emotionspecific knowledge present in the prosodic features. Table 2gives the mean and standard deviation values of prosodic pa-rameters derived over the database IITKGP-SESC. Table 3shows the emotion recognition results based on the aboveprosodic parameters. Here simple Euclidean distance mea-sure is used to classify eight emotions. Here, columns repre-sent the classified emotions (affects) and rows represent theemotions used for testing or validation. An average emotionrecognition performance of around 45 % and 51 % are ob-served in the cases of male and female speech respectively.From the results, it may be observed that, there are miss-classifications among high arousal emotions like anger, hap-piness, and fear. Similar observations with respect to slowemotions such as disgust, sadness, and neutral may also be

Page 5: Emotion recognition from speech using global and local prosodic features

Int J Speech Technol (2013) 16:143–160 147

Table 2 Mean and standard deviation values of the prosodic parameters for each of the emotions of IITKGP-SESC. Standard deviation values aregiven within braces

Emotion Male artist Female artist

Duration(s)

Pitch(Hz)

Energy Duration(s)

Pitch(Hz)

Energy

Anger 1.76 (0.32) 195.60 (37.12) 203.45 (41.38) 1.80 (0.31) 301.67 (53.56) 103.36 (18.12)

Disgust 1.62 (0.27) 188.05 (32.36) 118.94 (23.12) 1.67 (0.28) 308.62 (56.21) 90.19 (15.65)

Fear 1.79 (0.34) 210.70 (28.74) 263.68 (48.52) 1.89 (0.31) 312.07 (48.97) 144.87 (22.45)

Happiness 2.03 (0.41) 198.30 (39.41) 164.68 (26.44) 2.09 (0.46) 287.78 (50.45) 83.65 (12.78)

Neutral 1.93 (0.45) 184.37 (24.12) 160.44 (31.53) 2.04 (0.40) 267.13 (55.53) 83.42 (11.93)

Sadness 2.09 (0.47) 204.00 (28.43) 225.98 (42.89) 2.13 (0.49) 294.33 (42.67) 86.00 (14.54)

Sarcastic 2.16 (0.42) 188.44 (33.12) 120.03 (18.32) 2.20 (0.44) 301.11 (52.49) 75.26 (11.89)

Surprise 2.05 (0.39) 215.75 (45.76) 202.06 (48.65) 2.09 (0.42) 300.10 (57.09) 86.72 (16.34)

Table 3 Emotion classification performance using prosodic features.Abbreviations: Ang.—Anger, Dis.—Disgust, H—Happiness, Neu.—Neutral, Sad.—Sadness, Sar.—Sarcastic, Sur.—Surprise

Emo. Ang. Dis. Fear Hap. Neu. Sad. Sar. Sur.

Male, average: 45.38

Ang. 37 0 10 23 13 0 0 17

Dis. 0 40 0 3 20 27 7 3

Fear 17 0 63 0 13 0 0 7

Hap. 33 0 10 37 10 0 0 10

Neu. 0 10 0 0 53 27 10 0

Sad. 0 0 0 0 23 60 17 0

Sar. 0 27 0 0 10 20 43 0

Sur. 23 27 0 0 20 0 0 30

Female, average: 50.88

Ang. 43 0 10 17 13 0 0 17

Dis. 0 47 0 0 3 0 27 23

Fear 13 0 67 10 3 0 0 7

Hap. 10 0 13 57 10 0 0 10

Neu. 0 17 0 0 50 23 0 10

Sad. 0 7 0 0 33 60 0 0

Sar. 0 33 0 0 17 0 43 7

Sur. 3 37 10 0 0 10 0 40

seen. Most of the miss-classifications are biased toward neu-tral. Emotions expressed by female speakers are recognizedfairly better compared to the emotions of male speakers.

From Table 2, it may be observed that, the averagestatic prosodic values such as energy, pitch, and durationare distinguishable for different emotions. Similarly, thetemporal dynamics in the prosodic contours also representemotion specific information. Figure 1 shows the dynam-ics in prosodic contours for different emotions. Obviously,there are inherent overlaps among these static and dynamicprosodic values with respect to the emotions. In the liter-

ature, several existing works have explored static prosodicfeatures for speech emotion recognition (Dellaert et al.1996; Lee and Narayanan 2005; Nwe et al. 2003; Schroderand Cowie 2006; Banziger and Scherer 2005; Cowie andCornelius 2003). However, time-dependent prosody varia-tions may be used as the discrimination strategy, where staticprosodic properties of different emotions show high overlap.Figure 1 shows three subplots indicating the (a) duration pat-terns of the sequence of syllables, (b) energy contours and(c) pitch contours of an utterance “mAtA aur pitA kA AdarkarnA chAhie” in five different emotions. From the subplotindicating the duration patterns, one can observe the com-mon trend of durations for all emotions. However, the trendsalso indicate that, for some emotions such as fear and hap-piness, the durations of the initial syllables of the utteranceare longer, for happiness and neutral emotions middle sylla-bles of the utterance seems to be longer, and the final sylla-bles of the utterance seems to be longer for fear and anger(see Fig. 1(a)). From the energy plots, it is observed thatthe utterance with anger emotion has highest energy for theentire duration. Next to the anger emotion, fear and happi-ness show some what more energy over other two emotions.The dynamics of energy contours can be used to discrimi-nate fear and happiness (see Fig. 1(b)). It is observed fromFig. 1(c) that, anger, happiness and neutral have some whathigher pitch values, compared to other two emotions. Usingthe dynamics (changes of prosodic values with respect totime) of pitch contours, easy discrimination is possible be-tween anger, happiness and neutral emotions, even thoughthey have similar average values. Thus, Fig. 1 provides thebasic motivation to explore the dynamics prosodic featuresfor discriminating the emotions.

By observing pitch contours from Fig. 1(c), it may benoted that, initial portion of the plots (the sequence of 20pitch values), do not carry similar pitch information with re-spect to different emotions. Static features are almost thesame for happiness and neutral. However, static features

Page 6: Emotion recognition from speech using global and local prosodic features

148 Int J Speech Technol (2013) 16:143–160

Fig. 1 (a) Duration patterns for the sequence of syllables, (b) energy contours, and (c) pitch contours in different emotions for the utterance “mAtAaur pitA kA Adar karnA chAhie”

may be used to distinguish anger, sadness, fear emotionsas their static pitch values are spread widely between 250to 300 Hz. Similarly dynamic features are almost the samefor all emotions except for fear. One may observe the initialdecreasing and gradual rising trends of pitch contours foranger, happiness, neutral, and sadness emotions, whereas,for fear pitch contour starts with the rising trend. Similarlocal discriminative properties may also be observed in thecase of energy and duration profiles from the initial, mid-dle and final parts of the utterances. This phenomenon indi-cates that, it may be sometimes difficult to classify the emo-tions based on either global or local prosodic trends derivedfrom the entire utterance. Therefore, in this work, we intendto explore the static (global) and dynamic (local) prosodicfeatures, along with their combination for speech emotionrecognition at different levels (utterance, words, and sylla-bles) and positions (initial, middle, and final).

4 Extraction of global and local prosodic features

In this work, emotion recognition (ER) systems are devel-oped using local and global prosodic features, extractedfrom sentence, word and syllable levels. Word and syllable

boundaries are identified using vowel onset points (VOPs)as the anchor points (Vuppala et al. 2012). In this work,VOP detection is carried out using the combination of ev-idence from excitation source, spectral peaks, and modula-tion spectrum. This method is known as combined methodfor the detection of VOP. Excitation source information isrepresented using the Hilbert envelope (HE) of the linearprediction (LP) residual. Sequence of sum of ten largestpeaks of the spectra of speech frames represents the shape ofthe vocal tract. Slowly varying temporal envelope of speechsignal can be represented using modulation spectrum. Eachof these three features represents complementary informa-tion about the VOP, and hence they are combined for theenhancement in the performance of VOP detection. VOPdetection using combined method is carried out with thefollowing steps: (1) Derive the VOP evidence from excita-tion source, spectral peaks, and modulation spectrum. Here,the evidence from excitation source information is obtainedfrom the Hilbert envelope of the linear prediction residualsignal. The evidence from the spectral peaks is obtainedby summing the ten largest spectral peaks of each speechframe. The evidence due to modulation spectrum is derivedby passing the speech signal through a set of critical bandpass filters, and summing the components corresponding to

Page 7: Emotion recognition from speech using global and local prosodic features

Int J Speech Technol (2013) 16:143–160 149

Fig. 2 VOP detection using thecombination of evidence fromexcitation source energy,spectral peak energy, andmodulation spectrum energy.(a) Speech signal with manuallymarked VOPs. (b) Evidence plotusing excitation source energy.(c) Evidence plot using spectralpeak energies. (d) Evidence plotusing modulation spectrumenergies. (e) Combined evidenceplot with detected VOPs

4–16 Hz. (2) The above evidence are further enhanced bycomputing the slope with the help of first order difference(FOD). (3) These enhanced evidence are convolved with thefirst order Gaussian difference (FOGD) operator for deriv-ing the final VOP evidence. (4) Individual VOP evidencesderived from excitation source, spectral peaks and modula-tion spectrum are combined to provide the robust VOP ev-idence plot. (5) The positive peaks in the combined VOPevidence signal are hypothesized as the locations of VOPs.About 95 % of the VOPs are observed to be detected prop-erly, by this method within 40 ms deviation (Prasanna et al.2009). Figure 2 shows the intermediate steps of VOP de-tection using the evidence from excitation source, spectraland modulation spectrum energy plots. The sentence Dontask me to carry an oily rag like that, chosen from TIMITdatabase, is used for illustrating the automatic detection ofVOPs in Fig. 2. From figure, it is observed that, the detectedVOPs are close to the manually marked VOPs (see Figs. 2(a)and (e)).

4.1 Sentence level features

Sentence level static and dynamic prosodic features are de-rived by considering entire sentence as a unit for feature ex-traction. Pitch contours are extracted using zero frequencyfilter based method (Murty and Yegnanarayana 2008). Fig-ure 3(c) shows the zero frequency filtered signal for thesegment of the voiced speech shown in Fig. 3(b). Posi-

tive zero crossings in zero frequency filtered speech sig-nal are used as the epoch locations. Figure 3(d) shows thedetected epoch locations. To validate the results obtained,differenced electro-glottolo-graph (EGG) signal is shown inFig. 3(a). Vibration pattern of glottal folds can be directlyrecorded from electro-glottolo-graph. It is a device, that canbe attached to the throat of a speaker, and a transducer init, converts pressure variations into electrical signal. FromFigs. 3(a) and (d), it may be observed that, the automaticallydetected and actual epoch locations are almost matching.

The details of zero frequency filter are given in Murty andYegnanarayana (2008). Zero frequency filter method deter-mines the instants of significant excitation (epochs) presentin the voiced regions of speech signal. Voiced regions aredetermined using frame level energy and periodicity. In theunvoiced region concept of pitch is not valid, hence pitchvalues are considered as zero for each interval of 10 ms. Inthe voiced region pitch is determined using epoch intervals.The time interval between successive epochs is known asthe epoch interval. Reciprocal of epoch interval is consid-ered as the pitch at that instant of time. Energy contour ofa speech signal is derived from the sequence of frame ener-gies. Frame energies are computed by summing the squaredsample amplitudes within a frame. Fourteen (2—duration,6—pitch, 6—energy) prosodic parameters are identified torepresent duration, pitch and energy components of globalprosodic features. Average syllable and pause durations are

Page 8: Emotion recognition from speech using global and local prosodic features

150 Int J Speech Technol (2013) 16:143–160

Fig. 3 (a) EGG signal (reference epoch locations), (b) segment ofvoiced speech signal, (c) zero frequency filtered speech signal and(d) epoch locations (detected)

considered as two duration parameters. Average syllable du-ration is computed as

NDsyl = Ds − Dp

Nsyl

and average pause duration is computed as

NDp = Dp

Ds

where, NDsyl is average syllable duration, Ds is sentenceduration, Dp is pause duration, Nsyl is number of syllables,and NDp is average pause duration

Six pitch values and six energy values are derived fromthe sentence-level pitch and energy contours, respectively.These values represent minimum, maximum, mean, stan-dard deviation, median, and contour slope. The slopes ofpitch and energy contours are determined by using the mid-dle pitch and energy values of the first and the last words.These fourteen values are concatenated in the order dura-tion, pitch, and energy to form a feature vector that repre-sents global prosody.

Local prosodic features are intended to capture the vari-ations in the prosodic contours with respect to time. There-fore, the feature vector is expected to retain the natural dy-namics of the prosody. In this regard, resampled energy andpitch contours are used to represent the feature vectors forlocal prosody. The dimension of pitch and energy contoursis chosen to be 25, after evaluating the emotion recognitionperformance with 100, 50, 25, and 10 values. The recog-nition performance, with 25 dimensional feature vectors isslightly better compared to the feature vectors with other di-

mensions. Here, the dimension 25 for pitch and energy con-tours is not crucial. The reduced size of the pitch and energycontours has to be chosen so that the dynamics of the orig-inal contours are retained in their resampled versions. Thebasic reasons for reducing the dimensionality of the originalpitch and energy contours are (1) need for the fixed dimen-sional input feature vectors for developing the SVM modelsand (2) the number of feature vectors required for trainingthe classifier has to be proportional to the size of the featurevector to avoid curse of dimensionality. The local durationpattern is represented by the sequence of normalized syl-lable durations. Here the syllable durations are determinedusing the time interval between successive VOPs (Prasannaet al. 2009). The length of duration contour is proportionalto the number of syllables present in the sentence, whichleads to the feature vectors of unequal lengths. To obtain thefeature vectors of equal length, the length of duration vectoris fixed to be 18 (the maximum number of syllables presentin the longest utterance of IITKGP-SESC). The length forshorter utterances is compensated by zero padding.

4.2 Word and syllable level features

The global and local prosodic features extracted from wordsand syllables help to analyze the contribution of differentsegments (sentences, words, and syllables) and their po-sitions (initial, middle, and final), in the utterance towardemotion recognition. Word and syllable boundaries are de-termined automatically, using vowel onset points (Prasannaand Zachariah 2002; Prasanna 2004). Before extracting thefeatures, the words in all the utterances of the database aredivided into three groups namely initial, middle, and finalwords. Similarly, the syllables within each group of wordsare also classified as initial, middle, and final syllables.While categorizing the words, length of the words and num-ber of words in an utterance are taken into consideration.Length of words is measured in terms of number of sylla-bles. If there are more than 3 words in the utterance and thefirst word is monosyllabic, then the first 2 words are groupedas initial words. This is because, monosyllabic words maynot be sufficient to capture emotion specific information.Many times monosyllabic words are not sufficient for thespeaker to clearly express specific emotion. The scheme ofgrouping of words and syllables into above mentioned threegroups is given in Table 4.

This table contains word and syllable grouping details ofthe 15 sentences of IITKGP-SESC. For instance, groupingof words in the case of (S1, S8, S9) and (S5, S11) is straightforward as there are either 3 or 6 words in the sentences.In the case of S2, out of 5, two words are grouped as theinitial words, as the first word of the sentence is monosyl-labic in nature. The last word, which contains 4 syllables istreated as the final word, and the remaining two words are

Page 9: Emotion recognition from speech using global and local prosodic features

Int J Speech Technol (2013) 16:143–160 151

Table 4 Linguistic details of text prompts of IITKGP-SESC: Scheme for groping of words and syllables while extracting prosodic parameters.Abbreviations used: Fin.—Final, Ini.—Initial, Mid.—Middle, No.—Number, Sen.—Sentences, Syl.—Syllables, Wds.—Words

Sen. � � Words � Syllables in the wordsequence

� Syl. � Ini.wds.

� Mid.wds.

� Fin.wds.

� Syl. in initialwords

� Syl. in middlewords

� Syl. infinal words

S1 3 6 + 4 + 3 13 1 1 1 2 + 2 + 2 1 + 2 + 1 1 + 1 + 1

S2 5 1 + 2 + 2 + 4 + 4 13 2 2 1 2 + 0 + 1 2 + 2 + 2 1 + 2 + 1

S3 5 4 + 2 + 3 + 4 + 3 16 1 2 2 1 + 2 + 1 2 + 1 + 2 2 + 3 + 2

S4 4 4 + 4 + 3 + 3 14 1 1 2 1 + 2 + 1 1 + 2 + 1 2 + 2 + 2

S5 6 1 + 2 + 2 + 3 + 2 + 3 13 2 2 2 2 + 0 + 1 2 + 1 + 2 2 + 1 + 2

S6 5 4 + 2 + 5 + 3 + 3 17 2 1 2 2 + 2 + 2 1 + 3 + 1 2 + 2 + 2

S7 5 2 + 5 + 2 + 3 + 2 14 2 2 1 2 + 3 + 2 2 + 1 + 2 1 + 0 + 1

S8 3 3 + 4 + 4 11 1 1 1 1 + 1 + 1 1 + 2 + 1 1 + 2 + 1

S9 3 5 + 3 + 3 11 1 1 1 1 + 3 + 1 1 + 1 + 1 1 + 1 + 1

S10 5 1 + 2 + 6 + 3 + 2 14 2 1 2 2 + 0 + 1 1 + 4 + 1 2 + 1 + 2

S11 6 2 + 5 + 4 + 1 + 3 + 3 18 2 2 2 2 + 3 + 2 2 + 2 + 1 2 + 2 + 2

S12 4 2 + 2 + 4 + 4 12 2 1 1 2 + 0 + 2 1 + 2 + 1 1 + 2 + 1

S13 5 2 + 3 + 4 + 3 + 5 17 2 2 1 2 + 1 + 2 2 + 3 + 2 1 + 3 + 1

S14 4 3 + 2 + 3 + 3 11 1 2 1 1 + 1 + 1 2 + 1 + 2 1 + 1 + 1

S15 4 2 + 3 + 3 + 3 11 1 2 1 1 + 0 + 1 2 + 2 + 2 1 + 1 + 1

considered as the middle words. Similarly in the case of S3,the first word is considered as the initial word, as it contains4 syllables. On the basis of production and co-articulationconstraints, words in each group are divided into initial, mid-dle, and final syllables. Last 3 columns of Table 4 indicatethe number of initial, middle, and final syllables present ininitial, middle, and final words.

Here the syllable division is carried out using the follow-ing principle. (a) If the word contains more than 2 syllables,then the first syllable of the word is considered as initial syl-lable, the last syllable of the word is considered as the fi-nal syllable, and the remaining syllables are treated as themiddle syllables. (b) If the word contains 2 syllables, thenthey are treated as the initial and final syllables. (c) If theword consists of single syllable then, that syllable is treatedas the initial syllable. The English transcriptions of the textprompts of Telugu database (IITKGP-SESC) are given inTable 5.

The process of extracting word level global and localprosodic features is similar to the method of extracting ut-terance level global and local prosodic features. Length ofthe feature vectors for word level global prosodic featuresis kept as 13 (1—duration, 6—pitch, and 6—energy). Here,the parameter normalized pause duration is not included asthe feature, since only one or two words are used for fea-ture extraction. Slopes of the pitch and energy contours arecomputed by considering the first and last syllables of thespecific words. The length of the feature vectors for wordlevel local prosodic features is fixed to be 15 for pitch andenergy. This is derived by re-sampling the original prosody

Table 5 English transcriptions of the Telugu text prompts of IITKGP-SESC

Sent. Id. Text prompts

S1 thallidhandrulanu gauravincha valenu.

S2 mI kOsam chAlA sEpatnimchichUsthunnAmu.

S3 samAjamlo prathi okkaru chadhuvuko valenu.

S4 ellappudu sathyamune paluka valenu.

S5 I rOju nEnu tenali vellu chunnAnu.

S6 kOpamunu vIdi sahanamunu pAtinchavalenu.

S7 anni dAnamulalo vidyA dAnamu minnA.

S8 uchitha salahAlu ivvarAdhu.

S9 dongathanamu cheyutA nEramu.

S10 I rOju vAthAvaranamu podigA undhi.

S11 dEsa vAsulandharu samaikhyAthA thomelaga valenu.

S12 mana rAshtra rAjadhAni hyderAbAd.

S13 sangha vidhrOha sekthulaku AshrayamkalpincharAdhu.

S14 thelupu rangu shAnthiki chihnamu.

S15 gangA jalamu pavithra mainadhi.

contours obtained over the words. The length of local du-ration vector is fixed at 6, which is equal to the maximumnumber of syllables in a word of IITKGP-SESC. The lengthof local duration vector, at syllable level, is fixed at 4, whichis equal to the maximum number of syllables in any group,as shown in Table 4.

Page 10: Emotion recognition from speech using global and local prosodic features

152 Int J Speech Technol (2013) 16:143–160

Fig. 4 Block diagram of an emotion recognitionsystem using SVMs

5 Results and discussion

Emotion recognition systems are separately developed forsentence, word, and syllable level global and local prosodicfeatures. The combination of global and local prosodic fea-tures is also explored to study emotion recognition (ER). Inthis work, static prosodic parameters are known as globalprosodic features and the features those represent tempo-ral dynamics of the prosodic contours are known as localprosodic features. Therefore, the words static and global, dy-namic and local are interchangeably used.

5.1 Emotion recognition systems using sentence levelprosodic features

In this work, we have considered 8 emotions of IITKGP-SESC, for studying the role of global and local prosodicfeatures in recognizing speech emotions. SVMs are used todevelop emotion recognition models. Each SVM is trainedwith positive and negative examples. Positive feature vec-tors are derived from the utterances of intended emotion,and negative feature vectors are derived from the utterancesof all other emotions. Therefore, 8 SVMs are developed torepresent 8 emotions. The basic block diagram of ER sys-tem developed using SVMs is shown in Fig. 4. For evaluat-ing the performance of the ER systems, the feature vectors,derived from the test utterances are given as inputs to all 8trained emotion models. The output of each model is givento the decision module, where the category of the emotionis hypothesized based on the highest evidence among the 8emotion models.

Fig. 5 Emotion recognition system using sentence level global andlocal prosodic features

Table 6 Emotion recognition performance using global prosodic fea-tures computed over entire utterances. Average recognition perfor-mance: 43.75. Abbreviations: Emo.—Emotions, Ang.—Anger, Dis.—Disgust, Hap.—Happiness, Neu.—Neutral, Sad.—Sadness, Sar.—Sarcastic, Sur.—Surprise

Emo. Emotion recognition performance in %

Ang. Dis. Fear Hap. Neu. Sad. Sar. Sur.

Ang. 28 17 23 3 13 13 3 0

Dis. 7 47 0 0 3 10 33 0

Fear 7 0 67 7 0 10 0 9

Hap. 3 0 7 14 43 3 10 20

Neu. 0 0 7 17 67 0 3 6

Sad. 7 3 17 17 0 40 13 3

Sar. 0 10 0 13 20 3 44 10

Sur. 7 0 17 13 3 3 13 44

For analyzing the effect of global and local prosodic fea-tures, on emotion recognition performance, separate mod-els are developed using global and local prosodic features.The overall emotion recognition performance is obtained bycombining the evidence from the global and local prosodicfeatures, as shown in Fig. 5.

Emotion recognition system based on global prosodicfeatures consists of 8 emotion models, developed by us-ing 14-dimensional feature vectors (duration parameters—2, pitch parameters—6, energy parameters—6). Emotionrecognition performance of the models using global prosodicfeatures is given in Table 6. Fear and neutral are recognizedwith highest rate of 67 %, whereas, happiness utterancesare identified with only 14 % accuracy. It is difficult to

Page 11: Emotion recognition from speech using global and local prosodic features

Int J Speech Technol (2013) 16:143–160 153

attain high performance, while classifying the underlyingspeech emotions using only static prosodic features. Thisis mainly due to the overlap of static prosodic features ofdifferent emotions. For instance, it is difficult to discrimi-nate the pairs like fear and anger, sarcastic and disgust us-ing global prosodic features. Utterances of all 8 emotionsare miss-classified as either neutral, fear, or happiness. Themiss-classification due to static prosodic features may bereduced by employing dynamic prosodic features for classi-fication. Therefore, use of dynamic nature of prosody con-tours, captured through local prosodic features, is exploredin this work for speech emotion recognition.

To study the relevance of individual local prosodic fea-tures in emotion recognition, three separate ER systems cor-responding to sentence level duration, intonation and energypatterns are developed to capture local emotion specific in-formation. Score level combination of these individual lo-cal prosodic systems is performed to obtain overall emotionrecognition performance due to all local sentence level fea-tures. Emotion recognition performance using individual lo-cal prosodic features and their score level combination isgiven in Table 7.

The average emotion recognition performance due to in-dividual local prosodic features is well above the perfor-mance of global prosodic features. The information of pitchdynamics has the highest discrimination of about 54 %.

Energy and duration dynamic features have also achievedthe recognition performance around 48 %. From the results,it is observed that, local prosodic features play a major rolein discriminating the emotions. Score level combination ofenergy, pitch and duration features, further improved theemotion recognition performance up to 64 %. Evidence ofemotion recognition models developed using global and lo-cal prosodic features are combined to study the effect ofcombination. Table 8 shows the recognition performance ofthe emotion recognition system developed by combining theevidence from global and local prosodic features. The aver-age emotion recognition performance after combining theglobal and local prosodic features is observed to be about66 %. There is marginal improvement in the emotion recog-nition performance, by combining the evidence from globaland local prosodic features. This indicates that, the emotiondiscriminative properties of global prosodic feature are nothighly complementary to the local features. Therefore, lo-cal prosodic features alone would be sufficient to performspeech emotion recognition compared to the combinationof global and local prosodic features. The comparison ofrecognition performance in case of each emotion, with re-spect to the global, local and their combination of features isshown in Fig. 6. It may be observed from figure that, anger,neutral, sadness, and surprise have achieved better discrim-ination using the combination of global and local prosodicfeatures. Local prosodic features play important role in the

Table 7 Emotion recognition performance using local prosodic fea-tures computed over entire sentences. Abbreviations: Ang.—Anger,Dis.—Disgust, Hap.—Happiness, Neu.—Neutral, Sad.—Sadness,Sar.—Sarcastic, and Sur.—Surprise

Emo. Ang. Dis. Fear Hap. Neu. Sad. Sar. Sur.

Duration, average emotion recognition: 48.75

Ang. 30 20 7 3 23 3 7 7

Dis. 7 67 3 0 10 3 10 0

Fear 7 7 53 0 10 10 7 6

Hap. 17 7 10 30 3 13 7 13

Neu. 7 3 3 3 57 21 3 3

Sad. 3 7 23 7 20 30 10 0

Sar. 0 13 4 10 0 0 73 0

Sur. 7 3 14 10 3 10 3 50

Pitch, average emotion recognition: 53.75

Ang. 27 43 7 0 3 3 17 0

Dis. 10 60 10 0 0 3 7 0

Fear 3 13 43 7 0 10 7 17

Hap. 4 7 13 40 3 13 13 7

Neu. 3 7 0 3 80 7 0 0

Sad. 3 10 7 7 10 57 6 0

Sar. 0 0 7 3 0 10 63 17

Sur. 0 0 7 10 0 3 20 60

Energy, average emotion recognition: 48

Ang. 43 37 7 0 7 0 6 0

Dis. 27 37 0 0 13 0 20 3

Fear 0 0 57 7 10 13 0 13

Hap. 7 7 10 43 17 7 7 2

Neu. 20 3 3 10 47 10 3 4

Sad. 0 3 13 17 17 40 10 0

Sar. 0 10 3 0 0 0 80 7

Sur. 0 0 37 13 3 0 10 37

Duration + Pitch + Energy, average: 64.38

Ang. 40 4 23 27 3 0 3 0

Dis. 13 73 0 0 4 0 10 0

Fear 3 0 63 10 0 7 0 17

Hap. 7 0 10 57 3 13 3 7

Neu. 0 7 7 3 73 7 0 3

Sad. 0 7 13 7 0 63 10 0

Sar. 0 10 0 0 0 7 83 0

Sur. 3 0 17 10 0 0 7 63

discrimination of disgust, happiness, and sarcastic. Fear isrecognized well by using global prosodic features.

5.2 Emotion recognition using word level prosodic features

In general, while expressing emotions, different emotionsappear to be effective at different parts of the utterances.

Page 12: Emotion recognition from speech using global and local prosodic features

154 Int J Speech Technol (2013) 16:143–160

Table 8 Emotion recognition performance using the combinationof local and global prosodic features computed from entire utter-ances. Abbreviations: Emo.—Emotions, Ang.—Anger, Dis.—Disgust,Hap:—Happiness, Neu.—Neutral, Sad.—Sadness, Sar.—Sarcastic,Sur.—Surprise

Emo. Emotion recognition performance in %

Ang. Dis. Fear Hap. Neu. Sad. Sar. Sur.

Ang. 47 40 3 0 3 0 7 0

Dis. 10 63 0 0 7 10 10 0

Fear 7 0 60 3 0 13 0 17

Hap. 10 0 7 53 20 0 3 7

Neu. 0 0 3 7 74 13 3 0

Sad. 0 0 17 3 0 77 3 0

Sar. 0 10 0 0 3 0 84 3

Sur. 0 0 10 17 0 0 6 67

For example, anger and happiness show their characteris-tics mainly at the beginning of the utterance. Effective ex-pression of fear and disgust may be observed in the finalpart of the utterance. Based on this intuitive hypothesis, ini-tial, middle, and final portions of the utterances are analyzedseparately for capturing the emotion specific information. Toanalyze the characteristics of emotions at different parts ofthe utterance, each utterance of IITKGP-SESC, is dividedinto 3 parts namely initial, middle and final words. The divi-sion of the words into three regions, depends upon the lengthand other emotion related attributes of the words. The detailsof this division of an utterance into 3 parts are given in Ta-ble 4. In this study, global and local prosodic features are ex-tracted from initial, middle and final parts of the utterance.For each portion of the utterances (initial, middle or finalwords), emotion analysis is carried out using global and lo-cal prosodic features in similar manner as it was performedat the sentence level. Further the overall emotion recognitionperformance, from the word level prosodic features is ob-tained by combining the evidence from the emotion recog-nition systems (ERSs) developed using initial, middle andfinal words. Figure 7 shows the block diagram of ERS de-veloped using word level prosodic features. Each block ofFig. 7 contains the ERS shown in Fig. 5. From each por-tion of the sentence, global and local prosodic features arecomputed and the ERSs are developed as shown in Fig. 5.Further the evidence from initial, middle and final words arecombined to get overall evidence.

Table 9 shows the average emotion recognition perfor-mance of word level global and local emotion recognitionsystems. Table 10 shows the overall emotion recognitionperformance using word level prosodic features, obtainedby score level combination of local and global features ofinitial, middle, and final words.

In Table 9, column Global indicates the recognition per-formance by using only global prosodic features. Using lo-

Fig. 6 Comparison of emotion recognition performance using utter-ance level global, local, and global + local prosodic features

Fig. 7 Emotion recognition system using word level global and localprosodic features

cal prosodic parameters, systems are individually developedusing duration, pitch and energy components. Columns withheadings Dur., Pitch, and Energy show the recognition per-formance using the dynamic features of duration, pitch andenergy patterns respectively. Column Local indicates therecognition performance due to the score level combinationof duration, pitch, and energy parameters with appropriateweighting factors. Column Glo. + Loc. indicates the emo-tion recognition performance due to the combination of evi-dence from global (Glo.) and local (Loc.) systems. All theseresults are reported for initial, middle and final words of theutterances of IITKGP-SESC.

From the results of Table 9, it is evident that, all partsof the utterances do not contribute uniformly toward emo-tion recognition. Some of the important observations arementioned below. There is a drastic change in the recogni-tion performance by using local prosodic features comparedto global features. However, improvement in the recogni-tion performance is marginal over the local features, whenglobal and local features are combined. This indicates that,global prosodic features at word level, may not be compli-

Page 13: Emotion recognition from speech using global and local prosodic features

Int J Speech Technol (2013) 16:143–160 155

Table 9 Emotion recognition performance using global and localprosodic features computed from the words of the utterances depend-ing on their position. Abbreviations: Avg.—Average, Dur.—Duration,Glo.—Global, Loc.—Local

Emotions Global Local features Glo. +Loc.Dur. Pitch Energy Local

Initial words, average emotion recognition: 50.13

Anger 40 20 33 47 53 57

Disgust 33 33 27 33 37 37

Fear 37 33 17 53 50 43

Happiness 20 13 40 20 40 47

Neutral 23 23 47 30 43 43

Sadness 30 23 37 33 40 47

Sarcastic 60 40 43 73 63 67

Surprise 30 23 43 30 47 60

Avg. 34 26 36 39 47 50

Middle words, average emotion recognition: 58.38

Ang. 50 17 43 53 57 60

Dis. 57 53 53 43 63 70

Fear 60 23 30 67 67 70

Hap. 33 30 50 30 40 43

Neu. 43 20 53 37 60 67

Sad. 40 27 47 37 50 57

Sar. 30 67 33 57 63 63

Sur. 30 40 40 23 50 37

Avg. 43 35 44 43 56 58

Final words, average emotion recognition: 64

Ang. 33 40 30 43 43 43

Dis. 53 37 83 30 80 83

Fear 67 10 47 57 60 63

Hap. 23 33 33 23 30 33

Neu. 70 33 80 53 77 77

Sad. 73 27 83 30 80 83

Sar. 23 47 47 60 63 60

Sur. 63 23 67 60 70 70

Avg. 51 31 59 45 63 64

Table 10 Overall emotion classification performance by combininglocal and global prosodic features computed from the words fromdifferent positions. Abbreviations: Fin.—Final, Ini.—Initial, Mid.—Middle, and Wds.—Words

Ang. Dis. Fear Hap. Neu. Sad. Sar. Sur. Avg.

Ini. + Mid. +Fin. + wds.

57 77 70 47 67 80 60 63 65.38

mentary in nature, with respect to their local counterparts.In the case of individual local prosodic features, energy fea-tures are more discriminative with initial words of the utter-

Fig. 8 Comparison of emotion recognition performance using wordlevel global, local, and global + local prosodic features

ances. This is obvious, as generally, all utterances have dom-inant energy profiles in the beginning. In the case of middlewords, energy and pitch parameters have almost equal emo-tion discrimination with the recognition rate of 43 % and44 % respectively. Pitch values have dominant distinctionin the case of emotion recognition using final words. Dura-tion information has always been least discriminative in caseof initial, middle, and final words. Final words carry moreemotion discriminative information of about 64 %, com-pared to their initial and middle counterparts. It is observedthat, recognition performance using final words is almost thesame as the performance achieved using the entire sentence.It indicates that, only about (1/3)rd portion of the sentence(final part) is sufficient to recognize the emotions. Interest-ingly, the average performance obtained due to the combi-nation of the evidence of initial, middle, and final words, isalmost equal to the recognition rate obtained using entire ut-terances (see Table 7). Comparison of emotion recognitionperformance of individual emotions with respect to initial,middle, and final words is given in Fig. 8. It may be seenfrom figure that, passive emotions like disgust, sadness, neu-tral, and surprise are better discriminated using final wordsof the sentences. Initial words played important role in rec-ognizing the emotions like happiness and sarcastic. Angerand fear are recognized well using middle parts of the sen-tences.

5.3 Emotion recognition using syllable level prosodicfeatures

Within each word, emotion specific information at the ini-tial, middle and final syllables may be different for differentemotions. Based on this intuition, we have carried out theanalysis of emotion specific information at the syllable level

Page 14: Emotion recognition from speech using global and local prosodic features

156 Int J Speech Technol (2013) 16:143–160

Fig. 9 Emotion recognition system using syllable level global and lo-cal prosodic features

also. While analyzing emotions at syllable level, two typesof models are developed. (1) Utterance-wise syllable levelemotion recognition models and (2) Region-wise syllablelevel emotion recognition models. In the case of utterance-wise syllable models, initial, middle and final syllables ofall the words of the utterance are grouped separately. Thenthe emotion models are developed using global and localprosodic features derived from these syllable groups. In thecase of region-wise syllable models, initial, middle, and fi-nal syllables taken from the specific portion of the sentenceare grouped. In this manner we have 3 sets of initial, mid-dle, and final syllables corresponding to these regions of thesentence. Here the regions indicate initial, middle, and finalwords. Then the emotion models are developed using theglobal and local features of these syllable groups.

5.3.1 Utterance-wise syllable level emotion recognition

Some of the emotions expressed using high arousal prop-erties such as anger and happiness, cannot retain the sameenergy through out the word. Therefore, in such cases highenergy profiles are well exhibited in the case of initial sylla-bles of the words. Similarly syllable level variations may beobserved in the case of duration and pitch patterns of differ-ent emotions. Hence, in this work, we analyzed the syllablesfor their emotion discriminative nature. For analyzing theemotions at the syllable level, syllables present in each sen-tence are divided into 3 groups as initial, middle and finalsyllables, based on their position in the word. The detailsof syllable groups are given in Table 4. Block diagram ofthe ERSs developed using syllable level prosodic features isas shown in Fig. 9. Here ERS in each block has the samestructure as shown in Fig. 5.

The emotion recognition performance of utterance-wisesyllable level models is given in Table 11. From the results,it is observed that, at the syllable level also local prosodicparameters perform better emotion recognition compared

Table 11 Emotion recognition performance using global and localprosodic features computed from syllables taken from all words of theutterances, grouped based on their position in the word. Abbreviations:Avg.—Average, Dur.—Duration, Glo.—Global, Loc.—Local

Emotions Global Local features Glo. +Loc.Dur. Pitch Energy Local

Initial syllables, average emotion recognition: 51 %

Anger 30 23 30 57 50 47

Disgust 20 30 43 40 43 53

Fear 53 10 30 47 50 50

Happiness 17 17 47 20 43 47

Neutral 33 27 53 27 50 47

Sadness 30 27 43 40 43 47

Sarcastic 63 53 57 67 63 63

Surprise 23 33 47 17 43 50

Avg. 34 28 44 39 48 51

Middle syllables, average emotion recognition: 46 %

Anger 13 30 27 27 30 30

Disgust 17 43 37 43 47 43

Fear 37 27 27 47 43 47

Happiness 10 17 30 23 30 30

Neutral 40 30 50 43 53 57

Sadness 37 33 33 33 37 43

Sarcastic 30 67 40 47 63 67

Surprise 47 30 47 40 43 50

Avg. 29 35 36 38 43 46

Final syllables, average emotion recognition: 61 %

Anger 13 27 23 40 40 43

Disgust 43 47 53 57 60 60

Fear 53 23 57 63 60 63

Happiness 17 43 37 27 43 47

Neutral 67 33 67 50 70 73

Sadness 33 27 53 37 57 60

Sarcastic 33 37 47 70 73 73

Surprise 27 27 57 30 60 67

Avg. 36 33 49 47 58 61

to global prosodic features. Initial and final syllables carrymore emotion specific information compared to middle syl-lables. Emotion recognition performance using initial sylla-bles is slightly better compared to the performance of thesystems developed using only initial words. This may bedue to dominance of energy and pitch profiles in the ini-tial portion of the words compared to the entire durationof words. In general, final syllables contribute heavily to-ward emotion recognition. Average emotion recognition dueto only final syllables is about 61 %, whereas, it is 51 % and46 % respectively for the groups of initial and middle sylla-bles. The comparison of recognition of different emotionsusing utterance-wise initial, middle, and final syllables is

Page 15: Emotion recognition from speech using global and local prosodic features

Int J Speech Technol (2013) 16:143–160 157

Fig. 10 Comparison of emotionrecognition performance usingutterance-wise initial, middle,and final syllable prosodicfeatures

Table 12 Overall emotion classification performance by combininglocal and global prosodic features computed from the syllables fromdifferent positions

Ang. Dis. Fear Hap. Neu. Sad. Sar. Sur. Avg.

Ini. + Mid. +Fin. syls.

43 67 63 43 70 67 73 77 63

given in Fig. 10. From figure, it is observed that, final sylla-bles have more emotion discriminative information, in caseof most of the emotions except anger and happiness. Theseare high arousal emotions, and hence, their discriminationis better in the case of initial syllables. Middle syllables donot contribute much toward emotion recognition comparedto initial and final syllables. The overall emotion recognitionperformance due to the combination of initial, middle, andfinal syllables is given in Table 12, This is comparable tothe results of word and utterance level studies (see Tables 8and 10).

5.3.2 Region-wise syllable level emotion recognition

At the word level emotion recognition analysis, we havestudied emotion discriminative characteristics from the setof initial, middle and final words. Within these group ofwords, there may be some additional emotion discriminativeinformation present at the syllable level. Therefore, to cap-ture emotion specific information from the syllables withinthe group of words (initial, middle and final), the syllablesof these words are divided into initial, middle and final syl-lables. The syllables within each region of words (initial,middle and final words) are grouped into initial, middle,and final syllables, based on their positions in the words.The details of syllable groups are given in Table 4. Blockdiagram of the ERS developed using region-wise syllable

level prosodic features is similar to the ERS developed usingword level prosodic features. Here, ERS in each block hasthe same structure as shown in Fig. 9. Overall ER perfor-mance using region-wise syllable level features is obtainedby combining the evidence from the ERSs developed usinginitial middle and final syllables as shown in Fig. 11.

The performance of ERSs developed using region-wisesyllable level, global and local prosodic features is shown inTable 13. In this study, three sets of studies are carried outsimilar to the studies on utterance-wise syllable level emo-tion recognition. Initial, middle and final syllable models aredeveloped using global and local prosodic features derivedfrom the syllables of the initial words of the utterance. Sim-ilarly, syllable models are developed using global and localprosodic features derived from the syllables of middle andfinal words. In Table 13, these results are shown in separaterows. From table, it is observed that, the syllables of the finalwords carry comparatively more emotion specific informa-tion, with the recognition rate of around 46 %. Within thesefinal words, final syllables followed by initial syllables, con-tribute more toward emotion recognition, with an accuracyof 53 % and 48 % respectively. Figure 12 shows the compar-ison of emotion recognition performance using region-wisesyllable level prosodic features.

It is observed from the results that, the recognition per-formance is very poor with the region-wise syllable levelprosodic features. The basic reason for this poor perfor-mance is due to use of shorter speech segments for featureextraction while training and testing the emotion models.

5.4 Emotion recognition using utterance level global andlocal prosodic features on Emo-DB

Around 65 % of emotion recognition performance is achieved(see Table 8) using the combination of utterance level global

Page 16: Emotion recognition from speech using global and local prosodic features

158 Int J Speech Technol (2013) 16:143–160

Fig. 11 Emotion recognition system using Region-wise syllable level features

Fig. 12 Comparison of emotion recognition performance of proposedglobal, local, and global + local prosodic features with respect to syl-lables within the groups of initial, middle, and final words of IITKGP-

SESC. (a) Initial, middle and final syllables of initial words, (b) initial,middle and final syllables of middle words, and (c) initial, middle andfinal syllables of final words

Page 17: Emotion recognition from speech using global and local prosodic features

Int J Speech Technol (2013) 16:143–160 159

Table 13 Emotion classification performance using global and localprosodic features computed from syllables within specific regions likeinitial, middle, and final words. Abbreviations: Avg.—Average, Dur.—Duration, Fin.Syls.—Final syllables, Glo.—Global, Ini.Syls.—Initialsyllables, Loc.—Local, Mid.Syls.—Middle syllabls, Wd. Pos.—Wordposition

Wd. pos. Global Local features Glo. +Loc.Dur. Pitch Energy Local

Initial words, average emotion recognition: 26

Ini.Syls. 24 16 20 20 26 27

Mid.Syls. 18 10 14 19 20 21

Fin.Syls. 15 19 22 19 23 24

Middle words, average emotion recognition: 22

Ini.Syls. 14 8 19 15 20 20

Mid.Syls. 17 11 22 18 23 24

Fin.Syls. 12 7 14 16 18 22

Final words, average emotion recognition: 46.33

Ini.Syls. 26 18 31 31 41 48

Mid.Syls. 25 16 31 29 36 38

Fin.Syls. 38 27 42 40 51 53

Table 14 Emotion recognition performance using the combination ofutterance-wise global and local prosodic features on Emo-DB

Emotion Recognition performance (%)

Anger 53

Boredom 63

Disgust 70

Fear 57

Happiness 70

Neutral 67

Sadness 57

Average 62.43

and local prosodic features on the simulated Telugu emo-tion database (IITKGP-SESC). Similar study is conductedon internationally known Berlin emotion speech database(Burkhardt et al. 2005). Berlin emotion database (Emo-DB)contains 7 emotions. An average emotion recognition per-formance of around 62 % is achieved using the combina-tion of utterance level global and local prosodic features.The emotion recognition performance using Emo-DB isfairly high compared to the similar results of IITKGP-SESC(54 %). The reason may be that, there are only 7 emotionsin Emo-DB and the database is full blown in nature, where,distinction between expression of different emotions is high.Table 14 shows the emotion recognition results obtained onEmo-DB.

6 Summary and conclusions

In this paper, prosodic analysis of speech signal has beenperformed at different levels of speech segments for the taskof recognizing the underlying emotions. Eight emotions ofIITKGP-SESC are used for analysis. Support vector ma-chines are used for developing the emotion models. Globaland local prosodic features are separately extracted from ut-terance, word and syllable segments of speech for develop-ing the emotion models. Word and syllable boundaries areidentified using VOPs. Global prosodic features are derivedby computing statistical parameters like mean, maximum,minimum from the sequence of prosodic parameters. Lo-cal prosodic parameters are obtained from the sequence ofsyllable durations, frame level pitch and energy values. Theprosodic contour trends are retained through local prosodicfeatures. The contribution of different parts of the utterancestoward emotion recognition is studied by developing emo-tion recognition models using the prosodic features obtainedfrom initial, middle, and final regions of the utterances. Thecombination of local and global prosodic features found tomarginally improve the performance compared to the per-formance of the systems developed using only local features.From the word and syllable level prosodic analysis, it is ob-served that, final words and syllables contain more emo-tion discriminative information compared to other groupsof words and syllables. In future, source and system fea-tures may be combined with these prosodic features to studythe effect of combination. The use of other classifiers likeGMMs may be studied to evaluate the emotion recognitionperformance.

References

Banziger, T., & Scherer, K. R. (2005). The role of intonation in emo-tional expressions. Speech Communication, 46, 252–267.

Benesty, J., Sondhi, M. M., & Huang, Y. (Eds.) (2008). Springer hand-book on speech processing. Berlin: Springer.

Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss, B.(2005). A database of German emotional speech. In Interspeech.

Cahn, J. E. (1990). The generation of affect in synthesized speech. InJAVIOS (pp. 1–19), July 1990

Cowie, R., & Cornelius, R. R. (2003). Describing the emotional statesthat are expressed in speech. Speech Communication, 40, 5–32.

Dellaert, F., Polzin, T., & Waibel, A. (1996). Recognizing emotion inspeech. In 4th international conference on spoken language pro-cessing (pp. 1970–1973), Philadelphia, PA, USA, Oct. 1996

Iida, A., Campbell, N., Higuchi, F., & Yasumura, M. (2003). A corpus-based speech synthesis system with emotion. Speech Communi-cation, 40, 161–187.

Iliou, T., & Anagnostopoulos, C. N. (2009). Statistical evaluation ofspeech features for emotion recognition. In Fourth internationalconference on digital telecommunications, Colmar, France, July(pp. 121–126).

Kao, Y.h., & Lee, L.s. (2006). Feature analysis for emotion recognitionfrom Mandarin speech considering the special characteristics ofChinese language. In INTERSPEECH–ICSLP (pp. 1814–1817),Pittsburgh, Pennsylvania, Sept. 2006

Page 18: Emotion recognition from speech using global and local prosodic features

160 Int J Speech Technol (2013) 16:143–160

Koolagudi, S. G., & Rao, K. S. (2011). Two stage emotion recognitionbased on speaking rate. International Journal of Speech Technol-ogy, 14, 35–48.

Koolagudi, S. G., & Rao, K. S. (2012a). Emotion recognition fromspeech: a review. International Journal of Speech Technology,15(2), 99–117.

Koolagudi, S. G., & Rao, K. S. (2012b). Emotion recognition fromspeech using source, system and prosodic features. InternationalJournal of Speech Technology, 15(2), 265–289.

Koolagudi, S. G., & Rao, K. S. (2012c). Emotion recognition fromspeech using sub-syllabic and pitch synchronous spectral fea-tures. International Journal of Speech Technology. doi:10.1007/s10772-012-9150-8.

Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S., & Rao,K. S. (2009). IITKGP-SESC: speech database for emotion anal-ysis, Aug. 2009. Communications in computer and informationscience, Lecture notes in computer science. Berlin: Springer.

Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions inspoken dialogs. IEEE Transactions on Speech and Audio Process-ing, 13, 293–303.

Luengo, I., Navas, E., Hernez, I., & Snchez, J. (2005). Automatic emo-tion recognition using prosodic parameters. In INTERSPEECH,Lisbon, Portugal (pp. 493–496), Sept. 2005

Lugger, M., & Yang, B. (2007). The relevance of voice quality fea-tures in speaker independent emotion recognition. In ICASSP(pp. IV17–IV20), Honolulu, Hawai, USA, May 2007. New York:IEEE Press.

McGilloway, S., Cowie, R., Douglas-Cowie, E., Gielen, S., Westerdijk,M., & Stroeve, S. (2000). Approaching automatic recognition ofemotion from voice: a rough benchmark. In ISCA workshop onspeech and emotion, Belfast.

Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of asystem for producing emotion by rule in synthetic speech. SpeechCommunication, 16, 369–390.

Murray, I. R., Arnott, J. L., & Rohwer, E. A. (1996). Emotional stressin synthetic speech: Progress and future directions. Speech Com-munication, 20, 85–91.

Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction fromspeech signals. IEEE Transactions on Audio, Speech, and Lan-guage Processing, 16, 1602–1613.

Nwe, T. L., Foo, S. W., & Silva, L. C. D. (2003). Speech emotion recog-nition using hidden Markov models. Speech Communication, 41,603–623.

Prasanna, S. R. M. (2004). Event-based analysis of speech. PhD thesis,Dept. of Computer Science and Engineering, Indian Institute ofTechnology Madras, Chennai, India, Mar. 2004.

Prasanna, S. R. M., & Zachariah, J. M. (2002). Detection of vowelonset point in speech. In Proc. IEEE int. conf. acoust., speech,signal processing. Orlando, Florida, USA, May 2002.

Prasanna, S. R. M., Reddy, B. V. S., & Krishnamoorthy, P. (2009).Vowel onset point detection using source, spectral peaks, andmodulation spectrum energies. IEEE Transactions on Audio,Speech, and Language Processing, 17, 556–565.

Rao, K. S. (2005). Acquisition and incorporation prosody knowledgefor speech systems in Indian languages. PhD thesis, Dept. ofComputer Science and Engineering, Indian Institute of Technol-ogy Madras, Chennai, India, May 2005.

Rao, K. S. (2011a). Application of prosody models for developingspeech systems in Indian languages. International Journal ofSpeech Technology, 14, 19–33.

Rao, K. S. (2011b). Role of neural network models for developingspeech systems. Sadhana, 36, 783–836.

Rao, K. S. & Koolagudi, S. G., (2011). Identification of Hindi dialectsand emotions using spectral and prosodic features of speech.IJSCI: International Journal of Systemics, Cybernetics and Infor-matics, 9(4), 24–33.

Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification usinginstants of significant excitation. IEEE Transactions on Speechand Audio Processing, 14, 972–980.

Rao, K. S., Prasanna, S. R. M., & Sagar, T. V. (2007). Emotion recogni-tion using multilevel prosodic information. In Workshop on imageand signal processing (WISP-2007). Guwahati, India, Dec. 2007.Guwahati: IIT Guwahati.

Rao, K. S., Reddy, R., Maity, S., & Koolagudi, S. G. (2010). Charac-terization of emotions using the dynamics of prosodic features. InInternational conference on speech prosody. Chicago, USA, May2010.

Rao, K. S., Saroj, V. K., Maity, S., & Koolagudi, S. G. (2011). Recogni-tion of emotions from video using neural network models. ExpertSystems with Applications, 38, 13181–13185.

Scherer, K. R. (2003). Vocal communication of emotion: A review ofresearch paradigms. Speech Communication, 40, 227–256.

Schroder, M. (2001). Emotional speech synthesis: a review. In SeventhEuropean conference on speech communication and technology,Eurospeech, Aalborg, Denmark, Sept. 2001

Schroder, M., & Cowie, R. (2006). Issues in emotion-oriented comput-ing toward a shared understanding. In Workshop on emotion andcomputing, HUMAINE.

Schuller, B. (2012). The computational paralinguistics challenge. IEEESignal Processing Magazine, 29, 97–101.

Schuller, B., Batliner, A., Steidl, S., & Seppi, D. (2011). Recognisingrealistic emotions and affect in speech: state of the art and lessonslearnt from the first challenge. Speech Communication, 53, 1062–1087.

Ververidis, D., & Kotropoulos, C. (2006). A state of the art reviewon emotional speech databases. In Eleventh Australasian interna-tional conference on speech science and technology, Auckland,New Zealand, Dec. 2006

Ververidis, D., Kotropoulos, C., & Pitas, I. (2004). Automatic emo-tional speech classification. In ICASSP (pp. I593–I596). NewYork: IEEE Press.

Vuppala, A. K., Yadav, J., Chakrabarti, S., & Rao, K. S. (2012). Vowelonset point detection for low bit rate coded speech. IEEE Trans-actions on Audio, Speech, and Language Processing, 20, 1894–1903.

Wang, Y., Du, S., & Zhan, Y. (2008). Adaptive and optimal classifica-tion of speech emotion recognition. In Fourth international con-ference on natural computation (pp. 407–411).

Werner, S., & Keller, E. (1994). Prosodic aspects of speech. InE. Keller (Ed.), Fundamentals of speech synthesis and speechrecognition: basic concepts, state of the art, the future challenges(pp. 23–40). Chichester: Wiley.

Zhang, S. (2008). Emotion recognition in Chinese natural speech bycombining prosody and voice quality features. In Sun et al. (Ed.),Advances in neural networks (pp. 457–464). Lecture notes in com-puter science. Berlin: Springer.

Zhu, A., & Luo, Q. (2007). Study on speech emotion recognition sys-tem in E-learning. In J. Jacko (Ed.), Human computer interaction,Part III, HCII (pp. 544–552). Lecture notes in computer science.Berlin: Springer.