[ieee 2013 humaine association conference on affective computing and intelligent interaction (acii)...

6
Speaker Recognition and Speaker Characterization over Landline, VoIP and Wireless Channels Laura Fernández Gallardo University of Canberra, Australia Telekom Innovation Laboratories, TU Berlin, Germany [email protected] Abstract— The automatic detection of people’s identity and characteristics such as age, gender, emotion and personality from their voices generally requires the transmission of the speech to remote servers that perform the recognition task. This transmission may introduce severe distortions and channel mismatch that degrade the system performance or vary the detection of the actual emotions and speaker traits. Concurrently, humans also cope with the difficulty of reliably identifying and characterizing talkers from speech transmitted over telephone channels, particularly detecting emotions on a call to a friend or relative. The present research addresses the evaluation of human and automatic performances under different channel distortions caused by bandwidth limitation, codecs, and electro-acoustic user interfaces, among other impairments. Special attention is paid to the creation of robust automatic systems that can handle channel variations, and thus enhance voice-based human-computer interaction in applications where the speech is transmitted. This paper outlines the work completed to date and the potential future work plans to be finalized within the next 1.5 years. Keywords- human speaker characterization; automatic speaker characterization; channel variations; speech coding I. INTRODUCTION A. Motivation Recent advances in speech technologies underpin numerous applications and services enabling human- computer interaction. Specifically, speaker recognition and characterization tools, that detect the user’s emotions and other characteristics, permit spoken dialogue systems to adapt their behavior and to employ user-tailored language and information contents. Typical applications of human- computer interaction where speaker characterization plays a significant role involve intelligent agents for customer care and service, surveillance, and entertainment, to name a few. In the majority of cases, it is necessary to transmit the recorded speech for the subsequent processing of speaker information in remote servers, since popular devices employed for these tasks, such as cellphones and PDAs, are ineffective for handling their inherent complexity. The classification output can be a decision on the speaker identity or a determination of paralinguistic information such as gender, age, emotion or even personality from speech. The past few years have witnessed a very rapid deployment of high-speed digital networks. In addition to traditional circuit-switched or mobile communications, much of the speech data is now sent over IP networks, such as Voice over the Internet Protocol (VoIP). The still predominant Public Switched Telephone Network (PSTN) offers narrowband (NB), a relatively narrow frequency range of 300-3,400 Hz, while VoIP services also support wideband (WB), providing the extended range of 50 – 7,000 Hz. Besides, super-wideband transmissions (SWB, 50 14,000Hz) are currently gaining adoption in the marketplace. The benefits of WB and SWB over NB speech come from the added information being carried, accounting for better word intelligibility, higher voice naturalness and higher voice quality perception [1]. This fact motivated our research, initially aimed at finding added advantages of the enhanced bandwidth on speaker recognition and characterization, performed by human listeners [2] and by automatic systems [3]. Besides, the coding scheme used to efficiently transmit the speech over the network and the electro-acoustic user interface of speakers and listeners also have an important effect on these performances. This study evaluates these effects, proposes techniques for automatic systems to cope with channel variability, and may also serve as a motivation of investment into new infrastructure for the transition from NB to WB and to SWB channels. The fundamental task of automatically detecting speaker identity and characteristics involves the extraction of unique voice features from test utterances and the statistical comparisons of these and the models created in previous training or enrolment sessions. However, the transmission of the uttered segments inevitably introduces distortions on the speech signal and, if variations in transmission channel settings occur, then channel mismatch between training and testing utterances will be originated, causing automatic systems to perform poorly. These variations represent the main challenge speaker verification technologies attempt to overcome. On the other hand, research on the relatively new field of emotion recognition from speech is rather concerned with finding the most discriminative speech features and on the design of an appropriate classification scheme, but the effects of channel mismatch have been overlooked so far [4]. The proposed work is concentrated on the robust detection of speaker identity, emotions, and other characteristics under different transmission channel degradations, which will certainly enhance the interaction between man and machine in the typical scenarios when the speech needs to be transmitted. Particularly, the combination of speaker verification, and emotion and personality detection 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction 978-0-7695-5048-0/13 $26.00 © 2013 IEEE DOI 10.1109/ACII.2013.116 665

Upload: laura-fernandez

Post on 11-Apr-2017

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) - Geneva, Switzerland (2013.09.2-2013.09.5)] 2013 Humaine Association Conference

Speaker Recognition and Speaker Characterization over Landline, VoIP and Wireless Channels

Laura Fernández Gallardo University of Canberra, Australia

Telekom Innovation Laboratories, TU Berlin, Germany [email protected]

Abstract— The automatic detection of people’s identity and characteristics such as age, gender, emotion and personality from their voices generally requires the transmission of the speech to remote servers that perform the recognition task. This transmission may introduce severe distortions and channel mismatch that degrade the system performance or vary the detection of the actual emotions and speaker traits. Concurrently, humans also cope with the difficulty of reliably identifying and characterizing talkers from speech transmitted over telephone channels, particularly detecting emotions on a call to a friend or relative. The present research addresses the evaluation of human and automatic performances under different channel distortions caused by bandwidth limitation, codecs, and electro-acoustic user interfaces, among other impairments. Special attention is paid to the creation of robust automatic systems that can handle channel variations, and thus enhance voice-based human-computer interaction in applications where the speech is transmitted. This paper outlines the work completed to date and the potential future work plans to be finalized within the next 1.5 years.

Keywords- human speaker characterization; automatic speaker characterization; channel variations; speech coding

I. INTRODUCTION

A. Motivation Recent advances in speech technologies underpin

numerous applications and services enabling human-computer interaction. Specifically, speaker recognition and characterization tools, that detect the user’s emotions and other characteristics, permit spoken dialogue systems to adapt their behavior and to employ user-tailored language and information contents. Typical applications of human-computer interaction where speaker characterization plays a significant role involve intelligent agents for customer care and service, surveillance, and entertainment, to name a few. In the majority of cases, it is necessary to transmit the recorded speech for the subsequent processing of speaker information in remote servers, since popular devices employed for these tasks, such as cellphones and PDAs, are ineffective for handling their inherent complexity. The classification output can be a decision on the speaker identity or a determination of paralinguistic information such as gender, age, emotion or even personality from speech.

The past few years have witnessed a very rapid deployment of high-speed digital networks. In addition to traditional circuit-switched or mobile communications, much

of the speech data is now sent over IP networks, such as Voice over the Internet Protocol (VoIP). The still predominant Public Switched Telephone Network (PSTN) offers narrowband (NB), a relatively narrow frequency range of 300-3,400 Hz, while VoIP services also support wideband (WB), providing the extended range of 50 – 7,000 Hz. Besides, super-wideband transmissions (SWB, 50 – 14,000Hz) are currently gaining adoption in the marketplace. The benefits of WB and SWB over NB speech come from the added information being carried, accounting for better word intelligibility, higher voice naturalness and higher voice quality perception [1]. This fact motivated our research, initially aimed at finding added advantages of the enhanced bandwidth on speaker recognition and characterization, performed by human listeners [2] and by automatic systems [3]. Besides, the coding scheme used to efficiently transmit the speech over the network and the electro-acoustic user interface of speakers and listeners also have an important effect on these performances. This study evaluates these effects, proposes techniques for automatic systems to cope with channel variability, and may also serve as a motivation of investment into new infrastructure for the transition from NB to WB and to SWB channels.

The fundamental task of automatically detecting speaker identity and characteristics involves the extraction of unique voice features from test utterances and the statistical comparisons of these and the models created in previous training or enrolment sessions. However, the transmission of the uttered segments inevitably introduces distortions on the speech signal and, if variations in transmission channel settings occur, then channel mismatch between training and testing utterances will be originated, causing automatic systems to perform poorly. These variations represent the main challenge speaker verification technologies attempt to overcome. On the other hand, research on the relatively new field of emotion recognition from speech is rather concerned with finding the most discriminative speech features and on the design of an appropriate classification scheme, but the effects of channel mismatch have been overlooked so far [4]. The proposed work is concentrated on the robust detection of speaker identity, emotions, and other characteristics under different transmission channel degradations, which will certainly enhance the interaction between man and machine in the typical scenarios when the speech needs to be transmitted. Particularly, the combination of speaker verification, and emotion and personality detection

2013 Humaine Association Conference on Affective Computing and Intelligent Interaction

978-0-7695-5048-0/13 $26.00 © 2013 IEEE

DOI 10.1109/ACII.2013.116

665

Page 2: [IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) - Geneva, Switzerland (2013.09.2-2013.09.5)] 2013 Humaine Association Conference

techniques will offer users an improved experience if these are robust against channel distortions.

Further, the channel degradations also affect the human performance identifying interlocutors during a phone call and detecting their emotions and other characteristics. The recognition of speaker identity and traits such as age, gender and personality is only meaningful if the caller is unknown, while the recognition of momentary emotions is more interesting when the speaker is known, for instance, in a phone call to a relative. It still remains unknown which channel configurations would preserve better the human perception and also the automatic detection of speaker characteristics, in other words, if and how the detection of age, emotions and personality would vary depending on the transmission channel in comparison to clean, undistorted speech. These evaluations are also addressed in this PhD thesis. Likewise, it will be determined which of the speaker characteristics would be better preserved through voice communications.

B. Related Work The importance of acoustic cues that listeners use to

distinguish individuals’ voices has been well investigated, recently with a growing interest in the forensic field [5]. It has been shown that high-frequency components, filtered out in NB channels, carry information of voice quality or specific phonation types, distinctive characteristics of each person, e.g. nasal or breathy voice, and are thus critical for a superior human speaker recognition performance [6]. Together with the degradations introduced by limitation of bandwidth, also coding algorithm, packet loss rates and the electro-acoustic user interface employed for voice transmission and reception alter the transmitted voice. While these influences have been assessed for human perception of signal quality [7], their effects on speaker recognition have not formally been studied prior the commencement of this PhD.

The goal of auditory tests conducted to evaluate the human perception of speaker characteristics is predominantly the exploration of age-related acoustic variation and of features involved in the expression of affect [8]. Part of these psychology and linguistic studies attempt to assist the design of reliable automatic classifiers. Extending the analysis of non-linguistic properties of speech, also personality recognition has gained increased attention recently [9], often applying the Big Five traits [10] for personality assessment: Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism. However, no exhaustive research has been conducted to evaluate the effects of different transmission channels. One example is the work in [11], examining the detection of talker’s stress or urgency, yet the study does not consider the effects of different bandwidths.

Regarding automatic speaker verification, current efforts have been directed towards the separation of speaker-related characteristics from the session variability, attributable not only to channel effects and background noise, but also to effects like the aging phenomena or the speaker state. This variability may cause two recordings of the same person to be classified as they were from different talkers, and its

adequate modeling is essential for successful speaker verification. The effects of coded speech on speaker recognition performance have been extensively studied in the last two decades, during which many different feature sets and classifiers combined with channel compensation techniques have been proposed to cope with distortions [12][13]. The current state-of-the-arts joint factor analysis (JFA) [14] and i-vectors have been employed for speaker verification, outperforming the standard Gaussian Mixture Models – Universal Background Model (GMM-UBM) [15] under channel variability. This thesis delves into the training configurations of the mentioned systems, which have been mostly tested only with data limited to NB, and proposes the best schemes to handle channel degradations.

The aspirations to develop effective interfaces for human-machine interaction applications have also motivated the research on automatic speaker characterization classifiers, in addition to authentication systems [16][17]. Special attention has been paid to finding optimal features and to develop efficient classification algorithms that lead to satisfactory results. The Interspeech Challenges and the Audio Visual Emotion Challenges (AVEC) of the of the past few years [18][19][20] have promoted the unification of these efforts by proposing unified test conditions that permit an exact comparison between multiple approaches. Additionally, the organizers have released several corpora of spontaneous speech, covering relevant paralinguistic phenomena, such as emotions, age, gender, intoxication, sleepiness, personality, likability, and pathology – until year 2012. In the 2009 Interspeech Challenge [18] the FAU Aibo Emotion Corpus was provided for emotion classification and the five-class (Anger, Emphatic, Neutral, Positive and Rest) and the two-class (NEGative and IDLe) classification problems were proposed. In the Open Performance Sub-Challenge, where the participants could employ their own features and classification systems, the winner approach for the two-class task was a fusion of systems (short-term Cepstral, long-term prosodic and vocal tract), while a JFA system with Mel-Frequency Cepstral Coefficients (MFCC) was found to outperform the rest of submitted techniques for the five-class task [21]. Nevertheless, since significant differences were not found among the participant’s results, it is not clear which set of features or which classifier would be optimal for more accurate emotion detection. The exploration of new features and more sophisticated classification algorithm requires extensive speech data, recorded from spontaneous situations, i.e. non-acted, Currently, efforts towards collecting new realistic emotional databases are being conducted [4].

Former attempts to extract speaker information from the voice inputs have been based on previous psycholinguistic research and typically combined different sets of short-term acoustic features [17], whereas recent investigations choose also voice quality and linguistic features [22].“Brute-forcing” of features is another approach frequently adopted nowadays, also in Interspeech Challenges, where they provide extensive feature sets to the participants, e.g. 6125 features in [19]. This last challenge aimed at the assessment of speaker personality, among others. The e�ects of di�erent

666

Page 3: [IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) - Geneva, Switzerland (2013.09.2-2013.09.5)] 2013 Humaine Association Conference

feature sets on personality recognition accuracy had been previously evaluated in [23], finding Linguistic Inquiry and Word Count (LIWC) features useful for modeling all personality traits and that regression models outperform classification models, assuming continuous personality variation [10]. A recent review of state-of-the-art methods for automatic personality recognition is given in [24].

C. Scope In this thesis we attempt to fill some gaps of the affective

computing research, regarding spoken human-machine interactions and communication between humans: 1) addressing the influence of transmission channel variations on the performance of automatic classifiers, and 2) examining the effects of transmission channel variations on the perceptions of human listeners. Hence, we evaluate the performance of current systems for speaker characterization and listeners under different channel distortions – bandwidth limitation, speech codec, packet loss rates, and electro-acoustic user interface. Besides, the effects of the channel on the performance of state-of-the-art speaker verification systems are also assessed. The important contribution of this research is to propose novel strategies for building classifiers dealing with channel mismatch with the goal of improving human-computer interaction in typical scenarios and applications where the speech needs to be transmitted. Another objective of the present study is to examine the human performance recognizing speakers and their characteristics and to offer a comparison over different communication channels.

Previous studies have demonstrated the importance of the frequency content filtered out in narrowband channels for human quality perception and intelligibility [1] and for conventional automatic speaker recognition [13]. Hence, it is foreseen that signal transmission over extended bandwidths will facilitate both human and automatic recognition of voices, as well as the detection of other meta- and para-linguistic aspects of the speech. The complete analysis of this thesis may also motivate the transition from narrowband to wideband and super-wideband, considering human and automatic speaker recognition and characterization as an additional criterion –other than quality, naturalness, and intelligibility- when judging the benefits of more extended bandwidths compared to traditional narrowband.

The remaining of this paper is organized as follows. Section II describes the methods conducted to evaluate human and automatic speaker recognition performance, indicating the work done so far, and proposes future directions concerned with human and automatic speaker characterization. Section III outlines previous and expected contributions of this research. A brief summary is presented in Section IV.

II. PROPOSED METHODOLOGY

A. Human Speaker Recognition The aims of the human speaker recognition evaluation

were to find possible advantages of WB and SWB over NB transmissions and to study the influence of other channel

impairments, such as codecs, packet loss rates and electro-acoustic user interfaces. A set of auditory tests were proposed, that pretend to replicate practically relevant cases where a listener is confronted with voices over telephone and the identity of the caller is to be determined.

Two auditory tests were conducted involving speakers that were previously known to the listeners [25][2]. The effects of only bandwidth and codecs were analyzed in the first test, and the second one studied these effects in combination with those of packet loss and electro-acoustic user interfaces at both end points of the communication, that is, in sending and in receiving direction. The analysis of the bandwidth effects was extended to super-wideband in this test. The participants were chosen from the same work environment, ensuring their acquaintance. A small dataset was recorded from a total of 16 (8 male and 8 female) speakers, work colleagues at the Quality and Usability Lab of Telekom Innovation Laboratories in Berlin, where they have worked together with the listeners for at least two years. The mother tongue of all participants was German, as well as the language of all utterances. Next, excerpts of different lengths were extracted from the initial recordings and transmitted through different transmission channels. For the set-up of the second test, the user interfaces of the analyses in sending direction were mounted on a head-and-torso simulator, which simulates the acoustic transmission path, and connected to an Asterisk server or to the network simulator Rohde & Schwarz CMU 200, where the recordings were made after selecting the transmission conditions. For the rest of conditions with headsets or without user interface the channel effects were applied via software simulation.

A group of 26 listeners (19 males and 7 females) participated individually on the first listening test and 20 (16 males and 4 females) in the second test. They listened to the stimuli in random order with high quality headphones in the first test and for the sending direction study of the second test. Differently, they employed the corresponding user interface connected to the Asterisk server to listen to stimuli

TABLE I. CHANNEL IMPAIRMENTS AND LISTENERS’ MEAN ACCURACY (AUDITORY TEST 2).

Interface Bandwidth, codec, bit rate (kbps)

Listeners’ accuracy (%)

Phone with handset (SNOM 870)

NB, G.711, 64 Sending: 67.8 Receiving: 75.0

WB, G.722, 64 Sending: 75.0 Receiving: 77.8

Hands-free phone (Polycom IP 7000)

NB, G.711, 64 Sending: 60.3 Receiving: 65.0

WB, G.722, 64 Sending: 72.2 Receiving: 80.3

Headsets (Beyerdynamic DT

790)

NB, G.711, 64 Sending: 66.9 Receiving: 56.9

WB, G.722, 64 Sending: 80.3 Receiving: 81.6

SWB, G.722.1C, 32 Sending: 77.2 Receiving: 77.2

SWB, G.722.1C, 48 Sending: 77.2 Receiving: 78.1

Mobile phone (SONY XPERIA T)

NB, AMR-NB, 12.2 Sending: 63.1 WB, AMR-WB, 12.65 Sending: 76.9

667

Page 4: [IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) - Geneva, Switzerland (2013.09.2-2013.09.5)] 2013 Humaine Association Conference

in receiving direction. They were asked to select one of the 16 possible speakers presented on a GUI after listening to each test stimulus. The stimuli consisted of segments of different lengths – words, sentences and paragraphs – for the first test, and excerpts of three words for the second test. Every speaker uttered the same number of segments and these were processed with all channel conditions, resulting in a balanced experiment.

Table I indicates the characteristics of the transmission channels employed to transmit the utterances for the second test, together with the overall accuracy reached by the groups of listeners. Significant differences have been found between NB and WB (p<0.05), having the bandwidth a greater influence on accuracy for the mobile phone and for the headset (p<0.001) in sending direction and for the hands-free phone and for the headset in receiving direction. Astonishingly, no significant effects on accuracy were found comparing SWB to WB transmissions. This fact needs to be further investigated by conducting further auditory tests involving speech signals of selected frequency bands and examining their effects. The analyses of the length of stimuli (studied in the first test [25]) and of packet loss [2] are omitted in this document for brevity reasons.

B. Automatic Speaker Recognition The performances of GMM-UBM and JFA classifiers

have been tested in terms of automatic speaker verification. The standard GMM-UBM proposed by Reynolds et al. [15], consists of traditional GMM models trained with speech from enrolled speakers and derived from a UBM. The latter represents general, speaker-independent characteristics. The JFA approach by Kenny et al. [14] is an expansion from this standard system, based on the decomposition of the GMM mean supervectors into speaker- and channel-dependent parts. Since the speaker and channel distributions are modeled independently, JFA has shown excellent capabilities to handle channel variation and mismatch since it was developed.

Both a narrowband and a wideband GMM-UBM system were implemented [26], that is, the first one was trained and tested with NB data and the second one with WB data. Similarly, both a narrowband and a wideband JFA system were also implemented [3]. The reason for studying the bandwidths separately for the two classifiers is that it still remains a challenge to recognize the codec used in transmissions from the acoustic signal only, and hence it is not possible that the classifier can be tailored to operate with one type of channel only. However, in a practical situation it is straightforward to detect the signal bandwidth – simply by measuring the energy of its frequency components above 3.4 kHz. Thus, in this novel study – previous investigations ignored the channel effects of training data [14] – the systems are trained with NB or with WB data of mixed nature, including speech samples distorted with the same codecs as the test segments expected at verification time.

In a first experiment [26], two gender-dependent UBMs for each GMM-UBM system were trained with 1024 Gaussian Mixtures, with speech data from the ANDOSL database. The UBM parameters were then adapted to build

16 client models performing maximum a posteriori (MAP) adaptation. Samples from two sessions of the AusTalk database [27] were employed for the training and samples from its remaining session for testing. These samples were transmitted through the NB or WB communication channels in an identical way as for the auditory tests described in the previous section: Two NB codecs (G.711 at 64kbps and GSM-EFR at 12.2kbps) and two WB codecs (G.722 at 64kbps and AMR-WB at 23.05kbps) were applied, obtaining four versions of the initial datasets. The feature vector consisted of the first 13 MFCCs, extracted using a 20ms Hamming window with a frame shift of 10ms.

For the implementation of a NB and a WB JFA classifiers [3], the hyperparameters v, u, and d were trained on datasets processed with the same transmission channels as in the preceding experiment, following the complete description of the system given in [14]. Since this model requires extensive training data, this work employed a combination of the databases: TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT), Resource Management Corpus 2.0, part 1 (RMI) Continuous Speech Recognition phase I (WSJ0), phase II (WSJ1), and North American Business News Corpus (CSRNAB1). Only male speakers were considered. The 26-dimensional feature vector contained 12 MFCCs along with energy and their delta coefficients. After the estimation of the UBMs with 1024 Gaussian mixtures and of the mentioned subspaces, the speaker and the channel factors were jointly determined for enrolment, again using NB-processed or WB-processed speech, from 14 speakers not employed for training.

The four test subsets consisted of speech transmitted through one communication channel, and these were tested against the classifier of the corresponding bandwidth. Table II shows the Equal Error Rates (EERs) and the minimum detection cost function points (DCFs) obtained in the two experiments. It can be seen that, while WB data lead to a better accuracy of the GMM-UBM system in comparison to NB, the best JFA system performance does not rely on the signal bandwidth. It offers the best performance when it is trained with a mixture of utterances processed with the GSM-EFR and with the G.711 codecs and tested with the latter, which reveals that these training and testing sets enable better modeling of the speaker and channel characteristics than the other configurations tested. Presumably, this fact is due to the intrinsic characteristics of the codec algorithms. The two classifiers cannot be compared considering these outcomes since they were trained and tested employing different sets of utterances.

TABLE II. EERS AND DCFS FOR THE GMM-UBM AND JFA CLASSIFIERS IMPLEMENTED.

Classifier Test subset EER DCF

GMM-UBM NB G.711 (NB) 15.1% 0.151 GSM-EFR (NB) 15.6% 0.155

GMM-UBM WB G.722 (WB) 12.0% 0.108 AMR-WB (WB) 11.1% 0.098

JFA NB G.711 (NB) 4.6% 0.044 GSM-EFR (NB) 6.6% 0.064

JFA WB G.722 (WB) 7.5% 0.074 AMR-WB (WB) 6.8% 0.067

668

Page 5: [IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) - Geneva, Switzerland (2013.09.2-2013.09.5)] 2013 Humaine Association Conference

C. Automatic Emotion Detection An important part of this PhD covers the investigation of

channel effects on automatic speaker characterization (recognition of age, gender, emotions and personality), with emphasis on the evaluation of state-of-the-art systems, such as JFA, GMM and SVM. The aim of these experiments is not to improve the best performance of these classifiers, but to show a comparison of their accuracies across different channel degradations, which have been overlooked in past studies [4].

The SVM classifier performing emotion detection has been implemented employing speech data transmitted through channels likely to have a larger influence on the system performance; we chose those presenting different transmission bandwidths and codecs as we did for speaker recognition. The FAU Aibo Emotion Corpus [18], with emotional speech from children that was elicited as they interacted with a pet robot, is employed for this study. The original utterances have been transmitted through the different communication channels and acoustic features (MFCCs) have been extracted from the distorted speech for classification, following a similar approach as for speaker verification. Regrettably, no concluding results can be reported at the present moment as more experimentation with different configurations needs to be done.

D. Tentative plan In future work different classifiers will be implemented

performing emotion and personality detection tasks. Methods to enhance the performance when the classifiers are trained and tested with distorted speech will be explored, as well as for automatic speaker verification. Some proposals that may lead to this enhancement are codec identification, feature selection, feature extraction from the encoded bitstream, outlier detection and removal, and fusion strategies, although exhaustive analysis is needed in order to find the appropriate techniques. The i-vectors approach for speaker verification will also be analyzed under channel variability.

In addition, auditory tests involving transmitted speech will be conducted to evaluate human speaker characterization performance, following the same methodology developed for the human speaker identification study. The gender detection from speech, considered uncomplicated despite the degradations caused by telephone channels, should be studied together with the detection of age. One possible test design would involve listening to speech samples – that had been previously transmitted through particular communication channels – and choosing the gender and the age group corresponding to the utterance heard. Previous exposure to the voices is not necessary as for the speaker identification task. Separate listening tests will investigate human emotion recognition and human personality recognition, where the listeners would decide on either the emotional state or on the personality of the speaker, employing different datasets. The emotions and personality to be identified would depend on the labels of the dataset, which are commonly the Big Six emotions: happiness, anger, fear, surprise, sadness, disgust and neutral; and the NEO-FFI distributions for personality recognition

[10]. For these auditory tests, as for the speaker identification tests, it is crucial to select an appropriate duration of the segment that would provide enough information about the listeners’ capabilities. In other words, it should be long enough to permit above-chance performance, and short enough to impede saturation in accuracy. Resulting overall accuracies between 60% and 90% would be acceptable for the comparison among effects of the channel impairment studied.

To prepare the audio material both for automatic and for human performance evaluations, the speech samples will be transmitted through different transmission channels of standard bandwidths (NB, WB and SWB), and employing the most typical compression schemes available today and in the near future (G.711 and G.722 for digital telephony, GSM-EFR, the standard codec for cellular telephony in Europe, AMR-NB and AMR-WB for VoIP and wireless telephony, and G.722.1C for SWB transmissions, among others). In addition, a head-and-torso simulator with different user interfaces mounted on it could be employed to study their effects in sending direction, that is, the degradations introduced by their in-built microphones, following the same procedure as in [2]. In the case of auditory tests, speech samples could be transmitted to user interfaces employed in receiving direction to evaluate the effects introduced by their loudspeakers. Different packet loss rates could also be applied to the transmissions, either employing the Asterisk server or via software simulation.

It should be kept in view, however, that in order to apply the desired channel degradations the original dataset should contain unprocessed, microphone speech with no codec previously applied, and sampled at 32 kHz at least to allow the study of NB, WB and SWB transmissions. Databases meeting these requirements are not abundant within the emotional speech research. However, the Vera Am Mittag, containing spontaneous emotional data sampled at 44.1 kHz, should be suitable for the purpose of this work. The already mentioned FAU Aibo Emotion Corpus [18], with sampling frequency of 16 kHz, would permit the analysis of NB and of WB but not of SWB channels. To study human and automatic personality recognition, one available database is Conversation Test SPIQUE, containing utterances sampled at 44.1 kHz and labeled according to the NEO-FFI distributions [10]. This dataset consists of speech uttered by a professional speaker performing ten different personalities (fixed text), spontaneous speech elicited by showing nonprofessional speakers a set of images, and multi-speaker conversational speech.

III. RESEARCH CONTRIBUTIONS This section recapitulates the research contributions to

date whereas outlines expected contributions and their impact on the field of affective computing.

The auditory tests examining human speaker identification [25][2] have been effective to reveal higher human accuracy identifying speakers when their voices are transmitted over WB instead of NB channels. A similar outcome can be expected for the human identification of other meta- and para-linguistic aspects of the voice (such as

669

Page 6: [IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) - Geneva, Switzerland (2013.09.2-2013.09.5)] 2013 Humaine Association Conference

emotion recognition, personality recognition). These results should motivate the migration from NB to WB transmissions.

Regarding automatic speaker recognition assessment [26][3], it has been found that signal bandwidth does not have an important influence on the JFA system but the arrangement of training data becomes a principal issue. This work has proposed a training configuration leading to relatively low EERs, in which the data used for the estimation of the UBM model and the subspaces present the same distortions as those encountered at verification time.

In the next 1.5 years of work remaining to complete this PhD thesis, the expected contributions are: i) the evaluation of state-of-the-art automatic speaker classifiers detecting speaker’s age, gender, emotions, and personality under channel degradations, ii) the proposals and analyses of strategies to achieve better speaker verification and characterization performances under channel degradations, and iii) the evaluation of human speaker characterization performance over different channels.

IV. SUMMARY This paper has shown the work completed and future

plans on speaker recognition and characterization under channel distortions towards enhanced human-machine interactions. Unlike more traditional speech processing tasks such as automatic speaker verification, previous investigations of automatic emotion detection from speech have not considered the effects of channel mismatch, which manifest in typical scenarios where recorded voices are transmitted through different channels for the extraction of information. In addition, human performance distinguishing and characterizing individuals from their transmitted voices is often hampered due to channel degradations.

The human and automatic speaker recognition performances over different communication channels have been evaluated. As future work, current approaches for automatic speaker characterization, that is, the detection of age, gender, emotions and personality from speech, will be compared under different channel settings and novel techniques to handle channel variations will be proposed. The accuracy of listeners regarding speaker characterization will also be examined in future work.

REFERENCES [1] Möller, S., Raake, A., Kitawaki, N., Takahashi, A. and Wältermann,

M., “Impairment Factor Framework for Wideband Speech Codecs,” IEEE Trans. Audio, Speech and Language Processing, vol. 14, no. 6, pp.1969–1976, 2006.

[2] Fernández Gallardo, L., Möller, S. and Wagner, M., “Human Speaker Identification of Known Voices Transmitted Through Different User Interfaces and Transmission Channels,” ICASSP, 2013.

[3] Fernández Gallardo, L., Wagner M. and Möller, S., “Joint Factor Analysis for Speaker Verification under Variations of Transmission Bandwidth and Codec,” unpublished.

[4] Schuller, B., Batliner, A., Steidl, S. and Seppi, D., “Recognising Realistic Emotions and Affect in Speech: State of the Art and Lessons Learnt from the First Challenge,” Speech Communication, vol. 53, no. 9, pp. 1062-1087, 2011.

[5] Amino, K., Osanai, T., Kamada, T., Makinae, H. and Arai, T., “Effects of the Phonological Contents and Transmission Channels on

Forensic Speaker Recognition,” Forensic Speaker Recognition: Law Enforcement and Counter-Terrorism, pp. 275-308, Springer, 2011.

[6] Uzdy, Z., “Human Speaker Recognition Performance of LPC Voice Processors,” ICASSP, vol. 33, no.3, pp.752-753, 1985.

[7] A. Raake, “Speech Quality of VoIP – Assessment and Prediction,” John Wiley & Sons Ltd, Chichester, West Sussex, UK, 2006.

[8] Gobl, C. and N�� Chasaide, A., “The Role of Voice Quality in Communicating Emotion, Mood and Attitude,” Speech Communication, vol. 40, pp. 189–212, 2003.

[9] Polzehl, T., Möller, S. and Metze, F., “Modeling Speaker Personality using Voice,” Interspeech, pp. 2369-2362, 2011.

[10] Goldberg, L. R., “An Alternative “Description of Personality”: The Big-Five factor Structure”, Journal of Personality and Social Psychology, 59, 1216–1229, 1990.

[11] Voran, S., “Listener Detection of Talker Stress in Low-Rate Coded Speech”, ICASSP, pp. 4813-4816, 2008.

[12] Janicki, A., “SVM-Based Speaker Verification for Coded and Uncoded Speech,” European Signal Processing Conference, 2012.

[13] Pradhan, G. and Prasanna, S. R. M., “Significance of Speaker Information in Wideband Speech,” National Conf. on Communication (NCC), 2011.

[14] Kenny P., Ouellet, P., Dehak, N., Gupta V. and Dumouchel, P., “A Study of Inter-Speaker Variability in Speaker Verification,” IEEE Trans. Audio, Speech and Language Processing, vol. 16, no. 5, pp. 980–988, 2008.

[15] Reynolds, D.A., Quatieri, T.F. and Dunn, R.B., “Speaker Verification Using Adapted Gaussian Mixture Models,” Journal of Digital Signal Processing, vol. 10, no. 1-3, pp. 19-41, 2000.

[16] Metze, F., Ajmera, J., Englert, R., Bub, U., Burkhardt, F., Stegmann, J., Müller, C., Huber, R., Andrassy, B., Bauer, J.G. and Little, B., “Comparison of Four Approaches to Age and Gender Recognition for Telephone Applications,” ICASSP, pp. 1080-1092, 2007.

[17] Ververidis, D. and Kotropoulos, C., “Emotional Speech Recognition: Resources, Features, and Methods,” Speech Communication, vol. 48, no. 9, pp. 1162-1181, 2006.

[18] Schuller, B., Steidl, S. and Batliner, A., “The INTERSPEECH 2009 Emotion Challenge,” Interspeech, 2009.

[19] Schuller, B., Steidl, S. and Batliner, A., “The INTERSPEECH 2012 Speaker Trait Challenge,” Interspeech, 2012.

[20] Schuller, B., Valstar, M., Eyben, F., Cowie, and R. and Pantic. M., “Avec 2012 - The Continuous Audio/Visual Emotion Challenge,” ACM Int. Conf. Multimodal Interaction, pp. 449-456, 2012.

[21] Kockmann, M., Burget, L. and �ernocky´, J., “Brno University of Technology System for Interspeech 2009 Emotion Challenge,” Interspeech, pp. 348–351, 2009.

[22] Metze, F., Polzehl, T. and Wagner, M., “Fusion of Acoustic and Linguistic Speech Features for Emotion Detection,” IEEE Int. Conf. on Semantic Computing, 2009.

[23] Mairesse, F., Walker, M.A., Mehl, M.R. and Moore, R.K., “Using Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text,” Journal of Artificial Intelligence Research, vol. 30, no. 1, pp. 457 – 500, 2007.

[24] Metze, F., Black, A. and Polzehl, T., “A Review of Personality in Voice-Based Man Machine Interaction,” HCI International, vol.2, pp. 358-367, Springer, 2011.

[25] Fernández Gallardo, L., Möller, S. and Wagner, M., “Comparison of Human Speaker Identification of Known Voices Transmitted Through Narrowband and Wideband Communication Systems,” ITG Conference on Speech Communication, 2012.

[26] Fernández Gallardo, L., Wagner M. and Möller, S., “Analysis of Automatic Speaker Verification Performance over Different Narrowband and Wideband Telephone Channels,” 14th Australasian Int. Conf. on Speech Science and Technology, 2012.

[27] Burnham, D., Estival, D., Fazio, S., Cox, F., Dale, R., Viethen, J.,Cassidy, S., Epps, J., Togneri, R., Kinoshita, Y., Göcke, R., Arciuli, J,. Onslow, M., Lewis, T., Butcher, A., Hajek, J. and Wagner M., “Building an Audio-Visual Corpus of Australian English: Large Corpus Collection with an Economical Portable and Replicable Black Box,” 12th Annual Conf. of the International Speech Communication Association, 2011.

670