[ieee icassp 2014 - 2014 ieee international conference on acoustics, speech and signal processing...

5
VARIABILITY COMPENSATION IN SMALL DATA: OVERSAMPLED EXTRACTION OF I-VECTORS FOR THE CLASSIFICATION OF DEPRESSED SPEECH Nicholas Cummins 1,2 , Julien Epps 1,2 , Vidhyasaharan Sethu 1 , Jarek Krajewski 3 1 School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney Australia 2 ATP Research Laboratory, National ICT Australia (NICTA), Australia 3 Experimental Industrial Psychology, University of Wuppertal, Wuppertal, Germany [email protected], [email protected] ABSTRACT Variations in the acoustic space due to changes in speaker mental state are potentially overshadowed by variability due to speaker identity and phonetic content. Using the Audio/Visual Emotion Challenge and Workshop 2013 Depression Dataset we explore the suitability of i-vectors for reducing these latter sources of variability for distinguishing between low or high levels of speaker depression. In addition we investigate whether supervised variability compensation methods such as Linear Discriminant Analysis (LDA), and Within Class Covariance Normalisation (WCCN), applied in the i-vector domain, could be used to compensate for speaker and phonetic variability. Classification results show that i-vectors formed using an over-sampling methodology outperform a baseline set by KL-means supervectors. However the effect of these two compensation methods does not appear to improve system accuracy. Visualisations afforded by the t-Distributed Stochastic Neighbour Embedding (t-SNE) technique suggest that despite the application of these techniques, speaker variability is still a strong confounding effect. Index Terms— Depression, Acoustic Variability, I-vectors, Linear Discriminant Analysis, Within Class Covariance Normalisation, t-Distributed Stochastic Neighbour Embedding 1. INTRODUCTION A wide range of acoustic information is modulated onto speech signals; this potentially places an upper-bound on the accuracy of a speech based depression classification system. Acoustic variability that arises due to speaker characteristics, channel effects and phonetic content has been shown to have detrimental effects on the accuracy of recognition of a range of paralinguistic information such as long-term speaker traits including age and gender [1], temporary speaker traits such as intoxication [2] and sleepiness [3], as well as transient speaker states such as emotion [4]. In emotion recognition, in particular, it has been shown that speaker variability affects the feature space distribution of emotional data [4]. Both automatic emotion and depression recognition systems share common traits; a continuous negative affect is a key symptom of depression [5]. However depression is more steady-state compared with the transient nature of emotions, with individuals inflicted for weeks or months rather than seconds or minutes [6]. In speaker recognition, i-vectors, together with a range of complementary transforms designed to further reduce errors arising from intersession variability, have become a pseudo standard due to their ability to compress both speaker and channel variability into a low-dimensional feature space [7], [8]. However little work has been done exploring the suitability of this paradigm for modelling paralinguistic tasks which often have substantially (both in terms of number of speakers and duration) smaller amounts of training data when compared with those used in speaker recognition. Motivated by results showing that both speaker variability and phonetic variability have negative effects on depression classification [9], [10], we investigate the suitability of i-vectors for modelling depressed speech as well as the ability of the paradigm to reduce the effects of variability not related to depression. 2. RELATION TO PRIOR WORK Whilst a range of prosodic [11], [12], voice quality [13], spectral features [9], [14] and Gaussian Mixture Model (GMM) based supervectors [10] have been established for use in an automatic speech based depression classifier, there are only a small number of papers which have investigated the effects of unwanted acoustic variability on depressed speech classification. Results presented in [9] show that per-speaker normalisation offers no improvement for a depression classifier indicating that, as in emotion recognition, speaker variability has stronger effects than variability due to depression. Work in [15] shows that depression classification is susceptible to both speaker and channel effects. Recent results, found using the Audio/Visual Emotion Challenge (AVEC) and Workshop 2013 Depression Dataset, show that Nuisance Attribute Projection (NAP) applied to Kullback- Leibler (KL-means) supervectors may be able to help reduce effects due to phonetic variability in a depression regression system [10]. Whilst this paper focuses on i-vectors for depressed speech classification, we also apply our final i-vector system configuration to the depression scale prediction challenge (Section 5.4) to allow comparison with results presented in [10]. The application of i-vectors to paralinguistic speech classification problems may be complicated by more than just the lack of previous investigation on comparatively small databases. Speaker traits like depression often only have examples of one class (i.e. low or high depression but not both) from a single speaker among the training/development data [14]. Compared with emotion or speaker recognition, in which training databases exist with examples of many emotions or channels per speaker [4], a different approach will be required. Whilst i-vectors, and the related techniques of Joint Factor Analysis and Latent Factor Analysis, have been used in other paralinguistic classification tasks such as age and gender analysis [1], [16] and emotion classification [4], [17], to the best of the authors’ knowledge this paper is the first paper to explore the suitability of i-vectors for the classification of speech affected by depression. Further, depression data, including AVEC, often pose the additional challenge that speech utterances from only a single level of depression per speaker are available. 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) 978-1-4799-2893-4/14/$31.00 ©2014 IEEE 970

Upload: jarek

Post on 01-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Florence, Italy (2014.5.4-2014.5.9)] 2014 IEEE International Conference

VARIABILITY COMPENSATION IN SMALL DATA: OVERSAMPLED EXTRACTION OF I-VECTORS FOR THE CLASSIFICATION OF DEPRESSED SPEECH

Nicholas Cummins1,2, Julien Epps1,2, Vidhyasaharan Sethu1 , Jarek Krajewski3 1School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney Australia

2 ATP Research Laboratory, National ICT Australia (NICTA), Australia 3 Experimental Industrial Psychology, University of Wuppertal, Wuppertal, Germany

[email protected], [email protected]

ABSTRACT Variations in the acoustic space due to changes in speaker mental state are potentially overshadowed by variability due to speaker identity and phonetic content. Using the Audio/Visual Emotion Challenge and Workshop 2013 Depression Dataset we explore the suitability of i-vectors for reducing these latter sources of variability for distinguishing between low or high levels of speaker depression. In addition we investigate whether supervised variability compensation methods such as Linear Discriminant Analysis (LDA), and Within Class Covariance Normalisation (WCCN), applied in the i-vector domain, could be used to compensate for speaker and phonetic variability. Classification results show that i-vectors formed using an over-sampling methodology outperform a baseline set by KL-means supervectors. However the effect of these two compensation methods does not appear to improve system accuracy. Visualisations afforded by the t-Distributed Stochastic Neighbour Embedding (t-SNE) technique suggest that despite the application of these techniques, speaker variability is still a strong confounding effect.

Index Terms— Depression, Acoustic Variability, I-vectors, Linear Discriminant Analysis, Within Class Covariance Normalisation, t-Distributed Stochastic Neighbour Embedding

1. INTRODUCTION A wide range of acoustic information is modulated onto speech signals; this potentially places an upper-bound on the accuracy of a speech based depression classification system. Acoustic variability that arises due to speaker characteristics, channel effects and phonetic content has been shown to have detrimental effects on the accuracy of recognition of a range of paralinguistic information such as long-term speaker traits including age and gender [1], temporary speaker traits such as intoxication [2] and sleepiness [3], as well as transient speaker states such as emotion [4]. In emotion recognition, in particular, it has been shown that speaker variability affects the feature space distribution of emotional data [4]. Both automatic emotion and depression recognition systems share common traits; a continuous negative affect is a key symptom of depression [5]. However depression is more steady-state compared with the transient nature of emotions, with individuals inflicted for weeks or months rather than seconds or minutes [6].

In speaker recognition, i-vectors, together with a range of complementary transforms designed to further reduce errors arising from intersession variability, have become a pseudo standard due to their ability to compress both speaker and channel variability into a low-dimensional feature space [7], [8]. However little work has been done exploring the suitability of this paradigm for modelling paralinguistic tasks which often have substantially (both in terms of number of speakers and duration) smaller amounts of

training data when compared with those used in speaker recognition.

Motivated by results showing that both speaker variability and phonetic variability have negative effects on depression classification [9], [10], we investigate the suitability of i-vectors for modelling depressed speech as well as the ability of the paradigm to reduce the effects of variability not related to depression.

2. RELATION TO PRIOR WORK Whilst a range of prosodic [11], [12], voice quality [13], spectral features [9], [14] and Gaussian Mixture Model (GMM) based supervectors [10] have been established for use in an automatic speech based depression classifier, there are only a small number of papers which have investigated the effects of unwanted acoustic variability on depressed speech classification.

Results presented in [9] show that per-speaker normalisation offers no improvement for a depression classifier indicating that, as in emotion recognition, speaker variability has stronger effects than variability due to depression. Work in [15] shows that depression classification is susceptible to both speaker and channel effects.

Recent results, found using the Audio/Visual Emotion Challenge (AVEC) and Workshop 2013 Depression Dataset, show that Nuisance Attribute Projection (NAP) applied to Kullback-Leibler (KL-means) supervectors may be able to help reduce effects due to phonetic variability in a depression regression system [10]. Whilst this paper focuses on i-vectors for depressed speech classification, we also apply our final i-vector system configuration to the depression scale prediction challenge (Section 5.4) to allow comparison with results presented in [10].

The application of i-vectors to paralinguistic speech classification problems may be complicated by more than just the lack of previous investigation on comparatively small databases. Speaker traits like depression often only have examples of one class (i.e. low or high depression but not both) from a single speaker among the training/development data [14]. Compared with emotion or speaker recognition, in which training databases exist with examples of many emotions or channels per speaker [4], a different approach will be required.

Whilst i-vectors, and the related techniques of Joint Factor Analysis and Latent Factor Analysis, have been used in other paralinguistic classification tasks such as age and gender analysis [1], [16] and emotion classification [4], [17], to the best of the authors’ knowledge this paper is the first paper to explore the suitability of i-vectors for the classification of speech affected by depression. Further, depression data, including AVEC, often pose the additional challenge that speech utterances from only a single level of depression per speaker are available.

2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP)

978-1-4799-2893-4/14/$31.00 ©2014 IEEE 970

Page 2: [IEEE ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Florence, Italy (2014.5.4-2014.5.9)] 2014 IEEE International Conference

GrsGotsmvatUsclv

wsv3MtftTumos

3TvvoAvdNc

u

Given a Univerrepresent the fespeech utterancGMM represenof a number ofthe inherent hspace is a low mapped via a variability (useand has been uthe context of UBM approximspace and henccapture variatiolevel of depresvector model is

where is thesupervector corvectors, and i3.1. EstimationMathematicallythe matrix ifrom speaker rethe training of tTypically of pausing has a limmore supervectoversampling tespeech (subfiles

Figure 1: Exam‘channels’ fro

supervector

3.2 VariabilityTo attempt tovariability whilvector space, toversampled exAnalysis (LDAvector space thdepression in thNormalisation (class [7], [8].

Both LDAusing subfiles

3. I-VECTOrsal Backgroundeature distributices can be used nted as supervecf linear vector shigh dimension

dimensional sulinear transform

eful informationused extensivelyf depression clamately models tce the supervecons in this strussion, speaker is given by:

e supervector corresponding to is the projectionn of T Matrix y, the i-vector ms estimated froecognition showthe T space, thearalinguistic spemited training setor instances foechnique wheres) were extracte

mple of creating dom a single suffi

rs used to estimat

y Compensationo mitigate thele retaining vartwo forms of xtracted i-vecto

A), attempts to fhat maximises shis case), while (WCCN), attem

A and WCCN labelled with

OR PARADIGd Gaussian mixion of the acouto adapt this U

ctors. This allowspace operationality of superv

ubspace onto wmation while rn) present in ty in speaker reassification, it the phonetic structors and conse

ucture due to otidentity, channe

orresponding to

a particular un matrix [7], [8]

model is a factorom training datw that the moree more accurate eech systems, aet (50 files, Seor the estimatioein multiple oveed from each fil

different speech iciently long audte T. Figure repr

n e effects of sriability due tocompensation ors. The first, find the linear pseparation betwthe second, Wi

mpts to minimis

were applied oh a class (low

GM xture Model (UBustic space, indiUBM and the rews for the appl

ns but is held bvectors. The i-

which supervectoretaining most the supervectorecognition [7],

is expected thucture of the acequently the i-vther factors incel effects, etc.

o the UBM, utterance, is ].

r analysis methta. In general,e training data u

the final systemas the database wction 4.1), to pon of T we uerlapping segmee (Figure 1) [10

segments (subfilio file for extractroduced from [1

speaker and co depression in

were applied Linear Discrim

projection fromween classes (lev

thin Class Covase variations w

on a per class w or high lev

BM) to ividual sulting ication ack by -vector ors are of the

r space [8]. In

hat the coustic vectors cluding The i-

(1) is the the i-

od and results

used in m [18]. we are

provide used an ents of 0].

les) or ting 0].

channel n the i-

to the minant

m the i-vels of ariance

within a

basis; vels of

deptranbasi

4.1.All Emofull devEac(BDvalisomlevethe

One27 mreadconphochan[19]vari

prelvaridep(DVspea(indseleotheset as psele4.2.TheclasMFwas[21]10 iThevectdescadapdimcros

wertrainand traintwoprovSecdimthatstru

ression) basis. nsform matrix wis.

4. Corpora

experiments potion Challenge

data set comelopment and t

ch recording hDI) score, a selidity, which ra

matic symptomsel of depressionreader is referreThe make-up

e is large varietyminutes in lengd speech tasks

ntained within onetic variabilitnges in speake]; this could poiation into our s

To form a sliminary experiiability betweenression), herein

VC) partition, akers) from thedicating mild ected to form ther speakers (ou- the longest onpossible), with ected from the d. Experimentale experimentalssification systeCCs extracted

s found using K] for the extraciterations of thee entire AVEC ttors were thencribed in Sectiptation iterati

mensionality werss-fold validatio

LIBSVM withre used for allning. All trainin

d test vectors wning vectors. T

o reasons; firstlved successful ondly the use

mensionality is lt these UBMsucture of speech

WCCN was alswas trained usin

. EXPERIMEN

presented usede and Workshop

mprises 150 retesting partition

has an associatlf-reported meaates the severs, to give a patn [20]. For an ined to [10], [19]of this corpus py in terms of bogth) and speechs, noting that each file. Thi

ty within eacher affect are alotentially introdsystem. suitable two-cments (i.e. one

n each class are n referred to aall files (twe

e development sto non-existen

he ‘low’ class. Aut of 21 such fine was left out tBDI greater th

development parl Settings l settings (unem were as follo

as per [10]. AKL-means supertion methodolo

e EM algorithm training set wasn extracted foon 4.1. The nuons, matrre set empiricalon on DVC parth a linear kernel Support Vectng vectors werewere normaliz

The choice of Mly MFCCs, in

for low/high e of MFCCs low enough to s are represenh.

so trialled on a ng subfiles labe

NTAL SET-UP

d a subset of p (AVEC) 2013cordings, dividns each of 50 ted Beck Dep

asure of depressrity of cognitivtient a score wn-depth descript.

provides some uoth file length (h tasks (vocal enot all files i

is introduces ah file [10]. Spelso present throduce another so

lass classificat in which diffemost clearly du

as the developmenty files fromset, with BDI snt levels of dA further twentyiles available into balance the than 19 (severe rtition to form t

nless otherwiseows: the frame l

A baseline classrvectors; the re

ogy. All UBMs from the entire also used to tra

or the two-clasumber of UBMrix dimensionly using the traitition. el [22], and detor Machine (Se normalized toed by the sam

MFCCs as featucombination wlevels of deprensures that tallow UBMs t

ntative of the

per-file basis; elled on a per-

P

the Audio/Vis3 dataset [19]. Tded into trainifiles (recording

pression Inventsion, with clinive, affective a

which reflects thtion of the corp

unique challeng(ranging from 5exercises, free ainclude all tasa large degree ecifically inducoughout each

ource of unwan

tion data seterences in acousue to the effectsment classificatm twenty disticores between depression), wy files from twen the developmtwo classes as bdepression), w

the high class.

e stated) of level features wsification accuraeader is referred

were trained we training partitiain the T matrixss DVC partit

M mixtures, MAnality and LDining partition a

efault user settinSVM) testing ao a range of [0, me factors asures was made with GMMs, haression [14], [1the feature spto be trained su

acoustic/phone

the file

sual The ing, gs). tory ical and heir pus,

ges. 5 to and sks)

of ced file

nted

for stic s of tion inct 0-9

were enty

ment best

were

the were

acy d to with ion. x; i-tion AP-DA and

ngs and 1], the for ave 15]. ace uch etic

971

Page 3: [IEEE ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Florence, Italy (2014.5.4-2014.5.9)] 2014 IEEE International Conference

4Au

twaemtrvo(t

5ADscmMrm6a5Tdvdaad

tobs(romcv

oteso

4.3. EvaluationA similar cross used to generat100 trials of 5-trial the DVC pwere then used are reported in each trial is tminimum and mthe standard dereferred to eithvector is extraoversampling t(one per each sto generate one

5.1 Baseline SyAs in [10], an iDVC partition system. This syconsistent perfomeans extracteMAP also gaveresult (70.30%)means system (67.85%, with tha 10 second ove5.2 i-vector ClaTo determine thdepression, an vector system dimensionality able to better eiand the best dimensionality

For the ovtests to determioverlap and best setting wassample of thes(74.85%) was arelative improvoversampled Kmore importancompetitive i-vvector impleme

Table 1: Stangenerated usin

using theVector SyKL-means

StOv

i-vectors StOv

We specuoversampled i-vthe increase in estimate the system whilst oversampled sy

n Method fold validation

te the classifica-fold cross-valipartition was raas the SVM traterms of avera

the average ofmaximum accureviation in eachher as ‘standaracted per file technique multiupervector/i-veprediction per

5. Rystem Accuracyinitial series of

using an uncoystem was chosorming system d from a 128 m

e the best standa) (Table 1). An(multiple supervhe subfiles extrerlap (Table 1). assification he suitability ofexhaustive seri(only one i-vecof . None of

ither KL-meansaccuracy (58.8of 100 (Table 1

versampled systine the optimal

matrix dimenss a 60 second we results is shoachieved with vement of 6.5%

KL-means baselntly, oversampector systems to

entation providendard and overng uncompensae DVC partitioystem tandard versampled tandard versampled ulate that thevectors comparetraining sample matrix. 50 fil1505 subfiles

ystem. Further,

scheme to that ation accuracy rdation are emp

andomly split inaining and test age accuracy acf the 5 folds racy from each h trial set. Classrd’, where a sior ‘oversampleiple scores we

ector) and the mfile.

RESULTS y comprehensiveompensated, Ken as our baseliin [10]. Resultmixture GMM ard system (onen oversampled vectors per file)racted from 60 s

f the i-vector paies of tests wasctor per file) to

f these classificas baselines (stan88%) was fou1). tem, we ran anparameters in tsionality. Resuwindow with a own in Figure a dimensiona% and 10.3% ines respectivelpling seems o be developed

es close to chancrsampled classiated KL-meanson of the AVEC

mean min 70.30 57.5067.85 55.0058.45 45.0074.85 60.00

e improvemened with the stanes (despite someles were used were used for the superior p

presented in [1results. In this mployed, where into the 5 folds, permutations. Rcross all trials (

of cross-validset of trials as wsification systemingle supervected’. When usinere generated pmedian score wa

e tests were run KL-means super

ine as it was thts confirmed tha

using 5 iteratie supervector p

uncompensate) gave an accursecond segment

aradigm for mods run on a stando determine theation accuraciendard or oversamund for a

n exhaustive seterms of window

ults indicated th10 second over2. The best acality of 200; thover the sing

ly (Table 1). Pto allow rel

d where a ‘standce-level accuracification accurs and i-vectorsC 2013 corpus.

max s0 80.00 40 77.50 40 72.50 60 85.00 5nt when usinndard system is e redundancy) uto train the stthe matrix

performance of

0] was method n each which

Results (where

dation), well as ms are tor / i-ng the

per file as used

on the rvector

he most at KL-ions of er file)

ed KL-racy of ts with

delling dard i-e ideal s were mpled) matrix

eries of w size, hat the rlap. A

ccuracy his is a le and

Perhaps atively

dard’ i-cy.

racies found t.dev.

4.01 4.61 6.06 5.41 ng the

due to used to tandard

in the f the i-

vectsuggphoin tatteheavreasvectrobu

estiseco

5.3Whto ucarrmetdimclasevaland expbasinormindiaccuthe shed

va

L

N114

5.4 In Stocempappvisuattespac

undimpunc

tor system whgests that variat

onetic content athe supervectormpt at speakervily in state-osonable to infertor paradigm must to phonetic v

Figure 2: Classiimated using theond segments wi

Effect of Variaile the i-vectorundesired variabried out in thethods to the b

mensional oversssification accluated. Specific

d LDA in comblored. Further, is as well as a pmalisation). Allicate that theseuracy. MoreoveKL-means based some light onTable 2: Class

ariability compvectors extrac

part

LDA dim LD

No LDA 190 62.120 63.40 63.Visualisation oaddition to

chastic Neighbployed to visulication of LD

ualization of himpting to presce in the low dim

The t-SNE pderstanding ofpact on i-vectompensated i-v

hen comparedtions in the i-ve

and speaker char space. Also, gr normalisation of-the-art spear from Table 1 tmakes the deprevariability.

ification accurac oversampled tec

ith a 10 second ovAVEC 201

ability Compenr representation ability, further ve i-vector spacest performingsampled i-vectcuracy of 74cally, the use obination with W

WCCN was aper-speaker basil results from the methods haver, their performelines from Sec

n the reasons behsification accurpensation techncted via oversatition of the AV

DA only LDAWCC64.53

.83 66.18

.43 65.45

.20 64.05of the i-vector sclassification

bour Embeddinualise the i-vecDA and WCCNigh dimensionaserve local strmensional spacplots shown iwhy LDA andtor system acvector space tak

d with the suector space mayaracteristics comgiven that this

and given thaaker recognitiothat the use of tession classifica

cy for different dchnique, subfilesverlap, on the DV13 corpus.

nsation n offers some invariability comce, and the ap

g system from tor system w4.85%) was of LDA and W

WCCN in the i-vapplied on a peis (i.e., within shese analyses (Tve a negative imance was also ction 5.1. Sectiohind this drop inracy for a rangniques, using 2ampling, foundVEC 2013 corpA+ CN(class)

LDW

3 668 645 645 63space experiments,

ng (t-SNE) tecctor space befoN. The t-SNE al data in 2 or ructure of the ce [23]. in Figure 3, hd WCCN haveccuracy. Figurken from one fo

upervector systy be more robusmpared with thsystem makes

at i-vectors featon systems, it the oversampledation system m

dimensions of T, s extracted from VC partition of t

nherent robustnmpensation may

pplication of tsection 5.2 (2

with an estimacomprehensiv

WCCN individuavector space w

er-depression clspeaker covarianTable 2), howevimpact on systlower than thaton 5.4 attemptsn performance.ge of different 200-dimension id on the DVC pus. DA+

WCCN(speaker)6.13 4.05 4.13 3.18

the t-Distribuchnique [23] wfore and after

technique allo3 dimensions

high dimensio

help us gain e such a negatre 3a shows old of a cross f

tem t to

hose no

ture is

d i-more

60 the

ness be

two 200 ated vely ally

were lass nce ver, tem t of s to

i-

)

uted was the

ows by

onal

an tive

an fold

972

Page 4: [IEEE ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Florence, Italy (2014.5.4-2014.5.9)] 2014 IEEE International Conference

vacpt

ds(cssciS(

3cinrvaic5FtbotcpMs

validation test and of equal sclusters approxpartition, that ethe oversampled

LDA has thdistinct class separation doe(clusters correspcircles) and consuffers. Furtherspeaker clusterclasses. We speidentity that aSimilar results (not shown here

The effects3c) are a reduccompared with i-vector space normalisation’. reducing phonevector represenas within-speaimprove resultsconfounding fac5.5 DepressionFinally, for tranthose presentedboth an oversamoversampled i-vto estimate depclassification) partitions. All Mean Square Esetup of these te

Table 3: RMcomp

System

Baseline [1KL-means KL-means i-vectors (oi-vectors (o

Figures 3: (a)

(a)

(test partitions size). We specuximately equals ach individual cd i-vectors fromhe effect of sepgroups very

es not appear ponding to test nsequently clasr, the test samprs and are weleculate this cou

are not successwere seen for

e). s of WCCN apction in the ‘leFigure 3a but dand conseque

We speculate tetic variability wntation itself appaker covariances we speculatector.

n Scale Predictinsparency, we

d in [10] and bampled uncompevector WCCN (

pression scale (on two sets, system accura

Error (RMSE).Tests.

MSE’s generatedpared with syst

19] (standard) [10](oversampled+

oversampled) oversampled WC

) t-SNE plot of ui-ve

for all folds wulate that as ththe number of cluster in the vi

m a single speakparating the train

well (Figure to generalise samples depict

sification accurples still appeall separated fruld be due to esfully mitigateWCCN applied

pplied on a per-ength’ of each do not alter the gently do not that WCCN on within each filpears to do thise normalisatione that speaker

ion using i-vectcompare the pr

aseline corpus rensated i-vector(per speaker) syi.e. regression the AVEC d

cies are reportThe reader is re

d using the ovetem accuracies

+ NAP) [10]

CCN speaker)

uncompensated i-ector system. Bla

were chosen ranhe number of d

speakers in theisualisation repr

ker. ning samples in

3b). Howeveto unseen sp

ted in the plots racy on these saar to be organirom the two treffects of the sed by the trand on a per-class

-speaker basis (speaker clusterglobal structureachieve any a per-speaker be. However, as

s to a large exten does not aidentify is the

tors roposed systemresults [19], wer system as welystem for their as opposed to 2evelopment anted in terms ofeferred to [10]

ersampling mets from [10].

RMSE Devel. Test10.75 14.19.00 10.18.94 13.310.34 11.310.13 11.5

-vector system, (ack points repres

(b)

ndomly distinct e DVC resents

nto two er this peakers within

amples ised in raining speaker nsform. s basis

(Figure r when e of the ‘useful basis is s the i-ent and ctually major

ms with e tested ll as an ability 2-class

nd test f Root for the

thod

t 12 17 34 37 58

chalsupeoutpmeaKL-per-devi-verath

Spesetuunwpredovesmai-vebettbothstanspecminconpara

LDAthisvarit-SNsystsystperfenti

ovesecttrial

NICthe EcoCenARCRes

(b) LDA compensent low class, gr

On the devellenge baselineervector systemperformed the ans, NAP comp-means superve-speaker basiselopment set bu

ector systems wher than a regres

eaker variabilityup) variabilitywanted variabidiction systemrsampled extra

aller datasets. Aector systems inter suited to clah an uncompenndard i-vector (culate that this

nimizing the effntext given theameters via ove

Further attemA and WCCN is due to theiability due to thNE support thtems, results frtems could ouformance close ire AVEC data [

Future work rsampling methtion 5.3, clustelled.

7CTA is funded b

Department oonomy and the ntre of ExcellenC Discovery P

search Foundatio

sated i-vector syrey high class. Te

lopment set, bes, but were

m. On the test schallenge bas

pensated, supervector system. In lowers the ut not on the te

were set up and ossion task.

6. CONCy, phonetic conhave all previ

ility, to both ms [9], [10], action techniqu

Apart from permn this context,

assifying high ansated KL-mea

(one i-vector pes is due to bothffects of unwane possibility ofersampling.

mpts to minimizwere unsuccese speaker varihe effects of dehis speculationrom the regresutperform the

to that of a com[10]. includes testin

hod. Given the rer-based classif

7. ACKNOWLEby the Australiaf Broadband, Australian Re

nce program. TProject DP1301on (KR3698/4-

ystem, and (c) West samples are c

(c)

both i-vector unable to beaet, the proposedeline and the

vector system bunterestingly, WC

RMSE as eest set. It shouldoptimised for a

LUSION ntent and intersiously been sh

depression c[15]. This pa

ue for the i-vmitting the deve

this techniqueand low levels oans supervectoer file) extractioh the suitabilit

nted variability f estimating re

ze unwanted vsful. We specuiability being epression. Visuan. In addition sion analysis schallenge base

mpetitive system

ng the effects result shown in fication techniq

EDGEMENTSan GovernmentCommunicationsearch Council

This work was 101094 (Epps) 1) (Krajewski).

CCN (per speakcircled.

systems beat at the KL-med i-vector systeoversampled

ut not the standCCN applied o

evaluated on d be noted that classification t

session (recordhown to introduclassification aaper presents vector systemselopment of usee was found to of depression thr system and on techniques. Wty of i-vectors in a classificat

easonable i-vec

variability throuulate that as in stronger than alisations based

to classificatshow that i-veceline and provm operating on

of a less nathe t-SNE plots

ques will also

S t as representedn and the Digl through the I

partly funded and the Germ

ker) compensated

the ans

ems KL

dard on a

the the

task

ding uce and

an in

eful be

han the We for

tion ctor

ugh [9] the

d on tion ctor vide the

aïve s in be

d by gital ICT

by man

d

973

Page 5: [IEEE ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) - Florence, Italy (2014.5.4-2014.5.9)] 2014 IEEE International Conference

8. REFERENCES [1] M. Li, A. Metallinou, D. Bone, and S. Narayanan, “Speaker

states recognition using latent factor analysis based Eigenchannel factor vector modeling,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 1937–1940.

[2] D. Bone, M. Li, M. P. Black, and S. S. Narayanan, “Intoxicated Speech Detection: A Fusion Framework with Speaker-Normalized Hierarchical Functionals and GMM Supervectors,” Comput. Speech Lang., p. NA, 2012.

[3] J. Krajewski, S. Schnieder, D. Sommer, A. Batliner, and B. Schuller, “Applying multiple classifiers and non-linear dynamics features for detecting sleepiness from speech,” Neurocomputing, vol. 84, pp. 65–75, 2012.

[4] V. Sethu, J. Epps, and E. Ambikairajah, “Speaker Variability In Speech Based Emotion Models - Analysis and Normalisation,” in 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 7522–7526.

[5] R. J. Davidson, D. Pizzagalli, J. B. Nitschke, and K. Putnam, “Depression: Perspectives from Affective Neuroscience,” Annu. Rev. Psychol., vol. 53, no. 1, pp. 545–574, 2002.

[6] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human-computer interaction,” Signal Process. Mag. IEEE, vol. 18, no. 1, pp. 32–80, 2001.

[7] N. Dehak, R. Dehak, P. Kenny, N. Brümmer, P. Ouellet, and P. Dumouchel, “Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification.,” in 10th Annual Conference of the International Speech Communication Association Interspeech2009, 2009, vol. 9, pp. 1559–1562.

[8] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Factor Analysis for Speaker Verification,” IEEE Trans. Audio. Speech. Lang. Processing, vol. 19, no. 4, pp. 788–798, May 2011.

[9] N. Cummins, J. Epps, M. Breakspear, and R. Goecke, “An Investigation of Depressed Speech Detection: Features and Normalization,” in 12th Annual Conference of the International Speech Communication Association Interspeech2011, 2011, pp. 2997–3000.

[10] N. Cummins, J. Joshi, A. Dhall, V. Sethu, R. Goecke, and J. Epps, “Diagnosis of Depression by Behavioural Signals: A Multimodal Approach,” in The 21st ACM International Conference on Multimedia, 2013, p. NA.

[11] J. C. Mundt, P. J. Snyder, M. S. Cannizzaro, K. Chappie, and D. S. Geralts, “Voice acoustic measures of depression severity and treatment response collected via interactive voice response (IVR) technology,” J. Neurolinguistics, vol. 20, no. 1, pp. 50–64, 2007.

[12] J. C. Mundt, A. P. Vogel, D. E. Feltner, and W. R. Lenderking, “Vocal Acoustic Biomarkers of Depression Severity and Treatment Response,” Biol. Psychiatry, vol. 72, pp. 580 –587, 2012.

[13] S. Scherer, G. Stratou, J. Gratch, and L. Morency,

“Investigating Voice Quality as a Speaker-Independent Indicator of Depression and PTSD,” in 14th Annual Conference of the International Speech Communication Association Interspeech2013, Lyon, France, 25-29 Aug 2013, 2013, pp. 847–851.

[14] N. Cummins, J. Epps, V. Sethu, M. Breakspear, and R. Goecke, “Modeling Spectral Variability for the Classification of Depressed Speech,” in 14th Annual Conference of the International Speech Communication Association Interspeech2013, Lyon, France, 25-29 Aug 2013, 2013.

[15] D. Sturim, P. A. Torres-Carrasquillo, T. F. Quatieri, N. Malyska, and A. McCree, “Automatic Detection of Depression in Speech Using Gaussian Mixture Modeling with Factor Analysis,” in 12th Annual Conference of the International Speech Communication Association Interspeech2011, 2011, pp. 2983–2986.

[16] M. Li, K. J. Han, and S. Narayanan, “Automatic speaker age and gender recognition using acoustic and prosodic level information fusion,” Comput. Speech Lang., vol. 27, no. 1, pp. 151–167, Jan. 2013.

[17] R. Xia, and Y. Liu, “Using i-Vector Space Model for Emotion Recognition,” in 13th Annual Conference of the International Speech Communication Association Interspeech2012, 2012, pp. 2230–33.

[18] M. Senoussaoui, P. Kenny, N. Dehak, and P. Dumouchel, “An i-vector extractor suitable for speaker recognition with both microphone and telephone speech,” in Proc. IEEE Odyssey, 2010.

[19] M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, and M. Pantic, “The Continuous Audio/Visual Emotion and Depression Recognition Challenge,” in The 21st ACM International Conference on Multimedia, 2013, p. NA.

[20] D. Maust, M. Cristancho, L. Gray, S. Rushing, C. Tjoa, and M. E. Thase, “Chapter 13 - Psychiatric rating scales,” in Handbook of Clinical Neurology, vol. Volume 106, F. B. Michael J. Aminoff, and F. S. Dick, Eds. Elsevier, 2012, pp. 227–237.

[21] W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomonoff, “SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation,” in 2006 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, (ICASSP’ 06)., 2006, vol. 1, pp. 97–100.

[22] C.-C. Chang, and C.-J. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 1:27, 2011.

[23] L. J. P. Maaten, and G. E. Hinton, “Visualizing High-Dimensional Data Using t-SNE,” J. Mach. Learn. Res., vol. 9, no. nov, pp. 2579–2605, 2008.

974