[ieee icassp 2014 - 2014 ieee international conference on acoustics, speech and signal processing...
TRANSCRIPT
VARIABILITY COMPENSATION IN SMALL DATA: OVERSAMPLED EXTRACTION OF I-VECTORS FOR THE CLASSIFICATION OF DEPRESSED SPEECH
Nicholas Cummins1,2, Julien Epps1,2, Vidhyasaharan Sethu1 , Jarek Krajewski3 1School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney Australia
2 ATP Research Laboratory, National ICT Australia (NICTA), Australia 3 Experimental Industrial Psychology, University of Wuppertal, Wuppertal, Germany
[email protected], [email protected]
ABSTRACT Variations in the acoustic space due to changes in speaker mental state are potentially overshadowed by variability due to speaker identity and phonetic content. Using the Audio/Visual Emotion Challenge and Workshop 2013 Depression Dataset we explore the suitability of i-vectors for reducing these latter sources of variability for distinguishing between low or high levels of speaker depression. In addition we investigate whether supervised variability compensation methods such as Linear Discriminant Analysis (LDA), and Within Class Covariance Normalisation (WCCN), applied in the i-vector domain, could be used to compensate for speaker and phonetic variability. Classification results show that i-vectors formed using an over-sampling methodology outperform a baseline set by KL-means supervectors. However the effect of these two compensation methods does not appear to improve system accuracy. Visualisations afforded by the t-Distributed Stochastic Neighbour Embedding (t-SNE) technique suggest that despite the application of these techniques, speaker variability is still a strong confounding effect.
Index Terms— Depression, Acoustic Variability, I-vectors, Linear Discriminant Analysis, Within Class Covariance Normalisation, t-Distributed Stochastic Neighbour Embedding
1. INTRODUCTION A wide range of acoustic information is modulated onto speech signals; this potentially places an upper-bound on the accuracy of a speech based depression classification system. Acoustic variability that arises due to speaker characteristics, channel effects and phonetic content has been shown to have detrimental effects on the accuracy of recognition of a range of paralinguistic information such as long-term speaker traits including age and gender [1], temporary speaker traits such as intoxication [2] and sleepiness [3], as well as transient speaker states such as emotion [4]. In emotion recognition, in particular, it has been shown that speaker variability affects the feature space distribution of emotional data [4]. Both automatic emotion and depression recognition systems share common traits; a continuous negative affect is a key symptom of depression [5]. However depression is more steady-state compared with the transient nature of emotions, with individuals inflicted for weeks or months rather than seconds or minutes [6].
In speaker recognition, i-vectors, together with a range of complementary transforms designed to further reduce errors arising from intersession variability, have become a pseudo standard due to their ability to compress both speaker and channel variability into a low-dimensional feature space [7], [8]. However little work has been done exploring the suitability of this paradigm for modelling paralinguistic tasks which often have substantially (both in terms of number of speakers and duration) smaller amounts of
training data when compared with those used in speaker recognition.
Motivated by results showing that both speaker variability and phonetic variability have negative effects on depression classification [9], [10], we investigate the suitability of i-vectors for modelling depressed speech as well as the ability of the paradigm to reduce the effects of variability not related to depression.
2. RELATION TO PRIOR WORK Whilst a range of prosodic [11], [12], voice quality [13], spectral features [9], [14] and Gaussian Mixture Model (GMM) based supervectors [10] have been established for use in an automatic speech based depression classifier, there are only a small number of papers which have investigated the effects of unwanted acoustic variability on depressed speech classification.
Results presented in [9] show that per-speaker normalisation offers no improvement for a depression classifier indicating that, as in emotion recognition, speaker variability has stronger effects than variability due to depression. Work in [15] shows that depression classification is susceptible to both speaker and channel effects.
Recent results, found using the Audio/Visual Emotion Challenge (AVEC) and Workshop 2013 Depression Dataset, show that Nuisance Attribute Projection (NAP) applied to Kullback-Leibler (KL-means) supervectors may be able to help reduce effects due to phonetic variability in a depression regression system [10]. Whilst this paper focuses on i-vectors for depressed speech classification, we also apply our final i-vector system configuration to the depression scale prediction challenge (Section 5.4) to allow comparison with results presented in [10].
The application of i-vectors to paralinguistic speech classification problems may be complicated by more than just the lack of previous investigation on comparatively small databases. Speaker traits like depression often only have examples of one class (i.e. low or high depression but not both) from a single speaker among the training/development data [14]. Compared with emotion or speaker recognition, in which training databases exist with examples of many emotions or channels per speaker [4], a different approach will be required.
Whilst i-vectors, and the related techniques of Joint Factor Analysis and Latent Factor Analysis, have been used in other paralinguistic classification tasks such as age and gender analysis [1], [16] and emotion classification [4], [17], to the best of the authors’ knowledge this paper is the first paper to explore the suitability of i-vectors for the classification of speech affected by depression. Further, depression data, including AVEC, often pose the additional challenge that speech utterances from only a single level of depression per speaker are available.
2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP)
978-1-4799-2893-4/14/$31.00 ©2014 IEEE 970
GrsGotsmvatUsclv
wsv3MtftTumos
3TvvoAvdNc
u
Given a Univerrepresent the fespeech utterancGMM represenof a number ofthe inherent hspace is a low mapped via a variability (useand has been uthe context of UBM approximspace and henccapture variatiolevel of depresvector model is
where is thesupervector corvectors, and i3.1. EstimationMathematicallythe matrix ifrom speaker rethe training of tTypically of pausing has a limmore supervectoversampling tespeech (subfiles
Figure 1: Exam‘channels’ fro
supervector
3.2 VariabilityTo attempt tovariability whilvector space, toversampled exAnalysis (LDAvector space thdepression in thNormalisation (class [7], [8].
Both LDAusing subfiles
3. I-VECTOrsal Backgroundeature distributices can be used nted as supervecf linear vector shigh dimension
dimensional sulinear transform
eful informationused extensivelyf depression clamately models tce the supervecons in this strussion, speaker is given by:
e supervector corresponding to is the projectionn of T Matrix y, the i-vector ms estimated froecognition showthe T space, thearalinguistic spemited training setor instances foechnique wheres) were extracte
mple of creating dom a single suffi
rs used to estimat
y Compensationo mitigate thele retaining vartwo forms of xtracted i-vecto
A), attempts to fhat maximises shis case), while (WCCN), attem
A and WCCN labelled with
OR PARADIGd Gaussian mixion of the acouto adapt this U
ctors. This allowspace operationality of superv
ubspace onto wmation while rn) present in ty in speaker reassification, it the phonetic structors and conse
ucture due to otidentity, channe
orresponding to
a particular un matrix [7], [8]
model is a factorom training datw that the moree more accurate eech systems, aet (50 files, Seor the estimatioein multiple oveed from each fil
different speech iciently long audte T. Figure repr
n e effects of sriability due tocompensation ors. The first, find the linear pseparation betwthe second, Wi
mpts to minimis
were applied oh a class (low
GM xture Model (UBustic space, indiUBM and the rews for the appl
ns but is held bvectors. The i-
which supervectoretaining most the supervectorecognition [7],
is expected thucture of the acequently the i-vther factors incel effects, etc.
o the UBM, utterance, is ].
r analysis methta. In general,e training data u
the final systemas the database wction 4.1), to pon of T we uerlapping segmee (Figure 1) [10
segments (subfilio file for extractroduced from [1
speaker and co depression in
were applied Linear Discrim
projection fromween classes (lev
thin Class Covase variations w
on a per class w or high lev
BM) to ividual sulting ication ack by -vector ors are of the
r space [8]. In
hat the coustic vectors cluding The i-
(1) is the the i-
od and results
used in m [18]. we are
provide used an ents of 0].
les) or ting 0].
channel n the i-
to the minant
m the i-vels of ariance
within a
basis; vels of
deptranbasi
4.1.All Emofull devEac(BDvalisomlevethe
One27 mreadconphochan[19]vari
prelvaridep(DVspea(indseleotheset as psele4.2.TheclasMFwas[21]10 iThevectdescadapdimcros
wertrainand traintwoprovSecdimthatstru
ression) basis. nsform matrix wis.
4. Corpora
experiments potion Challenge
data set comelopment and t
ch recording hDI) score, a selidity, which ra
matic symptomsel of depressionreader is referreThe make-up
e is large varietyminutes in lengd speech tasks
ntained within onetic variabilitnges in speake]; this could poiation into our s
To form a sliminary experiiability betweenression), herein
VC) partition, akers) from thedicating mild ected to form ther speakers (ou- the longest onpossible), with ected from the d. Experimentale experimentalssification systeCCs extracted
s found using K] for the extraciterations of thee entire AVEC ttors were thencribed in Sectiptation iterati
mensionality werss-fold validatio
LIBSVM withre used for allning. All trainin
d test vectors wning vectors. T
o reasons; firstlved successful ondly the use
mensionality is lt these UBMsucture of speech
WCCN was alswas trained usin
. EXPERIMEN
presented usede and Workshop
mprises 150 retesting partition
has an associatlf-reported meaates the severs, to give a patn [20]. For an ined to [10], [19]of this corpus py in terms of bogth) and speechs, noting that each file. Thi
ty within eacher affect are alotentially introdsystem. suitable two-cments (i.e. one
n each class are n referred to aall files (twe
e development sto non-existen
he ‘low’ class. Aut of 21 such fine was left out tBDI greater th
development parl Settings l settings (unem were as follo
as per [10]. AKL-means supertion methodolo
e EM algorithm training set wasn extracted foon 4.1. The nuons, matrre set empiricalon on DVC parth a linear kernel Support Vectng vectors werewere normaliz
The choice of Mly MFCCs, in
for low/high e of MFCCs low enough to s are represenh.
so trialled on a ng subfiles labe
NTAL SET-UP
d a subset of p (AVEC) 2013cordings, dividns each of 50 ted Beck Dep
asure of depressrity of cognitivtient a score wn-depth descript.
provides some uoth file length (h tasks (vocal enot all files i
is introduces ah file [10]. Spelso present throduce another so
lass classificat in which diffemost clearly du
as the developmenty files fromset, with BDI snt levels of dA further twentyiles available into balance the than 19 (severe rtition to form t
nless otherwiseows: the frame l
A baseline classrvectors; the re
ogy. All UBMs from the entire also used to tra
or the two-clasumber of UBMrix dimensionly using the traitition. el [22], and detor Machine (Se normalized toed by the sam
MFCCs as featucombination wlevels of deprensures that tallow UBMs t
ntative of the
per-file basis; elled on a per-
P
the Audio/Vis3 dataset [19]. Tded into trainifiles (recording
pression Inventsion, with clinive, affective a
which reflects thtion of the corp
unique challeng(ranging from 5exercises, free ainclude all tasa large degree ecifically inducoughout each
ource of unwan
tion data seterences in acousue to the effectsment classificatm twenty disticores between depression), wy files from twen the developmtwo classes as bdepression), w
the high class.
e stated) of level features wsification accuraeader is referred
were trained we training partitiain the T matrixss DVC partit
M mixtures, MAnality and LDining partition a
efault user settinSVM) testing ao a range of [0, me factors asures was made with GMMs, haression [14], [1the feature spto be trained su
acoustic/phone
the file
sual The ing, gs). tory ical and heir pus,
ges. 5 to and sks)
of ced file
nted
for stic s of tion inct 0-9
were enty
ment best
were
the were
acy d to with ion. x; i-tion AP-DA and
ngs and 1], the for ave 15]. ace uch etic
971
4Au
twaemtrvo(t
5ADscmMrm6a5Tdvdaad
tobs(romcv
oteso
4.3. EvaluationA similar cross used to generat100 trials of 5-trial the DVC pwere then used are reported in each trial is tminimum and mthe standard dereferred to eithvector is extraoversampling t(one per each sto generate one
5.1 Baseline SyAs in [10], an iDVC partition system. This syconsistent perfomeans extracteMAP also gaveresult (70.30%)means system (67.85%, with tha 10 second ove5.2 i-vector ClaTo determine thdepression, an vector system dimensionality able to better eiand the best dimensionality
For the ovtests to determioverlap and best setting wassample of thes(74.85%) was arelative improvoversampled Kmore importancompetitive i-vvector impleme
Table 1: Stangenerated usin
using theVector SyKL-means
StOv
i-vectors StOv
We specuoversampled i-vthe increase in estimate the system whilst oversampled sy
n Method fold validation
te the classifica-fold cross-valipartition was raas the SVM traterms of avera
the average ofmaximum accureviation in eachher as ‘standaracted per file technique multiupervector/i-veprediction per
5. Rystem Accuracyinitial series of
using an uncoystem was chosorming system d from a 128 m
e the best standa) (Table 1). An(multiple supervhe subfiles extrerlap (Table 1). assification he suitability ofexhaustive seri(only one i-vecof . None of
ither KL-meansaccuracy (58.8of 100 (Table 1
versampled systine the optimal
matrix dimenss a 60 second we results is shoachieved with vement of 6.5%
KL-means baselntly, oversampector systems to
entation providendard and overng uncompensae DVC partitioystem tandard versampled tandard versampled ulate that thevectors comparetraining sample matrix. 50 fil1505 subfiles
ystem. Further,
scheme to that ation accuracy rdation are emp
andomly split inaining and test age accuracy acf the 5 folds racy from each h trial set. Classrd’, where a sior ‘oversampleiple scores we
ector) and the mfile.
RESULTS y comprehensiveompensated, Ken as our baseliin [10]. Resultmixture GMM ard system (onen oversampled vectors per file)racted from 60 s
f the i-vector paies of tests wasctor per file) to
f these classificas baselines (stan88%) was fou1). tem, we ran anparameters in tsionality. Resuwindow with a own in Figure a dimensiona% and 10.3% ines respectivelpling seems o be developed
es close to chancrsampled classiated KL-meanson of the AVEC
mean min 70.30 57.5067.85 55.0058.45 45.0074.85 60.00
e improvemened with the stanes (despite someles were used were used for the superior p
presented in [1results. In this mployed, where into the 5 folds, permutations. Rcross all trials (
of cross-validset of trials as wsification systemingle supervected’. When usinere generated pmedian score wa
e tests were run KL-means super
ine as it was thts confirmed tha
using 5 iteratie supervector p
uncompensate) gave an accursecond segment
aradigm for mods run on a stando determine theation accuraciendard or oversamund for a
n exhaustive seterms of window
ults indicated th10 second over2. The best acality of 200; thover the sing
ly (Table 1). Pto allow rel
d where a ‘standce-level accuracification accurs and i-vectorsC 2013 corpus.
max s0 80.00 40 77.50 40 72.50 60 85.00 5nt when usinndard system is e redundancy) uto train the stthe matrix
performance of
0] was method n each which
Results (where
dation), well as ms are tor / i-ng the
per file as used
on the rvector
he most at KL-ions of er file)
ed KL-racy of ts with
delling dard i-e ideal s were mpled) matrix
eries of w size, hat the rlap. A
ccuracy his is a le and
Perhaps atively
dard’ i-cy.
racies found t.dev.
4.01 4.61 6.06 5.41 ng the
due to used to tandard
in the f the i-
vectsuggphoin tatteheavreasvectrobu
estiseco
5.3Whto ucarrmetdimclasevaland expbasinormindiaccuthe shed
va
L
N114
5.4 In Stocempappvisuattespac
undimpunc
tor system whgests that variat
onetic content athe supervectormpt at speakervily in state-osonable to infertor paradigm must to phonetic v
Figure 2: Classiimated using theond segments wi
Effect of Variaile the i-vectorundesired variabried out in thethods to the b
mensional oversssification accluated. Specific
d LDA in comblored. Further, is as well as a pmalisation). Allicate that theseuracy. MoreoveKL-means based some light onTable 2: Class
ariability compvectors extrac
part
LDA dim LD
No LDA 190 62.120 63.40 63.Visualisation oaddition to
chastic Neighbployed to visulication of LD
ualization of himpting to presce in the low dim
The t-SNE pderstanding ofpact on i-vectompensated i-v
hen comparedtions in the i-ve
and speaker char space. Also, gr normalisation of-the-art spear from Table 1 tmakes the deprevariability.
ification accurac oversampled tec
ith a 10 second ovAVEC 201
ability Compenr representation ability, further ve i-vector spacest performingsampled i-vectcuracy of 74cally, the use obination with W
WCCN was aper-speaker basil results from the methods haver, their performelines from Sec
n the reasons behsification accurpensation techncted via oversatition of the AV
DA only LDAWCC64.53
.83 66.18
.43 65.45
.20 64.05of the i-vector sclassification
bour Embeddinualise the i-vecDA and WCCNigh dimensionaserve local strmensional spacplots shown iwhy LDA andtor system acvector space tak
d with the suector space mayaracteristics comgiven that this
and given thaaker recognitiothat the use of tession classifica
cy for different dchnique, subfilesverlap, on the DV13 corpus.
nsation n offers some invariability comce, and the ap
g system from tor system w4.85%) was of LDA and W
WCCN in the i-vapplied on a peis (i.e., within shese analyses (Tve a negative imance was also ction 5.1. Sectiohind this drop inracy for a rangniques, using 2ampling, foundVEC 2013 corpA+ CN(class)
LDW
3 668 645 645 63space experiments,
ng (t-SNE) tecctor space befoN. The t-SNE al data in 2 or ructure of the ce [23]. in Figure 3, hd WCCN haveccuracy. Figurken from one fo
upervector systy be more robusmpared with thsystem makes
at i-vectors featon systems, it the oversampledation system m
dimensions of T, s extracted from VC partition of t
nherent robustnmpensation may
pplication of tsection 5.2 (2
with an estimacomprehensiv
WCCN individuavector space w
er-depression clspeaker covarianTable 2), howevimpact on systlower than thaton 5.4 attemptsn performance.ge of different 200-dimension id on the DVC pus. DA+
WCCN(speaker)6.13 4.05 4.13 3.18
the t-Distribuchnique [23] wfore and after
technique allo3 dimensions
high dimensio
help us gain e such a negatre 3a shows old of a cross f
tem t to
hose no
ture is
d i-more
60 the
ness be
two 200 ated vely ally
were lass nce ver, tem t of s to
i-
)
uted was the
ows by
onal
an tive
an fold
972
vacpt
ds(cssciS(
3cinrvaic5FtbotcpMs
validation test and of equal sclusters approxpartition, that ethe oversampled
LDA has thdistinct class separation doe(clusters correspcircles) and consuffers. Furtherspeaker clusterclasses. We speidentity that aSimilar results (not shown here
The effects3c) are a reduccompared with i-vector space normalisation’. reducing phonevector represenas within-speaimprove resultsconfounding fac5.5 DepressionFinally, for tranthose presentedboth an oversamoversampled i-vto estimate depclassification) partitions. All Mean Square Esetup of these te
Table 3: RMcomp
System
Baseline [1KL-means KL-means i-vectors (oi-vectors (o
Figures 3: (a)
(a)
(test partitions size). We specuximately equals ach individual cd i-vectors fromhe effect of sepgroups very
es not appear ponding to test nsequently clasr, the test samprs and are weleculate this cou
are not successwere seen for
e). s of WCCN apction in the ‘leFigure 3a but dand conseque
We speculate tetic variability wntation itself appaker covariances we speculatector.
n Scale Predictinsparency, we
d in [10] and bampled uncompevector WCCN (
pression scale (on two sets, system accura
Error (RMSE).Tests.
MSE’s generatedpared with syst
19] (standard) [10](oversampled+
oversampled) oversampled WC
) t-SNE plot of ui-ve
for all folds wulate that as ththe number of cluster in the vi
m a single speakparating the train
well (Figure to generalise samples depict
sification accurples still appeall separated fruld be due to esfully mitigateWCCN applied
pplied on a per-ength’ of each do not alter the gently do not that WCCN on within each filpears to do thise normalisatione that speaker
ion using i-vectcompare the pr
aseline corpus rensated i-vector(per speaker) syi.e. regression the AVEC d
cies are reportThe reader is re
d using the ovetem accuracies
+ NAP) [10]
CCN speaker)
uncompensated i-ector system. Bla
were chosen ranhe number of d
speakers in theisualisation repr
ker. ning samples in
3b). Howeveto unseen sp
ted in the plots racy on these saar to be organirom the two treffects of the sed by the trand on a per-class
-speaker basis (speaker clusterglobal structureachieve any a per-speaker be. However, as
s to a large exten does not aidentify is the
tors roposed systemresults [19], wer system as welystem for their as opposed to 2evelopment anted in terms ofeferred to [10]
ersampling mets from [10].
RMSE Devel. Test10.75 14.19.00 10.18.94 13.310.34 11.310.13 11.5
-vector system, (ack points repres
(b)
ndomly distinct e DVC resents
nto two er this peakers within
amples ised in raining speaker nsform. s basis
(Figure r when e of the ‘useful basis is s the i-ent and ctually major
ms with e tested ll as an ability 2-class
nd test f Root for the
thod
t 12 17 34 37 58
chalsupeoutpmeaKL-per-devi-verath
Spesetuunwpredovesmai-vebettbothstanspecminconpara
LDAthisvarit-SNsystsystperfenti
ovesecttrial
NICthe EcoCenARCRes
(b) LDA compensent low class, gr
On the devellenge baselineervector systemperformed the ans, NAP comp-means superve-speaker basiselopment set bu
ector systems wher than a regres
eaker variabilityup) variabilitywanted variabidiction systemrsampled extra
aller datasets. Aector systems inter suited to clah an uncompenndard i-vector (culate that this
nimizing the effntext given theameters via ove
Further attemA and WCCN is due to theiability due to thNE support thtems, results frtems could ouformance close ire AVEC data [
Future work rsampling methtion 5.3, clustelled.
7CTA is funded b
Department oonomy and the ntre of ExcellenC Discovery P
search Foundatio
sated i-vector syrey high class. Te
lopment set, bes, but were
m. On the test schallenge bas
pensated, supervector system. In lowers the ut not on the te
were set up and ossion task.
6. CONCy, phonetic conhave all previ
ility, to both ms [9], [10], action techniqu
Apart from permn this context,
assifying high ansated KL-mea
(one i-vector pes is due to bothffects of unwane possibility ofersampling.
mpts to minimizwere unsuccese speaker varihe effects of dehis speculationrom the regresutperform the
to that of a com[10]. includes testin
hod. Given the rer-based classif
7. ACKNOWLEby the Australiaf Broadband, Australian Re
nce program. TProject DP1301on (KR3698/4-
ystem, and (c) West samples are c
(c)
both i-vector unable to beaet, the proposedeline and the
vector system bunterestingly, WC
RMSE as eest set. It shouldoptimised for a
LUSION ntent and intersiously been sh
depression c[15]. This pa
ue for the i-vmitting the deve
this techniqueand low levels oans supervectoer file) extractioh the suitabilit
nted variability f estimating re
ze unwanted vsful. We specuiability being epression. Visuan. In addition sion analysis schallenge base
mpetitive system
ng the effects result shown in fication techniq
EDGEMENTSan GovernmentCommunicationsearch Council
This work was 101094 (Epps) 1) (Krajewski).
CCN (per speakcircled.
systems beat at the KL-med i-vector systeoversampled
ut not the standCCN applied o
evaluated on d be noted that classification t
session (recordhown to introduclassification aaper presents vector systemselopment of usee was found to of depression thr system and on techniques. Wty of i-vectors in a classificat
easonable i-vec
variability throuulate that as in stronger than alisations based
to classificatshow that i-veceline and provm operating on
of a less nathe t-SNE plots
ques will also
S t as representedn and the Digl through the I
partly funded and the Germ
ker) compensated
the ans
ems KL
dard on a
the the
task
ding uce and
an in
eful be
han the We for
tion ctor
ugh [9] the
d on tion ctor vide the
aïve s in be
d by gital ICT
by man
d
973
8. REFERENCES [1] M. Li, A. Metallinou, D. Bone, and S. Narayanan, “Speaker
states recognition using latent factor analysis based Eigenchannel factor vector modeling,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 1937–1940.
[2] D. Bone, M. Li, M. P. Black, and S. S. Narayanan, “Intoxicated Speech Detection: A Fusion Framework with Speaker-Normalized Hierarchical Functionals and GMM Supervectors,” Comput. Speech Lang., p. NA, 2012.
[3] J. Krajewski, S. Schnieder, D. Sommer, A. Batliner, and B. Schuller, “Applying multiple classifiers and non-linear dynamics features for detecting sleepiness from speech,” Neurocomputing, vol. 84, pp. 65–75, 2012.
[4] V. Sethu, J. Epps, and E. Ambikairajah, “Speaker Variability In Speech Based Emotion Models - Analysis and Normalisation,” in 2013 IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 7522–7526.
[5] R. J. Davidson, D. Pizzagalli, J. B. Nitschke, and K. Putnam, “Depression: Perspectives from Affective Neuroscience,” Annu. Rev. Psychol., vol. 53, no. 1, pp. 545–574, 2002.
[6] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, “Emotion recognition in human-computer interaction,” Signal Process. Mag. IEEE, vol. 18, no. 1, pp. 32–80, 2001.
[7] N. Dehak, R. Dehak, P. Kenny, N. Brümmer, P. Ouellet, and P. Dumouchel, “Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification.,” in 10th Annual Conference of the International Speech Communication Association Interspeech2009, 2009, vol. 9, pp. 1559–1562.
[8] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Factor Analysis for Speaker Verification,” IEEE Trans. Audio. Speech. Lang. Processing, vol. 19, no. 4, pp. 788–798, May 2011.
[9] N. Cummins, J. Epps, M. Breakspear, and R. Goecke, “An Investigation of Depressed Speech Detection: Features and Normalization,” in 12th Annual Conference of the International Speech Communication Association Interspeech2011, 2011, pp. 2997–3000.
[10] N. Cummins, J. Joshi, A. Dhall, V. Sethu, R. Goecke, and J. Epps, “Diagnosis of Depression by Behavioural Signals: A Multimodal Approach,” in The 21st ACM International Conference on Multimedia, 2013, p. NA.
[11] J. C. Mundt, P. J. Snyder, M. S. Cannizzaro, K. Chappie, and D. S. Geralts, “Voice acoustic measures of depression severity and treatment response collected via interactive voice response (IVR) technology,” J. Neurolinguistics, vol. 20, no. 1, pp. 50–64, 2007.
[12] J. C. Mundt, A. P. Vogel, D. E. Feltner, and W. R. Lenderking, “Vocal Acoustic Biomarkers of Depression Severity and Treatment Response,” Biol. Psychiatry, vol. 72, pp. 580 –587, 2012.
[13] S. Scherer, G. Stratou, J. Gratch, and L. Morency,
“Investigating Voice Quality as a Speaker-Independent Indicator of Depression and PTSD,” in 14th Annual Conference of the International Speech Communication Association Interspeech2013, Lyon, France, 25-29 Aug 2013, 2013, pp. 847–851.
[14] N. Cummins, J. Epps, V. Sethu, M. Breakspear, and R. Goecke, “Modeling Spectral Variability for the Classification of Depressed Speech,” in 14th Annual Conference of the International Speech Communication Association Interspeech2013, Lyon, France, 25-29 Aug 2013, 2013.
[15] D. Sturim, P. A. Torres-Carrasquillo, T. F. Quatieri, N. Malyska, and A. McCree, “Automatic Detection of Depression in Speech Using Gaussian Mixture Modeling with Factor Analysis,” in 12th Annual Conference of the International Speech Communication Association Interspeech2011, 2011, pp. 2983–2986.
[16] M. Li, K. J. Han, and S. Narayanan, “Automatic speaker age and gender recognition using acoustic and prosodic level information fusion,” Comput. Speech Lang., vol. 27, no. 1, pp. 151–167, Jan. 2013.
[17] R. Xia, and Y. Liu, “Using i-Vector Space Model for Emotion Recognition,” in 13th Annual Conference of the International Speech Communication Association Interspeech2012, 2012, pp. 2230–33.
[18] M. Senoussaoui, P. Kenny, N. Dehak, and P. Dumouchel, “An i-vector extractor suitable for speaker recognition with both microphone and telephone speech,” in Proc. IEEE Odyssey, 2010.
[19] M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, and M. Pantic, “The Continuous Audio/Visual Emotion and Depression Recognition Challenge,” in The 21st ACM International Conference on Multimedia, 2013, p. NA.
[20] D. Maust, M. Cristancho, L. Gray, S. Rushing, C. Tjoa, and M. E. Thase, “Chapter 13 - Psychiatric rating scales,” in Handbook of Clinical Neurology, vol. Volume 106, F. B. Michael J. Aminoff, and F. S. Dick, Eds. Elsevier, 2012, pp. 227–237.
[21] W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomonoff, “SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation,” in 2006 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, (ICASSP’ 06)., 2006, vol. 1, pp. 97–100.
[22] C.-C. Chang, and C.-J. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 1:27, 2011.
[23] L. J. P. Maaten, and G. E. Hinton, “Visualizing High-Dimensional Data Using t-SNE,” J. Mach. Learn. Res., vol. 9, no. nov, pp. 2579–2605, 2008.
974