active learning for speech emotion recognition using deep

Active Learning for Speech Emotion Recognition using Deep Neural Network

MohammedAbdelwahabAndCarlosBusso

2

§ Mismatchbetweentrainandtestconditionsisoneofthemainbarrierinspeechemotionrecognition

§ Underidealclassificationconditions§  Thetrainingandtestingsetscomefrom

thesamedomain

§ Underrealapplicationconditions§  Thetrainingandtestingsetscomefrom

thedifferentdomains

§  Thisleadstoperformancedrop[Shamiand

Verhelst2007,ParthasarathyandBusso2017]

Generalization of Models training testing

Training Testing Accuracy

Danish Danish 64.90%

Berlin Berlin 80.70%

Berlin Danish 22.90%

Danish Berlin 52.60%

Training Arousal[CCC]

Valence[CCC]

Dominance[CCC]

In-corpus 0.764 0.289 0.713

Cross-corpus 0.464 0.184 0.451

[ShamiandVerhelst2007]

[ParthasarathyandBusso2017]

3

§ Theperformanceofaclassifierdegradesifthereisamismatchbetweentrainingandtestingconditions§  Speakervariations,channels(environments,noise),language,and

microphonesettings

§ Howtobuildaclassifierthatgeneralizeswell?§ Minimizethediscrepancybetweenthesourceandtargetdomains

The Problem

source target

training testingWeexplorethisproblem

relyingonactivelearning

4

§ Activelearninghasbeenwidelyusedtoiterativelyselecttrainingsamplesthatmaximizesthemodel’sperformance§ Notallthesamplesareequal

§ DNNpushesstateoftheartperformance§  Itrequiresvastamountsoflabeleddata

§ ThereisaneedforscalableactivelearningapproachforDNN§  ExploretheapproachestoidentifymostusefulNsamples

Motivation

Target domain Unlabeled sentences

Label

selected

sentences

Sourcedomain

Before After

5

§ SpeechEmotionRecognition§  Uselabelers’agreementtobuilduncertaintymodels[Zhanget.al.2013]

§ Multi-viewuncertaintysamplingtominimizeamountoflabeleddata[Zhanget.al.2015]

§ Minimizeannotationspersampleusingagreementthreshold[Zhanget.al.2015]

§ Minimizenoiseaccumulationinself-training[Zhanget.al.2016]

§  Adaptmodelwithlowconfidencecorrectlyclassifiedsamples[Abdelwahab&Busso2017]

§  CombineEnsemblesandActivelearningtomitigateperformancelossinnewdomain[Abdelwahab&Busso2017]

§  GreedysamplingforMulti-taskspeechemotionlinearregression[Wu&Huang2018]

§ NoneofthoseapproachesusedDeepNeuralnetworks

Related Work

6

§ Thereisnodataacquisitionsfunctionsthatworkwellinallscenarios§ Heuristicapproacheswhereshowntoworkinpractice

§  Greedysampling

§  Labelspace§  Featurespace§  Combination

§  Uncertaintysampling

§  Leastconfidentsamples

§  Margin

§  Entropy§  VoteEntropy(Ensembles)

§  Dropout

§  Randomsampling(baseline)

Data Acquisition Functions

Explaindiversity

sampling

Explaindiversity

sampling

7

§ Greedysamplingforregression[Wuetal.,2019]

§ maximizethediversityinthetrainset

1.  Selectinitialsamples

§  Previouslyselectedsamples

2.  Computedistances

§  Featuresspace

§  Labelspace

§  Combination

3.  Selectksamplestoannotate

4.  Updatemodelandrepeat

Greedy Sampling Approach

Explaindiversity

sampling

di,jx = kxi � xjk2di,jy = |yi � yj |di,jxy = di,jx di,jy

Featurespace

Source

domain

Target

domain

Labelspace

Predictedlabels

fromthetarget

domain

Labelof

source

domain

8

Targetdomain

Unlabeledsentences

§ DropoutcanapproximateBayesianinference[Galetal.,2016]§ Wecanrepresentthemodels’uncertainty

§ Usedifferentconfigurationsofdropout,analyzingpredictionspersample

§ Goal:selectsamplesthattheexistingmodelisthemostuncertainacrossseveraldropoutiterations

Uncertainty Sampling: Dropout

…

§ Useexistingpodcastrecordings§ Divideintospeakerturns§ Emotionretrievaltobalancetheemotionalcontent§ Annotateusingcrowdsourcingframework

The MSP-Podcast Database

9

� � � � � �

Podcastrecording

RezaLotfianandCarlosBusso,"Buildingnaturalisticemotionallybalancedspeechcorpusbyretrievingemotionalspeechfromexistingpodcastrecordings,"IEEETransactionsonAffectiveComputing,vol.Toappear,2018.

10

§ MSP-Podcast§  Collectionofpubliclyavailablepodcasts(naturalnessandthediversityofemotions)

§  Interviews,talkshows,news,discussions,education,storytelling,comedy,science,technology,politics,etc.

§  CreativeCommonscopyrightlicenses

§  Singlespeakersegments,HighSNR,nomusic,nophonequality

§  Developingandoptimizingdifferentmachinelearningframeworkusingexistingdatabases

§  Balancetheemotionalcontent

§  Emotionalannotationusingcrowdsourcingplatform

The MSP-Podcast Database

Podcast

Audio

16kHz,16b

PCM,MonoDiarization

2.75s<…<11s

Durationfilter

SNRfilterEmotion

retrieval

Manual

screening

Remove

telephone

quality

Emotional

Annotation

Remove

segments

withmusic

MSP-Podcast corpus version 1.1

11

Segmentedturns

152,975sentencesover1,000podcasts

Withemotionlabels:

22,630sentences

(38h,57m)

Arousal Valence Dominance

§  Testset

•  7,181segmentsfrom50speakers

(25males,25females)

§  Developmentset

•  2,614segmentsfrom15speakers

(10males,5females)

§  Trainset

•  remaining12,830segments

12

§ Interspeech2013Featureset§  65lowleveldescriptors(LLD)§  FunctionalarecalculatedonLLDsresultingintotalof6,373features

§  Functionalsinclude:§  Quartileranges§  Arithmeticmean

§  Rootquadraticmean

§  Moments

§  Mean/std.ofrising/fallingslopes

Acoustic Features

13

§ Multitasklearningnetwork:§ Primarytask:emotionregression

§  concordancecorrelationcoefficient(CCC)§  Secondarytask:featurereconstruction

§  Meansquareerror(MSE)

§  Secondarytaskhelpstogeneralizethemodel,

especiallywithlimiteddata

Proposed Architecture

embedding

Autoencoder

IS-2013(6,373features–Inputnodes)

Reconstructedinput

Emotionregressor

L = �11

N

NX

i=1

kx� xk2 + �2

"1� 2⇢�y�y

�2y + �2

y + (µy � µy)2

#

MSE CCC

14

§ Weconsider50,100,200,400,800,and1200samples§ Samplesareselectedbasedonthelatestmodel§ Weconsidertwostartingpoints

§  Fromscratch

§  Autoencodertrainedonreconstructionlossonlyfor20epochs§ Resultsaretheaverageof20trials§ Greedysampling(featurespace):

§  Useembeddingoftheautoencodertoreducethesearchspace

Experimental Settings

Target domain Unlabeled sentences

50

100

200

400

800

1200

Targetnumber

ofsamples

15

§ Observation§  Greedyfeatureleadstobetterperformance

§  Dropoutisnotaseffective§  Randomapproachbestmethodsasweaddmoredata

§  Pretrainedencoderhelpswithlimitedsamples

Results for Arousal

Pretrainedencoder Fromscratch

Withincorpus

performance

16

§ Observations§  Pretrainedautoencoderhelpstoachievebetterperformance

§ Weapproachwithincorpusperformancewithonly10%ofthetrainingdata

§  Randomsamplingislesseffective

Results for Valence

Pretrainedencoder Fromscratch

17

Statistical Significance Arousal Valence

#samples 100 200 400 800 100 200 400 800

RandomSampling 0.57 0.61 0.66 0.69 0.07 0.10 0.13 0.17

GreedyFeature 0.58 0.64 0.68 0.71 0.10 0.12 0.16 0.21

GreedyLabel 0.50 0.53 0.66 0.69 0.09 0.12 0.17 0.21

GreedyCombination 0.52 0.55 0.68 0.70 0.09 0.12 0.17 0.20

Dropout 0.56 0.59 0.63 0.67 0.11 0.12 0.15 0.18

§ Observations§ Greedysamplinginfeaturespacealmostalwaysbetterthanrandomsampling

§ Dropoutwasnotaseffective

Bold:statisticallysignificantimprovementsoverrandomsampling

18

Sensitivity to k (how often we update the model)

Valence

Arousal

100samples 200samples 400samples

§ Multitaskautoencoderframeworkwithgreedymethods§  Nostatisticaldifferencewithk=1andk=10§ Methodisnotsensitivetothisparameter(reducecomplexity)

19

§ 20resultsstartingwithdifferentinitializations§ Greedysamplingonthefeaturespaceversusrandomsampling

§  StandarddeviationoftheCCCvaluesachievedbythegreedysampling

methoddecreasesfasterasthesamplingsizeincreases

§  Moreconsistentthanrandomsampling

Consistency of the Results Arousal Valence

20

§ Greedysamplingachieveshigherperformancewithlowervariancecomparedtorandomsampling

§ Greedysamplinginlabelspacedependsonmodel’sperformance§ Asweintroducedmoredata,thedifferencesinperformanceacrossdataacquisitionfunctionsreduce

§ Reducecomputationcost:§  Calculatethedistanceinembeddingwithlowerdimensions

§  Setadequatevalueofkreducesthefrequencyofmodelupdates

§ FutureWork§  Combineactivelearningwithcurriculumlearning

§  Considernewacquisitionfunctionsthatscalewell

Conclusions

21

§ ThisworkwasfundedbyNSFCAREERawardIIS-1453781

Thank you

Interestedonourresearch?

msp.utdallas.edu

active learning for speech emotion recognition using deep

Documents