acoustic feature-based sentiment analysis of call center...

ACOUSTICFEATURE-BASEDSENTIMENTANALYSISOFCALLCENTERDATA

AThesis

Presentedto

TheFacultyoftheGraduateSchool

AttheUniversityofMissouri-Columbia

InPartialFulfillment

OftheRequirementsfortheDegree

MasterofScience

By

ZeshanPeng

Dr.YiShang,ThesisSupervisor

DEC2017

Theundersigned,appointedbythedeanoftheGraduateSchool,haveexaminedthethesisentitled

ACOUSTICFEATURE-BASEDSENTIMENTANALYSISOFCALLCENTERDATA

PresentedbyZeshanPeng

Acandidateforthedegreeof

MasterofScience

Andherebycertifythat,intheiropinion,itisworthyofacceptance.

Dr.YiShang

Dr.DetelinaMarinova

Dr.DongXu

ii

ACKNOWLEDGEMENTS

IwouldliketothankmyacademicadvisorDr.YiShangforallofthehelpsandthose

great suggestions that I received fromhim. Iwillnotbeable tohave this thesiswork

withouthishelps.

I would also like to thank all the people in our team for this project, especially

NickolasWergeles andWenboWang. They always provideme the best supports and

criticalthoughtsinthewholeprocess,whichletmemakethisfarforthiswork.

Finally,IwouldliketothankDr.DetelinaMarinovaandherstudentBittyBalduccifor

theirgreateffortstoprovidemedatasetsforthisprojectaswellastheirbusinessinsights

aboutmyworkthatdirectedmyinitialwork.

IwouldliketogivemylastthankstoDr.DongXuforbeingmycommitteemember

andhelpingmedefensemythesiswork.

- ZeshanPeng

iii

TABLEOFCONTENTS

ACKNOWLEDGEMENTS......................................................................................................ii

LISTOFFIGURES.................................................................................................................v

LISTOFTABLES..................................................................................................................vi

ABSTRACT........................................................................................................................vii

1. INTRODUCTION..........................................................................................................1

2. BACKGROUNDANDRELATEDWORK.........................................................................4

2.1 Text-basedapproach.......................................................................................................4

2.2 Acousticfeature-basedapproach...................................................................................5

2.3 Multimodalapproach......................................................................................................6

3. PROPOSEDMETHODS................................................................................................8

3.1 AcousticFeature-BasedSentimentRecognitionUsingClassicMachineLearningAlgorithms...................................................................................................................................9

3.1.1 GatheringAudioData..............................................................................................11

3.1.2 DataCleaningandPre-processing...........................................................................11

3.1.3 FeatureExtractionandSelection.............................................................................14

3.1.4 TrainClassificationModel.......................................................................................15

3.1.5 TestClassificationModel.........................................................................................16

3.2 AcousticFeature-Matrix-BasedSentimentRecognitionUsingDeepConvolutionalNeuralNetwork.........................................................................................................................17

3.2.1 DeepLearning..........................................................................................................17

3.2.2 FeatureMatrix.........................................................................................................18

3.2.3 Architecture.............................................................................................................20

4. FEATURESETSANDCLASSIFICAITONALGORITHSM................................................22

4.1 FeatureSets...................................................................................................................22

4.1.1 FundamentalFeatures.............................................................................................22

4.1.2 ShimmerandJitter..................................................................................................22

4.1.3 Short-termFeatures................................................................................................22

iv

4.1.4 EmotionFeatures....................................................................................................24

4.1.5 Spectrograms...........................................................................................................24

4.2 ClassificationAlgorithms...............................................................................................24

5. EXPERIMENTRESULTS..............................................................................................26

5.1 Dataset..........................................................................................................................26

5.2 FeatureExtractionLibraries..........................................................................................26

5.3 TrainingTools................................................................................................................27

5.4 Per-SegmentResults.....................................................................................................27

5.5 Per-RecordResults........................................................................................................28

5.6 DeepLearningResults...................................................................................................30

6. Conclusion................................................................................................................34

7. FUTUREWORK..........................................................................................................36

8. REFERENCES..............................................................................................................37

v

LISTOFFIGURES

Figure1.ProblemIllustration.............................................................................................9

Figure2.SentimentRecognitionProcessOverview.........................................................10

Figure3.SpeakerDiarizationProcess..............................................................................13

Figure4.DataFlowDiagramforModelTraining.............................................................16

Figure5.MajorityVoteMethodforModelTesting.........................................................17

Figure6.FeatureMatrixIllustration................................................................................19

Figure7.DeepConvolutionalNeuralNetworkArchitecture...........................................21

Figure8.ComparisonofCustomerModelandRepresentativeModelonDifferent

FeatureSets.....................................................................................................................28

Figure9.ModelComparisonBasedonMajorityVotesPerAudioRecord.......................29

Figure10.PredictionAccuracyvs.NumberofConvolutionalLayers................................30

Figure11.1-DCNNvs.2-DCNNonDifferentPoolingKernelSize....................................31

Figure12.ComparisononDifferentWindowSize............................................................32

Figure13.ComparisononDifferentMachineLearningAlgorithms.................................33

vi

LISTOFTABLES

Table1.Short-termFeatureDescriptions........................................................................23

Table2.DatasetSummary...............................................................................................26

vii

ABSTRACT

Withtheadvancementofmachinelearningmethods,audiosentimentanalysis

has become an active research area in recent years. For example, business

organizations are interested in persuasion tactics from vocal cues and acoustic

measuresinspeech.Atypicalapproachistofindasetofacousticfeaturesfrom

audiodatathatcanindicateorpredictacustomer’sattitude,opinion,oremotion

state.Foraudiosignals,acousticfeatureshavebeenwidelyusedinmanymachine

learningapplications,suchasmusicclassification,languagerecognition,emotion

recognition,andsoon.Foremotionrecognition,previousworkshowsthatpitch

and speech rate features are important features. This thesis work focuses on

determining sentiment from call center audio records, each containing a

conversationbetweenasalesrepresentativeandacustomer.Thesentimentofan

audiorecordisconsideredpositiveiftheconversationendedwithanappointment

being made, and is negative otherwise. In this project, a data processing and

machinelearningpipelineforthisproblemhasbeendeveloped.Itconsistsofthree

majorsteps:1)anaudiorecordissplitintosegmentsbyspeakerturns;2)acoustic

featuresareextractedfromeachsegment;and3)classificationmodelsaretrained

ontheacousticfeaturestopredictsentiment.Differentsetoffeatureshavebeen

usedanddifferentmachinelearningmethods,includingclassicalmachinelearning

algorithmsanddeepneuralnetworks,havebeenimplementedinthepipeline.In

viii

our deep neural network method, the feature vectors of audio segments are

stackedintemporalorderintoafeaturematrix,whichisfedintodeepconvolution

neural networks as input. Experimental results based on real data shows that

acousticfeatures,suchasMelfrequencycepstralcoefficients,timbreandChroma

features, are good indicators for sentiment. Temporal information in an audio

record can be captured by deep convolutional neural networks for improved

predictionaccuracy.

1

1. INTRODUCTION

Speechanalyticsallowspeopletoextractinformationfromaudiodata.Ithas

beenusedwidelytogatherbusinessintelligencethroughtheanalysisofrecorded

callsincontactcenterformanybusinesscompanies.Sentimentanalysisisoneof

thespeechanalyticsthattriestoinfersubjectfeelingsaboutproductsorservices

in a conversation. This analysis can be also used to help agents who talk to

customersoverthephonebuildcustomerrelationshipandsolveanyissuesthat

mayemerge.

There are two major methods that have been used for audio sentiment

analysis.Oneisacousticmodellingandtheotherislinguisticmodeling.Inlinguistic

modeling,itrequirestranscribingaudiorecordsintotextfilesandconductanalyses

basedontextcontent.Itassumesthatsomespecificwordsorphrasesareusedin

a higher probability in some certain environments. One can expect things to

happen if some words or phrases appear frequently in the transcripts. Many

researchershave showed strongevidences that linguisticmodellinghas a good

performance on audio sentiment analysis. Among their work, lexical features,

disfluenciesandsemantic featuresaremostlyusedwhenbuildingtheirmodels.

However, acquiring a good transcript for each audio record can be humanly

tediousandfinanciallycostly.Agoodautomatictranscribingsystemisalsohardto

2

establishtoaccuratelytranscribeaudiorecordsintotext.Evenasmallerrorinthe

textcanresultabigdifferenceinalinguisticmodel.Inacousticmodelling,itrelies

on acoustic features of audio data. These features include pitches, intensity,

speechrateandsoonthatcanbeeasilycalculatedbycomputerprograms.They

together can provide basic indicators of sentiment in some degree. Acoustic

featureshavebeenusedinmanyotherbusinessorpsychologicalanalysesaswell,

whichisanotherreasonthatweshouldusetheminsentimentanalysis.However,

theproblemwithacousticmodelingisthatthequalityofaudiorecordshasstrong

influenceon the final result.Sinceweareusingacousticcharacteristicofaudio

data,thequalityofaudiocansignificantlyimpacttheabilitytogetaccuratevalues

forthoseacousticfeatures,whichwill failbuildinggoodmodels intheend.The

otherproblemisthatinareal-worldsetting,therearealwayssomerandomnoises

inthebackground.Modeltrainingdatamaynotbeabletocapturetheserandom

noises,whichwouldchallengethesentimentanalysisresultifwewanttomonitor

itthroughaliveconversation.Butstill,thismethodisworthytoexploremoreand

researchonbecauseofitsconvenienceandefficiency.

Theremainingstructureofthisthesisisasfollows:chapter2givesbackground

andrelatedworkaboutsentimentanalysis,chapter3introducestwomethodsthat

weproposeandoursentimentrecognitionpipeline indetail,chapter4explains

our selection of feature sets and classification models, chapter 5 shows our

3

experiment results, chapter6 is the conclusion, chapter7 talks aboutpotential

problemsandfutureworks,andthelastchapteristheworkreferences.

4

2. BACKGROUNDANDRELATEDWORK

Sentiment analysis can be generalized into three categories: text-based

approach, acoustic feature-based approach and multimodal approach – a

combinationofprevioustwo.Differentfeaturesareusedfordifferentapproach.

Classification models are trained on different learning algorithms and

performancesarecompared.Manysentimentclassificationtechniqueshavebeen

proposedinrecentyears[4].However,theoverallperformanceisnotasgoodas

wethought.Muchmoreworkcouldbedoneforthisprobleminthefuture.

2.1 Text-basedapproach

Textdataareoneofthemost immensecorporathataregenerated

daily.Researchesareeagertotakeadvantageofthisandmanydatamining

workshavebeendoneforthissake.Forexample,Twitter,asoneofthemost

popularsocialmediaintheworld,generatestonsoftexteveryday.Different

miningmethodsbasedonTwitter texthavebeenproposed forsentiment

analysis [23] [24]. In 2010, it is shown that recurrent neural network can

exploittemporalfeaturesintextcontentsformodelinglanguage[13].Even

formusicfield,mininglyricsdirectlycancreatemodelsforsongsentiment

classificationproblem[20]andmusicmoodclassificationproblem[21].

5

2.2 Acousticfeature-basedapproach

Many different acoustic features can be generated from an audio file.

Pitches,intensities,speechrates,articulationrateandsoon.Short-termacoustic

features, such asMel frequency cepstral coefficients(MFCCs), are proved to be

usefulformusicmodelingin[5].MFCCsaregoodformusicalgenreclassification

jobandevidencesareshownin[6].Authorsin[9]takeMFCCsasmajoracoustic

features for emotion recognition problem. MFCCs features are also used on

trainingsingledeepneuralnetworkforspeakerandlanguagerecognitionproblem

in [12]. Timbre and Chroma are other acoustic features that interest many

researchersinthepastyears.Theycanbeusedtogeneratemusicfromtextin[7].

Intheirwork,emotionsareextractedfromtext,andthenacombinationoftimber

andChromafeaturesareusedtocomposedmusicthatexpressesthoseemotions

inthetext.Chromaandtimbrefeaturescanbeusedtoclassifymusicaswellin[8].

Theyalsoclaimthat“Chromafeaturesarelessinformativeforclassessuchasartist,

but contain information that is almost entirely independence of the spectral

features.”Therefore,theyareequallyimportanttobeincludedwhenconducing

sentimentanalysisonacousticfeatures.Rhythm,timberandintensityofmusicare

usedformusicmoodclassificationproblemaswell [22].Acousticfeature-based

6

approachcanbeextendedtospeechrecognitionin[10]andautomaticlanguage

identificationin[11]whenmodelingwithdeepneuralnetwork.

2.3 Multimodalapproach

Withtheemergeofsocialmedia,exceptfortextcorporaandaudiorecords,

videos and images brings new opportunities for sentiment analysis. Facial and

vocalfeaturesareextractedfromthesenewstreamsandmultimodalsentiment

analysisseemstohavemuchmorepotentialinthefuture[14][15].YouTube,as

oneofthebiggestvideohosts,hasahugenumberofvideos.Researchesaretrying

to identify sentiments from those videos based on linguistic, audio and visual

features all together [25] [26]. Fusion of modalities of audio, visual and text

sentimentfeaturesassourceofinformationisanotherwaythatisproposedin[27].

For linguisticanalysis, languagedifferencecouldbeaproblem. It is shown that

language-independent audio-visual analysis is competitive with mere linguistic

analysis[28].Worksin[16][17][18]showthatsystemsthatcombinelyricfeatures

andaudiofeaturessignificantlyoutperformothersystemsthatuselyricsoraudios

singularly for music mood classification problem. Fusion of lyrics features and

melody featuresofmusic improvesperformanceofmusic information retrieval

system[19].Fusionofmultiplemodalitiesiscurrenttrendsincewecanalwaysfind

newinformationwhenweviewanexistingobjectfromanewangle.Sometimes,

7

theaccesstovisualandlinguisticsourcescouldbehardandcostly.Forexample,

inourdatasource, it ishardtogetfacialexpressionsfromacustomeroverthe

phone.Butinothercircumstanceswhenwehavealltheinformation,multimodal

approachwouldbethefirstchoice.

8

3. PROPOSEDMETHODS

The problem that this thesis work wants to solve is to identify whether a

meetinghasbeenscheduledbetweenacustomerandarepresentativeinaphone

callconversation.Therearetwomethodsthatweproposetoaddressthiskindof

problem.Thefirstmethodistospliteachaudiorecordintosegmentsbyspeaker

turnsbeforeextractingfeaturevectorsfromeachsegment,andthenuseclassic

machine learning algorithms to classify those feature vectors according to the

sentiments.Allsegmentsthatcomefromthesameaudiorecordsharethesame

sentimentlabelastheaudiorecordlabel.Thesecondmethodistocuteachaudio

recordintoshortaudioclipswithsamelength.Thenfeaturesvectorsareextracted

fromeachaudioclips.Afterthat,westackupthosefeaturevectorsintofeature

matrixes in timeorder and then feed featurematrixes into deep convolutional

neuralnetworkstobuildclassificationmodels.Allfeaturematrixesthatcomefrom

thesameaudiorecordsharethesamesentimentlabelastheaudiorecordlabel.

Thenexttwosubsectordescribesthesetwomethodsinmoredetail.

9

Figure1.ProblemIllustration

3.1 AcousticFeature-BasedSentimentRecognitionUsingClassicMachine

LearningAlgorithms

Thewholesentimentrecognitionprocessconsistsoffivesteps.Firstly,we

gatherusefulaudiodatafromcallcenterrecords.Weselectaudiodatathathas

labelwith itsincewearemore interested inusingsupervisedmachine learning

algorithms. It ishighlypossible thatunsupervisedmachine learningcanprovide

interesting results for audio signal sentiment analysis, which is left to future

exploringeitherbycombiningitwithsupervisedmachinelearningormerelyitself.

10

The second step is cleaning and pre-processing the chosen audio files. Any

unrelatedpartsinaudiodataareremoved,suchasringtones,receptionisttalking,

music, advertisement, etc. Several methods have been tried to remove those

unrelatedparts,however,noneofthemmetourexpectations.So,wedecideto

removethemmanuallysincethedatasetwecurrentlyhaveisprettysmall.After

that we apply speaker diarization algorithm to split each audio record into

segmentsbyspeakerturns.Thethirdstepisfeatureextractionandselection.We

considerdifferentcombinationoffeaturesetsandselectdifferentfeaturestotrain

on in the next stage. The fourth step is classification model training. We use

different training algorithms, techniques and models, then compare their

performanceintermsofpredictionaccuracies.Thelastpartistestingthetrained

classificationmodel.

Figure2.SentimentRecognitionProcessOverview

Pre-processing/Cleaning

FeatureExtraction/FeatureSelection

ModelTraining

ModelTesting

11

3.1.1 GatheringAudioData

Everytrainingdataimpactsthefinalclassificationmodel.Wedonotwantbad

data to affect model’s performance, and we want to have effective training

datasetsthatcangeneralizewelltoallpossiblescenariosbutnotmanyunlikely

events.Asforaudiodatasets,thequalityoftheaudiofilecouldbeonemetricto

be used when selecting audio files. Call center audio records are not always

recordedgood.Duetotheinternetlatencyandbackgroundnoises,somefilesare

veryblurredandevenhardforhumantounderstandtheconversationinit.We

needaudiodatathatarerecordedwithgoodquality.Therearealsosomeother

filesthatdonothavealabelwithit,andwewanttogetridofthosefilesaswell

forsupervisedlearningpurpose.

3.1.2 DataCleaningandPre-processing

Inanaudiofile,itisusuallyaconversationbetweentwospeakers:acustomer

andasalesperson.Insomescenarios,thereisareceptionistansweringthephone

callandtransferringthesalespersontowhoeverheorshewantstoreachto.The

featuresofreceptionistarenotourmainlyinterestheresoweneedtocutthose

conversationsbetweensalespersonandreceptionistifithappensintheaudiofile.

Afterthat,wewouldliketospliteachaudiofileintosegmentsbyspeakerturns.

12

Knowingchangesofcertainfeaturesbetweenconsecutivesegmentswouldhelp

thewholesentimentanalysis.Thatiswhyit is importanttosplitaudiofilesinto

segmentsbyspeakerturnsbeforegatheringfeaturesfromthem.Wealsowantto

studythetemporalrelationsoffeaturesinconsecutivesegmentsinanaudiofile.

Thatwould helpmonitor live conversations in call centers. There are plenty of

speakerdiarizationalgorithmsthatarepubliclyavailable.As referred in [1], [2],

and[3],alotofresearcheshasbeenconductedforthistypeofproblem,andmany

ofthemhaveagoodperformancetosplitaudiofilebyspeakerturnsautomatically.

In our research, we have used IBM Watson and Google cloud computing

services to split our audio files into segments by speaker turns. Both of these

systemshavetheabilitytoidentifywhospeaksatwhattimeandtranscribethe

conversationintotext.Eventhoughthetranscriptionpartisnotasaccurateaswe

though, the timestamps of start and end time for each speaker turn are good

enoughforustousetoseparateeachaudiofileintosegments.

Sinceweonlyhaveasmallportionofaudiodatafornow,andthetranscripts

ofallaudiofilesareavailableforus,wespliteachaudiofileintosegmentsbased

on timestamps in the transcript accordingly. However, the speaker diarization

procedureisstillcrucialwhenhavemoreandmoredatacomingininthefuture

andtheaccesstoallofthetranscriptswouldbeverycostly.Thatisthetimewe

maywanttouseautomaticspeakerdiarizationtechniques.

13

Figure3.SpeakerDiarizationProcess

Pseudo-code:

functionspeaker_diarization():

openaudiofile

opentranscriptionfile

foreachtimestamppair(start_time,stop_time)intranscriptionfile:

segment=cut_audio_file(start_time,stop_time)

segment_label=audio_file_label

ifsegmentbelongstocustomer:

Addsegmenttocustomer_cluster

Ifsegmentbelongstorepresentative:

Addsegmenttorepresentative_cluster

14

3.1.3 FeatureExtractionandSelection

After we have split each audio file into segments by speaker turns, we

extract features from each segment. There are many features that can be

extracted from an audio file. For our research purpose, we focus on acoustic

features.Therearemainly twogroupof features that interestus.Onegroup is

prosodyfeatures,suchaspitch,intensity,speechrate,numberofpauses,etc.The

othergroupisshort-termfeatures.Wewillexplainourfeaturessets indetail in

chapter4.

Pseudo-code:

functionfeature_extraction():

foreachsegmentincustomer_cluster:

prosody,short_term=calc_algorithm(segment)

Addprosodytoprosody_feature_clusterforcustomer

Addshor_termtoshort_term_clusterforcustomer

foreachsegmentinrepresentative_cluster:

prosody,short_term=calc_algorithm(segment)

Addprosodytoprosody_feature_clusterforrepresentative

Addshor_termtoshort_term_clusterforrepresentative

15

3.1.4 TrainClassificationModel

Oncewehave features foreachsegment,wesplitdata into trainingset,

validationsetandtestingset,thenwetrainaclassificationmodelbasedonthese

featuresandtheirassociatedlabels.Everysegmentinthesameaudiofilehasthe

samelabelasthataudiofile.Wealsopartitionsegmentsintotwogroups:customer

groupandsalespersongroup.Sincecustomersmayhavedifferentbehaviorsthan

salesperson,wewouldliketotraintwodifferentmodelstorepresenteachofthem.

Wetakeallfeaturesincustomergrouptobeourtrainingdatasetforthecustomer

model, and all features in salesperson group to be our training date for the

salespersonmodel.Validationsetisusedtosetuphyperparametersoflearning

algorithms, and testing set is used tomeasure the performance of the trained

model.Moredetailsofclassificationmodelsandlearningalgorithmsareinchapter

4.

16

Figure4.DataFlowDiagramforModelTraining

3.1.5 TestClassificationModel

Afterthemodelisbuiltinthelastprocedure,weuseittopredictlabelsfor

anyaudiofilethatwewanttodosentimentanalysison.Theaudiofileisstillsplit

into segmentsby timestamps in the transcript. Themodelwill predict label for

eachsegmentinthatfile,andwetakethemajorityvoteofthelabelsasthefinal

labelforthattestaudiofile.

y= 𝑦%&% 𝑛 ≥ 0.5? 1 ∶ 0

wherenisthenumberofsegmentsinanaudiorecord

17

Figure5.MajorityVoteMethodforModelTesting

3.2 Acoustic Feature-Matrix-Based Sentiment Recognition Using Deep

ConvolutionalNeuralNetwork

3.2.1 DeepLearning

Ithasbeenshownin[29]thatdeepconvolutionalneuralnetworkcanbe

applied to character-level feature matrix for text classification problem. It has

18

achievedcompetitiveresultswithcurrentstate-of-artmethods.Evidenceshows

thattemporalinformationcanbecapturedbyapplyingdeepconvolutionalneural

network on feature matrix of time series dataset. With the advancement of

machinelearning,especiallydeeplearningmethods,audiosignalanalysiscantake

advantage of it to build competitivemodels or even bettermodels than other

approaches.

3.2.2 FeatureMatrix

Convolutional neural networks usually take two-dimensional matrix as

input.Westackupaudiofeaturevectorintofeaturematrixtomeetthat.Short-

termfeaturesareextractedevery50millisecondsfromeachaudiorecordandthen

stackedupintimeordertobuildfeaturematrixforthataudiorecord.Thewindow

size is 300, which means for every 300 rows we cut the matrix. Every matrix

generatedfromsameaudiorecordsharesthesamesentimentastheaudiorecord.

Ifthenumberofrowsortheremainingrowsaftercuttingislessthan300,weuse

zero-valuedpatchtocompensate.Eachtrainingmatrixhasthesizeof300by34

sincewe have 34 short-term features in total. The figure below illustrates the

processforbuildingfeaturematrixes.

19

Pseudo-code:

functionfeature_matricization():

foreachaudiorecord:

cutitintoequallength(50ms)segments

foreachsegment:

short-term=calc_algorithm(segment)

stackupshort-termfeaturesintofeaturematrixintimeorder

cutfeaturematrixevery300rows

matrix_label=audio_record_label

addtotrainingdataset

Figure6.FeatureMatrixIllustration

20

3.2.3 Architecture

Massiveexperimentshavebeenconductedtocompareperformanceswhen

usingdifferentnetworksettings.Thosesettingsincludenumberofconvolutional

layer,kernelsize,maxpoolingkernelsize,andwindowsizewhencuttingmatrix.

Thefigurebelowisatypicalnetworkarchitectureoffourconvolutionallayers.This

architecturedesignisbasedonpre-trainedneuralnetworksthatworkextremely

well on imagedatasets. Every convolutional layer is followedby amaxpooling

layer.Weuse1-Dconvolutionalkernelheresincefeaturesaredistinctfromeach

other,andourexperimentresultsshowthat1-Dconvolutionalkernelhasbetter

performancethan2-Dconvolutionalkernel.

21

Figure7.DeepConvolutionalNeuralNetworkArchitecture

22

4. FEATURESETSANDCLASSIFICAITONALGORITHSM

4.1 FeatureSets

4.1.1 FundamentalFeatures

Thefirst featuresetthatweconsiderarefundamental featuresforanyaudio

data,suchasdecibel,meanofpitches,standarddeviationofpitches,covariance

ofpitches,speechrate,numberofpausesinasegment,lengthofthesegmentin

milliseconds.Thesearegeneralfeaturesthatcanrepresentanaudiosegmentin

some extent. These features have shown crucial importance on audio signal

analysisfromliteraturesadecadeago.

4.1.2 ShimmerandJitter

Shimmermeasures thevariation in frequencyofvocal foldmovementswhile

jittermeasures the variation in amplitudeof vocal foldmovements. These two

featurescanrepresentcharacteristicsofanaudiosegmentinanotherperspective.

4.1.3 Short-termFeatures

Mel-frequency cepstrum coefficients, Chroma and timbre features are short-

termfeaturesusedinthiswork.Themel-frequencycepstrumisarepresentation

of the short-term power spectrum of a sound. Mel-frequency cepstral

coefficients(MFCCs) are coefficients that are used to make up the spectrum.

23

Different combination of those coefficients has different indications for some

sentimentanalysis,sowewanttoincludetheminourfeatureset.Chromafeatures

represent spectral energy of different pitch classes. Timbre features show the

characterorqualityofasoundthataredifferentfrompitchfeatureandintensity

feature. These three groups of features have been frequently used in many

classificationproblemsasexplainedinrelatedworkinsection2.

Feature Description

MFCCs MelFrequencyCepstralCoefficientsfromacepstralrepresentation

wherethefrequencybandsarenotlinearbutdistributedaccording

tothemel-scale

Chroma Arepresentationofthespectralenergywherethebinsrepresent

equal-temperedpitchclasses

Timbre Thecharacterorqualityofasoundorvoiceasdistinctfromits

pitchandintensity

Table1.Short-termFeatureDescriptions

24

4.1.4 EmotionFeatures

Wealsofindthatemotionfeaturescanbecriticalindicatorsforsomesentiments.

Positiveemotions,likehappyandsatisfaction,canleadtoagoodresultintheend

oftheconversationwhilenegativeemotions,likeupsetandangry,canturnthings

aroundintheend.Wewanttotrainourmodelonthesefeaturestoo.

4.1.5 Spectrograms

Besidesofvaluedinputs,wealsogenerateimagefeaturesfromeachsegment.

Theyarespectrograms.Spectrogramisavisualrepresentationofthespectrumof

frequencies of sound varied with time. Some sentiments may have specific

patterns in the spectrograms.Wewant ourmodel to be able to identify those

patternsinthespectrogramsiftheyexist.

4.2 ClassificationAlgorithms

Inourexperiment,weusebothlegendmachinelearningalgorithmsandneural

networks.Forlegendalgorithms,wehaveusedsupportvectormachinewithcubic

kernelandk-nearestneighborclassifier.Forneuralnetworks,wehaveusedclassic

shallow feed forward neural networks and deep convolutional neural network.

25

Differentlearningalgorithmhasdifferentadvantagesandcharacteristics.Wewant

toexperimentonallofthemandcomparetheirresults.

26

5. EXPERIMENTRESULTS

5.1 Dataset

Ourdatasets consistof88audio recording files fromPenske cooperated

offices.Thedurationofthesefilesrangesfromafewminutestotenminutes.The

average length is about three minutes. After we split these audio files into

segmentsbyspeakerturns,wehave2859segmentsintotal.Amongthem,1438

are for customers and 1421 are for salespersons. Features extracted from

customergroupareusedtotraincustomermodel,soassalespersongroupisfor

salespersonmodel.

Positive Negative Total

AudioRecords 31 55 86

Segments 1284 1304 2588

Table2.DatasetSummary

5.2 FeatureExtractionLibraries

Praat is the main software that we used to extract fundamental audio

features aswell as shimmer and jitter values.We use a python library named

27

pyAudioAnalysisonGitHubtoextractMFCCfeatures.WeuseOpenSmilepackage

toextractemotionfeatures.Anotherpythonlibrarypyplotinmatplotlibpackage

isusedtogeneratespectrogramsforeachaudiosegment.

5.3 TrainingTools

Weusemachine learningpackageinMatlabtotrainmodelsusingclassic

machine learning algorithms. For neural networks, they are coded in Keras

platformwhichisbasedontensorflow.

5.4 Per-SegmentResults

Since customerbehaviors different than representative,wewant to two

separatemodelstorepresenteachofthem.Speakerdiarizationprocesshelpson

that. Features extracted fromcustomer segments are grouped intoone cluster

while other features extracted from representative segments are grouped into

anothercluster.Featuresincustomerclusterareusedtotraincustomermodeland

so as the representativemodel.Models’ performances are compared by using

different feature set for training. Below is the figure that summarizes the

comparisonbasedonpredictionaccuracyforsegments:

28

Figure8.ComparisonofCustomerModelandRepresentativeModelonDifferentFeatureSets

From above figure, we can easily conclude that short-term features are

superiorindicatorsthanprosodyfeatures.Whenbothofthemareusedfortraining,

accuracyisslightlyimproved.Thatdoesnotmeancombinationofthemhasbetter

indicationstrength.Moreexperimentsshouldbeconductedfora largerdataset

beforewecansayanythingaboutit.

5.5 Per-RecordResults

Segment-wise prediction is not our final goal here.We want to predict

whetherrepresentativeispersuasiveornotinawholeaudiorecord.Toaccomplish

this,wepredicteachsegmentofatestaudiorecordusingmodelstrainedin5.2.1

0.646

0.82 0.822

0.570.684 0.694

0

0.2

0.4

0.6

0.8

1

Prosody Short-term Together

CustomerModelvs.RepresentativeModel

Customer Representative

29

andtakethemajorityvoteofallthepredictionsasourfinalpredictionofthataudio

record.Werunourexperimentsonbasedonusingdifferentmodeltopredict.Here

wehavethreemodelsettings:1)usecustomermodelonlytopredictcustomer

segmentsandtakethemajorityvoteasthefinalpredicationforthataudiorecord,

2)userepresentativemodelonlytopredictrepresentativesegmentsandtakethe

majorityvoteasthefinalpredictionforthataudiorecord,3)usebothcustomer

andrepresentativemodeltopredictallsegmentsinanaudiorecordandtakethe

majorityvoteasthefinalpredictionforthataudiorecord.Belowisthecomparison

result.

Figure9.ModelComparisonBasedonMajorityVotesPerAudioRecord

0.583

0.773 0.786

0.536

0.685 0.6910.571

0.739 0.75

0

0.2

0.4

0.6

0.8

1

Prosody Short-term Together

ModelComparisonBasedOnMajorityVotes

Customer Representative Together

30

Abovefigureshowsthatusingcustomermodelonlyhasthebestperformance

comparedwith othermodels. Itmatches intuitive instinct that representatives

havesimilarbehaviorsindependentofcustomersbehaviors.Representativesare

more consistent no matter what customers respond while customers have

differentcharacteristicswhentheyaresentimentallypositiveornegative.Using

bothofthemhascompromisedperformance.

5.6 DeepLearningResults

Thefigurebelowistheresultcomparingdifferentnumberofconvolutional

layerswhenusingeitherfeaturevectorsorfeaturematrixesasinputtotheneural

network.

Figure10.PredictionAccuracyvs.NumberofConvolutionalLayers

0.623 0.636 0.6470.632

0.6730.691

0.6580.644

PredictionAccuracyvs.NumberofConvolutionalLayers

featurevector featurematrix

31

Itcanbeseenthatfour-convolutionallayerhasthebestperformanceamongthe

others,andusingfeaturematrixeshasbetterperformancethanusingfeature

vectors,whichinfersthattemporalinformationbetweenaudioclipscanbe

capturedbyusingdeepconvolutionalneuralnetworks.

Figure11.1-DCNNvs.2-DCNNonDifferentPoolingKernelSize

Abovefigureshowsthecomparisonresultofusing1-Dconvolutionalversususing

2-Dconvolutionalondifferentpoolingkernelsize.Fromtheresult,wefindthat

1-Dconvolutionalneuralnetworkperformsbetterthan2-Dconvolutionalneural

network.Itmightbebecausetherearenotemporalmeaningsbetweeneach

0.55

0.6

0.65

0.7

0.75

2 3 4 5

1-DCNNvs.2-DCNNOnDifferentPoolingKernelSize

1-DCNN 2-DCNN

32

individualshort-termfeaturesincethecolumninthematrixrepresentsdifferent

short-termfeatures.Italsoshowsthatpredicationaccuracycanbeimprovedby

applyingalargermaxpoolingkernelsize.

Figure12.ComparisononDifferentWindowSize

Thisfigureshowsthecomparisonresultonusingdifferentwindowsizewhen

extractingfeaturesfromaudiorecords.Itseemsthatalargerwindowsizewill

haveabetterperformance.However,largerwindowsizewillresultsmaller

trainingdatasetforneuralnetwork.Whetheritistrueornotneedstobeproved

byconductingmoreexperimentsusingalargerdataset.

0.55

0.6

0.65

0.7

0.75

0.8

0.85

50ms 100ms 300ms 1000ms

ComparisonOnDifferentWindowSize

1-DCNN

33

Figure13.ComparisononDifferentMachineLearningAlgorithms

Thisfigurecomparesmodelperformanceusingdifferentmachinelearning

algorithms.Itcanbeseenthatdeepconvolutionalneuralnetworkhasaslightly

betterperformanceoverothermethodslistedhere.Short-termfeatureshavea

strongerindicationeffectofthesentimentthanprosodyfeatures.

0.583

0.773 0.786

0.5920.735 0.749

0.614

0.792 0.7960.647

0.826 0.813

ComparisonsonDifferentMachineLearningAlgorithms

SVM K-nearestneighbor FeedforwardNN DeepCNN

34

6. Conclusion

Audio sentiment analysis is amethodof inferring opinions of customers in

conversationswithcustomersupportrepresentatives.Havingamaturesystemto

analyzedsentimentswouldbeverybeneficialforbusinessworld.Thisthesiswork

presents twomethods to address this problem.One is basedon segment-wise

feature vector classification. Every audio recording is split into segments by

speakerturnsbeforeanalyzing.Differentsetsoffeaturesareextractedfromeach

segment and then a classification model is trained based on these features.

Anotheristofeedfeaturematrixintodeepconvolutionalneuralnetworkandlet

it learnthetemporalinformationbetweenconsecutiveaudioclips.Eachfeature

matrix is constructed by stacking up feature vectors which are extracted from

equallengthaudioclipsofaudiorecords.Wecomparetheperformanceofmodels

using different machine learning algorithms. We show that there are strong

connections between the sentiment we analyze and the features we extract.

Short-term features, such asMFCCs, Chrome and timbre features have better

indicationsthanprosodyfeatures.Deepconvolutionalneuralnetworkcannotonly

beappliedtoimagedataset,butalsobeappliedtonon-imagedataset,likefeature

matrixes.Temporalinformationbetweenconsecutiveaudioclipswithinafeature

matrix can be captured by using deep convolutional neural network from our

experiment results. We also present the problems that we have encountered

35

duringourresearchprocess.Furtherresearchworkcanbeconductedtoaddress

theseproblemstoimprovetheoverallsystemperformance.

36

7. FUTUREWORK

Automatic speaker diarization process can be embedded into this audio

sentimentanalysisprocesssincewewanttospliteachaudiodataintosegments

by speaker turns. It is even important when transcripts are costly and time-

consuming to get. Accurate text transcription could be another direction since

muchworkhasbeendoneinregardoftext-basedaudiosentimentanalysis.Ifwe

can get a good text transcription of an audio file automatically, then we can

combinethosetechniquesintext-basedanalysiswithouracousticfeaturebased

techniquestoimprovetheperformanceofanalysissystem.Featurematrixisnot

theend.Therecouldexistabetterrepresentationoffeaturestofeedintodeep

neural networks. Moreover, we can find other acoustic features and better

indicatorstocharacterizeaudiodata.Moreexperimentsshouldbeconductedfor

alargerdatasettobuildmorereliablemodels.

37

8. REFERENCES

[1]. Lapidot,Itshak&Aminov,L&Furmanov,T&Moyal,Ami.(2014).SpeakerDiarizationinCommercialCalls.

[2]. A.X.Miro,S.Bozonnet,N.Evans,C.Fredouille,G.Friedland,andO.Vinyals,"SpeakerDiarization:AReviewofRecentResearch,"IEEETransactionsonAudio,Speech,andLanguage,vol.20,no.2,pp.356-370,Feb.2012

[3]. Cyrta,Pawel,TomaszTrzcinskiandWojciechStokowiec.“Speaker

DiarizationusingDeepRecurrentConvolutionalNeuralNetworksforSpeakerEmbeddings.”ISAT(2017).

[4]. W.Medhat,A.Hassana,andH.Korashy.Sentimentanalysisalgorithms,

applications:Asurvey.AinShamsEngineeringjournal,pages1093–1113,2014.

[5]. B.Logan.Melfrequencycepstralcoefficientsformusicmodeling.InProc.

InternationalSymposiumonMusicInformationRetrieval,2000.[6]. G.TzanetakisandP.Cook.Musicalgenreclassificationofaudiosignals.IEEE

TransactionsonSpeechandAudioProcessing,10(5):293–302,2002.[7]. H.DavisandS.M.Mohammad.Generatingmusicfromliterature.InProc.

3rdWorkshoponComputationalLinguisticsforLiterature,pages1–10,2014.[8]. D.P.W.Ellis.Classifyingmusicaudiowithtimbralandchromafeatures.In

Proc.8thInternationalConferenceonMusicInformationRetrieval(ISMIR),pages339–340,2007.

[9]. D.Ververidis,C.Kotropoulos,andI.Pitas.Automaticemotionalspeech

classification.InProc.ICASSP,volume1,pages593–596,2004.[10]. G.Hinton,L.Deng,D.Yu,G.Dahl,A.Mohamed,N.Jaitly,A.Senior,V.

Vanhoucke,P.Nguyen,T.Sainath,andB.Kingsbury.Deepneuralnetworksfor

38

acousticmodelinginspeechrecognition.IEEESignalProcessingMagazine,29(6):82–97,2012.

[11]. I.Lopez-Moreno,J.Gonzalez-Dominguez,O.Plchot,D.Martinez,J.

Gonzalez-Rodriguez,andP.Moreno.Automaticlanguageidentificationusingdeepneuralnetworks.InProc.ICASSP,pages5337–5341,2014.

[12]. F.Richardson,D.Reynolds,andN.Dehak.Aunifieddeepneuralnetwork

forspeaker,languagerecognition.InProc.INTERSPEECH,pages1146–1150,2015.

[13]. T.Mikolov,M.Karafiat,L.Burget,J.H.Cernocky,andS.Khudanpur.

Recurrentneuralnetworkbasedlanguagemodel.InProc.INTERSPEECH,pages1045–1048,2010.

[14]. S.J.Fulse,R.Sugandhi,andA.Mahajan.Asurveyonmultimodalsentiment

analysis.InternationalJournalofEngineeringResearch,Technology(IJERT)ISSN:2278-0181,3(4):1233–1238,Nov2014.

[15]. M.Sikandar.Asurveyformultimodalsentimentanalysismethods.

InternationalJournalofComputerTechnology,Applications(IJCTA)ISSN:2229-6093,5(4):1470–1476,July2014.

[16]. X.HuandJ.S.Downie.Improvingmoodclassificationinmusicdigital

librariesbycombininglyricsandaudio.InProc.JointConferenceonDigitalLibraries,(JCDL),pages159–168,2010.

[17]. J.Zhong,Y.Cheng,S.Yang,andL.Wen.Musicsentimentclassification

integratingaudiowithlyrics.InformationandComputationalScience,9(1):35–54,2012.

[18]. A.Jamdar,J.Abraham,K.Khanna,andR.Dubey.Emotionanalysisofsongs

basedonlyricalandaudiofeatures.InternationalJournalofArtificialIntelligenceandApplications(IJAIA),6(3):35–50,2015.

39

[19]. T.Wang,D.Kim,K.Hong,andJ.Youn.Musicinformationretrievalsystemusinglyricsandmelodyinformation.InProc.Asia-PacificConferenceonInformationProcessing,pages601–604,2009.

[20]. Y.Xia,L.Wang,K.-F.Wong,andM.Xu.Sentimentvectorspacemodelfor

lyric-basedsongsentimentclassification.InProc.ACL-08:HLT,ShortPapers,pages133–136,2008

[21]. X.Hu,J.S.Downie,andA.F.Ehmann.Lyrictextmininginmusicmood

classification.InProc.10thInternationalConferenceonMusicInformationRetrieval(ISMIR),pages411–416,2009.

[22]. B.G.Patra,D.Das,andS.Bandyopadhyay.Unsupervisedapproachto

Hindimusicmoodclassification.InProc.MiningIntelligenceandKnowledgeExploration(MIKE),pages62–69,2013.

[23]. A.KumarandT.M.Sebastian.Sentimentanalysisontwitter.International

JournalofComputerScience(IJCSI),9(4):372–378,July2012.[24]. P.GamalloandM.Garcia.Citius:ANaive-Bayesstrategyforsentiment

analysisonEnglishtweets.InProc.8thInternationalWorkshoponSemanticEvaluation(SemEval2014),pages171–175,August2014.

[25]. R.Mihalcea.Multimodalsentimentanalysis.InProc.3rdWorkshopin

ComputationApproachestoSubjectivityandSentimentAnalysis,pages1-1July2012.

[26]. V.Rosas,R.Mihalcea,L.-P.Morency,"Multimodalsentimentanalysisof

spanishonlinevideos",IEEEIntelligentSystems,vol.28,no.3,pp.0038-45,2013.

[27]. SoujanyaPoria,ErikCambria,NewtonHoward,Guang-BinHuang,Amir

Hussain,Fusingaudio,visualandtextualcluesforsentimentanalysisfrommultimodalcontent,InNeurocomputing,Volume174,PartA,2016,Pages50-59,ISSN0925-2312,January2016.

40

[28]. M.Wöllmer,F.Weninger,T.Knaup,B.Schuller,C.Sun,K.Sagae,L.P.

Morency,"Youtubemoviereviews:Sentimentanalysisinanaudio-visualcontext",IEEEIntelligentSystems,vol.28,no.3,pp.46-53,2013.

[29]. X.Zhang,J.Zhao,Y.LeCun.Character-levelConvolutionalNetworksfor

TextClassification.InAdvancesinNeuralInformationProcessingSystems28(NIPS2015),2015.

acoustic feature-based sentiment analysis of call center...

Documents