acoustic feature-based sentiment analysis of call center...
TRANSCRIPT
ACOUSTICFEATURE-BASEDSENTIMENTANALYSISOFCALLCENTERDATA
AThesis
Presentedto
TheFacultyoftheGraduateSchool
AttheUniversityofMissouri-Columbia
InPartialFulfillment
OftheRequirementsfortheDegree
MasterofScience
By
ZeshanPeng
Dr.YiShang,ThesisSupervisor
DEC2017
Theundersigned,appointedbythedeanoftheGraduateSchool,haveexaminedthethesisentitled
ACOUSTICFEATURE-BASEDSENTIMENTANALYSISOFCALLCENTERDATA
PresentedbyZeshanPeng
Acandidateforthedegreeof
MasterofScience
Andherebycertifythat,intheiropinion,itisworthyofacceptance.
Dr.YiShang
Dr.DetelinaMarinova
Dr.DongXu
ii
ACKNOWLEDGEMENTS
IwouldliketothankmyacademicadvisorDr.YiShangforallofthehelpsandthose
great suggestions that I received fromhim. Iwillnotbeable tohave this thesiswork
withouthishelps.
I would also like to thank all the people in our team for this project, especially
NickolasWergeles andWenboWang. They always provideme the best supports and
criticalthoughtsinthewholeprocess,whichletmemakethisfarforthiswork.
Finally,IwouldliketothankDr.DetelinaMarinovaandherstudentBittyBalduccifor
theirgreateffortstoprovidemedatasetsforthisprojectaswellastheirbusinessinsights
aboutmyworkthatdirectedmyinitialwork.
IwouldliketogivemylastthankstoDr.DongXuforbeingmycommitteemember
andhelpingmedefensemythesiswork.
- ZeshanPeng
iii
TABLEOFCONTENTS
ACKNOWLEDGEMENTS......................................................................................................ii
LISTOFFIGURES.................................................................................................................v
LISTOFTABLES..................................................................................................................vi
ABSTRACT........................................................................................................................vii
1. INTRODUCTION..........................................................................................................1
2. BACKGROUNDANDRELATEDWORK.........................................................................4
2.1 Text-basedapproach.......................................................................................................4
2.2 Acousticfeature-basedapproach...................................................................................5
2.3 Multimodalapproach......................................................................................................6
3. PROPOSEDMETHODS................................................................................................8
3.1 AcousticFeature-BasedSentimentRecognitionUsingClassicMachineLearningAlgorithms...................................................................................................................................9
3.1.1 GatheringAudioData..............................................................................................11
3.1.2 DataCleaningandPre-processing...........................................................................11
3.1.3 FeatureExtractionandSelection.............................................................................14
3.1.4 TrainClassificationModel.......................................................................................15
3.1.5 TestClassificationModel.........................................................................................16
3.2 AcousticFeature-Matrix-BasedSentimentRecognitionUsingDeepConvolutionalNeuralNetwork.........................................................................................................................17
3.2.1 DeepLearning..........................................................................................................17
3.2.2 FeatureMatrix.........................................................................................................18
3.2.3 Architecture.............................................................................................................20
4. FEATURESETSANDCLASSIFICAITONALGORITHSM................................................22
4.1 FeatureSets...................................................................................................................22
4.1.1 FundamentalFeatures.............................................................................................22
4.1.2 ShimmerandJitter..................................................................................................22
4.1.3 Short-termFeatures................................................................................................22
iv
4.1.4 EmotionFeatures....................................................................................................24
4.1.5 Spectrograms...........................................................................................................24
4.2 ClassificationAlgorithms...............................................................................................24
5. EXPERIMENTRESULTS..............................................................................................26
5.1 Dataset..........................................................................................................................26
5.2 FeatureExtractionLibraries..........................................................................................26
5.3 TrainingTools................................................................................................................27
5.4 Per-SegmentResults.....................................................................................................27
5.5 Per-RecordResults........................................................................................................28
5.6 DeepLearningResults...................................................................................................30
6. Conclusion................................................................................................................34
7. FUTUREWORK..........................................................................................................36
8. REFERENCES..............................................................................................................37
v
LISTOFFIGURES
Figure1.ProblemIllustration.............................................................................................9
Figure2.SentimentRecognitionProcessOverview.........................................................10
Figure3.SpeakerDiarizationProcess..............................................................................13
Figure4.DataFlowDiagramforModelTraining.............................................................16
Figure5.MajorityVoteMethodforModelTesting.........................................................17
Figure6.FeatureMatrixIllustration................................................................................19
Figure7.DeepConvolutionalNeuralNetworkArchitecture...........................................21
Figure8.ComparisonofCustomerModelandRepresentativeModelonDifferent
FeatureSets.....................................................................................................................28
Figure9.ModelComparisonBasedonMajorityVotesPerAudioRecord.......................29
Figure10.PredictionAccuracyvs.NumberofConvolutionalLayers................................30
Figure11.1-DCNNvs.2-DCNNonDifferentPoolingKernelSize....................................31
Figure12.ComparisononDifferentWindowSize............................................................32
Figure13.ComparisononDifferentMachineLearningAlgorithms.................................33
vi
LISTOFTABLES
Table1.Short-termFeatureDescriptions........................................................................23
Table2.DatasetSummary...............................................................................................26
vii
ABSTRACT
Withtheadvancementofmachinelearningmethods,audiosentimentanalysis
has become an active research area in recent years. For example, business
organizations are interested in persuasion tactics from vocal cues and acoustic
measuresinspeech.Atypicalapproachistofindasetofacousticfeaturesfrom
audiodatathatcanindicateorpredictacustomer’sattitude,opinion,oremotion
state.Foraudiosignals,acousticfeatureshavebeenwidelyusedinmanymachine
learningapplications,suchasmusicclassification,languagerecognition,emotion
recognition,andsoon.Foremotionrecognition,previousworkshowsthatpitch
and speech rate features are important features. This thesis work focuses on
determining sentiment from call center audio records, each containing a
conversationbetweenasalesrepresentativeandacustomer.Thesentimentofan
audiorecordisconsideredpositiveiftheconversationendedwithanappointment
being made, and is negative otherwise. In this project, a data processing and
machinelearningpipelineforthisproblemhasbeendeveloped.Itconsistsofthree
majorsteps:1)anaudiorecordissplitintosegmentsbyspeakerturns;2)acoustic
featuresareextractedfromeachsegment;and3)classificationmodelsaretrained
ontheacousticfeaturestopredictsentiment.Differentsetoffeatureshavebeen
usedanddifferentmachinelearningmethods,includingclassicalmachinelearning
algorithmsanddeepneuralnetworks,havebeenimplementedinthepipeline.In
viii
our deep neural network method, the feature vectors of audio segments are
stackedintemporalorderintoafeaturematrix,whichisfedintodeepconvolution
neural networks as input. Experimental results based on real data shows that
acousticfeatures,suchasMelfrequencycepstralcoefficients,timbreandChroma
features, are good indicators for sentiment. Temporal information in an audio
record can be captured by deep convolutional neural networks for improved
predictionaccuracy.
1
1. INTRODUCTION
Speechanalyticsallowspeopletoextractinformationfromaudiodata.Ithas
beenusedwidelytogatherbusinessintelligencethroughtheanalysisofrecorded
callsincontactcenterformanybusinesscompanies.Sentimentanalysisisoneof
thespeechanalyticsthattriestoinfersubjectfeelingsaboutproductsorservices
in a conversation. This analysis can be also used to help agents who talk to
customersoverthephonebuildcustomerrelationshipandsolveanyissuesthat
mayemerge.
There are two major methods that have been used for audio sentiment
analysis.Oneisacousticmodellingandtheotherislinguisticmodeling.Inlinguistic
modeling,itrequirestranscribingaudiorecordsintotextfilesandconductanalyses
basedontextcontent.Itassumesthatsomespecificwordsorphrasesareusedin
a higher probability in some certain environments. One can expect things to
happen if some words or phrases appear frequently in the transcripts. Many
researchershave showed strongevidences that linguisticmodellinghas a good
performance on audio sentiment analysis. Among their work, lexical features,
disfluenciesandsemantic featuresaremostlyusedwhenbuildingtheirmodels.
However, acquiring a good transcript for each audio record can be humanly
tediousandfinanciallycostly.Agoodautomatictranscribingsystemisalsohardto
2
establishtoaccuratelytranscribeaudiorecordsintotext.Evenasmallerrorinthe
textcanresultabigdifferenceinalinguisticmodel.Inacousticmodelling,itrelies
on acoustic features of audio data. These features include pitches, intensity,
speechrateandsoonthatcanbeeasilycalculatedbycomputerprograms.They
together can provide basic indicators of sentiment in some degree. Acoustic
featureshavebeenusedinmanyotherbusinessorpsychologicalanalysesaswell,
whichisanotherreasonthatweshouldusetheminsentimentanalysis.However,
theproblemwithacousticmodelingisthatthequalityofaudiorecordshasstrong
influenceon the final result.Sinceweareusingacousticcharacteristicofaudio
data,thequalityofaudiocansignificantlyimpacttheabilitytogetaccuratevalues
forthoseacousticfeatures,whichwill failbuildinggoodmodels intheend.The
otherproblemisthatinareal-worldsetting,therearealwayssomerandomnoises
inthebackground.Modeltrainingdatamaynotbeabletocapturetheserandom
noises,whichwouldchallengethesentimentanalysisresultifwewanttomonitor
itthroughaliveconversation.Butstill,thismethodisworthytoexploremoreand
researchonbecauseofitsconvenienceandefficiency.
Theremainingstructureofthisthesisisasfollows:chapter2givesbackground
andrelatedworkaboutsentimentanalysis,chapter3introducestwomethodsthat
weproposeandoursentimentrecognitionpipeline indetail,chapter4explains
our selection of feature sets and classification models, chapter 5 shows our
3
experiment results, chapter6 is the conclusion, chapter7 talks aboutpotential
problemsandfutureworks,andthelastchapteristheworkreferences.
4
2. BACKGROUNDANDRELATEDWORK
Sentiment analysis can be generalized into three categories: text-based
approach, acoustic feature-based approach and multimodal approach – a
combinationofprevioustwo.Differentfeaturesareusedfordifferentapproach.
Classification models are trained on different learning algorithms and
performancesarecompared.Manysentimentclassificationtechniqueshavebeen
proposedinrecentyears[4].However,theoverallperformanceisnotasgoodas
wethought.Muchmoreworkcouldbedoneforthisprobleminthefuture.
2.1 Text-basedapproach
Textdataareoneofthemost immensecorporathataregenerated
daily.Researchesareeagertotakeadvantageofthisandmanydatamining
workshavebeendoneforthissake.Forexample,Twitter,asoneofthemost
popularsocialmediaintheworld,generatestonsoftexteveryday.Different
miningmethodsbasedonTwitter texthavebeenproposed forsentiment
analysis [23] [24]. In 2010, it is shown that recurrent neural network can
exploittemporalfeaturesintextcontentsformodelinglanguage[13].Even
formusicfield,mininglyricsdirectlycancreatemodelsforsongsentiment
classificationproblem[20]andmusicmoodclassificationproblem[21].
5
2.2 Acousticfeature-basedapproach
Many different acoustic features can be generated from an audio file.
Pitches,intensities,speechrates,articulationrateandsoon.Short-termacoustic
features, such asMel frequency cepstral coefficients(MFCCs), are proved to be
usefulformusicmodelingin[5].MFCCsaregoodformusicalgenreclassification
jobandevidencesareshownin[6].Authorsin[9]takeMFCCsasmajoracoustic
features for emotion recognition problem. MFCCs features are also used on
trainingsingledeepneuralnetworkforspeakerandlanguagerecognitionproblem
in [12]. Timbre and Chroma are other acoustic features that interest many
researchersinthepastyears.Theycanbeusedtogeneratemusicfromtextin[7].
Intheirwork,emotionsareextractedfromtext,andthenacombinationoftimber
andChromafeaturesareusedtocomposedmusicthatexpressesthoseemotions
inthetext.Chromaandtimbrefeaturescanbeusedtoclassifymusicaswellin[8].
Theyalsoclaimthat“Chromafeaturesarelessinformativeforclassessuchasartist,
but contain information that is almost entirely independence of the spectral
features.”Therefore,theyareequallyimportanttobeincludedwhenconducing
sentimentanalysisonacousticfeatures.Rhythm,timberandintensityofmusicare
usedformusicmoodclassificationproblemaswell [22].Acousticfeature-based
6
approachcanbeextendedtospeechrecognitionin[10]andautomaticlanguage
identificationin[11]whenmodelingwithdeepneuralnetwork.
2.3 Multimodalapproach
Withtheemergeofsocialmedia,exceptfortextcorporaandaudiorecords,
videos and images brings new opportunities for sentiment analysis. Facial and
vocalfeaturesareextractedfromthesenewstreamsandmultimodalsentiment
analysisseemstohavemuchmorepotentialinthefuture[14][15].YouTube,as
oneofthebiggestvideohosts,hasahugenumberofvideos.Researchesaretrying
to identify sentiments from those videos based on linguistic, audio and visual
features all together [25] [26]. Fusion of modalities of audio, visual and text
sentimentfeaturesassourceofinformationisanotherwaythatisproposedin[27].
For linguisticanalysis, languagedifferencecouldbeaproblem. It is shown that
language-independent audio-visual analysis is competitive with mere linguistic
analysis[28].Worksin[16][17][18]showthatsystemsthatcombinelyricfeatures
andaudiofeaturessignificantlyoutperformothersystemsthatuselyricsoraudios
singularly for music mood classification problem. Fusion of lyrics features and
melody featuresofmusic improvesperformanceofmusic information retrieval
system[19].Fusionofmultiplemodalitiesiscurrenttrendsincewecanalwaysfind
newinformationwhenweviewanexistingobjectfromanewangle.Sometimes,
7
theaccesstovisualandlinguisticsourcescouldbehardandcostly.Forexample,
inourdatasource, it ishardtogetfacialexpressionsfromacustomeroverthe
phone.Butinothercircumstanceswhenwehavealltheinformation,multimodal
approachwouldbethefirstchoice.
8
3. PROPOSEDMETHODS
The problem that this thesis work wants to solve is to identify whether a
meetinghasbeenscheduledbetweenacustomerandarepresentativeinaphone
callconversation.Therearetwomethodsthatweproposetoaddressthiskindof
problem.Thefirstmethodistospliteachaudiorecordintosegmentsbyspeaker
turnsbeforeextractingfeaturevectorsfromeachsegment,andthenuseclassic
machine learning algorithms to classify those feature vectors according to the
sentiments.Allsegmentsthatcomefromthesameaudiorecordsharethesame
sentimentlabelastheaudiorecordlabel.Thesecondmethodistocuteachaudio
recordintoshortaudioclipswithsamelength.Thenfeaturesvectorsareextracted
fromeachaudioclips.Afterthat,westackupthosefeaturevectorsintofeature
matrixes in timeorder and then feed featurematrixes into deep convolutional
neuralnetworkstobuildclassificationmodels.Allfeaturematrixesthatcomefrom
thesameaudiorecordsharethesamesentimentlabelastheaudiorecordlabel.
Thenexttwosubsectordescribesthesetwomethodsinmoredetail.
9
Figure1.ProblemIllustration
3.1 AcousticFeature-BasedSentimentRecognitionUsingClassicMachine
LearningAlgorithms
Thewholesentimentrecognitionprocessconsistsoffivesteps.Firstly,we
gatherusefulaudiodatafromcallcenterrecords.Weselectaudiodatathathas
labelwith itsincewearemore interested inusingsupervisedmachine learning
algorithms. It ishighlypossible thatunsupervisedmachine learningcanprovide
interesting results for audio signal sentiment analysis, which is left to future
exploringeitherbycombiningitwithsupervisedmachinelearningormerelyitself.
10
The second step is cleaning and pre-processing the chosen audio files. Any
unrelatedpartsinaudiodataareremoved,suchasringtones,receptionisttalking,
music, advertisement, etc. Several methods have been tried to remove those
unrelatedparts,however,noneofthemmetourexpectations.So,wedecideto
removethemmanuallysincethedatasetwecurrentlyhaveisprettysmall.After
that we apply speaker diarization algorithm to split each audio record into
segmentsbyspeakerturns.Thethirdstepisfeatureextractionandselection.We
considerdifferentcombinationoffeaturesetsandselectdifferentfeaturestotrain
on in the next stage. The fourth step is classification model training. We use
different training algorithms, techniques and models, then compare their
performanceintermsofpredictionaccuracies.Thelastpartistestingthetrained
classificationmodel.
Figure2.SentimentRecognitionProcessOverview
Pre-processing/Cleaning
FeatureExtraction/FeatureSelection
ModelTraining
ModelTesting
11
3.1.1 GatheringAudioData
Everytrainingdataimpactsthefinalclassificationmodel.Wedonotwantbad
data to affect model’s performance, and we want to have effective training
datasetsthatcangeneralizewelltoallpossiblescenariosbutnotmanyunlikely
events.Asforaudiodatasets,thequalityoftheaudiofilecouldbeonemetricto
be used when selecting audio files. Call center audio records are not always
recordedgood.Duetotheinternetlatencyandbackgroundnoises,somefilesare
veryblurredandevenhardforhumantounderstandtheconversationinit.We
needaudiodatathatarerecordedwithgoodquality.Therearealsosomeother
filesthatdonothavealabelwithit,andwewanttogetridofthosefilesaswell
forsupervisedlearningpurpose.
3.1.2 DataCleaningandPre-processing
Inanaudiofile,itisusuallyaconversationbetweentwospeakers:acustomer
andasalesperson.Insomescenarios,thereisareceptionistansweringthephone
callandtransferringthesalespersontowhoeverheorshewantstoreachto.The
featuresofreceptionistarenotourmainlyinterestheresoweneedtocutthose
conversationsbetweensalespersonandreceptionistifithappensintheaudiofile.
Afterthat,wewouldliketospliteachaudiofileintosegmentsbyspeakerturns.
12
Knowingchangesofcertainfeaturesbetweenconsecutivesegmentswouldhelp
thewholesentimentanalysis.Thatiswhyit is importanttosplitaudiofilesinto
segmentsbyspeakerturnsbeforegatheringfeaturesfromthem.Wealsowantto
studythetemporalrelationsoffeaturesinconsecutivesegmentsinanaudiofile.
Thatwould helpmonitor live conversations in call centers. There are plenty of
speakerdiarizationalgorithmsthatarepubliclyavailable.As referred in [1], [2],
and[3],alotofresearcheshasbeenconductedforthistypeofproblem,andmany
ofthemhaveagoodperformancetosplitaudiofilebyspeakerturnsautomatically.
In our research, we have used IBM Watson and Google cloud computing
services to split our audio files into segments by speaker turns. Both of these
systemshavetheabilitytoidentifywhospeaksatwhattimeandtranscribethe
conversationintotext.Eventhoughthetranscriptionpartisnotasaccurateaswe
though, the timestamps of start and end time for each speaker turn are good
enoughforustousetoseparateeachaudiofileintosegments.
Sinceweonlyhaveasmallportionofaudiodatafornow,andthetranscripts
ofallaudiofilesareavailableforus,wespliteachaudiofileintosegmentsbased
on timestamps in the transcript accordingly. However, the speaker diarization
procedureisstillcrucialwhenhavemoreandmoredatacomingininthefuture
andtheaccesstoallofthetranscriptswouldbeverycostly.Thatisthetimewe
maywanttouseautomaticspeakerdiarizationtechniques.
13
Figure3.SpeakerDiarizationProcess
Pseudo-code:
functionspeaker_diarization():
openaudiofile
opentranscriptionfile
foreachtimestamppair(start_time,stop_time)intranscriptionfile:
segment=cut_audio_file(start_time,stop_time)
segment_label=audio_file_label
ifsegmentbelongstocustomer:
Addsegmenttocustomer_cluster
Ifsegmentbelongstorepresentative:
Addsegmenttorepresentative_cluster
14
3.1.3 FeatureExtractionandSelection
After we have split each audio file into segments by speaker turns, we
extract features from each segment. There are many features that can be
extracted from an audio file. For our research purpose, we focus on acoustic
features.Therearemainly twogroupof features that interestus.Onegroup is
prosodyfeatures,suchaspitch,intensity,speechrate,numberofpauses,etc.The
othergroupisshort-termfeatures.Wewillexplainourfeaturessets indetail in
chapter4.
Pseudo-code:
functionfeature_extraction():
foreachsegmentincustomer_cluster:
prosody,short_term=calc_algorithm(segment)
Addprosodytoprosody_feature_clusterforcustomer
Addshor_termtoshort_term_clusterforcustomer
foreachsegmentinrepresentative_cluster:
prosody,short_term=calc_algorithm(segment)
Addprosodytoprosody_feature_clusterforrepresentative
Addshor_termtoshort_term_clusterforrepresentative
15
3.1.4 TrainClassificationModel
Oncewehave features foreachsegment,wesplitdata into trainingset,
validationsetandtestingset,thenwetrainaclassificationmodelbasedonthese
featuresandtheirassociatedlabels.Everysegmentinthesameaudiofilehasthe
samelabelasthataudiofile.Wealsopartitionsegmentsintotwogroups:customer
groupandsalespersongroup.Sincecustomersmayhavedifferentbehaviorsthan
salesperson,wewouldliketotraintwodifferentmodelstorepresenteachofthem.
Wetakeallfeaturesincustomergrouptobeourtrainingdatasetforthecustomer
model, and all features in salesperson group to be our training date for the
salespersonmodel.Validationsetisusedtosetuphyperparametersoflearning
algorithms, and testing set is used tomeasure the performance of the trained
model.Moredetailsofclassificationmodelsandlearningalgorithmsareinchapter
4.
16
Figure4.DataFlowDiagramforModelTraining
3.1.5 TestClassificationModel
Afterthemodelisbuiltinthelastprocedure,weuseittopredictlabelsfor
anyaudiofilethatwewanttodosentimentanalysison.Theaudiofileisstillsplit
into segmentsby timestamps in the transcript. Themodelwill predict label for
eachsegmentinthatfile,andwetakethemajorityvoteofthelabelsasthefinal
labelforthattestaudiofile.
y= 𝑦%&% 𝑛 ≥ 0.5? 1 ∶ 0
wherenisthenumberofsegmentsinanaudiorecord
17
Figure5.MajorityVoteMethodforModelTesting
3.2 Acoustic Feature-Matrix-Based Sentiment Recognition Using Deep
ConvolutionalNeuralNetwork
3.2.1 DeepLearning
Ithasbeenshownin[29]thatdeepconvolutionalneuralnetworkcanbe
applied to character-level feature matrix for text classification problem. It has
18
achievedcompetitiveresultswithcurrentstate-of-artmethods.Evidenceshows
thattemporalinformationcanbecapturedbyapplyingdeepconvolutionalneural
network on feature matrix of time series dataset. With the advancement of
machinelearning,especiallydeeplearningmethods,audiosignalanalysiscantake
advantage of it to build competitivemodels or even bettermodels than other
approaches.
3.2.2 FeatureMatrix
Convolutional neural networks usually take two-dimensional matrix as
input.Westackupaudiofeaturevectorintofeaturematrixtomeetthat.Short-
termfeaturesareextractedevery50millisecondsfromeachaudiorecordandthen
stackedupintimeordertobuildfeaturematrixforthataudiorecord.Thewindow
size is 300, which means for every 300 rows we cut the matrix. Every matrix
generatedfromsameaudiorecordsharesthesamesentimentastheaudiorecord.
Ifthenumberofrowsortheremainingrowsaftercuttingislessthan300,weuse
zero-valuedpatchtocompensate.Eachtrainingmatrixhasthesizeof300by34
sincewe have 34 short-term features in total. The figure below illustrates the
processforbuildingfeaturematrixes.
19
Pseudo-code:
functionfeature_matricization():
foreachaudiorecord:
cutitintoequallength(50ms)segments
foreachsegment:
short-term=calc_algorithm(segment)
stackupshort-termfeaturesintofeaturematrixintimeorder
cutfeaturematrixevery300rows
matrix_label=audio_record_label
addtotrainingdataset
Figure6.FeatureMatrixIllustration
20
3.2.3 Architecture
Massiveexperimentshavebeenconductedtocompareperformanceswhen
usingdifferentnetworksettings.Thosesettingsincludenumberofconvolutional
layer,kernelsize,maxpoolingkernelsize,andwindowsizewhencuttingmatrix.
Thefigurebelowisatypicalnetworkarchitectureoffourconvolutionallayers.This
architecturedesignisbasedonpre-trainedneuralnetworksthatworkextremely
well on imagedatasets. Every convolutional layer is followedby amaxpooling
layer.Weuse1-Dconvolutionalkernelheresincefeaturesaredistinctfromeach
other,andourexperimentresultsshowthat1-Dconvolutionalkernelhasbetter
performancethan2-Dconvolutionalkernel.
21
Figure7.DeepConvolutionalNeuralNetworkArchitecture
22
4. FEATURESETSANDCLASSIFICAITONALGORITHSM
4.1 FeatureSets
4.1.1 FundamentalFeatures
Thefirst featuresetthatweconsiderarefundamental featuresforanyaudio
data,suchasdecibel,meanofpitches,standarddeviationofpitches,covariance
ofpitches,speechrate,numberofpausesinasegment,lengthofthesegmentin
milliseconds.Thesearegeneralfeaturesthatcanrepresentanaudiosegmentin
some extent. These features have shown crucial importance on audio signal
analysisfromliteraturesadecadeago.
4.1.2 ShimmerandJitter
Shimmermeasures thevariation in frequencyofvocal foldmovementswhile
jittermeasures the variation in amplitudeof vocal foldmovements. These two
featurescanrepresentcharacteristicsofanaudiosegmentinanotherperspective.
4.1.3 Short-termFeatures
Mel-frequency cepstrum coefficients, Chroma and timbre features are short-
termfeaturesusedinthiswork.Themel-frequencycepstrumisarepresentation
of the short-term power spectrum of a sound. Mel-frequency cepstral
coefficients(MFCCs) are coefficients that are used to make up the spectrum.
23
Different combination of those coefficients has different indications for some
sentimentanalysis,sowewanttoincludetheminourfeatureset.Chromafeatures
represent spectral energy of different pitch classes. Timbre features show the
characterorqualityofasoundthataredifferentfrompitchfeatureandintensity
feature. These three groups of features have been frequently used in many
classificationproblemsasexplainedinrelatedworkinsection2.
Feature Description
MFCCs MelFrequencyCepstralCoefficientsfromacepstralrepresentation
wherethefrequencybandsarenotlinearbutdistributedaccording
tothemel-scale
Chroma Arepresentationofthespectralenergywherethebinsrepresent
equal-temperedpitchclasses
Timbre Thecharacterorqualityofasoundorvoiceasdistinctfromits
pitchandintensity
Table1.Short-termFeatureDescriptions
24
4.1.4 EmotionFeatures
Wealsofindthatemotionfeaturescanbecriticalindicatorsforsomesentiments.
Positiveemotions,likehappyandsatisfaction,canleadtoagoodresultintheend
oftheconversationwhilenegativeemotions,likeupsetandangry,canturnthings
aroundintheend.Wewanttotrainourmodelonthesefeaturestoo.
4.1.5 Spectrograms
Besidesofvaluedinputs,wealsogenerateimagefeaturesfromeachsegment.
Theyarespectrograms.Spectrogramisavisualrepresentationofthespectrumof
frequencies of sound varied with time. Some sentiments may have specific
patterns in the spectrograms.Wewant ourmodel to be able to identify those
patternsinthespectrogramsiftheyexist.
4.2 ClassificationAlgorithms
Inourexperiment,weusebothlegendmachinelearningalgorithmsandneural
networks.Forlegendalgorithms,wehaveusedsupportvectormachinewithcubic
kernelandk-nearestneighborclassifier.Forneuralnetworks,wehaveusedclassic
shallow feed forward neural networks and deep convolutional neural network.
25
Differentlearningalgorithmhasdifferentadvantagesandcharacteristics.Wewant
toexperimentonallofthemandcomparetheirresults.
26
5. EXPERIMENTRESULTS
5.1 Dataset
Ourdatasets consistof88audio recording files fromPenske cooperated
offices.Thedurationofthesefilesrangesfromafewminutestotenminutes.The
average length is about three minutes. After we split these audio files into
segmentsbyspeakerturns,wehave2859segmentsintotal.Amongthem,1438
are for customers and 1421 are for salespersons. Features extracted from
customergroupareusedtotraincustomermodel,soassalespersongroupisfor
salespersonmodel.
Positive Negative Total
AudioRecords 31 55 86
Segments 1284 1304 2588
Table2.DatasetSummary
5.2 FeatureExtractionLibraries
Praat is the main software that we used to extract fundamental audio
features aswell as shimmer and jitter values.We use a python library named
27
pyAudioAnalysisonGitHubtoextractMFCCfeatures.WeuseOpenSmilepackage
toextractemotionfeatures.Anotherpythonlibrarypyplotinmatplotlibpackage
isusedtogeneratespectrogramsforeachaudiosegment.
5.3 TrainingTools
Weusemachine learningpackageinMatlabtotrainmodelsusingclassic
machine learning algorithms. For neural networks, they are coded in Keras
platformwhichisbasedontensorflow.
5.4 Per-SegmentResults
Since customerbehaviors different than representative,wewant to two
separatemodelstorepresenteachofthem.Speakerdiarizationprocesshelpson
that. Features extracted fromcustomer segments are grouped intoone cluster
while other features extracted from representative segments are grouped into
anothercluster.Featuresincustomerclusterareusedtotraincustomermodeland
so as the representativemodel.Models’ performances are compared by using
different feature set for training. Below is the figure that summarizes the
comparisonbasedonpredictionaccuracyforsegments:
28
Figure8.ComparisonofCustomerModelandRepresentativeModelonDifferentFeatureSets
From above figure, we can easily conclude that short-term features are
superiorindicatorsthanprosodyfeatures.Whenbothofthemareusedfortraining,
accuracyisslightlyimproved.Thatdoesnotmeancombinationofthemhasbetter
indicationstrength.Moreexperimentsshouldbeconductedfora largerdataset
beforewecansayanythingaboutit.
5.5 Per-RecordResults
Segment-wise prediction is not our final goal here.We want to predict
whetherrepresentativeispersuasiveornotinawholeaudiorecord.Toaccomplish
this,wepredicteachsegmentofatestaudiorecordusingmodelstrainedin5.2.1
0.646
0.82 0.822
0.570.684 0.694
0
0.2
0.4
0.6
0.8
1
Prosody Short-term Together
CustomerModelvs.RepresentativeModel
Customer Representative
29
andtakethemajorityvoteofallthepredictionsasourfinalpredictionofthataudio
record.Werunourexperimentsonbasedonusingdifferentmodeltopredict.Here
wehavethreemodelsettings:1)usecustomermodelonlytopredictcustomer
segmentsandtakethemajorityvoteasthefinalpredicationforthataudiorecord,
2)userepresentativemodelonlytopredictrepresentativesegmentsandtakethe
majorityvoteasthefinalpredictionforthataudiorecord,3)usebothcustomer
andrepresentativemodeltopredictallsegmentsinanaudiorecordandtakethe
majorityvoteasthefinalpredictionforthataudiorecord.Belowisthecomparison
result.
Figure9.ModelComparisonBasedonMajorityVotesPerAudioRecord
0.583
0.773 0.786
0.536
0.685 0.6910.571
0.739 0.75
0
0.2
0.4
0.6
0.8
1
Prosody Short-term Together
ModelComparisonBasedOnMajorityVotes
Customer Representative Together
30
Abovefigureshowsthatusingcustomermodelonlyhasthebestperformance
comparedwith othermodels. Itmatches intuitive instinct that representatives
havesimilarbehaviorsindependentofcustomersbehaviors.Representativesare
more consistent no matter what customers respond while customers have
differentcharacteristicswhentheyaresentimentallypositiveornegative.Using
bothofthemhascompromisedperformance.
5.6 DeepLearningResults
Thefigurebelowistheresultcomparingdifferentnumberofconvolutional
layerswhenusingeitherfeaturevectorsorfeaturematrixesasinputtotheneural
network.
Figure10.PredictionAccuracyvs.NumberofConvolutionalLayers
0.623 0.636 0.6470.632
0.6730.691
0.6580.644
PredictionAccuracyvs.NumberofConvolutionalLayers
featurevector featurematrix
31
Itcanbeseenthatfour-convolutionallayerhasthebestperformanceamongthe
others,andusingfeaturematrixeshasbetterperformancethanusingfeature
vectors,whichinfersthattemporalinformationbetweenaudioclipscanbe
capturedbyusingdeepconvolutionalneuralnetworks.
Figure11.1-DCNNvs.2-DCNNonDifferentPoolingKernelSize
Abovefigureshowsthecomparisonresultofusing1-Dconvolutionalversususing
2-Dconvolutionalondifferentpoolingkernelsize.Fromtheresult,wefindthat
1-Dconvolutionalneuralnetworkperformsbetterthan2-Dconvolutionalneural
network.Itmightbebecausetherearenotemporalmeaningsbetweeneach
0.55
0.6
0.65
0.7
0.75
2 3 4 5
1-DCNNvs.2-DCNNOnDifferentPoolingKernelSize
1-DCNN 2-DCNN
32
individualshort-termfeaturesincethecolumninthematrixrepresentsdifferent
short-termfeatures.Italsoshowsthatpredicationaccuracycanbeimprovedby
applyingalargermaxpoolingkernelsize.
Figure12.ComparisononDifferentWindowSize
Thisfigureshowsthecomparisonresultonusingdifferentwindowsizewhen
extractingfeaturesfromaudiorecords.Itseemsthatalargerwindowsizewill
haveabetterperformance.However,largerwindowsizewillresultsmaller
trainingdatasetforneuralnetwork.Whetheritistrueornotneedstobeproved
byconductingmoreexperimentsusingalargerdataset.
0.55
0.6
0.65
0.7
0.75
0.8
0.85
50ms 100ms 300ms 1000ms
ComparisonOnDifferentWindowSize
1-DCNN
33
Figure13.ComparisononDifferentMachineLearningAlgorithms
Thisfigurecomparesmodelperformanceusingdifferentmachinelearning
algorithms.Itcanbeseenthatdeepconvolutionalneuralnetworkhasaslightly
betterperformanceoverothermethodslistedhere.Short-termfeatureshavea
strongerindicationeffectofthesentimentthanprosodyfeatures.
0.583
0.773 0.786
0.5920.735 0.749
0.614
0.792 0.7960.647
0.826 0.813
ComparisonsonDifferentMachineLearningAlgorithms
SVM K-nearestneighbor FeedforwardNN DeepCNN
34
6. Conclusion
Audio sentiment analysis is amethodof inferring opinions of customers in
conversationswithcustomersupportrepresentatives.Havingamaturesystemto
analyzedsentimentswouldbeverybeneficialforbusinessworld.Thisthesiswork
presents twomethods to address this problem.One is basedon segment-wise
feature vector classification. Every audio recording is split into segments by
speakerturnsbeforeanalyzing.Differentsetsoffeaturesareextractedfromeach
segment and then a classification model is trained based on these features.
Anotheristofeedfeaturematrixintodeepconvolutionalneuralnetworkandlet
it learnthetemporalinformationbetweenconsecutiveaudioclips.Eachfeature
matrix is constructed by stacking up feature vectors which are extracted from
equallengthaudioclipsofaudiorecords.Wecomparetheperformanceofmodels
using different machine learning algorithms. We show that there are strong
connections between the sentiment we analyze and the features we extract.
Short-term features, such asMFCCs, Chrome and timbre features have better
indicationsthanprosodyfeatures.Deepconvolutionalneuralnetworkcannotonly
beappliedtoimagedataset,butalsobeappliedtonon-imagedataset,likefeature
matrixes.Temporalinformationbetweenconsecutiveaudioclipswithinafeature
matrix can be captured by using deep convolutional neural network from our
experiment results. We also present the problems that we have encountered
35
duringourresearchprocess.Furtherresearchworkcanbeconductedtoaddress
theseproblemstoimprovetheoverallsystemperformance.
36
7. FUTUREWORK
Automatic speaker diarization process can be embedded into this audio
sentimentanalysisprocesssincewewanttospliteachaudiodataintosegments
by speaker turns. It is even important when transcripts are costly and time-
consuming to get. Accurate text transcription could be another direction since
muchworkhasbeendoneinregardoftext-basedaudiosentimentanalysis.Ifwe
can get a good text transcription of an audio file automatically, then we can
combinethosetechniquesintext-basedanalysiswithouracousticfeaturebased
techniquestoimprovetheperformanceofanalysissystem.Featurematrixisnot
theend.Therecouldexistabetterrepresentationoffeaturestofeedintodeep
neural networks. Moreover, we can find other acoustic features and better
indicatorstocharacterizeaudiodata.Moreexperimentsshouldbeconductedfor
alargerdatasettobuildmorereliablemodels.
37
8. REFERENCES
[1]. Lapidot,Itshak&Aminov,L&Furmanov,T&Moyal,Ami.(2014).SpeakerDiarizationinCommercialCalls.
[2]. A.X.Miro,S.Bozonnet,N.Evans,C.Fredouille,G.Friedland,andO.Vinyals,"SpeakerDiarization:AReviewofRecentResearch,"IEEETransactionsonAudio,Speech,andLanguage,vol.20,no.2,pp.356-370,Feb.2012
[3]. Cyrta,Pawel,TomaszTrzcinskiandWojciechStokowiec.“Speaker
DiarizationusingDeepRecurrentConvolutionalNeuralNetworksforSpeakerEmbeddings.”ISAT(2017).
[4]. W.Medhat,A.Hassana,andH.Korashy.Sentimentanalysisalgorithms,
applications:Asurvey.AinShamsEngineeringjournal,pages1093–1113,2014.
[5]. B.Logan.Melfrequencycepstralcoefficientsformusicmodeling.InProc.
InternationalSymposiumonMusicInformationRetrieval,2000.[6]. G.TzanetakisandP.Cook.Musicalgenreclassificationofaudiosignals.IEEE
TransactionsonSpeechandAudioProcessing,10(5):293–302,2002.[7]. H.DavisandS.M.Mohammad.Generatingmusicfromliterature.InProc.
3rdWorkshoponComputationalLinguisticsforLiterature,pages1–10,2014.[8]. D.P.W.Ellis.Classifyingmusicaudiowithtimbralandchromafeatures.In
Proc.8thInternationalConferenceonMusicInformationRetrieval(ISMIR),pages339–340,2007.
[9]. D.Ververidis,C.Kotropoulos,andI.Pitas.Automaticemotionalspeech
classification.InProc.ICASSP,volume1,pages593–596,2004.[10]. G.Hinton,L.Deng,D.Yu,G.Dahl,A.Mohamed,N.Jaitly,A.Senior,V.
Vanhoucke,P.Nguyen,T.Sainath,andB.Kingsbury.Deepneuralnetworksfor
38
acousticmodelinginspeechrecognition.IEEESignalProcessingMagazine,29(6):82–97,2012.
[11]. I.Lopez-Moreno,J.Gonzalez-Dominguez,O.Plchot,D.Martinez,J.
Gonzalez-Rodriguez,andP.Moreno.Automaticlanguageidentificationusingdeepneuralnetworks.InProc.ICASSP,pages5337–5341,2014.
[12]. F.Richardson,D.Reynolds,andN.Dehak.Aunifieddeepneuralnetwork
forspeaker,languagerecognition.InProc.INTERSPEECH,pages1146–1150,2015.
[13]. T.Mikolov,M.Karafiat,L.Burget,J.H.Cernocky,andS.Khudanpur.
Recurrentneuralnetworkbasedlanguagemodel.InProc.INTERSPEECH,pages1045–1048,2010.
[14]. S.J.Fulse,R.Sugandhi,andA.Mahajan.Asurveyonmultimodalsentiment
analysis.InternationalJournalofEngineeringResearch,Technology(IJERT)ISSN:2278-0181,3(4):1233–1238,Nov2014.
[15]. M.Sikandar.Asurveyformultimodalsentimentanalysismethods.
InternationalJournalofComputerTechnology,Applications(IJCTA)ISSN:2229-6093,5(4):1470–1476,July2014.
[16]. X.HuandJ.S.Downie.Improvingmoodclassificationinmusicdigital
librariesbycombininglyricsandaudio.InProc.JointConferenceonDigitalLibraries,(JCDL),pages159–168,2010.
[17]. J.Zhong,Y.Cheng,S.Yang,andL.Wen.Musicsentimentclassification
integratingaudiowithlyrics.InformationandComputationalScience,9(1):35–54,2012.
[18]. A.Jamdar,J.Abraham,K.Khanna,andR.Dubey.Emotionanalysisofsongs
basedonlyricalandaudiofeatures.InternationalJournalofArtificialIntelligenceandApplications(IJAIA),6(3):35–50,2015.
39
[19]. T.Wang,D.Kim,K.Hong,andJ.Youn.Musicinformationretrievalsystemusinglyricsandmelodyinformation.InProc.Asia-PacificConferenceonInformationProcessing,pages601–604,2009.
[20]. Y.Xia,L.Wang,K.-F.Wong,andM.Xu.Sentimentvectorspacemodelfor
lyric-basedsongsentimentclassification.InProc.ACL-08:HLT,ShortPapers,pages133–136,2008
[21]. X.Hu,J.S.Downie,andA.F.Ehmann.Lyrictextmininginmusicmood
classification.InProc.10thInternationalConferenceonMusicInformationRetrieval(ISMIR),pages411–416,2009.
[22]. B.G.Patra,D.Das,andS.Bandyopadhyay.Unsupervisedapproachto
Hindimusicmoodclassification.InProc.MiningIntelligenceandKnowledgeExploration(MIKE),pages62–69,2013.
[23]. A.KumarandT.M.Sebastian.Sentimentanalysisontwitter.International
JournalofComputerScience(IJCSI),9(4):372–378,July2012.[24]. P.GamalloandM.Garcia.Citius:ANaive-Bayesstrategyforsentiment
analysisonEnglishtweets.InProc.8thInternationalWorkshoponSemanticEvaluation(SemEval2014),pages171–175,August2014.
[25]. R.Mihalcea.Multimodalsentimentanalysis.InProc.3rdWorkshopin
ComputationApproachestoSubjectivityandSentimentAnalysis,pages1-1July2012.
[26]. V.Rosas,R.Mihalcea,L.-P.Morency,"Multimodalsentimentanalysisof
spanishonlinevideos",IEEEIntelligentSystems,vol.28,no.3,pp.0038-45,2013.
[27]. SoujanyaPoria,ErikCambria,NewtonHoward,Guang-BinHuang,Amir
Hussain,Fusingaudio,visualandtextualcluesforsentimentanalysisfrommultimodalcontent,InNeurocomputing,Volume174,PartA,2016,Pages50-59,ISSN0925-2312,January2016.
40
[28]. M.Wöllmer,F.Weninger,T.Knaup,B.Schuller,C.Sun,K.Sagae,L.P.
Morency,"Youtubemoviereviews:Sentimentanalysisinanaudio-visualcontext",IEEEIntelligentSystems,vol.28,no.3,pp.46-53,2013.
[29]. X.Zhang,J.Zhao,Y.LeCun.Character-levelConvolutionalNetworksfor
TextClassification.InAdvancesinNeuralInformationProcessingSystems28(NIPS2015),2015.