0409 humaine wp4 santorin speechemotionrecognition_v3
TRANSCRIPT
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
1/21
Occasion: HUMAINE / WP4 / Workshop
"From Signals to Signs of Emotion and Vice Versa"
Santorin / Fira, 18th 22nd September, 2004
Talk: Ronald Mller
Speech Emotion RecognitionCombining Acoustic and Semantic Analyses
Institute for
Human-Machine CommunicationTechnische Universitt Mnchen
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
2/21
Slide -2-
System Overview
Emotional Speech Corpus
Acoustic Analysis
Semantic Analysis
Stream Fusion
Results
Outline
Outline
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
3/21
Slide -3-
System Overview
System Overview
Speech signal
Prosodic features ASR-unit
Semantic interpretation
(Bayesian Networks)
Classifier
(SVM)
Stream fusion
(MLP)
Emotion
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
4/21
Slide -4-
Emotion set:Anger, disgust, fear, joy, neutrality, sadness, surprise
Corpus 1: Practical course
404 acted samples per emotion
13 speakers (1 female)
Recorded within one year
Corpus 2: Driving simulator
500 spontaneous emotion samples
200 acted samples (disgust, sadness)
Emotional Speech Corpus
Emotional Speech Corpus
2828iE
700iE
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
5/21
Slide -5-
System Overview
System Overview
Speech signal
Prosodic features ASR-unit
Semantic interpretation
(Bayesian Networks)
Classifier
(SVM)
Stream fusion
(MLP)
Emotion
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
6/21
Slide -6-
Acoustic Analysis
Acoustic Analysis
Low-level featuresPitch contour (AMDF, low-pass filtering)
Energy contour
Spectrum
Signal
High-level features
Statistic analysis of contours
Elimination of mean, normalization to standard dev.
Duration of one utterance (1-5 seconds)
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
7/21Slide -7-
Acoustic Analysis
Feature selection (1/2)
Initial set of 200 statistical features
Ranking 1: Single performance of each feature
(nearest-mean classifier)
Ranking 2: Sequential Forward Floating Search
wrapping by nearest-mean classifier
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
8/21Slide -8-
Acoustic Analysis
Feature selection (2/2)
Top 10 features
Acoustic Feature SFFS-Rank Single Perf.
Pitch, maximum gradient 1 31.5
Pitch, standard deviation of distance
between reversal points2 23.0
Pitch, mean value 3 25.6Signal, number of zero-crossings 4 16.9
Pitch, standard deviation 5 27.6
Duration of silences, mean value 6 17.5
Duration of voiced sounds, mean value 7 18.5
Energy, median of fall-time 8 17.8
Energy, mean distance between
reversal points9 19.0
Energy, mean of rise-time 10 17.6
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
9/21Slide -9-
Acoustic Analysis
Classification
Evaluation of various classification methods
33 features
ClassifierError, %
Speaker indep. Speaker dep.
kMeans 57.05 27.38
kNN 30.41 17.39
GMM 25.17 10.88
MLP 26.86 9.36SVM 23.88 7.05
ML-SVM 18.71 9.05
Output: Vector of (pseudo-) recognition confidences
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
10/21Slide -10-
Acoustic Analysis
Classification
Multi-Layer Support Vector Machines
acoustic feature vector
ang, ntl, fea, joy / dis, sur, sad
ang, ntl / fea, joy dis, sur / sad
ang / ntl fea / joy dis / sur
ang ntl fea joy saddis sur
No confidence vector to forward to fusion
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
11/21Slide -11-
System Overview
System Overview
Speech signal
Prosodic features ASR-unit
Semantic interpretation
(Bayesian Networks)
Classifier
(SVM)
Stream fusion
(MLP)
Emotion
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
12/21Slide -12-
Semantic Analysis
Semantic Analysis
ASR-Unit HMM-based
1300 words german vocabulary
No language model
5-best phrase hypotheses
Recognition confidences per word
Example output (first hypothesis):
I cant stand this every tray traffic-jam
69.3 34.6 72.1 20.0 36.1 15.9 55.8
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
13/21Slide -13-
Semantic Analysis
Semantic Analysis
Conditions Natural language
Erroneous speech recognition
Uncertain knowledge
Incomplete knowledge
Superfluous knowledge
Probabilistic spotting approach
Bayesian Belief Networks
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
14/21Slide -14-
Semantic Analysis
Bayesian Belief Networks
Acyclic graph of nodes and directed edges
One state variable per node (here states , )
Setting node-dependencies via cond. probability matrices
Setting initial probabilities in root nodes
Observation A causes evidence in a child node
(i.e. is known) Inference to direct parent nodes and finally to root nodes
Bayes rule :
iX ix ix
)|()|(
)|()|(|
~)()(
PCPC
PCPC
PParentCChild
xxPxxP
xxPxxPXXP
CxP
TRRR xPxPXP )()(
)(
)()|(|
C
PPCCP
XP
XPXXPXXP
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
15/21Slide -15-
Semantic Analysis
Emotion modelling
...
I
...
I_hate Bad Adhorrence
first_person
Joy
NegativePositive Disgust
Inputlevel
Words
Superwords
Phrases
Super-
phrases
Disgust
I cant stand this nasty every tray traffic-jam
cant stand nasty
cannot stand bad disgusting
Interpretation
Good
Anger
Clustering
Sequence
Handling
Clustering
Clustering
Spotting
I_like ... ...
... ...
...
... ...
... ...
... ...
Output: Vector of real recognition confidences
S t O i
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
16/21Slide -16-
System Overview
System OverviewF&F of HMC
Overview
Speech signal
Prosodic features ASR-unit
Semantic interpretation
(Bayesian Networks)
Classifier
(SVM)
Stream fusion
(MLP)
Emotion
S F i
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
17/21Slide -17-
Stream Fusion
Stream Fusion
Pairwise mean
Discriminative fusion applying MLP
Input layer: 2 x 7 confidences
Hidden layer: 100 nodes
Output layer: 7 recognition confidences
nfusionn
EPmaxarg
nsemanticnacousticnfusion EPEPEP
R lt
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
18/21Slide -18-
Results
Results
Emotion ang dis fea joy ntl sad sur Mean
% 95.5 61.3 78.7 75.1 78.5 62.1 68.3 74.2
Acoustic recognition rates (SVM):
Semantic recognition rates:
Emotion ang dis fea joy ntl sad sur Mean
% 78.4 71.2 53.4 57.7 56.0 35.0 65.5 59.6
R lt
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
19/21Slide -19-
Results
Results
Emotion ang dis fea joy ntl sad sur Mean
% 98.0 78.7 88.3 95.9 98.2 91.7 95.8 92.0
Recognition rates after discriminative fusion:
Acoustic
Information
Language
Information
Fusion
by means
Fusion
by MLP
% 74.2 59.6 83.1 92.0
Overview:
S
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
20/21Slide -20-
Summary
Summary
Acted Emotions
7 discrete emotion categories
Prosodic feature selection via
Singe feature performance
Sequential forward floating search
Evaluative comparision of different classifiers
Outperforming SVMs
Semantic analysis applying Bayesian Networks
Significant gain by discriminative stream fusion
-
7/31/2019 0409 HUMAINE WP4 Santorin SpeechEmotionRecognition_v3
21/21Slide 21