1 wp3 speech and emotion (analysis & recognition) human language technologies
Post on 31-Mar-2015
231 Views
Preview:
TRANSCRIPT
1
WP3 speech and emotion (analysis & recognition)
humanlanguage
technologies
2
Databases and Annotations
3
UERLN: SYMPAFLY
Fully automatic speech dialogue telephone system for flight reservation and booking, different system stages; 270 Dialogues.
• Annotations: word-based emotional user states, prosodic and conversational peculiarities; dialogue (step) success; emotional user states distribution follows nested Pareto (80/20) principle
4
UERLN: AIBO
Children's interaction (age 10-12, 51 children, 9.2 hours of speech) with SONY’s AIBO robot, Wizard-of-Oz-scenario; cf. WP5 (plus English and read speech)
• Annotations: word-based emotional user states (holistic, 5 labellers) and prosodic peculiarities; alignment of children's utterances with AIBO's actions; manual correction of F0, labelling of voice quality. Emotional user states for the English data.
5
AIBO disobedient: frommotherese to angry
g'radeaus Aibolein ja M fein M gut M machst M du M *da M | *tz l"aufst du mal bitte nach links | stopp E Aibo stopp | nach links E umdrehen | nein M <*ne> nein M <*ne> nein M <*ne> so M weit M *simma M noch M nicht M aufstehen M Schlafm"utze M komm M hoch M | ja M so M ist M es M <*is> guter M Hund M lauf mal jetzt nach links | nach links Aibo | Aibolein M aufstehen M *son M sonst M werd' M ich M b"ose M hoch E | nach A links A | Aibo A nach A links A | Aibolein A ganz A b"oser A Hund A jetzt A stehst A du A auf A | hoch A | dreh dich ein bisschen | ja M so ist es <*is> gut stopp Aibo stopp | *tz lauf g'radeaus |
6
UERLN: Different Conceptualizations
Aibo straight on stop Aibo stop turn round to the left Aibo get up turn round to the left Aibo get up turn round, to the left Aibo get up get up Aibo now go left now straight on Aibo st´ straight on
Straight on little Aibo ok greatYou‘re doing fine now please to the left stop Aibo stop turn to the left no no no we aren´t thatfar yet get up sleepyhead get upyes that´s a good dog now goleft left Aibo little Aibo get upelse I´m getting angry get up Aibo left little Aibo bad boy now get up turn a little ok that´s fine stop Aibo stop straight on
Remote control tool Pet dog
7
Fully automatic speech dialogue telephone system • 15,6 hours of Italian natural speech• 9444 files (turns) -> 450 emotionally rich
Word-level• Orthographic transcription and word segmentation• Prosodic peculiarities annotated
Turn-level• Holistic emotion labels
Sympafly (cf. UERLN) for comparison and benchmarking
ITC: Targhe
8
UKA: LDC2002S28
Elicited emotional speech database; native American English
• labels: 1 of 15 holistic speaker states per utterance; used in algorithm and feature set development
9
UKA: ISL Meeting Corpus
18 recordings of multi-party (mean 5.1 participants) meetings; mean 35 minute duration; American English
• Annotations: orthographic transcription; Verbmobil II, and discourse-level annotations.
10
Assessment of Data Collection:
• focus on• spontaneous, realistic data• important/new types of dialogues/interaction• evaluation of annotations
• considerable percentage of realistic (processed and available) databases world-wide
11
Features & Classification
12
UERLN: Features
• large feature vector for a context of 2 words:• 95 prosodic (duration, energy, F0, pauses)• 80 spectral (HNR, formant based frequencies and energy)• 24 MFCC• 30 POS
• Language Models & dialogue based features
13
Baseline feature set• 96 features• Based on energy, duration, and pitch
Final feature set• 273 features (many redundant)• Based on energy, duration, pitch, and pauses• Different pitch extractors tried
Normalized Cross CorrelationWeighted Auto CorrelationUERLN PDA
• Different subsets compared• Different tests to reduce the feature space
Principal component analysis
ITC: Features
14
UKA: 133 Acoustic Features
• pitch, unvoiced/unvoiced energy, quartiles (15)• voice quality, Praat metrics (11)• harmonicity, quartiles (5) and Praat metrics (3)• zero-crossing rate vs energy, histogram (20)• correlation/regression, coefficients (36)• vocal tract volume, quartiles (25)• duration/timing, verbmobil features (18)
15
Classifiers
UERLN: Linear Discriminant Analysis LDA, Decision Trees (CARTs), Neural Networks NN, Support Vector machines SVM, Gaussian Mixtures GM, Language Models LM
ITC: Decision Trees (CARTs), Neural Networks NN UKA: Linear, Neural Networks NN, Support Vector
machines SVM
16
UERLN classification I: SympaFly
GM/NN, 2 classes, neutral vs. problem, l≠t
dialogue step success, 2 classes, SVM: CL 82.5dialogue success, 2 classes, CART: CL 85.4
combination CL RR
Pros.+MFCC: 74.4 74.2
HNR+Pros: 74.8 76.0
HNR+MFCC: 70.4 69.8
RR: overall rec. rateCL: class-wise averaged rec. rate
LDA, 4 classes
SVM/CART, 2 classes, loo
17
UERLN classification II: AIBO
features CL
pros/POS 59.7
pros. /POS, opt. 63.2
MFCC, frames 45.4
MFCC, words 58.3
pros/POS + MFCC 65.3
4 classes "AMEN", NN joyful surprised motherese neutral (default) rest (non-neutral) bored helpless, hesitant emphatic touchy (=irritated) angry reprimanding
18
Final feature set• 273 (acoustic/temporal) features• 2 class problem (neutral and non neutral)
ITC Classification II:
Classifier CART Neural Networks
Database Targhe Sympafly Targhe Sympafly
RR 73.2% 73.9% 74.2% 73.5%
CL 70.7% 72.1% 69.4% 74.1%
RR = overall rec. rate; CL = class-wise averaged rec. rateN = neutral turns; NN = Non neutral turns
19
UKA Classification II:
133 utterance-level prosodic features, 15 classes,acted speech, 8 speakers:
Task Classifier Feat Selection CL
spk-indep linear none 19.0%
spk-indep linear spk-indep 21.3%
spk-indep linear spk-dep 31.3%
spk-dep linear none 38.7%
spk-dep SVM none 53.0%
20
Assessment of Features
• a pool of many different features/feature groups implemented/compared• prosodic features better (more consistent) than "spectral" features in realistic speech• combination of knowledge sources improves performance• relevance of single features (feature classes)?
21
Assessment of Classifications
• not much difference between different classifiers in classification performance (linear classifiers highly competitive in speaker-independent classification)• large differences between speaker-dependent and speaker-independent classification
22
Categories & Dimensions
cf. also tomorrow
23
UKA: Meeting Annotation
Meeting audio appears to be rich in non-neutral speech.
0
10
20
30
40
50
60
70
project work game discuss chat
Labeler 1
Labeler 2
Labeler 3
Open-set holistic labeling of 5 meetings by 3 labellers
24
UKA: towards new Dimensions for Social Interaction in Meetings denoting conflict, bulding community, or skepticism etc.
IMAGE PROMOTION
self self group groupat expense of more than no bias more than at expense of
group group self self
resolve/strength
grateful
doubt/weakness insecure
ego-building conflict-diffusinggiving up
skeptical
demandingencouraging/comforting advocating
↕directing/leading
ignoring/interrupting collegial-conflicthostile-conflict
accedingcommunity-building
weak
pow
er
s
tron
g
self support group
25
Assessment of Categories & Dimensions
New categories, new dimensions, new consistency measure
prototypical "full-blown" emotions are rare labels depend on type of data (call center, human-
robot, different types of multi-party meeting) new dimensions that do not model emotions but
interaction between participants in communication new entropy based consistency measure
26
Thak you for your attention
top related