1 wp3 speech and emotion (analysis & recognition) human language technologies

WP3 speech and emotion (analysis & recognition)

humanlanguage

technologies

Databases and Annotations

UERLN: SYMPAFLY

Fully automatic speech dialogue telephone system for flight reservation and booking, different system stages; 270 Dialogues.

• Annotations: word-based emotional user states, prosodic and conversational peculiarities; dialogue (step) success; emotional user states distribution follows nested Pareto (80/20) principle

UERLN: AIBO

Children's interaction (age 10-12, 51 children, 9.2 hours of speech) with SONY’s AIBO robot, Wizard-of-Oz-scenario; cf. WP5 (plus English and read speech)

• Annotations: word-based emotional user states (holistic, 5 labellers) and prosodic peculiarities; alignment of children's utterances with AIBO's actions; manual correction of F0, labelling of voice quality. Emotional user states for the English data.

AIBO disobedient: frommotherese to angry

UERLN: Different Conceptualizations

Aibo straight on stop Aibo stop turn round to the left Aibo get up turn round to the left Aibo get up turn round, to the left Aibo get up get up Aibo now go left now straight on Aibo st´ straight on

Straight on little Aibo ok greatYou‘re doing fine now please to the left stop Aibo stop turn to the left no no no we aren´t thatfar yet get up sleepyhead get upyes that´s a good dog now goleft left Aibo little Aibo get upelse I´m getting angry get up Aibo left little Aibo bad boy now get up turn a little ok that´s fine stop Aibo stop straight on

Remote control tool Pet dog

Fully automatic speech dialogue telephone system • 15,6 hours of Italian natural speech• 9444 files (turns) -> 450 emotionally rich

Word-level• Orthographic transcription and word segmentation• Prosodic peculiarities annotated

Turn-level• Holistic emotion labels

Sympafly (cf. UERLN) for comparison and benchmarking

ITC: Targhe

UKA: LDC2002S28

Elicited emotional speech database; native American English

• labels: 1 of 15 holistic speaker states per utterance; used in algorithm and feature set development

UKA: ISL Meeting Corpus

18 recordings of multi-party (mean 5.1 participants) meetings; mean 35 minute duration; American English

• Annotations: orthographic transcription; Verbmobil II, and discourse-level annotations.

Assessment of Data Collection:

• focus on• spontaneous, realistic data• important/new types of dialogues/interaction• evaluation of annotations

• considerable percentage of realistic (processed and available) databases world-wide

Features & Classification

UERLN: Features

• large feature vector for a context of 2 words:• 95 prosodic (duration, energy, F0, pauses)• 80 spectral (HNR, formant based frequencies and energy)• 24 MFCC• 30 POS

• Language Models & dialogue based features

Baseline feature set• 96 features• Based on energy, duration, and pitch

Final feature set• 273 features (many redundant)• Based on energy, duration, pitch, and pauses• Different pitch extractors tried

Normalized Cross CorrelationWeighted Auto CorrelationUERLN PDA

• Different subsets compared• Different tests to reduce the feature space

Principal component analysis

ITC: Features

UKA: 133 Acoustic Features

• pitch, unvoiced/unvoiced energy, quartiles (15)• voice quality, Praat metrics (11)• harmonicity, quartiles (5) and Praat metrics (3)• zero-crossing rate vs energy, histogram (20)• correlation/regression, coefficients (36)• vocal tract volume, quartiles (25)• duration/timing, verbmobil features (18)

Classifiers

UERLN: Linear Discriminant Analysis LDA, Decision Trees (CARTs), Neural Networks NN, Support Vector machines SVM, Gaussian Mixtures GM, Language Models LM

ITC: Decision Trees (CARTs), Neural Networks NN UKA: Linear, Neural Networks NN, Support Vector

machines SVM

UERLN classification I: SympaFly

GM/NN, 2 classes, neutral vs. problem, l≠t

dialogue step success, 2 classes, SVM: CL 82.5dialogue success, 2 classes, CART: CL 85.4

combination CL RR

Pros.+MFCC: 74.4 74.2

HNR+Pros: 74.8 76.0

HNR+MFCC: 70.4 69.8

RR: overall rec. rateCL: class-wise averaged rec. rate

LDA, 4 classes

SVM/CART, 2 classes, loo

UERLN classification II: AIBO

features CL

pros/POS 59.7

pros. /POS, opt. 63.2

MFCC, frames 45.4

MFCC, words 58.3

pros/POS + MFCC 65.3

4 classes "AMEN", NN joyful surprised motherese neutral (default) rest (non-neutral) bored helpless, hesitant emphatic touchy (=irritated) angry reprimanding

Final feature set• 273 (acoustic/temporal) features• 2 class problem (neutral and non neutral)

ITC Classification II:

Classifier CART Neural Networks

Database Targhe Sympafly Targhe Sympafly

RR 73.2% 73.9% 74.2% 73.5%

CL 70.7% 72.1% 69.4% 74.1%

RR = overall rec. rate; CL = class-wise averaged rec. rateN = neutral turns; NN = Non neutral turns

UKA Classification II:

133 utterance-level prosodic features, 15 classes,acted speech, 8 speakers:

Task Classifier Feat Selection CL

spk-indep linear none 19.0%

spk-indep linear spk-indep 21.3%

spk-indep linear spk-dep 31.3%

spk-dep linear none 38.7%

spk-dep SVM none 53.0%

Assessment of Features

• a pool of many different features/feature groups implemented/compared• prosodic features better (more consistent) than "spectral" features in realistic speech• combination of knowledge sources improves performance• relevance of single features (feature classes)?

Assessment of Classifications

• not much difference between different classifiers in classification performance (linear classifiers highly competitive in speaker-independent classification)• large differences between speaker-dependent and speaker-independent classification

Categories & Dimensions

cf. also tomorrow

UKA: Meeting Annotation

Meeting audio appears to be rich in non-neutral speech.

project work game discuss chat

Labeler 1

Labeler 2

Labeler 3

Open-set holistic labeling of 5 meetings by 3 labellers

UKA: towards new Dimensions for Social Interaction in Meetings denoting conflict, bulding community, or skepticism etc.

IMAGE PROMOTION

self self group groupat expense of more than no bias more than at expense of

group group self self

resolve/strength

grateful

doubt/weakness insecure

ego-building conflict-diffusinggiving up

skeptical

demandingencouraging/comforting advocating

↕directing/leading

ignoring/interrupting collegial-conflicthostile-conflict

accedingcommunity-building

self support group

Assessment of Categories & Dimensions

New categories, new dimensions, new consistency measure

prototypical "full-blown" emotions are rare labels depend on type of data (call center, human-

robot, different types of multi-party meeting) new dimensions that do not model emotions but

interaction between participants in communication new entropy based consistency measure

Thak you for your attention

1 wp3 speech and emotion (analysis & recognition) human language technologies

left aibo little aibo

left stop aibo

aibo disobedient

features classification

principle slide

targhe slide

oo slide

sonys aibo robot

Documents

optimizing speech emotion recognition using manta-ray

multimodal emotion recognition in speech-based interaction...

emotion recognition from speech: a review - brooklyn...

survey on speech emotion recognition features...

silent speech and emotion recognition from vocal tract

discriminative feature learning for speech emotion...

chapter automatic speech emotion recognition using machine

emotion detection from speech - tomas...

speech emotion recognition using deep neural network...

emotion recognition in speech using cross-modal transfer...

conversion of emotion in speech signal using dwt & adaptive...

real-time speech emotion analysis for smart home assistants

research article practical speech emotion recognition

speech emotion detection using machine learning …

feature selection for speech emotion recognition in

semi-supervised ladder networks for speech emotion...

speech emotion recognition using tamil corpus emotion...

emotion detection from speech emotion detection from...

speech emotion analysis in noisy real-world...

development and analysis of speech emotion corpus using