prosodic and phonetic features for speaking styles classification and detection

Arlindo VeigaDirce CeloricoJorge ProençaSara CandeiasFernando Perdigão

Prosodic and Phonetic Features for Speaking Styles Classification and Detection

IberSPEECH 2012 - VII Jornadas en Tecnología del Habla & III Iberian SLTech Workshop

November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

2

Summary

IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN

Objective

Characterization of the corpus

Features

Methods Automatic segmentation Classification

Results Automatic detection

Segmentation Speech versus Non-speech Read versus Spontaneous

Classification Speech versus Non-speech Read versus Spontaneous

Conclusions and future works

3

Objective


Automatic detection of speaking styles for segmentation purposes of multimedia data

Style of a speech segment?

Segment broadcast news documents into two most evident classes: read versus spontaneous speech (prepared and unprepared speech)Using combination of phonetic and prosodic featuresExplore also speech/non-speech segmentation

slow fastclear informal causal planned preparedspontaneous unprepared …

4



Broadcast News audio

corpus

TV Broadcast News MP4 podcasts

Daily download

Extract audio stream and downsample from

44.1kHz to 16 kHz

30 daily news programs (~27 hours) were manually segmented and annotated in 4 levels:

Level 1– dominant signal: speech, noise, music, silence, clapping, …For speech:

Level 2– acoustical environment: clean, music, road, crowd,…Level 3– speech style: prepared speech, lombard speech and 3 levels of unprepared speech (as a function of spontaneity)Level 4– speaker info: BN anchor, gender, public figures,…

5



From Level 1 – speech versus non-speechFrom Level 3 – read speech (prepared) versus spontaneous speech

Type of segment Number of segments Average duration (± std deviation) (s)

Speech 7971 11.0 (± 9.4) Non-Speech 2529 4.1 (± 5.3) Read Speech 4989 10.6 (± 8.5)

Spontaneous Speech 1738 12.0 (± 10.4)

For each segment, a vector of 322 features (214 phonetic features and 108 prosodic features) are computed

6

Features


Phonetic (size of parameter vector for each segment: 214)• Based on the results of a free phone loop speech recognition

• Phone duration and recognized loglikelihood: 5 statistical functions (mean, median, maximum, minimum and standard deviation)

• Silence and speech rate

Prosodic (size of parameter vector for each segment: 108)• Based on the pitch (F0) and harmonic to noise ratio (HNR) envelope

• First and second order statistics• Polynomial fit of first and second order• Reset rate (rate of voiced portions)• Voiced and unvoiced duration rates

7

Methods


Automatic detection

Implies automatic segmentation and automatic classification

Automatic segmentation based on modified BIC (Bayesian Information Criterion) - DISTBIC

Binary classification: SVM classifiers

8

Methods


Automatic segmentation DISTBIC - uses distance (Kullback-Leibler) on the first step and delta

BIC (DBIC) to validate marks

si-1 si si+1 si+2

…. ….DBIC<0 DBIC>0

Parameters: Acoustic vector: 16 Mel-Frequency Cepstral Coefficients (MFCCs) and logarithm of energy

(windows 25 ms, step 10 ms) A threshold of 0.6 in the distance standard deviation was used to select significant local maximum;

window size: 2000 ms, step 100 ms Silence segments with duration above 0.5 seconds are detected and removed for DISTBIC

process

9

Methods


Classification SVM classifiers (WEKA tool – SMO, linear kernel, C=14):

• speech / non-speech• read / spontaneous

2 step classification approach

Speech / non-speechclassification

Read / spontaneousclassification

non-speech

speechspontaneousread

10

Results


Performance measureSegmentation only:

Collar (detection tolerance) range 0.5 s to 2.0 sA detected mark is assigned as correct if there is one reference mark less than “collar”

Automatic detection

Classification only: “AT” – agreement time = % frame correctly classified

11

Results


Segmentation performance:

0.5 s 1.0 s 1.5 s 2.0 s

0.3

0.4

0.5

0.6

0.7

0.8

Collar (seconds)

F1-s

core

F1-score: collar range 0.5 s to 2.0 s 0.8

0.7

0.6

0.5

0.4

0.30.5 1.0 1.5 2.0

12

Results


0.5 s 1.0 s 1.5 s 2.0 s

0.5

0.6

0.7

0.8

0.9

1

Collar (seconds)

Acc

urac

yRecall: collar range 0.5 s to 2.0 s

1.0

0.9

0.8

0.7

0.6

0.5

0.5 1.0 1.5 2.0

Segmentation performance:

13

Results


Automatic detectionSpeech / non-speech detection

Type of features AT. Speech Non-speech Phonetic 91.5% 94.9% 62.2% Prosodic 93.2% 97.0% 61.0%

Combination 93.3% 96.6% 64.9%

Read / spontaneous detection

Type of features AT. Read Spontaneous Phonetic 76.7% 91.9% 38.6% Prosodic 81.1% 93.0% 51.2%

Combination 83.3% 92.7% 59.6%

“AT” – agreement time = % frame correctly classified

14

Results


Classification only (using given manual segmentation)Speech / non-speech classifier

Type of features Acc. Speech Non-speech Phonetic 93.8% 96.7% 82.0% Prosodic 93.8% 97.5% 81.9%

Combination 94.4% 97.6% 84.0%

Type of features Acc. Read Spontaneous Phonetic 83.2% 92.8% 55.4% Prosodic 86.4% 95.0% 61.6%

Combination 87.4% 93.7% 69.5%

“Acc.” – Accuracy

Read / spontaneous classifier

15

Conclusions and future work


Read speech can be differentiated from spontaneous speech with reasonable accuracy.

Good results were obtained with only a few and simple measures of the speech signal.

A combination of phonetic and prosodic features provided the best results (both seem to have important and alternative information).

We have already implemented several important features, such as hesitations detection, aspiration detection using word spotting techniques, speaker identification using GMM and jingle detection based on audio fingerprint.

We intend to automatically segment all audio genres and speaking styles.

16

THANK YOU


17

Appendix – BIC


BIC (Bayesian Information Criterion)Dissimilarity measure between 2 consecutive segments

Two hypothesizes:H0 – No change of signal characteristics. Model: 1 Gaussian:H1 – Change of characteristics. 2 Gaussians:

μ – mean vector; S – covariance matrixMaximum likelihood ratMaximum likelihood ratio between H0 and H1:

X

X1 X2

1 21 22 2 2( ) log log logX X XN N N

X X XR i S S S

~ ; ,X XX N x μ Σ

1 1 1 2 2 2~ ; , ; ~ ; , ;X X X XX N x X N xμ Σ μ Σ

18

Appendix – BIC


P –complexity penalizationλ – penalization factor (ideal 1.0)

Change if:

Parameters used in this work:p=16; λ=1.3; frame rate = 100; N=200; M=10;

( ) ( )BIC i R i PD

*( ) 0BIC iD

prosodic and phonetic features for speaking styles classification and detection

Documents

speech style

spontaneous speech

read speech

universidad autnoma

lombard speech

levels of unprepared

phonetic features

nonspeechfrom level