9 th conference on telecommunications – conftele 2013 castelo branco, portugal, may 8-10, 2013...

18
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo Veiga 1,2 Fernando Perdigão 1,2 1 Instituto de Telecomunicações, Polo de Coimbra, Portugal 2 Universidade de Coimbra, DEEC, Portugal Automatically distinguishing Styles of Speech

Upload: posy-parker

Post on 27-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

9th Conference on Telecommunications – Conftele 2013Castelo Branco, Portugal, May 8-10, 2013

Sara Candeias 1

Dirce Celorico 1

Jorge Proença 1

Arlindo Veiga 1,2

Fernando Perdigão 1,2

1Instituto de Telecomunicações, Polo de Coimbra, Portugal2Universidade de Coimbra, DEEC, Portugal

Automatically distinguishing Styles of Speech

2

Summary

Objective

Characterization of the corpus

Automatic segmentation Method Performance

Automatic classification Features Classification method Results

Speech versus Non-speech Read versus Spontaneous

Conclusions and future works

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

3

Objective

Automatic detection of styles of speech for segmentation of multimedia data

Speech - Who? What? How?

Style of a speech segment?

Segment broadcast news samples into the two most evident classes: read versus spontaneous speech (prepared and unprepared speech)

Using a combination of phonetic and prosodic featuresFirst explore a speech/non-speech segmentation

slow fastclear informal causal planned prepared

spontaneous unprepared …

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

4

Characterization of the corpus

Broadcast News audio

corpus

TV Broadcast News MP4 podcasts

Daily download

Extract audio stream and downsample from

44.1kHz to 16 kHz

30 daily news programs (~27 hours) were manually segmented and annotated in 4 levels:

Level 1– dominant signal: speech, noise, music, silence, clapping, …

For speech:

Level 2– acoustical environment: clean, music, road, crowd,…

Level 3– speech style: prepared speech, lombard speech and 3 levels of unprepared speech (as a function of spontaneity)

Level 4– speaker info: BN anchor, gender, public figures,…

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

5

Characterization of the corpus

From Level 1 – speech versus non-speech

From Level 3 – read speech (prepared) versus spontaneous speech

Type of segment Number of segments Average duration

(± std deviation) (s) Speech 7971 11.0 (± 9.4)

Non-Speech 2529 4.1 (± 5.3) Read Speech 4989 10.6 (± 8.5)

Spontaneous Speech 1738 12.0 (± 10.4)

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

6

Methods

Automatic Detection

1. Automatic Segmentation

(find/mark different segments on the audio signal)

2. Automatic Classification (classify the segments)

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

7

Methods

1. Automatic segmentation

Based on modified BIC (Bayesian Information Criterion):DISTBIC – uses distance (Kullback-Leibler) on the first step and delta BIC (DBIC) to validate marks

si-1 si si+1 si+2

…. ….DBIC<0 DBIC>0

Parameters:

Acoustic vector: 16 Mel-Frequency Cepstral Coefficients (MFCCs) and logarithm of energy (windows 25 ms, step 10 ms)

A threshold of 0.6 in the distance standard deviation was used to select significant local maximum; window size: 2000 ms, step 100 ms

Silence segments with duration above 0.5 seconds are detected and removed for DISTBIC process

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

8

Results

Performance measure

Automatic Segmentation:

Collar (detection tolerance) range 0.5 s to 2.0 sA detected mark is assigned as correct if there is one reference mark

inside the collar allowed interval

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

RecallPrecision

RecallPrecision2scoreF1

marks reference#

marks detectedcorrectly #Recall

marks unexpected# marks detectedcorrectly #

marks detectedcorrectly #Precision

9

Results

Segmentation performance

0.5 s 1.0 s 1.5 s 2.0 s

0.3

0.4

0.5

0.6

0.7

0.8

Collar (seconds)

F1-

scor

eF1-score: collar range 0.5 s to 2.0 s

0.8

0.7

0.6

0.5

0.4

0.3

0.5 1.0 1.5 2.0

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

10

Results

0.5 s 1.0 s 1.5 s 2.0 s

0.5

0.6

0.7

0.8

0.9

1

Collar (seconds)

Acc

urac

yRecall: collar range 0.5 s to 2.0 s

1.0

0.9

0.8

0.7

0.6

0.5

0.5 1.0 1.5 2.0

Segmentation performance

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

11

Methods

Phonetic (size of parameter vector for each segment: 214)

• Based on the results of a free phone loop speech recognition

• Phone duration and recognized log likelihood: 5 statistical functions (mean, median, maximum, minimum and standard deviation)

• Silence and speech rate

Prosodic (size of parameter vector for each segment: 108)

• Based on the pitch (F0) and harmonic to noise ratio (HNR) envelope

• First and second order statistics

• Polynomial fit of first and second order

• Reset rate (rate of voiced portions)

• Voiced and unvoiced duration rates

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

2. Automatic Classification – Features

a vector of 322 features for each segment is computed

12

Methods

Classification

SVM (Support Vector Machine) classifiers (WEKA tool, linear kernel, C=14):

• speech / non-speech

• read / spontaneous

2 step classification approach

Speech / non-speech

classification

Read / spontaneous

classification

non-speech

speechspontaneous

read

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

13

Results

Automatic detection (automatic segmentation + classification)

Agreement time = % frame correctly classified

Speech / non-speech detection

Type of features All Speech Non-speech Phonetic 91.5% 94.9% 62.2% Prosodic 93.2% 97.0% 61.0%

Combination 93.3% 96.6% 64.9%

Read / spontaneous detection

Type of features All Read Spontaneous Phonetic 76.7% 91.9% 38.6% Prosodic 81.1% 93.0% 51.2%

Combination 83.3% 92.7% 59.6%

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

14

Results

Classification only (using given manual segmentation)

% - Accuracy

Speech / non-speech classifier

Type of features All Speech Non-speech Phonetic 93.8% 96.7% 82.0% Prosodic 93.8% 97.5% 81.9%

Combination 94.4% 97.6% 84.0%

Type of features All Read Spontaneous Phonetic 83.2% 92.8% 55.4% Prosodic 86.4% 95.0% 61.6%

Combination 87.4% 93.7% 69.5%

Read / spontaneous classifier

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

15

Conclusions and future work

Read speech can be distinguished from spontaneous speech with reasonable accuracy.

Results were obtained with only a few and simple measures of the speech signal.

A combination of phonetic and prosodic features provided the best results (both seem to have important and alternative information).

We have already implemented several important features, such as hesitations detection, aspiration detection using word spotting techniques, speaker identification using GMM and jingle detection based on audio fingerprint.

We intend to automatically segment all audio genres and speaking styles.

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

16

THANK YOU

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

17

Appendix – BICBIC (Bayesian Information Criterion)Dissimilarity measure between 2 consecutive segments

Two hypothesizes:H0 – No change of signal characteristics. Model: 1 Gaussian:H1 – Change of characteristics. 2 Gaussians:

μ – mean vector; S – covariance matrix

Maximum likelihood ratio between H0 and H1:

X

X1 X2

1 2

1 22 2 2( ) log log logX X XN N NX X XR i

~ ; ,X XX N x μ Σ

1 1 1 2 2 2~ ; , ; ~ ; , ;X X X XX N x X N xμ Σ μ Σ

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013

18

Appendix – BIC

P –complexity penalization

λ – penalization factor (ideal 1.0)

Change if:

Parameters used in this work:

p=16; λ=1.3; frame rate = 100; N=200; M=10;

( ) ( )BIC i R i P

*( ) 0BIC i

| Conftele 2013 - Castelo Branco, Portugal - May 8-10 2013