prosodic and phonetic features for speaking styles classification and detection
DESCRIPTION
IberSPEECH 2012 - VII Jornadas en Tecnología del Habla & III Iberian SLTech Workshop. Prosodic and Phonetic Features for Speaking Styles Classification and Detection. November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN. Arlindo Veiga Dirce Celorico Jorge Proença - PowerPoint PPT PresentationTRANSCRIPT
Arlindo VeigaDirce CeloricoJorge ProençaSara CandeiasFernando Perdigão
Prosodic and Phonetic Features for Speaking Styles Classification and Detection
IberSPEECH 2012 - VII Jornadas en Tecnología del Habla & III Iberian SLTech Workshop
November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
2
Summary
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
Objective
Characterization of the corpus
Features
Methods Automatic segmentation Classification
Results Automatic detection
Segmentation Speech versus Non-speech Read versus Spontaneous
Classification Speech versus Non-speech Read versus Spontaneous
Conclusions and future works
3
Objective
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
Automatic detection of speaking styles for segmentation purposes of multimedia data
Style of a speech segment?
Segment broadcast news documents into two most evident classes: read versus spontaneous speech (prepared and unprepared speech)Using combination of phonetic and prosodic featuresExplore also speech/non-speech segmentation
slow fastclear informal causal planned preparedspontaneous unprepared …
4
Characterization of the corpus
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
Broadcast News audio
corpus
TV Broadcast News MP4 podcasts
Daily download
Extract audio stream and downsample from
44.1kHz to 16 kHz
30 daily news programs (~27 hours) were manually segmented and annotated in 4 levels:
Level 1– dominant signal: speech, noise, music, silence, clapping, …For speech:
Level 2– acoustical environment: clean, music, road, crowd,…Level 3– speech style: prepared speech, lombard speech and 3 levels of unprepared speech (as a function of spontaneity)Level 4– speaker info: BN anchor, gender, public figures,…
5
Characterization of the corpus
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
From Level 1 – speech versus non-speechFrom Level 3 – read speech (prepared) versus spontaneous speech
Type of segment Number of segments Average duration (± std deviation) (s)
Speech 7971 11.0 (± 9.4) Non-Speech 2529 4.1 (± 5.3) Read Speech 4989 10.6 (± 8.5)
Spontaneous Speech 1738 12.0 (± 10.4)
For each segment, a vector of 322 features (214 phonetic features and 108 prosodic features) are computed
6
Features
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
Phonetic (size of parameter vector for each segment: 214)• Based on the results of a free phone loop speech recognition
• Phone duration and recognized loglikelihood: 5 statistical functions (mean, median, maximum, minimum and standard deviation)
• Silence and speech rate
Prosodic (size of parameter vector for each segment: 108)• Based on the pitch (F0) and harmonic to noise ratio (HNR) envelope
• First and second order statistics• Polynomial fit of first and second order• Reset rate (rate of voiced portions)• Voiced and unvoiced duration rates
7
Methods
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
Automatic detection
Implies automatic segmentation and automatic classification
Automatic segmentation based on modified BIC (Bayesian Information Criterion) - DISTBIC
Binary classification: SVM classifiers
8
Methods
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
Automatic segmentation DISTBIC - uses distance (Kullback-Leibler) on the first step and delta
BIC (DBIC) to validate marks
si-1 si si+1 si+2
…. ….DBIC<0 DBIC>0
Parameters: Acoustic vector: 16 Mel-Frequency Cepstral Coefficients (MFCCs) and logarithm of energy
(windows 25 ms, step 10 ms) A threshold of 0.6 in the distance standard deviation was used to select significant local maximum;
window size: 2000 ms, step 100 ms Silence segments with duration above 0.5 seconds are detected and removed for DISTBIC
process
9
Methods
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
Classification SVM classifiers (WEKA tool – SMO, linear kernel, C=14):
• speech / non-speech• read / spontaneous
2 step classification approach
Speech / non-speechclassification
Read / spontaneousclassification
non-speech
speechspontaneousread
10
Results
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
Performance measureSegmentation only:
Collar (detection tolerance) range 0.5 s to 2.0 sA detected mark is assigned as correct if there is one reference mark less than “collar”
Automatic detection
Classification only: “AT” – agreement time = % frame correctly classified
11
Results
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
Segmentation performance:
0.5 s 1.0 s 1.5 s 2.0 s
0.3
0.4
0.5
0.6
0.7
0.8
Collar (seconds)
F1-s
core
F1-score: collar range 0.5 s to 2.0 s 0.8
0.7
0.6
0.5
0.4
0.30.5 1.0 1.5 2.0
12
Results
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
0.5 s 1.0 s 1.5 s 2.0 s
0.5
0.6
0.7
0.8
0.9
1
Collar (seconds)
Acc
urac
yRecall: collar range 0.5 s to 2.0 s
1.0
0.9
0.8
0.7
0.6
0.5
0.5 1.0 1.5 2.0
Segmentation performance:
13
Results
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
Automatic detectionSpeech / non-speech detection
Type of features AT. Speech Non-speech Phonetic 91.5% 94.9% 62.2% Prosodic 93.2% 97.0% 61.0%
Combination 93.3% 96.6% 64.9%
Read / spontaneous detection
Type of features AT. Read Spontaneous Phonetic 76.7% 91.9% 38.6% Prosodic 81.1% 93.0% 51.2%
Combination 83.3% 92.7% 59.6%
“AT” – agreement time = % frame correctly classified
14
Results
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
Classification only (using given manual segmentation)Speech / non-speech classifier
Type of features Acc. Speech Non-speech Phonetic 93.8% 96.7% 82.0% Prosodic 93.8% 97.5% 81.9%
Combination 94.4% 97.6% 84.0%
Type of features Acc. Read Spontaneous Phonetic 83.2% 92.8% 55.4% Prosodic 86.4% 95.0% 61.6%
Combination 87.4% 93.7% 69.5%
“Acc.” – Accuracy
Read / spontaneous classifier
15
Conclusions and future work
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
Read speech can be differentiated from spontaneous speech with reasonable accuracy.
Good results were obtained with only a few and simple measures of the speech signal.
A combination of phonetic and prosodic features provided the best results (both seem to have important and alternative information).
We have already implemented several important features, such as hesitations detection, aspiration detection using word spotting techniques, speaker identification using GMM and jingle detection based on audio fingerprint.
We intend to automatically segment all audio genres and speaking styles.
16
THANK YOU
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
17
Appendix – BIC
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
BIC (Bayesian Information Criterion)Dissimilarity measure between 2 consecutive segments
Two hypothesizes:H0 – No change of signal characteristics. Model: 1 Gaussian:H1 – Change of characteristics. 2 Gaussians:
μ – mean vector; S – covariance matrixMaximum likelihood ratMaximum likelihood ratio between H0 and H1:
X
X1 X2
1 21 22 2 2( ) log log logX X XN N N
X X XR i S S S
~ ; ,X XX N x μ Σ
1 1 1 2 2 2~ ; , ; ~ ; , ;X X X XX N x X N xμ Σ μ Σ
18
Appendix – BIC
IberSPEECH 2012| November 21-23 2012, Universidad Autónoma de Madrid, Madrid, SPAIN
P –complexity penalizationλ – penalization factor (ideal 1.0)
Change if:
Parameters used in this work:p=16; λ=1.3; frame rate = 100; N=200; M=10;
( ) ( )BIC i R i PD
*( ) 0BIC iD