modeling prosodic sequences with k-means and dirichlet process gmms andrew rosenberg queens college...

17
Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013

Upload: jeffrey-bennett

Post on 17-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

  • Slide 1
  • Slide 2
  • Modeling Prosodic Sequences with K-Means and Dirichlet Process GMMs Andrew Rosenberg Queens College / CUNY Interspeech 2013 August 26, 2013
  • Slide 3
  • Prosody Prosody Pitch, Intensity, Rhythm, Silence Prosody carries information about a speakers intent and identity. Here: prosodic recognition of Speaking Style Nativeness Speaker 8/26/13 1
  • Slide 4
  • Approach Unsupervised clustering of acoustic/prosodic features. Sequence modeling of cluster identities 8/26/13 2
  • Slide 5
  • K-Means K-means is a simple distance based clustering algorithm. Iterative, non-deterministic (sensitive to initialization) Must specify K. We evaluate K between 2 and 100. Optimal value from cross-validation for each task 8/26/13 3
  • Slide 6
  • Dirichlet Process GMMs Non-parametric infinite mixture model need a prior of the dirichlet process and a prior over N a zero mean gaussian still need to set hyper parameters and G 0 Stick-breaking & Chinese Restaurant metaphors Blei and Jordan 2005 Variational Inference Rich get Richer 8/26/13 4 Plate notation from M. Jordan 2005 NIPS tutorial
  • Slide 7
  • DPGMM Rich get Richer 8/26/13 5 Artificially omit the largest cluster = 0. 25
  • Slide 8
  • Prosodic Event Distribution ToBI Prosodic Labels Pitch Accents, Phrase Accent/Boundary Tones 8/26/13 6 Accent Type Distribution Phrase Ending Distribution
  • Slide 9
  • Sequence Modeling SRILM 3-gram model Backoff & GT smoothing Clusters learned over all material Sequence models trained over train sets 8/26/13 7
  • Slide 10
  • Experiments Speaking Style, Nativeness, Speaker Recognition Evaluation 500 samples between 10-100 syllables (~2-20 seconds) ToBI, K-Means, DPGMM, DPGMM (removing the largest cluster) 5 fold Cross-validation to learn hyperparameters Classification Train one SRILM model per class. Classify by lowest perplexity Outlier Detection Train a single model. Classifier learns a perplexity threshold 8/26/13 8
  • Slide 11
  • Data Boston Directions Corpus READ, SPONTANEOUS 4 speakers (used for Speaker Classification) Boston University Radio News Corpus BROADCAST NEWS 6 speakers Columbia Games Corpus SPONTANEOUS DIALOG 13 speakers Native Mandarin Chinese Speakers reading BURNC stories. 4 speakers All ToBI Labeled 8/26/13 9
  • Slide 12
  • Features Villing (2004) pseudosyllabification Syllables with mean intensity below 10dB are considered silent 7 Features Mean range normalized intensity Mean range normalized delta intensity Mean z-score normalized log f0 Mean z-score normalized delta log f0 Syllable duration Duration of previous silence (if any) Duration of following silence (if any) 8/26/13 10
  • Slide 13
  • Consistency with ToBI labels V-Measure between ToBI Accent Types and clusters ToBI Intonational Phrase-ending Tones and clusters K-means, solid line DPGMM, gray line for reference (doesnt vary by more than 0.001) 8/26/13 11 AccentingPhrasing
  • Slide 14
  • Speaking Style Recognition 4 styles: READ, SPON, BN, DIALOG Single speaker for evaluation. 8/26/13 12 Classification Outlier Detection - Dialog
  • Slide 15
  • Nativeness Recognition Native (BURNC) vs. Non-Native Single speaker for evaluation. 8/26/13 13 Classification Outlier Detection - Native
  • Slide 16
  • Speaker Recognition 4 BDC Speakers 6 tasks for training, 3 for testing 8/26/13 14 Classification Outlier Detection 6 BURNC Speakers Detect f2b vs. others
  • Slide 17
  • Conclusions K-means works well to represent prosodic information DPGMM does not work so well out-of-the-box. Despite being non-parametric, hyperparameter setting is still critically important Future Work Larger acoustic/prosodic feature set. requires pre-processing Evaluating the universality of prosodic representations Integration of K-means and DPGMM. Use one to seed the other. 8/26/13 15
  • Slide 18
  • Thank you [email protected] http://speech.cs.qc.cuny.edu 8/26/13 16