spectral methods for automatic processing of audio documents · september 29, 2008 université...

September 29, 2008 Université Toulouse III - Paul Sabatier 1

Spectral methods for automatic processing of audio documents

José Anibal Arias Aguilar

Advisor: Régine André-ObrechtTutor: Jérôme Farinas


Objectives

n Unify different dimensionality reduction approaches

n Visualize speechn Identify basic units in speechn Represent variable-length acoustic

sequences by 3D vectors


Outline

n Introductionn State of the artq Kernel functionsq Spectral methods for dimensionality reduction

n Contributionq Acoustic information in low dimensional spacesq Speech segmentation and labelingq Content visualization of audio databases

n Conclusions and perspectives


Introduction

n Data and machine learningq Quantity and quality, but dimensionality?

n Manifoldsq Low dimensional data embedded in high

dimensional spaces

n Kernel functionsq Link between pattern space and feature space

n Speech soundsq Complex, high dimensional information


Outline


n Contributionq Acoustic information in low-dimensional spacesq Speech segmentation and labelingq Content visualization of audio databases

n Conclusions and perspectives


State of the art: kernel functions

Pattern space vs feature spacen We can transform the pattern space to find

more informative data representations



Feature space: propertiesn Desirable properties of the new spaceq Contain a rich class of functionsq Have linear structureq Have inner product so that we can take

projections

n Example: Hilbert space (complete vector space with inner product)



Access to feature space: Kernelsn X is a compact metric space

psdis)x,x(K,Xx)x,z()z,x(

thatsuchXX:

jiiji κκκ

κ

=∈∀=

ℜ→×

)z()x()z,x(

thatsuchspaceHilbertaisFwhereFX:

Φ⋅Φ=

→Φ

κ

n For every Mercer kernel



Kernels and regularization theory [Evg99]

n Data: n Estimate

n Hypothesis space H (RKHS), complexity of the solution controlled by Hilbert space norm

n Representer theorem:

21

Hiii

Hf

f)o),x(f(Vn

minargf ∑ +=∈

λ

)x,x()x(f ii

iκα∑=

ℜ×ℜ∈ dnn )o,x(,),o,x( K11

OX:f →

fit to data complexity penalty


Outline

n Introductionn State of the artq Kernel functionsq Spectral methods for dimensionality reductionn Principal Component Analysis (PCA)n Metric Multidimensional Scaling (MDS)n Isometric mapping (ISOMAP)n Locally Linear Embedding (LLE)n Spectral Clustering (SC)


State of the art: spectral methods

Spectral methods for dimensionality reduction: two approachesn Manifold learning: nearby points remain

nearby, distant points remain distantn Information extraction: separate clustersn Spectral methods reveal low dimensional

structure by eigenvalues and eigenvectors of special matrices



Linear methods: PCAn Principal Component Analysis [Alp04]q Spectral decomposition of covariance matrixq Eigenvectors: principal axes of maximum variance

subspaceq Eigenvalues: projected variance of inputs along

principal axes. The number of significant (non negative) eigenvalues estimates dimensionality



Linear methods: MDSn Metric Multidimensional Scaling [Bor97]q Spectral decomposition of dot product matrix

(computed in terms of Euclidean distances of zero mean vectors)

q Eigenvectors: low dimensional embeddingq Eigenvalues: measure how each dimension

contributes to dot products. The number of significant (non negative) eigenvalues estimates dimensionality


State of the art: spectral methods – manifold learning

Nonlinear methods: ISOMAPn Preserve geodesic distances as estimated

along the manifoldn Algorithm [Ten00]:q Build adjacency graph: vertices represent inputs

and edges weighted by local distances connect neighbors

q Estimate geodesics: compute shortest paths through graph

q Metric MDS



Nonlinear methods: ISOMAPn Assumptionsq Graph is connectedq Neighborhoods on graph reflect neighborhoods on

manifold (no shortcuts)q Dense graph without “holes”



Nonlinear methods: ISOMAP

Fingerextension

Wrist rotation



Nonlinear methods: LLEn Preserve local geometric relationshipsn Algorithm [Row00]:q Nearest neighbor searchq Characterize local geometry of each

neighborhood by weights W ij

q Optimize low dimensional outputs



Nonlinear methods: LLEn Different approach than ISOMAPq Preserve local geometry: assume neighbors lie on

locally linear patchesq Construct sparse matrix



Non linear methods: LLE

Pose

Expression


State of the art: spectral methods – information extraction

Nonlinear methods: Spectral clusteringn Discover non convex clustersn Graph partition problem (minimal cut)


State of the art: spectral methods – information extraction

Nonlinear methods: Spectral clustering

n Relaxation of the Ncut problemn Solution based on eigenvectors of an affinity

matrix [Ng01]

=

⇒

=3

2

1

3

2

1

33

22

11

000000

000000

YYY

vv

v

AA

AA

)(

)(

)(


Outline



n Conclusion and perspectives


Contribution

Corporan OGI-MLTS

q 100 files of spontaneous telephonic speech (~45s, 8kHz)q Multilanguage (English, German, Hindi, Japanese,

Mandarin, Spanish)q Phonetically labeled

n ANITAq 150 files of studio speech (~7s, 16kHz)q 6 speakersq Posed and stressed conditions

n MUSICq 70 files (60s, 16khz)q Classic, singing voice, rock, jazz


Contribution

Some considerations

n Complexityq Isomap, SC-Kernel PCA: ~8000 vectors (~1 min

signal), 10 mins if we use phonetic speech labelsq LLE, LapEig, Landmark Isomap: ~10000 vectors

n Audio intrinsic dimensionality (MLE)q Speech: ~ 8-9 MFCCq Music: ~ 7-8 MFCCq Speech in stress conditions: dim - 1


- acoustic information in low-dimensional spaces- speech segmentation and labeling- content visualization of audio databases


Contribution: acoustic information in low-dimensional spaces

Speech manifolds: speech structure

n OGI sequencen 15 MFCCn Simplified phonetic

labels

n ISOMAP discovers a particular distribution of phonetic classes



Eigenvalues as intrinsic dimensionality estimators

n OGI sequencen 15 MFCC

n Original variance retained in the first 6dim:q PCA: 74.27%q Kernel PCA:

86.84%q ISOMAP: 89.80%



Speech manifolds: speech and music

n 20s of audio signal containing speech and music

n 15 MFCCn Laplacian eigenmaps

n Different zones of variation



Information extraction: a new kind of projectionsn OGI sequences in

english, mandarin and spanish

n 15 MFCCn Spectral clustering

n Different geometric structure than manifold learning approach



Information extraction: labels


Contribution: speech segmentation and labeling

Temporal spectral clustering

n OGI sequencesn 15 MFCC + D + DD

n Classical SC affinity matrix

n New metric applied to the main diagonal

A

A’



Temporal spectral clustering

ikki

ik

xx

ik

aa

otherwiseSxifea

kiif

ki

=

∈=

<−

−

0

2

2

2σ

n The new metric takes into account temporal closeness between vectors

n Eigenvectors of A’ are associated to segments on the signal



TSC: results



SCV labelingn Segments issued

of TSCn MFCC from the

middle of each segment

n Kernel PCAn k-means (k=3)n Labeling of clusters

according to their mean energy



TSC-SCV labeling: test conditions & resultsn 40 minutes of speech from OGI corpus (6

languages)n Results:q 74.66 % accuracy compared to manual labelingq Fbd + VActivity + Edetection [AO88] : 72.66 %q Hmm system : 81.22 %



Application: projection alignmentn Spectral projections

can randomly rotaten After SCV labeling

q Mean of S cluster in the positive side of X

q Mean of V cluster in the positive side of Y

q Mean of S cluster in the positive side of Z

n We can now model and compare projections



Application: Voiced C - Non Voiced C labelingn After TSC-SCV, Isomap with consonantsn 67.08% accuracy


Contribution: content visualization of audio databases

Audio databases

n Speech/musicn Musicn Languagesn Speakers

n Proposal: Visualization of acoustic sequences in 3D spaces!q Unsupervised and supervised analysis



KL system



KL system: speech – music database

n 60 filesq 30 from music db

(60s)q 30 from OGI

(45s)

n 15 MFCCn GMM 16

componentsn 2 well defined

clusters



KL system: music results

n Music cluster filesq 9 singing voiceq 17 instrumentalq 30 rock/jazz



KL system: languages database

n 60 OGI files, 3 languages (english, italian, mandarin)

n MFCC-SDC parameters

n Very difficult task



KL system: speakers databasesn 6 speakers from

ANITA corpusq 3 women, 3 men

n 25 files per speakern 15 MFCC + Dn GMM 32

components

n SC eigengapindicate 6 clusters in the set



KL-CV system: two modeling spaces



KL-CV system: speakers databasen 6 speakers from



components




SV system



SV system: speakers databasen 6 speakers from



components




Supervised learning results on speakers databasen SVM multiclass, one vs. all configuration

q 90 files for learning, 60 files for tests

n KL systemq 0 % test error, 85 support vectors

n KL-C systemq 3.33% test error, 22 support vectors

n KL-V systemq 3.33% test error, 17 support vectors ?

n SV systemq 6,66% test error, 33 support vectors


Outline



n Conclusion and perspectives


Conclusions and perspectives

Conclusions

n Spectral matrices => kernel matricesn Intrinsic < original MFCCn Speech manifoldsq Particular structureq Hints to phonetic and perceptive studiesn Interpretation of speech invariants



Conclusions

n Speech segmentation and labelingq Original approachq Good results and several applications

n Several proposals to transform variable length acoustic sequences into 3D vectorsq Similarity measure between sequences q Unsupervised and supervised analysis of results



Future work

n Generalize regression, classification and clustering in manifolds

n Study intra-inter speaker variationsn Identify intrinsic dimensions of speech and musicn Source separationn Framework for time series studies

q Speech coding schemesq Statistical modeling of sequences, distance measures

between models


Bibliographyn [Alp04] E. Alpaydin. Introduction to Machine Learning. MIT

Press, 2004.n [AO88] R. André-Obrecht. A new statistical approach for

automatic speech segmentation. Transactions on Audio, Speech, and Signal Processing, 1988.

n [Bor97] I. Borg, P. Groenen. Modern Multidimensional Scaling : Theory and Applications. Springer, 1997.

n [Evg99] T. Evgeniou, M. Pontil, T. Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 1999.

n [Ng01] A. Ng, M. Jordan, Y. Weiss. On spectral clustering : Analysis and an algorithm. Advances in Neural Information Processing Systems, MIT Press, 2001.

n [Row00] S. Roweis, L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000.

n [Ten00] J. Tenenbaum, V. D. Silva, J. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 2000.


Thank you!

spectral methods for automatic processing of audio documents · september 29, 2008 université...

Documents