on the use of statistical tools for audio processing mathieu lagrange and juan josé burred analyse...
TRANSCRIPT
On the use of statistical tools for audio processing
Mathieu Lagrange and Juan José BurredAnalyse / Synthèse Team, [email protected]
École centrale de NantesFilière ISBA
Mathieu Lagrange. Statistical Tools for Audio Processing.
2
Outline
1. Introduction1. Context and challenges
2. Past and Present1. Speech
1. Model
2. Applications (coding, speaker recognition, speech recognition)
2. Audio (Music)
1. Sound models
2. Applications (classification, similarity)
3. Limitations
3. Sound source separation1. Paradigms, tasks and applications
2. Mixing Models
3. Methods for the under determined case
4. Clustering of Spectral Audio (CoSA)1. Auditory Scene Analysis (ASA)
2. Clustering
Mathieu Lagrange. Statistical Tools for Audio Processing.
3
Outline
1. Introduction1. Context and challenges
Mathieu Lagrange. Statistical Tools for Audio Processing.
Technological Context« We are drowning in information and starving for knowledge »
R. Roger
• Needs:
o Measurement
o Transmission
o Access
• Aim of a numerical representation:
o Precision
o Efficiency
o Relevance
• Means
o Mechanical biology
o Psycho-acoustic
o Cognition
4
Mathieu Lagrange. Statistical Tools for Audio Processing.
Challenges
« Forty-two! yelled Loonquawl. Is that all you've got to show for seven and a half million years' work? »
D. Adams
Music is great to study as it is both:
• object : arrangement de sons et de silences au cours du temps
• function: more or less codified form of expression of :
o Individual feelings (mood)
o Collective feelings (party, singing, dance)
5
Mathieu Lagrange. Statistical Tools for Audio Processing.
Audio Processing: Past and Present
6
Mathieu Lagrange. Statistical Tools for Audio Processing.
7
Outline
1. Introduction1. Context and challenges
2. Past and Present1. Speech
1. Model
2. Applications (coding, speaker recognition, speech recognition)
Mathieu Lagrange. Statistical Tools for Audio Processing.
Speech signal
• The speech signal is produced when the air flow coming from the lungs go through the vocal chords and the vocal tract.
o The size and the shape of the vocal tract as well as the vocal chords excitations are changing relatively slowly
o The speech signal can therefore be considered as quasi-stationary over short period of about 20 ms.
• Type of speech production
o Voiced: <a>, <e>, …
o Unvoiced: <s>, <ch>,
o Plosives: <pe>, <ke>
Mathieu Lagrange. Statistical Tools for Audio Processing.
Source / Filter Model
• In the case of an idealized voiced speech signal, the vocal chords are producing a perfectly periodic harmonic signal
• The influence of the vocal tract can be considered as a filtering with a given frequency response whose maximas are called formants.
Mathieu Lagrange. Statistical Tools for Audio Processing.
Source / Filter Coding• Algorithm :
o Voiced / Unvoiced detection;
o Voiced case: the source signal is approximated with a Dirac comb:
o a Dirac comb whose successive Diracs are respectively T spaced by T as a spectrum which is a Dirac comb whose successive combs are 1/T spaced.
o Parameters : T, gain
o Unvoiced: the source signal is approximated by a stochastic signal:
o Parameter : gain.
o The Source signal is next filtered.
o Parameters : filter coefficients.
Mathieu Lagrange. Statistical Tools for Audio Processing.
« Code-Excited Linear Predictive » (CELP)
For each frame of 20 ms :
o Auto-Regressive coefficients are computed such that the prediction error is minimized over the entire duration of the frame:
o Quantified coefficients and an index encoding the error signal are transmitted.
Mathieu Lagrange. Statistical Tools for Audio Processing.
« Code-Excited Linear Predictive » (CELP)
Signal ResidualARCoefficients
index
Mathieu Lagrange. Statistical Tools for Audio Processing.
Speaker Recognition• Classical pattern recognition
problem
• Specific problems:
o Open Set / Closed Set: rejection problem
o Identification / Verification
o Text Dependency
• Method
o Feature extraction: model each speech with Mel-Frequency Cepstral Coefficients (MFCCs) and their derivatives.
o Classification
o Text independent: Vector Quantization Codebooks or Gaussian Mixture Models (GMMs)
o Text dependent: Dynamic Time Warping (DTW) or Hidden Markov Model (HMM)
13
s1s2
s3
Mathieu Lagrange. Statistical Tools for Audio Processing.
Speech recognition• An Automatic Speech Recognition System is
typically decomposed into:
o Feature Extraction: MFCCs
o Acoustic Models: HMMs trained for set of phones
o Each phone is modelled with 3 states
o Pronunciation dictionary: convert a series of phones into a word
o Language Model: predict the likelihood of specific words occurring one after another with n-grams
14
(Fig. from HTK documentation)
Mathieu Lagrange. Statistical Tools for Audio Processing.
MFCCs rules ?Mel Frequency Cepstral Coefficients are commonly
derived as follows:
1. Take the Fourier transform of (a windowed excerpt of) a signal.
2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.
3. Take the logs of the powers at each of the mel frequencies.
4. Take the discrete cosine transform (DCT) of the list of mel log powers, as if it were a signal.
5. The MFCCs are the amplitudes of the resulting spectrum.
15
Mathieu Lagrange. Statistical Tools for Audio Processing.
MFCCs Rules ?
16
Mathieu Lagrange. Statistical Tools for Audio Processing.
Issues with MFCCs computation steps
• The MEL frequency wraping:
o highly criticized form a perceptual point of view (Greenwood)
o conceptually: periodicity analysis over data that are not periodic anymore (Camacho)
• The Cepstral Coefficients are COSINE coefficients:
o cannot shift with speaker size to capture the shift in formant frequencies that occurs as children grow up and their vocal tracts get longer
• Not a sound representation:
o no way to provide enhancements such as speaker and channel adaptation, background noise suppression, source separation
17
Mathieu Lagrange. Statistical Tools for Audio Processing.
Potentials of the DCT step• Observation of Pols that the main components
capture most of the variance using a few smooth basis functions, smoothing away the pitch ripples
• Principal components of a collection of vowel spectra on a warped frequency scale aren't so far from the cosine basis functions
• Decorrelates the features.
o This is important because the MFCC are in most cases modelled by Gaussians with diagonal covariance matrices
18
Mathieu Lagrange. Statistical Tools for Audio Processing.
19
Outline
1. Introduction1. Context and challenges
2. Past and Present1. Speech
1. Model
2. Applications (coding, speaker recognition, speech recognition)
2. Audio (Music)
1. Sound models
2. Applications (classification, similarity)
3. Limitations
Mathieu Lagrange. Statistical Tools for Audio Processing.
Sound Models• Major classes of sounds
1. Transients (castanets, …)
2. Pseudo periodic (flute, …)
3. Stochastic (waves, …)
• Models
1. Impulsive noise
2. Sum of sinusoids
3. Wide band noise
20
Mathieu Lagrange. Statistical Tools for Audio Processing.
Classification• Method: [Tzanetakis’02]
o Agree on mutually exclusive set of tags (the ontology)
o Extract features from audio (MFCCs and variations)
o Train statistical models:
o Due to the high dimensionality of the feature vectors discriminatives approaches are prefered (SVMs)
• Segmentation
o Smoothing decision using dynamic programming (DP)
21
Tzanetakis’02Tzanetakis, G. Cook, P. Musical Genre Classification of Audio Signals
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING 2002
(Fig. from [Ramona07])
Mathieu Lagrange. Statistical Tools for Audio Processing.
Multi-Class Discriminative Classification
• Usually performed by combining binary classifiers
• Two approaches:
o One-vs-all: For each class build a classifier for that class versus the rest
o Often very imbalanced classifiers (use asymmetric regularization)
o All-vs-all Build a classifier for each couple of class
o A priori a large number of classifiers to build but the pairwise classification are faster and the classifications are balanced (easier to find the best regularization)
22
s1s2
s3
Mathieu Lagrange. Statistical Tools for Audio Processing.
Multi-Label Discriminative Classification
• Each object may be tagged using several labels
• Computational approaches
o Power Sets
o Binary Relevance (equivalent to one-vs-all)
• Multiple criteria:
o « Flattening » the ontology
o Research trend: considering the ontology structure to benefit from co-occurrence labels of different semantic criterion
23
C1 C2
C3
Mathieu Lagrange. Statistical Tools for Audio Processing.
Music Similarity• Question to solve: « Given a seed song, provide us
with the entries of the database which are the most similar »
• Annotation type: Artist / Album
• Method: [Aucouturier’04]
o Songs are modeled as GMM of MFCCs
o Proximity of GMMs are considered as similiarity measure:
o Likelihood (requires access to the MFCCs)
o Sampling
24
[Aucouturier’04] J.-J. Aucouturier and F. Pachet. Improving Timbre Similarity: How High is the Sky? Journal of Negative Results in Speech and Audio Sciences, 1 (1), 2004.
Mathieu Lagrange. Statistical Tools for Audio Processing.
Cover Version Detection• Question to solve: « Given a seed song, provide us
with the entries of the database which are cover versions »
• Annotation: canonical song
• Method: [Serra’08]
o Songs are modeled as a time series of Chromas
o Computation of the similarity matrix between the two time series
o Similarity is measured using Dynamic Programming Local Alignment
25
[Serra’08] Chroma Binary Similarity and Local AlignmentApplied to Cover Song Identification, IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2008
(Fig. from [Serr08])
Mathieu Lagrange. Statistical Tools for Audio Processing.
Limitations• Description of audio and music
o Polyphonic
o Multiple shapes varying in various ways
• Statistical Modeling
o Curse of dimensionality
o Sense of structure relevant at multiple levels of temporality
26
Mathieu Lagrange. Statistical Tools for Audio Processing.
27
Outline
1. Introduction1. Context and challenges
2. Past and Present1. Speech
1. Model
2. Applications (coding, speaker recognition, speech recognition)
2. Audio (Music)
1. Sound models
2. Applications (classification, similarity)
3. Limitations
3. Sound source separation1. Paradigms, tasks and applications
2. Mixing Models
3. Methods for the under determined case
Mathieu Lagrange. Statistical Tools for Audio Processing.
28
Sound Source Separation• “Cocktail party effect”
o E. C. Cherry, 1953.
o Ability to concentrate attention on a specific sound source from within a mixture.
o Even when interfering energy is close to energy of desired source.
• “Prince Shotoku Challenge”o Legendary Japanese prince Shotoku (6th
Century AD) could listen and understand simultaneously the petitions by ten people.
o Concentrate attention on several sources at the same time!
o “Prince Shotoku Computer” (Okuno et al., 1997)
• Both allegories imply an extra step of semantic understanding of the sources, beyond mere acoustical isolation.
[Cherry53]
[Okuno97]
E. C. Cherry. Some Experiments on the Recognition of Speech, With One and Two Ears. Journal of the Acoustical Society of America, Vol. 25, 1953.H. G. Okuno, T. Nakatani and T. Kawabata. Understanging Three Simultaneous Speeches. Proc. Int. Joint Conference on Artificial Intelligence (IJCAI), Nagoya, Japan, 1997.
Mathieu Lagrange. Statistical Tools for Audio Processing.
29
The paradigms of Musical Source Separation
• (based on [Scheirer00])
Understanding without separation
E.g. music genre classification
“Glass ceiling” of traditional methods (MFCC+GMM) [Aucouturier&Pachet04]
Separation for understanding
First (partially) separate, then feature extraction
Source separation as a way to break the glass ceiling?
o Separation without understanding
BSS: Blind Source Separation (ICA, ISA, NMF)
Blind means: only very general statistical assumptions taken.
Understanding for separation
Supervised source separation (based on a training database)
[Scheirer00][Aucouturier&Pachet04]
E. D. Scheirer. Music-Listening Systems. PhD thesis, Massachusetts Institute of Technology, 2000.J.-J. Aucouturier and F. Pachet. Improving Timbre Similarity: How High is the Sky? Journal of Negative Results in Speech and Audio Sciences, 1 (1), 2004.
Mathieu Lagrange. Statistical Tools for Audio Processing.
30
Required sound quality• Audio Quality Oriented (AQO)
o Aimed at full unmixing at the highest possible quality.o Applications:
o Unmixing, remixing, upmixingo Hearing aidso Post-production
• Significance Oriented (SO)o Separation quality just enough for facilitating semantic
analysis of complex signals.o Less demanding, more realistic.o Applications:
o Music Information Retrievalo Polyphonic Transcriptiono Object-based audio coding
Mathieu Lagrange. Statistical Tools for Audio Processing.
31
Musical Source Separation Tasks
• Classification according to the nature of the mixtures:
• Classification according to available a priori information:
Mathieu Lagrange. Statistical Tools for Audio Processing.
32
Linear mixing model• Only amplitude scaling before mixing (summing)
• Linear stereo recording setups:XY Stereo MS Stereo Close miking Direct injection
Mathieu Lagrange. Statistical Tools for Audio Processing.
33
• Amplitude scaling and delay before mixing
• Delayed stereo recording setups:
Delayed mixing model
AB Stereo Mixed StereoClose miking
with delayDirect injection
with delay
Mathieu Lagrange. Statistical Tools for Audio Processing.
34
Convolutive mixing model• Filtering between sources and sensors
• Convolutive stereo recording setups:
Reverberant environmentBinaural
Close miking with reverb
Direct injectionwith reverb
Mathieu Lagrange. Statistical Tools for Audio Processing.
35
Some terminology• System of linear equations:
o Usual algebraic methods from high school: X known, A known, S unknown
o But in source separation: unknown variables (S, sources) AND unknown coefficients (A, mixing matrix)
• Algebra terminology is retained for source separation:o More equations (mixtures) than unknowns (sources): overdetermined
o Same number of equations (mixtures) than unknowns (sources): determined (square A)
o Less equations (mixtures) than unknowns (sources): underdetermined
• The underdetermined case is the most demanding, but also the most important for music!
o Music is (still) mostly in stereo, with usually more than 2 instruments
o Overdetermined and determined situtations are only of interest for arrays of sensors or arrays of microphones (localization, tracking)
Mathieu Lagrange. Statistical Tools for Audio Processing.
36
Binaural Case (1)• Goal: find a mask M that retrieves one source when used to
filter a given time-frequency representation.
• DUET (Degenerate Unmixing Estimation Technique) [Yilmaz&Rickard04]
o Histogram of Interchannel Intensity (IID) and Phase (IPD) Differences
o Binary Mask created by selecting bins around histogram peaks.
• Drawback of t-f masking: “musical noise” or “burbling” artifacts
º is the Hadamard (element-wise) product
(Fig. from [Vincent06])
(Fig. from [Yilmaz&Rickard04])
[Yilmaz&Rickard04]
Ö. Yilmaz and S. Rickard. Blind Separation of Speech Mixtures via Time-Frequency Masking. IEEE Trans. on Signal Processing. Vol. 52(7), July 2004
Mathieu Lagrange. Statistical Tools for Audio Processing.
37
Binaural Case (2)• Human-assisted time-frequency masking
[Vinyes06]
o Human-assisted selection of the time-frequency bins out of the DUET-like histogram for creating the unmixing mask
o Implementation as a VST plugin (“Audio Scanner”)
[Vinyes06] M. Vinyes, J. Bonada and A. Loscos. Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking. 120th AES convention, Paris, France, 2006.
Mathieu Lagrange. Statistical Tools for Audio Processing.
38
Monaural Case• Classification according to a priori knowledge
o Supervised
o Based on training the model with a sound example database
o Better quality and more demanding situations at the cost of less generality
o Unsupervised
• Classification according to model type
o Adaptive basis decompositions (ISA, NMF, NSC)
o Sinusoidal Modeling
• Classification according to mixture type
o Monaural systems
o Hybrid systems combining advanced source models with spatial diversity
Mathieu Lagrange. Statistical Tools for Audio Processing.
39
Independent Subspace Analysis
• Application of ISA to audio: Casey and Westner, 2000.
• Application of ICA to the spectogram of a mono mixture.
• Each independent component corresponds to an independent subspace of the spectrogram.
• Component-to-source clustering
o The extracted components usually do not directly correspond to the sources.
o They must be clustered together according to some similarity criterion.
o Casey&Westner use a matrix of Kullback-Leibler divergences called the ixegram.
(Fig. from [Casey&Westner00])
[Casey&Westner00]
M. Casey and A. Westner. Separation of Mixed Audio Sources by Independent Subspace Analysis. Proc, Int. Computer Music Conference (ICMC), Berlin, Germany, 2000.
Mathieu Lagrange. Statistical Tools for Audio Processing.
ICA for Audio
40
(Figs from Virtanen)
Mathieu Lagrange. Statistical Tools for Audio Processing.
41
Nonnegative Matrix Factorization• Matrix factorization ( ) imposing non-
negativity.
• Needed when using magnitude or power spectrograms.
• NMF does not aim at statistical independence, but:o It has been proven that, under some conditions, non-negativity is
sufficient for separation.
o NMF yields components that very closely correspond to the sources.
o To date, there is no exact theoretical explanation why is that so!
• Use for transcription:
o P. Smaragdis and J.C. Brown. Non-Negative Matrix Factorization for Polyphonic Music Transcription. Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, 2003.
• Use for separation:
o B. Wang and M. D. Plumbley. Musical Audio Stream Separation by Non-Negative Matrix Factorization. Proc. UK Digital Music Research Network (DMRN) Summer Conf., 2005.
Mathieu Lagrange. Statistical Tools for Audio Processing.
NMF for Audio
42
(Figs from Virtanen)
Mathieu Lagrange. Statistical Tools for Audio Processing.
NMF for Vision• By representing signals as a sum purely additive,
non- negative sources, we get a parts-based representation [Lee’99]
43
[Lee’99]Lee and Seung, Learning the parts of objects by nonnegative matrix factorization, Nature, 1999, 41
Mathieu Lagrange. Statistical Tools for Audio Processing.
44
Nonnegative Sparse Coding• Combination of non-negativity and sparsity constraints in the
factorization.
• [Virtanen03]: NSC is optimized with an additional criterion of temporal continuity.
o Measured by the absolute value of the overall amplitude difference between consecutive frames.
• [Virtanen04]: Convolutive Sparse Coding
o Improved temporal accuracy by modeling the sources as the convolution of spectrograms with a vector of onsets.
Mixture Component 1 Component 2
Mixture Component 1 Component 2
[Virtanen03]
[Virtanen04]
T. Virtanen. Sound Source Separation Using Sparse Coding with Temporal Continuity Objective. Proc. Int. Computer Music Conference (ICMC), Singapore, 2003.T. Virtanen. Separation of Sound Sources by Convolutive Sparse Coding. Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing (SAPA), Jeju, Korea, 2004.
Mathieu Lagrange. Statistical Tools for Audio Processing.
45
Sinusoidal Methods• Sinusoidal Modeling: detection and tracking
of the sinusoidal partial peaks on the spectrogram.
• Based on Auditory Scene Analysis (ASA) cues of good-continuation, common fate
and smoothness of sinusoidal tracks.
• Overall, very good reduction of interfering sources, but moderate timbral
quality.
o Appropriate for Significance-Oriented applications
• [Virtanen&Klapuri02]: model of spectral smoothness of harmonic sounds
o Based on basis decomposition of harmonic structures
o Additive resynthesis of partial parameters
• [Every&Szymanski06]
o Spectral subtraction instead of additive resynthesis
(Fig. from [Every06])
MixtureSeparated
sources
[Virtanen&Klapuri02]
[Every&Szymanski06]
T. Virtanen and A. Klapuri. Separation of Harmonic Sounds Using Linear Models for the Overtone Series. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Orlando, USA, 2002.M. R. Every and J. E. Szymanski. Separation of Synchronous Pitched Notes by Spectral Filtering of Harmonics. IEEE Trans. on Audio, Speech and Signal Processing. Vol. 14(5), 2006.
Mathieu Lagrange. Statistical Tools for Audio Processing.
46
Supervised Methods (1)• Use of a training database to create a set of source models,
each one modeling a specific instrument.
o Better separation as a trade-off for generality.
• Supervised sinusoidal methods
o [Burred&Sikora07]
o The source models are compact descriptions of the spectral envelope
and its temporal evolution.
o The detailed temporal evolution allows to ignore harmonicity constraints,
and thus separation of chords and inharmonic sounds is possible.
Mixture Separated sources
Mixture Separated sources
Separation of chords
Inharmonic separation
[Burred&Sikora07]J.J. Burred and T. Sikora. Monaural Source Separation from Musical Mixtures Based on Time-Frequency Timbre Models. Proc. Int. Conf. on Music Information Retrieval (ISMIR), Vienna, Austria, September 2007.
Mathieu Lagrange. Statistical Tools for Audio Processing.
47
Supervised Methods (2)• Bayesian Networks
o [Vincent06]
o Multilayered model describing note probabilities (state layer), spectral decomposition (source layer) and spatial information (mixture layer).
o Trained on a database of isolated notes.
o Allows separation of sounds with reverb.
• Learnt priors for Wiener-based separation
o [Ozerov05]
o Single-channel
o GMM models of singing voice and accompaniment.
Mixture Separated sources
Separated sources
Mixture
[Vincent06]
[Ozerov05]
E. Vincent. Musical Source Separation Using Time-Frequency Source Priors. IEEE Trans. on Audio, Speech and Language Processing, Vol. 14 (1), 2006.A. Ozerov, O. Philippe, R. Gribonval and F. Bimbot. One Microphone Singing Voice Separation Using Source-Adapted Models. Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, 2005.
Mathieu Lagrange. Statistical Tools for Audio Processing.
48
Conclusions• Still far from fully-general, audio-quality-oriented
system.
• More realistic: significance orientedo Separation good enough to facilitate content analysis
• Methods based on adaptive models, time-frequency masking:
o More realistic mixtures, but more artifacts and interferences
• Methods based on sinusoidal modeling:o More artificial timbre, but less interferences.
• Current polyphony limitations:o Mono signals: up to 3, 4 instruments
o Stereo signals: up to 5, 6 instruments
Mathieu Lagrange. Statistical Tools for Audio Processing.
49
Outline
1. Introduction1. Context and challenges
2. Past and Present1. Speech
1. Model
2. Applications (coding, speaker recognition, speech recognition)
2. Audio (Music)
1. Sound models
2. Applications (classification, similarity)
3. Limitations
3. Sound source separation1. Paradigms, tasks and applications
2. Mixing Models
3. Methods for the under determined case
4. Clustering of Spectral Audio (CoSA)1. Auditory Scene Analysis (ASA)
2. Clustering
Mathieu Lagrange. Statistical Tools for Audio Processing.
Binary Masking with Oracle• Binary masking is an
effective way of performing the separation
• Using an oracle allows to assess the relevance of Fourier spectrograms as an atomic representation
• The binary mask is set to 1 if the source of interest is dominant in the considered frequency bin, and 0 otherwise
50
STFT ISTFT
|STFT|
Mask
Mathieu Lagrange. Statistical Tools for Audio Processing.
Auditory Scene Analysis• Formalism proposed by
psycho-acousticians [Bregman’90]
• Main principle :
o The scene can be decomposed into a set of atoms
o A first level of structuration clusters atoms into entities (notes)
o A second one clusters entities into streams (voices)
51
[Bregman’90]
A. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound, The MIT Press, 1990
Mathieu Lagrange. Statistical Tools for Audio Processing.
Affinity Cues• Proximity
o Time
o Frequency
o Amplitude
• Dynamic
• Harmonicity
52
Mathieu Lagrange. Statistical Tools for Audio Processing.
Vector-based Clustering• Each object of the dataset is
described as a set of features
• For that purpose “k-means” is widely used due to its efficiency
• Aim: minimize the within-cluster sum of squares (WCSS)
• 2-Steps iterative method:
o Assignment step: Assign each observation to the cluster with the closest mean
o Update step: Calculate the new means to be the centroid of the observations in the cluster
53
Mathieu Lagrange. Statistical Tools for Audio Processing.
Graph-based Clustering• Given a set of data points A, the similarity matrix
may be defined as a matrix W where w(i, j) represents a measure of the similarity between points.
• More generic approach, as vector-based descriptions can be trivially converted
• Many existing methods:
o Agglomerative Hierarchical Clustering (AHC)
o k-medoids
o Spectral clustering
54
Mathieu Lagrange. Statistical Tools for Audio Processing.
Spectral Clustering (1)• Spectral clustering techniques make use of
o the spectrum of the similarity matrix of the data (eigenvectors)
o to perform dimensionality reduction for clustering in fewer dimensions.
55
Mathieu Lagrange. Statistical Tools for Audio Processing.
Spectral Clustering (2)• Method:
o Compute the similarity between each objects (W)
o Computes the normalized laplacian (L):
o With D the degree matrix defined as the diagonal matrix with the degrees on the diagonal:
o Select the eigenvectors corresponding to the k largest eigenvalues of the laplacian and normalize them by rows
o Use those eigenvectors for clustering with k-means
56
Mathieu Lagrange. Statistical Tools for Audio Processing.
Spectral Clustering (3)• Can be viewed as a relaxation of the Normalized
Cuts problem:
57
Mathieu Lagrange. Statistical Tools for Audio Processing.
Performance Criterion• Normalized Mutual Information
o Given 2 sets of labels (the ground truth and the cluster results) estimate the degree of matching between the 2
• The NMI is between 0 and 1, 1 being a perfect match.
58
€
NMI(X,Y ) =H(X) −H(X |Y )
(H(X) +H(Y )) /2
H(X |Y ) = − p(x,y)log(p(x | y))∑
Mathieu Lagrange. Statistical Tools for Audio Processing.
Clustering of Spectral Audio (CoSA)
• Aim: apply clustering method for generating the binary mask
• Method:
o Spectrogram as the atomic representation
o Prune the representation to retain only high amplitude components
o Compute the features related to each retained components
o Split the spectrogram into contiguous texture windows
o For each texture window
o Apply clustering
o Select clusters that are likely to belong to the source of interest
o Apply spectrogram inversion59
Mathieu Lagrange. Statistical Tools for Audio Processing.
Performance Criteria• NMI
o Does not consider the amplitude of the spectral components
• Spectral domain Signal to Noise Ratio
o Does not consider phase and framing issues
• Time domain Signal to Noise Ratio
o Is still not perfect as it is only vaguely related to perception
60
€
SSNR(X,Y ) =| Xn,m |
| Xn,m | − |Yn,m |m=1
M
∑n=1
N
∑
€
TSNR(x,y) =1
N
x(nT + t)2
(x(nT + t) − y(nT + t))2t=1
T
∑n=1
N
∑
Mathieu Lagrange. Statistical Tools for Audio Processing.
61
Literature• Very few overview materials on Musical Source
Separation
• P. D. O´Grady, B. A. Pearlmutter and S. T. Rickard. Survey of sparse and non-sparse methods in source separation. International Journal of Imaging Systems and Technology, 15(1). 2005.
• E. Vincent, M. G. Jafari, S. A. Abdallah, M. D. Plumbley and M. E. Davies. Model-based audio source separation. Technical Report C4DM-TR-05-01, Queen Mary University, London, UK, 2006.
• T. Virtanen. Unsupervised Learning Methods for Source Separation in Monaural Music Signals. Chapter in A. Klapuri, M. Davy (Eds.), Signal Processing Methods for Music Transcription, Springer 2006.
• Stereo Audio Source Separation Evaluation Campaign:
o http://sassec.gforge.inria.fr
• Tutorial (section 3 is an excerpt)
o Juan Jose Burred. Musical Source Separation: Principles and State of the Art. 2nd Int. Workshop on Learning Semantics of Audio Signals (LSAS)