cs591k - 20th november - fall 2003 1. 2 content-based retrieval of music and audio seminar : cs591k...

CS591k - 20th November - Fall 2003CS591k - 20th November - Fall 2003 11


Content-Based Retrieval of Music and Audio

Seminar : CS591k Multimedia Systems

By

Rahul Parthe

Anirudha Vaidya

Instructor : Dr. Donald Adjeroh


IntroductionIntroduction

• An audio search engine able to retrieve sound files from a large database similar to the input query sound.

• Sounds are characterized by templates" derived from a tree-based vector quantizer trained to maximize mutual information (MMI).


Basic OperationBasic Operation

• Corpus with different classes of audio files is parameterized into feature vectors

• Construction of a tree-based quantizer.

• Generation of the audio template using the parameterized data.

• The template is thus generated by capturing the salient characteristics of the input audio.

• Construction of a template for the query audio and matching it with the templates in the database.


Basic Operation [cont..]Basic Operation [cont..]

Fig 1:

Audio Template Construction


Audio ParameterizationAudio Parameterization

• The basic objective is to parameterize the audio-files into mel-scaled cepstral coefficients (MFCCs).

• The audio waveform sampled at 16 kHz is transformed into a sequence of 13-dimensional feature vectors (12 MFCC coefficients + energy term).


Audio Parameterization -StepsAudio Parameterization -Steps

• The audio is hamming-windowed in overlapping steps. The window is 25mS wide, hence 1 sec. Audio contains 500 overlapped windows.

• Calculate the log of power spectrum for each window using DFT.

• Mel–scaling. This emphasizes the mid-frequency bands in order of their perceptual importance.

• Transform the mel-scaled coefficients into cepstral coefficients using another DFT resulting in dimensionally uncorrelated features.

• The audio waveform is thus transformed into 13 dimensional feature vectors (12 MFCCs + energy).


Audio Parameterization[cont..]Audio Parameterization[cont..]

Fig 2:Audio parameterization into Mel cepstral coefficients


Tree-Structured QuantizationTree-Structured Quantization

• The Q-tree is grown offline using the max amount of training data possible.

• Supervised tree-based quantization.

i.e. the quantizer learns the critical distance between classes while ignoring the other variability.

• The advantage of this technique is that it can find similarities between similar slides, intervals or scales despite lumping the time-dependant vectors into one time –independent template.


Tree ConstructionTree Construction

• The quantizer tree partitions the feature space into distinct regions.

• Each threshold in the tree is chosen to maximize the mutual information I(X;C) between the data X and the associated class C.

• The best MMI split is found by considering all possible thresholds in all possible directions.

Consider an MMI split for the dimension d which it intercepts at value t. The hyper plane divides the set N of training vectors X into 2 sets

• First split – root node, left child Xb gets the training samples less than the threshold while the right child inherits those greater than the threshold.

• Splitting process repeated recursively on each child resulting in more modes and splits in the tree.


Tree Construction [cont..]Tree Construction [cont..]

Each node in the tree corresponds to a hyper-rectangular cell in the feature space. The leaves of the tree partition the feature space into non-overlapping regions as shown.

Fig 4: Nearest neighbor MMI Tree


Estimating I(X;C)Estimating I(X;C)

H2 is the binary entropy function

The probabilities Pr(ci) & Pr(ai) can be defined as follows


Stopping conditionStopping condition

• The stopping rule decides that further splits are unnecessary and stops the recursive splitting process.

• The best-split mutual information is weighted by the probability mass inside the cell to be split.

The stopping metric stop for the cell lj is given as:

Nj is the data points in cell j and N is the total number of data points


Template generation Template generation The Tree Partitions the feature space into L non-overlapping regions or cells each of which correspond to the leaf of the Tree.

One approach of using it is to label the leafs with class name and then use it as classifier, but this wont be robust since classes will be overlapping containing data from many classes.

Another approach the paper suggested is to mark the ensemble of leaf probabilities from the quantized class data. In short to use the histogram of leaf probabilities for a sequence of frames.

The resulting histogram captures the essential class qualities so that it can be compared with other histograms.


Template generation [cont]Template generation [cont]Since the size of the tree determines the size of the templates it can be easily pruned to give us variable free parameters as per the application allowing better characterization of data.

The processing being in 1-Dimension the quantization is rapid and takes only log(N) time for N-leaf tree.

Visual approximation of the Vectors.


Distance MetricsDistance MetricsThe templates generated need to be compared to references in order to determine to which class they belong. Comparing them determines the acoustic similarity.

Several distance measures have been proposed but the main two in use are Euclidean Distance and Cosine Distance.

Euclidean treats the histogram vectors as N-dimensional vectors and computes the L2 norm between them.

The cosine also treats histogram as a N-dimensional vectors but measures the relative angles between them thus is more effective since it is not independent of the magnitude of the vectors.


Distance Metrics [cont..]Distance Metrics [cont..]Euclidean Distance Measure

Cosine Distance Measure


ClassificationClassificationQuery template is matched with corpus templates using the Distance measures as discussed previously.

The results are sorted in the form of a list with order of similarity. They can be imagined as the output results of a search engine like google.

The search has to be through the full data base hence is a big search as for comparison all the distances have to be compared.


Experiments & ResultsExperiments & Results1. Sound Retrieval

A simple test was conducted to check the performance of the system with the Muscle Fish System on web.

Two types of trees were used one was quasi-supervised and the other was supervised.

Quasi supervised means that the tree was used to classify the whole sample space in distinct classes. This results in number of cells in the feature space equal to the size of the sample space.

The supervised was used to classify the sample into a subclass or a group with similar properties. Which obviously gives the better results.


Experiments & Results [con]Experiments & Results [con]

DistanceDistance Q Tree (DQ Tree (Dcc))

unsupervisedunsupervised

Q Tree (DQ Tree (Dcc))

supervisedsupervised

Muscle FishMuscle Fish

(no DPL)(no DPL)Muscle FishMuscle Fish

(+ DPL)(+ DPL)

Laughter (M)Laughter (M) 0.680.68 0.820.82 1.001.00 1.001.00

OboeOboe 0.110.11 0.430.43 0.690.69 0.940.94

AgogoAgogo 1.001.00 1.001.00 0.530.53 0.580.58

Speech (F)Speech (F) 0.770.77 0.870.87 0.690.69 0.940.94

TouchtoneTouchtone 0.610.61 1.001.00 0.440.44 0.730.73

Rain/ThunderRain/Thunder 0.220.22 0.350.35 0.300.30 0.420.42

Mean APMean AP 0.580.58 0.7720.772 0.6080.608 0.7680.768

Retrieval Average Precision (AP) for different schemes. Quantization tree results used un weighted cosine distance measure.

Distance measures of both kind were used but as mentioned cosine performed a lot better.


Experiments & Results [cont..]Experiments & Results [cont..]2. Music Retrieval

In this application music clips were used for classification. Genres used were jazz, pop, rock, rap etc

Clips from the same artist were considered as from the same class. Each artist had 5 clips in the corpus.

The corpus consisted of 255 7-second clips 5 clips a artist with 40 artists.

DistanceDistance Euclidean Euclidean (D(DEE))


Euclidean Euclidean (D(DEE))

unsuperviseunsupervisedd

Cosine (DCosine (Dcc))


VectorVector

distancedistance

APAP 0.350.35 0.320.32 0.400.40 0.310.31

Retrieval Average Precision (AP) for music retrieval experiment.


ConclusionsConclusionsThe retrieval works effective for complex data and measures acoustic similarity. The sorted comparison results give the order of similarity between query data and references in the corpus.

The computational requirements and storage requirements are also modest. Since the feature vectors are just array of integers and the Q-tree quantization and classification is in one dimension.

This method can be used to automatically segment the multimedia data based on changes in the speaker, pause , musical interludes etc

Finally the variability of the number of free parameters and ignoring the dimension which are never used the templates can be optimized as per the application requirements.


LimitationsLimitations

The classifier used is simple straight plane classifier which can distinguish between to subspaces with a simple plane. In real life the vector distribution may not be distributed in such a way.

They have used simple acoustical parameters to be used for matching, for more sophisticated systems other parameters like pitch and speaker dependent properties can be used.

We need to have recorded musical clips to find the genres. In case of distortion or losses in the clips the system will not work well.

There is no option for dynamic training. i.e. the system is not self-updating.


SuggestionsSuggestions• Neural networks can be used to divide the feature space in a more complex fashion. Where curved, concave and hybrid planes can be formed.

• If the dimensions of the feature vectors are increased then simple Euclidean distance will not work and we need to go for other reasoning methods.

• Non acoustical features such as pitch, brightness and speaker dependent parameters can be used to find good classification with a less data base.

• Dynamic training needs to be added to the system to include new samples in database as it encounters them.

• Possibly speech recognition can be added to search for data based on uttered query by the user.


ReferencesReferences[1] S. Pfeier, S. Fischer, and W. Eelsberg, “Automatic audio content analysis," Tech. Rep. TR-96-008, University of Mannheim, D-68131 Mannheim, Germany, April 1996.

[2] E. Wold, T. Blum, D. Keslar, and J. Wheaton, “Content-based classication, search, and retrieval of audio,“ IEEE Multimedia , pp. 27{36, Fall 1996.

[3] T. Blum, D. Keslar, J. Wheaton, and E. Wold, “Audio analysis for content-based retrieval” tech. rep., Muscle Fish LLC, 2550 Ninth St., Suite 207B, Berkeley, CA 94710, USA, May 1996.

[4] B. Feiten and S. Gunzel, “Automatic indexing of a sound database using self-organizing neural nets," Computer Music Journal 18(3), pp. 53{65, 1994.


Questions & CommentsQuestions & Comments


ThanksThanks

cs591k - 20th november - fall 2003 1. 2 content-based retrieval of music and audio seminar : cs591k...

Documents

input audio

query audio

audio waveform

quantizer tree

audio parameterizationcont

tree construction

audio parameterization

treebased quantizer