algorithms for music information retrieval · algorithms for music information retrieval a thesis...

Algorithms for Music InformationRetrieval

A Thesissubmitted for the Degree of

Master of Science (Engineering)

in the Faculty of Engineering

By

Balaji Thoshkahna

Department of Electrical EngineeringIndian Institute of Science

Bangalore – 560 012

April 2006

1

c© Balaji ThoshkahnaApril 2006

All rights reserved

to

Amma and Appa

Contents

List of symbols xi

Acknowledgements xiii

Abstract xv

1 Introduction 11.1 Music Information Retrieval Systems(MIRS) . . . . . . . . . . 21.2 Architecture of MIRS . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Signal acquisition . . . . . . . . . . . . . . . . . . . . 51.2.2 Signal thumbnailing . . . . . . . . . . . . . . . . . . . 51.2.3 Feature extraction . . . . . . . . . . . . . . . . . . . . 61.2.4 Feature comparison and retrieval . . . . . . . . . . . . 7

1.3 MIRS, an introductory literature survey . . . . . . . . . . . . 71.4 Types of MIRS . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4.1 QBH . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4.2 QBE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.3 QBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4.4 QBB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Motivation for research in QBE . . . . . . . . . . . . . . . . . 111.6 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . 121.7 Organisation of the thesis . . . . . . . . . . . . . . . . . . . . 121.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Literature Survey on QBES’ 152.1 Understanding the QBE form of MIR . . . . . . . . . . . . . . 162.2 Systems with better audio related features . . . . . . . . . . . 172.3 Systems with better retrieval algorithms . . . . . . . . . . . . 18

iii

iv CONTENTS

2.4 Systems with efficient indexing structures . . . . . . . . . . . . 212.5 Linking the QBES’ . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.1 Similarity in a QBE system . . . . . . . . . . . . . . . 222.5.2 Features in a QBE system . . . . . . . . . . . . . . . . 222.5.3 Retrieval algorithm and distance measure in a QBE

system . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.5.4 Database size in a QBE system . . . . . . . . . . . . . 23

2.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Quebex : A QBES for MIR 273.1 Database Information . . . . . . . . . . . . . . . . . . . . . . . 273.2 Similarity definition for Quebex . . . . . . . . . . . . . . . . . 293.3 Feature extraction for the Quebex system . . . . . . . . . . . 29

3.3.1 I. Extraction of features for representing timbre . . . . 303.3.2 II. Extraction of temporal and energy features . . . . . 323.3.3 III. Extraction of rhythm features . . . . . . . . . . . . 32

3.4 Algorithm for retrieval . . . . . . . . . . . . . . . . . . . . . . 343.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 393.6 Conclusions and Chapter Summary . . . . . . . . . . . . . . . 39

4 Arminion : An improved QBES 414.1 Similarity for Arminion system . . . . . . . . . . . . . . . . . 424.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.1 Timbre descriptors - MFCCs and their extraction . . . 434.2.2 “Noisyness” descriptors - HILN based features and their

extraction . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Calculating the distance between 2 GMMs . . . . . . . . . . . 47

4.3.1 A new and simple distance measure for using in GMMswith M > 1 . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 Algorithm for retrieval in Arminion . . . . . . . . . . . . . . . 504.5 Experiments using the proposed distance measure . . . . . . . 524.6 Experiments using the HILN features . . . . . . . . . . . . . . 544.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . 56

5 A Speech / Music Discriminator algorithm 615.1 Current state of art . . . . . . . . . . . . . . . . . . . . . . . . 625.2 Speech Music Discriminator algorithm . . . . . . . . . . . . . 62

5.2.1 HILN model based features . . . . . . . . . . . . . . . 63

CONTENTS v

5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . 655.4 Conclusions and Future work . . . . . . . . . . . . . . . . . . 69

6 Conclusions and future work 716.1 Summary of this work . . . . . . . . . . . . . . . . . . . . . . 716.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Appendix A 75.1 Triangular inequality proof . . . . . . . . . . . . . . . . . . . . 75

Appendix B 77.2 Triangular inequality proof . . . . . . . . . . . . . . . . . . . . 77

Bibliography 80

vi CONTENTS

List of Figures

1.1 Block diagram of a futuristic MIRS . . . . . . . . . . . . . . . 41.2 Block diagram of a generic MIRS used in this thesis . . . . . 6

3.1 Rhythm feature based on normalized autocorrelation template 333.2 Block diagram of the Quebex algorithm . . . . . . . . . . . . 343.3 A suboptimal mapping function for distance calculation be-

tween feature vector trajectories. Node1 of Song1 can bemapped to the nearest of Node1,Node2 or Node3 of Song2.Any intermediate Nodei from Song1 can be mapped to any ofNodei− 1,Nodei or Nodei + 1 of Song2. . . . . . . . . . . . . 36

3.4 Projekt Quebex’s MATLAB GUI . . . . . . . . . . . . . . . . 37

4.1 Top figure indicates the FFT of a frame of a violin piece.Bottom figure indicates the sine likeness measure for the frame 45

4.2 Block diagram of Arminion’s architecture . . . . . . . . . . . 514.3 Arminion’s MATLAB GUI . . . . . . . . . . . . . . . . . . . . 55

5.1 Block diagram of the speech-music discriminator . . . . . . . . 635.2 Average number of sinusoids per voiced frame . . . . . . . . . 665.3 Average number of individual lines per voiced frame. As can

be seen from the plot, the Iavg feature is not of much use. Sowe neglected this feature. . . . . . . . . . . . . . . . . . . . . . 67

5.4 Average number of harmonics per voiced frame . . . . . . . . 685.5 Average residual noise energy per voiced frame. In all the

above histograms, the x-axis corresponds to the value of thefeature and the y-axis corresponds to the number of times thefeature value occurs in the experiment. . . . . . . . . . . . . . 69

1 Proof of triangular inequality failure of the distance measure . 75

vii

viii LIST OF FIGURES

2 Proof of triangular inequality failure of the distance measure . 77

List of Tables

2.1 Table of papers reviewed in the chapter . . . . . . . . . 25

4.1 Table for experiment no.1 . . . . . . . . . . . . . . . . . . 574.2 Table for experiment no.2 . . . . . . . . . . . . . . . . . . 574.3 Table for experiment no.3 . . . . . . . . . . . . . . . . . . 584.4 Table for experiment no.4 . . . . . . . . . . . . . . . . . . 584.5 Table for experiment no.5 . . . . . . . . . . . . . . . . . . 59

5.1 Tests on Schierer and Slaney’s database . . . . . . . . . 70

ix

x LIST OF TABLES

List of symbols

CBR → Content Based Retrieval.

CBRS → Content Based Retrieval System.

CBMR → Content Based Multimedia Retrieval.

CBIR → Content Based Image Retrieval.

PR → Pattern Recognition.

AIR → Audio Information Retrieval.

MIR → Music Information Retrieval.

MIRS → Music Information Retrieval System.

QBE → Query by Example.

QBH → Query by Humming.

QBB → Query by Beat-boxing.

QBT → Query by Tapping.

QBES → Query by Example System.

MIDI → Musical Instruments Digital Interface.

HILN → Harmonics Individual Lines + Noise.

MFCC → Mel Frequency Cepstral Coefficients.

LPC → Linear Prediction Coefficients.

xi

xii LIST OF SYMBOLS

EMD → Earth Mover’s Distance.

KLD → Kullback-Liebler Divergence.

GMM → Gaussian Mixture Model.

ZCR → Zero crossings.

TC → Twin clip.

DTC → Dubbed twin clip.

SMD → Speech/Music Discriminator.

Acknowledgements

I would like to thank my adviser and guide Prof.KRR for having enduredwith me for the past three years. He has been extremely helpful during mytime at IISc and has stood by me during the toughest of times. It has beena long way for me, since his “ I won’t allow you to work on music....” topresenting me with the oppurtunity and resources to grow with the musicresearch group here at LSML.

I would like to thank my labmates for having given me a wonderful timehere at LSML. I got a new perspective towards life due to my interactionwith them. Bigfoot, LL VJ, Vatsan, “Tamil fanatic” Ravi, Kunche, Pranav,Sandy Rat, Um‘E‘sh devru, Pavan, ‘Hase’ Vivek, Prasi and Nash have beenamazing colleagues during my time here.

Special thanks to Chash, AGK and Pramoda ( for all the left justificationwe did at the 16th cross ) for the great time at the shuttle courts, TB andthe gymkhana movies. Without you guys my stay would have been definitelyless colorful. Thanks also to Prof.TVS, Ram Sir and speech lab guys fortreating me as one of their own. Thanks also to Younger, Elder and “Eh....Gunnersa” Mahesh for all the “low bandwidth” football interactions. Thanksto my “Mad” friend Vijaykrishna for all those enjoyable times.

Finally, thanks to my parents and sister for having endured my fits ofidiosyncrasies with calm and patience over the last quarter century. Thisthesis is dedicated to you for all the sacrifices you made to help me pursuemy dreams.

xiii

xiv ACKNOWLEDGEMENTS

Abstract

With the growth of the Internet physical boundaries have been almost erased.Fast access to information and information exchange have become the re-quirements of the day. Search engines like Google, Alta-vista, Yahoo etcsearch and retrieve textual content from huge databases ( read websites )pretty fast. But these “all-seeing” search engines cannot go beyond the textretrieval paradigm. With the ever increasing presence of multimedia contentlike images, audio and videos on the Internet, there is need to incorporatemultimedia searching capabilities in today’s search engines. Though manysearch engines have abilities to search for multimedia, it’s only metatag basedsearch, which are minor extensions to the existing textual search capabili-ties. Content Based Multimedia Retrieval (CBMR) is the need of the hourfor searching multimedia based on content itself rather than labels and tagsattached to the content.

After images, audio is the second most popular form of multimedia onthe web. Most of the current search engines in websites that distributeaudio on the web ( such as www.mp3.com, www.raaga.com etc) are stilltextual metalabel based (i.e search can only be in the form of pre-definedand restricted queries such as “search by singer / album name”). Searchingby textual queries is cumbersome for a normal user who may not have anyknowledge about the content he is searching for. For example, a person mayremember a tune but may well have forgotten the name of the album / artist.In such cases we need Music Information Retrieval Systems ( MIRS ) thattry to work on the content itself rather than the meta content. Towardsthis, the current work attempts Content Based Music Information Retrieval(CBMIR).

MIR systems can be queried in various modes, such as the query byhumming (QBH) , query by example ( QBE ), query by beat boxing ( QBB), query by tapping ( QBT ) etc. In these systems, a user gives an input

xv

xvi ABSTRACT

signal (in various possible modes depending on the system) as a query and thesystem retrieves audio that is similar in some sense, to the query input. Mostof the currently available systems work on the MIDI ( Musical InstrumentsDigital Interface ) audio format. All the above querying systems except forthe QBE mode of MIRS expect some skill or the other from the user ( QBHrequires the user to hum accurately, QBT requires the user to have drummingskills etc ). QBE mode of querying is the most convenient form of queryingand doesn’t require any skill from the user.

The main contribution of this thesis are two algorithms that performa content based retrieval on music data using the QBE paradigm and onealgorithm for front end processing in QBH systems. We have also come upwith a systematic procedure to build databases for MIRS such that efficientobjective evaluation can be performed on systems using these databases.

A database containing 1581 songs in the PCM ( .wav ) format has beenbuilt for experimentation. The database has songs from various genres, lan-guages, artists and bands. The database incorporates a controlled redun-dancy of data, in the form of multiple versions of a given song, such that anobjective evaluation of any MIRS can be done.

The first MIR system developed in the course of the research work ( calledthe Quebex [1] ), is a QBE system for audio retrieval. In this system a querysong can be selected from the database using a GUI and can be fed as in-put. A strict form of similarity is used for retrieval in Quebex. This form ofsimilarity tries to match the timbre of two songs temporally followedby matching their rhythms as well. For this purpose we describe thefeatures, both timbre and rhythm features, required to be used. A modi-fied version of Hausdorff distance is conceived to match the timbres of anytwo songs temporally. We use an elimination method that removes 50% ofsongs whose timbre is not well matched to the query song. Only 50% of thesongs are further processed for the purpose of rhythm matching. A nearestneighbour matching paradigm is used since no genre based or hierarchicalstructure based retrieval methods are possible ( since we are not using anymetadata for classification ). The algorithm performs very well when thereare lot of songs of the type of the query song, but fails miserably when thedatabase size is small or the number of songs of the type of the query areless. This motivated us to dilute the strict form of similarity we had usedfor Quebex and come up with a much simpler form of similarity that wouldretrieve songs well even in a small database.

In Arminion [2], our second QBE system, songs that are timbrewise

xvii

similar and have similar “noisyness” to the query song are retrieved. “Noisy-ness“ is defined loosely as the feeling similar to that experienced while lis-tening to a heavy metal / rock song. For modeling timbre MFCCs withGMMs(Gaussian Mixture Model) are used. A new distance measure be-tween two mixtures of Gaussian‘s is derived and an intuitive explanation isgiven for the same. HILN (Harmonics, Individual Lines and Noise) modelof audio coding used in MPEG4 standard has been employed for extractingcertain features that denote the “noisyness“ of a signal. Here again, a twostage process to retrieve songs that are timbrewise similar is used. The firststage retrieves the best matching 50% ( say ) of the total number of songsin the database, that are timbrewise similar to the query song. The secondstage involves the comparison of HILN features . In this step, songs that aretimbrewise similar and also having the same kind of “noisyness“ are retrieved( say, best 10 matches ) . Through a set of experiments, the utility of thenew distance measure, the capability of MFCCs to capture the timbre of anaudio and also the ability of the HILN features to capture “noisyness“ in asong are shown. We get nearly 60% ( among the top 10 retrievals, around 6will be acceptable) acceptable retrieval using the Arminion system.

Our third algorithm is a Speech / Music Discriminator (SMD )[3]. Thisalgorithm uses HILN features and has been motivated by the performanceof the HILN features in our earlier system towards better discrimination of“noisyness“ in music. A very simple 3 dimensional feature vector is usedfor a Bayesian classifier along each of the 3 dimensions. A simple votingmechanism is then used to choose between speech / music label. A tune canbe played or sung or hummed ( using specific syllables) and used as inputfor a QBH system. The SMD would help in distinguishing between the sungqueries and the played queries. This SMD performs on par with far morecompute intensive and complex classifier based SMDs on a standard databasewith an overall accuracy of 96%.

This thesis is organised as follows. Chapter 1 introduces the concept ofMIR and CBR to the reader. Chapter 2 is a brief literature survey on thetopic of QBE [4] . Chapter 3 and 4 deal with Quebex and the Arminion sys-tem and their corresponding experiments. The shortcomings of the systemsand our solutions to them are dealt in detail in the two chapters. Chapter 5describes our SMD and its implementation. We conclude the thesis in Chap-ter 6 by giving out possible extensions to our work and its significance in thefield of MIR.

xviii ABSTRACT

Chapter 1

Introduction

With the phenomenal growth of the Internet, information exchange and fastaccess to information have become the requirements of the day. Search en-gines like Google, Alta-vista, Yahoo etc search from huge databases andretrieve textual content ( and to some extent images ) pretty fast. Tex-tual content retrieval has become more or less trivial with the presence ofthese search engines. Recently, Google-Desktop has made powerful searchfunctionalities possible on personal computers.

With the ever increasing presence of various forms of multimedia such asimages, audio and video on the web and on PCs, content based multimediaretrieval is seen as a neccessity. The second most popular form of multimedia,next only to images, is audio. Most of the current search engines in websitesthat distribute audio on the web ( such as www.mp3.com,www.raaga.cometc) are textual metalabel based, i.e search can only be in the form of pre-defined and restricted queries such as “search by singer / album name”.Usually websites provide names of artists, albums and genre for every song.With over a billion songs on the web, tasks such as indexing and labeling arestill done manually. This leads to incompatibilities among various websites.The same audio content downloaded from different content providers would,many times, be confusing about the metadata ( such as genre ). Personaltastes will play a role in labeling the ’genre’ of a song. The end user /consumer is at loss in such cases. Also, two different content distributorsmay store the same song in different audio formats and quality. A songmaybe named differently by different content distributors. Thus we needsome form of Content Based Retrieval Systems ( CBRS ) that work on theaudio content itself rather than the metadata and retrieve relevant songs for

1

2 CHAPTER 1. INTRODUCTION

a user.To overcome the above said problems, the MPEG ( Motion Pictures Ex-

perts Group) committee came up with the MPEG7 standard in 20011. Alsocalled the Multimedia Content Description Interface, the standard intends toprovide a common framework for content based multimedia retrieval systems.The standard only provides the structures of various interfaces such as thosebetween the user and the content creator, user and the content provider etc.The way content can be described is also specified along with the possibleapplications of the framework. CBIR ( Content based Image retrieval), Mu-sic Information Retrieval Systems (MIRS) etc have been the outcome of suchcollective efforts by the standards committee. The standard does not specifythe algorithms to implement various forms of MIRS, but does specify onlythe framework within which each of the standard blocks maybe implemented.In this thesis we shall look at algorithms and systems that try to do this jobof retrieving music based on CBR technologies.

In this chapter, we shall deal with MIRS in section 1.1 and their archi-tecture in section 1.2. In section 1.3, we will take a look into the currentliterature in MIRS. We describe the various types of MIRS in section 1.4 andthe user skills needed to use them. We briefly discuss the motivations forthis work and what we intend to achieve in this thesis in sections 1.5 and1.6. We shall end the chapter by a brief summary in section 1.8.

1.1 Music Information Retrieval Systems(MIRS)

MIRS are those CBR ( Content Based Retrieval ) systems that take textualqueries, aural queries etc as input and give some audio as output. MIRS areused for indexing, searching and retrieving music in an efficient and fast man-ner. MIRS are usually employed in situations where huge audio databasesare constantly accessed by various clients / users. This kind of a scenariorules out manual sorting, indexing and retrieving audio since the databasesare huge and the requests for audio information are varied.

To illustrate the scenario by an example, let us take the case of a musicdirector A working for an animation film. He has a huge database of soundsfor special effects ( funny sounds, eerie sounds, hand claps, glass breaks etc ).He needs to search them for certain special effects to be used at various placesin the movie. Now this database can have some hierarchy in it, imposed by

1http://www.chiariglione.org/mpeg/standards/mpeg-7/mpeg-7.htm

1.2. ARCHITECTURE OF MIRS 3

the database creator2. But the music director may not feel like using the pre-defined hierarchy to search for the sounds of his choice. The director mayhave a set of special effects and would like to see if the database has “similar”effects. In this situation, he would want some automated system that wouldaccept his clips of special effects audio as input and retrieve “similar” clipsfrom the database. If another music director B, working for a horror moviewould like to retrieve sounds by using textual descriptions of the sounds, hemight like an automated system that would take his textual descriptive inputand retrieve the best possible matches from the database ( ‘Screechy soundwith hordes of glass breaking in the background’, ‘a hard hitting piano soundwith a comical feeling’ etc would be his typical queries). Though the databaseis the same in the above two cases, the needs are varied and therefore theapplications, though similar need to be able to cater to different situations. Inboth the said cases, textual metalabelling would not only be difficult but alsoreflect the preferences and tastes of the content creator. It may be impossiblefor a human assistant to wade through the entire database searching for thekind of audio the director is looking for.

With PCs having harddisk capacities of over 40GB becoming a commonthing, pocket mp3 players ( like Apple-ipod and Sony discman ) having capac-ities of over 20GB hitting the markets, and websites offering more and moreaudio for downloads ( like Pay and listen services ), searching for favouriteaudio will be cumbersome for the normal user. Thus, some form of MIRS isneeded to go through the ocean of music and audio available at ones disposal.In the next section, we shall discuss the architecture of a very generic MIRSand describe it‘s working to bring forth the motivation for this work.

1.2 Architecture of MIRS

Figure 1.1 shows a block diagram of an MIRS. The system takes audio inputfrom the database audio and extracts low level features LLi, where i variesfrom 1 to n. These features are combined in some way using either psy-choacoustic grouping rules [6] or cognition based rules [7] to get higher levelsemantic features / representations [8] HLi where i varies from 1 to m. Aspart of an input query to the system, a user can type in a query like “Get mesongs that feel like Yanni” [9, 10]. Systems that use some finite set of wordsto convey moods / impressions that a song would convey are being currently

2For example, www.mp3.com categorizes its 208,000 songs into 215 separate genres[5]


Extract Low level features

LL1 LL2 LLn

Generatehigh levelsemantics

Generate Generatehigh level high levelsemantics semantics

HLS1 HLS2 HLSm

QHLS1 QHLS2 QHLSm

Understand query and extract semantics

Parse query using grammar rules

Input natural language query

Input signal

Figure 1.1: Block diagram of a futuristic MIRS

used instead of natural language ( NL ) queries[11]. This query has to beparsed by an NL understanding system that tries to understand the user’squery. Upon understanding the query, the system throws out the cognitiveequivalent expressions / features ( QHLi where i can vary between 1 andm ) about what the user wanted. Now, some form of distance measure isused for calculating the similarity between the query features QHLi and thedatabase features HLi to retrieve songs that satisfy the users query. Thisform of an MIRS is really far fetched since we are yet to make progress inNL understanding and also in our understanding of the cognitive processesthat give rise to various “gestalt” effects. Hence, we attempt much easier

1.2. ARCHITECTURE OF MIRS 5

versions of the above said MIRS. We shall discuss a few of such systems inthis thesis. From now on, whenever we refer to an MIRS, we mean a simplerform of MIRS that takes in some form of aural input and process the inputto extract features and retrieve audio based on some “similarity” measure.

As described above, a very simple and generic MIRS is shown in Fig 1.2.The system takes some form of input query signal ( which is dependent onthe type of MIRS ) and processes the signal to extract certain features. Thesame set of features are extracted ( usually done offline ) for the audio in thedatabase. Now we compare the query song and the songs in the databaseusing these extracted features as a form of middle level representation. De-pending on some distance measure, we would like to retrieve songs that aresimilar in some sense.

The following are the different stages of an MIRS and they involve dif-ferent signal processing and pattern recognition ( PR ) techniques. They areas follows;

i. Signal acquisition.

ii. Signal thumbnailing.

iii. Feature extraction.

iv. Feature comparison and retrieval.

We shall elaborate on the above in the following sections.

1.2.1 Signal acquisition

Depending on the input signal acquisition method, MIRS are called by dif-ferent names. Usually some form of aural query such as whistling, humming,singing, tapping drum pads or even a piece of music is given as the input.Though the features extracted and similarity measures used in the varioustypes of MIRS are different, we have made a classification of MIRS based onthe input signal type alone. This point is elaborated in section 1.4.

1.2.2 Signal thumbnailing

Thumbnailing is the process of finding the most representative part of a musicpiece. For example, the chorus and the verse portion in a song are usually


Comparison between query song and the database

Input signal

Feature extraction

Database

Featureextraction and

indexing

Output the retrieved songs

Figure 1.2: Block diagram of a generic MIRS used in this thesis

the most repetitive and hence can be called the representative summary ofthe song. Cooper et al[12] and Chai et al[13] describe work on similarityanalysis for summary / thumbnail extraction from audio clips. Usually thefeature set is extracted from the database songs using the thumbnail of thesongs rather than the whole songs themselves. Also, the thumbnail may beretrieved in the retrieval stage instead of the whole song. This is particularlyuseful in web based applications when a user needs only a sample of somemusic to make a decision about the content he needs. This is also logicalsince a thumbnail is much smaller in size than the full song and would takelesser bandwidths / transfer times to transfer from the server to the client.

1.2.3 Feature extraction

In this stage, the input signal is usually split into small frames and variouskinds of features are extracted from each frame assuming that the signal isstationary within the frame. Usually frame lengths of 20 millisecs-30 millisecsare used. The most popular set of features [14, 15, 16] in the current literature

1.3. MIRS, AN INTRODUCTORY LITERATURE SURVEY 7

are time-domain features ( such as mean energy, zero crossings, number ofsilent frames etc ), frequency domain features ( such as spectral centroid, roll-off, spectral flux etc) and coefficient domain features ( such as MFCCs ( Melfrequency cepstral coefficients ), LPCs ( Linear Prediction coefficients) etc)or time-frequency domain features such as pitch. These features are usuallyconcatenated to form feature vectors. The input signal is represented usingthese feature vectors as a compact form to represent certain characteristicsof the signal.

1.2.4 Feature comparison and retrieval

In this stage the features extracted from the query and the features in thedatabase are compared for some form of similarity using some distance mea-sure. “Similarity” can be in terms of timbre[17, 18, 19, 20, 21], rhythm[22],melody[23], mood[24], genre[25], artist[26] or a combination of the above[1,27]. Based on the definition of similarity in an MIRS, the system extractsneccessary features from the query song and compares the feature set withthose of the songs in the database using some distance measure[28, 29, 30, 21].The nearest set of songs are then retrieved as being most similar to the querysong. Usually a nearest neighbour approach [31, 15, 32] or a hierarchical clas-sification based approach[15, 33] is used for retrieval.

1.3 MIRS, an introductory literature survey

In this section we shall look into the various aspects of MIRS that we dis-cussed till this point with the help of some well quoted papers in the litera-ture. We will first discuss a few important papers that shall act as pointersto research in MIR and then proceed towards more specific MIRS in the nextsections.

Foote [8] in his literature survey about various Audio Information Re-trieval (AIR) systems discusses about automated systems that can do videomail retrieval, spoken document retrieval, keyword spotting using speechrecognition technology, music browsing and indexing, music retrieval3 etc.This paper predicts that one day we might come up with an “AltaVista formusic”.

3MuscleFish ,one of the first AIR systems is described


Wold et al [31] discuss about MuscleFish, an AIR system. The systemused very basic spectral and temporal features such as loudness, pitch, bright-ness, bandwidth and harmonicity. The system is trained using various typesof sounds. A pattern classifier approach is taken to classify new input sounds.When a user wants to retrieve all “dog barks”, he can give an example queryas input to the system, as a .wav file. The class of the input sound is foundand the nearest neighbours are retrieved as the output4.

Tzanetakis [34] talks about various aspects of retrieval systems. Theseinclude newer features, newer modes of retrieval and also newer ways ofvisualizing audio data. Temporal, spectral, rhythm based and also waveletbased features are well documented in his work. Experiments comparinghuman performance in tasks such as classification of audio and segmentation,against that of machines have been well explained. New forms of visualizingmusic such as the Timbregram are proposed. The future of MIR systemswith various possible ways of querying is also well elaborated. Marsyas, amusic analysis and synthesis framework developed by Tzanetakis [16] has aset of general music analysis algorithms implemented in a user friendly Javaenvironment. The system supports feature extraction, input segmentation,classification and a few other basic algorithms. Any new application canbe very easily developed by modifying / using the inbuilt functions of thesystem5. Tzanetakis [35] describes various tools that can be implementedusing this framework. Audio segmenting, thumbnailing, retrieval using theQBE (Query by Example) paradigm, sound editing tools, timbregrams (atool to visualize timbral similarity in audio clips) are a few of the tools thathave been implemented using the MARSYAS framework.

Keuser [14] talks of various types of MIRs and also various feature extrac-tion algorithms. His literature survey is concise and indicative of the stateof the art in MIRS. Audio genre classification, theme extraction, beat esti-mation, QBE and query by humming (QBH) systems have been described.Various features ( temporal, spectral, beat and others) have been detailedand the motivation to use them for MIRS has been well elucidated. A li-brary of audio utility functions in the line of Marsyas has been implementedto achieve a generic MIRS.

A very recent paper by Typke et al [36] surveys nearly 20 MIRS of varioustypes. This paper does a comparative study of the features, retrieval methods

4A demo can be found at http://www.musclefish.com/cbrdemo.html5MARSYAS is downloadable from http://opihi.cs.uvic.ca/marsyas/

1.4. TYPES OF MIRS 9

and also their application areas. The paper also deals with various aspectsof MIRS in general and is a good source to start a literature survey aboutMIRS in general.

1.4 Types of MIRS

We shall look into the description of various forms of MIRS in sufficientdetail and provide the motivation for this work on QBE. Music transcription[37, 38], synthesis [39, 40], classification [25], representation [41] and musicretrieval in various forms have been the main topics in the field of musicresearch for the past decade. Music description and retrieval have beenchallenging fields, with complexity of music signals still winning over eventhe most intelligent and involved signal processing techniques. We will lookat a few of such systems to see why MIRS is still a very open problem forresearchers to tackle.

Depending on the type / mode of input query, MIRS can be classified aslisted below.

I. Query by Humming (QBH).

II. Query by Example (QBE).

III. Query by Tapping (QBT).

IV. Query by Beat-boxing (QBB).

We shall describe each of these types of MIRS in the following sectionsin as concise a manner as possible.

1.4.1 QBH

The earliest of attempts at music retrieval came in the form of QBH systems.Here the user hums/sings a query song into a microphone and the systemretrieves tunes that match the hummed query. QBH systems work by retriev-ing songs that match the melody contour and the timing information of thequery song. McNab et al [42] described one of the first QBH systems, usingNewZealand folk songs. MIDI (Musical Instruments Digital Interface) for-mat of audio was used for melody extraction and similarity comparison wasbased on string edit distances. This has been the usual technique used since


then for QBH systems. Almost all QBH systems have some form of melodycontour transcription system ( Most systems use MIDI format of audio toextract this! ), followed by matching the melody contour of the aural query(singing, humming, whistling or even performing using some musical instru-ment) to the melody contour of each song of the database using some formof approximate string matching techniques[43]. The best matching songs arethen retrieved from the database [44, 45, 46, 47, 48]. QBH systems have in-herent problems such as pitch transcription in raw polyphonic music [49, 50],users not being able to reproduce the melody of the song accurately in theirhummed queries [51, 49] and also efficient searching which is usually not pos-sible in case of huge databases. We are still far away from realizing QBH onraw polyphonic audio or MP3 songs.

1.4.2 QBE

The second most popular form of MIR systems that have been attemptedare the QBE systems. In these systems, a user would like to hear songs thatsound “similar” to a query song. Hence he gives a piece of music itself asa query to the system. The system analyses the query piece, extracts theneccessary features and upon comparison with the database songs, retrievesa few pieces of music as being “similar” in some sense to the input querysong. We shall discuss QBE systems in detail in the next chapter.

1.4.3 QBT

In QBT [52] a user taps his query song on a tapping pad. The intervalsbetween tappings acts as the features for the purpose of retrieval. Onsetintervals from the database songs are compared with that of the query songand songs that have similar onset patterns are retrieved. Even this modeof querying is yet to become popular though it may well be used in karaokesystems, DJing or even as teaching aids for kids.

1.4.4 QBB

In QBB[53] a singer sings a bunch of ‘bols’6 / animated sounds that resemblepercussive sounds. The system retrieves sounds that are similar to the sung

6bol is an Indian word for a speak and here roughly translates to spoken syllables

1.5. MOTIVATION FOR RESEARCH IN QBE 11

‘bols’. Though this is useful for DJing, rappers, music teachers etc, researchin this field has not picked up much.

1.5 Motivation for research in QBE

Example is the school of mankind, and they will learn at no other.

– Edmund Burke (1729- 1797)

As quoted above, humans learn faster and better when they see by ex-amples. Many times, when the description of a particular object / taskbecomes difficult with language, giving an example clarifies a lot of ideasabout the problem at hand. This aspect of our understanding of the worldaround us has motivated this work largely. Audio retrieval too has seen someresearchers take the same route - of solving the MIR problem through theQBE paradigm. A lot of concepts about music such as genre, mood(theseare the usually available metalabels for online audio) etc are ill defined anddependent on cultural backgrounds and personal likings of people who tagthe audio metadata. Thus, what one user calls as rock music, another mightcall it hard rock or soft rock or even metal. A normal user searching formusic in huge databases will find it useful if he had a simple tool that wouldretrieve for him songs “similar” to the one he gives as input. For QBH orQBB or QBT form of MIRS, some skill is expected of the user for the sakeof musical input. This makes querying itself a hard task for a layman. QBErequires no such skills. In QBE, if “similarity” is defined for the user, hewould know well what to expect. For example if a QBES offers retrieval bysimilarity of mood, the user would know pretty well that music (irrespectiveof genre, artists, instrumentation etc) that tries to convey the same mood asthe query song would be returned to him. This has been a major motivationfor our research in QBE MIRS.

Consider the following scenario. An ordinary user listens to a Carnaticclassical performance and likes the mood and feel of the song. Being a lay-man, he doesn’t understand “raaga7” or “taala8” nor can he remember thecomposer of the piece. Yet he would like to listen to more songs of the “same

7Loosely, raaga is the set of notes that have to be traversed using a set of predefinedrules.

8Taala is that aspect of music that maintains the tempo and prescribes the periodicbeat pattern for a musical performance.


feel”. This presents a serious problem to a search engine that uses textualinput as query, since the user is unable to express the feeling he experiences,upon listening to the song. Also any metalabel based search like “Search byRaaga” or “Search by Composer” etc fail since the user doesn’t know them.In this scenario, a QBE MIRS is the best way to help the user. It doesn‘trequire any input other than the audio that the user already has. Givingthis as the input, in a generic QBE MIRS, the user can select “Search bysimilar mood” or “ Search by Raaga” etc without having to worry about anyinformation about the audio. Thus, the QBE form of CBR is one of the mostpowerful and user friendly MIRS.

We would like to design algorithms and features for music retrieval usingsome forms of “similarity”. Until we come up with better and stable descrip-tors for music content and easier forms of querying, the simplest and easiestway to search for audio using CBR techniques would be using the QBE formof MIR.

1.6 Problem statement

We shall formalize our approach to provide a solution(s) to the QBE paradigmin MIR. We would like to state the problem as rigourously as possible leavingout the loose ends if they have not been introduced in this thesis yet.

Let DBS be a database of songs. Let QS be a query song. LetS be a type of ‘similarity’ ( S can similarity of mood or timbre orrhythm etc ). Let F be the set of features extracted from DBS. LetDM be some form of distance measure applied on F to measure S.

We would like to build a QBES that would take QS as the inputand search the database DBS for songs that are ‘similar’ accordingto S. We intend to explore various forms of S, various features Fthat suit a particular S and also look at various possible DM s.We intend to design algorithms that are fast and computationallycheap for retrieval purposes.

1.7 Organisation of the thesis

This thesis is organised as follows. In Chapter 2, we discuss a few papersabout QBESs. This chapter shall pave way for better understanding of the

1.8. CHAPTER SUMMARY 13

state of the art in QBESs and also the approaches that are in vogue. Wediscuss the three important approaches that have been taken by researchersto address the QBE problem in MIR.

In Chapter 3 we present Projekt Quebex, the first QBES developed for ourwork. This system uses a feature trajectory matching algorithm to retrievesimilar songs. A strict form of similarity is defined between any two songs.Songs that are similar in instrumentation along time and also in rhythm areretrieved for a query song using this system. This had some disadvantageswhen used for small databases and motivated us to look for better solutions.The result was Arminion, another QBE MIRS that retrieves audio based ontimbral similarity.

In Chapter 4, the shortcomings of Projekt Quebex system are discussedand Arminion is offered as a solution to the shortcomings. We use the HILN (Harmonics Individual Lines and Noise) model used for MPEG4 audio codingto extract features for improved retrieval performance. We conduct variousexperiments to prove that Arminion works well even in case of average sizeddatabases.

In Chapter 5, a Speech/Music Discriminator (SMD) based on HILN fea-tures will be described. This work is more of an offshoot of our work onArminion. As an application in MIR, this SMD can be used in a QBH sys-tem for front end processing. The HILN features used are extremely simpleto evaluate and a simple voting system instead of compute intensive distancemeasures is used to demonstrate the performance of our SMD. The SMDperforms well compared with the performance of various existing SMDs.

We conclude and summarize the contributions of this work in Chapter 6along with future directions for MIRS. We discuss some interesting ideas andviews for further research in this area.

1.8 Chapter Summary

In this chapter, we discussed about MIR and also the need for MIRS. Ageneric MIRS architecture was described and its various aspects were dis-cussed. We did a brief survey of the current techniques in the field of MIRand also a had bird‘s eye view of the literature in this field. We looked at thevarious kinds of MIRS‘ namely QBH, QBE, QBB and QBT and discussedthe motivation for this current work on QBE form of MIRS. The advan-tages of querying by example was elucidated and the format of the thesis has


been elaborated. In the next chapter we shall do a literature survey that isindicative of the state of the art in the field of QBESs.

Chapter 2

Literature Survey on QBES’

A king wanting to know about the entire world asked his wisestministers to catalog information about every man who had everlived. Seven weeks later, the king was presented with seven camelseach carrying seven huge books. Not ready to go through so manybooks, the king asked for a smaller catalog. Seven days later he gotone book, said to contain a brief history about every man who hadwalked the earth. Still dissatisfied, the king asked for a smallercatalog. One minute later his wisest minister presented him witha small note saying that it contained the summary that the kingwanted.The note read;”They all were born, they lived and they died”.

- An Arabic tale

”Copy from one, its plagiarism; copy from many, its research”.- Wilson Mizner

This chapter discusses some of the past work in the field of QBE. In sec-tion 2.1, we discuss the general architecture of QBES. We elaborate uponthree different directions that researchers in QBES’ have taken over thepast few years. Systems dealing with optimal features , systems that dealwith fast retrieval techniques and systems that use efficient indexingmethods are the most described in literature today. We describe systemsthat have handled each of the above said issues separately or in tandem. Wehave confined ourselves to cite important papers that have implications of

15

16 CHAPTER 2. LITERATURE SURVEY ON QBES’

the signal processing kind. In section 2.2 we study systems that apply bet-ter audio features for retrieval purposes. Section 2.3 deals with systems thathave better retrieval algorithms on account of using better distance measuresor because of using better features along with better distance metrics or be-cause of paradigm changes in looking at audio retrieval by QBE. Section 2.4looks into various indexing structures that are efficient and lead to fasterretrieval on account of intelligent method of storing data.

2.1 Understanding the QBE form of MIR

QBE has been very popular since it is one of easiest mode of querying adatabase. A QBES takes in a piece of audio as input and extracts the neededfeatures to process it further. The system would have extracted the same setof features offline from the audio in the database. Now using some form of adistance metric, the system tries to retrieve audio that are “similar” in somesense to the query. The system can be asked to (i) retrieve songs that havesimilar mood, instrumentation ( timbre ), rhythm or melody or (ii) retrievethe set of instances of an audio piece based on the query input piece. Inthe first case a QBES works to retrieve some k-nearest neighbours. In thesecond case, instances of an advertisement based on a query / instances ofan actor‘s dialogues based on the actor’s voice sample query are the kind ofapplications that can be looked at. Usually the best matching audio piecesare retrieved in some ranked order. In the above described systems, thereare three aspects for researchers to consider - features, retrieval strategyand indexing strategy . We can use any one or more of these aspects whendesigning a new QBE system. Consequently, QBE systems can be classifiedas listed below;

I. Systems with better audio related features.

II. Systems with better retrieval algorithms.

III. Systems with efficient indexing structures.

We look at each aspect of a QBE system and discuss some of the impor-tant contributions in these areas. We will discuss each of the above view-points with a few examples in the following sections. Though a few systemscombine more than one of the above said aspects, the following classification

2.2. SYSTEMS WITH BETTER AUDIO RELATED FEATURES 17

is based on our understanding of QBE systems and is hence not the only wayto classify QBEs. Also, the literature survey is only indicative of the kind ofefforts in the area of QBE and is in no way exhaustive.

2.2 Systems with better audio related fea-

tures

Welsh et al [5] describe the Ninja Jukebox QBE system that has a databaseof around 7000 songs. Using tonal histograms, tonal transition, noise, vol-ume and tempo based features, a simple k-nearest neighbour algorithm isimplemented to retrieve ‘similar sounding’ songs. Though the algorithm isnovel, it has a high feature dimension of 1248. Also according to the authors,the algorithm retrieves ‘soft’ pieces of music of other instruments also whensay, a classical music piece in piano is given as input.

Foote [22] attempts audio retrieval from “rhythmic similarity” by com-paring two songs spectrally. A simple 2D matrix based similarity function isderived to indicate how similar two songs are. This measure is loosely basedon the autocorrelation and cross correlation functions. A new feature calledthe beat spectrum has been designed using the above said matrix similarityfunction. He has used a small database of around 120 songs1.

Cheng [20] considers retrieval based on spectral energy similarity. Peakpicking in the signal power domain followed by spectrum matching is used.A DTW ( Dynamic Time Warping ) based algorithm for matching songs ofdifferent lengths is developed. The algorithm takes care of different speedsof performance of a given music piece and retrieves them as similar. Chenguses a small database of around 200 songs of orchestra music.

Feng et al [24, 54] describe a QBE system that retrieves songs by similarityof mood. Tempo and articulation features are used to train a BP-NN(Backpropagation - Neural network) to classify the mood of a song as one of fourpossible moods (happiness, fear, sorrow and anger). This, to the best of ourknowledge is one of the first attempts that accomplishes retrieval by mood.A database of 223 pop songs were used to train and test the system.

Liu and Wan [15] discuss a QBE system that has a hierarchical searchengine built into it. Using Sequential Forward Selection ( SFS ) method forbest feature selection, the best set of 20 features out of 87 features belonging

1Foote’s demo can be found on the web–http://www.fxpal.com/people/foote/musicr/doc0.html


to 4 class of features ( time domain features, frequency domain features,time-frequency features and coefficient domain features ) well recognised inthe literature are selected and various classifiers (like Nearest neighbour, k-NN and a Probabilistic Neural Network(PNN) ) are built on them. The datais first classified into coarse classes and each coarse class is further classifiedinto fine classes, giving rise to a natural hierarchy within the database. Thehierarchical search engine retrieves audio clips from the databases fast andefficiently by first determining the coarse class of a query clip and searchingfor the similar clips within the coarse class. 2 databases of sizes around 400and 1200 clips of labelled audio are used for the experiments. This workessentially attempts to point out the best set of features that can be usedfrom the gamut of features available today to a researcher.

Venkat [55] proposed a form of QBH for Indian classical music using aQBE format. He has used a “continuity descriptor” to decide on whetherthe query signal is Indian classical music or western classical. This descrip-tor accounts for the presence of “gamaka2” that is popular in Indian classicalmusic ( both Hindustani and Carnatic forms ) and based on the “amount ofcontinuity” in a song, it is either classified as Indian classical or Western clas-sical piece. This new feature reduces the search time drastically in classicalmusic databases. Venkat had used a database of around 100 songs.

2.3 Systems with better retrieval algorithms

Though it is difficult to classify a lot of systems described below as systemsusing better retrieval algorithms, we do so based on the importance of theretrieval algorithm since most of the systems described use very popularfeatures and improve the retrieval mechanism by using different distancefunctions.

Cuidado system [21, 32], one of the earliest QBE systems attempted ona large scale defines similarity as having similar instrumentation ( “globaltimbre” ) in the query piece of music and the retrieved music pieces. MFCCsare used as the features to describe timbre. The system uses a GMM ( Gaus-sian Mixture Model ) based similarity retrieval algorithm. Cuidado systemhas one of the biggest databases seen in the literature ( More than 17,000songs ). The system has both objective evaluation and subjective evaluationmethods. The authors come up with a measure of “interestingness” or that of

2Continuous pitch variations between notes /embellishments

2.3. SYSTEMS WITH BETTER RETRIEVAL ALGORITHMS 19

finding something unexpected yet interesting for the user. But the algorithmis a bit compute intensive.

Logan and Salomon [17] describe a system that tries to identify similaritybased on Earth Mover’s Distance ( EMD ) between two pieces of music whoseframe based features ( MFCCs ) have been k-means clustered. Similaritybetween a database song and the query song is given by the amount of“work” done in moving the feature set of the query song to the positionsoccupied by the feature set of a database song. The algorithm is extremelycompute intensive. Their database has around 8,000 songs.

Spevak et al [19] describe a system ( Soundspotter ) that attempts tomatch the MFCC trajectory of selected sounds in an audio piece by using aDTW based algorithm. Soundspotter can retrieve songs / clips of audio thathave a given sound ( say a dog bark or sound of rain ) somewhere within anaudio file.

Baumann et al [56] compare the work of Aucouturier ( The Cuidadosystem) [21] and Logan [30] and use the best of both methods to build apeer-to-peer(P2P) QBE system. Baumann has used a database of 800 songsof 33 genres. MFCC features followed by k-means clustering have been usedfollowed by a minimum of mean KLD(Kullback-Leibler Divergence) insteadof the EMD used by Logan. The authors quote similar results as that ofLogan.

Velivelli et al [57] design an HMM based QBE system that defines theexample query as a “theme” and builds a “theme” HMM on it. The setof HMMs built on the audio in the database is compared with the “theme”HMM for the most similar audio clips. MFCCs and the energy coefficient areused as features for each frame. Though novel in its approach, the algorithmis compute intensive and needs great amount of training to work well.

Liu et al [58] show a QBE that works on automatically extracted segmentsof audio signals. GMMs of features (MFCC features) from segments are builtand are stored as the representatives of the segments. The authors come upwith a distance measure [59] to find the distance between 2 mixture gaussians.They use this distance metric to speed up the process of finding similar audioclips. The authors use a database of around 250 music segments extractedfrom 7 hours of TV recordings.

Paulus et al [60] describe an algorithm for rhythmically similar song re-trieval on a database of around 400 songs. Loudness, spectral centroid andMFCCs are the features used for pattern matching. Beat is estimated usingwell established algorithms and this helps in choosing window lengths for the


pattern matching algorithm. A DTW approach is taken to retrieve songsthat are rhythmically similar.

Kashino et al [61, 62] propose a retrieval system for both audio and videousing histogram similarity. In this method, histograms of features are builtover the length of the query signal and this is compared with histogramsbuilt from various segments of the database video and audio. A time or-dering feature is also incorporated for better discrimination. This algorithmis called the Time-series Active Search (TAS)[62] and reduces search timesin huge databases by nearly 500 times when compared to the brute forceexhaustive search algorithms. The audio features used in their implementa-tion are BPFs ( Bandpass filterbank features ), LPC-Mel cepstrum ( CalledLPCs by the authors) and ∆-LPCs. Robustness to noise is achieved by usinga ”probabilistic dither voting” concept [63], wherein a feature vector is saidto belong to various classes depending on the class density instead of a singleclass.

Park et al [64] propose MRTB ( Music Retrieval via Thumbnail Browser), system that incorporates Multi Feature Clustering ( MFC ) techniqueson sufficiently long windows ( of around 15 secs ). Features such as LPCs,MFCCs and spectral centroid, zero crossing rates are used for clustering (using k-means algorithm ) within each of the 15 second windows. Classifica-tion and query by example systems are built using the MRTB framework. Adatabase of around 2000 songs belonging to 4 genres of music is chosen forthe experiments.

Doraisamy et al[18, 65] propose the use of n-grams to retrieve melodywisesimilar music. Polyphonic MIDI files are used to extract pitch, note onsetand note duration information and store them as n-grams. These n-grams areconverted to text strings and text retrieval methods are applied to retrievesimilar songs. A database of 3096 MIDI songs are used for the experiments.

Zhang et al [66] propose a QBE system that works on physical ( En-ergy, zerocrossing rates and fundamental frquency ) and perceptual features( rhythm and timbre ) of sounds and does a hierarchical classification( Acoarse classification followed by fine classification). HMMs are then used foraudio retrieval by classifying the input query sound and retrieving sounds ofthe determined class type. A decently large database of around 1500 soundsis used for the system.

2.4. SYSTEMS WITH EFFICIENT INDEXING STRUCTURES 21

2.4 Systems with efficient indexing structures

In this section we shall discuss systems that have attempted to use betterindexing structures to enable faster searching in music databases. Theseefficient storage structures eliminate a lot of searching by storing data in-telligently ( Like the Tree structure VQ, for example). In our work we donot make use of these indexing structures but concentrate on better audiofeatures and retrieval mechanisms.

Lu [44] describes various structures for multimedia data retrieval pur-poses. He discusses about the 3 types of queries- point query, range queryand the k-nearest neighbour query and the various types of data structuresused for such query engines. Multidimensional data structures such as theB+ and B trees, kd -trees, grid files, R-trees etc are dealt in this paper.

Cano et al [67] discuss a Fastmap based approach to a QBE system.FastMap [68] was introduced to overcome the high computational complexityof MDS ( Multi dimensional scaling ) methods for visualization of datasets.Also FastMap allows for the QBE paradigm based on any generic distancematrix. Cano’s work emphasizes on a generic QBE system using around1900 songs. The authors stress the ease of using Fastmap along with the lowcomputational complexity that comes with it3.

Foote [33] describes a hierarchical QBE system using MFCCs and a Q-tree structure. The Q-tree structure helps to retrieve similar sounding clipsfaster and can be built in a supervised fashion. Cosine distance measureupon comparing with the Euclidean distance measure is found to be superiorand is hence used for retrieval purposes. A sound corpus of around 400 clipsis used for the experiments.

Chen and Chen [69] describe a system that retrieves rhythmically similarsongs by using an efficient indexing structure called the L-tree. A rhythmicsimilarity measure is defined and used by the authors and an approximatestring matching paradigm is used for retrieval purposes. A small database of102 folk songs is used for the experimental purposes.

We have not discussed retrieval by melody ( or the QBH systems in QBEformat), since they usually involve the same QBH framework except for themode of input query. We shall summarize the systems that have been re-viewed in this brief literature survey in the form of a table 2.1.

3www.iua.ufp.es/mtg/SongSurfer


2.5 Linking the QBES’

We now try to link the works described above using a different view of QBES.The common aspects in the QBES detailed above are dealt in the followingsections. Systems belonging to the three different classes explained abovehave some similarities and differences when looked from different perspectivesand we will try to bring them out in this thesis. This shall enable us tounderstand clearly the requirements of QBES for various types of similarities.

2.5.1 Similarity in a QBE system

QBES allow for various forms of similarities during the retrieval stage unlikethe QBH ( retrieval by melody ) or QBB ( retrieval by sound similarity )or QBT ( retrieval by rhythm or tapping pattern ). Similarities betweensongs can be of various types and QBE allows for that. Similarity of mood[24, 54], similarity of rhythm [22, 60, 66, 69], similarity of timbre [33, 30, 21,56, 58], similarity of melody [55, 18, 65] and many other forms of similaritiescan be implemented for QBES. As can be seen in the section 2.1, systemsconcentrating on different aspects of QBES may work towards achieving thesame mode of retrieval. We will look at one of these types of similarities inour work.

2.5.2 Features in a QBE system

A QBE system needs some set of features to be extracted from the databasesongs to be able to compare a query song with any database song. Dif-ferent types of features have been used in the literature for different typesof similarities. As reported in section 1.2.3, many classes of features havebeen used to represent different properties of a signal. MFCCs are the mostpopular of them, especially for the purpose of representing the timbre of asong [30, 17, 21, 32, 58, 33, 56]. Inter onset intervals ( IOI )[60, 69, 22] andmelody line ( pitch contour ) [55, 18, 65] are usually used for rhythmic andmelodic similarity retrieval purposes respectively. Energy, ∆-energy, spectralflux and zero crossings are usually used to estimate the mood and “noisy-ness” in music [24, 54, 1]. As can be seen, even these features cut acrossthe classes of features and many QBES use these features in tandem ( Forexample, rhythm and timbre features are used together to retrieve songs thatare not only timbrewise similar but also are rhythmically similar [27]). In

2.5. LINKING THE QBES’ 23

the following chapters, we would like to look at these features in detail tobuild our MIR algorithms.

2.5.3 Retrieval algorithm and distance measure in aQBE system

This stage is the most important and compute intensive stage in a QBE.In this stage, the feature set extracted from the query song is comparedusing some distance measure with the songs in the database. Usually, a highdimensional feature vector is formed from the extracted feature set. Everysong in the database is represented as a point / cluster of points in a highdimensional space [17, 21, 32, 58]. Some form of parametric representationis used for this cluster of points, for ease of storage and searching. GMMs ork-means are the most widely used modeling techniques used to represent theclusters [17, 21, 32, 58]. The distance between the parametric representationof the query song and that of the songs in the database is calculated usingsome distance measure. Euclidean or cosine distance [22], KLD [56], EMD[17], transportation distances [28] or even newer distance measures [58, 59,1, 2] are usually used as distance functions. Upon calculating the distancebetween the query song and the songs in the database, the closest matches( Best 10 or Top 50 matching songs etc)-in a nearest neighbour sense, areretrieved as the set of best matching songs.

2.5.4 Database size in a QBE system

Though we mention this aspect of a QBE system at this point in our thesis, itshall be elaborated upon at later stages when we have developed our retrievalsystem and discuss the problems faced by an MIR algorithm when dealingwith small databases. For now, a subjective discussion shall suffice.

There are effectively 2 factors that determine retrieval in a QBES. Thefirst is the computational cost of the algorithm and the second is the databasesize. Keeping aside the computational costs, database size plays a very im-portant role in the retrieval and also the quality of retrievals. In general, thebigger the database size, usually the better the retrieval. Also, the contentof the database matters a lot in the retrieval. There is the need for someform of a ground truth to evaluate a QBE system objectively and some formof subjective evaluation is needed for a complete analysis of a QBE system.


We shall deal with these aspects later in the thesis. We shall also come upwith a few self-tests and ways to intelligently build databases so that theycan be evaluated even without the cumbersome subjective evaluation tests.

2.6 Chapter Summary

We have attempted to give a brief introduction towards QBES and also agood overview of literature in QBES . We have tried to showcase the mo-tivation that researchers over the past decade have had to work in QBESand also the directions taken by them. QBEs have been tackled from threedifferent viewpoints and they have been-designing better features , de-signing better algorithms and designing better indexing structures .We would like to emphasize that our work concentrates on improving retrievalin QBES from a signal processing point of view. We have refrained ourselvesfrom using / designing any form of indexing structures and hence will notrefer to them in this thesis any further. We have also tried to look at a QBESat a block diagram level and have analyzed each block in sufficient detail tounderstand the overall flow of control and information in a QBES.

We shall discuss a QBES in the next chapter. We start off by choosingthe features we deem to be important for various aspects of a retrieval systemand justify our choices with proper reasons. A simple matching algorithmthat works well to retrieve songs that are similar in timbre and rhythm tothe query song is described. We describe the experiments we conducted onthis system to show its strengths. The various disadvantages of this systemwill be elucidated and possible improvements will be suggested.

2.6. CHAPTER SUMMARY 25

Table 2.1: Table of papers reviewed in the chapterTable

Research group Features Functionality Database sizeWelsh[5] Tonal histograms etc T.S 7000Foote[22] Beat spectrum R.S 120

Cheng[20] Power spectrum + DTW E.S 200Feng[24, 54] Tempo and articulation features S.M 223

Liu and Wan[15] 87 features G.S 1600Venkat [55] Pitch Contour + continuity M.S 100

Aucouturier[21, 32] MFCC + GMM T.S 17000Logan[17] MFCC + EMD T.S 8000

Spevak[19] MFCC + DTW C.S -Baumann[56] MFCC + k-means + KLD T.S 800

Velivelli[57] MFCC + Energy + HMM Th.S -Liu[58] MFCC + GMM T.S 250

Paulus[60] Loudness + MFCCs + DTW R.S 400Kashino[61] BPFs + LPC+ ∆-LPC I and S -

Park[64] LPC + MFCC + ZCR etc C 2000Doraisamy[18, 65] pitch, note onset etc + n-grams M.S 3096

Zhang[66] Physical and perceptual features H.C 1500Cano[67] Fastmap Gen.S 1900Foote[33] MFCCs + Q-tree T.S 400Chen[69] Notes R.S 102

T.S : Timbral Similarity, R.S : Rhythmic Similarity, Th.S : Thematic Simi-larity, I and S : Identification and Search, H.C : Hierarchical Classification,M.S : Melodic Similarity, S.M : Similarity of Mood, C : Classification, E.S :Energy Similarity, G.S : Genre Similarity, Gen.S : Generic Similarity, C.S :Clip Search.

Chapter 3

Quebex : A QBES for MIR

You may fail trying but dont fail to try

- Mark Twain

This chapter describes a Query by Example ( QBE ) system - Quebex,that takes a piece of music as input and retrieves “similar” music pieces.We begin by describing our audio database in Section 3.1, since the samedatabase is used for the two QBESs we have designed. In Section 3.2 wedefine a strict form of similarity for retrieval purposes in the Quebex system.The steps of the Quebex algorithm along with the various features used inthe system are elaborated in Section 3.3. We also come up with a simpleretrieval algorithm in section 3.4. Section 3.5 deals with the experiments weconducted on our Quebex system and the results and conclusions. We willwrap up the chapter by a summary of this chapter and the motivation forour next algorithm.

3.1 Database Information

The audio database we have created is used by both Quebex and our Armin-ion system (Chapter 4). The database contains 1581 clips of approximately10 seconds long and are in raw audio ( .wav ) format. All clips have beenmanually extracted from CDs and resampled at 16kHz. The clips are num-bered as 1.wav all the way up to 1581.wav. Each song is normalized inintensity to an average value of 70dB, since humans are most accustomedto listening at this intensity level [70]. This is to compensate for intensity

27

28 CHAPTER 3. QUEBEX : A QBES FOR MIR

variations that may have occured during either the recording stage or duringthe clip extraction stage.

The database has clips1 from various languages such as English, Hindi,Telugu, Tamil, Kannada, Punjabi, Malayalam etc. Also, songs belongingto various genres have been used. Rock, pop, classical ( both Western andIndian), film music, fusion, orchestra music, jazz etc make up the genresof the songs in the database. The database has around 800 pieces fromIndian films. There are also some clips that have no vocals. Clips havebeen extracted from international artistes and bands like Beatles, MichaelJackson, Britney Spears, Yanni, Kenny G, Vanessa Mae, Enigma, BryanAdams, Christina Aguilera and others.

The clips don‘t have any metalabel2 such as “genre” or “artists”. Metal-abels were not used since they could not be reliably obtained for the Indianfilm songs. Also, genre is ill defined for Indian film music. Lack of metala-bels may prevent retrieval systems using this database from the “retrieval bymetalabel tags” paradigm, but improper metalabels would lead to poor per-formance of a QBES. Else, we could have used label / metadata informationfor text based querying of the database[9].

For purposes of self evaluation of the QBESs, the database was developedin such a way as to help perform some objective analysis of the QBES. Twoidentical ( or almost identical) clips were extracted from different places ina full five minute song and stored as adjacent clips in the database. Wecall these as “twin clips ( TCs )”. For example, clip 610.wav contains aBritney Spears’ song and clip 611.wav also contains a clipping from a differentposition in the same song that has similar lyrics / instrumentation. TheseTCs would act as very good test inputs, since the performance of a QBEScan be estimated by verifying if the system could retrieve the twin of a clipwithin the top 10 retrievals. Also, we had around 30 songs that had a dubbed/ remixed version in the database. When a film is dubbed / remade into adifferent language in India, the songs get dubbed too. These songs usuallyhave the same tune, instrumentation and performers. Even these “dubbedtwin clips ( DTCs )” would act as good indicators of a QBES‘ retrievalperformance. This kind of controlled redundancy addition in the databasefor testing purposes can eliminate a lot of cumbersome subjective evaluation

1We have used the words “clip” and “song” interchangebly in Chapters 3 and 42Metalabel: A label that contains some information about the data. Ex: Genre, artist,

album name etc can be possible metalabels for an audio

3.2. SIMILARITY DEFINITION FOR QUEBEX 29

tests.

Though not very huge, the current database is decent enough to be usedon a MIRS for testing purposes. The clips in the database can be viewedas thumbnails of the corresponding full songs, even though thumbnails areusually automatically extracted. Ideally we would have liked to build a muchbigger database of around 10,000 songs with reliable metalabels.

3.2 Similarity definition for Quebex

The similarity functions used in the systems described in chapter 2 are usuallyspectral content based, with little or no regard to the tempo ( perceived speedof a musical performance ) features. One way to account for tempo would beto use average tempo or beat of the song [71] as a feature. We can considerthe retrieved song as being similar only if it‘s tempo also matches along withspectral features. In our system, we define a strict form of similarity betweenany two songs.

Our algorithm uses an approach that combines the techniques used in[21, 22, 19, 71]. We use tempo information along with timbre information totry and define a strict form of similarity. Two audio pieces are considered“similar” by us if they are:

1 Rhythmically similar. (Have same tempo and onset pattern)

2 Temporally have similar timbre. Example:Lets say song A has twosegments, with violin in the first segment and piano in the second. We saysong B is similar to song A if it has a section where violin is played followedby piano. We like to match the timbres of the two songs temporally also.

Rule 2 above retrieved extremely good matches when there were enoughnumber of similar songs in the database. From now on “similar” will meanthe above 2 rules unless, otherwise mentioned. Our similarity definition issomewhat equivalent to that used for clip search in a QBES.

3.3 Feature extraction for the Quebex system

In this section, we analyze the set of features needed to best characterize thetype of similarity we have defined above. Since we need a temporal similarityof timbre, we need some form of a time indexed timbre feature set that cancompare timbre of the query song with that of a database song in a sequential


manner. Also, we need features that can give a good estimate of rhythm ofa song. For that we need a feature set to represent the average beat rateand possibly a feature vector that can store template of significant onsets forevery clip. To capture the similarity defined above, we extract these classesof features - timbre features , temporal and energy features and rhythmfeatures . We shall describe the extraction of the above said features andtheir properties in the following pages.

We try to represent the temporal structure of the timbre of a clip usingMFCCs, spectral flux and zero crossings ( They estimate the noise in thesignal ). To take into account the non-stationary property of acoustic signals,each song is divided into non - overlapping frames of 20 milli seconds andfeatures are extracted on a per frame basis. This concept of splitting thesignal into small frames, within which the signal is assumed to be stationaryis well established in speech and audio research.

3.3.1 I. Extraction of features for representing timbre

i.Spectral flux

Spectral flux is defined as the overall change in spectrum between twoadjacent frames. Let F (i, j) be the Short Time Fourier Transform (STFT)of the ith frame. We have used 512 point FFT in our implementation, sinceeach 20 millisec frame will have 320 samples. Then the spectral flux, Flux(i)for the ith frame is calculated as follows;

Flux(i) =512∑

j=1

|(F (i, j)− F (i + 1, j))| (3.1)

Mean and standard deviation of spectral flux ( Fluxavg and Fluxstd ) arecomputed for every song [14, 16, 25]. Spectral flux gives an estimate of theamount of spectral variation in the signal and songs that are more or less“texturally stable” or less transient will have low values for the above twofeatures.

ii.Zero crossings

For every frame, the number of times the signal crosses the zero value /changes it‘s sign is taken as the zero crossings for the frame. Let S(i, j) bethe jth sample in the ith frame of a signal. Then the zero crossings, ZCR(i)for the ith frame is calculated as given below;

3.3. FEATURE EXTRACTION FOR THE QUEBEX SYSTEM 31

ZCR(i) =319∑

j=1

|sgn(S(i, j))− sgn(S(i, j + 1))|/2 (3.2)

where sgn(x) is 1 if x > 0 and -1 if x <= 0.

The average and standard deviations of ZCR ( ZCRavg and ZCRstd )are taken for every song in the database. Zero crossings indicate the amountof “noisyness” in a signal and are used very well to distinguish speech frommusic [72] and even as features for genre recognition of songs [25].

iii.Mel Frequency Cepstral Coefficients (MFCCs)

MFCCs are perceptually motivated features that have been in vogue inthe speech research for the past decade [73]. MFCCs have been shown tobe useful in music research for capturing the timbre of music [30]. Theirstrength in discriminating between different genres of music have been wellknown[25].

MFCCs(Mel frequency cepstral coefficients) are computed for each frameby passing the signal through a Mel scaled filterbank and taking the DiscreteCosine Transform (DCT) of the output for the purpose of compaction. Thefilterbank performs an integration operation on the spectrum of the signal,smoothing out the fine details of the spectrum and presenting a frequencyaveraged version of the spectrum. We use 40 Mel filters in our filterbank. Thefilterbank presents the 40 cepstral coefficients of which the first 13 coefficientsare the most significant. We choose to retain only the first 8 coefficients sincethey capture the “ global timbre” of the song [21, 17] (MFCC(i, j) is theMFCC vector for the jth frame of ith song ). Since we have 10 seconds ofmusic on an average this comes to around 500 frames.

We split the 10 second clip into non-overlapping windows of 0.5 seconds.We take the mean and standard deviation of the MFCCs falling within each ofthe 0.5 second window and represent the group of frames within the windowby the above mean and standard deviation vectors. There will be 25 framesfalling within each half second window.

MFCCAvg(i, j) =1

25

j∗25∑

k=(j−1)∗25

MFCC(i, k) (3.3)

MFCCDvn(i, k, j) = ‖MFCC(i, k)−MFCCAvg(i, j)‖2. (3.4)


MFCCStd(i, j) =

√√√√√ 1

25

j∗25∑

k=(j−1)∗25

MFCCDvn(i, k, j) (3.5)

where j varies from 1 to 20.This way we have a trajectory of MFCCs of 40 vectors ( 20 for the mean

vectors and 20 for the standard deviation vectors ) for every song. We canalso model the frames within each half second window using GMMs.

3.3.2 II. Extraction of temporal and energy features

To estimate the amount of silence and average intensity in a clip, we extractenergy based features as follows. For each music piece, the percentage offrames that have energy less than the average energy of the music piece (Per Eavg ) is calculated and stored. If the signal is represented as x(n) andthere are N number of samples in the signal, the average energy of the musicpiece is calculated as follows;

Eavg =1

N

N∑

k=1

x(n)2 (3.6)

The energy of frame i, E frame(i) is calculated using the Eqn 3.7.

E frame(i) =N∑

k=1

xi(k)2 (3.7)

The mean and standard deviation of the frame energy E frame is com-puted as Song Eavg and Song Estd.

3.3.3 III. Extraction of rhythm features

For estimating the beat and average onset rate in a song, we extract the fol-lowing set of rhythm based features. A psychoacoustically motivated onsetdetection algorithm [70] is used to extract significant onsets. First the audioclip is passed through a filterbank that has 30 filters. Any two adjacent filtersare equivalent rectangular bandwidth (ERB) apart. Within each subband,we pick significant energy transitions ( Energy transitions that are above athreshold ) and sum up the transitions from across the subbands to find thesignificant onsets. The mean and standard deviation of energy in the onset is

3.3. FEATURE EXTRACTION FOR THE QUEBEX SYSTEM 33

calculated. Energy of the onset is assumed to be the energy of the frame thatis declared to be an onset frame(OnsetEn avg and OnsetEn std). Once thatis extracted, we have a onset track that determines the time instances wheresignificant onsets in the audio occur. Onsets that are very close ( within50 milli seconds of each other), are removed from the list of onsets and onerepresentative onset with the highest onset strength is chosen for this groupof onsets. A low pass filtered version of the onset track is taken and it‘s auto-correlation is computed. The low pass filtering is done to simulate the ear’sgrouping and time masking phenomena [70]. The autocorrelation gives outthe presence of a rhythm in the signal. The beat in the signal is well capturedby the autocorrelation function [74]. We normalize the autocorrelation func-tion and downsample it to a template of 50 points size. Let Templ acf(i) bethe template for the ith song. Songs with similar onset strengths and interonset intervals ( IOI ) will have similar onset templates. Note, in the Fig3.1, the autocorrelation function of Song 3 is different from that of Song 2and Song 1, since Song 1 and Song 2 are DTCs. The number of onsets persecond is taken as the average tempo of the song(Tempo avg). The averageand standard deviation of onset strength is a good indication of how strongthe each onset is and how much onsets vary from the mean.

Figure 3.1: Rhythm feature based on normalized autocorrelation template

In the next section we shall look at the algorithm to retrieve similar songs.


3.4 Algorithm for retrieval

Figure 3.2 shows a block diagram of the steps involved in the Quebex algo-rithm. As described in section 3.1, the first two steps are implemented offlinefor the audio in the database. For any new audio clip, the steps will be imple-mented online before proceeding to the feature extraction stage. The featureextraction stage involves extracting the three types of features described insection 3.3.

to 16kHz

Resample signal

Normalize average intensity to 70dB

Extract Extractspectral features

Extract temporal and energy features rhythm features

Database

Offlinefeature

extraction

Top 10 best retrievals

Remove 50% of songs

from second stage

��

��

� ��

Figure 3.2: Block diagram of the Quebex algorithm

A MATLAB GUI has been developed for demonstrating the Quebex sys-

3.4. ALGORITHM FOR RETRIEVAL 35

tem ( Fig.3.4 ). A query song can be selected from the database itself usingthe MATLAB GUI. A song can be played ( Using “Play song” button ) beforeselecting ( Using “Select song” button ) it as the query song. The algorithmwill search for similar songs in the database and retrieve the nearest 10 songsbased on a ranking system.

The following steps are used for retrieval purposes:(i). Given a query song A, a symmetric distance function between the

MFCC vector trajectories of song A and all the songs in the database iscalculated as shown in Figure 3.3. That is, in the 8 dimensional MFCCspace, we try to map each point of query song A to the nearest point in songB’s trajectory ( with some temporal restriction ( Figure 3.3 ). Let A(i) andB(i) denote the MFCC features extracted for the ith half second window inthe songs A and B respectively. Euclidean distance is calculated between apoint i (Each MFCCavg and MFCCstd feature for the ith half second frame isrepresented as a point in the 8 dimensional space ) in the temporal trajectoryof the query song A and the database song B by trying to match the nearestamong points numbered i−1, i and i+1. We allow for a temporal matchingof the timbres between the two songs with some constraint, as shown in thefollowing set of equations.

d(A,B) =19∑

i=2

min{||(A(i)−B(i− 1)||2, ||A(i)−B(i)||2,

||A(i)−B(i + 1)||2}+ min{||A(1)−B(1)||2, ||A(1)−B(2)||2,||A(1)−B(3)||2}+ min{||A(20)−B(18)||2,

||A(20)−B(19)||2, ||A(20)−B(20)||2} (3.8)

where d(A,B) is the distance of song A with respect to song B. The last twoterms of the above equation are to take care of the starting and end points.

Notice that this is asymmetric, due to asymmetric temporal constraintson songs A and B;

d(A,B) 6= d(B, A) (3.9)

So, to make it symmetric, an average is taken between the two mutualdistances as shown below;

D(A,B) = (d(A,B) + d(B, A))/2. (3.10)


Song2’s trajectory

Song1’s trajectory

Node1Node2

Node3 Node4 Node5

Node1

Node2 Node3Node4

Node5

Figure 3.3: A suboptimal mapping function for distance calculation betweenfeature vector trajectories. Node1 of Song1 can be mapped to the nearest ofNode1,Node2 or Node3 of Song2. Any intermediate Nodei from Song1 canbe mapped to any of Nodei− 1,Nodei or Nodei + 1 of Song2.

The above distance measure ensures that the distance between a songand itself is 0, i.e

D(A,A) = 0. (3.11)

The distance function is based on the Hausdorff distance measure [75]and it satisfies the identity property ( 3.11 ) and symmetry property ( 3.10) and also the uniqueness property ( i.e d(A,B) = 0 implies A = B ). Butthe very important property of triangle inequality is not satisfied by thisdistance measure (See Appendix .1). Thus, this distance function is not ametric. The above distance is used on two features namely, the MFCCAvg

and MFCCStd ( Block C1 in Fig 3.2 ) . The songs are ranked accordingto their distance from the query song( Rank R1). Now, Euclidean distancebetween the features Fluxavg, Fluxstd, ZCRavg and ZCRstd of the querysong A and a database song B is calculated as follows for all songs in thedatabase;

dfavg(A,B) = (Fluxavg(A)− Fluxavg(B))2. (3.12)

dfstd(A,B) = (Fluxstd(A)− Fluxstd(B))2. (3.13)

D Flux(A,B) = dfavg(A,B) + dfstd(A,B). (3.14)

dzavg(A, B) = (ZCRavg(A)− ZCRavg(B))2. (3.15)

3.4. ALGORITHM FOR RETRIEVAL 37

Figure 3.4: Projekt Quebex’s MATLAB GUI

dzstd(A,B) = (ZCRstd(A)− ZCRstd(B))2. (3.16)

D Zcr(A, B) = dzavg(A,B) + dzstd(A,B). (3.17)

The distances are again ranked ( R2 for D Flux and R3 for D Zcr ).Ranking is intuitively satisfying since the features have different orders ofmagnitude and adding them up as in Euclidean distances can mean neglectingthe importance of some features. A final rank R ( Block C1 in the Fig 3.2 )is calculated as,

R = R1 + α.R2 + β.R3. (3.18)

where α and β are constants of weighing the features R2 and R3 respec-tively. ( This means that different features can be given different importances.We give a value of 1 to α and 0.25 to β, meaning that the zero-crossings mea-sure which relates to ‘noisyness’ in a music piece is neglected as comparedto the spectral distance measures R1 and R2)

(ii). All the songs ranked in the top 50% of the above rank R, arereturned for further processing. This sieving eliminates songs whose timbreand spectral features are way too different from the query song.

(iii). Now, a Euclidean distance is calculated between the temporal fea-tures( Per Eavg, Song Eavg and Song Estd ) of query song and the sievedsongs and the songs are ranked ( Block C2 in Fig 3.2 ). These ranks are


clubbed together and a similar ranking ( “temporal feature matching rank”), as explained above is computed ( R temporal ).

(iv). Euclidean distance between the autocorrelation templates ( Cal-culated as the sum of the differences between the corresponding points ofeach template ), the mean tempo and onset strengths ( OnsetEn avg, On-setEn std, Templ acf(i) and Tempo avg ) are calculated between the querysong and the database songs and a weighted ranking ( Each of the abovesaid features are ranked and then the ranks are added based on weights.Example: If we want to match the beat templates better, greater weightageis given to that ranking ) is used to get a “rhythm feature matching rank” (R rhythm ).

(v). Now the 2 ranks are merged to get a final rank for each of the song.

R final = R rhythm + R temporal. (3.19)

(vi). The top 10 ranked ( R final ) songs are retrieved. The query songwill always be ranked 1. This ranking system results in a better retrievalthan simple Euclidean distance between the features.

The above mentioned sieving method eliminates songs that are way off interms of timbre. The second stage tries to match not only average onset rateof the songs but also their pattern ( in the form of using the autocorrelationtemplate ). We are justified in doing a timbre matching followed by rhythmmatching instead of the other way around since rhythm features are not asgood music discriminating features [27] as the timbral features. Thus, tryingto do a rhythm first retrieval would lead to very bad results on a smalldatabase.

We believe that the human brain attempts to look for the “strongest”feature in a song and finally takes a decision of similarity based on the per-ceptual strength of features ( Ex: Users of our system were able to accept2 rock songs that had different instrumentation, tempo and rhythm, butsame “noisyness” ( related to zero crossing rate ) as “similar” songs ). Butwe don’t have a perceptual ranking of the above said features to easily saywhich feature dominates over the others. This makes retrieval by similaritychallenging.

3.5. EXPERIMENTS AND RESULTS 39

3.5 Experiments and Results

10 songs were randomly chosen from the database and their retrievals weretested on 3 people. For these 10 songs, users reported an average of 50% ofthe songs to be similar to the query songs. Also the system always retrievedTCs and DTCs of the same songs well within the top 10 songs. This wasa self-test, in which, the system worked well. The retrieval was very bad (less than 20% ) when there were not enough songs of the type of the querysong. This has motivated us to increase the database size to around 5000 ormore songs. [21, 17] had huge databases to start with and therefore mustnot have felt this problem. [22, 20] had very few songs ( around 200 ) andmust have had this problem (We tested the outputs for system by Foote[22]and felt this problem for some songs). More detailed analysis / tests on theQuebex system were abandoned since the main problem was found to be inthe size of the database. But this enabled us to formulate certain empiricalrules about databases and their development. These will be discussed in thenext chapter in detail.

As an alternative to increasing the database size ( which is time consum-ing and involves manual labour ) , we intend to solve the problem of QBEretrieval by using other similarity measures and algorithms in the follow-ing chapters. Quebex retrieved very good matches when there were enoughnumber of songs of the query type, but failed miserably when the number ofsongs of a particular type were less.

3.6 Conclusions and Chapter Summary

In this chapter we have designed a QBE system that tries to retrieve songsbased on a strict form of similarity. The algorithm matches temporal sim-ilarity of timbre and rhythm for retrieving similar sound clips. A distancefunction that uses a modified version of Hausdorff distance measure is usedfor matching purposes. A sieving technique that removes irrelevant songsfrom further processing reduces computation. The system matches timbreand rhythm sequentially for the above said reasons. Though the retrievalis good for certain songs, the performance of the system falls drastically ifdatabase is small.

QBE systems that are based on perceptual similarity measures appearto be promising. Appropriate characterization of a song / piece of recorded


music, such as done by listeners, in terms of bass, noise, rhythm etc maybe incorporated into QBE systems. Finding suitable computational featuresand descriptors to match such a characterization is a challenging task. Wemay need to use more than one feature set for this purpose since no singlefeature might contain all the information for such a characterization. Futureresearch can try to concentrate on finding features that are perceptuallymore relevant than those that do a good job objectively. Also, coming upwith ways to automate testing so that subjective listening tests could be doneaway with in future while evalutaing MIRS. We would like to point out tocertain directions in this field of research in the next chapters.

Chapter 4

Arminion : An improvedQBES

It is not enough that you aim, you must hit

- Italian proverb

In this chapter, we describe a new QBES. As seen in the previous chapter,a strict form of similarity led to poor audio retrieval in the absence of a hugedatabase. One of the solutions would be to increase the database size. Thiswas found to be laborious and time consuming. Hence we opted to design anewer algorithm that would do well on smaller databases. This meant relax-ing the strict form of similarity we had defined for Quebex. The system wedesigned, called the Arminion1, works on raw audio ( PCM ) data of poly-phonic music signals to retrieve pieces that are timbrewise similar. We useMFCCs for timbre description. The MFCCs computed for each music pieceare clustered into a 3 mixture gaussian. We describe a new and intuitivelypleasing distance measure between mixture gaussians that is used to com-pute the timbral similarity between any two songs in the database. We showthrough various experiments that our distance measure successfully retrievestimbrally similar songs. We also propose a new set of features based on theMPEG4 audio coding model ( HILN model ) and evaluate it’s usefulness forretrieving timbrewise similar songs that have similar “noisyness”. We elab-orate on various methods of self-evaluation of a QBES and derive certainguidelines to build databases that aid such self-evaluation techniques.

1The word is made up from the hARMonics, INdividual lines and NOIse ( NOI changedto ION) (HILN) model used in this system

41

42 CHAPTER 4. ARMINION : AN IMPROVED QBES

This chapter is organised as follows. In section 4.1, we define a newsimilarity measure for our algorithm. Section 4.2 shall describe the featuresneeded to capture timbre and “noisyness”. A detailed description of theHILN ( Harmonics, Individual Lines and Noise ) model and it‘s usage in theextraction of “noisyness” features is given in the same section. Section 4.3deals with the MFCCs clustering algorithm using GMMs. A new distancemeasure between two mixture gaussians is derived and it‘s properties areelaborated upon. Section 4.4 describes our new algorithm that works on theabove defined features to retrieve timbrewise similar songs. In section 4.5we describe the experiments done to study the performance of MFCCs andour new distance measure. Section 4.6 demonstrates the capabilities of theHILN features using some experiments. In section 4.7 we shall conclude bypointing out to directions in which research can be progressed by taking acue from our work.

4.1 Similarity for Arminion system

For Arminion, two songs are defined to be similar if they have the sametimbre and “noisyness”. Timbre is the perceptual aspect of instrumentation.Two songs that have similar instrumentation have similar timbres. Noisynesscan be loosely defined as the perceived amount of noise in a song. Forexample, a rock song with electric guitars is perceived as noisy compared toa piano performance of a Bach’s composition. We use HILN ( Harmonics,Individual Lines and Noise ) based features to distinguish between thesetypes of songs(i.e, clean or noisy songs). Thus, two songs having same /similar instruments ( in the performance of the song ) and similar “noisyness”are considered as similar to each other.

4.2 Feature extraction

The features needed to capture timbre and “noisyness /cleanliness” in a songwill be described in the following sections. As has been described in chapter3, timbre is well captured by MFCCs. We give a more rigorous explanationfor the same in the next section.

4.2. FEATURE EXTRACTION 43

4.2.1 Timbre descriptors - MFCCs and their extrac-tion

MFCC features have been used in speech recognition work for long. Logan[30] proved that though MFCCs might not model music better than otherfeatures, they did not “cause any harm”. Enough work has been done inthe past with respect to timbre, using MFCCs [21]. MFCCs have provento be powerful features because of their discriminative abilities [25, 27, 76].We prove conclusively through various experiments that MFCCs indeed docapture timbre of polyphonic music very well.

To extract the MFCCs from music, we split each music piece in thedatabase into non-overlapping frames of 20 milliseconds. For each frame,we compute 13 point MFCCs ( See Chapter 3 ). Now for every frame weretain only the first 8 MFCC coefficients since they model the global timbrevery well [21, 32].

We use GMMs (Gaussian Mixture Model) to cluster the MFCCs into 3mixtures2. The weight of each mixture, along with the means and covariancesare stored for each song. A 3 component GMM, i.e M = 3 was chosen (where M specifies the number of component Gausssians in the GMM ) sinceit has been proven that 3 mixtures are enough to capture most of the timbralvariations in a song[21].

So, for each song in the database we extract the MFCCs and store theparameters of the 3 cluster GMM that are created. Given a new song, weextract the above parameters and compare them with those of the songs inthe database using some distance measure. More details about GMMs andour new distance measure are given in the coming sections.

4.2.2 “Noisyness” descriptors - HILN based featuresand their extraction

The HILN model has been used for audio coding in the MPEG4 standard.It is an advancement over the general Macaulay-Quatieri sinusoidal model[77] and its variants like the sinusoidal+ noise model [78] and sinusoidal +noise + transient models [79]. The basic concept in this model is to try andcapture individual entities like individual sinusoids, harmonic componentsand residual noise from a signal spectrum. For audio coding purposes, de-

2http://www.autonlab.org/tutorials/gmm.html


pending on the bandwidth / storage capabilties, the above components arecoded systematically. An extensible coding framework allows the MPEG4audio codec to first code only harmonics in a signal, then add individual si-nusoids ( those that do not have any harmonics or are not harmonics of othersinusoids in the signal) for better reconstruction of the sprectum and lateradd the residual noise in the signal for better timbre. For a more detailedexplanation on the HILN model see[80]. For the purpose of extraction of theHILN based features, we have implemented the sine+noise model using theframe based techniques given in [78].

In this technique, a signal is split into frames of 20 millisec ( 320 samplesof the signal sampled at 16kHz per frame ) and multiplied with a Hanningwindow of the same length. This framewise analysis of the signal is to allowfor stationarity assumption on the signal characteristics within short win-dows. A 2048 point Short Time Fourier Transform (STFT) of the signal iscomputed for each frame of the signal. The zero padding is done to increasefrequency resolution. From the Fourier transform of a frame of the signal, weextract the sinusoids, harmonics and noise components as described below.

Individual sinusoids can be computed by picking peaks from the Fouriertransform, as can be seen in the figure 4.1. We can pick either a fixed numberor a variable number of pure sinusoids per frame based on some constraints (like, energy of the peaks etc ). Usually picking a variable number of sinusoidsis better for audio coding purposes. So we use the same procedure. Forpicking pure sinusoids, we use the “sinusoid likeness measure” whose valuelies between 0 and 1. A value towards 1 denotes more sine likeness.

Let S(ω) and W (ω) be the spectra of the frame of signal under con-sideration and the windowing function used resepectively. Let F (ω) be thespectrum of a pure sinusoid that has been multiplied by the windowing func-tion. Then the “sinusoid likeness measure” at frequency ωk, SLM(ωk) iscalculated as the cross-correlation between S(ω) and F (ω) as the followingequations show;

r(ωk) =ωk+B∑

ωk−B

S(ω).F (ω − ωk) (4.1)

SLM(ωk) =|r(ωk)|√∑ωk+B

ωk−B |S(ω)|2. ∑ωk+Bωk−B |F (ω − ωk)|2

(4.2)

where B is the bandwidth of the main lobe of the windowing function (

4.2. FEATURE EXTRACTION 45

Figure 4.1: Top figure indicates the FFT of a frame of a violin piece. Bottomfigure indicates the sine likeness measure for the frame

This is calculated from the spectrum of the windowing function ) and is usedto reduce computation.

Using the “sinusoid likeness measure”, we first find the pure sines in asignal and subtract them from the overall signal to get the noise component.A pure sinusoid here is not an individual sinusoid, but a sine signal thatis multiplied or shaped by the windowing function. On the group of puresinusoids we find components that form harmonics of each other ( Sinusoidswithin a + / - 20Hz range of the harmonics of a particular sinusoid is con-sidered as its harmonic. This is to take care of the frequency resolution thatthe 2048 point Fourier transform delivers) and subtract them from the puresinusoids. Thus, we are able to extract the harmonic components from theset of pure sinusoids. The remaining sinusoids are labeled as the individualsinusoids ( also called individual lines ) in the signal.

We now subtract out smoothened sinusoidal components from the spec-


trum to get the residual spectrum of the frame. This residual spectrum isusually devoid of sharp peaks ( since they would have been removed as puresinusoids ) and can be useful for estimating percussion instruments [78, 81].

This entire processing is done in the frequency domain, as given in betterdetail in [78] and the phase information is ignored since we don’t resynthesizethe signal using the extracted parameters. From the three extracted signals,we derive the following set of features;

I. Mean Harmonic energy ( mhe ) and standard deviation ofharmonic energy ( she ) : For each frame, the energy in the harmoniccomponents is computed and it’s mean and standard deviation found overthe entire signal.

II. Mean individual lines energy ( mile ) and standard deviationof individual lines energy ( sile ) : For each frame, the energy in theindividual line components is computed and it’s mean and standard deviationcomputed over the entire signal. This, together with the above 2 features isa very good indication of the ‘tone likeness’ of the signal.

III. Mean noise energy ( mne ) and standard deviation of noiseenergy ( sne ) : For each frame, the energy in the noise is computed andit’s mean and standard deviation calculated over the entire signal. This, isa very good indication of the ‘noiselikeness’ of the signal. A high percentageof noise energy in the signal implies it is usually very harsh on the ears, asin hard rock music. Though zero crossings may give an estimate of the noisein the signal, they fail to give an estimate of the strength of the noise.

IV. Mean number of harmonic components ( mhc ) and stan-dard deviation of harmonic components ( shc ) : This is the meanof the number of harmonic components per frame and it’s standard devi-ation. A large number of harmonic components with a high percentage ofenergy in the harmonic component, and a low standard deviation of numberof harmonic components implies that the music is very harmonious and hasvery ‘pure’ instrumentation (Like a solo violin playing a very slow melody).We can derive these kinds of deductions for the set of features defined here.These features may well help us to come up with various descriptors of music.

V. Mean number of individual lines ( mil ) and standard de-viation of individual lines ( sil ) : This is the mean of the number ofindividual sinusoids per frame and it’s standard deviation. A large number ofindividual lines implies that the music is very tonelike (A solo violin playinga set of high frequency notes is usually not so consonant but is still moretonelike than noiselike).

4.3. CALCULATING THE DISTANCE BETWEEN 2 GMMS 47

VI. Mean of the noise centroid ( mnc ) and standard deviation ofthe noise centroids ( snc ) : We calculate the centroid of the noise signalfor every frame and then take the mean and standard deviation across theframes. This is an indication of the type of noise in the signal. For example,a drum signal can be detected by the noise residual[78, 81] than by the wholespectrum. So the type of percussion instrument can be very well estimated( though very crudely, since we are losing a lot of information by just takingthe mean and standard deviation). Thus, signals with castanets and thosewith bass drums can be well separated using the above two features.

The above defined set of features also help to distinguish between speechand music very well as is illustrated by us in the next chapter.

Now that we have extracted the features from the songs in the database,we would like to come up with fast distance measures that can compare thedistance between songs using as little computation as possible. Since wehave clustered the MFCCs using GMMs, we have to find a distance measurebetween GMMs. Liu et al [59] came up with a distance measure betweenmixture gaussians. We arrived at a similar distance measure though ourmotivation has been different from that of Liu [58].

4.3 Calculating the distance between 2 GMMs

As mentioned above, we would like to compare the distance between twoGMMs. For M = 1 ( M stands for the number of mixtures per Gaussian), wehave the Kullback-Leibler divergence, that tries to find the cross entropy i.ehow similar 1 gaussian is with respect to the other. But no simple distancemeasures exist for GMMs with M > 1. We arrive at a simple distancemeasure and show it’s properties and effectiveness.

4.3.1 A new and simple distance measure for using inGMMs with M > 1

When using MFCCs to represent timbre, in case of multicomponent GMMs(M >1), each mixture tries to model the individual instruments’ timbre better thanthe GMM with M = 1 ( A single gaussian) . Usually GMMs with M = 3are sufficient to model most of the timbral variations in a musical piece [32].A recent publication shows a detailed analysis of the impact of M on the


retrieval accuracy[82]. We choose to continue using M = 3, since it performspretty well on most clips we have used in the database.

In polyphonic music, each component instrument usually has differenttimbre. When GMMs with M > 1 are used to model this polyphonic pieceusing MFCCs, each component ( mixture ) Gaussian ‘tries’ to capture theindividual timbre of the component instruments. In other words, the indi-vidual component Gaussians are virtually capturing the timbre of each ofthe significant instruments in a polyphonic music piece. This is becausethe MFCCs of each instrument in the polyphonic music cluster around it’sown similar MFCC points in the n-dimensional timbrespace. This is intu-itively very pleasing if we understand that in most polyphonic music pieceswe usually don’t have two instruments that sound almost same ( like a panflute and a bamboo flute). So the component instrument timbre are wellseparated and thus will be well captured by the components of the GMM.In many cases this does not hold since the final configuration of the EM(Expectation Maximization) algorithm for mixture estimation in GMMs isdependent on various stopping factors ( Of course, the strength and timelength of occurence of each instrument in a polyphonic piece also matters forthe above statement to hold. But in general this case holds decently enough).

We now propose a new distance measure based on the Mahalanobis’ dis-tance. The Mahalanobis’ distance between a Gaussian whose mean is µ(column vector NX1) and covariance matrix is Σ (of size N X N), and apoint x in the N dimensional space that belongs to the distribution given byµ and Σ is given as;

D(µ, x) = (µ− x)T Σ−1(µ− x) (4.3)

We now modify this distance measure itself to suit the case of M > 1.Let there be two GMMs A and B each with M=3. Let p11, p12 and p13 bethe weights of the component Gaussians of GMM A, whose mean vectors areµ11, µ12 and µ13 and the covariance matrices are Σ11, Σ12 and Σ13. ForGMM B, let the corresponding parameters be p21, p22 and p23, µ21, µ22and µ23 and the covariance matrices be Σ21, Σ22 and Σ23 repectively.

The distance calculation between GMM A and GMM B is given by thefollowing steps;

(i) We calculate the Mahalanobis’ distance between µ11 and µ21 withrespect to both the Gaussians and add them.

D(µ11, µ21) = (µ11− µ21)T Σ11−1(µ11− µ21) (4.4)

4.3. CALCULATING THE DISTANCE BETWEEN 2 GMMS 49

D(µ21, µ11) = (µ21− µ11)T Σ21−1(µ21− µ11) (4.5)

(ii) Now, we weigh this distance by the component Gaussian weights i.ep11 and p21. This weighing intuitively implies that we are giving greaterimportance to component Gaussians whose weights are high and lesser im-portance to components whose weights are lower.

Dfin(µ21, µ11) = {D(µ11, µ21) + D(µ21, µ11)}/(p11 ∗ p21) (4.6)

For example, let the weight of one of the component Gaussians A1 be 0.05and that of B1 be 0.01. Let us say that the Mahalanobis distance as calcu-lated above between the mean of A1 and that of B1 is d. Then, accordingto our modification, the distance becomes d/(0.05 ∗ 0.01). This is intuitivelyappealing since the distance calculation should take into consideration howimportant a particular Gaussian component is, to the GMM. If a componentis very important, then its weight will be very high ( but always ≤ 1 ). Thenthe distance d between that component and another component Gaussian (of another GMM) is very much indicative of the actual separation of the 2GMMs. We similarly calculate the above distance for all the ordered pairsof the mean vectors of the component Gaussians ( The ordered pair alwayscontains 1 component Gaussian from A and B each).

(iii)We define the distance between GMM A’s first component Gaussian’smean µ11 and GMM B as;

Dfin(µ11, B) = min{Dfin(µ11, µ21), Dfin(µ11, µ22), Dfin(µ11, µ23)} (4.7)

This operation allows us to match a component Gaussian to the nearestGaussian from the other GMM. This is the same as matching the timbrecomponent of the query song to the timbre component of the database song.

(iv) We find the overall distance between GMM A and GMM B as;

DGMM(A,B) = Dfin(µ11, B) + Dfin(µ21, B) + Dfin(µ31, B) (4.8)

(v) This distance measure is not symmetric, i.e

DGMM(A,B) 6= DGMM(B, A) (4.9)

So we make it symmetric by averaging the two components DGMM(A,B)and DGMM(B, A). So the final distance measure between GMMs A and B isgiven as;


Dsym(A,B) = {DGMM(A,B) + DGMM(B,A)}/2 (4.10)

Thus we have a new and simple distance measure that is computationallynot taxing and storagewise also needs simple parameters3.

(vi) The distance between A and itself is 0,

DGMM(A,A) = 0 (4.11)

(vii) The distance between A and B is ≥ 0,

DGMM(A,B) ≥ 0 (4.12)

Note that this distance measure is not a metric since it does not satisfythe triangular inequality. We demonstrate the efficacy of the above distancemeasure through three experiments in the following sections.

4.4 Algorithm for retrieval in Arminion

We present here the Arminion system architecture. As shown in Fig.4.2, thesystem is based essentially on the MPEG7 model for a MIR. We would liketo integrate the feature extraction stage and the feature comparision stagein this section and explain the algorithm developed to retrieve songs.

i. The first stage of processing involves extracting the GMM parametersand the HILN features from the query song.

ii. The second stage involves finding the nearest 500 songs to the querysong in the database, using the MFCC features and the distance measurederived earlier. The songs are ranked according to their timbral nearness tothe query song. This is given a rank name r gmm.

iii. The above step eliminates nearly 1000 songs from further process-ing. Euclidean distance is used for HILN feature comparison purposes. TheEuclidean distance between the HILN features of the query song and thebest matching 500 songs (from step (i)) is calculated. We form separateaugmented features from the HILN features depending on the order of mag-nitude of each of them. Accordingly, the six augmented feature vectors weget are (i) mne, (ii) mhe, (iii) mile, sile, she, sne, (iv) mhc, (v) shc, mil, sil

3We could have used any other generic distance function instead of the Mahalanobis’distance [59]

4.4. ALGORITHM FOR RETRIEVAL IN ARMINION 51

Input query song

parameters parametersExtract GMM Extract HILN

Retrieve timbrewise best 500 songs

(Sieving technique)

Database

Feature extraction and

indexingRanking based HILN feature comparison

Retrieve Top 10 matches

Figure 4.2: Block diagram of Arminion’s architecture

and (vi) mnc, snc. Since the order of magnitudes are different for the abovesaid features and equal importance must be given to each of the features, weaugment them into 6 groups as said above. We find the Euclidean distancesbetween the augmented features of the query song(q) and the ith song amongthe Top 500 retrievals and rank them. The Euclidean distance between

d(a,b) =n∑

i=1

(ai − bi)2 (4.13)

r mne(i) =√

d(mne(q),mne(i)) (4.14)

r mhe(i) =√

d(mhe(q),mhe(i)) (4.15)

r aug1(i) = (d(mile(q),mile(i)) + d(sile(q), sile(i))

+ d(she(q), she(i)) + d(sne(q), sne(i)))1/2 (4.16)

r mhc(i) =√

d(mhc(q),mhc(i)) (4.17)

r aug2(i) = (d(shc(q), shc(i)) + d(mil(q),mil(i))

+ d(sil(q), sil(i))))1/2 (4.18)

r aug3(i) = (d(mnc(q),mnc(i)) + d(snc(q), snc(i)))1/2 (4.19)


Now, we find a new ranking for the songs based on these 6 distances.

hiln rank(i) = α1.r mne(i) + α2.r mhe(i) + α3.r aug1(i)

+ α4.r mhc(i) + α5.r aug2(i) + α6.r aug3(i) (4.20)

where the alphas are weighing factors. We have used 1, 1, 0.5, 1, 0.5 and0.5 for α1, α2, α3, α4, α5 and α6 respectively.

The final ranking for each of the 500 songs is obtained as,

Rfin(i) = r gmm(i) + α ∗ hiln rank(i) (4.21)

α is a weighing constant used to give a relative importance to the rankhiln rank(i) with respect to r gmm. We have used an α of 1.5. The finalranking uses r gmm also since songs that are timbrally closer to the querysong are ranked higher and ignoring that ranking implies losing out valuableinformation about timbre ordering of the 500 retrieved songs. Since we donot know the preference of the human brain towards these rankings (i.e whichfeature set is perceptually more important than the other), we need to finetune them to get the best perceptually acceptable result. In such cases, wehave no other alternative, but to evaluate our MIRS objectively as in thenext section.

iv. The top 10 songs in the nearest ranked neighbourhood(Smallest 10Rfin) are retrieved as the nearest matches to the query song.

4.5 Experiments using the proposed distance

measure

This section deals with the experiments conducted on the Arminion system.The GUI for the Arminion system can be seen in Fig.4.3. Using the ‘Playsong’ button, users can choose to listen to a song (From the Database songlist) before they select it as the query song. Any song can be selected as aquery song using the ‘Select song’ button. This leads to the top 10 retrievalsbeing displayed in the right hand window (Marked as Output songs) usingonly the MFCC features ( Timbral similarity alone is used ). Any of theretrieved songs can be played using the ‘Play song’ button given in the right.Using the ‘HILN feat’ button, a user can perform the second stage of HILNfeatures comparison. Of the best 500 timbrally matching songs, the Top 10songs with similar noisyness are retrieved.

4.5. EXPERIMENTS USING THE PROPOSED DISTANCE MEASURE53

The following experiments were conducted on the Arminion system’s firststage (Retrieving only the best 500 timbrewise similar songs) to prove theuse of the proposed distance measure.

Experiment No.1

This experiment was conducted to see whether the proposed distancemeasure retrieves the TC ( Twin clip ) of a song within the top 10 retrievals(See section 3.1 on the database creation for more info). As can be seen fromTable 4.1 for 12 randomly selected songs, their corresponding clips from thesame song were retrieved well within the top 10 retrievals. The TC’s numberis given in the brackets. Also for clips 749.wav to 760.wav, which are takenfrom a piano performance of a Bach composition, 4 clips were retrieved inthe top 10 when 750.wav was given as the query clip. As can be seen, the TCis retrieved in the top 10 retrievals in 10 out of 12 randomly selected songs.

Experiment No.2

This experiment was conducted to see whether the proposed distancemeasure retrieves the DTC (Dubbed twin clip) of a clip within the top 10retrievals. As can be seen from Table 4.2 for 9 songs which had their dubbedversions / remix versions in the database, their corresponding clips wereretrieved well within the top 10 retrievals. For clip numbered 237.wav, thoughwe have prior knowledge that there exists a dubbed version, we were unable tolocate it manually ourselves. As can be seen from the Table 4.2, the dubbedversion of a song is retrieved in 7 out of 9 cases tested and the retrievalranking is also very high on an average.

Experiment No.3

This experiment tabulates the number of clips with same instrumentationas the query song, in the top 10 retrievals. Table 4.3 shows for 6 songs ran-domly selected from the database, their predominant instruments are listedin the second column and the number of clips retrieved in top 10 that hadthe same instrumentation is also shown in the third column. In Table 4.3it is seen that 31 out of the 60 songs (around 50%) retrieved had the sametimbre as the query song.

The above three experiments prove the strength of our proposed distancemeasure. We now propose to test the usefulness of the HILN based featuresin the next section by conducting 2 experiments.


4.6 Experiments using the HILN features

Experiment No.4In this experiment, we chose 20 songs randomly and labeled them as being

‘noisy’ or ‘middle’ or ‘clean’ depending on the perception of noise (Ex:Arock song was perceived to be more noisy than a solo piano performance).The ‘middle’ label is chosen for songs that could not be easily classified as‘clean’/‘noisy’. The retrievals for these songs were then checked for similartimbre and then marked with one of the 3 labels defined above based on theperception of noisyness. We show in Table 4.4 the results for the Arminionsystem. Initial results show that around 58% of songs in the top 10 retrievedare those of same timbre and noisyness. As can be seen the retrievals forcertain songs are very low, but the other retrievals for these songs wererejected since they contained an extra instrument or was missing one of theinstruments in the query song.

This is better than the results shown for a very small test conducted byLogan et al [17]. They claim a 50% retrieval accuracy on around 20 songsretrieved for 2 users. Also their project had a much bigger database andthat should have worked to their advantage as we explain in the next fewparagraphs.

For song number 7.wav the besides a female singer with a sparse hissingvoice, the instrumentation was supported by drums. Though the number ofmatching songs retrieved was only 3 in the top 10, 6 other songs with almostsame ‘cleanliness’ but with male lead voice and drums were retrieved. This isjustified by our original proposal that GMMs with M > 1 attempt to modelindividual instrument timbres. Because of this, the percussion in the 6 songswith male lead voice was well matched to the query song.

Similarly for song 1223.wav, 5 out of top 10 retrievals had extra vocalsthat the query song did not have. Similarly for 1017.wav, the female vocalswere missing but the male chorus was perfectly matched and the songs ofGregorian Chants were retrieved in the top 10.

Experiment No.5In this experiment we would like to see if the HILN feature comparison

stage helps in retrieving TCs and DTCs better than the first stage alone. Ascan be seen, in Table 4.5, out of 9 songs for which we tested, DTCs wereretrieved for 7 after the first stage and 7 after the second stage. But asthe table shows the second stage improved or maintained the ranking of theretrieved DTC in 6 cases and worsened in 3 cases. Though this cannot be

4.6. EXPERIMENTS USING THE HILN FEATURES 55

Figure 4.3: Arminion’s MATLAB GUI

used as conclusive evidence that the HILN features improve retrieval, we canconfidently state that they do aid retrieval.

The above experiments prove that the system retrieves similar soundingsongs pretty well. But the system as such performs extremely well when thenumber of songs in the database is huge and also when there are enoughnumber of similar songs. We believe that Aucouturier [32] and McNab [42]must have found no problems since the systems have huge databases of over15,000 songs, while Foote[22], Balaji[1], Yang[20] and Narita[83] had prob-lems since the database sizes were small ( A cross-check with the QBE systemof Jonathan Foote4 confirmed our hypothesis).

This leads to a very interesting problem in database creation. One thumbrule for creation of any new database would be to use a single genre / type ofsongs at a time and populate the feature space well enough ( Say 200 songsper genre ) and only then move on to the next genre instead of randomlyputting in new songs to the database without knowing well into which pointin the feature space a new song might map into. We can virtually fool a userinto believing that the system is doing “intelligent” search by just populating

4http://www.fxpal.com/people/foote/musicr/doc0.html


the feature space well. As an example, consider a case where there are lots ofrock songs in a database and when a rock song is given as a query, whateverthe retrieved rock song maybe, the user feels that the system identified thegenre well and retrieved the songs. This is similar to what a search enginelike Google does.

Building databases by incorporating controlled redundancy in the formof TCs and DTCs and remixed versions of songs has the advantage of objec-tive verifiability. Also, if meta labels have to be used on databases, we feelthey must be taken from multiple sources instead of a single source. Usingthese kinds of techniques, database development can take meaningful stridestowards eliminating the need for subjective evaluation of MIRS

4.7 Conclusions and Future Work

We have demonstrated the use of a QBE system and this can be used prettywell for playlist creation, retrieval by mood etc. Without the use of anymetadata, it is shown that the system works pretty well to retrieve similartimbre and noisyness. The method of splitting the retrieval stage into twosteps has it’s advantages since newer forms of similarity can be defined andthe system can be modified easily to retrieve songs based on those formsof similarity. We also have given an intuitive feel for the ‘noisyness’ of amusic piece and have come up with features that describe them. A newdistance measure that works for mixture Gaussians has been proposed andit’s properties have been explained. The distance measure can be modifiedto match whole timbres of songs instead of matching just the individualinstrumentation, as we have used. We have shown how objective evaluationof a database can be made easier by using DTCs and TCs.

The usefulness of the HILN features for retrieval has been clearly eluci-dated. The HILN features, apart from being useful for ‘noisyness’ quantifi-cation can be used for other tasks as we shall see in the next chapter.

We also feel that HILN features can be used in other tasks. We thinkthese features should be able to differentiate same class instruments like apan-flute from a bamboo flute, since HILN models the noise residual and alsothe number of harmonics in the signal well. We are also looking at uses ofthese features for the purpose of percussion detection since percussion willbe well represented in the noise residual than the original signal itself[78, 81].

4.7. CONCLUSIONS AND FUTURE WORK 57

Table 4.1: Table for experiment no.1

Experiment No.1

SongNumber

Clip retrieved? Rank in Top 10

793 Yes(792) 8900 Yes(901) 3913 No(914) -610 Yes (611) 2546 Yes (548) 2750 Yes (749:760) 4 songs in top 10674 Yes (675) 2644 Yes (645) 2783 Yes (784) 21051 No (1052) -1035 Yes ( 1036) 6583 Yes(584) 3


Experiment No.2

SongNumber

Dubbed/Remix clip retrieved? Rank in Top 10

7 Yes(104) 6150 Yes (401) 2411 Yes(424) 3391 Yes(206) 2131 Yes(81) 2348 No(30) -347 Yes(1020) 3237 No -231 Yes(1061) 2



Experiment No.3

SongNumber

Dominant instruments Number of retrievals in top 10

1482 piano 9560 opera female voice,bass drum 21549 Spanish Guitar,drums 51512 Concert Violins 41536 Flute, ghatam 81576 Solo violin ,Bass drums 3


Experiment No.4

Song No. Main instruments Noisyness Songs in top 10141 Male voice,Drums Middle 9360 Piano ,Bass drum Clean 66 Harsh male voice , Ghatam Noisy 780 Male,drums,piano Clean 77 Hissing Female voice, Percussion Clean 3610 Britney spears, Drums Middle 5263 Male lead,electric guitar,bass drums Noisy 81076 Male Voice , shakers Middle 61017 Female lead,Male chorus, Tabla Middle 21088 Female vocals, Bass Drums Middle 51134 Male vocals,guitar Clean 6848 Male Chorus, drums Noisy 51215 piano ,drums Clean 51223 Guitar , congo drums Middle 41327 Bass Drums Noisy 6

1094 Female vocals,chorus,bass drums Middle 41419 Male Chorus , percussion Middle 71499 Concert Violins Clean 91512 Flute , Mridangam Clean 71578 Solo violin, bass drums Clean 4



Experiment No.5

Song No. DTC/TC Retrieved in I stage?(Rank) Retrieved in II stage?(Rank)81 TC(132) No Yes(10)81 TC(131) Yes(2) Yes(2)30 DTC(348) Yes(8) No(-)224 DTC(355) Yes(2) Yes(8)7 DTC(104) Yes(6) Yes(4)11 DTC(1002) Yes(2) Yes(2)163 DTC(379) Yes(7) No(-)6 DTC(1452) No(-) Yes(9)1020 DTC(347) Yes(3) Yes(4)

Chapter 5

A Speech / MusicDiscriminator algorithm

It is a good morning exercise for a research scientist to discard apet hypothesis every day before breakfast. It keeps him young

- Konrad Lorenz

In the last chapter, the HILN model was used for retrieval purposes. Asan offshoot of that work, we propose a simple speech / music discrimina-tor in this chapter. The algorithm uses features based on the HILN modeland a simple voting mechanism to decide whether an audio piece is speech/ music. We would like to classify singing and humming as speech and onlyinstrumental ( both monophonic and polyphonic) pieces as music. We havetested the algorithm on a standard database of 66 files and a discriminationaccuracy of around 97% has been achieved. We also have tested on sungqueries and polyphonic music and have got very good results. The algorithmcan be used to discriminate between sung queries and played (using an in-strument like flute) queries in a generic Query by Humming(QBH) system.This algorithm is currently being used for front end processing in a QBHsystem being developed in our lab.

This chapter is organised as follows. We begin by giving an introductionand brief survey of existing techniques about speech / music discriminators(SMD) in section 5.1. The SMD is described in 5.2. The feature extractiontechniques and the description of features is done in 5.2.1. We describe theexperiments on the SMD in section 5.3. Section 5.4 concludes by giving ourviews on SMDs and also the future directions for work on this topic.

61

62CHAPTER 5. A SPEECH / MUSIC DISCRIMINATOR ALGORITHM

5.1 Current state of art

Speech Music discrimination is an important task for Music Information Re-trieval ( MIR ). Highlights extraction in sports videos and advertisementdetection in broadcast videos are certain applications where an SMD is ex-tremely useful. News channels in particular have an immense use of SMDs.

For the purpose of discriminating between programs and advertisements,Saunders[72] designed a realtime speech/music discriminator(SMD) using ze-rocrossing rates as a simple feature. Scheirer et al [84] designed a SMD using13 temporal and spectral features followed by a GMM based classifier. Anovel SMD was designed by Jarina et al[85] that used rhythm based featuresof polyphonic music. Maleh et al[86] describe a frame level SMD that workson 20 millisec frames to declare them as speech/music.

A much simpler form of SMD was designed by Karneback[87] using the4Hz modulation energy feature. A PR based SMD that could detect sungphrases was designed by Chou et al[88]. A fast SMD with low complexitywas designed by Wang et al[89] using a modified energy ratio feature followedby a Bayesian classifier. Our work is aimed at developing an SMD that candistinguish between singing and music also along with speech versus music.

Most of the algorithms described above use GMMs, Neural networks etcfor classification purposes and use a large number of features. These areusually compute intensive and slow for real-time purposes. In this workit is desired to incorporate the capabilities of Chou’s algorithm with lowcomplexity using the HILN model to derive features that separate speechfrom music.

5.2 Speech Music Discriminator algorithm

A generic QBH system allows for users to hum a query or to play it usingan instrument such as the flute, violin etc. Since sung queries and instru-mental music are processed separately by the system (usually different pitchtracking algorithms are used for hummed queries and instrumental queriesrespectively), we would like to discriminate between the sung queries andinstrumental queries.

Figure.5.1 shows the block diagram of our speech-music discriminator. Allinput signals are resampled to 16kHz and normalized to an average intensityof 70dB. We then pass the processed signal through a HILN model based

5.2. SPEECH MUSIC DISCRIMINATOR ALGORITHM 63

feature extraction block. This block extracts 4 features that are described asbelow.

Input signal

Extract HILN features

Use voting mechanism for

speech / music decision

Training data

and

threshold setting

Speech / Music classification

Resample signal to 16kHz

Normalize intensity to 70dB

Figure 5.1: Block diagram of the speech-music discriminator

5.2.1 HILN model based features

We use HILN model based features that are computed for speech, music (both monophonic and polyphonic ) and also singing / humming.

The sine+noise model analyzes the signal and picks peaks in the frequencyspectrum. As explained in the previous chapter of this thesis, we have madeuse of Virtanen’s algorithms for implementation of the sine+noise model forour feature extraction purposes [78].

The signal is split into 20 millisecond frames with an overlap of 10 mil-liseconds. Each frame of the signal is classified as “silence” or “sound ” basedon a simple energy threshold. A high resolution 4096 point STFT(Short timeFourier transform) is taken and the peaks in the STFT are picked using anenergy threshold ( i.e all peaks within 90% of the maximum energy are picked


using the sinusoid likeness measure as in Chapter 4). A continuity analysisacross the time frames is performed in the frequency domain to pick onlysinusoids that are stable[78]. Only sinusoids that are continuous for atleast15 frames ( 150 milli seconds) are retained. This way, spurious / discontinouspeaks are eliminated from further processing.

Now, we search for harmonics of sinusoids in every frame. If a “sound”-frame doesn’t contain any sinusoids, it is labelled “unvoiced” and removedfrom further processing. If a frame contains atleast one sinusoid, it is labelled“voiced”. For every “voiced” frame, we search for the harmonics of a givensinusoid of value S Hz. Sinusoids that are off by +/- 10Hz of the multiplesof S are also considered as its harmonics. This way the harmonics of everysinusoid in the kth frame are obtained.

The sinusoids that are not the harmonic partials of any of the sinusoidsfound in the frame are labelled “individual lines” ( See eqn.5.1 ). Let a framecontain peaks at 100Hz, 200Hz, 330Hz ,411Hz and 500Hz. Then the numberof harmonics in the frame is 3 ( 100 Hz, 200 Hz and 500 Hz ) and the numberof individual lines is 2 ( 330 Hz and 411 Hz ).

S(k) = H(k) + I(k) (5.1)

where S(k) is the total number of sinusoids in the kth “voiced ” frame, andsimilarly H(k) and I(k) are the total number of harmonics and individuallines in the kth frame respectively.

Using the 3 variables, S, I and H, we calculate the “average numberof sinusoids per voiced frame” ( Savg ), the “average number of harmonicsper voiced frame” ( Havg ) and the “average number of individual lines pervoiced frame” ( Iavg ) as the average number of sinusoids, harmonics andindividual lines for every voiced frame.

The noise residual is calculated for the frames classified as “sound” asfollows. For every frame, we calculate the energy in the sinusoids ( if theframe is voiced ) picked from the previous steps and subtract it from theenergy of the frame. Before subtraction, the sinusoids are smoothened bythe windowing signal as shown by the following steps.

Let the original signal be x(n) and the DT-STFT ( Discrete Time - STFT) be X(n, k). Let the windowing function be w(n) and the Fourier transformof w(n) be W (ω). Then;


X(n, k) =n+N−1∑

k=n

x(m).w(n−m).e−2jπmk/N (5.2)

In the frequency domain, the multiplication of x(n) and w(n) in (5.2)corresponds to convolution of the Fourier spectrum of the signal with theFourier spectrum of the window. Let the signal representing the harmonicspicked, as mentioned in the previous paragraph be H(n, k). H(n, k) now con-tains only impulses at the samples corresponding to the frequencies identifiedby the peak picking algorithm. We convolve the signal H(n, k) with W (ω).This leads to a smoothened version of the harmonic spectrum H(n, k).

Hsmooth(n, k) = H(n, k) ∗W (ω) (5.3)

This Hsmooth(n, k) is the Fourier spectrum of the windowed version of theharmonics in the previous section. We subtract the Hsmooth(n, k) from X(n,k)to get the residual function R(n, k). We find the average of the energy inR(n, k) along the time frames and call it the “average residual energy pervoiced frame”- Ravg.

Having extracted these features for a training set, we set thresholds foreach feature to classify an audio piece as speech / music along each feature.For any new piece of audio that has to be classified, we label the audio as“speech” or “music” along each of the 4 feature dimensions depending on thevalue of each of the extracted features with respect to the thresholds. Usinga voting strategy, the audio clip is voted as speech or music depending onthe majority label. Example: If an audio piece is labelled as “speech” along3 dimensions and “music” along 1 dimension, we classify the audio piece asspeech.

5.3 Experiments and Results

To evaluate the decision thresholds, the above 4 features were calculated fora set of 84 training files that contained 45 music files ( monophonic trumpet,piano, flute and violin of average 4 seconds length ), 25 speech files ( ofaverage length 4 seconds ) and 14 singing files ( of average 15 seconds length). Figures 5.2, 5.3, 5.4 and 5.5 show a histogram of values of Savg, Havg,Iavg and Ravg for music, speech and singing files from the training set, in 3separate subplots. As can be seen, simple thresholds can separate the speech


signals from music signals. Also we can very easily see that the “averagenumber of individual lines per voiced frame” Iavg is a feature that doesn’thave any discrimination between speech, music and singing. We thus ignorethis feature based on the histogram plot shown in Figure.5.3. Thresholds setfor Savg, Havg and Ravg are 1.5, 2 and 0.075 respectively. These thresholdsare set to minimize the misclassification error. The experiments and resultson test data and the database information are given next.

Figure 5.2: Average number of sinusoids per voiced frame

For testing on the database, the voting strategy was used on the 3 featuresto decide whether the sound was “speech ( singing included ) ” or “music”depending on whether the sound is more music like or speech like in atleast 2of the 3 dimensions. This technique works very well for monophonic music.We are yet to understand the robustness of this technique on polyphonicmusic.

For the experiments, we used Slaney and Scheirer’s database 1 for speech /

1http://www.ee.columbia.edu/∼dpwe/sounds/musp/music-speech-20051006.tgz


Figure 5.3: Average number of individual lines per voiced frame. As can beseen from the plot, the Iavg feature is not of much use. So we neglected thisfeature.

music discriminators. The database has around 150 files belonging to variouscategories like music, vocals, non-vocal sounds etc. We also used high qualitysinging files from our QBH database. These clips are recorded using “hums”( of Hindi film songs ) from semi-professional singers. We split the abovedatabase into training and testing data with 60% of the database going intotraining and the remaining 40% of files into testing. As described in thealgorithm, the training phase was used to estimate the thresholds for thevarious features.

After setting the thresholds, the testing was done for a simple speech ormusic classification. The 3 features Savg, Havg and Ravg were calculated foreach of the test file. Along each feature we classified the file as speech ormusic based on whether the feature was greater than or less than the setthreshold. A final voting was then done to select the majority class as the


Figure 5.4: Average number of harmonics per voiced frame

class label for the audio file.

Table 5.1 gives the results for the test data. The first column gives theinstrument class, the second column the number of files of the particularinsrument / speech class and the third and fourth columns give the num-ber of files classified as speech or music respectively for the given class ofinstruments. As can be seen from the table, this simple technique gave anaccuracy of 96.8% for the 32 monophonic instrumental files and 97.5% forthe 41 speech and singing files we tested.

We also tested this technique on 15 polyphonic music files of averagelength of 7 seconds. These files were clips from various CD recordings ofmovie songs and various albums ( Note: No training was done using poly-phonic music data ). 12 out of the 15 were classified as music. This techniquecompares well with the results from various other algorithms, but is muchsimpler and low on computation. With slight modifications, this can be usedfor realtime speech / music discrimination purposes also since the classifier


Figure 5.5: Average residual noise energy per voiced frame. In all the abovehistograms, the x-axis corresponds to the value of the feature and the y-axis corresponds to the number of times the feature value occurs in theexperiment.

is extremely simple. Compared with the various algorithms referred in thischapter ( See [85] for a comparative analysis of various algorithms ), we getan accuracy that is good enough for an SMD with simple features. Also thepeak picking in the frequency domain and continuity analysis are fast andinvolve no iterative techniques ( unlike the other training methods mentionedin the papers referred in section 5.1).

5.4 Conclusions and Future work

We have proposed a simple and computationally fast algorithm that discrim-inates between speech ( spoken sentences and sung phrases ) and music (monophonic intrumental performances and polyphonic sounds ). This has


been proposed taking into consideration that music is inherently more har-monic than speech. This algorithm is currently being used to discriminatebetween sung queries and played queries in a QBH system being developed inour lab. This algorithm can be extended to realtime applications such as ad-vertisement detections in broadcasts, speech music segmentations in moviesetc. Also the HILN model is definitely useful in various other applicationsthan just audio coding as can be seen from this work.

Table 5.1: Tests on Schierer and Slaney’s databaseExperimental results

Class labelAudio Class No.of files Speech MusicFlute 10 0 10Violin 10 1 9Trumpet 6 0 6Piano 6 0 6Polyphonic 15 3 12Total Music 47 4 43TIMIT speech set 14 14 0Slaney’s speech set 7 6 1Singing 20 20 0Total Speech 41 40 1

Chapter 6

Conclusions and future work

6.1 Summary of this work

In this thesis, we have looked at MIRS from various possible perspectives andelaborated on the the possible methods of music retrieval from repository. Wehave also shown the various modes of querying such as humming, whistling,singing, tapping, beat-boxing and querying by example. The motivation forresearch in query by example mode of MIR was enforced by formally statingthe problem and giving reason for our interest in this work.

We have built a database that can be used for any content based musicinformation retrieval system. The database incorporates controlled redun-dancy in the form of “twin clips” and “dubbed twin clips” that aid objectiveevaluation of music information retrieval systems. We also have stated for-mally the steps to build any new database for a generic MIRS.

We have designed a new algorithm, the Quebex, that retrieves songsthat have a temporal similarity of timbre along with similar rhythm. Wecapture timbre and rhythm using temporal, spectral, rhythm and energybased features. The feature comparison involves a two stage processing thatenables us to decouple the timbre features from the rhythm features. Thisis very prudent since the timbre features and rhythm features have differentdiscriminating capabilities [25]. We thus use a timbre matching followed byrhythm matching procedure. Since we do not have pre defined classes orlabels for our data, a nearest neighbour approach is used to retrieve the bestmatching songs for a query song. The system had a few shortcomings in thatit worked poorly for our small database given the strict nature of similarity

71

72 CHAPTER 6. CONCLUSIONS AND FUTURE WORK

that we had defined. To over come the shortcomings, we designed a newalgorithm that would work well on small databases also.

To retrieve songs well on a small database implied that we relax the strictform of similarity that we had defined for Quebex. We modified our similarityto mean only timbral similarity and similarity of noisyness. In Arminion,our second system, we retrieve songs based on the new similarity. We use aclustering algorithm ( GMM ) for a compact representation of every clip’stimbre and propose a new distance measure between two mixture models.This distance measure is computationally cheaper and also fast enough tocapture certain similarities between mixtures. We have also proposed a newset of features based on the HILN model used in MPEG4 audio coding. Thesefeatures help to capture the “noisyness / cleanliness” in a song. We feel thatsince MFCCs smooth out the spectrum, the “noisyness” factor is lost out.The HILN based features that we propose precisely capture this aspect ofa music signal. This has a very good perceptual bearing, since a “noisy”song is perceived differently from a “clean” song. Therefore these featuresare extremely useful in any MIRS. We conduct experiments to show thestrength of our distance measure and also the usefulness of our new featureset. The system works very well on a small database and we predict that itwould work better with increasing database size.

The HILN features that we proposed have found usage in more than onescenario. We have demonstrated through a new speech / music discrimina-tor algorithm that these features are indeed useful for SMD. HILN featuresextracted from audio clips are fed to a simple voting meachanism that de-clares whether the audio is speech / music. A simple thresholding allows usto classify sounds as speech or music using our algorithm and it compareswell with standard algorithms, with an overall classification rate of 97%, ona standard database.

6.2 Future work

A researcher can build upon the framework that we have proposed. Wehave adopted a multistep processing and this helps in sieving out unwantedcomponents from further processing. Retrieval based on similarity of mood,lyrics, genre and many others can be developed in future. We can do awaywith subjective evaluation altogether for future MIRS by using the techniquesthat we have proposed for developing databases.The HILN features that we

6.2. FUTURE WORK 73

have proposed in our thesis can be used for various other tasks such asinstrument recognition, advertisement detection in sport videos, multipitchestimation and others.

Instruments belonging to the same family /class may be well recognisedby using the HILN model since it captures the harmonics and individualsinusoids very well. Thus, a pan flute and a bamboo flute may be welldistinguished, though they belong to the same family. Usually within classmisclassification is high in instrument recognition tasks but we feel that theHILN model based features will be able to cope well in such scenarios.

Our work on SMD is being extended by us for advertisement detectionin sports videos. This can be used for highlights generations and summarygeneration of sports telecasts. Estimating the pitch of multiple instrumentsin a polyphony is another challenge and we feel that HILN features could beof some help in this direction.

MIRS that can provide an indication of their performances ( Say, an 8 on10 performance for a given song can be interpreted as saying that the systemretrieves 8 acceptable songs out of the Top 10 retrievals ) can help systemdevelopers to improve the retrieval of certain types of songs. If an MIRScan identify potentially bad retrievals based on some measure, correctivemeasures can be taken to improve retrieval for such songs. This is differentfrom the relevance feedback approach that relies on a user to help betterthe system. Also, from a user’s perspective this means that he need not gothrough the entire set of retrieved songs when the system itself says that itcould not retrieve more than a specific number of songs accurately for a givenquery. Some initial simulations were carried out in this direction using thedensity of songs around a query song in the timbrespace as a measure, but noconclusive results could be obtained. More work or better understanding ofthe perceptual aspect of the timbrespace is needed to prove the use or disuseof such objective performance measures.

Work can also be done in the area of perceptual ordering of features thathave been used for retrieval purposes. We still do not know which featuresdominate to retrieve perceptually the best set of songs for a query. We donot know how features gel to give the ‘gestalt’ effect during retrieval. Ifthese aspects could be incorporated into retrieval systems, retrieval based onthe best set of features can be attempted. The best subset of perceptuallyrelevant features could be used for retrieval when either computation poweris low or memory sources are low.

The future of MIR lies in bringing out newer forms of interfaces to “visu-

74 CHAPTER 6. CONCLUSIONS AND FUTURE WORK

alize” and “feel” audio. Tzanetakis[41] talks of various futuristic avatars ofMIR systems and definitely the “tommorrow” of MIR lies in retrieval basedon user tastes and hence user feedback should also be taken into consider-ation while designing MIRs[90]. Feature spaces can be partitioned offlineso that algorithms will search in a small number of files in huge databases,speeding up retrieval times. Features can be augmented together to formnew features which may reveal important characteristics of the signal andits possible effect on the listener. This can have great implications on theway we perceive information through our senses. MIR systems catering toeach individual’s tastes while maintaining an underlying architecture thatgives an insight into the way humans understand music may be needed forbetter retrieval. Though a global listener model is very difficult to perceive,individual user models can be built using machine learning techniques. Andit is a challenge for music researchers to realize atleast a few of these dreamsfor a better and entertaining future.

Appendix A

.1 Triangular inequality proof

We prove here that the distance measure used in chapter 3 doesn’t satisfythe triangular inequality theorem to be called a distance metric.

A

B

C

�

� �

Figure 1: Proof of triangular inequality failure of the distance measure

As can be seen from the figure, we have the trajectories of 3 songs A, Band C. The trajectories of the songs A and B are in hyperplanes perperdicu-lar to that of song B and are mutually parallel. The trajectory of song A andC are at a distance x from the starting node of B ( i.e the starting node of Bis the centroid of an equilateral triangle in the hyperplanes). From chapter3, using the definition of distance between any two song trajectories, we have

75

76 APPENDIX A

d(A,B) = 2x + y (1)

d(B,A) = 2y + x (2)

D(A,B) =3x + 3y

2(3)

Similarly, D(B, C) is also (3x + 3y)/2. Now D(A, C) will be 6y, as canbe seen from the figure 1.

Now,

D(A,B) + D(B,C) ≥ D(A,C) (4)

iff

x ≥ y (5)

Thus the triangle inequality is not satisfied if x < y. Thus we have shownone simple case where the triangle inequality is not satisfied.

Appendix B

.2 Triangular inequality proof

We prove here that the distance measure used in chapter 4 doesn’t satisfythe triangular inequality theorem to be called a distance metric.

d

d dA

B

C

d

Figure 2: Proof of triangular inequality failure of the distance measure

We have used a 2 mixture GMM for representative purposes. The proofgiven here is more geometric and intuitive. Let the GMM for the MFCCclusters for the song A be the 2 blue spheres encircled by the dark blueellipse. The weights of both the Gaussians are taken to be the same( equalto 0.5). Similarly let the GMM for the MFCC clusters of song B be the

77

78 APPENDIX B

spheres encircled by the light blue ellipse and that of song C be the spheresencircled by the red ellipse.

Now, the centres of the spheres are at distance d apart and at right anglesto each other, i.e the three ellipses are along three mutually perpendicularplanes. Now since A and B have one gaussian in common between them andalso B and C have one gaussian common between them as can be seen fromFig 2. DGMM(A,B) between A and B is ;

DGMM(A,B) = d (6)

Similarly, DGMM(B, C) is also d. Now D(A, C) will be d + d√

2, as canbe seen from the figure 2.

Now,

D(A,B) + D(B, C) = 2d (7)

andD(A,C) = d + d

√2 (8)

Thus the triangle inequality is not satisfied since 2d < d + d√

2 . Thuswe have shown one simple case where the triangle inequality is not satisfied.

Publications from this work

1. B. Thoshkahna and K.R.Ramakrishnan, Projekt quebex:a query byexample system for audio retrieval, Proceedings of IEEE InternationalConference on Multimedia and Expo (ICME)-2005, 2005.

2. B. Thoshkahna and K.R.Ramakrishnan, Arminion:a query by examplesystem for audio retrieval, Proceedings of Computer Music Modellingand Retrieval-2005, 2005.

3. B. Thoshkahna, V.Sudha, and K.R.Ramakrishnan, A hiln based speech/ music discriminator, Accepted for IEEE International Conference onAcoustic Signal and Speech Processing-2006, 2006.

4. B. Thoshkahna and K.R.Ramakrishnan, A literature survey of queryby example systems for audio retrieval, Manuscript under preparation,2006.

79

Bibliography

[1] B. Thoshkahna and K.R.Ramakrishnan, “Projekt quebex:a query byexample system for audio retrieval,” Proceedings of IEEE InternationalConference on Multimedia and Expo (ICME)-2005, 2005.

[2] B. Thoshkahna and K.R.Ramakrishnan, “Arminion:a query by examplesystem for audio retrieval,” Proceedings of Computer Music Modellingand Retrieval-2005, 2005.

[3] B. Thoshkahna, V.Sudha, and K.R.Ramakrishnan, “A hiln basedspeech/music discriminator,” Accepted for IEEE International Confer-ence on Acoustic Signal and Speech Processing-2006, 2006.

[4] B. Thoshkahna and K.R.Ramakrishnan, “A literature survey of queryby example systems for audio retrieval,” Manuscript under preparation,2006.

[5] M. Welsh, N. Borisov, J. Hill, R. von Behren, and A. Woo, “Queryinglarge collections of music for similarity,” UCB Technical report, 1999.

[6] D. P. Ellis, “A computer implementation of psychoacoustic groupingrules,” ICPR, 1994.

[7] D. Liu, L. Lu, and H.-J. Zhang, “Automatic mood detection from acous-tic music data,” ISMIR, 2003.

[8] J. Foote, “An overview of audio information retrieval,” Multimedia Sys-tems, 1999.

[9] S.Baumann, A.Klter, and M.Norlien, “Using natural language inputand audio analysis for a human-oriented mir system,” Proceedings ofWEDELMUSIC, the International Conference on Web Delivering ofMusic, 2002.

81

[10] S.Baumann and A.Klter, “Super-convenience for non-musicians: Query-ing mp3 and the semantic web,” ISMIR, 2002.

[11] T. K. K. Ohta, “Evaluation and comparison of natural language andgraphical user interfaces in query-by-impressions scenes,” ITCC, 2004.

[12] M. Cooper and J. Foote, “Summarizing popular music via structuralsimilarity analysis,” IEEE-WASPAA, 2003.

[13] W.Chai and B.Vercoe, “Structural analysis of musical signals for index-ing and thumbnailing,” JCDL, 2003.

[14] S. Keuser, “Similarity search on musical data,” Diploma Thesis, 2002.

[15] M. Liu and C. Wan, “A study on content-based classification and re-trieval of audio database,” Proceedings of the International DatabaseEngineering and Applications Symposium, 2001.

[16] G.Tzanetakis and P.Cook, “Marsyas:a framework for audio analysis,”Organised Sound, 2000.

[17] B.Logan and A.Salomon, “A content based music similarity function,”Cambridge Research Labs - Tech Report, June 2001.

[18] S.Doraiswamy and S.Ruger, “Robust polyphonic music retrieval withn-grams,” Journal of Intelligent Information Systems, 2003.

[19] C.Spevak and E.Favreau, “Soundspotter-a prototype system for con-tent based audio retrieval,” 5th Intl. Conference on Digital AudioEffects(DAFx-02), 2002.

[20] C. Yang, “Macs:music audio characteristic sequence indexing for simi-larity retrieval,” IEEE Workshop on Applications of Signal Processingto Audio and Acoustics, 2001.

[21] J.J.Aucouturier and F.Pachet, “Music similarity measures:what’s theuse?,” Proceedings of the 3rd International Symposium on Music Infor-mation Retrieval, 2002.

[22] J.Foote, M.Cooper, and U.Nam, “Audio retrieval by rhythmic similar-ity,” International Symposium on Music Information Retrieval, 2002.

82

[23] N.Liu, Y.Wu, and A.L.P.Chen, “Efficient k-nn search in polyphonic mu-sic databases using a lower bounding mechanism,” MIR 03, 2003.

[24] Y.Feng, Y.Zhuang, and Y.Pan, “Popular music retrieval by detectingmood,” ACM SIGIR 03, 2003.

[25] G.Tzanetakis, G.Essl, and P.Cook, “Automatic musical genre classifica-tion of audio signals,” ISMIR, 2001.

[26] B.Logan, D.P.W.Ellis, and A.Berenzweig, “Toward evaluation tech-niques for music similarity,” SIGIR, 2003.

[27] S.Lippens, J. Martens, T. Mulder, and G.Tzanetakis, “A comparison ofhuman and automatic musical genre classification,” ICASSP-04, 2004.

[28] R. Typke, P. Giannopoulos, R. C.Veltkamp, F. Wiering, and R.Oostrum,“Using transportation distances for measuring melodic similarity,” Tech-nical report, 2003.

[29] P. Rao, “Applying perceptual distance to the discrimination of sounds,”NCC-2001, 2001.

[30] B. Logan, “Mel frequency cepstral coefficients for music modeling,” In-ternational Symposium on Music Information Retrieval, 2000.

[31] E.Wold, T.Blum, D.Keislar, and J.Wheaton, “Content-based classifica-tion,search and retrieval of audio,” IEEE Multimedia Magazine, 1996.

[32] J.-J. Aucouturier and F. Pachet, “Finding songs that sound the same,”MPCA-2002, 2002.

[33] J. Foote, “Content-based retrieval of music and audio,” SPIE, 1997.

[34] G.Tzanetakis, “Manipulation, analysis and retrieval systems for audiosignals,” Ph.D thesis,Princeton Univ, 2002.

[35] G.Tzanetakis and P.Cook, “Audio information retrieval(air) tools,” IS-MIR, 2000.

[36] R. Typke, F.Wiering, and R. C.Veltkamp, “A survey of music informa-tion retrieval systems,” ISMIR, 2005.

83

[37] A.Klapuri, “Automatic transcription of music,” MSc Thesis, 1997.

[38] J.P.Bello, G.Monti, and M.Sandler, “Techniques for automatic musictranscription,” ISMIR01, 2001.

[39] K. Johnson, “Controlled chaos and other sound synthesis techniques,”BS degree Thesis, 2000.

[40] X.Rodet, “Musical sound signal analysis/synthesis:sinusoidal + residualand elementary waveform models,” TFTS-97, 1997.

[41] G.Tzanetakis, A.Ermolinskyi, and P.Cook, “Beyond the query-by-example paradigm: New query interfaces for music information re-trieval,” ICMC-02, 2002.

[42] I. R.J.McNab, L.A.Smith and C.L.Henderson, “Tune retrieval in themultimedia library,” Multimedia Tools and Applications, 2000.

[43] C. Francu and C. G. Nevill-Manning, “Distance metrics and indexingstrategies for a digital library of popular music,” ICME-2000, 2000.

[44] L.Lu, H.You, and H.J.Zhang, “A new approach to query by hummingin music retrieval,” ICME01, 2001.

[45] T.Miura and I.Shioya, “Similarity among melodies for music informationretrieval,” CIKM03, 2003.

[46] B.Pardo, C.Meek, and W.Birmingham, “Comparing aural music-information retrieval systems,” The MIR/MDL Evaluation ProjectWhite Paper Collection (2nd ed), 2002.

[47] P.Y.Rolland, G.RaSkinis, and J.G.Ganascia, “Musical content-based re-trieval : Overview of the melodiscov approach and system,” ACM Mul-timedia 99, 1999.

[48] P.Salosaari and K.Jrvelin, “Musir :a retrieval model for music,” TampereUniversity Research notes, 1998.

[49] M.Carre, P.Philippe, and C.Apelian, “New query-by-humming musicretrieval system conception and evaluation based on a query naturestudy,” Proceedings of the COST G-6 Conference on Digital Audio Ef-fects (DAFX-01), 2001.

84

[50] D.Byrd and T.Crawford, “Problems of music information retrieval inreal world,” Information Processing and Management: an InternationalJournal,Volume 38,Issue 2, 2002.

[51] C.Meek and W.Birmingham, “Johnny can t sing: A comprehensive er-ror model for sung music queries,” International Symposium on MusicInformation Retrieval(ISMIR2002), 2002.

[52] G. Eisenberg, J. M. Batke, and T. Sikora, “Beatbank:an mpeg-7 com-pliant query by tapping system,” 116th AES convention, 2004.

[53] A.Kapur, M.Benning, and G.Tzanetakis, “Query by beat boxing:musicretrieval for the dj,” ISMIR2004, 2004.

[54] Y.Feng, Y.Zhuang, and Y.Pan, “Music information retrieval by detect-ing mood via computational media aesthetics,” IEEE Intl Conferenceon Web Intelligence, 2003.

[55] V. Subramaniam, “Music information retrieval using continuity,”MSc(Engg) Thesis, IISc, 2003.

[56] S.Baumann and T.Pohle, “A comparison of music similarity measuresfor a p2p application,” DaFX-03, 2003.

[57] A. Velivelli, C. Zhui, and T. S. Huung, “Audio segment retrieval usinga short duration example query,” ICME, 2004.

[58] Z.Liu and Q.Huang, “Content based indexing and retrieval by examplein audio,” ICME, 2000.

[59] Z.Liu and Q.Huang, “A new distance measure for probability distribu-tion function of mixture type,” ICASSP, 2000.

[60] J.Paulus and A.Klapuri, “Measuring the similarity of rhythmic pat-terns,” ISMIR, 2002.

[61] K. Kashino, T. Kurozumi, and H. Murase, “A quick search method foraudio and video signals based on histogram pruning,” IEEE Trans.onMultimedia, 2003.

[62] G. Smith, H. Murase, and K. Kashino, “Quick audio retrieval usingactive search,” ICASSP, 1998.

85

[63] K. Kashino, T. Kurozumi, and H. Murase, “Feature fluctuation ab-sorption for a quick audio retrieval from long recordings,” InternationalConference on Pattern Recognition(ICPR), 2000.

[64] K.-S. Park, W.-J. Yoon, K.-K. Lee, S.-H. Oh, and K.-M. Kim, “Mrtbframework: A robust content-based music retrieval and browsing,” TheInternational Conference on Consumer Electronics(ICCE), 2005.

[65] S.Doraiswamy and S.Ruger, “An approach towards a polyphonic musicretrieval system,” ISMIR, 2001.

[66] T. Zhang and C. Kuo, “Hierarchical classification of audio data forarchiving and retrieving,” ICASSP, 1999.

[67] P. Cano, M. Kaltenbrunner, F. Gouyon, and E. Batlle, “On the use offastmap for audio retrieval and browsing,” ISMIR, 2002.

[68] C.Faloutsos and K.I.Lin, “Fastmap: A fast algorithm for indexing,data-mining and visualization of traditional and multimedia datasets,” SIG-MOD, 1995.

[69] J.Chen and A.Chen, “Query by rhythm:an approach to song retrievalin music databases,” Proceedings of the Workshop on Research Issues inDatabase Engineering,, 1998.

[70] A.Klapuri, “Sound onset detection by applying psychoacoustic knowl-edge,” ICASSP, 1999.

[71] E.Scheirer, “Music listening systems,” PhD Thesis,MIT, 2000.

[72] J. Saunders, “Real-time discrimination of broadcast speech/music,”ICASSP, 1996.

[73] L.Rabiner and B.H.Juang, “Fundamentals of speech recognition,”Prentice-Hall, 1993.

[74] G.Tzanetakis, G.Essl, and P.Cook, “Human perception and computerextraction of musical beat strength,” DAFx-02, 2002.

[75] R. C.Veltkamp, “Shape matching: Similarity measures and algorithms,”Shape Modelling International, 2001.

86

[76] D. Li, I. Sethi, N. Dimitrova, and T. McGee, “Classification of gen-eral audio data for content-based retrieval,” Pattern Recognition Let-ters,April, 2001.

[77] R.J.Macaulay and T.F.Quatieri, “Speech analysis/synthesis based on asinusoidal representation,” IEEE Transactions on acoustics,speech andsignal processing(ASSP), 1986.

[78] T. Virtanen, “Audio signal modeling with sinusoids plus noise,” MScThesis, 2000.

[79] Serra.X, “A system for sound analysis / transformation / synthesis basedon a deterministic plus stochastic decomposition,” PhD Thesis, StanfordUniversity, 1989.

[80] H. Purnhagen and N. Meine, “Hiln - the mpeg-4 parametric audio codingtools,” ISCAS-2000, 2000.

[81] T. Heittola and A. Klapuri, “Locating segments with drums in musicsignals,” ISMIR02, 2002.

[82] J.-J. Aucouturier and F. Pachet, “Improving timbral similarity:how highis the sky?,” Journal of negative research in speech and audio sciences,2004.

[83] T.Narita and M.Sugiyama, “Fast music retrieval using spectrum andpower information,” CRAC-01, 2001.

[84] E. Scheirer and M. Slaney, “Construction and evaluation of a robustmultifeature speech/music discriminator,” Proc. of High PerformanceComputing on the Information Superhighway, 1997.

[85] R.Jarina, N.O.Connor, S.Marlow, and N.Murphy, “Rhythm detectionfor speech-music discrimination in mpeg compressed domain,” 14th In-ternational Conference on Digital Signal Processing, 2002.

[86] K. El-Maleh, M. Klein, G. Petrucci, and P. Kabal, “Speech/music dis-crimination for multimedia applications,” ICASSP, 2000.

[87] S. Karneback, “Discrimination between speech and music based on alow frequency modulation feature,” Eurospeech, 2001.

87

[88] W. Chou and L. Gu, “Robust singing detection in speech/music dis-criminator design,” ICASSP, 2001.

[89] W. Wang, W. Gao, and D. Ying, “A fast and robust speech/musicdescrimination approach,” Proceedings of ICICS-PCM, 2003.

[90] W.Chai and B.Vercoe, “Using user models in music information re-trieval systems,” International Symposium on Music Information Re-trieval(ISMIR2000), 2000.

88

algorithms for music information retrieval · algorithms for music information retrieval a thesis...

Documents