impact of audio degradation on music...

Universitat Politècnica de Catalunya

Master Thesis

Impact of audio degradation on musicclassification

Author:Francesc Capó Clar

Supervisor:Dr. Andreas Rauber

A thesis submitted in fulfilment of the requirementsfor the degree of Telecommunications Engineering

in the

Departament de Teoria del Senyal i ComunicacionsEscola Tècnica Superior d’Enginyeria de Telecomunicació de Barcelona

July 2014

http://www.upc.edu/

[email protected]

http://www.ifs.tuwien.ac.at/~andi

http://www.tsc.upc.es/

http://www.etsetb.upc.edu/

Declaration of Authorship

I, Francesc Capó Clar, declare that this thesis titled, ’Impact of audio degradation on

music classification’ and the work presented in it are my own. I confirm that:

� This work was done wholly or mainly while in candidature for a research degree at

this University.

� Where any part of this thesis has previously been submitted for a degree or any

other qualification at this University or any other institution, this has been clearly

stated.

� Where I have consulted the published work of others, this is always clearly at-

tributed.

� Where I have quoted from the work of others, the source is always given. With the

exception of such quotations, this thesis is entirely my own work.

� I have acknowledged all main sources of help.

� Where the thesis is based on work done by myself jointly with others, I have made

clear exactly what was done by others and what I have contributed myself.

Signed:

Date:

i

“Music expresses that which cannot be said and on which it is impossible to be silent."

Victor Hugo

UNIVERSITAT POLITÈCNICA DE CATALUNYA

AbstractEscola Tècnica Superior d’Enginyeria de Telecomunicació de Barcelona

Departament de Teoria del Senyal i Comunicacions

Telecommunications Engineering

Impact of audio degradation on music classification

by Francesc Capó Clar

Music genre classification is an important part in the domain of Music Information

Retrieval that tries to find several algorithms and audio analysis to classify audio tracks

in different music styles or musical genres, like classical, rock or electronic, among others.

The benchmark collections used to make the training model for the classifier tend to

be based on high-quality studio recordings. Nevertheless, sometimes we need to classify

audio coming from several recording ways or from a non high-quality audio recording, so

this fact could have a direct impact on the genre classification.

In this document we present extensive studies focused on the impact of distorted audio

on the music genre classification. We intend to find which are the most robust and the

weakest attributes of several feature sets against a variety of distortions in controlled

settings, as well as their effect on the resulting classification.

Keywords: Music genre classification, audio degradation, Music Information Retrieval,

automatic classification, music features robustness

http://www.upc.edu/

http://www.etsetb.upc.edu/

http://www.tsc.upc.es/

Acknowledgements

First of all, I would want to thank my supervisor, Andreas Rauber, for suggesting me

this interesting project and for his help, comprehension and for being patient with me

during my learning and work.

Next, thanks to my home university supervisor, Antonio Bonafonte, for all the questions

and doubts solved.

In particular, I would like to thank my parents and my family, for giving me the op-

portunity to study this degree and for their unconditional support during all these years

and for helping me in the hardest moments.

Further, an acknowledgement to my all lifelong friends and the “telecos” for encouraging

me in every moment and for making this a more enjoyable and nice way. Especially,

thanks to Xavi Bernat, Pere Guillem Mas and Bernat Orell for their technical help and

Maria Antònia Orell for her linguistic support.

Finally, thanks to the Technische Universität Wien for receiving me as a student, as well

as thanks to Vienna city for making my Erasmus an unrepeatable experience.

Xesc

iv

Agraïments

En primer lloc, voldria agrair al meu supervisor, Andreas Rauber, per suggerir-me aquest

interessant projecte i per la seva ajuda, comprensió i paciència amb jo durant el meu

aprenentatge i treball.

També, un agraïment al meu supervisor de la meva pròpia universitat, Antonio Bona-

fonte, per totes les preguntes i dubtes resolts.

En particular, m’agradaria donar les gràcies als meus pares i família, per donar-me la

oportunitat d’estudiar aquesta carrera, pel seu suport incondicional durant tots aquests

anys i per ajudar-me en els moments més difícils.

També, un especial agraïment als meus amics de tota la vida i als “telecos” per donar-me

ànims en tot moment i fer d’aquest, un camí molt més divertit i agradable. Especialment,

agrair a en Xavi Bernat, en Pere Guillem Mas i en Bernat Orell per la seva ajuda tècnica,

a la vegada que a na Maria Antònia Orell pel seu suport lingüístic.

Per acabar, donar les gràcies a la Technische Universität Wien per acollir-me com a

estudiant, així com agrair a la ciutat de Viena per fer del meu Eramus una experiència

irrepetible.

Xesc

Contents

Declaration of Authorship i

Abstract iii

Acknowledgements / Agraïments iv

Contents vi

List of Figures viii

List of Tables ix

Abbreviations x

1 Introduction 11.1 Music Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Music genre classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 42.1 Music classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Audio degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Noise models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Experimental Set-up 103.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Audio Degradation Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Synthetic Distortions . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2.2 Real World Distortions . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Feature Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Machine Learning Software: Weka . . . . . . . . . . . . . . . . . . . . . . 17

4 Impact of Degradations 214.1 Effect on features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.1 Feature processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1.2 Feature differences . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

vi

Contents vii

4.2 Effect on classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.2.1 Creation of 10 - CV folds . . . . . . . . . . . . . . . . . . . . . . . 274.2.2 Analysis of classification results . . . . . . . . . . . . . . . . . . . . 28

5 Results classifying with mixed degradations 325.1 Creation of training and mixed test sets . . . . . . . . . . . . . . . . . . . 325.2 Training and classifying with all attributes . . . . . . . . . . . . . . . . . . 345.3 Attribute selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3.1 Attribute selection process . . . . . . . . . . . . . . . . . . . . . . . 355.3.2 Possible results with attribute selection . . . . . . . . . . . . . . . 415.3.3 Training and classifying with most robust attributes . . . . . . . . 425.3.4 Training with all attributes and classifying missing the weaker at-

tributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Summary and Further Work 476.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

A Worst degradations - Attribute selection 50A.1 Mean differences of ISMIR worst degradations . . . . . . . . . . . . . . . . 50A.2 Variance differences of ISMIR worst degradations . . . . . . . . . . . . . . 51A.3 Mean differences of GTZAN worst degradations . . . . . . . . . . . . . . . 52A.4 Variance differences of GTZAN worst degradations . . . . . . . . . . . . . 53

B Classification of mixed degradations 54B.1 Complete classification results of Section 5.3.3 . . . . . . . . . . . . . . . . 55B.2 Complete classification results of Section 5.3.4 . . . . . . . . . . . . . . . . 58

C Attached files 61

Bibliography 62

List of Figures

1.1 Spotify Radio Genres . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Weighting Loudness Filter Curves . . . . . . . . . . . . . . . . . . . . . . . 82.2 Masking Auditory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Dolby System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 Feature extraction process . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2 10-Fold cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Mean and variance differences calculation . . . . . . . . . . . . . . . . . . 224.2 Folder structure for the features processing . . . . . . . . . . . . . . . . . 224.3 RP mean differences with Smartphone Recording degradation . . . . . . . 244.4 RH mean differences with Vinyl degradation . . . . . . . . . . . . . . . . . 254.5 SSD mean differences with Low Pass Filtering degradation . . . . . . . . . 254.6 MVD mean differences with Smartphone Playback degradation . . . . . . 26

5.1 Creation of mixed degraded test set . . . . . . . . . . . . . . . . . . . . . . 335.2 Creation of worst degradation . . . . . . . . . . . . . . . . . . . . . . . . . 365.3 ISMIR MVD worst degradation and attributes selection . . . . . . . . . . 385.4 ISMIR RP worst degradation and attributes selection . . . . . . . . . . . . 385.5 ISMIR RH worst degradation and attributes selection . . . . . . . . . . . 395.6 ISMIR SSD worst degradation and attributes selection . . . . . . . . . . . 395.7 ISMIR TSSD worst degradation and attributes selection . . . . . . . . . . 395.8 ISMIR TRH worst degradation and attributes selection . . . . . . . . . . . 40

A.1 ISMIR worst degradations variances and attributes selection . . . . . . . . 51A.2 GTZAN worst degradations means and attributes selection . . . . . . . . 52A.3 GTZAN worst degradations variances and attributes selection . . . . . . . 53

viii

List of Tables

2.1 Central frequency (in Hz) of frequency bands of the Bark Scale . . . . . . 8

4.1 Correct mean classification percentage of clean data sets . . . . . . . . . . 294.2 GTZAN data set: vinyl degradation classification . . . . . . . . . . . . . . 294.3 GTZAN data set: harmonic distortion classification . . . . . . . . . . . . . 304.4 ISMIR data set: low pass filtering degradation classification . . . . . . . . 31

5.1 Mixed degraded classification . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 ISMIR attribute selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.3 GTZAN attribute selection . . . . . . . . . . . . . . . . . . . . . . . . . . 415.4 ISMIR, mean, most robust attributes, strong selection . . . . . . . . . . . 435.5 ISMIR, mean, most robust attributes, strong selection . . . . . . . . . . . 445.6 ISMIR, mean, missing weakest attributes, strong selection . . . . . . . . . 455.7 GTZAN, mean, missing weakest attributes, strong selection . . . . . . . . 46

B.1 ISMIR, mean, most robust attributes, tolerant selection . . . . . . . . . . 55B.2 ISMIR, variance, most robust attributes, tolerant selection . . . . . . . . . 55B.3 ISMIR, variance, most robust attributes, strong selection . . . . . . . . . . 56B.4 GTZAN, mean, most robust attributes, tolerant selection . . . . . . . . . 56B.5 GTZAN, variance, most robust attributes, tolerant selection . . . . . . . . 57B.6 GTZAN, variance, most robust attributes, strong selection . . . . . . . . . 57B.7 ISMIR, mean, missing weakest attributes, tolerant selection . . . . . . . . 58B.8 ISMIR, variance, missing weakest attributes, tolerant selection . . . . . . . 58B.9 ISMIR,variance, missing weakest attributes, strong selection . . . . . . . . 59B.10 GTZAN, mean, missing weakest attributes, tolerant selection . . . . . . . 59B.11 GTZAN,variance, missing weakest attributes, tolerant selection . . . . . . 60B.12 GTZAN, variance, missing weakest attributes, strong selection . . . . . . . 60

ix

Abbreviations

MIR Music Information Retreival

ISMIR International Society for Music Information Retreival

MFCC Mel-Frequency Cepstral Coefficients

ADT Audio Degradation Toolbox

RP Rhythm Patterns

RH Rhythm Histogram

SSD Stadistical Spectrum Descriptor

TSSD Temporal Stadistical Spectrum Descriptor

MVD Modulation frequency Variance Descriptor

TRH Temporal Rhythm Histograms

10 - CV 10 - fold Cross Validation

SVM Suport Vector Machines

KNN K-Nearest Neighbours

x

Chapter 1

Introduction

This project has been done in the Music Information Retrieval Group 1 of the Institute

of Software Technology and Interactive Systems 2 at Technische Universität Wien 3, with

the supervision of Dr. Andreas Rauber.

1.1 Music Information Retrieval

Music Information Retrieval, or MIR, is a growing field of research related to the science

of retrieving information from music, as the name itself suggests. Those involved in

this field use a mathematical and scientific background combined with knowledge in

musicology, psychology and academic music studies in order to make extensive studies

that could be very useful. Some of the most important examples could be recommender

systems, track separation and music recognition, automatic music transcription, music

generation and automatic categorization, to which this thesis is related.

Every year, there is a conference of the International Society for Music Information

Retrieval, ISMIR [1], where the MIR researchers around the world can present their

studies on that field and share improvements in the different areas.

The methods used in the MIR studies are common to all of them. These methods are

specific data source (which is used as a training or benchmark), features extraction, pro-

cessing and representation (which have the extracted information of each music track)1http://www.ifs.tuwien.ac.at/mir/index.html2http://www.ifs.tuwien.ac.at/3http://www.tuwien.ac.at/

1

http://www.ifs.tuwien.ac.at/mir/index.html

http://www.ifs.tuwien.ac.at/

http://www.tuwien.ac.at/

Chapter 1. Introduction 2

and finally statistics and machine learning (to classify and get the results). This docu-

ment expects to focus on the music genre classification.

1.2 Music genre classification

Nowadays, we can access to several vast musical databases in due to extended internet

services (e.g. Spotify, Grooveshark, iTunes), creating a need for methods to search and

organise these databases. One way to do this is to classify the audio tracks depending on

their genre or music style (e.g. Figure 1.1). This information can be extracted directly

from the metadata present on several of the new audio formats, as MP3 or files, but

the specified genre can’t be in our own classification, i.e. it could be more general (e.g.

classical, electronic) or more specific (e.g. baroque, dance). Otherwise, in other audio

formats, as an audio CD, there is not metadata information. Thus, we need an automatic

and not subjective process to classify audio tracks among a specific list of available genres.

Figure 1.1: Genre classification used on Spotify Radio

In order to make this classification we need a benchmark data set, which has to contain

several audio tracks already classified on the genres in which we want to classify future

audio files. Thus, the bigger a benchmark data set is, the much larger classification set

can be done.

1.3 Motivation

Normally, the benchmark collections used for the training of the classifiers tends to be

based in a single source, or in high-quality studio recordings. Indeed, this is not a problem

if the audio that we need to classify is also from a high-quality recording, i.e. the audio

Chapter 1. Introduction 3

used as a benchmark and the audio that we want to classify come from an homogeneous

recording quality.

Otherwise, some users compile their own audio collections coming from different data

source, i.e. combining recordings from vinyl, smartphones, live recordings, ethnical and

historical recordings, which could be a mixing of recordings with different encoding qual-

ity and with several distortions. This could have a direct impact on the genre classifica-

tion because the audio has changed its musical features, therefore it could be classified

as a wrong genre. The main goal of this thesis is to study which exactly the effect of this

degradations on the genre classification is, as well as their effect on prominent musical

features. However, we are not interested in an absolute performance improvement of the

correct classification, the complexity classes or any real-world size features of collection.

1.4 Thesis outline

We start reviewing some related work in audio classifications field, including human and

machine learning classification, as well as different degradation and noise models which

we could find in low-quality recordings in Chapter 2.

In Chapter 3 we present our experimental set-up, including all the software, features and

classifiers used in the different experiments.

The degradations have an impact on the features, as well as on the classification, that is

presented in Chapter 4.

In Chapter 5 we present the main results of this study, as well as two ways to try to

increment our correct classification percentage.

Finally, in Chapter 6 we present our conclusions and the summary of our experiments

and we also propose some new experiments to keep working in this study field.

Chapter 2

Related Work

Music classification is one of the dominant areas of MIR research. This study is about

music genre classification, however there exist other studies related to mood or even

author classification, which have to be very precise. In addition, there are several studies

related to the audio degradation and noise models that can be connected, which is the

goal of this thesis. In this section we will do a review of several studies related to it.

2.1 Music classification

One of the data sets used in this thesis (more data set details in Section 3.1) was collected

by G. Tzanetakis in order to make an extensive study of the automatic genre classifica-

tion [2] using several features sets different than the ones that we use in our study. The

features he uses are Timbral Texture Features (that includes spectral centroid, spectral

rolloff, spectral flux, time domain zero crossings, MFCC, analysis and texture window,

and low-energy feature), Timbral Texture Feature Vector (that consists on several statisti-

cal measures of the Timbral Texture Features), Rhythmic Content Features (that consists

on a beat histogram made by the correlation of several extracted envelopes from the

discrete wavelet transform octave frequency bands) and Pitch Content Features (based

on multiple pitch detection techniques). The classifiers used in this study are Simple

Gaussian, Gaussian Mixture Model, Expectation-Maximization and K-Nearest Neighbour.

About the classification, they first proceed to a general classification (across 10 genres)

where they achieve a maximum correct classification of 60% of the instances. Then they

4

Chapter 2. Related Work 5

also perform a sub-genre classification of classical genre (between choir, orchestra, piano

and string quartet) achieving a maximum correct classification of 88% of instances, as

well as a sub-genre classification of jazz genre (between big band, cool, fusion, piano,

quartet and swing) achieving a maximum correct classification of 68%. In addition they

present the confusion matrix for the different classifications. They do a classification by

humans in order to compare both classifications, where humans achieve 53% of correctly

classified instances with only 250 ms of the audio tracks and 70% of correctly classified

instances with 3 seconds of the audio tracks.

Another related studio was made by A. Schindler and A. Rauber [3], in which they

perform a classification of four different data sets, two of them also used in our study,

using several feature sets. The features used are Echonest features, Marysas Features and

Psychoacoustic Features. As a conclusion, the best classification performance in most

of the cases is achieved with the Temporal Echonest features, that is based on all the

statistical moments of Segments Pitches, Segments Timbre, Segments Loudness Max,

Segments Loudness Max Time and lengths of segments calculated from Segments Start.

Furthermore, the dimensionality of this feature is 224 dimensions so the computational

cost required is not as high as other features that can have more than 1000 dimensions.

Support Vector Machines and K-Nearest Neighbours are two classifiers frequently used in

MIR community. In [4], they perform an study of the automatic performer classification

over a group of 18 performers, proving a better classification using the SVM instead of

K-NN. In this study they use conventional features as MFCC and their related statistical

measures taking the whole duration of each audio track to extract the feature. They also

report that, in the case of the performer classification, the percentage of correct classified

instances is higher using songs in the training and testing sets from the same album than

from different albums. They achieve a 84% of correct classified instances with songs from

the same album and 69% of correct classified instances with songs from different albums,

both of them using SVM. The improvement compared to the use of K-NN in this case is

about 15%.


2.2 Audio degradation

In a study related to a software that is used in this thesis (Audio Degradation Toolbox ),

the authors perform an extensive study of the most common ways to degrade audio tracks

[5]. Apart from the instructions and features of the software, the perform experiments of

the impact of some audio degradations on several methods for standard music informatics

on suitable audio data: Audio Identification Service provided by EchoNest, Score-to-

Audio Alignment, Beat-tracking and Chord Detection. In the study they test several

Real World degradations (more information about them in Section 3.2.2), comparing

their impact for each service.

Another example of study related to audio degradation is [6],in which the authors study

the effect of the degradation on the audio track classification, i.e. for each of the music

items in the database there is one class. They use three features in their study: Loud-

ness, that belongs to the category of the intensity sensations; Spectral Flatness Measures

(SFM), that is related to the tonality aspect of the audio signal, used in the discrimi-

nating criterion between different audio tracks; and Spectral Crest Factor (SCF), that

is similar to the SFM but using the maximum values of the audio signal instead of the

mean values used in the SFM. All the feature are extracted individually from the dif-

ferent frequency bands. Their experiment consist in performing a degradation of the

audio tracks in several ways, then extract the features from the degraded tracks, as well

as from the original tracks, and finally classify the degraded audio features among the

original audio. The degradations performed are: time shifts, cropping, volume change

(although the features are designed in order to be level independents of the volume, thus

there are not separate test results), codification in 96 kbps MPGE-1/2 Layer-3, equal-

isation (used with adjacent band attenuations set to -6 dB and +6 dB in an alternate

fashion), band limiting (low pass filtering), dynamic range compression, noise addition

and loudspeaker-microphone transmission. The results of the experiments are different

depending on the frequency band selection, but as a conclusion, SFM and SCF are more

robust than Loudness against audio degradation.

Sometimes is also good to perform a genre classification by humans in order to compare

the results achieved with the automatic classification. The audio degradation could

have a direct impact on the human classification as well. In [7] paper, the authors

discuss the audio degradation effects on the genre classification by music students. The


audio is degraded with 3 different levels of timbre or rhythm alterations. The timbre

alterations consist in changing 3 different frequency bands (3rd, 6th and 12th octaves)

by a Gaussian noise of the same spectral power on that band. On the other hand, the

rhythmic degradation consist in shuffling the frames during three different times (125,

250 and 500 ms) preserving the average global and the timbre information. The results

show that the rhythm degradations are not meaningful, besides the timbre degradations

show a very important deterioration, sometimes reaching the 70% of deterioration in the

case of classical music.

The MSc thesis by J. Mansen [8] performs a complete feature extraction covering the

whole spectrum of standard features used in MIR. The features are organized by differ-

ent toolboxes used in their extraction: Binaural Cue Selection, Music Analysis, Chroma,

ISP, MIR, PSY and YAAFE. The author also discusses the effect of the MP3 encod-

ing and the resampling on the feature extraction. According to the results, the MP3

compression with 192 Kbps and above is acceptable for the musical data, whereas below

bit rates can be significant due to features extraction. As far as frequency resampling is

concerned, although it could have no effect on the audio signal, the features extraction

could be affected due to the exchange of information between frequency bands, and in

some cases like down-sampling, it could even lead to the removing a few attributes from

some feature sets.

2.3 Noise models

The perception of the audio by humans is different than the machines’ perception. The

human audiology system is not lineal for all the frequencies due to the size of the different

parts of the ear. Thus, the audio noise doesn’t affect humans with the same magnitude as

well. In order to linearise the humans auditive response to the audio systems and noise is

common the use of Weighting filters [9]. There exist several weighting filters specifically

designed for different applications, but the most prominent one used in the noise studies

and measurements is the A-Weighting. This one emphasises the frequencies around 3-6

KHz, while attenuates the low and very high frequencies, as the ear sensitivity. Thus,

the unit used to measure the loudness is called dBA. There are also filters B, C (used

for louder sounds) and D (used for loud aircraft noise) (Figure 2.1).


Figure 2.1: Weighting loudness filter curves: A-weighting (blue), B-weighting (yel-low), C-weighting (red) and D-weighting (black)

Another fact of the human ear system related to the noise distortion is the masking

auditory. This effect appears when a sound has a high loudness in a specific frequency

range and makes imperceptible other lower sounds present in the same critical frequency

band (Table 2.1) or in the nearest ones (Figure 2.2). This can be a problem in noisy

environments but also in some music mixing where some instruments or vocals can be

masked by others.

100 200 300 400 510 630 770 920 1080 1270 1480 17202000 2320 2700 3150 3700 4400 5300 6400 7700 9500 12000 15500

Table 2.1: Central frequency (in Hz) of frequency bands of the Bark Scale

Related to both previous psychoacoustic facts, there is a study related to a different noise

model that can be useful to improve the non masking by noise in the audio signals [10].

In it, the author explains that in some cases it is possible that in an audio track there is

a noise from a sinusoidal noise (e.g. interferences from 50 Hz from the power grid); that

can cause an effect of auditory masking. Even in some cases, the interference could be in

the frequency bands where the human ear is most sensible, producing a higher masking

of a largest frequency spectrum range. One solution proposed by the author is to locate

the region that promotes the noise using the humans masking auditory knowledge in

order to filter it and try to eliminate it.


Figure 2.2: Masking auditory of a sound between 125 and 250 Hz of 40 dB maskedby a sound between 250 and 500 Hz of 70 dB

Another system to avoid the noise distortion and provide the cassettes of more robustness

is this degradation [11]. This system is based in encoder and decoder systems called

companders that compress the range between loud and soft of the audio that we want

to record in the cassette and expand the range back again on the playback, therefore

reducing the tape noise (Figure 2.3).

Figure 2.3: Dolby system used in the cassettes recordings

Chapter 3

Experimental Set-up

This study is based on the repetitions of several processes: degradation of all the data

sets, extraction of features and finally a genre classification of the different audio sets

and combinations between them. As this is a repetitive action that we need to do in

several data subsets, we use Matlab [12] in order to run the different software of feature

extraction, classify and to do graphs with the results. The Matlab is working on an

Ubuntu 13.10 remote server.

3.1 Data Sets

The data sets are the collections of audio tracks that are used to work in our study.

Due to copyright terms, we are not allowed to use our own audio collection, so we use

two prominent data sets that have been studied intensively over the time in the MIR

community. Each data set has its own list of musical genres in which the audio tracks

are previously classified. In the study we do the same procedure for both data sets, but

without mixing any data between them, so they are two different compilations, and as

we will see in their features, they have different audio format and quality. The two data

sets used are:

• GTZAN: Data set collected by G. Tzanetakis and P. Cook [2]. It consists of 1000

audio tracks, each 30 seconds long, equally distributed across 10 musical genres:

blues, classical, country, disco, hiphop, pop, jazz, metal, reggae and rock. All

tracks are 22050 Hz Mono 16-bit in .au format.

10

Chapter 3. Experimental Set-up 11

• ISMIR: Data set created for the ISMIR 2004 in the genre classification evaluation

campaign [13]. It consists of 1458 full length files distributed in 6 musical genres:

classical, electronic, jazz&blues, metal&punk, rock&pop and world. All tracks are

44 KHz Stereo 128 Kbits in .mp3 format.

The GTZAN data set is classified across 10 musical genres whereas the ISMIR data set

is classified across 6 musical genres: the last one is more general and it also mixes similar

genres e.g. jazz&blues or rock&pop. Thus, the percentage of correct genre classification

tracks in the second one should be higher than in the first one. We will check it in the

following sections.

3.2 Audio Degradation Toolbox

In order to study the differences between clean and distorted audio, we need to compare

all the clean audio tracks with their respective degraded versions. The Audio Degradation

Toolbox 0.2 (ADT) is used with the purpose of creating the different degraded versions

of the clean audio in controlled settings [5].

The ADT is implemented in Matlab. This software works doing a degraded version of

all the audio tracks on the input list. We can choose between 14 distortion units, in

which case the software applies only a single way to degrade the audio track, or we can

combine several distortions at the same time. In addition, all the parameters of the

distortions can be configured. In this study we use the predefined options, which consist

in 12 distortion units that we call synthetic distortions and 6 other distortions that we

call real world distortions. All the distortions are applied across all the audio tracks and

after this, the audio signal is also normalized with a maximum amplitude of 0.999. The

clean audio is also normalized with a maximum amplitude of 0.999.

3.2.1 Synthetic Distortions

They involve isolated distortions that only would be added deliberately to an audio track

in order to make an audio processing study as well as a music effect. The 12 distortions

that we use are:


• Add pink noise: Adds a pink noise with a final SNR = 10 dB. The noise is

implemented as white noise plus filter.

• Add background sound: Adds a background sound with a final SNR = 10 dB.

The sound used is a noise from a restaurant environment [14].

• Aliasing: The signal is down-sampled to 4000 Hz sampling rate without low pass

filtering, causing a purposeful violation of Nyquist-Shannon sampling theorem.

Then, the original sampling rate is restored using a regular re-sampling method

with filtering.

• Clipping: Normalize the audio signal such that a 10% is outside the interval [-1,1],

and each resulting sample x is clipped using sign(x).

• Dynamic range compression: Applies a signal-dependent normalization to the

audio signal, reducing the energy differences between soft and loud parts of the

signal. The parameters of this distortion are: forgetting time = 0.1 seconds, com-

pressor threshold = -40 dB, compressor slope = 0.9, attack time = 0.01 seconds,

release time = 0.01 seconds and delay time = 0.01 seconds.

• Harmonic distortion: Applies a quadratic distortion to the audio signal with 5

iterative applications.

• Low quality MP3 compression: The audio signal is compressed to MP3 with

a constant bit rate of 32 Kbps. Then is decompressed to the out quality.

• Speed up: The audio signal is resampled in order to speed-up the music 5% up.

• Wow re-sampling: Applies a time-dependent resampling of the audio signal with

intensity of change = 3 and frequency of change = 0.5, imitating the non-constant

of some analogue players.

• Delay: Pads the first 22050 audio samples with zero samples.

• High-pass filtering: Applies a linear high-pass filtering using a Hamming window

with the stop frequency = 1000 Hz.

• Low-pass filtering: Analogue to the high-pass filtering, applies a low-pass filter-

ing with the stop frequency = 800 Hz.


In all of these cases, the distortions are applied with prominent presence in all the cases,

in order to observe the maximum effect that could produce each distortion on the clean

audio tracks.

3.2.2 Real World Distortions

These distortions are a few examples of audio degradations that we could find in real-

life recording conditions. These are made as combination of synthetic distortions with

specific parameters that try to emulate a typical recording instrument or emplacement.

The 6 real world distortions used are:

• Live recording: Applies an impulse response of a reverberation effect called Great

Hall (taken from [15] but included in the ADT) and adds a pink noise with SNR

ratio= 40 dB. As the name says, it makes the effect of a live recording of a musical

concert on an open stage.

• Strong MP3 compression: The audio signal is compressed to MP3 with a

constant bit rate of 64 kbps. The bit rate compression used is the double as in the

low quality MP3 compression used in the synthetic distortion. It makes that the

difference with the uncompressed audio can be almost imperceptible to the human

ear.

• Vinyl recording: Applies an impulse response of a vinyl effect extracted from a

plug-in 1 (included in the ADT), adds a sound of a vinyl player with a SNR ratio

= 40 dB from the same plug-in effect, applies a Wow resampling distortion with

intensity of change = 1.3 and frequency of change = 33/60 (vinyl rpm = 33 rpm)

and finally adds a pink noise with SNR ratio = 40 dB. It imitates a vinyl player

recording with his typical fluctuations.

• Radio broadcast: Applies a dynamic range compression with forgetting time =

0.3 seconds, compressor threshold = -40 dB, compressor slope = 0.6 dB, attack

time = 0.2 seconds, release time = 0.2 seconds and delay time = 0.2 seconds, and

applies a speed up of +2%. It imitates the loudness characteristic of many radio

stations and the speed up used to shorten music and create more advertisement

time.1https://www.izotope.com/fr/products/effects-instruments/vinyl/

https://www.izotope.com/fr/products/effects-instruments/vinyl/


• Smart phone recording: Applies an impulse response from a Google Nexus One

front microphone (included in the ADT), applies a dynamic range compression with

forgetting time = 0.2 seconds, compressor threshold = -35 dB, compressor slope

= 0.5, attack time = 0.01 seconds, release time = 0.01 seconds and delay time =

0.01 seconds; also applies a clipping distortion with 0.3% of samples clipped, and

finally adds a pink noise with a SNR ratio = 35 dB.

• Smart phone playback: Applies an impulse response from a Google Nexus One

front speaker (included in the ADT) which has a high-pass filtering and cut-off =

500 Hz, and adds a pink noise with a SNR ratio = 40 dB.

At the end of the procedure, all output audio tracks are compressed to MP3 in order

to store the distorted tracks in MP3 format. The compressor used is LAME 2 with the

parameters set to the highest quality encoding, a constant bit rate of 256 kbps and joint

stereo.

3.3 Feature Sets

In order to do an audio genre classification we need to extract information from those

tracks as a manageable set of values. This values are called music features and there are

several kinds of them, e.g. MFCC (Mel-Frequency Cepstral Coefficients) which measure

the timbre of music, and Chroma features which is related to the harmony and chords

of music, among others. In this study we want to focus on several subsets of psycho-

acoustic features [16], which can be extracted with the Audio Feature Extraction Software

v.6411 using a Matlab implementation 3. These features are called this way because they

describe rhythmic structures on variety of frequency bands considering psycho-acoustic

phenomenons according to the human perception of music and sound. The features that

we use are:

• Rhythm Patterns (RP): Represents the modulation amplitudes for a range of

modulation frequencies on critical bands of the Bark scale (Table 2.1) according to

the human auditory range perception and the loudness sensation per band. The2http://lame.sourceforge.net3http://www.ifs.tuwien.ac.at/mir/audiofeatureextraction.html

http://lame.sourceforge.net

http://www.ifs.tuwien.ac.at/mir/audiofeatureextraction.html


whole extraction process is shown in Figure 3.1). The resulting feature is a 1440

dimensional value vector (60 bins for each of 24 critical band).

• Rhythm Histograms (RH): This is a histogram of 60 bins based on the average

of the 24 critical bands (Table 2.1) computed on RP. It captures the modulation

between 0 and 10 Hz which represents the general rhythm characteristics of the au-

dio track. The extraction process is on Figure 3.1. The feature is a 60 dimensional

value vector.

• Statistical Spectrum Descriptor (SSD): Computes several statistical measures

on each of 24 critical bands with the audio track already adapted to the human au-

ditory perception. The statistical measures computed are: mean, median, variance,

skewness, kurtosis, minimum value and maximum value. The extraction process is

on figure 3.1. The resulting feature is a 168 dimensional value vector, where the

values are organized first in the seven different statistical measures, and in each

subgroup we can find the value for each different frequency band.

• Modulation frequency Variance Descriptors (MVD): This represents the

variations over the critical bands (Table 2.1) for a specific modulation frequency.

Similar to the SSD, it calculates the 7 statistical measures to each 60 fluctuation

bins over all the 24 critical bands, and then makes the average of the statistical

measures for all the critical bands, resulting a 420 dimensional value vector.

• Temporal Statistical Spectrum Descriptor (TSSD): In order to incorporate

time series aspect, it takes the SSD over 7 different parts of the audio track de-

scribing the variations over the time. The resulting vector has a dimensionality of

1176 values, i.e. 7 times the dimension of SSD.

• Temporal Rhythm Histograms (TRH): It takes the RH over 7 different parts

of the audio track and describes the difference in the fluctuations between 0 and

10 Hz in the different parts of the audio track. The dimensionality of the resulting

vector is of 420 values, ie. 7 times the dimension of RH.

All the features are extracted as a text file with the name of the feature as the extension

for each different feature set. Each feature set has a header that specifies information

related to the extracted feature: the dimension of the feature (including subsets dimen-

sion), the number of tracks extracted and more information about the software version.


Figure 3.1: Feature extraction process for the RP, RH and SSD

After the header there are different values that are called attributes, i.e. the dimension

of each feature is the number of attributes. Each attribute is the music information

extracted from the audio track. Each attribute has a different meaning, e.g. mean of sig-

nal in a concrete frequency band, fluctuation of frequencies in a frequency band, among

others. The attributes of each feature can be compared to each other but not to the at-

tributes of other features, because they are not related to the same music characteristic.

This is an example of the RH text file of feature extraction (comments are written after

%):

$TYPE vec

$DATA_TYPE audio-rh

$DATA_DIM 1x60 %1 group of 60 attributes per audio track

$EXTRACTOR Matlab rp_extract v 0.6411 by tml

$XDIM 100 % number of audio tracks analyzed

$YDIM 1

$VEC_DIM 60 % total dimension of the feature

Val_Attr1_Track1 Val_Attr2_Track1 ... Val_Attr60_Track1 Name_Track1


Val_Attr1_Track2 Val_Attr2_Track2 ... Val_Attr60_Track2 Name_Track2

...

3.4 Machine Learning Software: Weka

The last step to do the genre classification is to process all the vector values coming

from the feature extraction. The most common way to do the genre classification is the

machine learning. This is a branch of artificial intelligence which concerns the construc-

tion and study of systems, in this case the genre classification, that can be created and

learned from data.

In the genre classification systems there are two parts: the training set and the test set.

The training set is a representative collection of audio tracks already classified in genre,

that have to contain all the genres in which we want to classify. With the training set we

are able to make the model using a classifier. Finally, we can use the test set, which is

also already classified in genre, and make the classification using the model constructed

before. The procedure that the machine learning software makes is to compare the genre

classification made with the trained model and the classification previously done of the

test set. Then, the results are several values with statistical measures comparing both

classifications and the confusion matrix. In this study we will use only the percentage of

correctly classified instances, which is the most general value for the genre classification.

In order to make a good genre classification, the training set should contain more audio

tracks than the test set (normally, the proportion used is 90% of training set and 10%

of test set). This is because the model needs more statistical information about the

benchmark data in order to be more accurate.

One of the most used procedures in the classification studies is the 10-Fold cross val-

idation. This is a way of using all the data set that we want to study as a training

set and at the same time also as a test set. It consists in generating 10 independent

classification experiments in which we create a test set based on the 10% of the data set

and the remaining 90% is used to create the model to be used in the classification of the

test set selected before. Then the experiment is repeated but selecting 10% of different

audio tracks from the data set as a test set and taking the 90% remaining as a training


set. This action is repeated 10 times in order to use all the data as a test set. Then the

results are averaged. Example Figure: 3.2.

Figure 3.2: Example of 10-fold cross validation

The software that we use in this study to do our experiments is Weka (Waikato Environ-

ment for Knowledge Analysis) [17]. This is a popular suite of machine learning written

in Java that can also be run by Java, so it can be called from Matlab. It is also an

open-source software. The program includes a graphic interface but we only will use it

with the command line calls. Weka can read *.arff files, which are text files that have all

the data values from each of the instance that we want to process. This files can allow

several different kinds of information (see the full *.arff files specifications in 4) but our

files will always have the same structure:

@Relation Name_of_the_file

@Attribute term1 numeric

@Attribute term2 numeric

...

@Attribute class {list_of_different_genres}

@Data

Value_term1_Audio1, Value_term2_Audio1, ... , genre_Audio14http://www.cs.waikato.ac.nz/ml/weka/arff.html

http://www.cs.waikato.ac.nz/ml/weka/arff.html


Value_term1_Audio2, Value_term2_Audio2, ... , genre_Audio2

...

We need to do a conversion from the files from the feature extraction to the *.arff files

because, even if the structure is similar, weka software can only read files with this

format. We also need to put together all the files from the different genres of the data

set, specifying the genre of the audio track and at the end of the attribute values line

instead of the name of the audio track.

When the *.arff file is ready, we are able to do the genre classification. To do it, we have

to chose which classifier we want to use. Weka contains several classifiers to be selected,

which have too several parameters to specify. In this study we will use the most common

classifiers used in the MIR research [3]:

• Naive Bayes: Is a popular probabilistic classifier based on the Bayes’ theorem:

P (A|B) =P (B|A)P (A)

P (B)

The main feature of this classifier is that it assumes that all the attributes are

independent between them. This classifier is efficient and robust against noisy

data and has a simple structure. This classifier can work fine with less training

files than the others.

• Support Vector Machines (SVM): This classifier constructs a set of hyper-

planes in a high dimensional space with the attributes and then choses the one

with the largest distance between the different genres. We use two versions of

the SVM classifier: Linear Polykernel and RBFKernel (RBF), both with penalty

parameter = 1, RBF Gamma = 0.01 and c = 0.1 (default parameters).

• J48: Is the open-source Java implementation of decision tree C4.5. It works using

the concept of information entropy. In the genre classification is very useful because

it has the advantage of being relatively quick to train. It is used with a confidence

factor used for pruning from 0.25 and a minimum of 2 instances per leaf.

• Random Forest: constructs a multitude of decision trees at training time, needing

more time but being superior in precision than J48. The parameters that we use


for this classifier are unlimited depth of trees, 10 generated trees and the number

of attributes to be used in random selection set to 0.

• K-Nearest Neighbours (KNN): Is a popular non-parametric classifier. It is

based on the lazy learning, where the function is only approximated locally and all

computation is deferred until classification. In our study we use Euclidean distance

(L2) as well as Manhattan distance (L1), both with k = 1.

Chapter 4

Impact of Degradations

4.1 Effect on features

After the performing of the degradation of all data sets and features extraction, we are

able to evaluate the effect of each degradation on the different feature sets. In order to

see the differences between the clean and the distorted audio, we need to process the

feature values.

4.1.1 Feature processing

As the feature extraction writes the values on a text file for each feature set of each

genre and degradation, we need to create a Matlab script in order to read the values

from the different text files and load them on a Matlab matrix that we will use to make

the different computations. We want to see the differences between the clean and the

degraded audio for each feature set depending on the degradation, but otherwise, in this

part we don’t discriminate between different genres. An absolute value difference is done

for each attribute across all the feature set; then the mean and the variance of each

difference of the attributes is calculated for all the audio tracks degraded with the same

distortion. The whole process is in the example Figure 4.1.

Mean and variance differences are calculated for all the tracks for each degradation case,

in order to study the differences independently of the genre, although this is done for

each data set separately. This involves a new folder structure as in the Figure 4.2.

21

Chapter 4. Impact of Degradations 22

$XGLR�WUDFN��

$XGLR�WUDFN��

)HDWXUH�6HW�$WWULEXWHV $WWULEXWH�YDOXH�IURP�FOHDQ�DXGLR

$WWULEXWH�YDOXH�IURP�GHJUDGHG�DXGLR

$EVROXWH�YDOXH�RI�WKH�GLIIHUHQFH�EHWZHHQ�DWWULEXWHV�IURP�FOHDQ�DQG�GHJUDGHG�DXGLR

$OO�DXGLR�WUDFNV�IURP�WKH�VDPH�GHJUDGDWLRQ

$OO�GLIIHUHQFHV�IURP�WKH�VDPH�GHJUDGDWLRQ

0HDQ�RI�GLIIHUHQFHV�IURP�WKH�VDPH�GHJUDGDWLRQ�

9DULDQFH�RI�GLIIHUHQFHV�IURP�WKH�VDPH�GHJUDGDWLRQ�

Figure 4.1: Performing of mean and variance differences calculation. First step:differences between attributes from clean and degraded audio for each audio track;

second step: mean and variance of all the attributes from the same degradation.

Figure 4.2: Folder structure before (separated by genre) and after (all genres togetherfor each degradation, and creation of new subfolder with mean and variance plots) the

features processing.


4.1.2 Feature differences

Regarding to the features processing done before, now we are able to analyse the robust-

ness and the weakness of the attributes against all the degradations that we applied, per

each attribute. Our results are presented as a plot for each degradation and feature, of

the mean and variance differences over all the feature set. Conceptually, a small mean

difference of an attribute means that this attribute is hardly affect by the respective

degradation, thus making it a robust attribute in an audio collection having both clean

and degraded audio with this degradation, i.e. the mean attribute values are similar

between clean and degraded audio. Oppositely, a higher mean difference of an attribute

means that this attribute has a significant impact of the clean audio by the degradation,

thus making it a weak attribute, i.e. the mean attribute values are different between

clean and degraded audio. On the other hand, a high variance difference of an attribute

means that this attribute is also weak against this degradation because this means a high

dispersion of the attribute differences, that could also induce to a wrong classification.

In summary: robuster attributes have small mean and variance difference, as well as

weaker attributes have high mean difference and/or high variance difference.

Differences observed between clean and degraded audio follow a similar degradation

pattern for each feature set, mainly in the mean differences. However, differences remain

between different degradations. On the other hand, differences between both data sets

used in this study are very similar, mainly on mean differences and diverging only in

some isolated cases on variance differences. All mean and variance attribute differences

are presented in attached files (Appendix C), although in this section we will analyse

some relevant results of the main feature differences.

Regarding to differences referring to RP, as in the Figure 4.3, the differences are irregu-

larly distributed over the whole feature set due to the composition of it. RP is a large

feature set organized by 60 groups that belongs to the amplitude modulation, where each

group includes information of the 24 frequency bands, resulting of a 1440 dimensionality.

The degradation effect to each frequency band is different for each degradation, as well

as for each of 60 groups, so we cannot simply select a frequency band in order to classify

the robuster and weaker attributes. High variance on attribute differences are irregularly

distributed as well, but with less presence than mean differences.


0 200 400 600 800 1000 1200 14000

0.02

0.04

0.06

0.08

0.1

0.12

0.14Mean rp smartPhoneRecording

Figure 4.3: Rhythm Patterns mean attribute differences between clean audio andSmartphone Recording degradation of the ISMIR data set.

In the case of RH, the attributes classification is simpler, as high shifts are located in

lower part of 60 bins that model the feature set. As the feature set describes amplitude

modulation of the aggregation of all the frequency bands, degradations affect more to

slower rhythm features than faster. You can see an example of feature differences in

Figure 4.4. On the other hand, the variance of attribute differences is not very impor-

tant except by degradations as Clipping and Harmonic Distortion, which lead to a high

variance on firsts attribute differences.

Regarding to the differences of SSD, in the major part of the cases, the most significant

shifts are found in skewness measures, mainly in the highest frequency bands as you

can see in Figure 4.5 which belongs to difference of Low pass filtering degradation. In

that case low frequency bands are barely affected comparing to high frequency bands

due to the degradation peculiarities, i.e. the audio signal in the lowest frequency range

shouldn’t be affected by the degradation, so neither do features belonging to that part.

The highest shifts observed in the MVD case are also found in skewness measures group,

but in this feature, differences on the skewness have a similar value over all the 60 bins as

you can see in the example Figure 4.6. The second measure group with higher differences


0 10 20 30 40 50 600

0.2

0.4

0.6

0.8

1

1.2

1.4Mean rh vinylRecording

Figure 4.4: Rhythm Histograms mean attribute differences between clean audio andVinyl degradation of the GTZAN data set.

0 20 40 60 80 100 120 140 1600

20

40

60

80

100

120

140Mean ssd unit applyLowpassFilter

Mean Median Variance Skewness Kurtosis Minimum Maximum

Figure 4.5: Statistical Spectrum Descriptor mean attribute differences between cleanaudio and Low Pass Filtering degradation of the ISMIR data set. Dashed red linesindicate the separation of statistical measure groups that contains each measure related

to the 24 frequency bands 2.1.


in this feature is the variance between all bins corresponding to each frequency band,

resulting a group of 60 bins (don’t confuse with the variance between all the tracks

difference). In some degradations, we can also see another important difference in the

first bins of the group of maximum measures, but these shifts are not as relevant as the

two later discussed.

0 50 100 150 200 250 300 350 4000

0.5

1

1.5

2

2.5Mean mvd smartPhonePlayback

Mean VarianceMedian Skewness Kurtosis Minimum Maximum

Figure 4.6: Modulation frequency Variance Descriptor mean attribute differencesbetween clean audio and Smartphone Playback degradation of the ISMIR data set.Dashed red lines indicate the separation of statistical measure groups that contains 60

bins from the amplitude modulation of all the frequency bands aggregated 2.1.

Referring to the temporal features (TSSD and TRH ), the differences between degraded

and clean audio are not very prominent in all attributes except for an small attribute

range, so the effect on the genre classification will not be affected either.

As a summary of this section, the most affected features are RP and RH over all at-

tributes range, whereas on SSD and MVD high shifts are located on a clear range of

the features. Otherwise TSSD and TRH are not specially affected by degradations.

After this extensive analysis of degradation effects on our used features, we expect an

important effect on the genre classification.


4.2 Effect on classification

In this section we analyse the results achieved in the classification of both data sets

degraded by the different degradations explained in Sections 3.2.1 and 3.2.2. We compare

the results with the classification of the same clean data sets as well. First we need to

perform a creation of a whole 10-CV environment consisting of different folds: training

set and test set. It is important to emphasize that in this section, in both training and

test sets we are using the same single degradation over all the data set.

4.2.1 Creation of 10 - CV folds

As we already explained in Section 3.4, Weka software uses feature values in order to

perform the classification. We have to create an individual *.arff file per each data set,

feature and degradation. The different files have to be like the explained model in the

mentioned section, containing the attribute values of each audio track from all the genres

of each data set, and the genre where it belongs at the end of the line. We have to repeat

the same procedure for each degradation and for each feature set as well. For instance,

in the GTZAN case:

GTZAN data set = 100 audio tracks per genre · 10 genres = 1000 audio tracks

6 feature sets · (clean+ 12 synthetic degr.+ 6 real world degr.) = 114 arff files

Each one of these 114 *.arff files will contain information about all the 1000 audio tracks

from the GTZAN data set, each file with different feature sets and degradations. The

same operation has to be repeated for the ISMIR data set, resulting the same number of

*.arff files but containing information about the 1458 audio tracks of the data set. When

we have already created the *.arff files we are able to create the 10-CV environment

necessary to perform the classification.

In order to use all audio files information as a test file, we use the 10-CV technique

(Figure 3.2). Weka software has an option which is able to construct its own 10-CV

environment, with the different folds containing training and test sets, but we don’t use

it because we want more information about the different folds that can not be provided


with this mechanism. Therefore, we construct our own 10-CV model using filtering

Weka options with which we can create the different folds of the whole *.arff file; then we

perform a single validation for each fold, getting as a result the percentage of correctly

classified instances per each fold. Finally we are able to calculate the correctly classified

instances mean and variance.

4.2.2 Analysis of classification results

Before the analysis of the classification achieved with different degradation, we will per-

form an extra classification of the clean audio set using the same 10-CV procedure ex-

plained before, in order to be able to compare both classifications: clean audio and

degraded audio. In Tables 4.1 you can see the classification results of both clean data

sets (GTZAN & ISMIR). The maximum mean values achieved in both data sets are

around 80% on ISMIR and 75% on GTZAN. The classifier that achieved better mean

results is the Support Vector Machine by Sequential Minimal Optimization with Polyno-

mial Kernel (SMO Polykernel), whereas the feature that achieved better results is the

SSD, in both libraries as well. Otherwise, as we commented before, the classification of

ISMIR is better than the classification of GTZAN due to the number of genres in which

they are distributed, i.e. is more difficult to achieve good results in data sets distributed

with a large number of genres than in others less selective. The second better classifiers

are either KNN (Euclidean and Manhattan), and the second better feature is TSSD,

which is directly related with SSD. About the variance between different folds classifica-

tion, in GTZAN data set is around 25% whereas in ISMIR data set is around 10%. This

means that the percentage of correctly classified instances is more regular for each fold

in ISMIR than in GTZAN.

In most of the cases, the application of a single degradation over training and test sets

has not a prominent impact on the genre classification, i.e. the average mean percentage

values are similar on the classification of clean audio. As you can see in the example

Tables 4.2, on the vinyl degradation classification, the values are around ± 2%, so the

differences between both classifications are not relevant. If we take a look to the RH

difference studied before in Figure 4.4 (the same case), even if we see differences over all

the attributes of the feature, we still have similar classification results.


(a) GTZAN data set: 10 genres

GTZAN&MEANoriginalClassifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 34,70% 49,00% 37,40% 52,30% 54,60% 35,80%SMO PolyKernel 42,60% 65,80% 49,10% 74,40% 67,50% 38,90%SMO RBFKernel 28,60% 59,60% 36,40% 52,10% 64,40% 37,60%J48 32,90% 36,40% 33,60% 49,60% 47,20% 29,50%RandomForest 35,30% 43,90% 37,90% 61,60% 59,40% 37,90%KNN Euclidean 40,40% 51,60% 40,80% 66,10% 51,60% 30,80%KNN Manhattan 40,20% 53,60% 42,80% 66,20% 61,30% 35,40%

Difference&Original&/&DegradationsliveRecording Positive&=&improvement&of&classification&with&degradation

Negative&&=&deterioration&with&degradationClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 33,00% 49,10% 38,70% 51,00% 48,30% 35,10% Naive Bayes L1,70% 0,10% 1,30% L1,30% L6,30% L0,70%SMO PolyKernel 40,40% 66,10% 45,10% 65,20% 59,90% 33,40% SMO PolyKernel L2,20% 0,30% L4,00% L9,20% L7,60% L5,50%SMO RBFKernel 27,60% 61,10% 33,80% 50,90% 59,60% 32,80% SMO RBFKernel L1,00% 1,50% L2,60% L1,20% L4,80% L4,80%J48 28,00% 34,20% 30,90% 46,30% 43,90% 27,30% J48 L4,90% L2,20% L2,70% L3,30% L3,30% L2,20%RandomForest 30,90% 39,50% 37,80% 55,40% 53,50% 32,70% RandomForest L4,40% L4,40% L0,10% L6,20% L5,90% L5,20%KNN Euclidean 37,00% 53,20% 39,00% 58,90% 49,20% 30,80% KNN Euclidean L3,40% 1,60% L1,80% L7,20% L2,40% 0,00%KNN Manhattan 35,20% 53,90% 39,70% 60,60% 56,40% 35,40% KNN Manhattan L5,00% 0,30% L3,10% L5,60% L4,90% 0,00%

strongMp3CompressionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 34,80% 48,30% 38,60% 53,10% 54,70% 35,60% Naive Bayes 0,10% L0,70% 1,20% 0,80% 0,10% L0,20%SMO PolyKernel 42,10% 65,10% 48,80% 73,50% 68,70% 38,90% SMO PolyKernel L0,50% L0,70% L0,30% L0,90% 1,20% 0,00%SMO RBFKernel 27,40% 59,60% 36,50% 53,60% 65,90% 37,00% SMO RBFKernel L1,20% 0,00% 0,10% 1,50% 1,50% L0,60%J48 30,60% 36,50% 30,40% 49,10% 47,00% 30,10% J48 L2,30% 0,10% L3,20% L0,50% L0,20% 0,60%RandomForest 35,50% 43,20% 39,80% 59,70% 58,00% 34,50% RandomForest 0,20% L0,70% 1,90% L1,90% L1,40% L3,40%KNN Euclidean 40,10% 51,60% 41,40% 66,30% 51,60% 30,70% KNN Euclidean L0,30% 0,00% 0,60% 0,20% 0,00% L0,10%KNN Manhattan 40,90% 53,10% 42,40% 66,40% 60,90% 36,00% KNN Manhattan 0,70% L0,50% L0,40% 0,20% L0,40% 0,60%

vinylRecordingClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 35,20% 49,60% 38,50% 49,70% 50,90% 37,00% Naive Bayes 0,50% 0,60% 1,10% L2,60% L3,70% 1,20%SMO PolyKernel 40,10% 64,30% 49,50% 72,80% 65,10% 37,70% SMO PolyKernel L2,50% L1,50% 0,40% L1,60% L2,40% L1,20%SMO RBFKernel 29,80% 60,10% 37,90% 49,50% 63,10% 39,30% SMO RBFKernel 1,20% 0,50% 1,50% L2,60% L1,30% 1,70%J48 30,40% 36,40% 31,50% 46,90% 42,60% 29,90% J48 L2,50% 0,00% L2,10% L2,70% L4,60% 0,40%RandomForest 36,60% 41,30% 37,30% 54,70% 54,20% 34,50% RandomForest 1,30% L2,60% L0,60% L6,90% L5,20% L3,40%KNN Euclidean 38,50% 51,50% 41,60% 62,00% 48,90% 31,90% KNN Euclidean L1,90% L0,10% 0,80% L4,10% L2,70% 1,10%KNN Manhattan 38,50% 53,40% 42,30% 62,70% 60,20% 36,40% KNN Manhattan L1,70% L0,20% L0,50% L3,50% L1,10% 1,00%

radioBroadcastClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 35,00% 52,50% 39,90% 53,00% 52,60% 34,50% Naive Bayes 0,30% 3,50% 2,50% 0,70% L2,00% L1,30%SMO PolyKernel 41,30% 64,90% 45,10% 72,40% 67,50% 38,00% SMO PolyKernel L1,30% L0,90% L4,00% L2,00% 0,00% L0,90%SMO RBFKernel 26,50% 62,80% 36,80% 53,30% 66,00% 37,60% SMO RBFKernel L2,10% 3,20% 0,40% 1,20% 1,60% 0,00%J48 29,70% 33,30% 31,40% 48,70% 48,50% 28,60% J48 L3,20% L3,10% L2,20% L0,90% 1,30% L0,90%RandomForest 37,40% 41,30% 36,70% 58,70% 55,70% 34,60% RandomForest 2,10% L2,60% L1,20% L2,90% L3,70% L3,30%KNN Euclidean 37,50% 50,80% 38,40% 63,40% 54,00% 28,70% KNN Euclidean L2,90% L0,80% L2,40% L2,70% 2,40% L2,10%KNN Manhattan 39,20% 53,30% 39,60% 64,00% 60,00% 34,90% KNN Manhattan L1,00% L0,30% L3,20% L2,20% L1,30% L0,50%

smartPhoneRecordingClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 39,00% 53,10% 42,00% 51,70% 52,80% 36,50% Naive Bayes 4,30% 4,10% 4,60% L0,60% L1,80% 0,70%SMO PolyKernel 44,50% 66,70% 47,50% 68,60% 63,70% 37,00% SMO PolyKernel 1,90% 0,90% L1,60% L5,80% L3,80% L1,90%SMO RBFKernel 28,90% 61,10% 43,10% 52,50% 61,00% 39,40% SMO RBFKernel 0,30% 1,50% 6,70% 0,40% L3,40% 1,80%J48 32,30% 35,50% 34,00% 45,10% 45,40% 28,80% J48 L0,60% L0,90% 0,40% L4,50% L1,80% L0,70%RandomForest 39,90% 36,90% 40,80% 52,60% 49,10% 35,60% RandomForest 4,60% L7,00% 2,90% L9,00% L10,30% L2,30%KNN Euclidean 40,70% 52,30% 41,40% 60,60% 49,80% 32,20% KNN Euclidean 0,30% 0,70% 0,60% L5,50% L1,80% 1,40%KNN Manhattan 39,60% 52,20% 43,00% 62,80% 56,90% 35,20% KNN Manhattan L0,60% L1,40% 0,20% L3,40% L4,40% L0,20%

smartPhonePlaybackClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 32,20% 47,80% 34,50% 42,80% 46,20% 35,20% Naive Bayes L2,50% L1,20% L2,90% L9,50% L8,40% L0,60%SMO PolyKernel 38,20% 63,00% 42,10% 65,30% 56,20% 35,50% SMO PolyKernel L4,40% L2,80% L7,00% L9,10% L11,30% L3,40%SMO RBFKernel 25,70% 55,70% 35,60% 43,70% 58,30% 34,30% SMO RBFKernel L2,90% L3,90% L0,80% L8,40% L6,10% L3,30%J48 27,60% 33,30% 27,80% 44,40% 43,10% 28,00% J48 L5,30% L3,10% L5,80% L5,20% L4,10% L1,50%RandomForest 36,30% 39,80% 33,90% 53,30% 50,80% 32,70% RandomForest 1,00% L4,10% L4,00% L8,30% L8,60% L5,20%KNN Euclidean 38,90% 52,20% 36,20% 53,60% 42,30% 30,20% KNN Euclidean L1,50% 0,60% L4,60% L12,50% L9,30% L0,60%KNN Manhattan 36,70% 52,10% 37,30% 54,40% 52,40% 37,60% KNN Manhattan L3,50% L1,50% L5,50% L11,80% L8,90% 2,20%

unit_addNoiseClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 33,80% 45,80% 36,80% 51,40% 52,50% 33,90% Naive Bayes L0,90% L3,20% L0,60% L0,90% L2,10% L1,90%SMO PolyKernel 41,00% 64,00% 46,10% 72,80% 62,50% 38,00% SMO PolyKernel L1,60% L1,80% L3,00% L1,60% L5,00% L0,90%SMO RBFKernel 25,30% 61,00% 36,30% 52,60% 62,80% 36,30% SMO RBFKernel L3,30% 1,40% L0,10% 0,50% L1,60% L1,30%J48 29,60% 35,40% 28,00% 45,50% 47,40% 28,60% J48 L3,30% L1,00% L5,60% L4,10% 0,20% L0,90%RandomForest 35,80% 41,40% 38,00% 59,50% 53,20% 33,60% RandomForest 0,50% L2,50% 0,10% L2,10% L6,20% L4,30%KNN Euclidean 40,10% 51,70% 39,70% 64,50% 46,50% 31,80% KNN Euclidean L0,30% 0,10% L1,10% L1,60% L5,10% 1,00%KNN Manhattan 40,10% 52,60% 38,70% 64,00% 57,80% 35,20% KNN Manhattan L0,10% L1,00% L4,10% L2,20% L3,50% L0,20%

unit_addSoundClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 34,80% 48,40% 38,40% 52,20% 54,50% 35,80% Naive Bayes 0,10% L0,60% 1,00% L0,10% L0,10% 0,00%SMO PolyKernel 41,70% 65,50% 46,00% 72,70% 65,00% 35,60% SMO PolyKernel L0,90% L0,30% L3,10% L1,70% L2,50% L3,30%SMO RBFKernel 29,40% 61,30% 37,00% 53,00% 63,50% 36,00% SMO RBFKernel 0,80% 1,70% 0,60% 0,90% L0,90% L1,60%J48 27,60% 36,10% 29,20% 45,70% 45,10% 28,60% J48 L5,30% L0,30% L4,40% L3,90% L2,10% L0,90%RandomForest 36,00% 40,60% 36,50% 56,20% 51,50% 32,70% RandomForest 0,70% L3,30% L1,40% L5,40% L7,90% L5,20%KNN Euclidean 37,80% 51,20% 41,80% 64,70% 52,90% 31,80% KNN Euclidean L2,60% L0,40% 1,00% L1,40% 1,30% 1,00%KNN Manhattan 37,70% 53,90% 43,40% 63,30% 58,10% 35,50% KNN Manhattan L2,50% 0,30% 0,60% L2,90% L3,20% 0,10%

Mean&of&all&folds&with&10&cross&validation

(b) ISMIR data set: 6 genres

ISMIR%MEANoriginalClassifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 57,07% 63,38% 61,25% 61,04% 52,81% 60,01%SMO PolyKernel 63,65% 75,11% 70,51% 79,29% 80,31% 66,12%SMO RBFKernel 54,94% 68,79% 63,99% 63,86% 73,12% 64,40%J48 58,23% 59,26% 60,84% 68,11% 66,67% 55,01%RandomForest 65,98% 68,86% 68,79% 75,31% 74,69% 64,54%KNN Euclidean 60,35% 73,25% 63,04% 78,81% 76,68% 63,79%KNN Manhattan 61,66% 71,33% 64,81% 78,61% 77,78% 63,44%

Difference%Original%/%DegradationsliveRecording Positive%=%improvement%of%classification%with%degradation

Negative%%=%deterioration%with%degradationClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 50,48% 66,74% 58,57% 63,99% 53,22% 34,77% Naive Bayes L6,59% 3,36% L2,67% 2,95% 0,41% L25,24%SMO PolyKernel 61,11% 72,36% 69,42% 78,68% 77,37% 65,57% SMO PolyKernel L2,54% L2,74% L1,09% L0,62% L2,94% L0,55%SMO RBFKernel 48,01% 70,65% 61,52% 66,81% 73,19% 59,67% SMO RBFKernel L6,93% 1,85% L2,47% 2,95% 0,07% L4,73%J48 55,49% 62,34% 58,78% 68,38% 66,26% 53,70% J48 L2,74% 3,09% L2,05% 0,27% L0,41% L1,31%RandomForest 61,38% 68,52% 66,05% 73,46% 72,09% 61,86% RandomForest L4,60% L0,34% L2,74% L1,85% L2,60% L2,68%KNN Euclidean 59,94% 70,37% 58,98% 75,45% 73,87% 61,52% KNN Euclidean L0,41% L2,88% L4,05% L3,36% L2,81% L2,26%KNN Manhattan 58,85% 70,37% 60,43% 77,44% 75,86% 62,69% KNN Manhattan L2,81% L0,96% L4,39% L1,16% L1,91% L0,75%

strongMp3CompressionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 57,48% 63,58% 58,03% 61,94% 58,50% 60,36% Naive Bayes 0,41% 0,20% L3,22% 0,89% 5,69% 0,34%SMO PolyKernel 63,17% 74,70% 70,17% 78,95% 79,49% 66,26% SMO PolyKernel L0,48% L0,41% L0,34% L0,34% L0,83% 0,14%SMO RBFKernel 54,12% 69,21% 62,42% 64,06% 73,94% 64,75% SMO RBFKernel L0,82% 0,41% L1,58% 0,20% 0,82% 0,34%J48 56,52% 61,94% 60,22% 68,73% 68,17% 55,90% J48 L1,72% 2,68% L0,62% 0,62% 1,50% 0,89%RandomForest 63,10% 69,27% 66,74% 74,08% 73,04% 63,79% RandomForest L2,88% 0,41% L2,05% L1,23% L1,65% L0,75%KNN Euclidean 59,88% 72,70% 63,17% 78,05% 76,41% 63,58% KNN Euclidean L0,48% L0,55% 0,14% L0,76% L0,27% L0,21%KNN Manhattan 60,49% 71,47% 63,24% 77,23% 78,05% 64,75% KNN Manhattan L1,17% 0,14% L1,58% L1,37% 0,27% 1,30%

vinylRecordingClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 54,60% 63,72% 61,52% 62,55% 52,26% 59,05% Naive Bayes L2,47% 0,34% 0,27% 1,51% L0,55% L0,96%SMO PolyKernel 60,97% 74,97% 68,73% 77,92% 79,49% 65,36% SMO PolyKernel L2,68% L0,14% L1,78% L1,37% L0,82% L0,75%SMO RBFKernel 49,93% 69,89% 62,14% 65,91% 74,35% 62,69% SMO RBFKernel L5,01% 1,10% L1,85% 2,06% 1,23% L1,71%J48 53,92% 61,11% 56,59% 68,04% 66,67% 55,69% J48 L4,32% 1,85% L4,25% L0,07% 0,00% 0,69%RandomForest 62,48% 68,38% 66,39% 73,32% 73,25% 64,27% RandomForest L3,50% L0,48% L2,40% L1,99% L1,44% L0,27%KNN Euclidean 57,89% 73,18% 59,53% 77,16% 75,18% 60,22% KNN Euclidean L2,47% L0,07% L3,50% L1,65% L1,50% L3,57%KNN Manhattan 57,41% 73,11% 62,21% 77,91% 76,34% 62,14% KNN Manhattan L4,25% 1,78% L2,60% L0,69% L1,44% L1,30%

radioBroadcastClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 53,36% 63,44% 57,27% 60,70% 50,48% 55,90% Naive Bayes L3,71% 0,07% L3,97% L0,35% L2,34% L4,12%SMO PolyKernel 58,64% 75,03% 69,00% 78,94% 78,33% 63,03% SMO PolyKernel L5,01% L0,07% L1,51% L0,35% L1,98% L3,09%SMO RBFKernel 48,35% 70,16% 62,76% 64,19% 72,63% 55,49% SMO RBFKernel L6,59% 1,37% L1,23% 0,34% L0,48% L8,91%J48 53,37% 60,70% 57,89% 65,84% 65,64% 52,68% J48 L4,87% 1,44% L2,95% L2,27% L1,03% L2,33%RandomForest 61,52% 67,43% 65,29% 71,95% 72,02% 60,22% RandomForest L4,46% L1,44% L3,50% L3,36% L2,67% L4,32%KNN Euclidean 57,68% 73,11% 62,07% 78,12% 76,00% 57,27% KNN Euclidean L2,68% L0,14% L0,96% L0,69% L0,68% L6,52%KNN Manhattan 59,26% 71,88% 63,31% 78,32% 76,61% 60,08% KNN Manhattan L2,40% 0,55% L1,51% L0,28% L1,16% L3,36%

smartPhoneRecordingClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 54,12% 65,29% 59,33% 63,10% 38,27% 56,44% Naive Bayes L2,95% 1,91% L1,92% 2,05% L14,54% L3,57%SMO PolyKernel 59,95% 73,74% 69,89% 78,53% 79,36% 64,88% SMO PolyKernel L3,71% L1,37% L0,62% L0,76% L0,96% L1,23%SMO RBFKernel 49,45% 71,54% 65,57% 66,94% 72,70% 59,19% SMO RBFKernel L5,49% 2,74% 1,58% 3,09% L0,41% L5,21%J48 53,16% 58,92% 58,77% 65,71% 64,74% 53,56% J48 L5,08% L0,34% L2,06% L2,40% L1,93% L1,45%RandomForest 61,46% 67,97% 66,88% 70,85% 71,94% 62,76% RandomForest L4,53% L0,90% L1,92% L4,46% L2,74% L1,78%KNN Euclidean 59,95% 72,98% 65,64% 78,19% 76,95% 61,25% KNN Euclidean L0,41% L0,27% 2,60% L0,62% 0,28% L2,54%KNN Manhattan 60,63% 72,77% 63,85% 78,74% 77,57% 60,84% KNN Manhattan L1,03% 1,44% L0,96% 0,13% L0,21% L2,61%

smartPhonePlaybackClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 50,48% 53,98% 41,70% 51,71% 42,79% 46,22% Naive Bayes L6,59% L9,40% L19,55% L9,33% L10,02% L13,79%SMO PolyKernel 60,08% 72,70% 64,34% 76,48% 75,92% 63,85% SMO PolyKernel L3,57% L2,40% L6,17% L2,81% L4,39% L2,26%SMO RBFKernel 46,37% 69,41% 55,01% 60,49% 70,10% 55,49% SMO RBFKernel L8,57% 0,62% L8,99% L3,36% L3,02% L8,92%J48 53,02% 56,93% 55,29% 65,85% 63,10% 50,62% J48 L5,21% L2,33% L5,55% L2,26% L3,57% L4,39%RandomForest 61,59% 69,34% 63,30% 72,43% 70,44% 59,74% RandomForest L4,39% 0,48% L5,49% L2,88% L4,25% L4,80%KNN Euclidean 56,72% 72,63% 58,57% 75,93% 70,09% 55,15% KNN Euclidean L3,63% L0,62% L4,46% L2,88% L6,58% L8,64%KNN Manhattan 56,31% 72,43% 59,05% 76,62% 74,35% 56,58% KNN Manhattan L5,35% 1,09% L5,76% L1,99% L3,43% L6,87%

unit_addNoiseClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 55,08% 63,72% 60,77% 63,85% 56,51% 59,67% Naive Bayes L1,99% 0,34% L0,48% 2,81% 3,70% L0,34%SMO PolyKernel 63,44% 74,56% 69,55% 78,60% 78,46% 65,77% SMO PolyKernel L0,21% L0,55% L0,96% L0,69% L1,85% L0,34%SMO RBFKernel 55,21% 69,27% 64,41% 67,70% 74,63% 63,79% SMO RBFKernel 0,27% 0,48% 0,41% 3,84% 1,51% L0,62%J48 54,32% 62,55% 61,45% 69,48% 67,42% 58,16% J48 L3,91% 3,30% 0,62% 1,37% 0,75% 3,15%RandomForest 62,90% 69,00% 69,14% 74,42% 74,01% 64,41% RandomForest L3,09% 0,14% 0,35% L0,89% L0,68% L0,14%KNN Euclidean 59,06% 72,42% 65,30% 79,29% 77,44% 61,59% KNN Euclidean L1,30% L0,83% 2,26% 0,48% 0,76% L2,20%KNN Manhattan 58,85% 70,57% 65,23% 79,63% 78,74% 62,07% KNN Manhattan L2,81% L0,76% 0,41% 1,03% 0,96% L1,37%

unit_addSoundClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 54,53% 66,67% 62,41% 66,87% 34,77% 59,95% Naive Bayes L2,54% 3,29% 1,17% 5,83% L18,04% L0,07%SMO PolyKernel 60,84% 74,69% 69,96% 79,29% 81,00% 66,60% SMO PolyKernel L2,82% L0,42% L0,55% 0,00% 0,69% 0,48%SMO RBFKernel 49,93% 69,69% 65,02% 67,01% 71,74% 62,76% SMO RBFKernel L5,01% 0,89% 1,03% 3,16% L1,37% L1,65%J48 54,18% 58,50% 58,02% 68,11% 67,14% 54,53% J48 L4,05% L0,75% L2,82% 0,00% 0,47% L0,48%RandomForest 63,04% 67,29% 65,57% 75,04% 72,22% 64,68% RandomForest L2,95% L1,58% L3,22% L0,27% L2,47% 0,14%KNN Euclidean 58,57% 70,78% 62,00% 79,63% 79,77% 54,87% KNN Euclidean L1,78% L2,47% L1,03% 0,82% 3,09% L8,92%KNN Manhattan 58,64% 72,29% 63,72% 79,56% 79,49% 58,30% KNN Manhattan L3,03% 0,96% L1,10% 0,95% 1,72% L5,15%

Mean%of%all%folds%with%10%cross%validation

Table 4.1: Correct mean classification percentage of clean used data sets and thenumber of genres across which are distributed.

(a) Mean percentage of correctly classified instances (highlighted values mean an improve-ment with respect to clean audio classification)












(b) Mean percentage differences of correctly classified instances between clear and degradedaudio (positive values = improvement, negative values = deterioration; a darker highlightedvalue means a better improvement)












Table 4.2: Classification of GTZAN data set degraded by vinyl degradation

On the other hand and surprisingly, in some classification cases of degraded data sets, we

got results improvement, as in the example tables 4.3 from the classification of GTZAN

data set degraded by harmonic distortion. In this case, the RH has got an improvement

around 4% in the whole feature, achieving a maximum improvement of 6,10% using

the Naive Bayes classifier. In addition, TRH has got an improvement around 4% too

but with more variance between differen classifiers, achieving a maximum of 7,30% with

Naive Bayes too. Other features as RP, MVD and SSD also have got some improvement

in isolated cases whereas in TSSD feature set, all classifier results suffer a deterioration

around 3%.


There are more cases with several classification improvements, mainly in GTZAN data

set, as in Smartphone recording degradation for Naive Bayes classifier or in Clipping

degradation for Naive Bayes and SMO RBFKernel classifiers, among others. Although

this is not the main goal of this study, we think that it could be an interesting research

topic to review changes that these degradations perform on the audio files and then apply

them to the actual classifiers in order to improve their results.

(a) Mean percentage of correctly classified instances (highlighted values mean an improve-ment with respect to clean audio classification)

unit_applyAliasingClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 37,40% 48,20% 41,70% 44,80% 45,80% 38,50% Naive Bayes 2,70% L0,80% 4,30% L7,50% L8,80% 2,70%SMO PolyKernel 43,80% 62,10% 47,80% 65,70% 57,90% 37,50% SMO PolyKernel 1,20% L3,70% L1,30% L8,70% L9,60% L1,40%SMO RBFKernel 29,60% 56,20% 41,80% 46,10% 57,00% 39,40% SMO RBFKernel 1,00% L3,40% 5,40% L6,00% L7,40% 1,80%J48 33,20% 32,80% 33,60% 44,30% 42,00% 30,70% J48 0,30% L3,60% 0,00% L5,30% L5,20% 1,20%RandomForest 36,70% 40,00% 37,40% 52,80% 49,10% 35,40% RandomForest 1,40% L3,90% L0,50% L8,80% L10,30% L2,50%KNN Euclidean 40,70% 47,80% 43,10% 54,20% 45,00% 33,30% KNN Euclidean 0,30% L3,80% 2,30% L11,90% L6,60% 2,50%KNN Manhattan 40,70% 48,70% 42,50% 54,80% 51,30% 37,70% KNN Manhattan 0,50% L4,90% L0,30% L11,40% L10,00% 2,30%

unit_applyClippingAlternativeClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 37,60% 52,10% 40,40% 54,10% 54,80% 38,30% Naive Bayes 2,90% 3,10% 3,00% 1,80% 0,20% 2,50%SMO PolyKernel 41,50% 66,30% 46,80% 71,10% 66,00% 37,50% SMO PolyKernel L1,10% 0,50% L2,30% L3,30% L1,50% L1,40%SMO RBFKernel 29,20% 62,40% 39,40% 55,90% 66,60% 39,60% SMO RBFKernel 0,60% 2,80% 3,00% 3,80% 2,20% 2,00%J48 32,10% 36,70% 29,30% 49,40% 48,30% 30,80% J48 L0,80% 0,30% L4,30% L0,20% 1,10% 1,30%RandomForest 36,50% 42,60% 38,90% 59,80% 56,00% 34,20% RandomForest 1,20% L1,30% 1,00% L1,80% L3,40% L3,70%KNN Euclidean 39,70% 50,10% 37,80% 62,20% 53,70% 30,50% KNN Euclidean L0,70% L1,50% L3,00% L3,90% 2,10% L0,30%KNN Manhattan 40,40% 52,10% 37,50% 64,10% 61,20% 35,30% KNN Manhattan 0,20% L1,50% L5,30% L2,10% L0,10% L0,10%

unit_applyDynamicRangeCompressionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 35,60% 52,60% 37,40% 53,20% 53,40% 36,70% Naive Bayes 0,90% 3,60% 0,00% 0,90% L1,20% 0,90%SMO PolyKernel 38,80% 66,20% 48,10% 72,30% 63,50% 34,90% SMO PolyKernel L3,80% 0,40% L1,00% L2,10% L4,00% L4,00%SMO RBFKernel 25,80% 63,20% 38,20% 55,00% 63,90% 36,00% SMO RBFKernel L2,80% 3,60% 1,80% 2,90% L0,50% L1,60%J48 29,80% 34,80% 32,40% 45,70% 41,20% 28,00% J48 L3,10% L1,60% L1,20% L3,90% L6,00% L1,50%RandomForest 34,50% 40,30% 38,10% 57,00% 53,80% 34,00% RandomForest L0,80% L3,60% 0,20% L4,60% L5,60% L3,90%KNN Euclidean 35,20% 52,70% 37,80% 61,30% 50,00% 27,70% KNN Euclidean L5,20% 1,10% L3,00% L4,80% L1,60% L3,10%KNN Manhattan 35,50% 53,50% 40,50% 61,40% 57,90% 34,30% KNN Manhattan L4,70% L0,10% L2,30% L4,80% L3,40% L1,10%

unit_applyHarmonicDistortionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 40,80% 52,50% 41,70% 52,40% 53,20% 43,10% Naive Bayes 6,10% 3,50% 4,30% 0,10% L1,40% 7,30%SMO PolyKernel 47,90% 65,70% 47,00% 71,90% 64,70% 39,30% SMO PolyKernel 5,30% L0,10% L2,10% L2,50% L2,80% 0,40%SMO RBFKernel 34,50% 62,70% 41,40% 55,20% 63,00% 41,60% SMO RBFKernel 5,90% 3,10% 5,00% 3,10% L1,40% 4,00%J48 34,40% 37,00% 33,40% 52,00% 46,80% 31,20% J48 1,50% 0,60% L0,20% 2,40% L0,40% 1,70%RandomForest 38,60% 42,00% 40,40% 58,40% 54,80% 38,10% RandomForest 3,30% L1,90% 2,50% L3,20% L4,60% 0,20%KNN Euclidean 43,50% 51,90% 40,50% 61,90% 51,00% 34,80% KNN Euclidean 3,10% 0,30% L0,30% L4,20% L0,60% 4,00%KNN Manhattan 41,50% 54,40% 42,00% 63,50% 58,50% 37,90% KNN Manhattan 1,30% 0,80% L0,80% L2,70% L2,80% 2,50%

unit_applyMp3CompressionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 34,30% 44,60% 39,20% 50,20% 51,70% 35,90% Naive Bayes L0,40% L4,40% 1,80% L2,10% L2,90% 0,10%SMO PolyKernel 41,60% 63,80% 47,10% 70,90% 64,20% 38,10% SMO PolyKernel L1,00% L2,00% L2,00% L3,50% L3,30% L0,80%SMO RBFKernel 28,80% 58,80% 37,10% 50,70% 62,30% 37,80% SMO RBFKernel 0,20% L0,80% 0,70% L1,40% L2,10% 0,20%J48 30,30% 34,20% 31,10% 47,90% 44,90% 28,30% J48 L2,60% L2,20% L2,50% L1,70% L2,30% L1,20%RandomForest 38,50% 40,80% 38,50% 55,50% 52,50% 36,00% RandomForest 3,20% L3,10% 0,60% L6,10% L6,90% L1,90%KNN Euclidean 39,60% 51,60% 42,10% 61,90% 49,00% 29,40% KNN Euclidean L0,80% 0,00% 1,30% L4,20% L2,60% L1,40%KNN Manhattan 39,90% 52,20% 44,00% 62,80% 58,20% 36,30% KNN Manhattan L0,30% L1,40% 1,20% L3,40% L3,10% 0,90%

unit_applySpeedupClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 34,20% 49,10% 36,70% 51,20% 50,60% 35,00% Naive Bayes L0,50% 0,10% L0,70% L1,10% L4,00% L0,80%SMO PolyKernel 41,40% 65,30% 47,30% 72,30% 64,80% 37,40% SMO PolyKernel L1,20% L0,50% L1,80% L2,10% L2,70% L1,50%SMO RBFKernel 28,80% 59,30% 36,20% 52,00% 64,70% 36,70% SMO RBFKernel 0,20% L0,30% L0,20% L0,10% 0,30% L0,90%J48 30,10% 36,90% 33,00% 47,60% 47,70% 31,70% J48 L2,80% 0,50% L0,60% L2,00% 0,50% 2,20%RandomForest 38,10% 41,70% 37,90% 57,80% 54,90% 35,20% RandomForest 2,80% L2,20% 0,00% L3,80% L4,50% L2,70%KNN Euclidean 39,90% 51,30% 41,20% 63,90% 51,50% 30,00% KNN Euclidean L0,50% L0,30% 0,40% L2,20% L0,10% L0,80%KNN Manhattan 37,60% 50,80% 43,10% 63,30% 60,90% 34,60% KNN Manhattan L2,60% L2,80% 0,30% L2,90% L0,40% L0,80%

unit_applyWowResamplingClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 35,50% 48,50% 38,20% 51,70% 52,90% 36,50% Naive Bayes 0,80% L0,50% 0,80% L0,60% L1,70% 0,70%SMO PolyKernel 42,30% 67,10% 49,10% 74,50% 68,10% 37,50% SMO PolyKernel L0,30% 1,30% 0,00% 0,10% 0,60% L1,40%SMO RBFKernel 27,90% 60,80% 37,30% 53,70% 66,20% 36,40% SMO RBFKernel L0,70% 1,20% 0,90% 1,60% 1,80% L1,20%J48 28,70% 35,80% 31,90% 50,40% 45,10% 27,60% J48 L4,20% L0,60% L1,70% 0,80% L2,10% L1,90%RandomForest 38,10% 44,10% 37,20% 58,60% 56,20% 34,70% RandomForest 2,80% 0,20% L0,70% L3,00% L3,20% L3,20%KNN Euclidean 40,20% 52,90% 42,80% 65,00% 50,60% 31,00% KNN Euclidean L0,20% 1,30% 2,00% L1,10% L1,00% 0,20%KNN Manhattan 39,90% 54,50% 43,30% 64,70% 59,80% 36,80% KNN Manhattan L0,30% 0,90% 0,50% L1,50% L1,50% 1,40%

unit_applyDelayClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 33,90% 48,30% 36,60% 51,60% 53,60% 34,50% Naive Bayes L0,80% L0,70% L0,80% L0,70% L1,00% L1,30%SMO PolyKernel 42,40% 66,90% 47,30% 73,00% 70,20% 39,40% SMO PolyKernel L0,20% 1,10% L1,80% L1,40% 2,70% 0,50%SMO RBFKernel 28,10% 59,50% 37,20% 52,10% 66,40% 40,50% SMO RBFKernel L0,50% L0,10% 0,80% 0,00% 2,00% 2,90%J48 29,30% 35,90% 29,70% 48,10% 44,70% 30,20% J48 L3,60% L0,50% L3,90% L1,50% L2,50% 0,70%RandomForest 36,20% 42,30% 38,60% 60,60% 58,90% 36,30% RandomForest 0,90% L1,60% 0,70% L1,00% L0,50% L1,60%KNN Euclidean 40,40% 51,80% 42,20% 64,10% 54,20% 34,10% KNN Euclidean 0,00% 0,20% 1,40% L2,00% 2,60% 3,30%KNN Manhattan 38,30% 53,00% 42,70% 64,50% 63,30% 37,80% KNN Manhattan L1,90% L0,60% L0,10% L1,70% 2,00% 2,40%

unit_applyHighpassFilterClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 31,90% 36,60% 39,90% 39,80% 44,50% 35,80% Naive Bayes L2,80% L12,40% 2,50% L12,50% L10,10% 0,00%SMO PolyKernel 37,60% 61,50% 46,10% 63,40% 56,40% 35,00% SMO PolyKernel L5,00% L4,30% L3,00% L11,00% L11,10% L3,90%SMO RBFKernel 22,90% 53,80% 40,20% 46,10% 56,60% 33,50% SMO RBFKernel L5,70% L5,80% 3,80% L6,00% L7,80% L4,10%J48 27,80% 34,60% 29,20% 43,40% 41,90% 27,70% J48 L5,10% L1,80% L4,40% L6,20% L5,30% L1,80%RandomForest 32,30% 41,60% 36,90% 55,40% 52,70% 29,90% RandomForest L3,00% L2,30% L1,00% L6,20% L6,70% L8,00%KNN Euclidean 35,80% 49,20% 36,10% 54,40% 46,90% 32,30% KNN Euclidean L4,60% L2,40% L4,70% L11,70% L4,70% 1,50%KNN Manhattan 37,80% 49,80% 39,60% 58,50% 56,70% 35,90% KNN Manhattan L2,40% L3,80% L3,20% L7,70% L4,60% 0,50%

unit_applyLowpassFilterClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 38,30% 32,80% 36,90% 27,60% 25,10% 39,10% Naive Bayes 3,60% L16,20% L0,50% L24,70% L29,50% 3,30%SMO PolyKernel 44,10% 56,80% 37,90% 60,90% 55,90% 35,20% SMO PolyKernel 1,50% L9,00% L11,20% L13,50% L11,60% L3,70%SMO RBFKernel 31,30% 48,50% 36,10% 43,50% 48,90% 39,40% SMO RBFKernel 2,70% L11,10% L0,30% L8,60% L15,50% 1,80%J48 34,40% 31,10% 33,70% 43,70% 39,40% 33,70% J48 1,50% L5,30% 0,10% L5,90% L7,80% 4,20%RandomForest 36,30% 38,60% 37,20% 52,70% 46,70% 35,40% RandomForest 1,00% L5,30% L0,70% L8,90% L12,70% L2,50%KNN Euclidean 38,60% 47,80% 30,50% 49,20% 36,60% 29,00% KNN Euclidean L1,80% L3,80% L10,30% L16,90% L15,00% L1,80%KNN Manhattan 40,10% 48,50% 34,70% 51,00% 41,90% 34,40% KNN Manhattan L0,10% L5,10% L8,10% L15,20% L19,40% L1,00%

(b) Mean percentage differences of correctly classified instances between clear and degradedaudio (positive values = improvement, negative values = deterioration; a darker highlightedvalue means a better improvement)

unit_applyAliasingClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 37,40% 48,20% 41,70% 44,80% 45,80% 38,50% Naive Bayes 2,70% L0,80% 4,30% L7,50% L8,80% 2,70%SMO PolyKernel 43,80% 62,10% 47,80% 65,70% 57,90% 37,50% SMO PolyKernel 1,20% L3,70% L1,30% L8,70% L9,60% L1,40%SMO RBFKernel 29,60% 56,20% 41,80% 46,10% 57,00% 39,40% SMO RBFKernel 1,00% L3,40% 5,40% L6,00% L7,40% 1,80%J48 33,20% 32,80% 33,60% 44,30% 42,00% 30,70% J48 0,30% L3,60% 0,00% L5,30% L5,20% 1,20%RandomForest 36,70% 40,00% 37,40% 52,80% 49,10% 35,40% RandomForest 1,40% L3,90% L0,50% L8,80% L10,30% L2,50%KNN Euclidean 40,70% 47,80% 43,10% 54,20% 45,00% 33,30% KNN Euclidean 0,30% L3,80% 2,30% L11,90% L6,60% 2,50%KNN Manhattan 40,70% 48,70% 42,50% 54,80% 51,30% 37,70% KNN Manhattan 0,50% L4,90% L0,30% L11,40% L10,00% 2,30%

unit_applyClippingAlternativeClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 37,60% 52,10% 40,40% 54,10% 54,80% 38,30% Naive Bayes 2,90% 3,10% 3,00% 1,80% 0,20% 2,50%SMO PolyKernel 41,50% 66,30% 46,80% 71,10% 66,00% 37,50% SMO PolyKernel L1,10% 0,50% L2,30% L3,30% L1,50% L1,40%SMO RBFKernel 29,20% 62,40% 39,40% 55,90% 66,60% 39,60% SMO RBFKernel 0,60% 2,80% 3,00% 3,80% 2,20% 2,00%J48 32,10% 36,70% 29,30% 49,40% 48,30% 30,80% J48 L0,80% 0,30% L4,30% L0,20% 1,10% 1,30%RandomForest 36,50% 42,60% 38,90% 59,80% 56,00% 34,20% RandomForest 1,20% L1,30% 1,00% L1,80% L3,40% L3,70%KNN Euclidean 39,70% 50,10% 37,80% 62,20% 53,70% 30,50% KNN Euclidean L0,70% L1,50% L3,00% L3,90% 2,10% L0,30%KNN Manhattan 40,40% 52,10% 37,50% 64,10% 61,20% 35,30% KNN Manhattan 0,20% L1,50% L5,30% L2,10% L0,10% L0,10%

unit_applyDynamicRangeCompressionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 35,60% 52,60% 37,40% 53,20% 53,40% 36,70% Naive Bayes 0,90% 3,60% 0,00% 0,90% L1,20% 0,90%SMO PolyKernel 38,80% 66,20% 48,10% 72,30% 63,50% 34,90% SMO PolyKernel L3,80% 0,40% L1,00% L2,10% L4,00% L4,00%SMO RBFKernel 25,80% 63,20% 38,20% 55,00% 63,90% 36,00% SMO RBFKernel L2,80% 3,60% 1,80% 2,90% L0,50% L1,60%J48 29,80% 34,80% 32,40% 45,70% 41,20% 28,00% J48 L3,10% L1,60% L1,20% L3,90% L6,00% L1,50%RandomForest 34,50% 40,30% 38,10% 57,00% 53,80% 34,00% RandomForest L0,80% L3,60% 0,20% L4,60% L5,60% L3,90%KNN Euclidean 35,20% 52,70% 37,80% 61,30% 50,00% 27,70% KNN Euclidean L5,20% 1,10% L3,00% L4,80% L1,60% L3,10%KNN Manhattan 35,50% 53,50% 40,50% 61,40% 57,90% 34,30% KNN Manhattan L4,70% L0,10% L2,30% L4,80% L3,40% L1,10%

unit_applyHarmonicDistortionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 40,80% 52,50% 41,70% 52,40% 53,20% 43,10% Naive Bayes 6,10% 3,50% 4,30% 0,10% L1,40% 7,30%SMO PolyKernel 47,90% 65,70% 47,00% 71,90% 64,70% 39,30% SMO PolyKernel 5,30% L0,10% L2,10% L2,50% L2,80% 0,40%SMO RBFKernel 34,50% 62,70% 41,40% 55,20% 63,00% 41,60% SMO RBFKernel 5,90% 3,10% 5,00% 3,10% L1,40% 4,00%J48 34,40% 37,00% 33,40% 52,00% 46,80% 31,20% J48 1,50% 0,60% L0,20% 2,40% L0,40% 1,70%RandomForest 38,60% 42,00% 40,40% 58,40% 54,80% 38,10% RandomForest 3,30% L1,90% 2,50% L3,20% L4,60% 0,20%KNN Euclidean 43,50% 51,90% 40,50% 61,90% 51,00% 34,80% KNN Euclidean 3,10% 0,30% L0,30% L4,20% L0,60% 4,00%KNN Manhattan 41,50% 54,40% 42,00% 63,50% 58,50% 37,90% KNN Manhattan 1,30% 0,80% L0,80% L2,70% L2,80% 2,50%

unit_applyMp3CompressionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 34,30% 44,60% 39,20% 50,20% 51,70% 35,90% Naive Bayes L0,40% L4,40% 1,80% L2,10% L2,90% 0,10%SMO PolyKernel 41,60% 63,80% 47,10% 70,90% 64,20% 38,10% SMO PolyKernel L1,00% L2,00% L2,00% L3,50% L3,30% L0,80%SMO RBFKernel 28,80% 58,80% 37,10% 50,70% 62,30% 37,80% SMO RBFKernel 0,20% L0,80% 0,70% L1,40% L2,10% 0,20%J48 30,30% 34,20% 31,10% 47,90% 44,90% 28,30% J48 L2,60% L2,20% L2,50% L1,70% L2,30% L1,20%RandomForest 38,50% 40,80% 38,50% 55,50% 52,50% 36,00% RandomForest 3,20% L3,10% 0,60% L6,10% L6,90% L1,90%KNN Euclidean 39,60% 51,60% 42,10% 61,90% 49,00% 29,40% KNN Euclidean L0,80% 0,00% 1,30% L4,20% L2,60% L1,40%KNN Manhattan 39,90% 52,20% 44,00% 62,80% 58,20% 36,30% KNN Manhattan L0,30% L1,40% 1,20% L3,40% L3,10% 0,90%

unit_applySpeedupClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 34,20% 49,10% 36,70% 51,20% 50,60% 35,00% Naive Bayes L0,50% 0,10% L0,70% L1,10% L4,00% L0,80%SMO PolyKernel 41,40% 65,30% 47,30% 72,30% 64,80% 37,40% SMO PolyKernel L1,20% L0,50% L1,80% L2,10% L2,70% L1,50%SMO RBFKernel 28,80% 59,30% 36,20% 52,00% 64,70% 36,70% SMO RBFKernel 0,20% L0,30% L0,20% L0,10% 0,30% L0,90%J48 30,10% 36,90% 33,00% 47,60% 47,70% 31,70% J48 L2,80% 0,50% L0,60% L2,00% 0,50% 2,20%RandomForest 38,10% 41,70% 37,90% 57,80% 54,90% 35,20% RandomForest 2,80% L2,20% 0,00% L3,80% L4,50% L2,70%KNN Euclidean 39,90% 51,30% 41,20% 63,90% 51,50% 30,00% KNN Euclidean L0,50% L0,30% 0,40% L2,20% L0,10% L0,80%KNN Manhattan 37,60% 50,80% 43,10% 63,30% 60,90% 34,60% KNN Manhattan L2,60% L2,80% 0,30% L2,90% L0,40% L0,80%

unit_applyWowResamplingClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 35,50% 48,50% 38,20% 51,70% 52,90% 36,50% Naive Bayes 0,80% L0,50% 0,80% L0,60% L1,70% 0,70%SMO PolyKernel 42,30% 67,10% 49,10% 74,50% 68,10% 37,50% SMO PolyKernel L0,30% 1,30% 0,00% 0,10% 0,60% L1,40%SMO RBFKernel 27,90% 60,80% 37,30% 53,70% 66,20% 36,40% SMO RBFKernel L0,70% 1,20% 0,90% 1,60% 1,80% L1,20%J48 28,70% 35,80% 31,90% 50,40% 45,10% 27,60% J48 L4,20% L0,60% L1,70% 0,80% L2,10% L1,90%RandomForest 38,10% 44,10% 37,20% 58,60% 56,20% 34,70% RandomForest 2,80% 0,20% L0,70% L3,00% L3,20% L3,20%KNN Euclidean 40,20% 52,90% 42,80% 65,00% 50,60% 31,00% KNN Euclidean L0,20% 1,30% 2,00% L1,10% L1,00% 0,20%KNN Manhattan 39,90% 54,50% 43,30% 64,70% 59,80% 36,80% KNN Manhattan L0,30% 0,90% 0,50% L1,50% L1,50% 1,40%

unit_applyDelayClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 33,90% 48,30% 36,60% 51,60% 53,60% 34,50% Naive Bayes L0,80% L0,70% L0,80% L0,70% L1,00% L1,30%SMO PolyKernel 42,40% 66,90% 47,30% 73,00% 70,20% 39,40% SMO PolyKernel L0,20% 1,10% L1,80% L1,40% 2,70% 0,50%SMO RBFKernel 28,10% 59,50% 37,20% 52,10% 66,40% 40,50% SMO RBFKernel L0,50% L0,10% 0,80% 0,00% 2,00% 2,90%J48 29,30% 35,90% 29,70% 48,10% 44,70% 30,20% J48 L3,60% L0,50% L3,90% L1,50% L2,50% 0,70%RandomForest 36,20% 42,30% 38,60% 60,60% 58,90% 36,30% RandomForest 0,90% L1,60% 0,70% L1,00% L0,50% L1,60%KNN Euclidean 40,40% 51,80% 42,20% 64,10% 54,20% 34,10% KNN Euclidean 0,00% 0,20% 1,40% L2,00% 2,60% 3,30%KNN Manhattan 38,30% 53,00% 42,70% 64,50% 63,30% 37,80% KNN Manhattan L1,90% L0,60% L0,10% L1,70% 2,00% 2,40%

unit_applyHighpassFilterClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 31,90% 36,60% 39,90% 39,80% 44,50% 35,80% Naive Bayes L2,80% L12,40% 2,50% L12,50% L10,10% 0,00%SMO PolyKernel 37,60% 61,50% 46,10% 63,40% 56,40% 35,00% SMO PolyKernel L5,00% L4,30% L3,00% L11,00% L11,10% L3,90%SMO RBFKernel 22,90% 53,80% 40,20% 46,10% 56,60% 33,50% SMO RBFKernel L5,70% L5,80% 3,80% L6,00% L7,80% L4,10%J48 27,80% 34,60% 29,20% 43,40% 41,90% 27,70% J48 L5,10% L1,80% L4,40% L6,20% L5,30% L1,80%RandomForest 32,30% 41,60% 36,90% 55,40% 52,70% 29,90% RandomForest L3,00% L2,30% L1,00% L6,20% L6,70% L8,00%KNN Euclidean 35,80% 49,20% 36,10% 54,40% 46,90% 32,30% KNN Euclidean L4,60% L2,40% L4,70% L11,70% L4,70% 1,50%KNN Manhattan 37,80% 49,80% 39,60% 58,50% 56,70% 35,90% KNN Manhattan L2,40% L3,80% L3,20% L7,70% L4,60% 0,50%

unit_applyLowpassFilterClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 38,30% 32,80% 36,90% 27,60% 25,10% 39,10% Naive Bayes 3,60% L16,20% L0,50% L24,70% L29,50% 3,30%SMO PolyKernel 44,10% 56,80% 37,90% 60,90% 55,90% 35,20% SMO PolyKernel 1,50% L9,00% L11,20% L13,50% L11,60% L3,70%SMO RBFKernel 31,30% 48,50% 36,10% 43,50% 48,90% 39,40% SMO RBFKernel 2,70% L11,10% L0,30% L8,60% L15,50% 1,80%J48 34,40% 31,10% 33,70% 43,70% 39,40% 33,70% J48 1,50% L5,30% 0,10% L5,90% L7,80% 4,20%RandomForest 36,30% 38,60% 37,20% 52,70% 46,70% 35,40% RandomForest 1,00% L5,30% L0,70% L8,90% L12,70% L2,50%KNN Euclidean 38,60% 47,80% 30,50% 49,20% 36,60% 29,00% KNN Euclidean L1,80% L3,80% L10,30% L16,90% L15,00% L1,80%KNN Manhattan 40,10% 48,50% 34,70% 51,00% 41,90% 34,40% KNN Manhattan L0,10% L5,10% L8,10% L15,20% L19,40% L1,00%

Table 4.3: Classification of GTZAN data set degraded by harmonic distortion

Referring to classification deteriorations due to degraded data sets, the two most affected

degradations are Low pass filtering degradation (Tables 4.4) and High pass filtering

degradation. In Low pass filtering case, deteriorations are around 7% with a maximum

deterioration of 27,57% in TSSD feature with Naive Bayes classifier. In both filtering

degradations, this significant degradation is due to the removing of some frequency bands

caused by the filtering, i.e. feature information belonging to removed frequency bands is

lost as well, hindering the classification performance.

Another important classification deterioration is caused by the Smartphone playback,

where the deterioration is around 7% in GTZAN data set and 9% in ISMIR data set. In

this case, the deterioration is due to the degradation characteristics: high pass filtering,

cut-off and bad SNR. This means a high degradation of the audio signal, changing the

extracted features and hindering a correct classification as well.


(a) Mean percentage of correctly classified instances (highlighted values means an improve-ment respect clean audio classification)

unit_applyAliasingClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 53,85% 59,26% 57,48% 59,95% 31,62% 59,12% Naive Bayes L3,22% L4,12% L3,77% L1,10% L21,20% L0,89%SMO PolyKernel 61,46% 72,22% 68,52% 74,69% 75,86% 62,83% SMO PolyKernel L2,20% L2,88% L1,99% L4,60% L4,46% L3,29%SMO RBFKernel 50,96% 67,35% 63,03% 63,17% 67,56% 62,28% SMO RBFKernel L3,98% L1,44% L0,96% L0,69% L5,56% L2,13%J48 54,61% 59,67% 57,06% 66,73% 65,56% 54,94% J48 L3,63% 0,42% L3,78% L1,38% L1,11% L0,07%RandomForest 63,51% 67,70% 67,01% 73,12% 71,06% 62,69% RandomForest L2,47% L1,17% L1,78% L2,19% L3,63% L1,85%KNN Euclidean 60,15% 69,21% 60,02% 73,46% 70,91% 59,87% KNN Euclidean L0,20% L4,05% L3,02% L5,35% L5,76% L3,91%KNN Manhattan 59,54% 69,48% 61,93% 75,10% 74,28% 61,05% KNN Manhattan L2,12% L1,85% L2,88% L3,50% L3,50% L2,40%

unit_applyClippingAlternativeClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 53,77% 63,58% 58,30% 63,44% 52,74% 55,63% Naive Bayes L3,29% 0,20% L2,95% 2,40% L0,07% L4,39%SMO PolyKernel 59,12% 73,53% 67,63% 77,92% 78,33% 62,96% SMO PolyKernel L4,53% L1,58% L2,88% L1,38% L1,99% L3,15%SMO RBFKernel 49,66% 70,44% 63,44% 66,53% 74,49% 56,31% SMO RBFKernel L5,28% 1,65% L0,55% 2,68% 1,37% L8,09%J48 53,29% 58,78% 57,68% 67,42% 67,29% 53,84% J48 L4,94% L0,47% L3,16% L0,69% 0,62% L1,17%RandomForest 61,04% 69,35% 67,01% 74,35% 71,74% 62,41% RandomForest L4,94% 0,48% L1,78% L0,96% L2,95% L2,13%KNN Euclidean 60,70% 72,50% 64,47% 77,23% 75,58% 58,78% KNN Euclidean 0,35% L0,75% 1,44% L1,58% L1,10% L5,01%KNN Manhattan 58,71% 71,95% 64,34% 78,05% 76,96% 61,59% KNN Manhattan L2,95% 0,61% L0,48% L0,55% L0,82% L1,85%

unit_applyDynamicRangeCompressionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 47,12% 64,47% 56,58% 61,38% 49,58% 31,48% Naive Bayes L9,95% 1,09% L4,67% 0,34% L3,23% L28,53%SMO PolyKernel 58,57% 73,53% 67,42% 77,98% 76,95% 62,90% SMO PolyKernel L5,08% L1,58% L3,09% L1,31% L3,36% L3,22%SMO RBFKernel 47,46% 71,05% 62,48% 64,74% 72,15% 55,42% SMO RBFKernel L7,48% 2,26% L1,51% 0,89% L0,96% L8,98%J48 50,55% 57,61% 56,93% 63,86% 62,42% 51,30% J48 L7,68% L1,64% L3,91% L4,26% L4,25% L3,71%RandomForest 60,70% 64,68% 65,16% 70,78% 70,71% 61,11% RandomForest L5,28% L4,19% L3,63% L4,53% L3,98% L3,43%KNN Euclidean 57,48% 70,37% 62,69% 76,61% 74,28% 59,05% KNN Euclidean L2,88% L2,88% L0,34% L2,20% L2,40% L4,73%KNN Manhattan 57,82% 70,58% 61,32% 76,95% 75,72% 60,01% KNN Manhattan L3,84% L0,75% L3,49% L1,65% L2,06% L3,43%

unit_applyHarmonicDistortionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 54,46% 65,84% 60,63% 62,49% 53,91% 57,06% Naive Bayes L2,61% 2,46% L0,62% 1,44% 1,10% L2,95%SMO PolyKernel 59,67% 73,46% 69,21% 78,33% 78,12% 63,17% SMO PolyKernel L3,98% L1,65% L1,30% L0,96% L2,19% L2,95%SMO RBFKernel 51,30% 70,78% 64,48% 66,32% 74,69% 62,28% SMO RBFKernel L3,63% 1,99% 0,48% 2,47% 1,58% L2,12%J48 54,32% 59,47% 59,12% 68,18% 66,46% 55,69% J48 L3,91% 0,21% L1,72% 0,07% L0,21% 0,68%RandomForest 63,45% 67,22% 66,94% 74,07% 74,01% 63,71% RandomForest L2,54% L1,65% L1,85% L1,24% L0,68% L0,83%KNN Euclidean 60,29% 71,54% 62,62% 77,78% 76,40% 60,15% KNN Euclidean L0,06% L1,71% L0,41% L1,03% L0,27% L3,64%KNN Manhattan 57,54% 72,02% 63,85% 77,64% 77,64% 61,38% KNN Manhattan L4,12% 0,68% L0,96% L0,97% L0,14% L2,06%

unit_applyMp3CompressionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 56,11% 61,94% 58,44% 60,77% 58,84% 59,46% Naive Bayes L0,96% L1,44% L2,81% L0,28% 6,03% L0,55%SMO PolyKernel 63,31% 74,90% 69,21% 76,96% 79,22% 65,29% SMO PolyKernel L0,34% L0,21% L1,30% L2,34% L1,10% L0,82%SMO RBFKernel 54,53% 69,34% 63,44% 62,55% 72,77% 63,99% SMO RBFKernel L0,41% 0,55% L0,55% L1,31% L0,34% L0,41%J48 54,74% 58,57% 57,13% 68,80% 67,56% 57,13% J48 L3,49% L0,69% L3,70% 0,69% 0,89% 2,13%RandomForest 64,75% 68,65% 67,63% 73,12% 71,13% 64,47% RandomForest L1,23% L0,21% L1,17% L2,19% L3,56% L0,07%KNN Euclidean 62,00% 71,40% 62,56% 77,65% 74,90% 61,18% KNN Euclidean 1,65% L1,85% L0,48% L1,16% L1,78% L2,61%KNN Manhattan 61,45% 70,58% 63,86% 76,76% 77,44% 62,35% KNN Manhattan L0,21% L0,75% L0,96% L1,85% L0,34% L1,10%

unit_applySpeedupClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 56,58% 64,41% 61,66% 61,93% 36,62% 60,70% Naive Bayes L0,48% 1,03% 0,41% 0,89% L16,19% 0,68%SMO PolyKernel 63,85% 72,98% 70,37% 79,57% 80,45% 66,46% SMO PolyKernel 0,20% L2,13% L0,14% 0,27% 0,14% 0,34%SMO RBFKernel 55,01% 68,93% 62,69% 65,43% 72,09% 63,58% SMO RBFKernel 0,07% 0,13% L1,30% 1,57% L1,03% L0,82%J48 55,42% 62,14% 59,26% 67,97% 69,41% 57,06% J48 L2,82% 2,89% L1,58% L0,14% 2,74% 2,06%RandomForest 64,41% 69,00% 67,22% 75,65% 73,67% 64,75% RandomForest L1,58% 0,13% L1,57% 0,34% L1,02% 0,21%KNN Euclidean 61,87% 70,92% 63,17% 79,97% 78,33% 61,80% KNN Euclidean 1,51% L2,33% 0,13% 1,16% 1,65% L1,99%KNN Manhattan 60,56% 71,40% 64,95% 78,60% 79,02% 63,78% KNN Manhattan L1,10% 0,07% 0,14% 0,00% 1,24% 0,34%

unit_applyWowResamplingClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 56,72% 64,06% 60,70% 61,53% 31,55% 59,74% Naive Bayes L0,34% 0,68% L0,55% 0,48% L21,27% L0,27%SMO PolyKernel 63,38% 75,38% 69,76% 79,22% 81,62% 67,49% SMO PolyKernel L0,28% 0,27% L0,75% L0,07% 1,31% 1,37%SMO RBFKernel 56,24% 69,14% 63,65% 65,02% 71,68% 64,68% SMO RBFKernel 1,31% 0,34% L0,34% 1,16% L1,44% 0,28%J48 55,00% 61,05% 59,94% 68,04% 68,58% 56,51% J48 L3,23% 1,79% L0,90% L0,07% 1,91% 1,50%RandomForest 65,02% 69,76% 69,48% 75,58% 73,19% 66,53% RandomForest L0,96% 0,89% 0,69% 0,28% L1,50% 1,99%KNN Euclidean 60,83% 72,29% 64,41% 79,36% 76,40% 63,31% KNN Euclidean 0,48% L0,96% 1,37% 0,55% L0,27% L0,48%KNN Manhattan 60,01% 71,81% 65,70% 78,54% 77,64% 63,38% KNN Manhattan L1,65% 0,48% 0,89% L0,07% L0,14% L0,07%

unit_applyDelayClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 56,79% 63,86% 61,45% 61,53% 47,94% 60,49% Naive Bayes L0,27% 0,48% 0,21% 0,48% L4,87% 0,48%SMO PolyKernel 63,86% 74,21% 68,87% 79,29% 79,84% 67,97% SMO PolyKernel 0,20% L0,89% L1,64% 0,00% L0,48% 1,85%SMO RBFKernel 55,42% 68,66% 63,79% 63,65% 73,05% 64,33% SMO RBFKernel 0,48% L0,14% L0,20% L0,21% L0,07% L0,07%J48 57,00% 61,39% 60,29% 69,14% 70,85% 55,28% J48 L1,24% 2,13% L0,55% 1,03% 4,18% 0,27%RandomForest 64,96% 69,41% 68,32% 75,86% 74,56% 64,82% RandomForest L1,02% 0,55% L0,47% 0,55% L0,13% 0,28%KNN Euclidean 61,11% 73,46% 62,76% 78,81% 76,54% 63,38% KNN Euclidean 0,75% 0,21% L0,28% 0,00% L0,14% L0,41%KNN Manhattan 62,07% 70,92% 64,41% 78,19% 77,85% 65,02% KNN Manhattan 0,41% L0,41% L0,41% L0,41% 0,07% 1,57%

unit_applyHighpassFilterClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 48,90% 56,45% 55,56% 49,72% 32,44% 52,60% Naive Bayes L8,16% L6,93% L5,69% L11,32% L20,37% L7,41%SMO PolyKernel 58,37% 71,33% 66,19% 75,52% 76,33% 64,06% SMO PolyKernel L5,28% L3,78% L4,32% L3,78% L3,98% L2,06%SMO RBFKernel 46,43% 66,53% 59,74% 61,18% 68,31% 53,50% SMO RBFKernel L8,50% L2,26% L4,25% L2,68% L4,80% L10,91%J48 51,09% 62,34% 55,35% 67,49% 64,89% 51,58% J48 L7,14% 3,09% L5,49% L0,62% L1,78% L3,43%RandomForest 61,93% 68,66% 66,05% 73,18% 71,88% 61,80% RandomForest L4,05% L0,21% L2,74% L2,13% L2,81% L2,74%KNN Euclidean 58,37% 71,33% 59,05% 74,62% 70,72% 58,03% KNN Euclidean L1,99% L1,92% L3,98% L4,19% L5,96% L5,76%KNN Manhattan 56,92% 71,33% 62,21% 75,72% 74,49% 58,23% KNN Manhattan L4,74% 0,00% L2,61% L2,88% L3,29% L5,22%

unit_applyLowpassFilterClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 50,28% 51,24% 57,20% 52,74% 25,24% 53,98% Naive Bayes L6,79% L12,14% L4,05% L8,30% L27,57% L6,03%SMO PolyKernel 58,30% 69,21% 62,21% 71,12% 71,81% 63,58% SMO PolyKernel L5,35% L5,90% L8,30% L8,17% L8,51% L2,53%SMO RBFKernel 49,25% 63,37% 60,36% 60,29% 65,22% 57,47% SMO RBFKernel L5,69% L5,42% L3,64% L3,57% L7,89% L6,93%J48 51,99% 58,58% 55,15% 63,85% 64,07% 50,68% J48 L6,24% L0,68% L5,69% L4,26% L2,60% L4,33%RandomForest 60,42% 65,23% 61,60% 69,47% 68,25% 59,74% RandomForest L5,56% L3,64% L7,19% L5,83% L6,44% L4,80%KNN Euclidean 56,31% 65,02% 53,63% 68,59% 66,12% 56,10% KNN Euclidean L4,05% L8,23% L9,40% L10,22% L10,56% L7,69%KNN Manhattan 56,59% 65,84% 55,00% 70,16% 68,79% 55,97% KNN Manhattan L5,07% L5,49% L9,81% L8,44% L8,98% L7,48%

(b) Mean percentage differences of correctly classified instances between clear and degradedaudio (positive values = improvement, negative values = deterioration)

unit_applyAliasingClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 53,85% 59,26% 57,48% 59,95% 31,62% 59,12% Naive Bayes L3,22% L4,12% L3,77% L1,10% L21,20% L0,89%SMO PolyKernel 61,46% 72,22% 68,52% 74,69% 75,86% 62,83% SMO PolyKernel L2,20% L2,88% L1,99% L4,60% L4,46% L3,29%SMO RBFKernel 50,96% 67,35% 63,03% 63,17% 67,56% 62,28% SMO RBFKernel L3,98% L1,44% L0,96% L0,69% L5,56% L2,13%J48 54,61% 59,67% 57,06% 66,73% 65,56% 54,94% J48 L3,63% 0,42% L3,78% L1,38% L1,11% L0,07%RandomForest 63,51% 67,70% 67,01% 73,12% 71,06% 62,69% RandomForest L2,47% L1,17% L1,78% L2,19% L3,63% L1,85%KNN Euclidean 60,15% 69,21% 60,02% 73,46% 70,91% 59,87% KNN Euclidean L0,20% L4,05% L3,02% L5,35% L5,76% L3,91%KNN Manhattan 59,54% 69,48% 61,93% 75,10% 74,28% 61,05% KNN Manhattan L2,12% L1,85% L2,88% L3,50% L3,50% L2,40%

unit_applyClippingAlternativeClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 53,77% 63,58% 58,30% 63,44% 52,74% 55,63% Naive Bayes L3,29% 0,20% L2,95% 2,40% L0,07% L4,39%SMO PolyKernel 59,12% 73,53% 67,63% 77,92% 78,33% 62,96% SMO PolyKernel L4,53% L1,58% L2,88% L1,38% L1,99% L3,15%SMO RBFKernel 49,66% 70,44% 63,44% 66,53% 74,49% 56,31% SMO RBFKernel L5,28% 1,65% L0,55% 2,68% 1,37% L8,09%J48 53,29% 58,78% 57,68% 67,42% 67,29% 53,84% J48 L4,94% L0,47% L3,16% L0,69% 0,62% L1,17%RandomForest 61,04% 69,35% 67,01% 74,35% 71,74% 62,41% RandomForest L4,94% 0,48% L1,78% L0,96% L2,95% L2,13%KNN Euclidean 60,70% 72,50% 64,47% 77,23% 75,58% 58,78% KNN Euclidean 0,35% L0,75% 1,44% L1,58% L1,10% L5,01%KNN Manhattan 58,71% 71,95% 64,34% 78,05% 76,96% 61,59% KNN Manhattan L2,95% 0,61% L0,48% L0,55% L0,82% L1,85%

unit_applyDynamicRangeCompressionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 47,12% 64,47% 56,58% 61,38% 49,58% 31,48% Naive Bayes L9,95% 1,09% L4,67% 0,34% L3,23% L28,53%SMO PolyKernel 58,57% 73,53% 67,42% 77,98% 76,95% 62,90% SMO PolyKernel L5,08% L1,58% L3,09% L1,31% L3,36% L3,22%SMO RBFKernel 47,46% 71,05% 62,48% 64,74% 72,15% 55,42% SMO RBFKernel L7,48% 2,26% L1,51% 0,89% L0,96% L8,98%J48 50,55% 57,61% 56,93% 63,86% 62,42% 51,30% J48 L7,68% L1,64% L3,91% L4,26% L4,25% L3,71%RandomForest 60,70% 64,68% 65,16% 70,78% 70,71% 61,11% RandomForest L5,28% L4,19% L3,63% L4,53% L3,98% L3,43%KNN Euclidean 57,48% 70,37% 62,69% 76,61% 74,28% 59,05% KNN Euclidean L2,88% L2,88% L0,34% L2,20% L2,40% L4,73%KNN Manhattan 57,82% 70,58% 61,32% 76,95% 75,72% 60,01% KNN Manhattan L3,84% L0,75% L3,49% L1,65% L2,06% L3,43%

unit_applyHarmonicDistortionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 54,46% 65,84% 60,63% 62,49% 53,91% 57,06% Naive Bayes L2,61% 2,46% L0,62% 1,44% 1,10% L2,95%SMO PolyKernel 59,67% 73,46% 69,21% 78,33% 78,12% 63,17% SMO PolyKernel L3,98% L1,65% L1,30% L0,96% L2,19% L2,95%SMO RBFKernel 51,30% 70,78% 64,48% 66,32% 74,69% 62,28% SMO RBFKernel L3,63% 1,99% 0,48% 2,47% 1,58% L2,12%J48 54,32% 59,47% 59,12% 68,18% 66,46% 55,69% J48 L3,91% 0,21% L1,72% 0,07% L0,21% 0,68%RandomForest 63,45% 67,22% 66,94% 74,07% 74,01% 63,71% RandomForest L2,54% L1,65% L1,85% L1,24% L0,68% L0,83%KNN Euclidean 60,29% 71,54% 62,62% 77,78% 76,40% 60,15% KNN Euclidean L0,06% L1,71% L0,41% L1,03% L0,27% L3,64%KNN Manhattan 57,54% 72,02% 63,85% 77,64% 77,64% 61,38% KNN Manhattan L4,12% 0,68% L0,96% L0,97% L0,14% L2,06%

unit_applyMp3CompressionClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 56,11% 61,94% 58,44% 60,77% 58,84% 59,46% Naive Bayes L0,96% L1,44% L2,81% L0,28% 6,03% L0,55%SMO PolyKernel 63,31% 74,90% 69,21% 76,96% 79,22% 65,29% SMO PolyKernel L0,34% L0,21% L1,30% L2,34% L1,10% L0,82%SMO RBFKernel 54,53% 69,34% 63,44% 62,55% 72,77% 63,99% SMO RBFKernel L0,41% 0,55% L0,55% L1,31% L0,34% L0,41%J48 54,74% 58,57% 57,13% 68,80% 67,56% 57,13% J48 L3,49% L0,69% L3,70% 0,69% 0,89% 2,13%RandomForest 64,75% 68,65% 67,63% 73,12% 71,13% 64,47% RandomForest L1,23% L0,21% L1,17% L2,19% L3,56% L0,07%KNN Euclidean 62,00% 71,40% 62,56% 77,65% 74,90% 61,18% KNN Euclidean 1,65% L1,85% L0,48% L1,16% L1,78% L2,61%KNN Manhattan 61,45% 70,58% 63,86% 76,76% 77,44% 62,35% KNN Manhattan L0,21% L0,75% L0,96% L1,85% L0,34% L1,10%

unit_applySpeedupClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 56,58% 64,41% 61,66% 61,93% 36,62% 60,70% Naive Bayes L0,48% 1,03% 0,41% 0,89% L16,19% 0,68%SMO PolyKernel 63,85% 72,98% 70,37% 79,57% 80,45% 66,46% SMO PolyKernel 0,20% L2,13% L0,14% 0,27% 0,14% 0,34%SMO RBFKernel 55,01% 68,93% 62,69% 65,43% 72,09% 63,58% SMO RBFKernel 0,07% 0,13% L1,30% 1,57% L1,03% L0,82%J48 55,42% 62,14% 59,26% 67,97% 69,41% 57,06% J48 L2,82% 2,89% L1,58% L0,14% 2,74% 2,06%RandomForest 64,41% 69,00% 67,22% 75,65% 73,67% 64,75% RandomForest L1,58% 0,13% L1,57% 0,34% L1,02% 0,21%KNN Euclidean 61,87% 70,92% 63,17% 79,97% 78,33% 61,80% KNN Euclidean 1,51% L2,33% 0,13% 1,16% 1,65% L1,99%KNN Manhattan 60,56% 71,40% 64,95% 78,60% 79,02% 63,78% KNN Manhattan L1,10% 0,07% 0,14% 0,00% 1,24% 0,34%

unit_applyWowResamplingClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 56,72% 64,06% 60,70% 61,53% 31,55% 59,74% Naive Bayes L0,34% 0,68% L0,55% 0,48% L21,27% L0,27%SMO PolyKernel 63,38% 75,38% 69,76% 79,22% 81,62% 67,49% SMO PolyKernel L0,28% 0,27% L0,75% L0,07% 1,31% 1,37%SMO RBFKernel 56,24% 69,14% 63,65% 65,02% 71,68% 64,68% SMO RBFKernel 1,31% 0,34% L0,34% 1,16% L1,44% 0,28%J48 55,00% 61,05% 59,94% 68,04% 68,58% 56,51% J48 L3,23% 1,79% L0,90% L0,07% 1,91% 1,50%RandomForest 65,02% 69,76% 69,48% 75,58% 73,19% 66,53% RandomForest L0,96% 0,89% 0,69% 0,28% L1,50% 1,99%KNN Euclidean 60,83% 72,29% 64,41% 79,36% 76,40% 63,31% KNN Euclidean 0,48% L0,96% 1,37% 0,55% L0,27% L0,48%KNN Manhattan 60,01% 71,81% 65,70% 78,54% 77,64% 63,38% KNN Manhattan L1,65% 0,48% 0,89% L0,07% L0,14% L0,07%

unit_applyDelayClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 56,79% 63,86% 61,45% 61,53% 47,94% 60,49% Naive Bayes L0,27% 0,48% 0,21% 0,48% L4,87% 0,48%SMO PolyKernel 63,86% 74,21% 68,87% 79,29% 79,84% 67,97% SMO PolyKernel 0,20% L0,89% L1,64% 0,00% L0,48% 1,85%SMO RBFKernel 55,42% 68,66% 63,79% 63,65% 73,05% 64,33% SMO RBFKernel 0,48% L0,14% L0,20% L0,21% L0,07% L0,07%J48 57,00% 61,39% 60,29% 69,14% 70,85% 55,28% J48 L1,24% 2,13% L0,55% 1,03% 4,18% 0,27%RandomForest 64,96% 69,41% 68,32% 75,86% 74,56% 64,82% RandomForest L1,02% 0,55% L0,47% 0,55% L0,13% 0,28%KNN Euclidean 61,11% 73,46% 62,76% 78,81% 76,54% 63,38% KNN Euclidean 0,75% 0,21% L0,28% 0,00% L0,14% L0,41%KNN Manhattan 62,07% 70,92% 64,41% 78,19% 77,85% 65,02% KNN Manhattan 0,41% L0,41% L0,41% L0,41% 0,07% 1,57%

unit_applyHighpassFilterClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 48,90% 56,45% 55,56% 49,72% 32,44% 52,60% Naive Bayes L8,16% L6,93% L5,69% L11,32% L20,37% L7,41%SMO PolyKernel 58,37% 71,33% 66,19% 75,52% 76,33% 64,06% SMO PolyKernel L5,28% L3,78% L4,32% L3,78% L3,98% L2,06%SMO RBFKernel 46,43% 66,53% 59,74% 61,18% 68,31% 53,50% SMO RBFKernel L8,50% L2,26% L4,25% L2,68% L4,80% L10,91%J48 51,09% 62,34% 55,35% 67,49% 64,89% 51,58% J48 L7,14% 3,09% L5,49% L0,62% L1,78% L3,43%RandomForest 61,93% 68,66% 66,05% 73,18% 71,88% 61,80% RandomForest L4,05% L0,21% L2,74% L2,13% L2,81% L2,74%KNN Euclidean 58,37% 71,33% 59,05% 74,62% 70,72% 58,03% KNN Euclidean L1,99% L1,92% L3,98% L4,19% L5,96% L5,76%KNN Manhattan 56,92% 71,33% 62,21% 75,72% 74,49% 58,23% KNN Manhattan L4,74% 0,00% L2,61% L2,88% L3,29% L5,22%

unit_applyLowpassFilterClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 50,28% 51,24% 57,20% 52,74% 25,24% 53,98% Naive Bayes L6,79% L12,14% L4,05% L8,30% L27,57% L6,03%SMO PolyKernel 58,30% 69,21% 62,21% 71,12% 71,81% 63,58% SMO PolyKernel L5,35% L5,90% L8,30% L8,17% L8,51% L2,53%SMO RBFKernel 49,25% 63,37% 60,36% 60,29% 65,22% 57,47% SMO RBFKernel L5,69% L5,42% L3,64% L3,57% L7,89% L6,93%J48 51,99% 58,58% 55,15% 63,85% 64,07% 50,68% J48 L6,24% L0,68% L5,69% L4,26% L2,60% L4,33%RandomForest 60,42% 65,23% 61,60% 69,47% 68,25% 59,74% RandomForest L5,56% L3,64% L7,19% L5,83% L6,44% L4,80%KNN Euclidean 56,31% 65,02% 53,63% 68,59% 66,12% 56,10% KNN Euclidean L4,05% L8,23% L9,40% L10,22% L10,56% L7,69%KNN Manhattan 56,59% 65,84% 55,00% 70,16% 68,79% 55,97% KNN Manhattan L5,07% L5,49% L9,81% L8,44% L8,98% L7,48%

Table 4.4: Classification of ISMIR data set degraded by low pass filtering degradation

As a summary of this section, if a whole environment, including training and test sets, is

degraded by the same single degradation, there exist three different classifying repercus-

sions: slightly results improvements, provided only in by a reduced group of degradations

in some specific cases; a notable deterioration, caused by degradations that remove a

frequency band of the audio signal and/or introduce a high noise level; and the most

common, a not prominent impact on the classification due to the model construction of

degraded files by the same degradation as the training data set. This last part can be

a good way to improve our results for the classification of audio collection that comes

from a similar source or degradation.

Chapter 5

Results classifying with mixed

degradations

So far, we studied the effect of different degradations on music features and on genre

classification as well, but always for each degradation in an isolated case, i.e. we haven’t

created any training or test set formed by degraded audio tracks from different degra-

dation systems, which is not common in real-world cases. Thereby, in this chapter we

will create an environment formed by training sets from clean audio data sets, and test

sets from mixed real-world degraded data sets (Section 3.2.2); then we will analyse the

results achieved by the genre classification and compare them with the results achieved

by the clean audio classification; finally we will propose some different experiments to

try to improve the results achieved on the classification for the created environment.

5.1 Creation of training and mixed test sets

In order to create 10-CV new fold versions as we described before, we will use the

procedure explained in Section 4.2.1 with a few modifications. First of all we will create

a 10-CV environment from the extracted features of clean data sets; then we will create

only the different test set folds of all the real-world degradations using exactly the same

order of extracted features of the degraded data sets; finally we have to proceed to

exchange *.arff test set file lines (where each line belongs to a concrete audio track

features) between files of clean audio and mixed degraded audio, as the example Figure

32

Chapter 5. Results classifying with mixed degradations 33

5.1. It is important to take care that the order of the created test set folds has to be the

same on clean sets and degraded sets, because we have to exchange features between the

same origin file, in order to avoid missing or repeating any track files.

Audio track 1 attr.

Audio track 2 attr.

Audio track 3 attr.

Audio track 4 attr.

Audio track 5 attr.

Audio track 6 attr.

Audio track 7 attr.

...

Audio track 1 attr.

Audio track 2 attr.

Audio track 3 attr.

Audio track 4 attr.

Audio track 5 attr.

Audio track 6 attr.

Audio track 7 attr.

...

Audio track 1 attr.

Audio track 2 attr.

Audio track 3 attr.

Audio track 4 attr.

Audio track 5 attr.

Audio track 6 attr.

Audio track 7 attr.

...

Audio track 1 attr.

Audio track 2 attr.

Audio track 3 attr.

Audio track 4 attr.

Audio track 5 attr.

Audio track 6 attr.

Audio track 7 attr.

...

Audio track 1 attr.

Audio track 2 attr.

Audio track 3 attr.

Audio track 4 attr.

Audio track 5 attr.

Audio track 6 attr.

Audio track 7 attr.

...

Clean Test Set Fold(*.arff file)

Mixed Degraded Test Set Fold(*.arff file)

Real World

Degradation I

Real World Degradation III

Real World Degradation II

Single Degraded Test Set Folds(*.arff files)

Figure 5.1: Creation of mixed degraded test set file from one fold of the 10-CV,picking one audio track feature from each single degraded test set circularly over alltest set long. Each table cell corresponds to each *.arff line that belongs to feature

attribute values and the audio track genre at the end of line.

This process has created 10 fold training sets (clean audio) and 10 fold test sets (mixed

degraded audio) for each feature and data set (6 features ∗ 2 data set = 12), resulting

of 12 complete 10-CV environments. Then, the classification is performed fold per fold,

as in classifications made before, using the last created environments. With the results,

we will be able to calculate the mean and variance between correctly fold percentage

classification of the same 10-CV.


5.2 Training and classifying with all attributes

As we thought, the classification of mixed degraded audio instead of clean audio, using

clean data set as a training set, has a clearly deterioration on all data sets, classifiers and

features. This fact is due to modifications on audio by the different degradation which

makes a confusion on the classification model constructed by clean audio. On Tables 4.1

there is the ISMIR data set example of the classification results and the deterioration

comparing with classification of clean audio. The most affected feature is SSD in all

classifiers, except from SMO RBFKernel which has its highest deterioration on TSSD

feature. The deterioration of SSD is around 20% which in an important degradation.

The second most affected feature is TSSD, because it is directly related to SSD. The

following ones are RP and MVD features, with a deterioration around 15%, and finally

RH and TRH with a deterioration around 12%. About the variance results, in both

classification ways, it has no prominent result. About GTZAN data set, the maximum

deterioration is also located on SSD features, being slightly higher than on ISMIR data

set (around 25%). RH is also the less affected classification on GTZAN data set with

average values of 11%.

(a) Mean percentage of correctly classified instances of mixed degradaded data set classifi-cation.

MEAN:&CLEAN&audio&Vs.&MIXED&DEGRADED&audio&(ISMIR)Clean&audio&classificationClassifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 57,07% 63,38% 61,25% 61,04% 52,81% 60,01%SMO PolyKernel 63,65% 75,11% 70,51% 79,29% 80,31% 66,12%SMO RBFKernel 54,94% 68,79% 63,99% 63,86% 73,12% 64,40%J48 58,23% 59,26% 60,84% 68,11% 66,67% 55,01%RandomForest 65,98% 68,86% 68,79% 75,31% 74,69% 64,54%KNN Euclidean 60,35% 73,25% 63,04% 78,81% 76,68% 63,79%KNN Manhattan 61,66% 71,33% 64,81% 78,61% 77,78% 63,44%

Mixed&degraded&audio&classification Deterioration&=&Clean&K&Mixed&degradedClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 43,21% 49,18% 47,19% 43,00% 40,32% 45,89% Naive Bayes 13,86% 14,20% 14,06% 18,05% 12,49% 14,13%SMO PolyKernel 51,03% 60,02% 56,65% 57,82% 58,92% 55,76% SMO PolyKernel 12,62% 15,09% 13,86% 21,47% 21,40% 10,36%SMO RBFKernel 45,68% 56,79% 51,51% 49,79% 57,75% 50,41% SMO RBFKernel 9,26% 12,01% 12,49% 14,06% 15,37% 13,99%J48 46,71% 45,34% 46,71% 48,63% 51,44% 46,37% J48 11,52% 13,92% 14,13% 19,48% 15,23% 8,64%RandomForest 51,30% 53,50% 49,11% 54,25% 57,20% 50,27% RandomForest 14,68% 15,37% 19,68% 21,06% 17,49% 14,27%KNN Euclidean 48,28% 55,41% 47,05% 54,60% 57,40% 51,24% KNN Euclidean 12,07% 17,84% 15,99% 24,21% 19,28% 12,55%KNN Manhattan 48,69% 55,63% 48,42% 56,44% 57,82% 52,67% KNN Manhattan 12,97% 15,71% 16,39% 22,16% 19,96% 10,77%

Mixed&degraded&audio&classification:&Classification&using&clean&audio&data&set&as&a&training&and&mixed&degraded&audio&data&set&as&a&test&set.

Clean&audio&classification:&Classification&using&clean&audio&data&set&as&a&training&and&as&a&test&set.

(b) Mean deterioration on classification of mixed degradations comparing to classification ofclean audio, using the same clean audio data set as training set (a darker highlighted valuemeans a higher deterioration).

MEAN:&CLEAN&audio&Vs.&MIXED&DEGRADED&audio&(ISMIR)Clean&audio&classificationClassifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 57,07% 63,38% 61,25% 61,04% 52,81% 60,01%SMO PolyKernel 63,65% 75,11% 70,51% 79,29% 80,31% 66,12%SMO RBFKernel 54,94% 68,79% 63,99% 63,86% 73,12% 64,40%J48 58,23% 59,26% 60,84% 68,11% 66,67% 55,01%RandomForest 65,98% 68,86% 68,79% 75,31% 74,69% 64,54%KNN Euclidean 60,35% 73,25% 63,04% 78,81% 76,68% 63,79%KNN Manhattan 61,66% 71,33% 64,81% 78,61% 77,78% 63,44%

Mixed&degraded&audio&classification Deterioration&=&Clean&K&Mixed&degradedClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 43,21% 49,18% 47,19% 43,00% 40,32% 45,89% Naive Bayes 13,86% 14,20% 14,06% 18,05% 12,49% 14,13%SMO PolyKernel 51,03% 60,02% 56,65% 57,82% 58,92% 55,76% SMO PolyKernel 12,62% 15,09% 13,86% 21,47% 21,40% 10,36%SMO RBFKernel 45,68% 56,79% 51,51% 49,79% 57,75% 50,41% SMO RBFKernel 9,26% 12,01% 12,49% 14,06% 15,37% 13,99%J48 46,71% 45,34% 46,71% 48,63% 51,44% 46,37% J48 11,52% 13,92% 14,13% 19,48% 15,23% 8,64%RandomForest 51,30% 53,50% 49,11% 54,25% 57,20% 50,27% RandomForest 14,68% 15,37% 19,68% 21,06% 17,49% 14,27%KNN Euclidean 48,28% 55,41% 47,05% 54,60% 57,40% 51,24% KNN Euclidean 12,07% 17,84% 15,99% 24,21% 19,28% 12,55%KNN Manhattan 48,69% 55,63% 48,42% 56,44% 57,82% 52,67% KNN Manhattan 12,97% 15,71% 16,39% 22,16% 19,96% 10,77%

Mixed&degraded&audio&classification:&Classification&using&clean&audio&data&set&as&a&training&and&mixed&degraded&audio&data&set&as&a&test&set.

Clean&audio&classification:&Classification&using&clean&audio&data&set&as&a&training&and&as&a&test&set.

Table 5.1: Mixed degraded classification of ISMIR data set, using clean data set as atraining set.

The last values analyzed confirm our thoughts about the main goal of this thesis: degra-

dations have an important impact on the classification results due to their effect on


different features used in this study. On the other hand, there is not an specific fea-

ture or frequency band that is clearly more affected than others, so it won’t be an easy

work to avoid the degradations effect and restore the results achieved by the clean audio

classification. The experiments that we have on mind are related to selecting the most

robust features and removing or missing the weakest ones in the classification process.

5.3 Attribute selection

In order to avoid the degradation effect, we will select the most robust feature attributes

that we will use on the classification process, i.e. we will use attributes that have less

shifts on mean and variance when we apply degradations over the clean audio tracks,

so the difference between degraded and clean audio is slight and it shouldn’t lead to

a wrong classification result. Although the mixed degradations involve only real-world

degradations, for the attribute selection we will take into account all the degradations

studied in this thesis (both synthetic 3.2.1 and real-world 3.2.2) because this way we

extend our study to other real-world degradations that could be made by combinations

of the different synthetic degradations studied.

5.3.1 Attribute selection process

Before explaining the attribute selection process, we have to introduce an important

concept that we will use in the selection process: the worst degradation. This is an

inexistent degradation which is constructed by the highest attribute mean and variance

differences between the different degradations that we discussed in this study. We call it

worst degradation because if it could exist it would be the degradation which would have

a higher impact on audio features, consequently affecting the classification process as

well. One worst degradation is created for each data set and feature, using all synthetic

and real-world degradation for mean and variance differences.

The creation process of each worst degradation uses the mean and variance differences

calculated on Section 4.1.1; then, the highest shift value is selected between all the

degradations that belong to the same data set for each mean and variance difference

value; finally we will have all feature mean and variance differences for each data set and

feature. The creation process is on the example Figure 5.2.


'HJUDGDWLRQ�,

'HJUDGDWLRQ�,,

'HJUDGDWLRQ�,,,

�

�

�

�

�

��

9DOXH�GLIIHUHQFHV

$WWULEXWHV

'LIIHUHQFHV�EHWZHHQ�FOHDQ�DQG�GHJUDGHG�DXGLR

+LJKHU�GLIIHUHQFHV�EHWZHHQ�FOHDQ�DQG�GHJUDGHG�DXGLR

:RUVW�GHJUDGDWLRQ

��

$WWULEXWHV

�

�

�

�

�

9DOXH�GLIIHUHQFHV

Figure 5.2: Example of performing process of the worst degradation: an inexistentdegradation made by the higher differences of all studied degradations for each attribute

After the creation of different worst degradations, we perform the attribute selection

which consists in cutting-off of the highest differences on mean and variance attributes.

We set the highest difference attribute as the 100% level for each data set, feature set,

mean and variance. Then, we set up two levels of selectivity: the tolerant level, which

is softer on the attribute selection, and the strong level, which is more selective on

the attribute selection. The tolerant level is always higher than the strong level. It’s

important to explain that the percentage level of cut-off is not referred to the number

of attributes, but it is referred to the level of highest attribute difference of the worst

degradation of which we are performing the attribute selection, being this one the 100%


level. Using this criteria, an attribute difference over the established could be considered

a weak attribute, whereas an attribute difference above the established level could be a

robust one. About the mean and variance interaction, if an attribute is selected as weak

one in either measure, finally is considered a weak attribute for the future experiments,

i.e. an attribute has to be robust in mean as well as in variance to be considered a robust

attribute.

Here we will analyse the selection process specifically for the ISMIR data set, although

the feature selection between ISMIR and GTZAN is very similar. The complete worst

degradations and attribute selection are presented in Appendix A.

The criteria for the setting of the selection level is performed by the observation of the

worst degradations resulting of the last process. We try to establish two levels taking

account of the emergence of a possible threshold of the several attribute groups for each

feature and also we have to take care not to remove too many attributes in order to

provide the classifier of enough values to perform a correct classification.

In Figure 5.3(a) there are the mean differences of the worst degradation of MVD ISMIR

data set. In the right side of the plot, there are red dashed lines indicating the different

levels: the 100% level corresponds to the highest attribute difference, the 90% level

corresponds to the tolerant selection level and the 35% level corresponds to the strong

selection level. In Figure 5.3(b) there is the same degradation after the application of

the tolerant selection, where 60 attributes are removed. In this case, all the attributes

that belong to the skewness measures of the MVD are removed due to the effect of the

mean differences on the low pass filtering. In Figure 5.3(c) there is the same degradation

after the application of the strong selection, where 128 attributes are removed. In this

case, all the attributes that belong to the variance measures, the first bin that belongs

to the median measure, the 7 first bins that belong to maximum measures and all the

attributes already removed in the tolerant selection. This is due to the effect of several

degradations, also including the low pass filtering degradation.

In the RP (Figure 5.4), the worst degradation has not the weakest attributes located in

a continuous range of attributes, so in this case the use of our method of level selection

is more useful than other methods like attribute range selection. Our set levels are 75%

for the tolerant level and 60% for the strong level. In this case the number of attributes


0 50 100 150 200 250 300 350 4000

1

2

mvd Original Mean

0 50 100 150 200 250 300 350 4000

1

2

mvd First Selection Mean

0 50 100 150 200 250 300 350 4000

1

2

mvd Second Selection Mean

0 50 100 150 200 250 300 350 4000

1

2mvd Original Variance

0 50 100 150 200 250 300 350 4000

1

2mvd First Selection Variance

0 50 100 150 200 250 300 350 4000

1

2mvd Second Selection Variance

100%

90%

35%

(a) MVD worst degradation: maximum attribute = 100%0 50 100 150 200 250 300 350 4000

1

2

mvd Original Mean

0 50 100 150 200 250 300 350 4000

1

2


0 50 100 150 200 250 300 350 4000

1

2


0 50 100 150 200 250 300 350 4000

1


0 50 100 150 200 250 300 350 4000

1


0 50 100 150 200 250 300 350 4000

1


(b) Modified MVD worst degradation due to the application of the tolerant cut-off attributes (dottedline) = 90%

0 50 100 150 200 250 300 350 4000

1

2

mvd Original Mean

0 50 100 150 200 250 300 350 4000

1

2


0 50 100 150 200 250 300 350 4000

1

2


0 50 100 150 200 250 300 350 4000

1


0 50 100 150 200 250 300 350 4000

1


0 50 100 150 200 250 300 350 4000

1


(c) Modified MVD worst degradation due to the application of the strong cut-off attributes (dashedline) = 35%

Figure 5.3: MVD worst degradation and attributes selection on ISMIR data set:tolerant cut-off (dotted line) and strong cut-off (dashed line).

removed are 122 and 325 , respectively. The attributes removed in this case are unequally

distributed across all the feature set.

0 200 400 600 800 1000 1200 14000

0.05

0.1

0.15rp Original Mean

0 200 400 600 800 1000 1200 14000

0.05

0.1

rp First Selection Mean

0 200 400 600 800 1000 1200 14000

0.05

0.1

rp Second Selection Mean

0 200 400 600 800 1000 1200 14000

0.01

0.02

0.03rp Original Variance

0 200 400 600 800 1000 1200 14000

0.01

0.02

0.03rp First Selection Variance

0 200 400 600 800 1000 1200 14000

0.01

0.02

0.03rp Second Selection Variance

Figure 5.4: RP worst degradation and attributes selection on ISMIR data set: tolerantcut-off (dotted line) and strong cut-off (dashed line).

Regarding the RH selection process (Figure 5.5), the worst degradation has the shape of

a decreasing curve, having its higher shifts in the lower part of the feature. In this case

the tolerant selection level is on 40%, resulting from a removal of the 7 first attributes of

the feature set. On the other hand, the strong selection level is on 25%, resulting from a

removal of the 17 first attributes of the feature set. In this case, the attributes removal

is due to the effect of several degradations, most of them have the same decreasing curve

as a graph shape.


0 10 20 30 40 50 600

5

10

15

rh Original Mean

0 10 20 30 40 50 600

5

10

15

rh First Selection Mean

0 10 20 30 40 50 600

5

10

15

rh Second Selection Mean

0 10 20 30 40 50 600

20

40

60rh Original Variance

0 10 20 30 40 50 600

20

40

60rh First Selection Variance

0 10 20 30 40 50 600

20

40

60rh Second Selection Variance

Figure 5.5: RH worst degradation and attributes selection on ISMIR data set: toler-ant cut-off (dotted line) and strong cut-off (dashed line).

In respect to the attribute removing of SSD features (Figure 5.6), as the worst degrada-

tion has the highest differences on the skewness measures part the weakest attributes are

from 73 to 96, the skewness region for the SSD features. The tolerant level is set on 50%,

resulting of a removal of 12 attributes, whereas the strong level is set on 20%, resulting

of a removal of 24 attributes (the whole skewness measure group). The removal of these

attributes is due to the effect of the high pass filtering (attributes from 73 to 80) and

low pass filtering (attributes from 81 to 96).

0 20 40 60 80 100 120 140 1600

50

100

ssd Original Mean

0 20 40 60 80 100 120 140 1600

50

100

ssd First Selection Mean

0 20 40 60 80 100 120 140 1600

50

100

ssd Second Selection Mean

0 20 40 60 80 100 120 140 1600

2000

4000

6000

ssd Original Variance

0 20 40 60 80 100 120 140 1600

2000

4000

6000

ssd First Selection Variance

0 20 40 60 80 100 120 140 1600

2000

4000

6000

ssd Second Selection Variance

Figure 5.6: SSD worst degradation and attributes selection on ISMIR data set: tol-erant cut-off (dotted line) and strong cut-off (dashed line).

For the TSSD features (Figure 5.7), the worst degradation has a range of high shifts

between attributes from 241 to 264. The degradation effect of the remaining attributes is

negligible. However, the level of the difference between the affected attributes is different

and not regular for all of them. The tolerant selection level is set on 53%, resulting of a

removal of 11 attributes, whereas the strong selection level is set on 10%, resulting of a

removal of 24 attributes.

0 200 400 600 800 10000

1

2x 104 tssd Original Mean

0 200 400 600 800 10000

1

2x 104 tssd First Selection Mean

0 200 400 600 800 10000

1

2x 104 tssd Second Selection Mean

0 200 400 600 800 10000

2

4x 108 tssd Original Variance

0 200 400 600 800 10000

2

4x 108 tssd First Selection Variance

0 200 400 600 800 10000

2

4x 108 tssd Second Selection Variance

_Figure 5.7: TSSD worst degradation and attributes selection on ISMIR data set:

tolerant cut-off (dotted line) and strong cut-off (dashed line).


Regarding the TRH feature (Figure 5.8), in its worst degradation there are several sep-

arated groups that remind us the RH shape, repeated over all the TRH feature. This

fact is due to the TRH feature structure, that is basically the same RH feature model

repeated several times over the track recording, resulting of similar differences between

degraded and clean audio, repeated several times over the whole temporal feature TRH.

The tolerant selection level is set on 20% and the strong one is set on 10%. The number

of attributes removed is 4 and 19, respectively. In the tolerant selection level the first

bins that belong to two of the mentioned RH groups are removed; in the strong selection

level that first bins that belong to four of the mentioned RH groups as well are removed.

0 50 100 150 200 250 300 350 4000

50

100trh Original Mean

0 50 100 150 200 250 300 350 4000

50

100trh First Selection Mean

0 50 100 150 200 250 300 350 4000

50

100trh Second Selection Mean

0 50 100 150 200 250 300 350 4000

5000

10000trh Original Variance

0 50 100 150 200 250 300 350 4000

5000

10000trh First Selection Variance

0 50 100 150 200 250 300 350 4000

5000

10000trh Second Selection Variance

Figure 5.8: TRH worst degradation and attributes selection on ISMIR data set:tolerant cut-off (dotted line) and strong cut-off (dashed line).

In respect to the worst variance of the differences between clean and degraded audio,

there are only very isolated cases where we found a high variance without a high mean

difference, but we also performed a selection on variance graphs, using exactly the same

procedure that we used on the mean graphs. As we already said before, if an attribute is

weak either on mean or on variance, we consider it as a weak attribute and we proceed

to the removal of it. All the different graphs of the worst degradations with mean and

variance difference values with their own selection levels is presented in the Appendix A.

(a) Tolerant cut-off

Attribute(Selection((ISMIR)Tolerant)cut,offFeature RH RP MVD SSD TSSD TRHOriginal dimensionality 60 1440 420 168 1176 420Num. Attributes removed 7 122 60 12 11 4Num. Attributes remaining 53 1318 360 156 1165 416Removing relative attribute set size 11,67% 8,47% 14,29% 7,14% 0,94% 0,95%Remaining relative attribute set size 88,33% 91,53% 85,71% 92,86% 99,06% 99,05%

Strong)cut,offFeature RH RP MVD SSD TSSD TRHOriginal dimensionality 60 1440 420 168 1176 420Num. Attributes removed 17 325 128 24 24 19Num. Attributes remaining 43 1115 292 144 1152 401Removing relative attribute set size 28,33% 22,57% 30,48% 14,29% 2,04% 4,52%Remaining relative attribute set size 71,67% 77,43% 69,52% 85,71% 97,96% 95,48%

(b) Strong cut-off

Attribute(Selection((ISMIR)Tolerant)cut,offFeature RH RP MVD SSD TSSD TRHOriginal dimensionality 60 1440 420 168 1176 420Num. Attributes removed 7 122 60 12 11 4Num. Attributes remaining 53 1318 360 156 1165 416Removing relative attribute set size 11,67% 8,47% 14,29% 7,14% 0,94% 0,95%Remaining relative attribute set size 88,33% 91,53% 85,71% 92,86% 99,06% 99,05%


Table 5.2: Number of attributes selected on ISMIR data set


As a summary, in the Table 5.2 there are the different relations and values regarding the

attribute selection of the ISMIR data set. In the majority of the features, the number of

removed attributes is around the double in the strong selection comparing to the tolerant.

selection. The feature with the strongest selectivity level is MVD with a removal of 30%

of attributes on the strong selection, whereas features with less attributes removed are

the temporal features TSSD and TRH, which could even have negligible effect on the

classification.

(a) Tolerant cut-off

Attribute(Selection((GTZAN)Tolerant)cut,offFeature RH RP MVD SSD TSSD TRHOriginal dimensionality 60 1440 420 168 1176 420Num. Attributes removed 12 102 60 15 13 2Num. Attributes remaining 48 1338 360 153 1163 418Removing relative attribute set size 20,00% 7,08% 14,29% 8,93% 1,11% 0,48%Remaining relative attribute set size 80,00% 92,92% 85,71% 91,07% 98,89% 99,52%


(b) Strong cut-off

Attribute(Selection((GTZAN)Tolerant)cut,offFeature RH RP MVD SSD TSSD TRHOriginal dimensionality 60 1440 420 168 1176 420Num. Attributes removed 12 102 60 15 13 2Num. Attributes remaining 48 1338 360 153 1163 418Removing relative attribute set size 20,00% 7,08% 14,29% 8,93% 1,11% 0,48%Remaining relative attribute set size 80,00% 92,92% 85,71% 91,07% 98,89% 99,52%


Table 5.3: Number of attributes selected on GTZAN data set

Regarding the GTZAN data set (Tables 5.3), the number of attributes removed is close to

the ISMIR attribute selection, except for the RH case, in which the difference of removed

attributes is about 10% between both data sets, because it has a less pronounced curve,

the attribute selection is harder than on ISMIR data set. About the remainder features,

the differences between both data sets are around 1%.

5.3.2 Possible results with attribute selection

Using all the information about the feature selection, we will proceed to two different

experiments in order to try to achieve the same results as the classification of clean audio.

We expect three possible results:

• A classification results improvement with respect to the results achieved by the

classification of mixed degraded audio using all features: this could mean that

we removed the weaker attributes against the degradations without affecting the

correct genre classification of the different audio tracks.


• A similar classification results with respect to the results achieved by the classifi-

cation of mixed degraded audio using all features: in this case, it could mean that

the attributes that we removed are not important or useful on the classification of

mixed degradation audio tracks. It could lead to a new experiment to study the

effect of the removal of the same attributes in the classification of clean audio, in

order to try to reduce the dimensionality of the studied attributes.

• A classification results deterioration with respect to the results achieved by the

classification of mixed degraded audio using all features: the removal of weakest

attributes is not allowed in the genre classification of mixed degraded audio files.

5.3.3 Training and classifying with most robust attributes

This is the first experiment that we will perform in order to improve the genre classifica-

tion results of mixed degraded audio and using clean audio as a training set. We know

that in the best of the cases we will achieve the same results as in the classification of

clean audio.

In this experiment we will proceed to remove the weakest attributes from our training and

test sets as well, using both selections obtained in the previous section. The procedure

is to copy and modify the different *.arff files, which contain the attribute values of each

audio track, removing the header parts of attributes that we want to remove, as well as

the different values from each line, which belong to the attribute value that we want to

remove. We have to repeat the procedure for each fold (training and test sets) of the 12

environments of the 10-CV, using in each feature, the attributes selected to be removed

in the previous procedure. We have to perform it for the tolerant selection as well as for

the strong one. (24 complete 10-CV environments). Then, the classification is performed

as in the other sections, resulting of a mean and variance between the 10 folds of the

classification.

The Table 5.4 shows the results of the classification of the before mentioned experiments

using the strong selection of the ISMIR data set. The results achieved by this experiment

are similar as the ones achieved in Section 5.2, so we get the results commented in the

second point of Section 5.3.2.


It is surprisingly good that MVD feature, which is the one with the highest percentage of

removed attributes (30%), achieved the best classification marks. In addition, with the

KNN Euclidean classifier, it has a percentage of correct instances 2,26% higher than the

classification without attributes removing. This means that in the MVD case, for the

classification of mixed degraded audio files, the removed attributes are totally useless. In

addition, for RP and SSD, even with their significant percentage of attributes removal,

the results still are very similar to the mixed degraded audio files classification, only with

maximum deterioration of 1,51% in all their classifiers cases. In RP features, the clas-

sification is slightly worst, but is still very similar to the other mentioned classification.

About TSSD and TRH, the results are not significant due to the low ratio of removed

attributes.

(a) Mean percentage of correctly classified instances (highlighted values mean an improve-ment with respect to mixed degraded classification without attribute selection)

MEAN%of%ISMIR%CLASSIFICATION%WITH%TRAINING%3333>%1st%Scenario%Vs.%2nd%Scenario1st$Scenario:$Using$all$the$attributesClassifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 43,21% 49,18% 47,19% 43,00% 40,32% 45,89%SMO PolyKernel 51,03% 60,02% 56,65% 57,82% 58,92% 55,76%SMO RBFKernel 45,68% 56,79% 51,51% 49,79% 57,75% 50,41%J48 46,71% 45,34% 46,71% 48,63% 51,44% 46,37%RandomForest 51,30% 53,50% 49,11% 54,25% 57,20% 50,27%KNN Euclidean 48,28% 55,41% 47,05% 54,60% 57,40% 51,24%KNN Manhattan 48,69% 55,63% 48,42% 56,44% 57,82% 52,67%

Difference%=%2nd%Scen%3%1st%Scen2nd$Scenario:$Most$robust$attributes$(1st$selection) Positive%%=%improving%of%the%classification%with%2nd%scenario

Negative%%=%deterioration%of%the%classification%with%2nd%scenarioClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 43,48% 49,25% 46,91% 42,93% 40,05% 45,82% Naive Bayes 0,28% 0,07% 30,27% 30,07% 30,28% 30,07%SMO PolyKernel 48,90% 60,02% 56,44% 58,16% 58,85% 55,62% SMO PolyKernel 32,13% 0,00% 30,21% 0,34% 30,07% 30,14%SMO RBFKernel 47,26% 55,69% 51,65% 49,38% 57,75% 50,62% SMO RBFKernel 1,58% 31,10% 0,14% 30,41% 0,00% 0,21%J48 44,17% 45,75% 47,26% 48,63% 52,20% 45,96% J48 32,54% 0,41% 0,55% 0,00% 0,76% 30,41%RandomForest 48,42% 55,01% 49,24% 53,98% 54,94% 51,37% RandomForest 32,88% 1,51% 0,13% 30,27% 32,26% 1,10%KNN Euclidean 45,33% 55,28% 48,49% 54,66% 57,33% 51,10% KNN Euclidean 32,95% 30,14% 1,44% 0,07% 30,07% 30,14%KNN Manhattan 44,78% 55,14% 48,08% 56,79% 57,88% 51,23% KNN Manhattan 33,91% 30,48% 30,34% 0,34% 0,07% 31,44%

2nd$Scenario:$Most$robust$attributes$(2nd$selection)Classifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 42,94% 49,18% 46,02% 43,55% 40,66% 45,88% Naive Bayes 30,27% 0,00% 31,17% 0,55% 0,34% 0,00%SMO PolyKernel 46,91% 61,39% 55,28% 58,85% 59,12% 55,69% SMO PolyKernel 34,12% 1,37% 31,37% 1,03% 0,20% 30,07%SMO RBFKernel 45,82% 55,49% 49,86% 49,03% 57,75% 50,21% SMO RBFKernel 0,14% 31,30% 31,64% 30,76% 0,00% 30,21%J48 46,09% 46,29% 46,91% 47,12% 51,86% 45,20% J48 30,62% 0,96% 0,20% 31,51% 0,41% 31,16%RandomForest 46,56% 53,16% 49,18% 53,43% 56,65% 50,00% RandomForest 34,73% 30,34% 0,07% 30,81% 30,55% 30,27%KNN Euclidean 44,37% 54,46% 49,31% 54,87% 57,33% 50,21% KNN Euclidean 33,91% 30,96% 2,26% 0,27% 30,07% 31,03%KNN Manhattan 44,71% 54,11% 49,45% 56,17% 57,54% 50,27% KNN Manhattan 33,98% 31,51% 1,03% 30,27% 30,27% 32,40%

2nd%SCENARIO:%Classification%by%10%cross%validation,%usign%clean%data%for%training,%and%mixed%degraded%audio%for%testing,%using%the%most%robust%attributes%in%both%cases.%Two%levels%of%selectivity.

1st%SCENARIO:%Classification%using%all%the%attributes,%by%10%cross%validation,%using%clean%data%for%training,%and%mixed%degraded%audio%for%testing.

(b) Mean differences on classification of mixed degradations using the most robust attributescompared to classification of mixed degraded data sets without attribute selection, using cleanaudio data set as training set (positive values = improvement, negative values = deterioration;a darker highlighted value means a better improvement).



Negative%%=%deterioration%of%the%classification%with%2nd%scenarioClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 43,48% 49,25% 46,91% 42,93% 40,05% 45,82% Naive Bayes 0,28% 0,07% 30,27% 30,07% 30,28% 30,07%SMO PolyKernel 48,90% 60,02% 56,44% 58,16% 58,85% 55,62% SMO PolyKernel 32,13% 0,00% 30,21% 0,34% 30,07% 30,14%SMO RBFKernel 47,26% 55,69% 51,65% 49,38% 57,75% 50,62% SMO RBFKernel 1,58% 31,10% 0,14% 30,41% 0,00% 0,21%J48 44,17% 45,75% 47,26% 48,63% 52,20% 45,96% J48 32,54% 0,41% 0,55% 0,00% 0,76% 30,41%RandomForest 48,42% 55,01% 49,24% 53,98% 54,94% 51,37% RandomForest 32,88% 1,51% 0,13% 30,27% 32,26% 1,10%KNN Euclidean 45,33% 55,28% 48,49% 54,66% 57,33% 51,10% KNN Euclidean 32,95% 30,14% 1,44% 0,07% 30,07% 30,14%KNN Manhattan 44,78% 55,14% 48,08% 56,79% 57,88% 51,23% KNN Manhattan 33,91% 30,48% 30,34% 0,34% 0,07% 31,44%




Table 5.4: Mixed degraded classification of ISMIR data set, using clean data set asa training set, and using only the most robust attributes of training and test sets with

strong selection.

In Table 5.5 we present the classification results of the same experiment on GTZAN data

set. In this case, the best feature is also MVD, achieving the best results with 2,10% of

correct classified instances using the KNN Euclidean classifier: the same classifier which

achieved the best results of the ISMIR classification. About RP and SSD, the results

achieved are also similar as the classification of mixed degraded audio, even with the high

ratio of removed attributes. On the other hand, the RP has a deterioration around 5%

which isn’t as high as on the ISMIR case. This is due to the different ratio of removed



MEAN%of%GTZAN%CLASSIFICATION%WITH%TRAINING%4444>%1st%Scenario%Vs.%2nd%Scenario1st$Scenario:$Using$all$the$attributesClassifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 24,60% 34,30% 27,50% 32,60% 37,30% 25,00%SMO PolyKernel 27,90% 45,30% 35,20% 42,20% 42,70% 29,50%SMO RBFKernel 19,90% 40,40% 27,40% 33,50% 42,50% 24,50%J48 21,00% 24,50% 22,80% 27,40% 29,90% 18,70%RandomForest 22,50% 28,00% 25,90% 36,40% 37,00% 24,40%KNN Euclidean 28,60% 36,50% 27,40% 38,20% 34,70% 20,30%KNN Manhattan 28,30% 36,90% 28,30% 40,90% 41,00% 24,50%

Difference%=%2nd%Scen%4%1st%Scen2nd$Scenario:$Most$robust$attributes$(1st$selection)Classifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 21,50% 34,80% 28,00% 33,10% 37,20% 24,90% Naive Bayes 43,10% 0,50% 0,50% 0,50% 40,10% 40,10%SMO PolyKernel 23,30% 46,30% 36,60% 42,90% 42,70% 29,80% SMO PolyKernel 44,60% 1,00% 1,40% 0,70% 0,00% 0,30%SMO RBFKernel 16,50% 40,00% 27,50% 33,80% 42,50% 25,30% SMO RBFKernel 43,40% 40,40% 0,10% 0,30% 0,00% 0,80%J48 20,30% 23,50% 22,30% 27,90% 30,10% 18,70% J48 40,70% 41,00% 40,50% 0,50% 0,20% 0,00%RandomForest 21,30% 27,80% 24,90% 37,10% 36,30% 22,80% RandomForest 41,20% 40,20% 41,00% 0,70% 40,70% 41,60%KNN Euclidean 24,70% 36,40% 28,80% 39,00% 34,70% 20,50% KNN Euclidean 43,90% 40,10% 1,40% 0,80% 0,00% 0,20%KNN Manhattan 25,20% 35,80% 27,40% 40,60% 41,20% 24,20% KNN Manhattan 43,10% 41,10% 40,90% 40,30% 0,20% 40,30%




Positive%%=%improving%of%the%classification%with%2nd%scenario

Negative%%=%deterioration%of%the%classification%with%2nd%scenario

(b) Mean differences on classification of mixed degradations using the most robust attributescomparing to classification of mixed degraded data sets without attribute selection, usingclean audio data set as training set (positive values = improvement, negative values = dete-rioration; a darker highlighted value means a better improvement).







Negative%%=%deterioration%of%the%classification%with%2nd%scenario

Table 5.5: Mixed degraded classification of GTZAN data set, using clean data set asa training set, and using only the most robust attributes of training and test sets with

strong selection.

attributes (37% on the GTZAN case and 28% on the ISMIR case). TSSD and TRH

features are in the same situation as in the classification of ISMIR data set.

5.3.4 Training with all attributes and classifying missing the weaker

attributes

In this section we will perform an experiment proceeding to train the model using all the

attributes from clean audio, and missing the weakest attributes on the test set of mixed

degraded audio.

The procedure of this experiment is similar to the other experiments performed and uses

several files that we already have made. The *.arff files for training folds used in this

section will be exactly the same training folds created in Section 5.1, which are made by

all the feature attributes from clean audio tracks. On the other hand, we will proceed

to modify a copy of the *.arff files folds created in the same section, changing in each

line of the *.arff file, all the values that a belong to weak attributes (depending on the

level of selectivity and feature case) by the interrogation sign ?. The Weka software

translates this sign as a missing value, changing the classification algorithm for each

classifier and changing the result achieved. We have to do the whole procedure for the


tolerant selection level as well as the strong selection level, having at the end 24 complete

10-CV environments (the same number of environments as in the previous section).

In Table 5.6 we present the classifying results achieved by the last described experiment.

In this case, the results achieved are worst than in the previous section and in less cases

we see a slightly improvement. The worst case of deterioration is with the use of the

classifier KNN Euclidean, getting deteriorations up to 40,87% on RP feature or 34,64%

on the RH one. With this classifier, the deterioration is directly related to the ratio of

removed attributes, so it means that this classifier is not good using missing values in its

classifying algorithm. The KNN Manhattan has also bad classification results, related

to the number of removing attributes as well. On the other hand, Naive Bayes and J48

classifiers have results quite similar to the classification achieved with the use of all the

attributes.


MEAN%of%ISMIR%CLASSIFICATION%WITH%TRAINING%3333>%1st%Scenario%Vs.%3rd%Scenario1st$Scenario:$Using$all$the$attributesClassifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 43,21% 49,18% 47,19% 43,00% 40,32% 45,89%SMO PolyKernel 51,03% 60,02% 56,65% 57,82% 58,92% 55,76%SMO RBFKernel 45,68% 56,79% 51,51% 49,79% 57,75% 50,41%J48 46,71% 45,34% 46,71% 48,63% 51,44% 46,37%RandomForest 51,30% 53,50% 49,11% 54,25% 57,20% 50,27%KNN Euclidean 48,28% 55,41% 47,05% 54,60% 57,40% 51,24%KNN Manhattan 48,69% 55,63% 48,42% 56,44% 57,82% 52,67%

Difference%=%3rd%Scen%3%1st%Scen3rd$Scenario:$Most$robust$attributes$(1st$selection) Positive%%=%improving%of%the%classification%with%3rd%scenario

Negative%%=%deterioration%of%the%classification%with%3rd%scenarioClassifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 43,48% 49,25% 46,91% 42,93% 40,05% 45,82% Naive Bayes 0,28% 0,07% 30,27% 30,07% 30,28% 30,07%SMO PolyKernel 44,79% 58,10% 53,77% 57,68% 59,12% 55,69% SMO PolyKernel 36,24% 31,92% 32,88% 30,14% 0,21% 30,07%SMO RBFKernel 45,75% 55,49% 51,44% 49,04% 57,68% 50,75% SMO RBFKernel 0,07% 31,30% 30,07% 30,76% 30,07% 0,34%J48 43,08% 45,33% 47,26% 49,38% 51,51% 46,30% J48 33,63% 0,00% 0,55% 0,75% 0,07% 30,07%RandomForest 45,67% 52,74% 49,59% 53,01% 56,79% 49,99% RandomForest 35,62% 30,75% 0,48% 31,23% 30,41% 30,28%KNN Euclidean 42,66% 40,47% 35,67% 32,78% 57,33% 46,23% KNN Euclidean 35,62% 314,95% 311,38% 321,82% 30,07% 35,01%KNN Manhattan 45,95% 53,63% 39,10% 56,72% 57,95% 50,48% KNN Manhattan 32,74% 31,99% 39,32% 0,28% 0,14% 32,20%

3rd$Scenario:$Most$robust$attributes$(2nd$selection)Classifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 42,94% 49,18% 46,02% 43,55% 40,66% 45,88% Naive Bayes 30,27% 0,00% 31,17% 0,55% 0,34% 0,00%SMO PolyKernel 42,18% 54,74% 49,17% 57,61% 59,12% 54,32% SMO PolyKernel 38,85% 35,28% 37,48% 30,21% 0,21% 31,44%SMO RBFKernel 45,54% 54,18% 50,00% 48,42% 57,68% 50,27% SMO RBFKernel 30,14% 32,60% 31,51% 31,37% 30,07% 30,14%J48 42,59% 46,09% 47,12% 48,56% 51,51% 46,16% J48 34,12% 0,75% 0,42% 30,07% 0,07% 30,20%RandomForest 44,85% 53,56% 48,83% 53,42% 57,00% 49,17% RandomForest 36,45% 0,07% 30,27% 30,82% 30,21% 31,10%KNN Euclidean 13,65% 14,54% 28,88% 24,96% 56,85% 41,29% KNN Euclidean 334,64% 340,87% 318,17% 329,63% 30,55% 39,95%KNN Manhattan 39,50% 46,98% 40,67% 55,21% 57,75% 47,80% KNN Manhattan 39,19% 38,64% 37,75% 31,24% 30,07% 34,87%


3rd%SCENARIO:%Classification%by%10%cross%validation,%usign%clean%data%for%training,%and%mixed%degraded%audio%for%testing,%missing%the%weakest%sttributes%in%the%test%files.%Two%levels%of%selectivity.

(b) Mean differences on classification of mixed degradations mising weakest attributes com-paring to classification of mixed degraded data sets without attribute selection, using cleanaudio data set as training set (positive values = improvement, negative values = deteriora-tion; a darker highlighted value means a better improvement).







Table 5.6: Mixed degraded classification of ISMIR data set, using clean data sets asa training set, and using all the attributes of training but missing weakest attributes of

test sets.

With ISMIR data set, we can not define the experiment results like one of the expected

results of the Section 5.3.2, because this depends on the classifier used in each case.

For Naive Bayes, SMO RBFKernel, J48 and Random Forest classifiers we achieved the

results expected in the second point, where the attributes removed are not useful for the

classification process. On the other hand, for SMO Polykernel, KNN Euclidean and KNN

Manhattan classifiers, we achieved worst results than with the first classification with all


the attributes, so the attribute selection using missing values is not recommended with

the use ot these classifiers.


MEAN%of%GTZAN%CLASSIFICATION%WITH%TRAINING%4444>%1st%Scenario%Vs.%3rd%Scenario1st$Scenario:$Using$all$the$attributesClassifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 24,60% 34,30% 27,50% 32,60% 37,30% 25,00%SMO PolyKernel 27,90% 45,30% 35,20% 42,20% 42,70% 29,50%SMO RBFKernel 19,90% 40,40% 27,40% 33,50% 42,50% 24,50%J48 21,00% 24,50% 22,80% 27,40% 29,90% 18,70%RandomForest 22,50% 28,00% 25,90% 36,40% 37,00% 24,40%KNN Euclidean 28,60% 36,50% 27,40% 38,20% 34,70% 20,30%KNN Manhattan 28,30% 36,90% 28,30% 40,90% 41,00% 24,50%






(b) Mean differences on classification of mixed degradations missing weakest attributes com-paring to classification of mixed degraded data sets without attribute selection, using cleanaudio data sets as training set (positive values = improvement, negative values = deteriora-tion; a darker highlighted value means a better improvement).







Table 5.7: Mixed degraded classification of GTZAN data set, using clean data setsas a training set, and using all the attributes of training but missing weakest attributes

of test sets.

In Table 5.7 we show the results of the experiment described in this section on the

classification of GTZAN data set. In this case, the results are similar as the results

achieved for the ISMIR data set: the classification depends on the classifier used instead

of the feature used. KNN Euclidean is still the worst classifier, but with less deterioration

than in ISMIR, achieving a maximum deterioration of 18,60% on RP features. Except for

the classifier KNN Euclidean, the results of the SSD are similar to the results achieved

in the non attribute selection classification, like it happened in the ISMIR data set.

As a summary, the second experiment, using all attributes on the training set and missing

values on the test set achieved worst results than the first experiment (Section 5.3.3).

However, the results achieved in the first experiment have not improved the classification

results as we expected, achieving similar results as in the classification of mixed degraded

audio using all the attributes in both data sets.

Chapter 6

Summary and Further Work

6.1 Summary

There exist several studies related to musical genre classification systems, but they are

usually related to the classification of collections of rather consistent recording quality,

using as a training data sets with the same recording quality as well. On the other hand,

we can find situations where the audio tracks that we want to classify could come from

different sources, as well as recorded with different qualities. This has a direct impact on

the genre classification due to the effect of the audio features by different audio recording

degradations.

The study reported in this thesis evaluates the impact of degradations produced in

controlled environments as well as degradations that we could find on real-world audio

recordings, coming from several popular sources. We discussed this impact comparing

psychoacoustic features extracted from clean audio tracks and several degraded versions

of the same audio tracks. Thus, as the genre classification uses this feature sets in order

to perform its work, the classification is affected as well, decreasing the percentage of

correct classified instances.

Our hypothesis was that we could select the most robust features against an aggregation

of the most common degradations, and then proceed with a new classification process

using only the robuster features selected, thus minimizing the negative impact of the

degradations on the genre classification.

47

Chapter 6. Summary and Further Work 48

Referring to the impact of degradations on psychoacoustic features, we analysed that

the effect of degradations is not evenly spread across different feature sets, i.e. some

attributes suffer more strongly than others, depending on the degradation applied as

well. In addition, some feature sets have their weaker attributes in a continuous range

(such as RH or MVD), although for others (such as RP and SSD), the effect is not located

in a focused range. In this fashion, the best way to select the robuster features is set a

threshold level on the attribute differences between clean and degraded audio for each

feature set.

Related to the effect of degradations on the classification, we observed that if training and

test sets are formed by degraded audio of the same kind of degradation, the percentage

of correct classification is similar to the percentages achieved by training and test with

the original, i.e. high quality or clean audio. In some cases, the classification is slightly

even better, such as Harmony Distortion degradation. However, in Low-pass and High-

pass filtering, the classification is significantly worst because of the removing of several

frequency bands.

In the classification of mixed degradations using high quality audio as training set, the

results achieved are worse than the classification of collections of rather consistent record-

ing quality, as we thought in the beginning. On the other hand, the strategies proposed

in order to mitigate the effect by relying only the most robust attributes failed, because

the results achieved are similar or even worse than the new models that we proposed.

Thus, classifiers may be trained using high-quality recordings and then classifing mixed

degraded audio collections, without taking special precautions to the attributes selection,

at least for the rather broad range of degradations studied.

6.2 Further work

One interesting fact that we observed during our study is that the genre classification

whole executed in a degraded environment, i.e. training set and test set, both degraded

by the same distortion, has similar results comparing with the use of clean environment.

It could be useful when we intend to classify an audio collection stemming predominantly

or purely from the same specific degradation, e.g. all the audio files that we want to

classify are from mobile phone recordings. In this case, we could define a training set of

Chapter 6. Summary and Further Work 49

audio coming from the same degradation or degraded a clean audio collection, in order

to get a degraded one.

In addition, in the results review, we get an improvement of the correct classification of

degraded audio from some specific degradations. It would be interesting to do a research

about how these degradations modify the feature attributes in order to get a better

classification, and then apply it to the traditional genre classification systems.

Related to the attribute selection, for each song that has a known degradation, we

could perform an attribute selection of the most robust attributes only of the specific

degradation of the audio track, removing the weaker ones (like the experiment in Section

5.3.3) or missing their values (like the experiment in Section 5.3.4).

On the other hand, the combination of several feature set, taking care of the different

weight of each attribute per feature, could improve the classification using the attribute

selection.

Finally, another interesting study could be to perform our experiments using the tradi-

tional features as MFCC, Chroma features, among others, and then analyse the impact

if different degradations on them and the results achieved with the implementation of

the different selection attributes experiments.

Appendix A

Worst degradations - Attribute

selection

A.1 Mean differences of ISMIR worst degradations

Mean differences between ISMIR worst degraded and clean audio with both attribute

selection levels of all the features studied are presented during the Chapter 5:

• RP features: 5.4

• RH features: 5.5

• SSD features: 5.6

• MVD features: 5.3

• TSSD features: 5.7

• TRH features: 5.8

50

Appendix A. Worst degradations - Attribute selection 51

A.2 Variance differences of ISMIR worst degradations

0 200 400 600 800 1000 1200 14000

0.05

0.1

0.15rp Original Mean

0 200 400 600 800 1000 1200 14000

0.05

0.1


0 200 400 600 800 1000 1200 14000

0.05

0.1


0 200 400 600 800 1000 1200 14000

0.01

0.02


0 200 400 600 800 1000 1200 14000

0.01

0.02


0 200 400 600 800 1000 1200 14000

0.01

0.02


(a) RP. Tolerant selection level: 28%; Strong selection level: 14%.

0 10 20 30 40 50 600

5

10

15

rh Original Mean

0 10 20 30 40 50 600

5

10

15


0 10 20 30 40 50 600

5

10

15


0 10 20 30 40 50 600

20

40

60rh Original Variance

0 10 20 30 40 50 600

20

40

60rh First Selection Variance

0 10 20 30 40 50 600

20

40

60rh Second Selection Variance

(b) RH. Tolerant selection level: 19%; Strong selection level: 13%.

0 20 40 60 80 100 120 140 1600

50

100

ssd Original Mean

0 20 40 60 80 100 120 140 1600

50

100

ssd First Selection Mean

0 20 40 60 80 100 120 140 1600

50

100

ssd Second Selection Mean

0 20 40 60 80 100 120 140 1600

2000

4000

6000


0 20 40 60 80 100 120 140 1600

2000

4000

6000


0 20 40 60 80 100 120 140 1600

2000

4000

6000


(c) SSD. Tolerant selection level: 82%; Strong selection level: 8%.

0 50 100 150 200 250 300 350 4000

1

2

mvd Original Mean

0 50 100 150 200 250 300 350 4000

1

2


0 50 100 150 200 250 300 350 4000

1

2


0 50 100 150 200 250 300 350 4000

1


0 50 100 150 200 250 300 350 4000

1


0 50 100 150 200 250 300 350 4000

1


(d) MVD. Tolerant selection level: 60%; Strong selection level: 20%.

0 200 400 600 800 10000

1

2x 104 tssd Original Mean

0 200 400 600 800 10000

1

2x 104 tssd First Selection Mean

0 200 400 600 800 10000

1

2x 104 tssd Second Selection Mean

0 200 400 600 800 10000

2


0 200 400 600 800 10000

2


0 200 400 600 800 10000

2


_(e) TSSD. Tolerant selection level: 72,5%; Strong selection level: 7,5%.

0 50 100 150 200 250 300 350 4000

50

100trh Original Mean

0 50 100 150 200 250 300 350 4000

50


0 50 100 150 200 250 300 350 4000

50


0 50 100 150 200 250 300 350 4000

5000


0 50 100 150 200 250 300 350 4000

5000


0 50 100 150 200 250 300 350 4000

5000


(f) TRH. Tolerant selection level: 19%; Strong selection level: 3%.

Figure A.1: Variance differences between ISMIR worst degraded and clean audio withboth attribute selection levels of all the features studied.


A.3 Mean differences of GTZAN worst degradations

0 200 400 600 800 1000 1200 14000

0.05

0.1

0.15

rp Original Mean

0 200 400 600 800 1000 1200 14000

0.05

0.1

0.15


0 200 400 600 800 1000 1200 14000

0.05

0.1

0.15


0 200 400 600 800 1000 1200 14000

0.01

0.02


0 200 400 600 800 1000 1200 14000

0.01

0.02


0 200 400 600 800 1000 1200 14000

0.01

0.02



0 10 20 30 40 50 600

5

10

rh Original Mean

0 10 20 30 40 50 600

5

10


0 10 20 30 40 50 600

5

10


0 10 20 30 40 50 600

10

20

30

rh Original Variance

0 10 20 30 40 50 600

10

20

30

rh First Selection Variance

0 10 20 30 40 50 600

10

20

30

rh Second Selection Variance


0 20 40 60 80 100 120 140 1600

50

100

150ssd Original Mean

0 20 40 60 80 100 120 140 1600

50

100

150ssd First Selection Mean

0 20 40 60 80 100 120 140 1600

50

100

150ssd Second Selection Mean

0 20 40 60 80 100 120 140 1600

5000

10000


0 20 40 60 80 100 120 140 1600

5000

10000


0 20 40 60 80 100 120 140 1600

5000

10000



0 50 100 150 200 250 300 350 4000

1

2

mvd Original Mean

0 50 100 150 200 250 300 350 4000

1

2


0 50 100 150 200 250 300 350 4000

1

2


0 50 100 150 200 250 300 350 4000

1


0 50 100 150 200 250 300 350 4000

1


0 50 100 150 200 250 300 350 4000

1



0 200 400 600 800 10000

1

2

x 104 tssd Original Mean

0 200 400 600 800 10000

1

2

x 104 tssd First Selection Mean

0 200 400 600 800 10000

1

2

x 104 tssd Second Selection Mean

0 200 400 600 800 10000

2

4


0 200 400 600 800 10000

2

4


0 200 400 600 800 10000

2

4


_(e) TSSD. Tolerant selection level: 43%; Strong selection level: 10%.

0 50 100 150 200 250 300 350 4000

20

40

60trh Original Mean

0 50 100 150 200 250 300 350 4000

20

40


0 50 100 150 200 250 300 350 4000

20

40


0 50 100 150 200 250 300 350 4000

2000

4000


0 50 100 150 200 250 300 350 4000

2000

4000


0 50 100 150 200 250 300 350 4000

2000

4000



Figure A.2: Mean differences between GTZAN worst degraded and clean audio withboth attribute selection levels of all the features studied.


A.4 Variance differences of GTZAN worst degradations

0 200 400 600 800 1000 1200 14000

0.05

0.1

0.15

rp Original Mean

0 200 400 600 800 1000 1200 14000

0.05

0.1

0.15


0 200 400 600 800 1000 1200 14000

0.05

0.1

0.15


0 200 400 600 800 1000 1200 14000

0.01

0.02


0 200 400 600 800 1000 1200 14000

0.01

0.02


0 200 400 600 800 1000 1200 14000

0.01

0.02



0 10 20 30 40 50 600

5

10

rh Original Mean

0 10 20 30 40 50 600

5

10


0 10 20 30 40 50 600

5

10


0 10 20 30 40 50 600

10

20

30

rh Original Variance

0 10 20 30 40 50 600

10

20

30

rh First Selection Variance

0 10 20 30 40 50 600

10

20

30

rh Second Selection Variance


0 20 40 60 80 100 120 140 1600

50

100

150ssd Original Mean

0 20 40 60 80 100 120 140 1600

50

100

150ssd First Selection Mean

0 20 40 60 80 100 120 140 1600

50

100

150ssd Second Selection Mean

0 20 40 60 80 100 120 140 1600

5000

10000


0 20 40 60 80 100 120 140 1600

5000

10000


0 20 40 60 80 100 120 140 1600

5000

10000



0 50 100 150 200 250 300 350 4000

1

2

mvd Original Mean

0 50 100 150 200 250 300 350 4000

1

2


0 50 100 150 200 250 300 350 4000

1

2


0 50 100 150 200 250 300 350 4000

1


0 50 100 150 200 250 300 350 4000

1


0 50 100 150 200 250 300 350 4000

1



0 200 400 600 800 10000

1

2

x 104 tssd Original Mean

0 200 400 600 800 10000

1

2

x 104 tssd First Selection Mean

0 200 400 600 800 10000

1

2

x 104 tssd Second Selection Mean

0 200 400 600 800 10000

2

4


0 200 400 600 800 10000

2

4


0 200 400 600 800 10000

2

4


_(e) TSSD. Tolerant selection level: 50%; Strong selection level: 2,5%.

0 50 100 150 200 250 300 350 4000

20

40

60trh Original Mean

0 50 100 150 200 250 300 350 4000

20

40


0 50 100 150 200 250 300 350 4000

20

40


0 50 100 150 200 250 300 350 4000

2000

4000


0 50 100 150 200 250 300 350 4000

2000

4000


0 50 100 150 200 250 300 350 4000

2000

4000



Figure A.3: Variance differences between GTZAN worst degraded and clean audiowith both attribute selection levels of all the features studied.

Appendix B

Classification of mixed degradations

In this Appendix we present all mean and variance classification results of both data sets

used in this study that belong to classification experiments of Sections 5.3.3 and 5.3.4.

The meaning of the different tables and their highlighted colours is the same for each

group:

• Mean classification results: Mean of correct classified instances percentage of all

the folds regarding the caption of the table. Highlighted values mean an improve-

ment with respect to mixed degraded classification without attribute selection.

• Difference mean classification results: Mean differences of correct classified

instances percentage of all the folds regarding the caption of the table comparing

to the same classification without attribute selection. Positive values mean im-

provement, negative values mean deterioration; a darker highlighted value means

a better improvement.

• Variance classification results: Variance between correct classified instances

percentage of all the folds regarding the caption of the table. A darker highlighted

value means a higher variance.

• Difference variance classification results: Variance differences between correct

classified instances percentage of all the folds regarding the caption of the table

comparing to the same classification without attribute selection. Positive values

mean a higher variance on the experiment with attribute selection, negative values

mean a higher variance on the experiment without attribute selection.

54

Appendix B. Classification of mixed degradations 55

B.1 Complete classification results of Section 5.3.3

(a) Mean classification results



Classifier/Feature RH RP MVD SSD TSSD TRH Classifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 43,48% 49,25% 46,91% 42,93% 40,05% 45,82% Naive Bayes 0,28% 0,07% 30,27% 30,07% 30,28% 30,07%SMO PolyKernel 48,90% 60,02% 56,44% 58,16% 58,85% 55,62% SMO PolyKernel 32,13% 0,00% 30,21% 0,34% 30,07% 30,14%SMO RBFKernel 47,26% 55,69% 51,65% 49,38% 57,75% 50,62% SMO RBFKernel 1,58% 31,10% 0,14% 30,41% 0,00% 0,21%J48 44,17% 45,75% 47,26% 48,63% 52,20% 45,96% J48 32,54% 0,41% 0,55% 0,00% 0,76% 30,41%RandomForest 48,42% 55,01% 49,24% 53,98% 54,94% 51,37% RandomForest 32,88% 1,51% 0,13% 30,27% 32,26% 1,10%KNN Euclidean 45,33% 55,28% 48,49% 54,66% 57,33% 51,10% KNN Euclidean 32,95% 30,14% 1,44% 0,07% 30,07% 30,14%KNN Manhattan 44,78% 55,14% 48,08% 56,79% 57,88% 51,23% KNN Manhattan 33,91% 30,48% 30,34% 0,34% 0,07% 31,44%




(b) Difference mean classification results







Table B.1: ISMIR mean classification results of mixed degraded audio, using cleanaudio as training set, and using only the most robust attributes with tolerant selection

(a) Variance classification results

VARIANCE%of%ISMIR%CLASSIFICATION%WITH%TRAINING%3333>%1st%Scenario%Vs.%2nd%Scenario1st$Scenario:$Using$all$the$attributesClassifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 0,10% 0,11% 0,03% 0,14% 0,17% 0,10%SMO PolyKernel 0,04% 0,20% 0,16% 0,08% 0,06% 0,10%SMO RBFKernel 0,03% 0,07% 0,08% 0,06% 0,08% 0,11%J48 0,13% 0,28% 0,07% 0,27% 0,08% 0,20%RandomForest 0,20% 0,08% 0,04% 0,20% 0,06% 0,13%KNN Euclidean 0,08% 0,13% 0,11% 0,07% 0,12% 0,12%KNN Manhattan 0,07% 0,14% 0,03% 0,05% 0,07% 0,18%

Difference%=%2nd%Scen%3%1st%Scen2nd$Scenario:$Most$robust$attributes$(1st$selection) Positive%=%lower%variance%in%1st%scenario





(b) Difference variance classification results







Table B.2: ISMIR variance classification results of mixed degraded audio, using cleanaudio as training set, and using only the most robust attributes with tolerant selection

• ISMIR mean classification results of mixed degraded audio, using clean audio as

training set, and using only the most robust attributes with strong selection are

presented on Tables 5.4 (Chapter 5).
















Table B.3: ISMIR variance classification results of mixed degraded audio, using cleanaudio as training set, and using only the most robust attributes with strong selection















Table B.4: GTZAN mean classification results of mixed degraded audio, using cleanaudio as training set, and using only the most robust attributes with tolerant selection



VARIANCE%of%GTZAN%CLASSIFICATION%WITH%TRAINING%4444>%1st%Scenario%Vs.%2nd%Scenario1st$Scenario:$Using$all$the$attributesClassifier/Feature RH RP MVD SSD TSSD TRHNaive Bayes 0,04% 0,11% 0,09% 0,21% 0,06% 0,05%SMO PolyKernel 0,11% 0,20% 0,08% 0,14% 0,28% 0,27%SMO RBFKernel 0,06% 0,08% 0,12% 0,12% 0,40% 0,13%J48 0,22% 0,25% 0,11% 0,19% 0,16% 0,12%RandomForest 0,13% 0,27% 0,10% 0,17% 0,10% 0,09%KNN Euclidean 0,11% 0,31% 0,12% 0,18% 0,11% 0,25%KNN Manhattan 0,04% 0,14% 0,11% 0,18% 0,11% 0,20%





Positive%=%lower%variance%in%1st%scenario








Table B.5: GTZAN variance classification results of mixed degraded audio, usingclean audio as training set, and using only the most robust attributes with tolerant

selection

• GTZAN mean classification results of mixed degraded audio, using clean audio as

training set, and using only the most robust attributes with strong selection are

presented on Tables 5.5 (Chapter 5).















Table B.6: GTZAN variance classification results of mixed degraded audio, usingclean audio as training set, and using only the most robust attributes with strong

selection


B.2 Complete classification results of Section 5.3.4















Table B.7: ISMIR mean classification results of mixed degraded audio, using cleanaudio as training set, and missing the weakest attributes with tolerant selection



Difference%=%3rd%Scen%3%1st%Scen3rd$Scenario:$Most$robust$attributes$(1st$selection) Positive%=%lower%variance%in%1st%scenario












Table B.8: ISMIR variance classification results of mixed degraded audio, using cleanaudio as training set, and missing the weakest attributes with tolerant selection

• ISMIR mean classification results of mixed degraded audio, using clean audio as

training set, and missing the weakest attributes with strong selection are presented

on Tables 5.6 (Chapter 5).
















Table B.9: ISMIR variance classification results of mixed degraded audio, using cleanaudio as training set, and missing the weakest attributes with strong selection















Table B.10: GTZAN mean classification results of mixed degraded audio, using cleanaudio as training set, and missing the weakest attributes with tolerant selection
















Table B.11: GTZAN variance classification results of mixed degraded audio, usingclean audio as training set, and missing the weakest attributes with tolerant selection

• GTZAN mean classification results of mixed degraded audio, using clean audio as

training set, and missing the weakest attributes with strong selection are presented

on Tables 5.7 (Chapter 5).















Table B.12: GTZAN variance classification results of mixed degraded audio, usingclean audio as training set, and missing the weakest attributes with strong selection

Appendix C

Attached files

The attached files set contains all the extra plots regarding to the Section 4.1, showing

mean and variance differences between all degradations used in this study and clean

audio. The plots are showed in two different display ways:

• Small-scale plots with Y-axis normalized difference values (the normalization is

applied using the highest difference attribute on all the degradations set), in order

to be able to compare different degradations for the same files.

• Individual plots with high resolution quality for each degradation without joint

normalization.

In addition, we attach two files regarding to the classification of degraded audio studied

in Section 4.2, containing all tables of mean and variance classification results for all

degradations used in this study.

61

Bibliography

[1] ISMIR. International society for music information retrieval, 2014. URL http:

//www.ismir.net/. [Online; accessed May-2014].

[2] George Tzanetakis and Perry Cook. Musical genre classification of audio signals.

Speech and Audio Processing, IEEE transactions on, 10(5):293–302, 2002.

[3] Alexander Schindler and Andreas Rauber. Capturing the temporal domain in echon-

est features for improved classification effectiveness. In Adaptive Multimedia Re-

trieval, Lecture Notes in Computer Science, Copenhagen, Denmark, October 24-25

2012. Springer.

[4] Michael I Mandel and Daniel PW Ellis. Song-level features and support vector

machines for music classification. In ISMIR 2005: 6th International Conference on

Music Information Retrieval: Proceedings: Variation 2: Queen Mary, University of

London & Goldsmiths College, University of London, 11-15 September, 2005, pages

594–599. Queen Mary, University of London, 2005.

[5] Matthias Mauch and Sebastian Ewert. The audio degradation toolbox and its ap-

plication to robustness evaluation. In Proceedings of the 14th International Society

for Music Information Retrieval Conference (ISMIR 2013), Curitiba, Brazil, 2013.

accepted.

[6] Eric Allamanche, Jürgen Herre, Oliver Hellmuth, Bernhard Fröba, Throsten Kast-

ner, and Markus Cremer. Content-based identification of audio material using mpeg-

7 low level description. In ISMIR, 2001.

[7] Enric Guaus and Perfecto Herrera. Music genre categorization in humans and ma-

chines. In Audio Engineering Society Convention 121. Audio Engineering Society,

2006.

62

http://www.ismir.net/

http://www.ismir.net/

Bibliography 63

[8] Jens Madsen. Modeling of emotions expressed in music using audio features. pages

117–142, 2011.

[9] Lindos electronics. A-weighting in detail. URL http://www.lindos.co.

uk/cgi-bin/FlexiData.cgi?SOURCE=Articles&VIEW=full&id=2. Last visited:

12/06/2014.

[10] Steven Van De Par, Armin Kohlrausch, Ghassan Charestan, and Richard Heusdens.

A new psychoacoustical masking model for audio coding applications. In Acoustics,

Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on,

volume 2, pages II–1805. IEEE, 2002.

[11] Center for Computer Research in Music and Acoustics Standford. Dolby

b, c, and s noise reduction systems: Making cassettes sound better.

URL http://www.dolby.com/uploadedFiles/English_(US)/Professional/

Technical_Library/Technologies/Dolby_A-type_NR/212_Dolby_B,_C_and_S_

Noise_Reduction_Systems.pdf. Last visited: 12/06/2014.

[12] MATLAB. version 7.12.0.635 (R2011a). The MathWorks Inc., Natick, Mas-

sachusetts, 2011.

[13] ISMIR 2004, 5th International Conference on Music Information Retrieval,

Barcelona, Spain, October 10-14, 2004, Proceedings, 2004. URL http://ismir2004.

ismir.net/genre_contest/index.html.

[14] D. Stowell D. Giannoulis, E. Benetos and M. D. Plumbley. Public dataset for

scene classification task. IEEE AASP Challenge on Detection and Classification of

Acoustic Scenes and Events, 2012.

[15] Rebecca Stewart and Mark B. Sandler. Database of omnidirectional and b-format

room impulse responses. In ICASSP, pages 165–168. IEEE, 2010. ISBN 978-

1-4244-4296-6. URL http://dblp.uni-trier.de/db/conf/icassp/icassp2010.

html#StewartS10.

[16] Thomas Lidy and Andreas Rauber. Evaluation of feature extractors and psycho-

acoustic transformations for music genre classification. In Proceedings of the Sixth

International Conference on Music Information Retrieval, pages 34–41, 2005. ISBN

0-9551179-0-9.

http://www.lindos.co.uk/cgi-bin/FlexiData.cgi?SOURCE=Articles&VIEW=full&id=2

http://www.lindos.co.uk/cgi-bin/FlexiData.cgi?SOURCE=Articles&VIEW=full&id=2

http://www.dolby.com/uploadedFiles/English_(US)/Professional/Technical_Library/Technologies/Dolby_A-type_NR/212_Dolby_B,_C_and_S_Noise_Reduction_Systems.pdf



http://ismir2004.ismir.net/genre_contest/index.html

http://ismir2004.ismir.net/genre_contest/index.html

http://dblp.uni-trier.de/db/conf/icassp/icassp2010.html#StewartS10

http://dblp.uni-trier.de/db/conf/icassp/icassp2010.html#StewartS10

Bibliography 64

[17] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann,

and Ian H. Witten. The weka data mining software: an update. SIGKDD Explor.

Newsl., 11(1):10–18, 2009. ISSN 1931-0145. doi: 10.1145/1656274.1656278. URL

http://dx.doi.org/10.1145/1656274.1656278.

http://dx.doi.org/10.1145/1656274.1656278

impact of audio degradation on music...

Documents