finding the genre of a song with deep learning — a .i ... · 12/2/2016 · here, we’ll use 50...

Julien Despois FollowAI Engineer, Enthusiast, MusicianNov 22, 2016 · 8 min read

Finding the genre of a song with DeepLearning — A.I. Odyssey part. 1A step-by-step guide to make your computer a music expert.

If you like Arti�cial Intelligence, subscribe to the newsletter to receive

updates on articles and much more!

One of the things we, humans, are particularly good at is classifying songs. In

just a few seconds we can tell whether we’re listening to Classical music, Rap,

Blues or EDM. However, as simple as this task is for us, millions of people still

live with unclassifed digital music libraries.

. . .

The average library is estimated to have about 7,160 songs. If it takes 3

seconds to classify a song (either by listening or because you already know), a

quick back-of-the-envelope calculation gives around 6 hours to classify them

all.

If you add the time it takes to manually label the song, this can easily go up to

10+ hours of manual work. No one wants to do that.

In this post, we’ll see how we can use Deep Learning to help us in this labour-

intensive task.

Here’s a general overview of what we will do:

Extract a simpli�ed representation of each song in the library

Train a deep neural network to classify the songs

Use the classi�er to �ll in the mising genres of our library

The data

First of all, we’re going to need a dataset. For that I have started with my own

iTunes library — which is already labelled due to my slightly obsessive passion

for order. Although it is not as diverse, complete or even as big as other

datasets we could �nd, it is a good start. Note that I have only used 2,000

songs as it already represents a lot of data.

Re�ning the dataset

The �rst observation is that there are too many genres and subgenres, or to

put it di�erently, genres with too few examples. This needs to be corrected,

either by removing the examples from the dataset, or by assigning them to a

broader genre. We don’t really need this Concertos genre, Classical will do the

trick.

Too much information — Waaaaaay to much

Once we have a decent number of genres, with enough songs each, we can

start to extract the important information from the data. A song is nothing but

a very, very long series of values. The classic sampling frequency is 44100Hz

— there are 44100 values stored for every second of audio, and twice as much

for stereo.

This means that a 3 minute long stereo song contains 7,938,000 samples.

That’s a lot of information, and we need to reduce this to a more manageable

level if we want to do anything with it. We can start by discarding the stereo

channel as it contains highly redundant information.

We will use Fourier’s Transform to convert our audio data to the frequency

domain. This allows for a much more simple and compact representation of

the data, which we will export as a spectrogram. This process will give us a

PNG �le containing the evolution of all the frequencies of our song through

time.

The 44100Hz sampling rate we talked about earlier allows us to reconstruct

frequencies up to 22050Hz — see Nyquist-Shannon sampling theorem — but

now that the frequencies are extracted, we can use a much lower resolution.

Here, we’ll use 50 pixel per second (20ms per pixel), which is more than

enough to be sure to use all the information we need.

NB: If you know a genre characterized by ~20ms frequency variations, you

got me.

Here’s what our song looks like after the process (12.8s sample shown here).

Time is on the x axis, and frequency on the y axis. The highest frequencies

are at the top and the lowest at the bottom. The scaled amplitude of the

frequency is shown in greyscale, with white being the maximum and black the

minimum.

I have used use a spectrogram with 128 frequency levels, because it contains

all the relevant information of the song — we can easily distinguish di�erent

notes/frequencies.

Further processing

The next thing we have to do is to deal with the length of the songs. There are

two approaches for this problem. The �rst one would be to use a recurrent

neural network with wich we would feed each column of the image in order.

Instead, I have chosen to exploit even further the fact that humans are able to

classify songs with short extracts.

If we can classify songs by ear in under 3 seconds, why

couldn’t machines do the same ?

We can create �xed length slices of the spectrogram, and consider them as

independent samples representing the genre. We can use square slices for

convenience, which means that we will cut down the spectrogram into

128x128 pixel slices. This represents 2.56s worth of data in each slice.

At this point, we could use data augmentation to expand the dataset even

more (we won’t here because we aready have a lot of data). We could for

instance add random noise to the images, or slightly stretch them horizontally

and then crop them.

However, we have to make sure that we do not break the patterns of the data.

We can’t rotate the images, nor �ip them horizontally because sounds are not

symmetrical.

E.g, see those white fading lines? These are decaying sounds which cannot be

reversed.

Choice of the model — Let’s build a classi�er!

After we have sliced all our songs into square spectral images, we have a

dataset containing tens of thousands of samples for each genre. We can now

train a Deep Convolutional Neural Network to classify these samples. For

this purpose, I have used Tensor�ow’s wrapper TFLearn.

Implementation details

Dataset split: Training (70%), validation (20%), testing (10%)

Model: Convolutional neural network.

Layers: Kernels of size 2x2 with stride of 2

Optimizer: RMSProp.

Activation function: ELU (Exponential Linear Unit), because of the

performance it has shown when compared to ReLUs

Initialization: Xavier for the weights matrices in all layers.

Regularization: Dropout with probability 0.5

Results — Does this thing work?

With 2,000 songs split between 6 genres — Hardcore, Dubstep, Electro,

Classical, Soundtrack and Rap, and using more than 12,000 128x128

spectrogram slices in total, the model reached 90% accuracy on the

validation set. This is pretty good, especially considering that we are

processing the songs tiny bits at a time. Note that this is not the �nal

accuracy we’ll have on classifying whole songs (it will be even better). We’re

only talking slices here.

Time to classify some �les!

So far, we have converted our songs from stereo to mono and created a

spectrogram, which we sliced into small bits. We then used these slices to

train a deep neural network. We can now use the model to classify a new song

that we have never seen.

We start o� by generating the spectrogram the same way we did with the

training data. Because of the slicing, we cannot predict the class of the song in

one go. We have to slice the new song, and then put together the predicted

classes for all the slices.

To do that, we will use a voting system. Each sample of the track will “vote”

for a genre, and we choose the genre with the most votes. This will increase

our accuracy as we’ll get rid of many classi�cations errors with this ensemble

learning-esque method.

NB: a 3 minute long track has about 70 slices.

With this pipeline, we can now classify the unlabelled songs from our library.

We could simply run the voting system on all the songs for which we need a

genre, and take the word of the classi�er. This would give good results but we

might want to improve our voting system.

A better voting system

The last layer of the classi�er we have built is a softmax layer. This means that

it doesn’t really output the detected genre, but rather the probabilities of each.

This is what we call the classi�cation con�dence.

We can use this to improve our voting system. For instance, we could reject

votes from slices with low con�dence. If there is no clear winner, we reject the

vote.

It’s better to say “I don’t know” than to give a answer we’re not sure of.

Similarly, we could leave unlabelled the songs for which no genre received

more than a certain fraction -70%?- of the votes. This way, we will avoid

mislabeling songs, which we can still label later by hand.

ConclusionsIn this post, we have seen how we could extract important information from

redundant and high dimensional data structure — tracks. We have taken

advantage of short patterns in the data which allowed us to classify 2.56

second long extracts. Finally, we have used our model to �ll in the blanks in a

digital library.

If you like Arti�cial Intelligence, subscribe to the newsletter to receive

updates on articles and much more!

(Psst! part. 2 is out!)

You can play with the code here:

If you want to go further on audio classi�cation, there are other approaches

which yield impressive results, such as Shazam’s �ngerprinting technique or

dilated convolutions.

Thanks for reading this post, stay tuned for more !

•

•

•

•

•

•

•

•

•

•

despoisj/DeepAudioClassi�cation DeepAudioClassi�cation - Finding the genre of a song with Deep Learning

github.com

Machine Learning Arti�cial Intelligence Deep Learning Music Howto

One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.

21

FollowJulien DespoisAI Engineer, Enthusiast,Musician

FollowHacker Noonhow hackers start theirafternoons.

More from Hacker Noon

The Five Keys to Crypto Evolution

Daniel Je�ries16 min read


The E�ective Tech Lead is a 100xEngineer

Leonard Souza28 min read


The History of Money & the Futureof Bitcoin and the Cryptocurren…

Kenny Li18 min read

Responses

Write a response…

Dan Warne

Dec 5, 2016

Applause from Julien Despois (author)

Fantastic. I really hope Spotify hires you to add genres to their library. It is

confounding that they don’t o�er this.

nikos voutouras

Dec 5, 2016


Nice work !! You can use a bandpass �lter to smooth useless frequencies from

FFT for reducing data on input. Most Fingerprint Recognition Systems use

100 Hz- 2500 Khz

Where Angels Fear

Dec 4, 2016


Of course, I might dispute your categorisation — that’s not Dubstep, it’s crappy

EDM!!!

https://cdn-images-

1.medium.com/max/800/1*pmF7vvgWQWHRGoejyCBw-w.jpeg

;)

Conversation between Thomas Haferlach and Julien Despois.

Thomas Haferlach

Nov 14, 2017 · 1 min read

I have changed the code to do musical key classi�cation. I basically just

changed the part that reads the genre to return the key from the metadata.

I get surprisingly high accuracy on the test set. Now I’m wondering: is it

possible that slices of the same song are both in the test set and the training

set?

Read more…

1 response

Julien Despois

Nov 14, 2017 · 1 min read

I haven’t looked back at the code, but from what I remember, that might be an

issue. In that case, the results would be a bit biased, but not completely. In

fact, a single slice would not be present in both dataset (training/test), but

slices from the same song could be. The network is then biased because the

slices might look similar.

Read more…

Joshua Heagle

Dec 5, 2016


This is a very interesting concept, thanks for sharing.

Potentially you could assign genres to artists, and then use this data to weight

the votes in favour of the genre based on the artist.

Sai Kiran Kannaiah

Dec 3, 2016


Good approach! You can also use Mel Cepstral coe�cients to do audio

classi�cation. That’s the traditional approach.

Sergey Zuev

Dec 4, 2016


Great post, thanks Julien!

Zeger Hoogeboom

Dec 4, 2016


Nice and fun read! Keep it up :)

Conversation with Julien Despois.

Fernando Martínez

Dec 2, 2016

Really awesome post! Thank you for sharing

1 response

Julien Despois

Dec 2, 2016

Thanks for the support !

Conversation with Julien Despois.

João Paulo Vilaça

Dec 2, 2016

Very well explained, nice post man! keep it up :)

2 responses

Julien Despois

Dec 2, 2016

Glad you like it ! :)

Show all responses

Top highlight

1.1K

3.7K 2.3K 3.1K

8

4

3

3

2

2

1

1

Creating super genres

DeepMind — WaveNet

Spectrogram of an extract of the song

Sliced spectrogram

Decaying sound

Convolutional neural network

Voting system

Full classi�cation pipeline

Classi�cation con�dence

Classi�cation rejected because of the low con�dence

Track left unlabeled because of low voting system con�dence

1.1K

Follow Sign in Get started

HOME DEV BLOCKCHAIN REACT UI REALTIME APIS, GLOBAL INFRASTRUCTURE

Never miss a story from Hacker Noon, when you sign up for

Medium. Learn moreGET UPDATES

finding the genre of a song with deep learning — a .i ... · 12/2/2016 · here, we’ll use 50...

Documents