finding the genre of a song with deep learning — a .i ... · 12/2/2016 · here, we’ll use 50...
TRANSCRIPT
Julien Despois FollowAI Engineer, Enthusiast, MusicianNov 22, 2016 · 8 min read
Finding the genre of a song with DeepLearning — A.I. Odyssey part. 1A step-by-step guide to make your computer a music expert.
If you like Arti�cial Intelligence, subscribe to the newsletter to receive
updates on articles and much more!
One of the things we, humans, are particularly good at is classifying songs. In
just a few seconds we can tell whether we’re listening to Classical music, Rap,
Blues or EDM. However, as simple as this task is for us, millions of people still
live with unclassifed digital music libraries.
. . .
The average library is estimated to have about 7,160 songs. If it takes 3
seconds to classify a song (either by listening or because you already know), a
quick back-of-the-envelope calculation gives around 6 hours to classify them
all.
If you add the time it takes to manually label the song, this can easily go up to
10+ hours of manual work. No one wants to do that.
In this post, we’ll see how we can use Deep Learning to help us in this labour-
intensive task.
Here’s a general overview of what we will do:
Extract a simpli�ed representation of each song in the library
Train a deep neural network to classify the songs
Use the classi�er to �ll in the mising genres of our library
The data
First of all, we’re going to need a dataset. For that I have started with my own
iTunes library — which is already labelled due to my slightly obsessive passion
for order. Although it is not as diverse, complete or even as big as other
datasets we could �nd, it is a good start. Note that I have only used 2,000
songs as it already represents a lot of data.
Re�ning the dataset
The �rst observation is that there are too many genres and subgenres, or to
put it di�erently, genres with too few examples. This needs to be corrected,
either by removing the examples from the dataset, or by assigning them to a
broader genre. We don’t really need this Concertos genre, Classical will do the
trick.
Too much information — Waaaaaay to much
Once we have a decent number of genres, with enough songs each, we can
start to extract the important information from the data. A song is nothing but
a very, very long series of values. The classic sampling frequency is 44100Hz
— there are 44100 values stored for every second of audio, and twice as much
for stereo.
This means that a 3 minute long stereo song contains 7,938,000 samples.
That’s a lot of information, and we need to reduce this to a more manageable
level if we want to do anything with it. We can start by discarding the stereo
channel as it contains highly redundant information.
We will use Fourier’s Transform to convert our audio data to the frequency
domain. This allows for a much more simple and compact representation of
the data, which we will export as a spectrogram. This process will give us a
PNG �le containing the evolution of all the frequencies of our song through
time.
The 44100Hz sampling rate we talked about earlier allows us to reconstruct
frequencies up to 22050Hz — see Nyquist-Shannon sampling theorem — but
now that the frequencies are extracted, we can use a much lower resolution.
Here, we’ll use 50 pixel per second (20ms per pixel), which is more than
enough to be sure to use all the information we need.
NB: If you know a genre characterized by ~20ms frequency variations, you
got me.
Here’s what our song looks like after the process (12.8s sample shown here).
Time is on the x axis, and frequency on the y axis. The highest frequencies
are at the top and the lowest at the bottom. The scaled amplitude of the
frequency is shown in greyscale, with white being the maximum and black the
minimum.
I have used use a spectrogram with 128 frequency levels, because it contains
all the relevant information of the song — we can easily distinguish di�erent
notes/frequencies.
Further processing
The next thing we have to do is to deal with the length of the songs. There are
two approaches for this problem. The �rst one would be to use a recurrent
neural network with wich we would feed each column of the image in order.
Instead, I have chosen to exploit even further the fact that humans are able to
classify songs with short extracts.
If we can classify songs by ear in under 3 seconds, why
couldn’t machines do the same ?
We can create �xed length slices of the spectrogram, and consider them as
independent samples representing the genre. We can use square slices for
convenience, which means that we will cut down the spectrogram into
128x128 pixel slices. This represents 2.56s worth of data in each slice.
At this point, we could use data augmentation to expand the dataset even
more (we won’t here because we aready have a lot of data). We could for
instance add random noise to the images, or slightly stretch them horizontally
and then crop them.
However, we have to make sure that we do not break the patterns of the data.
We can’t rotate the images, nor �ip them horizontally because sounds are not
symmetrical.
E.g, see those white fading lines? These are decaying sounds which cannot be
reversed.
Choice of the model — Let’s build a classi�er!
After we have sliced all our songs into square spectral images, we have a
dataset containing tens of thousands of samples for each genre. We can now
train a Deep Convolutional Neural Network to classify these samples. For
this purpose, I have used Tensor�ow’s wrapper TFLearn.
Implementation details
Dataset split: Training (70%), validation (20%), testing (10%)
Model: Convolutional neural network.
Layers: Kernels of size 2x2 with stride of 2
Optimizer: RMSProp.
Activation function: ELU (Exponential Linear Unit), because of the
performance it has shown when compared to ReLUs
Initialization: Xavier for the weights matrices in all layers.
Regularization: Dropout with probability 0.5
Results — Does this thing work?
With 2,000 songs split between 6 genres — Hardcore, Dubstep, Electro,
Classical, Soundtrack and Rap, and using more than 12,000 128x128
spectrogram slices in total, the model reached 90% accuracy on the
validation set. This is pretty good, especially considering that we are
processing the songs tiny bits at a time. Note that this is not the �nal
accuracy we’ll have on classifying whole songs (it will be even better). We’re
only talking slices here.
Time to classify some �les!
So far, we have converted our songs from stereo to mono and created a
spectrogram, which we sliced into small bits. We then used these slices to
train a deep neural network. We can now use the model to classify a new song
that we have never seen.
We start o� by generating the spectrogram the same way we did with the
training data. Because of the slicing, we cannot predict the class of the song in
one go. We have to slice the new song, and then put together the predicted
classes for all the slices.
To do that, we will use a voting system. Each sample of the track will “vote”
for a genre, and we choose the genre with the most votes. This will increase
our accuracy as we’ll get rid of many classi�cations errors with this ensemble
learning-esque method.
NB: a 3 minute long track has about 70 slices.
With this pipeline, we can now classify the unlabelled songs from our library.
We could simply run the voting system on all the songs for which we need a
genre, and take the word of the classi�er. This would give good results but we
might want to improve our voting system.
A better voting system
The last layer of the classi�er we have built is a softmax layer. This means that
it doesn’t really output the detected genre, but rather the probabilities of each.
This is what we call the classi�cation con�dence.
We can use this to improve our voting system. For instance, we could reject
votes from slices with low con�dence. If there is no clear winner, we reject the
vote.
It’s better to say “I don’t know” than to give a answer we’re not sure of.
Similarly, we could leave unlabelled the songs for which no genre received
more than a certain fraction -70%?- of the votes. This way, we will avoid
mislabeling songs, which we can still label later by hand.
ConclusionsIn this post, we have seen how we could extract important information from
redundant and high dimensional data structure — tracks. We have taken
advantage of short patterns in the data which allowed us to classify 2.56
second long extracts. Finally, we have used our model to �ll in the blanks in a
digital library.
If you like Arti�cial Intelligence, subscribe to the newsletter to receive
updates on articles and much more!
(Psst! part. 2 is out!)
You can play with the code here:
If you want to go further on audio classi�cation, there are other approaches
which yield impressive results, such as Shazam’s �ngerprinting technique or
dilated convolutions.
Thanks for reading this post, stay tuned for more !
•
•
•
•
•
•
•
•
•
•
despoisj/DeepAudioClassi�cation DeepAudioClassi�cation - Finding the genre of a song with Deep Learning
github.com
Machine Learning Arti�cial Intelligence Deep Learning Music Howto
One clap, two clap, three clap, forty?By clapping more or less, you can signal to us which stories really stand out.
21
FollowJulien DespoisAI Engineer, Enthusiast,Musician
FollowHacker Noonhow hackers start theirafternoons.
More from Hacker Noon
The Five Keys to Crypto Evolution
Daniel Je�ries16 min read
More from Hacker Noon
The E�ective Tech Lead is a 100xEngineer
Leonard Souza28 min read
More from Hacker Noon
The History of Money & the Futureof Bitcoin and the Cryptocurren…
Kenny Li18 min read
Responses
Write a response…
Dan Warne
Dec 5, 2016
Applause from Julien Despois (author)
Fantastic. I really hope Spotify hires you to add genres to their library. It is
confounding that they don’t o�er this.
nikos voutouras
Dec 5, 2016
Applause from Julien Despois (author)
Nice work !! You can use a bandpass �lter to smooth useless frequencies from
FFT for reducing data on input. Most Fingerprint Recognition Systems use
100 Hz- 2500 Khz
Where Angels Fear
Dec 4, 2016
Applause from Julien Despois (author)
Of course, I might dispute your categorisation — that’s not Dubstep, it’s crappy
EDM!!!
https://cdn-images-
1.medium.com/max/800/1*pmF7vvgWQWHRGoejyCBw-w.jpeg
;)
Conversation between Thomas Haferlach and Julien Despois.
Thomas Haferlach
Nov 14, 2017 · 1 min read
I have changed the code to do musical key classi�cation. I basically just
changed the part that reads the genre to return the key from the metadata.
I get surprisingly high accuracy on the test set. Now I’m wondering: is it
possible that slices of the same song are both in the test set and the training
set?
Read more…
1 response
Julien Despois
Nov 14, 2017 · 1 min read
I haven’t looked back at the code, but from what I remember, that might be an
issue. In that case, the results would be a bit biased, but not completely. In
fact, a single slice would not be present in both dataset (training/test), but
slices from the same song could be. The network is then biased because the
slices might look similar.
Read more…
Joshua Heagle
Dec 5, 2016
Applause from Julien Despois (author)
This is a very interesting concept, thanks for sharing.
Potentially you could assign genres to artists, and then use this data to weight
the votes in favour of the genre based on the artist.
Sai Kiran Kannaiah
Dec 3, 2016
Applause from Julien Despois (author)
Good approach! You can also use Mel Cepstral coe�cients to do audio
classi�cation. That’s the traditional approach.
Sergey Zuev
Dec 4, 2016
Applause from Julien Despois (author)
Great post, thanks Julien!
Zeger Hoogeboom
Dec 4, 2016
Applause from Julien Despois (author)
Nice and fun read! Keep it up :)
Conversation with Julien Despois.
Fernando Martínez
Dec 2, 2016
Really awesome post! Thank you for sharing
1 response
Julien Despois
Dec 2, 2016
Thanks for the support !
Conversation with Julien Despois.
João Paulo Vilaça
Dec 2, 2016
Very well explained, nice post man! keep it up :)
2 responses
Julien Despois
Dec 2, 2016
Glad you like it ! :)
Show all responses
Top highlight
1.1K
3.7K 2.3K 3.1K
8
4
3
3
2
2
1
1
Creating super genres
DeepMind — WaveNet
Spectrogram of an extract of the song
Sliced spectrogram
Decaying sound
Convolutional neural network
Voting system
Full classi�cation pipeline
Classi�cation con�dence
Classi�cation rejected because of the low con�dence
Track left unlabeled because of low voting system con�dence
1.1K
Follow Sign in Get started
HOME DEV BLOCKCHAIN REACT UI REALTIME APIS, GLOBAL INFRASTRUCTURE
Never miss a story from Hacker Noon, when you sign up for
Medium. Learn moreGET UPDATES