vojtˇech lanz - dspace.cuni.cz

55
BACHELOR THESIS Vojtˇ ech Lanz Automatic Chord Recognition in Audio Recording Institute of Formal and Applied Linguistics Supervisor of the bachelor thesis: doc. Ing. Zdenˇ ek ˇ Zabokrtsk´ y, Ph.D. Study programme: Computer Science Study branch: General Computer Science Prague 2021

Upload: others

Post on 03-Oct-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Vojtˇech Lanz - dspace.cuni.cz

BACHELOR THESIS

Vojtech Lanz

Automatic Chord Recognition in AudioRecording

Institute of Formal and Applied Linguistics

Supervisor of the bachelor thesis: doc. Ing. Zdenek Zabokrtsky,Ph.D.

Study programme: Computer ScienceStudy branch: General Computer Science

Prague 2021

Page 2: Vojtˇech Lanz - dspace.cuni.cz

I declare that I carried out this bachelor thesis independently, and only with thecited sources, literature and other professional sources. It has not been used toobtain another or the same degree.I understand that my work relates to the rights and obligations under the ActNo. 121/2000 Sb., the Copyright Act, as amended, in particular the fact that theCharles University has the right to conclude a license agreement on the use of thiswork as a school work pursuant to Section 60 subsection 1 of the Copyright Act.

In . . . . . . . . . . . . . date . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Author’s signature

i

Page 3: Vojtˇech Lanz - dspace.cuni.cz

I would like to thank Zdenek Zabokrtsky for being my supervisor and for all therecommendations he gave me. I am grateful for all the time he spent leading thiswork. His enthusiasm for the topic was amazing and inspiring.

ii

Page 4: Vojtˇech Lanz - dspace.cuni.cz

Title: Automatic Chord Recognition in Audio Recording

Author: Vojtech Lanz

Institute: Institute of Formal and Applied Linguistics

Supervisor: doc. Ing. Zdenek Zabokrtsky, Ph.D., Institute of Formal and AppliedLinguistics

Abstract: The transformation of sheet music into a sound is a very straightfor-ward task, in which we only have to follow certain instructions, in our case notesand their detailed descriptions. The inverse process is much more complicated.For that, the song’s harmony is an essential basis. Improvisations, solos, or songmelodies are based on it. Automatic Chord Recognition is one of the most chal-lenging tasks in Music Information Retrieval that has been actively researchedduring the last few decades. State-of-the-art algorithms work with deep learning.These algorithms inspired us. Therefore, we are also exploring deep learning pos-sibilities and how they work with various preprocessing methods together. Weare also presenting our user-friendly web application that will visualize the chordsequence of an uploaded audio file, its key, and BPM value. The applicationalso provides a more stable algorithm for unusual audios. However, this model ismuch less accurate for typical songs than the one based on deep learning.

Keywords: musical harmony chord recognition deep learning

iii

Page 5: Vojtˇech Lanz - dspace.cuni.cz

Contents

Introduction 3

1 Theoretical background 41.1 Musical background . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 Pitches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.2 Chords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 Keys and modes . . . . . . . . . . . . . . . . . . . . . . . . 51.1.4 Tempo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Signal processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2.2 Chromagram . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.2 Activation functions and Optimizers . . . . . . . . . . . . 81.3.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 State of the art 112.1 Feature Representation . . . . . . . . . . . . . . . . . . . . . . . . 112.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Experimental results 163.1 Chord recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.2 Model architectures . . . . . . . . . . . . . . . . . . . . . . 173.1.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.4 Evaluation scores . . . . . . . . . . . . . . . . . . . . . . . 203.1.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.1.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Application structure 274.1 Application endpoints . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 The main page /index . . . . . . . . . . . . . . . . . . . . 274.1.2 The visualization of the Template Voter

/VisualizeTemplateVoter . . . . . . . . . . . . . . . . . 284.1.3 The visualization of the Predictors

/VisualizePredictors . . . . . . . . . . . . . . . . . . . 294.1.4 The about page /about . . . . . . . . . . . . . . . . . . . 294.1.5 The about page /IncorrectInputFormat . . . . . . . . . 30

4.2 Template Voter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.1 Audio Source . . . . . . . . . . . . . . . . . . . . . . . . . 314.2.2 Fourier transform . . . . . . . . . . . . . . . . . . . . . . . 32

1

Page 6: Vojtˇech Lanz - dspace.cuni.cz

4.2.3 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2.4 Chord Classifier . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3.2 Key prediction . . . . . . . . . . . . . . . . . . . . . . . . 404.3.3 Beat assigning and tempo prediction . . . . . . . . . . . . 41

Conclusions and future works 424.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Bibliography 44

List of Figures 47

List of Tables 49

List of Abbreviations 50

A Attachments 51

2

Page 7: Vojtˇech Lanz - dspace.cuni.cz

IntroductionAutomatic music transcription is a very challenging problem that is potentiallyuseful for musicians. Nowadays, recordings are available more easily in manycases than sheet music. And in case we have sheet music, it still could be createdby an unprofessional with a lot of inaccuracies, mistakes, or parts not intuitive toread. Even when a professional pianist will make sheet music for a professionalguitarist, it may not be comfortable for the latter one to read and play. Therefore,automatic music transcription could be so helpful, and researchers have beenactively exploring this area over a few last decades.

There are several sub-tasks included in the transcription problem:

• Harmony and melody recognition that gives us all played notes.• Detection of the key signature of the song.• Detection of the tempo signature and beats per minute (BPM) value to

divide notes into measures.• Detection of the dynamic instructions of played song’s parts or even timbres

and textures.

Of course, there are still a few additional tasks like recognizing the song’sauthor, title, genre, or specific arrangement and printing the result into a sheet.Automatic harmony recognition is beneficial for other musical aspects. The keysignature is based on played tones in harmony. The melody comes from harmonytones as well, including non-harmony tones to make a phrase complete. Wecan even estimate the tempo from harmony segments by looking for patterns ofplaying chords and their duration lengths. Sometimes only chords are those weare looking for in music. For instance, when we want to play and sing songswith friends in train with guitar’s accompaniment or need to have some basis forimprovisations and solos.

This work will predominantly focus on automatic chord recognition, but wewill also consider key, tempo, and harmony segment recognition as valuable fea-tures for our research. The main goal is to enable all people, but primarilymusicians, to use the Music Informational Retrieval research outcomes via intu-itive web application. We would like to invent an own solution in the automaticchord recognition field. In respect to this purpose, only major and minor chordswill be considered.

The first two chapters will describe the necessary theoretical background andworks related to this topic. The third chapter will describe the research based onmachine learning, and we will compare the results of different approaches. Thelast chapter will show the implementation of our web application which providestwo models. The first one uses the best model of the research chapter. Thesecond one is the simple non-statistical algorithm in case that the deep learningmodel fails.

3

Page 8: Vojtˇech Lanz - dspace.cuni.cz

1. Theoretical background

1.1 Musical background

1.1.1 PitchesSongs are composed of tones. Usually, we consider only 108 frequency-differenttones. Each tone has its pitch, intensity, timbre, and duration. We will deal withtheir pitch. From this point of view, we have tones divided into nine octaves oftwelve chromas. Octaves we have are sub-contra, contra, great, small, one-lined,two-lined, three-lined, four-lined, five-lined sorted from the lowest. The twelvechromas mentioned are C, C# (or Db), D, D# (or Eb), E, F, F# (or Gb), G,G# (or Ab), A, A# (or Bb), B also sorted from the lowest. The frequency of then-th tone is calculated as follows

freq(n) = 12√

2n−69 · fA4

where fA4 is a frequency of one-lined A, which is around 440 Hz. The exact valuedepends on the circumstances. The standard tone scale begins on n = 12, whichcorresponds to the sub-contra C.

A couple of tones constitutes an interval. According to the number of half-tones among the pitches in the interval, we distinguish the following basic in-tervals: perfect unison, minor second, major second, minor third, major third,perfect fourth, tritone, perfect fifth, minor sixth, major sixth, minor seventh, ma-jor seventh, perfect octave and of course their variations, so-called enharmonicchanges. We considered only thirteen intervals with a maximum of eleven mid-dle half-tones, i.e., the perfect octave. However, there could be more half-tonesbetween two pitches, like ninth, etc.

If three or more tones are played at once, we call it a chord.

1.1.2 ChordsThe most general chord is a triad chord that consists of a root tone, a third, anda fifth. It does not depend on its combinations and which one is the lowest orhighest. Then we call it a substitution, but the triad is still the same. We markthe chord according to the root tone. For example, C. The third can be minoror major. It is a minor or major chord, and the chords now are either C major(simplified C) or C minor. The lowest tone can be specified in this way, C / G,which means that the bass tone is G, and only then we build a chord C. However,the fifth can be perfect, augmented, or diminished, and thus we append to themajor and minor types also augmented or diminished triads. But those are notso common outside of jazz music, and in this work, we will only deal with majorand minor chords.

If we want to add more tones, we can do it with a seventh, ninth, eleventh,etc. It is possible, and we will create a new character of harmony. We do not dealwith them in this work. They are more connected with the already mentionedjazz music.

4

Page 9: Vojtˇech Lanz - dspace.cuni.cz

1.1.3 Keys and modesEach song has an own key signature. Sometimes the modulation occurs, and thekey signature will change in the middle. The key signature specifies seven chromasthat correspond and fit the song’s part the most. There is also another therm, thekey. Each song also has its key, which is very connected with the key signature.The key specifies the song’s characteristics. It could be often recognizable bythe last tone of the song. Anyway, keys are compounded by the key signatureand the mode. We have seven basic modes: Ionian, Dorian, Phrygian, Lydian,Mixolydian, Aeolian, and Locrian. All modes have the same set of fitting tones,hence chords too, for the same key signature. The only difference is the wholecharacter.

1.1.4 TempoAnother significant song’s characteristics are rhythm and tempo. Firstly, thespeed is specified by that. We have a so-called beats per minute (BPM) valuethat says how many beats one minute includes. Secondly, we can describe therhythm pattern, the pattern of chords and their duration time, and how muchtime takes any sub melody part. We distinguish the number of tones during onebar. For instance, we can say that our song is played with 4/4 bars which meansthat one bar takes four quarter notes. Another example could be 3/8 bars whichrepresents a bar that takes three eighth notes.

We have thus gone through the very basics of music theory, which is sufficientfor this text. If you are interested in deeper knowledge, we highly recommend tostart with the Berklee Music Theory book written by [Schmeling, 2005].

1.2 Signal processingThere are two similar terms in signal processing we distinguish. The first oneis audio which is the file containing audio description and encoded waveform.The second one is an audio waveform which is an array decoded from audiodescribing a graph of wave’s displacement changing over time. To process theaudio for Automatic Chord Recognition (ACR), we must somehow obtain thedistribution of intensities along the frequency scale from the sound waveform.There are two commonly used methods. The first is the Fourier transform, andthe second is the Constant-Q transform (CQT).

Thanks to these methods, we obtain the intensities of each frequency playedduring a specific time window, where we determine the length and the startingposition. The output is called a spectrogram, which can be averaged or evensummed into a chromagram.

1.2.1 SpectrogramFourier analysis

The Fourier transform decomposes the signal to individual frequencies, and itsequation looks as follows

5

Page 10: Vojtˇech Lanz - dspace.cuni.cz

f (ξ) =∫ +∞

−∞f(x) · e(−2πixξ) dx,

where ξ is a frequency value and f is an audio waveform function. Thanks toCooley and Tukey [1965], we can use the algorithm to compute a discrete versionof this function at time O(n · log(n)), the so-called Fast Fourier transform (FFT).However, if we want to process audio, we are always interested in only a shortsegment of the whole waveform due to the chord label recognition at a given frame.Hence, we apply the convolution window function over the entire waveform. Thatgives us only the sound we are interested in. We pass the result into the FFTalgorithm. This process is the Short-time Fourier transform (STFT) and bringsthe following equation for each particular section

STFT (t, m) =N−1∑k=0

c(t · I − N

2 + k) · w(2k

N− 1) · e

−2πikmN .

We are asking for time point t and frequency bin m. We use the convolutionalfunction w, and convolutional windows are shifted by I value, sometimes calledhop length.

Constant Q-Transform

The advantage of the STFT method is that we will obtain intensities of all fre-quencies. Unfortunately this is not what we really need to know when we listeninga music. We are interested only in tone frequencies. Here the CQT could be help-ful. With specific parameters, we can obtain tone frequencies. The equation isvery similar to the STFT function:

CQT (t, k) = 1Nk

∑n≤Nk

x(n − t · hop length) · w(n) · ej2πnQ

Nk

for Q = (21β − 1)−1 and Nk = ⌈Q · fs

fk⌉ for fk = f0 · 2

kβ , where k is the frequency

index, f0 is the minimum frequency and β is the number of bins per octave.Today, there is a way to compute that quickly using STFT and FFT results

proposed by Brown and Puckette [1992].

1.2.2 ChromagramSometimes it is more useful to start with a chromagram that contains informationabout the intensity of each chroma over all octaves. This is calculated either byaveraging, or by summing the spectrogram values mapped to those chroma ofeach octave. If we are trying to analyze harmony, the list of played chromas isthe most important thing. But of course, we need to know which tone was thelowest one, as a bass or root, and which one is not that important because it ismore likely part of the melody than harmony. We can compute the chromagramthis way

6

Page 11: Vojtˇech Lanz - dspace.cuni.cz

Cβ(t, b) =O∑

o=0|CQT (t, b + o · β)|

for b ∈ [1, β] where β represents a number of bins, and we sum over all octavesO. We can compute the chromagram from the STFT result analogously.

1.3 Machine learningIn summary, machine learning approaches are based on a lot of provided data. Inour case, it is a set of songs. Every new unknown music audio is then comparedwith our dataset, and the goal is to trace some similarities there. One of themachine learning models is the multi-layer perceptron (MLP) classifier. The MLPcomprises several layers, the input, output, and several hidden ones. Each layerhas a specific number of units that are connected with those from the previouslayer. Each such connection is assigned a certain weight. The goal is to setweights to minimize the deviation of the data from the output layer and theresults we expect.

Input Layer Output Layer1st Hidden Layer

Figure 1.1: Sketch of MLP.

1.3.1 PreprocessingAt first, we need to prepare data for the input layer. The goal is to simplifydata the most and keep all relevant information in them simultaneously. Forinstance, when we have an audio waveform, it is not saying almost anythingabout the song’s harmony. Therefore, we can use STFT to analyze the audio toget spectrograms. We can read there all frequencies connected with the particulartime frame. Hence, the spectrogram could be potentially considered as an inputto some machine learning model.

Besides the possibility of CQT spectrogram, we can also generate a chroma-gram and pass it to the model. Anyway, we can still simplify our data features

7

Page 12: Vojtˇech Lanz - dspace.cuni.cz

for model training by other preprocessing functions. One of them is StandardScaler preprocessing. This method standardizes data features (spectrogram forus right now) by removing the mean and scaling to unit variance. We can saythat all data features are numerically normalized. That follows that our songsare more similar to each other, and it is simpler for a model to recognize othersimilar songs and their chords.

1.3.2 Activation functions and OptimizersWe already know that there are some data features as an input to the model. Butwhat is happening in the model during the training or evaluating process? Wealso know that each unit, or neuron, of the neuron network has a value. The valueis obtained as a sum of all weighted unit outputs from the previous layer, whereeach connection between two units from adjacent layers has an own weight. Theoutput of the unit could be that sum. However, we can use an arbitrary functionthat computes another more complex output value that could improve the modelperformance. We call it an activation function, and these three are the mostcommon

• ReLUReLU(x) = max{0, x}

• Sigmoidσ(x) = 1

1+e−x

• Tanhtanh(x) = ex−e−x

ex+e−x

The last layer is much more specific than the hidden or the input one. Wewould like to have some output there that will determine the result. The firstpossible type of output is just a real number. For that, only one output unit isneeded. Another output type is the classification that the output layer has asmany units as we have classification elements. Our neural network then predictshow much each classification element is possible with the total sum of one of alloutput units. We can compute it with the following function

softmax(xi) = exi∑Jj=1 exj

for J output units where xi marks the i-th unit.To train weights of unit connections, one of the optimizers is used. An opti-

mizer is an algorithm that considers network output, compares it with the correctdata, and finds how much different they are. To minimize the difference, or wecan say the value of the loss function, the gradient of this error function is com-puted, and we will move all weights - increase or decrease - in the right directionto get lower in the loss function surface. The two most used optimizers are Adamand Stochastic Gradient Descent.

8

Page 13: Vojtˇech Lanz - dspace.cuni.cz

1.3.3 Deep LearningIt would make sense that more variables, or weights, could better allow the modelto adjust the parameters for prediction. We can increase the weight count byadding units and layers. When we work with much more complex structures andmore layered models, we call it deep learning. Indeed, our model has a betterchance of learning the training data well, but we need to avoid the so-calledoverfitting. That means our model will predict our training data so well, unlikenew data that the model will not recognize correctly for even minor differences.Deep learning offers several types of layers that are trying to deal with this issue.

We will mention two layer types that will help to consider also context infor-mation around, and those are convolutional and recurrent layers.

Convolutional layers

Convolutional networks could be a good choice for image or audio processing.The convolutional layer provides local interactions, parameter sharing, and shiftinvariance between nearly positioned data information. A kernel template isapplied to each possible area of input data. The output data are evaluated whereeach value includes information from the context around. We can consider the1D convolution layer, where its simple example is shown in Figure 1.2.

Figure 1.2: 1D convolutional layer considering one neighbour on each side.

We can extend the same idea about parameter sharing into the 2D convolu-tional layer that is illustrated in Figure 1.3.

Recurrent layers

For data in sequences, like speech recordings or texts, the recurrent layer couldbe a good idea to use. The recurrent layer contains several recurrent units wherethe model starts with the first one and goes through each, one by one. Duringthe process, each unit uses the memory from previous calculations combined withthe layer input value. We have mostly use two types of recurrent layers. The firstone is a layer of gated recurrent unit (GRU) with two inputs, one as an memoryvalue, one as an input value. Only one output value is there. You can see itsdiagram in Figure 1.4.

The second one is the long short-term memory (LSTM) which is a little bitmore complex. Its coefficients, memory update, and computation process aredescribed in Figure 1.5

9

Page 14: Vojtˇech Lanz - dspace.cuni.cz

Figure 1.3: Simple example of 2D convolutional layer with kernel 2x2 [Goodfellowet al., 2016].

Figure 1.4: Diagram of GRU [Abdulwahab et al., 2017].

Figure 1.5: Diagram of LSTM unit [Hrnjica, 2019].

10

Page 15: Vojtˇech Lanz - dspace.cuni.cz

2. State of the artThis chapter gives an overview of the last few decades of the ACR task. The veryactive research and community around the ACR topic started in 1999 when Fu-jishima published an article focused on the chord recognition of musical sounds.He proposed to create a chromagram containing 12 features. All possible chordtemplates were considered, and the most similar one was chosen. This idea in-spired us in Chapter 4 in Section 4.2. However, science has evolved since then,and statistical models began to be popular. Some approaches were trying to ex-plore Hidden Markov model (HMM), others convolutional neural network (CNN)or recurrent neural network (RNN). The community also collected some of thedatasets. In the beginning, the question of predicting triad chords had beeninvestigated. Nowadays, more scientists are researching how to estimate largevocabulary chords that include also sevenths and so on. There are a few prob-lems for ACR researchers that have to be discussed. The first one is the featurerepresentation that the model takes as an input. The second one is the modelitself. And finally, the last one is which datasets should be used.

2.1 Feature RepresentationThe audio waveform itself doesn’t say much about the harmony. To give ourmodel all relevant information, we have to extract the audio signal. We al-ready know that chromagram is one of the options. We already mentioned thatFujishima [1999] used this feature representation which is based on the DiscreteFourier transform spectrogram. Chromagram based on the CQT spectrogram is apossible way as well. Chen and Su [2019] use the non-negative-least-squares chro-magram computed with the Chordino VAMP plugin [Mauch and Dixon, 2010].The feature set contains 24 features, 12 bass chromas, and 12 treble chromas.The advantage of these feature representations is that they are already part ofthe McGill Billboard dataset, and we don’t have to generate them anymore. Inaddition to those already mentioned graphs, the Tonnetz space as a feature rep-resentation was considered by Humphrey et al. [2012], Carsault et al. [2018]. Thesignificant advantage of this space is that it stores and considers the similarity ofsome tones and chords. For instance, the perfect fifth is much more similar to thecorrect tone than its major seventh. There is, of course, a physical explanationfor this. The reason is basically that each instrument also generates multiples ofthe frequency played, where the perfect fifth is the third multiple, and the majorseventh is a much higher multiple.

With the advent of deep learning models, the idea of chromagram representa-tion persists. On the other hand, the spectrogram is considered as input data toneural networks more often. A lot of noise could be included in the spectrogram,and we would like to clean it up. Some researchers decided to apply principalcomponent analysis (PCA) on the spectrogram [Boulanger-Lewandowski et al.,2013, Zhou and Lerch, 2015]. Overall, feature extractor models and generatorshave become popular. Either there is one extra model that generates the chro-magram separately, or the feature extraction to chromagram is directly part ofthe model. The example of the first option is described by Korzeniowski and

11

Page 16: Vojtˇech Lanz - dspace.cuni.cz

Widmer [2016b], where the logarithmic quarter-tone spectrogram is computedand extracted to a chromagram that Korzeniowski and Widmer [2016a] use. Asimilar thing proposed Deng and Kwok [2016], but they used the STFT basedspectrogram extracted via the non-negative-least-square method. Chromagramwas decoded using the Gaussian-HMM model, and the result was the input to theACR model in the paper of Deng and Kwok [2017]. When the feature extractor ispart of the model, for instance, as an acoustic model or so-called encoder, articleswork with STFT spectrogram using logarithmically spaced triangular filter bank[Korzeniowski and Widmer, 2018, Nadar et al., 2019] or CQT spectrogram [Zhouand Lerch, 2015, Sigtia et al., 2015, McFee and Bello, 2017, Jiang et al., 2019].There is a paper that has tried cepstrum spectrogram as a model input to supportand highlight root note of a played chord [Yang et al., 2016]. The real cepstrumor the inverse discrete Fourier transform of a log-magnitude spectrogram can becomputed as follows

F −1(log(|F (x)|))

for F Fourier transform function and F −1 denoting its inverse.

2.2 ModelsThe most intuitive model that was the kickoff for the ACR community is thetemplate-matching-based system [Fujishima, 1999, Hausner, 2014]. There is aset of chords from our chord vocabulary, and all of them are compared withour features. However, models based on deep learning currently predominate.Neural networks can process much more complex audio and its spectrogram thanearlier methods. Music is a complicated area where all played frequencies inone moment are connected somehow with those played around. For instance,we can play chords via arpeggio, and the whole arpeggio duration is essential forus. Also, there could be non-trivial changes between two different track partsrepresenting the same harmony or even melody. Therefore, CNNs often occur.For instance, a CNN with their various approaches of a loss function is mentionedby Carsault et al. [2018]. Also, CNN-based models are described by Zhou andLerch [2015], Korzeniowski and Widmer [2018] where extra classification layers areconsidered. The article Zhou and Lerch [2015] discussed the difference betweenSupport Vector Machine, HMM, and simple argmax function applied on CNN’ssoftmax output to improve accuracy and smooth wrongly segmented chord sub-sequences. Korzeniowski and Widmer [2018] researched conditional random field(CRF) for the same purpose.

The community splits the architecture into two separate parts. One is theacoustic model, and the other is the language model. The purpose of the acousticmodel is to process and to extract the audio’s spectrogram. The language modelwill consider context information and will predict the entire chord sequence ofthe song. Each song has patterns that repeat in some way during the song.What’s more, a lot of songs are composed of primarily three basic chords, tonic,subdominant, and dominant. Then RNN are an excellent choice. Authors Sigtiaet al. [2015] replaced HMM in the language model by RNN and applied extra

12

Page 17: Vojtˇech Lanz - dspace.cuni.cz

beam search smoothing. Another approach of the language model is describedby Boulanger-Lewandowski et al. [2013], where the acoustic model is a deepbelief network. Each layer is unsupervised pre-trained to filter noise and notrequired pitches. HMM, Beam Search and Dynamic programming are analyzedas a language model smoothing here.

New deep learning improvements are often quickly applied in the automaticchord estimation field too. One of the RNN improvements was bidirectionalLSTM [Deng and Kwok, 2017] or GRU [Korzeniowski and Widmer, 2018]. Au-thors Korzeniowski and Widmer [2018] came up with the third model, the so-called duration model. The acoustic model is the VGG style CNN, which is thebase both for RNN-GRU based language model to predict the next chord, andfor the duration model to predict whether the next chord will be different. An-other great discovery in the deep learning community was RNN with attentionand transformers. Chen and Su [2019] were inspired by this idea and used anencoder-decoder harmonic transformer where the encoder performs the chord seg-mentation. The decoder performs the chord recognition with the encoder outputas additional information. Anyway, the HMM is still not wholly obsolete here.The architecture of HMM working for forced alignment and deep neural networkworking for automatic chord estimation was published in 2019 [Wu et al., 2019].

An influential idea that helps a lot in the ACR research, mainly when thelarge chord vocabulary is considered, predicts root (tone after which the chordis named), chord pitches, and bass tone separately. We can see from McFee andBello [2017] that the encoder as a CNN with bidirectional GRU is connectedto the decoder that predicts chord pitches from single encoder layers, and allthose information will classify the chord itself. The same pitch separation ideabased on CNN encoder was also compared by Nadar et al. [2019] with basic chordclassification. Also, Jiang et al. [2019] explored those thoughts. The encoder isCNN based feature extraction with bidirectional LSTM. The decoder is made upof a CRF predicting root with third, seventh, ninth, eleventh, etc.

2.3 DatasetsThe more data samples we have in training datasets, the better scores we willhopefully get from deep learning models. To achieve this, someone has to man-ually or automatically generate audio data and its chord annotations for everysingle time segment. Some of the datasets were already collected and annotated.So-called LAB files are commonly used to describe a song’s chords with time align-ment. Let’s see the first 15 seconds of Queen’s song Crazy Little Thing CalledLove as an example in Figure 2.1. The first two columns specify the interval andthe third one is a chord description. However, none of the already existing ACRdatasets contains also audio files. If we want them to create data features, for in-stance, spectrograms, we need to get them separately on our own. Another issuewith chord annotations is that sometimes it is not clear which chord is actuallyplayed and when the chord ends and starts another. We are trying to say thatannotations are not and may not even be one hundred percent correct. Moreover,datasets are not consistent with chromas B and B with a flat. Some countriesuse B and Bb, others H and B respectively. Annotations are mixing these twotypes, and we have no chance to programmatically recognize whether B is meant

13

Page 18: Vojtˇech Lanz - dspace.cuni.cz

0.000 0.273 N0.273 9.535 D:maj9.535 11.102 G:maj11.102 11.820 C:maj11.820 12.702 G:maj12.702 15.808 D:maj

Figure 2.1: A few lines of Crazy Little Thing Called Love chord LAB file.

to be B or B with a flat. Anyway, still, these datasets are at least a good startingpoint and they are listed below.

• Isophonics dataset1

The dataset contains chord, key, segmentation and beat annotations for 225songs. Songs are collected from Beatles, Queen, Zweieck and Carole Kinginterprets.

• McGill Billboard dataset2

The Billboard dataset provides chord and key annotations for 890 Americanpopular songs between 1950 and 1990. We can find here also song’s titleor artist and its metre. Although audio files are not included, the authorsgenerated NNLS chroma features if we would want to use them.

• RWC Music Database3

Here is the mixture of 20 American 1980s hits and 80 Japanese popularsongs. Only chord annotations are provided.

• US Pop 20024

One hundred ninety-five popular American hits were annotated and col-lected here. Annotations consist only of chords.

• Robbie Williams dataset5

There are 65 key and chord annotated songs of Robbie Williams.

• BPS-FH dataset6

Beethoven Piano Sonata with Function Harmony dataset consists of chord,beat and note annotations of the 1st movements from Beethoven’s 32 pianosonatas. Annotations are not provided in LAB formats but with xlsx format.

• WJD dataset7

Weimar Jazz Database includes 456 jazz songs with scores, chords, notes,beats, etc. Data is not stored in LAB files, the dataset has its own morecomplicated structure.

1http://isophonics.net/content/reference-annotations2http://ddmal.music.mcgill.ca/billboard3https://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation/4https://github.com/tmc323/Chord-Annotations5https://www.researchgate.net/publication/260399240_Chord_and_Harmony_

annotations_of_the_first_five_albums_by_Robbie_Williams6https://github.com/Tsung-Ping/functional-harmony7https://jazzomat.hfm-weimar.de/

14

Page 19: Vojtˇech Lanz - dspace.cuni.cz

These datasets are built from authentic recordings of real popular songs, jazzstandards, or classical sonatas that were manually annotated. Therefore the dataare more realistic with all noise and tuning issues. On the other hand, because ofthe training data rarity and annotation human inaccuracy, synthetic data couldcome in handy as well. We refer to one more dataset [Nadar et al., 2019] thatwas generated as two MIDI files covering all possible seventh chords and theirinversions and alternatives. The dataset’s name is IDMT SMT CHORDS8. Authorsprovide the piano and guitar sounds.

8https://www.idmt.fraunhofer.de/en/business_units/m2d/smt/chords.html

15

Page 20: Vojtˇech Lanz - dspace.cuni.cz

3. Experimental resultsThe experiment started without any previous ACR experience or work. The maingoal is to explore some possibilities, which one works and which one does not.The first experimental part is focused on chord recognition and finding the bestway to predict chord sequence from a time frame as accurately as possible. Thesecond practical part is about determining chord segment lengths and borders.The segment information could be helpful for some post-processing and improveprediction accuracy, or when we want to visualize chords in some readable form,for example, in song bars. Otherwise, one mistake in segments could shift therest of the chord visualization, making the result unreadable.

3.1 Chord recognitionWe started with a simple MLP classifier provided by Pedregosa et al. [2011]. Somepreprocessing is explored with this very default approach. In order to increase thescore, we design more complex neural network architectures using deep learningmethods. In particular, convolutional and recurrent layers were considered andimplemented using the software developed by Abadi et al. [2015].

3.1.1 PreprocessingAt first, we need to extract data features from audio waveforms. After that, wecan apply functions on those data that could help in model training. We will focuson these two parts in the subsections below. We also need to adapt annotationsto our research. Chord vocabulary is formed from all 24 major and minor chordsand one extra None chord. All augmented and diminished chords are mapped toNone chord. All sixth, seventh, ninth, etc., are simplified to their triad form, soonly the third and the bass tones determines the chord label. Annotation labelsfrom annotation files, which are in the LAB format, are modified in that way.

Generating of features

The WAV audio files were loaded and transformed to spectrograms using thePython library librosa Brian et al. [2015]. We were comparing two differentspectrograms, the log Mel and the CQT. The log Mel spectrogram turned outto be a better choice for the MLP classifier. On the contrary, CQT spectrogramswork better for convolutional recurrent neural network (CRNN) models. Let’sdiscuss the MLP and CRNN input formats.

Data features for MLP are represented as a flattened window of several logMel spectrograms around to keep the context. For loading audio files and gener-ating spectrograms, we used a sample rate of 44 100 Hz and a hop length of 1024samples. The window size is 11 spectrograms, five on the left and five on theright, where there are 22 other spectrogram arrays between every two neighborsin the window that were skipped and weren’t included. That means that thecontext is a little bit larger but not that intense. All 11 spectrogram arrays are

16

Page 21: Vojtˇech Lanz - dspace.cuni.cz

flattened to a one-dimensional array. The log Mel spectrogram with 128 bins and16384 FFT samples is generated here.

Data features for CRNN are represented as a sequence of single CQT spectro-gram arrays. When loading audio files and generating spectrogram, the samplerate of 22 050 Hz and the hop length of 512 samples were set. Each song is dividedinto sequences of 1000 spectrogram arrays which corresponds to approximately23 s [Jiang et al., 2019]. We supplement the last chord series of a song by Nonechords and zero-valued spectrograms. CQT spectrogram is generated with 252bins where 36 bins correspond to one octave.

Features preprocessing

We considered two preprocessing methods. The first one is the Standard Scalerimplemented by Pedregosa et al. [2011] that normalizes feature values. The sec-ond one is the PCA decomposition, also provided by Pedregosa et al. [2011].MLP classifier measurement considers both and compares them with the situa-tion without preprocessing function. On the other hand, CRNN measurementapplies only the Scaler method because the rest possibilities do not work well atall.

Another preprocess method proposed in this work is transposing all songs toone key signature, which has no sharps and no flats. All major-key-based songsare transposed to C major key, and all minor-key-based songs are transposed toA minor key. We transpose songs of the rest of the key modes analogically. Theaudio file can be shifted by any number of half-tones using the librosa library[Brian et al., 2015]. We can obtain the information about the song’s key fromdataset annotations which is our source. Another way to estimate the song’s keyis to consider all song’s chords and choose the one key signature that chords fitthe most. We will use and describe this approach in Chapter 4 where we want tohave the best model that predicts chords of songs of any key signature.

3.1.2 Model architecturesThe first experiment is focused on a simple MLP classifier model with defaultsettings by Pedregosa et al. [2011]. The model contains 100 hidden units withReLU activation where the Adam optimizer is used with an initial learning rateof 0.001, the first beta that equals 0.9, and the second beta that equals 0.999.The epsilon coefficient is 1e − 8. The output layer is a softmax of 25 classes, oneclass for one chord.

Input Layer, 1408 units

Hidden layer,100 units

Softmax, 25 units

Figure 3.1: The architecture of our MLP classifier.

The second part of the research explores CRNN models, all implemented in

17

Page 22: Vojtˇech Lanz - dspace.cuni.cz

Python using tensorflow library Abadi et al. [2015]. The first part of our struc-tures represents the feature extractor, formed by convolutional layers, followed bythe second part, a RNN trained on the song’s chord sequence structures. All ofour CRNN models are optimized by, again, Adam with a learning rate of 0.001,first beta 0.9, second beta 0.999, and the epsilon value 1e − 07. All, except theone using CRF, are trained on the loss function sparse categorical cross-entropy.

• Basic CRNNBasic CRNN architecture that contains six convolutional layers always fol-lowed by batch normalization, one max pooling, and two bidirectional GRUlayers that each RNN layer has dropout of 0.5.

Con

v2D

Batc

h N

orm

aliz

atio

n

Con

v2D

Batc

h N

orm

aliz

atio

n

Con

v2D

Batc

h N

orm

aliz

atio

n

Max

Poo

ling

2D

Con

v2D

Batc

h N

orm

aliz

atio

n

Con

v2D

Batc

h N

orm

aliz

atio

n

Con

v2D

Batc

h N

orm

aliz

atio

n

Bidirectional GRU

Softm

ax

Bidirectional GRU

16, (3,3)ReLU

16, (3,3)ReLU

16, (3,3)ReLU

32, (3,3)ReLU

32, (3,3)ReLU

32, (3,3)ReLU

(1,3)

128 16

Figure 3.2: The architecture of our approach of CRNN.

• CRNN based on EfficientNetInstead of a simple convolutional part, the EfficientNet [Tan and Le, 2019]is used as a feature extractor.

Bidirectional GRU

Softm

ax

Bidirectional GRU

128 16

CNN EfficientNet B0

Figure 3.3: The architecture of CRNN using EfficientNet [Tan and Le, 2019].

• CRNN with CRF as an output layerInstead of the last RNN layer, the CRF output layer is used, as was proposedby Jiang et al. [2019].

18

Page 23: Vojtˇech Lanz - dspace.cuni.cz

Con

v2D

Batc

h N

orm

aliz

atio

n

Con

v2D

Batc

h N

orm

aliz

atio

n

Con

v2D

Batc

h N

orm

aliz

atio

n

Max

Poo

ling

2D

Con

v2D

Batc

h N

orm

aliz

atio

n

Con

v2D

Batc

h N

orm

aliz

atio

n

Con

v2D

Batc

h N

orm

aliz

atio

n

Bidirectional GRU CR

F

Dense

16, (3,3)ReLU

16, (3,3)ReLU

16, (3,3)ReLU

32, (3,3)ReLU

32, (3,3)ReLU

32, (3,3)ReLU

(1,3)

128 512

Figure 3.4: The architecture of CRNN finished by CRF.

• Bass-Third CRNN modelThe design contains two models of the Basic CRNN architecture. The firstone is trained to predict bass, and the second one is trained to predictthirds. Results are combined, and the chord is formed.

BASSBasic CRNN

THIRDBasic CRNN

Bass-Third separator

Triad compounder

Figure 3.5: The architecture of two of our CRNN models combined, one predictsbass, the other predicts third.

3.1.3 DatasetsFor evaluating and measuring, only the Beatles dataset was used. The Beatlesdataset is part of the Isophonics dataset1. There are 180 Beatles songs groupedfrom 12 Beatles’ albums. The advantage is that each song has a similar sound andsimilar quality of recordings. On the other hand, Beatles is known for their com-plicated harmonies, at least in comparison with popular music. Therefore, every

1http://isophonics.net/content/reference-annotations

19

Page 24: Vojtˇech Lanz - dspace.cuni.cz

two songs are completely different from the harmonical point of view. For testingand evaluating, 36 pieces were picked as all first and second tracks of Beatles CDsand some third tracks to achieve the wanted size. The rest of the Beatles songsform the training set. The dataset contains chord and key annotations. The keyinformation is used for dataset transposition.

Overall, the training dataset contains approximately 6 hours and 27 minutes ofmusic. That corresponds to 1075 sequences of 1000 chords as an input to CRNNmodels and 999 412 windows as MLP training data. The test dataset containsapproximately 1 hour and 40 minutes, which are 276 sequences for CRNN modelsand 258 222 windows for the MLP classifier. The baseline of the transposeddataset is 26.87 %, where all chords are predicted as a tonic, i.e., C major.

As we already discussed, the strength of deep learning is the dataset largeness.We can use many other datasets, but due to optimizing training time and RAMcapacity, we were pushed to the dataset size limitation.

We trained the model on all Isophonics’ songs to increase the applicationperformance for the web application. That means we used 225 songs of Beatles,Queen, Carole King, and Zweieck. The application applies some extra post-processing to make the result more readable. Therefore we haven’t evaluatedmodel trained on the Isophonic dataset with that post-processing.

3.1.4 Evaluation scoresThe primary score here for us is the accuracy of the correctly predicted labels ofindividual chords generated in the manner described in Subsection 3.1.1. In orderto compare our results with the state-of-the-art models, we use scores that themir eval library evaluates Raffel et al. [2014]. The triads score compares chorddurations of triads, in our case, only majors and minors. The over-segmentationscore is computed as a directional hamming distance of estimated and referencechord intervals subtracted from one. Its output is on the scale of zero and one,where one means no over-segmentation at all. The under-segmentation is com-puted similarly. Only estimated and reference intervals are switched, and one, asa result, means no under-segmentation. The segmentation score is the minimumof both and one means that the segmentation is perfect.

3.1.5 ResultsMLP classifier results with different preprocessing functions are listed in Table 3.1.We applied various methods to preprocess and simplify data for the model, theStandard Scaler and the PCA. We compared those preprocessing functions onoriginal and transposed songs of Beatles. The accuracy in the table means howmany chords the MLP model correctly predicted.

All evaluation results of CRNN models described in Subsection 3.1.2 are listedin Table 3.2. We are interested in the difference between training based on origi-nal and transposed songs. We preprocess the Beatles dataset with the StandardScaler function. We trained every model for 100 epochs and almost every modelwith a batch size of 32 sequences. We trained only the Bass-Third model witha batch size of 128 in order to speed up the training a little bit using the Ten-sor Processing Unit. The accuracy of evaluation is measured as the number of

20

Page 25: Vojtˇech Lanz - dspace.cuni.cz

MLP Classifier with various preprocessing functionsPreprocsssingfunctions

Chord Accuracy

OS 26.04 %OS, Scaler 27.75 %OS, PCA 29.02 %TS, Scaler 30.67 %TS, PCA 32.03 %TS, Scaler, PCA 32.26 %

Table 3.1: MLP classifier results of different preprocessing functions and on dif-ferent datasets, of OS and TS.

correctly predicted chords. The rest of the scores in the table are provided bymir eval library [Raffel et al., 2014], where triads are limited only on major,minor, and None chords.

CRNN models and their scoresModel Accuracy Triads UnderSeg OverSeg SegBass-Third, OS 49.79 % 14.65 13.22 1.0 13.22Bass-Third, TS 54.74 % 14.65 13.22 1.0 13.22CRF, OS 56.48 % 14.65 13.22 1.0 13.22CRF, TS 57.99 % 14.65 13.22 1.0 13.22EfficientNet, OS 56.48 % 57.76 84.20 71.37 69.54EfficientNet, TS 58.61 % 60.33 84.88 72.83 72.17Basic, OS 57.72 % 58.72 83.26 73.69 71.67Basic, TS 59.67 % 61.28 84.54 72.98 72.21

Table 3.2: Results of various CRNN models evaluated on dataset of OS andon dataset of TS. All figures can be interpreted as percentages, which Subsec-tion 3.1.4 describes in detail.

We can see the Beatles dataset’s chord distribution and prediction mistakeplaces in confusion matrices displayed in Figure 3.6. The matrix on the left sideis generated from the Basic CRNN model prediction on original songs. We cansee there almost all chords on the scale. The matrix on the right side is generatedfrom the Basic CRNN model prediction as well, but this time the model wastrained and evaluated on transposed songs. This is why there are highlighted Cmajor, F major, G major, and A minor predominantly.

3.1.6 DiscussionAs we can see, the dataset transposition works quite well. All models trainedand evaluated on the original dataset have significantly worse accuracy and tri-ads scores than transposed songs. Also, the PCA or Standard Scaler can notreplace the dataset transposition approach. The MLP measurement shows thatboth Stand Scaler and PCA have a similar effect. Considering the transpositionapproach, models have reduced potential chords to learn and they could focusmore on their harmony functions. Furthermore, there are only a few important

21

Page 26: Vojtˇech Lanz - dspace.cuni.cz

Figure 3.6: The confusion matrix of Basic CRNN predictions on original datasetis on the left, confusion matrix of Basic CRNN predictions on transposed datasetto C major is on the right.

chords, primarily the tonic, subdominant, and dominant, which in most songspredominate. When a model learns to predict those harmony functions, it willpredict a lot of labels correctly.

CRNN architecture improves the performance a lot, moreover those with thelanguage model at the end of the structure. The CRF or the chord compoundingin the Bass-Third model can not consider the harmony structure of the songsequence as good as the RNN layer. Therefore, accuracies of these models inTable 3.2 are still approaching the rest of the models, but the rest of the scoreswhich also measures the chord continuity are not. Comparing our proposition ofCNN, the Basic one, with Efficient Net [Tan and Le, 2019], ours performs better.Comparing our results with MIREX scoreboard2, Music Informational RetrievalEvaluation eXchange, we are approaching their numbers. Some algorithms areworse, some better. We are not using precisely all songs from the Isophonicsdataset, but only Beatles. There could be a space for improvement in extendingthe dataset. Another improving approach could be considering the language layeron the Bass-Third model. Of course, using segmentation and tempo informationcould also be helpful.

3.2 SegmentationThe primary motivation of the segmentation research in this work is to recognizeharmony changes to map chords into bars correctly. We would like to know howlong is each chord segment and when it starts. The secondary motivation is alsoto help models in predicting chord labels. In other words, the usage could be bothan extra data features into the ACR model and a post-processing function thatfixes inaccuracies. We proposed several methods of CNN-based architectures,which we compare with the librosa functionality [Brian et al., 2015].

2https://www.music-ir.org/mirex/wiki/2020:Audio_Chord_Estimation_Results

22

Page 27: Vojtˇech Lanz - dspace.cuni.cz

3.2.1 ApproachesCRNN model, zeros versus ones

The first approach was designed as a CRNN model that predicts sequences ofones and zeros. If there is a zero predicted, there is no chord change betweentwo frames. Otherwise, there is a harmonic change. We used a model witharchitecture sketched on Figure 3.7. We use Adam as an optimizer and sparsecategorical cross-entropy as a loss function. We divided the Isophonics datasetinto train and validation set in a ratio of seven to three. We preprocess datasimilarly as for CRNN models in Section 3.1. There are sequences of log Melspectrograms long of approximately 40 s and each sequence contains 100 frames.The accuracy score of the validation set is 79.04 %, which is not that much whenwe consider that there could be at least three times more zeros than ones. Then,we can predict zeros everywhere and also get a similarly high score. So we haven’tproceeded with this approach.

On the other hand, our loss function is not a good metric for this task. We donot want actually to optimize the correctly predicted binary number. Moreover,we want to predict ones as near in the sequence to the target ones as possible.Therefore, the result could be much better with another loss function definition.

Con

v2D

Batc

h N

orm

aliz

atio

n

Con

v2D

Batc

h N

orm

aliz

atio

n

Con

v2D

Batc

h N

orm

aliz

atio

n

Max

Poo

ling

2D

Con

v2D

Batc

h N

orm

aliz

atio

n

Con

v2D

Batc

h N

orm

aliz

atio

n

Con

v2D

Batc

h N

orm

aliz

atio

n

Max

Poo

ling

2D

Con

v2D

Batc

h N

orm

aliz

atio

n

Con

v2D

Batc

h N

orm

aliz

atio

n

Max

Poo

ling

2D

Con

v2D

Batc

hN

orm

aliz

atio

n

Con

v2D

Batc

hN

orm

aliz

atio

n

Bidirectional LTSM

Softm

ax

16, (3,3)ReLU

16, (3,3)ReLU

16, (3,3)ReLU

32, (3,3)ReLU

32, (3,3)ReLU

32, (3,3)ReLU

64, (3,3)ReLU

64, (3,3)ReLU

80, (3,3)ReLU

80, (3,3)ReLU

(1,3)(1,3)

(1,4)

296

Figure 3.7: CRNN segmentation model that predicts ones for chord changes,otherwise zeros.

23

Page 28: Vojtˇech Lanz - dspace.cuni.cz

CNN encoder-decoder

The main goal of the CNN encoder-decoder is to transform spectrograms intographically visualized harmony segments. We proposed those graphical visual-izations as a peak at the beginning of a harmony change narrowing rapidly to theend of the harmony segment. The part of Python script creating the visualizationgraph is implemented as in Figure 3.8.

for i in range(start , chord_ind ):n_ones = (int)( n_features -

( n_features )*(((i-start )/( chord_ind -start ))**(1/2)))n_zeros = (int )(( n_features - n_ones )/2)segmented_sequence .append(

np. concatenate ((np.zeros (( n_zeros )),np.ones (( n_ones )),np.zeros(

(int)(max(n_features -( n_ones+ n_zeros ), 0)))

)))

Figure 3.8: The Python algorithm generating segments visualisation graph.

The CNN encoder-decoder model was created and designed to process spec-trograms and build the segments visualizations. Data features are processedsimilarly as in the previous approach, but the length of sequences is different. Weuse here 50 frames long sequences that each corresponds to about 23s. The lossfunction here is the mean squared error function, and we have chosen the meanabsolute error between labels and predictions as a metric.

In Figure 3.10, we can see an example of how chord labels are preprocessedfrom one sample of 1000 frames long sequence corresponding to around 23 secondsof some song. Its spectrogram as the model input is right above the segmentsgraph visualization.

As we mentioned, the model was trained on 50 frames long sequences during88 epochs. The final validation score of mean absolute error is 0.3418, which isquite good when considering that the best-case scenario is when the MAE wouldequal zero, where the worst case is when it would be equal to one. We can see thedifference between predicted and gold visualization graphs in Figure 3.11. We cansee the predicted sequence on the left side, the gold one on the right side. As youcan notice, some post-processing should be applied to recognize actual harmonysegment borders and do some extra work, and still, it is not that accurate yet.On the other hand, the result is not that bad for rejecting the whole approach.

Dynamic Programming based algorithm

There is a beat tracking algorithm proposed by Ellis [2007]. The authors consid-ered two parts in the algorithm. Firstly, the strength of instruments at differenttimes is significant. Secondly, the inter-beat-interval should be the same for some

24

Page 29: Vojtˇech Lanz - dspace.cuni.cz

Con

v2D

Batc

h N

orm

aliz

atio

n

Max

Poo

ling

2D

Con

v2D

Batc

h N

orm

aliz

atio

n

Max

Poo

ling

2D

64, (3,3)ReLU

128, (3,3)ReLU(2,2)

Con

v2D

Batc

h N

orm

aliz

atio

n

Max

Poo

ling

2D

256, (3,3)ReLU(2,2)

(2,2)

Con

v2D

Batc

hN

orm

aliz

atio

n

512, (3,3)ReLU

Con

v2D

Batc

hN

orm

aliz

atio

n

512, (3,3)ReLU

Con

v2D

Batc

h N

orm

aliz

atio

n

256, (3,3)ReLU

Con

v2D

Batc

h N

orm

aliz

atio

n

128, (3,3)ReLU

Con

v2D

Batc

h N

orm

aliz

atio

n

64, (3,3)ReLU

UpS

ampl

ing

2D

UpS

ampl

ing

2D

UpS

ampl

ing

2D

(2,2)(2,2)

(2,2)

Con

v2D

3, (3,3)Tanh

Figure 3.9: Encoder-decoder architecture for harmony segmentation.

Figure 3.10: Spectrogram of approximately 23 s that contains four harmony seg-ments with a graphical visualisation sample which we are trying to learn ourmodel to predict.

song’s area. The Onset strength envelope is constructed from STFT, and theglobal tempo can be obtained by applying autocorrelation functions. Combiningthese two features, we get the song’s beats and the BPM value. The algorithm isalready implemented by Brian et al. [2015]. As Ellis [2007] described, beats andtempo estimation is often very close to the truth. However, it is still not 100 %precise, where every millisecond inaccuracy will affect the song significantly later.Once we have the BPM value and beats, we can assume that each beat is mappedto one chord. That could help us in the segmentation. For instance, when wedo some extra post-processing, we can get the harmony segmentation based onACR predictions, as we will use in Chapter 4. Of course, the algorithm outcomecan also be used as an additional feature of the harmony segmentation model.

25

Page 30: Vojtˇech Lanz - dspace.cuni.cz

Figure 3.11: Sample of difference between the correct and the predicted graphicalvisualisation of harmony segments.

Figure 3.12: Tempogram and onset strength of Queen’s song A Kind of Magic[Brian et al., 2015].

3.2.2 DiscussionWe have three completely different approaches estimating harmony segments.CRNN model predicting zeros and ones has to be improved by some better lossfunction definition, and also the architecture and the training have to be op-timized. It is not a dead-end, but it needs a little bit more research. CNNencoder-decoder approach is in a very similar situation. Some post-processingfunctions have to be implemented. However, the model output is not that clear,and probably more data should be provided, or we should design a different out-put format. Both of them need a lot of extra work where there is no guaranteeof success. The last option, the dynamic programming-based algorithm, is theeasiest one to implement and use. We only have to be careful about possible in-accuracies, and we need to have already predicted chord sequences from previouschord recognition research.

26

Page 31: Vojtˇech Lanz - dspace.cuni.cz

4. Application structureSong Chords Recognizer is an ASP.NET application with the Model, View, andController architecture that provides chord sequence visualization of any songin WAV format you will upload. You have two different models you can pick toprocess the audio that will return the song’s chord sequence.

• Template Voter

• Predictors

They are based on two different approaches. The first one is a model codedin C# that needs a lot of parameters. It is good enough for an audio of singleinstrument harmony and when we already know the BPM value. On the otherhand, it doesn’t work well with more complex songs. Another model is based ondeep learning methods with additional correcting algorithms. We described deeplearning approaches and their outcome in detail in Chapter 3. Its advantage isthat the model will also predict the BPM value, the key signature, and the timesignature. For now, the deep learning model is trained on a small dataset, andthere is still room for improvement in the research. We coded the backend of thispart in Python because of already existing libraries that the language supports.We will focus on model contexts and functionalities later in this chapter.

We provide only one controller, the so-called Recognizer controller, with sam-ple link https://localhost:44304/Recognizer where the domain address andport depends on the application location. There are several endpoints, and wewill describe them one by one.

4.1 Application endpoints

4.1.1 The main page /indexHere is the default index page where you can upload and process an audio file withany provided models. Besides the menu bar, the motivational and informationalheader is included. Users can quickly read about web page purposes and whethertheir interests fit. There is a form below where you can choose the model type onits top. We set the Predictors as a default option, but the user can easily switchto the Template Voter without reloading the page using javascript. There isonly one argument for the default option, the audio file itself. Template Voterhas four more parameters. Of course, we provided some default settings therethat most of them should work in most cases. However, the user can changethem if he wants. Only the BPM value is problematic here since each song has adifferent one. On the other hand, users who don’t care about the bars or rhythmcan ignore this field. Figure 4.1 shows the interface of Predictors form, Figure 4.2displays the form for Template Voter.

Endpoint Method: GETEndpoint Arguments: None

27

Page 32: Vojtˇech Lanz - dspace.cuni.cz

Figure 4.1: Default page with /index endpoint displaying the form of Predictors.

Figure 4.2: Default page with /index endpoint displaying the form of TemplateVoter.

4.1.2 The visualization of the Template Voter/VisualizeTemplateVoter

Here, we can find the outcome of the Song Chords Recognizer model based on thesimple template-matching algorithm from generated and filtered spectrograms.Chords are always grouped in the bars of four beats, and beats depend on theuser input. No more song’s description is provided.

Method: POSTArguments: audio, windowArg, filtrationArg, sampleLengthLevel, bpm

28

Page 33: Vojtˇech Lanz - dspace.cuni.cz

Figure 4.3: Chord sequence visualization page of Template Voter with/VisualizeTemplateVoter endpoint displaying song’s chord sequence.

4.1.3 The visualization of the Predictors/VisualizePredictors

This page shows the outcome of the Predictors’s process. We can see an exampleof the interface in Figure 4.4. Three boxes describing the processed song areplaced below the header. The first one is the key signature. We assume a song ismajor-key-based or so-called in Ionian mode. The user has to extract the infor-mation about the key signature, which means the number of sharps or flats, onhis own, which is not a challenging part at all. Another box shows a BPM value.The processed song has to have throughout the same tempo. Otherwise, the pacewould be inaccurate. The last box contains the time signature information. Wecan see two values there divided by a slash. It tells us the bar descriptions. Thefirst value specifies how many beats of the default bar type fit the measure. Thesecond value describes the type of default beat for the bar. We support onlyfour and three-quarter notes per measure, i.e., 4/4 and 3/4. Base on the timesignature, the bars of chords are grouped according to predicted quarters. Eachpredicted chord is also mapped with the time point. The user can potentiallysynchronize himself with the uploaded audio.

Method: POSTArguments: audio

4.1.4 The about page /aboutThis page contains nothing but information about the project with the GitHublink. Figure 4.5 shows the page screenshot.

Method: GETArguments: None

29

Page 34: Vojtˇech Lanz - dspace.cuni.cz

Figure 4.4: Chord sequence visualization page of Predictors with/VisualizePredictors endpoint displaying song’s chord sequence.

Figure 4.5: Page containing basic application information with /about endpoint.

4.1.5 The about page /IncorrectInputFormatIf something happens during the process, this endpoint will occur, and a corre-sponding error message will be displayed.

Method: GETArguments: messages

Figure 4.6: Error message with /IncorrectInputFormat endpoint that occursduring the process.

30

Page 35: Vojtˇech Lanz - dspace.cuni.cz

4.2 Template VoterThis section will focus on one of our models based on the simple voting of the mostcommon chord for some chromagram interval. This part is implemented entirelyin C#, and we do not use any additional libraries, except those that are part ofASP.NET. Therefore, we have our solutions even for audio parsing and STFT com-putation. The algorithm takes five arguments. We have to pass the convolutionalwindow and sample length level for the STFT algorithm, spectrogram filtrationtype, BPM value, and the audio itself. Figure 4.7 shows the workflow diagram ofindividual steps during the process. We will go through them separately.

Figure 4.7: Workflow diagram of Template Voter algorithm.

4.2.1 Audio SourceThe audio file is parsed byte by byte. We only dealt with the implementationof the WAV format. Other audio formats are not supported, but the algorithm ofthe Template Voter is easily extendable and prepared for other format parsers.We can see the WAV structure in Figure 4.8. There are several subchunks, at leasttwo. The first one is called fmt and contains data describing the audio format.For instance, the sample rate, number of channels, or how many bits are assignedto one sample. We are interested in one more subchunk, the data subchunk. Butwe have to iterate over all subchunks first to find its offset. The data subchunkincludes the encoded waveform of the song. We already have all information ofhow to encode the waveform from the fmt.

31

Page 36: Vojtˇech Lanz - dspace.cuni.cz

Figure 4.8: Data structure of bytes in WAV audio format [Luma et al., 2014].

4.2.2 Fourier transformThe Fourier transform part provides the Discrete Fourier transform, the FFT,and the STFT. The primary motivation is to get frequency intensities of all timeframes. Therefore, the object offers this functionality as well. Furthermore, thefunction is designed asynchronously with parallel tasks computing STFT resultsto increase the speed of the whole process. Its dependency on other Fouriertransform parts is why we have to pass the same arguments here as the STFTmethod takes.

Fast Fourier transform

We coded the FFT in a recursion way. Our approach is stable-based, e.i. we donot rewrite the audio waveform. When we compare our solution to other existinglibraries providing the stable-based FFT algorithm, ours is much faster. Thereare also libraries with in place functionality, e.i. they do not use extra space forthe result, but the waveform function is rewritten. These approaches are a littlebit faster. We did five measurements of the FFT process of 16384 song samplesof each library. Averaged results are listed in Figure 4.1.

32

Page 37: Vojtˇech Lanz - dspace.cuni.cz

Library Time in msOur Approach (stable) 17.89

FFTW (stable) 783.04Math.NET (stable) 147.46Accord (in place) 3.18AForge (in place) 2.79NAudio (in place) 1.31

Table 4.1: Average of five measurements of total execution time in millisecondsof FFT libraries.

The implementation of our FFT function is right in Figure 4.9.

private static Complex [] FFTRecursion (int N, double [] g, int offset , int fft_length

){

Complex [] result = new Complex [N];if (N == 1){

result [0] = g[offset ];}else{

Complex w = Complex .Exp(new Complex (0, ( -2.0f * Math.PI) / N)

);

Complex [] even = FFTRecursion (N/2, g, offset , fft_length

);Complex [] odd = FFTRecursion (

N/2, g, offset + fft_length /N, fft_length);

for(int i = 0; i < even.Length; i++){

result[i] = even[i]+ Complex .Pow(w, i) * odd[i];

result[i + N / 2] = even[i]- Complex .Pow(w, i) * odd[i];

}}return result;

}

Figure 4.9: Our C# recursive implementation of FFT.

33

Page 38: Vojtˇech Lanz - dspace.cuni.cz

Short-time Fourier transform

The function computing the STFT does not support overlapping of windows.The only arguments the method takes are the audio waveform, the number ofwindow samples, and the convolutional window type. The number of samplesthat one window contains is passed as its logarithm with base two to ensure thatthe number of window samples is a power of two. We call this variable as samplelength level. Window time in seconds is computed according to the followingformula

time in seconds = (2sample length level)/(sample rate)

.

This parameter is very sensitive and the Template Voter model performancedepends very significantly on it. If we choose a small number of seconds, eachtime frame will miss a context. On the contrary, large windows would have a lotof not relevant information. We want to set something in the middle. For betterintuition of which parameter corresponds to which time sample, Table 4.2 showsa few examples. For optimizing the parameter, 0.37152 seconds is a good startingpoint.

sample length level sample rate time in seconds10 44100 0.0232211 44100 0.0464412 44100 0.0928813 44100 0.1857614 44100 0.3715215 44100 0.7430416 44100 1.4860817 44100 2.9721518 88200 2.97215

Table 4.2: Few examples of sample length level.

The last parameter here is the STFT convolutional window type which is afunction that is applied to the waveform and modifies its shape. The main reasonfor doing that is to make FFT more accurate and highlight specific areas morethan others. There are several window functions that this project supports, butit is also well designed for users that would like to add their convolutional windowfunction.

• Rectangular window

w[n] = 1

34

Page 39: Vojtˇech Lanz - dspace.cuni.cz

Figure 4.10: Visualization of rectangular convolutional window function[Wikipedia contributors, 2021].

• Triangular window

w[n] = 1 − |n − N

2L2

|, 0 ≤ n ≤ N

Figure 4.11: Visualization of triangular convolutional window function [Wikipediacontributors, 2021].

• Parzen window

w[n] = w0(n − N

2 ), 0 ≤ n ≤ N

w0(n) =⎧⎨⎩ 1 − 6( n

L/2)2(1 − |n|L/2) 0 ≤ |n| ≤ L

42(1 − |n|

L/2)3 L4 < |n| ≤ L

2

35

Page 40: Vojtˇech Lanz - dspace.cuni.cz

Figure 4.12: Visualization of Prazen convolutional window function [Wikipediacontributors, 2021].

• Welch window

w[n] = 1 − (n − N

2N2

)2, 0 ≤ n ≤ N

Figure 4.13: Visualization of Welch convolutional window function [Wikipediacontributors, 2021].

• Nuttal window

w[n] = a0 − a1 · cos(2πn

N) + a2 · cos(4πn

N) − a3 · cos(6πn

N)

a0 = 0.355768, a1 = 0.487396, a2 = 0.144232, a3 = 0.012604

36

Page 41: Vojtˇech Lanz - dspace.cuni.cz

Figure 4.14: Visualization of Nuttal convolutional window function [Wikipediacontributors, 2021].

4.2.3 GraphsThe project has a Graph interface that generates its graphical visualization orprogrammatically returns all graph data. We use only two graphs for the SongChords Recognizer purpose. The first one is the Spectrogram, and the secondone is the Chromagram. The only graph printer we implemented is a simpleASCII-based graph inserted into the text file. But again, the user can easily addhis own graph printer object.

Spectrogram

We already know what the spectrogram is. In the code, it is computed by theSTFT function called asynchronously for each time window. It is important torealize that our goal is to analyze the harmony of musical instruments. Everymusic recording has a lot of noise that we have to suppress. Furthermore, instru-ments also generate n-th multiples of played tones. For instance, whether theC1 is played, spectrogram also states C2, G2, C3, E3, G3, etc. These tones arecalled aliquots. For this purpose, we provide several approaches that clean thespectrogram a little bit or suppress not interesting spectrogram areas.

• IdentityFiltering that doesn’t do any of filtration. It will return default spectrogramas it is generated by STFT.

• Accompaniment Frequency Area MaskWhen we are interested only in chord recognition, there is only one area oftones that is intended for accompaniment creation, in other words, harmonycreation. This range of harmony part begins somewhere at the end of thegreat octave and ends somewhere in the middle of the one-lined octave. Thisfiltering creates a mask and allows only values in this area. The rest of thefrequencies are set to zero. Because of that, chromagram creation would notwork well for the inconsistency of valued octaves of each chroma. Therefore,the chromagram generation algorithm explicitly ignores zero values.

• Weighted OctavesWe have four types of tones.

37

Page 42: Vojtˇech Lanz - dspace.cuni.cz

– Beats– Bass - Bass tones are essential for chords because they often specify

the root tone of the chord. But sometimes, these tones could be mis-leading, for instance, much more complex bass lines like walking bassin jazz, etc.

– Harmony - Harmony is the area we are interested in the most. Itdirectly represents chords.

– Melody - Melody is not that interesting for single chords. There couldbe passages that are not part of our chord. On the other hand, themelody is always based on harmony.

So it is a good idea to weigh single octave tones by their importance inthe ACR. For instance, great, small, and one-lined octaves are much moreimportant than the rest. Users can set weight values on their own or usethe default settings.

• Filter Nth HarmonicsFiltration was proposed by Hausner [2014]. This algorithm filters spec-trogram from n-th harmonics, so-called aliquotes, which are automaticallygenerated by instruments. The filtration iterates over all spectrogram val-ues and each will be rewritten by a minimum of

spectrogram[i][j · n] · (1/epsilon)n−1

for integer n ∈ [1, n harmonics]. The epsilon and the n harmonics aredefaultly set to epsilon = 0.4 and n harmonics = 10.

Chromagram

The creation of a chromagram is straightforward. The spectrogram is taken, oneof spectrogram filtration is applied, and the result is averaged over all octaves.The averaging is a little bit trickier here. The filtered spectrogram is dividedinto frequency intervals according to which pitch frequency is the nearest. Thehighest peaks of all intervals are selected to represent the interval’s pitch, andthese values are averaged for all twelve chromas over all octaves. The averagedresult forms the chromagram.

4.2.4 Chord ClassifierThis section classifies chords that correspond to the harmony played in the giventime frame. The BPM value specifies those time frames as a duration betweentwo beats. All chromagram samples corresponding to that duration are summedtogether. We prepared all triad and seventh chord templates, and the classifieris designed to choose the one that fits the most. At first, all triad chords areconsidered, i.e., diminished, minor, major, and augmented, for each root tone.The most likely triad chord is the one with the highest value of the product ofroot, third and fifth tone intensities normalized to scale between one and zero.

38

Page 43: Vojtˇech Lanz - dspace.cuni.cz

In case the triad fifth is a perfect fifth, we multiply the result by two to increasechances for more intuitive and statistically more common chords - major andminor. The same selecting algorithm is processed with seventh chords. If there isa seventh chord with the same root, third and fifth tones as the most likely triadand with the score, i.e., the intensity product, higher than 0.6, the most likelyseventh chord is used. Otherwise, the most likely triad chord is selected. Thisheuristic helps us in prioritizing triad chords. The seventh chord is considered ifit has a solid score.

4.3 PredictorsThis section focuses on the model based on deep learning, described in Chapter 3in detail. Although the rest of the program is coded in .NET, we decided to proceedwith Python language. The main reason is a significant advantage of Pythonlibraries, namely audio library librosa [Brian et al., 2015], machine learninglibrary sklearn [Pedregosa et al., 2011], and deep learning library tensorflow[Abadi et al., 2015]. Therefore, when the Predictors is selected to process audio inthe main ASP.NET application, the Python script with all required steps is called.Our script takes audio waveform with a sample rate as an input in JSON formaton the standard input. The format of the JSON request is desribed in Figure 4.15.

{" Waveform ": [SONGs WAVEFORM ]," SampleRate ": [ WAVEFORMs SAMPLE RATE]

}

Figure 4.15: Format of JSON request of Python Song Chord Recognizer Pipelinescript.

The Python pipeline result will be printed on the standard output as, again,JSON, which contains information about the key signature, BPM value, beat chordsequence of the song, beat times of beat chords, and the time signature. Fig-ure 4.16 shows the format of that JSON response.

{"Key ": [KEY DESCRIPTION ],"BPM ": [BPM VALUE]," ChordSequence ": [LIST OF CHORDS]," BeatTimes ": [LIST OF BEAT TIMES IN SECONDS ]," BarQuarters ": [NUMBER OF QUARTERS IN ONE BAR]

}

Figure 4.16: Format of JSON response Python Song Chord Recognizer Pipelinescript.

The algorithm has two models. The first one predicts the chords of the originalsong, and the second one predicts the chord sequence of the song transposed to theC Ionian key and its mode alternatives, which should be more accurate. Firstly,

39

Page 44: Vojtˇech Lanz - dspace.cuni.cz

the pipeline considers an audio waveform and preprocesses it to our original deeplearning model’s format. Let’s suppose that the chord prediction from the firstmodel is good enough to estimate the song’s key signature. Secondly, librosatransposes the audio to the C Ionian key [Brian et al., 2015], and our secondmodel predicts the transposed chord sequence, which would be more accurate.The last part of the process is to recognize beat times and assign them theirchord. That will also tell us the time signature and the BPM value. The wholeprocess is designed in Figure 4.17.

Audio Waveform

CRNN modeloriginal song

Original Chords

Spectrogram Sequences

Key Prediction

C Transposed Waveform

CRNN modelC transposed song

Spectrogram Sequences

C Transposed Chords

Beat Chords

BPM

C Transposed BeatChords

Original Beat ChordsTimeSignature

Figure 4.17: Workflow diagram of Predictors algorithm.

4.3.1 ModelsModels in this pipeline are based on the ACR research that was described inChapter 3. We proceed with the Basic CRNN model that seems to give the mostaccurate score. We want to use the one for transposed songs. But first, we have tofind out the key signature of uploaded audio. That is why we need both models,the one for original songs and the one for transposed songs. Models are alreadytrained on the Isophonics dataset. We need to prepare features from providedWAV audio file for those models in the same format as our training data. The WAVparsing is already done by the C# web application, but the rest of preprocessingdo the librosa library [Brian et al., 2015]. The sample rate is set to 22 050 Hz,hop length is 512 samples, and one sequence of features contains 1000 CQTspectrograms, which corresponds to about 23 seconds of the song.

4.3.2 Key predictionAs mentioned in Chapter 1, each key signature has seven chromas that charac-terize the whole song. Based on these chromas, we can build seven triad chordsconstructed from those tones. For instance, let’s consider the C Ionian key. Itscharacteristic tones are

C, D, E, F, G, A, B.

Their triad chords fitting the C Ionian are

C major, D minor, E minor, F major, G major, A minor, and B diminished.

40

Page 45: Vojtˇech Lanz - dspace.cuni.cz

Considering all song’s predicted chords, we take all possible key signatures andcompute the score - how many fitting chords are included in the song. The keywith the highest score is the one we use.

4.3.3 Beat assigning and tempo predictionThe motivation for this project part is the chord sequence visualization. We wantto show chords in a readable format. One intuitive way is to separate the songinto bars where each bar also has the chord description. We decided to show achord for each beat just in case that some chord change occurs in the middle ofthe measure. But it is not quite ordinary. It leads to three problems. Firstly,what the BPM value is. Secondly, when beat starts and when ends. Lastly, howmany beats are included in a bar. We implemented two different approaches.Both are based on the librosa library that finds BPM value and beat timesfor us. Unfortunately, the liborsa’s output is not entirely accurate. The BPMestimation could be less or greater by one than the correct value, which couldcause significant inaccuracy later in the song.

• Chord VotingBased on the librosa BPM and beat times estimation, the chord sequenceis processed, and the most frequent chord between two beat times is votedand assigned to this beat. We rely on the librosa’s precision which is notone hundred percent accurate.

• Chord beats estimatingThe BPM is used to estimate how many chords correspond to one beat.After that, we create the first draft of the beat chord sequence where eachchord corresponds to some beat. This sequence is used for the time signatureestimation. We support only songs with four quarters or three quarters onone bar. We compare numbers of the same chord beats in a sequencedivisible by four and those divisible by three. The one with the highestoccurrence wins. After that, we use the greedy algorithm to supply orremove missing chords or single chords to complete the full bar of the samechord. We are doing that to prevent all chord bars shift by one beat thatwill make chord sequence visualization more difficult to read.

We proceed with the second approach that seems to be more accurate with abetter visualization.

41

Page 46: Vojtˇech Lanz - dspace.cuni.cz

Conclusions and future workIn this work, we introduced one of the most important Music Informational Re-trieval topics, Automatic Chord Recognition. In the research part, we were ex-ploring various types of preprocessing methods. Both PCA and Standard Scalerdistinctly improved the results of the MLP classifier model. A very helpful ideawas transposing all songs into C major key or its key mode equivalents. Thataffected very well also convolutional recurrent neural networks. We proposed sev-eral CRNN architectures. It turned out that a model structure has to includethe language model or its alternative that would learn harmony patterns in orderto have chord sequences structured much better and more accurately. Thereforemodels finished with the recurrent layer showed better performance. Consideringthe first part of the CRNN, the so-called feature extractor, designed as a convo-lutional neural network containing six convolutional layers provided by us, hasbetter scores than the one using much more complex EfficientNet [Tan and Le,2019]. Despite the fact that we were using such a limited dataset, our results areclose to the state-of-the-art models.

We also described our research of the harmony segments recognition, for whichwe have developed three approaches. The study is more at the beginning but stillgave us some points and ideas. In purpose to find exact chord interval borders,the Dynamic Programming algorithm of beat tracking [Ellis, 2007] seems to be abeneficial tool for some post-processing which we used.

Based on the research, we implemented ASP.NET web application that willtake a WAV audio file, the file is processed, and the chord sequence visualizationis printed. The application uses two different algorithms, the one using deeplearning and the one based on template matching. In some cases would makemore sense to use the Template Voter since the deep learning model depends onthe training dataset. Using Python during the research part was very reasonablebecause of all the already existing libraries in that language. For the applicationpurpose and the visualization, C# language was a good choice because of thecomfortable and clear object design. The connection between these two partswas not an issue at all.

4.4 Future workIn order to have an improvement of the ACR scores, more data could come handy.There are many more available datasets, but some work about collecting andprocessing must be done first. It is worth considering creating a new syntheticdata set of all chord combinations of one or more instruments. We should alsotry modern improvements of recurrent neural networks, how attention affects theresult, etc.

We were working with a chord vocabulary of 25 chords. Extension for largervocabulary containing and recognizing chord seventh, ninth, or so is anotherpossible step in the future. Also, the segmentation research can be further inves-tigated.

The web application we provided is doing most of the processes on the server.That could be very slow if more users would want to use the application at

42

Page 47: Vojtˇech Lanz - dspace.cuni.cz

the same moment. For instance, an audio waveform parsing and spectrogramgeneration could be done on a client-side, which is the major part of all processes.Another possible application improvement is a connection of our ACR methodswith some already existing Automatic Speech Recognition (ASR), to show chordsand lyrics in one place.

43

Page 48: Vojtˇech Lanz - dspace.cuni.cz

BibliographyMartın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,

Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Is-ard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, JoshLevenberg, Dandelion Mane, Rajat Monga, Sherry Moore, Derek Murray, ChrisOlah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, KunalTalwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas,Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,and Xiaoqiang Zheng. TensorFlow: Large-Scale Machine Learning on Het-erogeneous Systems, 2015. URL https://www.tensorflow.org/. Softwareavailable from tensorflow.org.

Saddam Abdulwahab, Mohammed Jabreel, and Dr Moreno. Deep Learning Mod-els for Paraphrases Identification. PhD thesis, 09 2017.

Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Audiochord recognition with Recurrent Neural Networks. Proceedings of the 14thConference of the International Society for Music Information Retrieval (IS-MIR), 2013.

McFee Brian, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, EricBattenberg, and Oriol Nieto. librosa: Audio and music signal analysis inpython. In Proceedings of the 14th python in science conference, 12:18–25,2015.

Judith Brown and Miller Puckette. An efficient algorithm for the calculation of aconstant Q transform. Journal of the Acoustical Society of America, 92:2698,11 1992. doi: 10.1121/1.404385.

Tristan Carsault, Jerome Nika, and Philippe Esling. Using musical relationshipsbetween chord labels in automatic chord extraction tasks. Proceedings of the19th Conference of the International Society for Music Information Retrieval(ISMIR), 2018.

Tsung-Ping Chen and Li Su. Harmony Transformer: Incorporating Chord Seg-mentation into Harmony Recognition. Proceedings of the 20th InternationalSociety for Music Information Retrieval Conference, ISMIR, pages 259–267,2019.

James W. Cooley and John W. Tukey. An Algorithm for the Machine Calculationof Complex Fourier Series. Mathematics of Computation, 19(90):297–301, 1965.ISSN 00255718, 10886842. URL http://www.jstor.org/stable/2003354.

Junqi Deng and Yu-Kwong Kwok. A hybrid Gaussian-HMM-deep learning ap-proach for automatic chord estimation with very large vocabulary. Proceedingsof the 17th Conference of the International Society for Music Information Re-trieval (ISMIR), pages 812–818, 2016.

44

Page 49: Vojtˇech Lanz - dspace.cuni.cz

Junqi Deng and Yu-Kwong Kwok. Large vocabulary automatic chord estimationwith an even chance training scheme. Proceedings of the 18th Conference ofthe International Society for Music Information Retrieval (ISMIR), 2017.

Daniel P. W. Ellis. Beat Tracking by Dynamic Programming. Journal of NewMusic Research, 36(1):51–60, 2007. doi: 10.1080/09298210701653344. URLhttps://doi.org/10.1080/09298210701653344.

Takuya Fujishima. Realtime chord recognition of musical sound: a system us-ing Common Lisp Music. Proceedings of the International Computer MusicConference (ICMC), page 464–467, 1999.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press,2016. http://www.deeplearningbook.org.

Christoph Hausner. Design and Evaluation of a Simple Chord Detection Algo-rithm. Bachelor’s thesis, University of Passau, 2014.

Bahrudin Hrnjica. In depth LSTM Implementation usingCNTK on .NET platform. https://hrnjica.net/2019/04/08/in-depth-lstm-implementation-using-cntk-on-net-platform/, 042019. Accessed: 2021-05-07.

Eric J. Humphrey, Taemin Cho, and Juan P. Bello. Learning a robust Tonnetz-space transform for automatic chord recognition. Proceedings of the IEEE In-ternational Conference on Acoustics, Speech and Signal Processing (ICASSP),pages 453–456, 2012.

Junyan Jiang, Ke Chen, Wei Li, and Gus Xia. Large-vocabulary Chord Transcrip-tion Via Chord Structure Decomposition. Proceedings of the 20th InternationalSociety for Music Information Retrieval Conference, ISMIR, pages 644–651,2019.

Filip Korzeniowski and Gerhard Widmer. A fully convolutional deep auditorymodel for musical chord recognition. Proceedings of the IEEE 26th Interna-tional Workshop on Machine Learning for Signal Processing (MLSP), pages1–6, 2016a.

Filip Korzeniowski and Gerhard Widmer. Feature learning for chord recogni-tion: The deep chroma extractor. Proceedings of the 17th Conference of theInternational Society for Music Information Retrieval (ISMIR), 2016b.

Filip Korzeniowski and Gerhard Widmer. Improved chord recognition by combin-ing duration and harmonic language models. Proceedings of the 19th Conferenceof the International Society for Music Information Retrieval (ISMIR), pages10–17, 2018.

Artan Luma, Besnik Selimi, and Lirim Ameti. Audio Message TransmitterSecured Through Elliptical Curve Cryptosystem. International Journal ofApplied Mathematics, Electronics and Computers, 2:54–58, 12 2014. doi:10.18100/ijamec.68742.

45

Page 50: Vojtˇech Lanz - dspace.cuni.cz

Matthias Mauch and Simon Dixon. Approximate note transcription for the im-proved identification of difficult chords. Proceedings of the 11th InternationalSociety for Music Information Retrieval Conference (ISMIR), pages 135–140,2010.

Brian McFee and Juan Pablo Bello. Structured training for large-vocabularychord recognition. Proceedings of the 18th Conference of the InternationalSociety for Music Information Retrieval (ISMIR), 2017.

Christon-Ragavan Nadar, Jakob Abeßer, and Sascha Grollmisch. Towards CNN-based Acoustic Modeling of Seventh Chords for Automatic Chord Recognition.2019.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

Colin Raffel, Brian Mcfee, Eric Humphrey, Justin Salamon, Oriol Nieto, DawenLiang, and Daniel Ellis. mir eval: A Transparent Implementation of CommonMIR Metrics. 10 2014.

P. Schmeling. Berklee Music Theory. Number bk. 1 in Berklee Methods Series.Berklee Press, 2005. ISBN 9780876390467. URL https://books.google.cz/books?id=ZYtHSwAACAAJ.

Siddharth Sigtia, Nicolas Boulanger-Lewandowski, and Simon Dixon. Audiochord recognition with a hybrid recurrent neural network. Proceedings of the16th Conference of the International Society for Music Information Retrieval(ISMIR), pages 127–133, 2015.

Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling forConvolutional Neural Networks, 2019. URL http://arxiv.org/abs/1905.11946. cite arxiv:1905.11946Comment: Published in ICML 2019.

Wikipedia contributors. Window function — Wikipedia, the free encyclope-dia, 2021. URL https://en.wikipedia.org/w/index.php?title=Window_function&oldid=1021741408. [Online; accessed 9-May-2021].

Y. Wu, T. Carsault, and K. Yoshii. Automatic chord estimation based on aframe-wise convolutional recurrent neural network with nonaligned annota-tions. Journal of New Music Research, 48:232–252, 2019.

Mu-Heng Yang, Li Su, and Yi-Hsuan Yang. Highlighting root notes in chordrecognition using cepstral features and multi-task learning. Proceedings of theAsia-Pacific Signal and Information Processing Association Annual Summitand Conference (APSIPA), pages 1–8, 2016.

Xinquan Zhou and Alexander Lerch. Chord detection using deep learning. Pro-ceedings of the 16th Conference of the International Society for Music Infor-mation Retrieval (ISMIR), pages 52–58, 2015.

46

Page 51: Vojtˇech Lanz - dspace.cuni.cz

List of Figures

1.1 Sketch of MLP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 1D convolutional layer considering one neighbour on each side. . . 91.3 Simple example of 2D convolutional layer with kernel 2x2 [Good-

fellow et al., 2016]. . . . . . . . . . . . . . . . . . . . . . . . . . . 101.4 Diagram of GRU [Abdulwahab et al., 2017]. . . . . . . . . . . . . 101.5 Diagram of LSTM unit [Hrnjica, 2019]. . . . . . . . . . . . . . . . 10

2.1 A few lines of Crazy Little Thing Called Love chord LAB file. . . 14

3.1 The architecture of our MLP classifier. . . . . . . . . . . . . . . . 173.2 The architecture of our approach of CRNN. . . . . . . . . . . . . 183.3 The architecture of CRNN using EfficientNet [Tan and Le, 2019]. 183.4 The architecture of CRNN finished by CRF. . . . . . . . . . . . . 193.5 The architecture of two of our CRNN models combined, one pre-

dicts bass, the other predicts third. . . . . . . . . . . . . . . . . . 193.6 The confusion matrix of predictions on original and transposed

datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.7 CRNN segmentation model that predicts ones for chord changes,

otherwise zeros. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.8 The Python algorithm generating segments visualisation graph. . 243.9 Encoder-decoder architecture for harmony segmentation. . . . . . 253.10 Spectrogram of approximately 23 s that contains four harmony seg-

ments with a graphical visualisation sample which we are tryingto learn our model to predict. . . . . . . . . . . . . . . . . . . . . 25

3.11 Sample of difference between the correct and the predicted graph-ical visualisation of harmony segments. . . . . . . . . . . . . . . . 26

3.12 Tempogram and onset strength of Queen’s song A Kind of Magic[Brian et al., 2015]. . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1 Default page displaying the form of Predictors. . . . . . . . . . . . 284.2 Default page displaying the form of Template Voter. . . . . . . . . 284.3 Chord sequence visualization page of Template Voter displaying

song’s chord sequence. . . . . . . . . . . . . . . . . . . . . . . . . 294.4 Chord sequence visualization page of Predictors displaying song’s

chord sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.5 Page containing basic application information. . . . . . . . . . . . 304.6 Error message that occurs during the process. . . . . . . . . . . . 304.7 Workflow diagram of Template Voter algorithm. . . . . . . . . . . 314.8 Data structure of bytes in WAV audio format [Luma et al., 2014]. . 324.9 Our C# recursive implementation of FFT. . . . . . . . . . . . . . 334.10 Visualization of rectangular convolutional window function. . . . . 354.11 Visualization of triangular convolutional window function. . . . . 354.12 Visualization of Prazen convolutional window function. . . . . . . 364.13 Visualization of Welch convolutional window function. . . . . . . . 364.14 Visualization of Nuttal convolutional window function. . . . . . . 37

47

Page 52: Vojtˇech Lanz - dspace.cuni.cz

4.15 Format of JSON request of Python Song Chord Recognizer Pipelinescript. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.16 Format of JSON response Python Song Chord Recognizer Pipelinescript. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.17 Workflow diagram of Predictors algorithm. . . . . . . . . . . . . . 40

48

Page 53: Vojtˇech Lanz - dspace.cuni.cz

List of Tables

3.1 MLP classifier results of different preprocessing functions and ondifferent datasets, of OS and TS. . . . . . . . . . . . . . . . . . . 21

3.2 Results of various CRNN models evaluated on dataset of OS andon dataset of TS. All figures can be interpreted as percentages,which Subsection 3.1.4 describes in detail. . . . . . . . . . . . . . 21

4.1 Average of five measurements of total execution time in millisec-onds of FFT libraries. . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Few examples of sample length level. . . . . . . . . . . . . . . . . 34

49

Page 54: Vojtˇech Lanz - dspace.cuni.cz

List of AbbreviationsFFT Fast Fourier transform

STFT Short-time Fourier transform

CQT Constant-Q transform

MLP multi-layer perceptron

CNN convolutional neural network

RNN recurrent neural network

CRNN convolutional recurrent neural network

GRU gated recurrent unit

LSTM long short-term memory

PCA principal component analysis

CRF conditional random field

TS transposed songs

OS original songs

BPM beats per minute

HMM Hidden Markov model

ACR Automatic Chord Recognition

50

Page 55: Vojtˇech Lanz - dspace.cuni.cz

A. Attachments

Folder Description

demo/Web application package with the installation andrun ReadMe guideline. We also provided here twosample demo audios.

docs/ Autogenerated technical documentation of the webapplication and the Template Voter model.

src/

Source code of the web application, Predictorsmodel, and Template Voter model. The folder alsocontains ACR research outcomes in Jupyter note-books. The folder structure is detailed explainedin its ReadMe files.

51