ismir 2017 -international society for music …...figure 3: alignment between the ground-truth (top)...

1
ISMIR 2017 - International Society for Music Information Retrieval Conference A Metric for Music Notation Transcription Accuracy Andrea Cogliati and Zhiyao Duan < [email protected] > Dept. of Electrical and Computer Engineering University of Rochester Abstract Automatic music transcription (AMT) aims at transcribing musical performances into music notation Most existing AMT systems only focus on parametric transcription Lack of objective metric to evaluate music notation transcription The proposed edit metric counts differences between a transcription and the ground- truth music score in twelve different musical features The metric can be used to predict human evaluations of music notation transcription with an average R 2 of 0.564 Alignment Stage Results Proposed Method Examples [1] Andrea Cogliati, Zhiyao Duan, and David Temperley, “Transcribing human piano performances into music notation,” in Proc. International Society for Music Information Retrieval Conference (ISMIR), 2016. [2] http://www.ece.rochester.edu/~acogliat/repository.html References & ? 4 2 4 2 r œ œ œ # œ Œ p J œ r œ œ œ œ œ œ œ œ œ œ œ J œ r œ œ œ # œ œ œ œ œ œ œ œ r œ œ œ # œ r œ œ œ # œ œ œ œ œ œ œ (a) Ground truth & ? 4 2 4 2 r œ œ œ # œ J œ Œ œ œ œ r œ œ œ œ J œ œ œ œ œ œ œ œ r œ œ œ # œ r œ œ œ # œ œ œ œ œ œ œ œ r œ œ œ # œ Œ œ œ œ Œ (b) Transcription with a wrong pickup measure & ? 4 2 4 2 r œ œ œ # œ œ œ œ œ r œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ œ r œ œ œ # œ r œ œ œ # œ œ œ œ œ œ œ œ œ œ œ œ r œ œ œ # R œ œ œ œ œ œ R œ œ (c) Transcription off by a 16th note Comparison of two transcriptions of the same piece containing similar errors but with different readability: Align the transcription to the ground truth based on the pitch content only Pitch content is arguably the most salient feature of a transcription Invariant to meter and key mistakes Increased robustness of the alignment Compare musical objects at aligned portions between the scores and count differences on the following features: Binary matching: barlines, clefs, key signatures, time signatures Rests: duration, staff assignment Notes: spelling, duration, stem direction, staff assignment, grouping into chords Normalize error counts by the total number of musical objects Translate normalized error counts to musically relevant evaluation with a linear regression to fit human ratings Human ratings of three musical aspects taken from [1]: pitch content, rhythm notation, note positioning For each aspect, linear regression learns twelve weights, one for each normalized error count Human evaluators in [1] were graduate students in Music Theory The dataset shows a low inter- evaluator agreement Average standard deviation for (score range is 1 to 10) Pitch notation: 1.64 Rhythm notation: 1.52 Note positioning: 1.84 & ? # # 4 3 4 3 œ œ œ œ œ ˙ ˙ ˙ œ œ œ œ . ˙ œ œ m œ œ œ . ˙ œ œ œ . ˙ œ m œ œ œ œ . ˙ œ œ œ œ œ . ˙ & ? # # 4 3 4 3 ˙ œ œ œ œ œ Œ ˙ œ ˙ Œ œ œ œ Ó . . ˙ Ó . œ œ œ œ œ Ó . . ˙ Ó . œ œ œ Ó . . ˙ Ó . œ œ œ œ œ Ó . . ˙ Ó . œ œ œ œ œ Ó . . ˙ Ó . & ? # # 4 3 4 3 œ œ œ œ œ ˙ ˙ ˙ œ œ œ œ . ˙ œ œ m œ œ œ . ˙ œ œ œ . ˙ œ m œ œ œ œ . ˙ œ œ œ œ œ . ˙ & ? 4 4 4 4 J œ . œ œ œ œ . œ œ J œ œ ˙ ˙ œ œ œ œ œ Œ œ œ œ . ˙ œ œ œ œ . . œ œ œ œ œ œ œ # œ j œ . r œ œ w Alignment between the ground truth (top) and a transcription (bottom) of Bach’s Minuet in G. Arrows indicate aligned beats. Alignment between the ground truth (top) and another transcription (bottom) of Bach’s Minuet in G. Arrows indicate aligned beats. 2 3 4 5 6 7 8 9 Evaluator score 1 2 3 4 5 6 7 8 9 Predicted score (a) Pitch Notation 2 3 4 5 6 7 8 Evaluator score 2 3 4 5 6 7 8 Predicted score (b) Rhythm Notation 2 3 4 5 6 7 8 Evaluator score 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 Predicted score (c) Note Positioning Correlation between the predicted ratings and the average human ratings. Conclusions Clear correlation between predicted ratings and average human ratings Pitch notation R 2 =0.558 Rhythm notation R 2 =0.534 Note positioning R 2 =0.601 The twelve proposed error count categories capture musically relevant features of music notation transcription High variance between evaluator scores may reduce performance Full code is available at [2]

Upload: others

Post on 23-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

ISMIR2017- InternationalSocietyforMusicInformationRetrievalConference

AMetricforMusicNotationTranscriptionAccuracyAndreaCogliatiand ZhiyaoDuan

<[email protected]>Dept.ofElectricalandComputerEngineeringUniversityofRochester

Abstract• Automaticmusictranscription(AMT)aimsattranscribingmusicalperformancesintomusicnotation• MostexistingAMTsystemsonlyfocusonparametrictranscription• Lackofobjectivemetrictoevaluatemusicnotationtranscription• Theproposededitmetriccountsdifferencesbetweenatranscriptionandtheground-truthmusicscoreintwelvedifferentmusicalfeatures• Themetriccanbeusedtopredicthumanevaluations ofmusicnotationtranscriptionwithanaverageR2 of0.564

AlignmentStage ResultsProposedMethod

Examples

[1]AndreaCogliati,ZhiyaoDuan,andDavidTemperley,“Transcribinghumanpianoperformancesintomusicnotation,”inProc.InternationalSocietyforMusicInformationRetrievalConference(ISMIR),2016.[2]http://www.ece.rochester.edu/~acogliat/repository.html

References

(NMF) constrained with a rhythmic structure modeled witha Gaussian mixture model. Collins et al. [8] proposed amodel for multiple fundamental frequency estimation, beattracking, quantization, and pattern discovery. The pitchesare estimated with a neural network. An HMM is sepa-rately used for beat tracking. The results are then com-bined to quantize the notes. Note spelling is performed byestimating the key of the piece and assigning to MIDI notesthe most probable pitch class given the key.

An immediate problem arising when building a musicnotation transcription system by incorporating the above-mentioned musical structure inference methods is to findan appropriate way to evaluate the transcription accuracyof the system. In our prior work [7], we asked musictheorists to evaluate music notation transcriptions alongthree different musical aspects, i.e., the pitch notation, therhythm notation, and the note positioning. However, sub-jective evaluation is time consuming and difficult to scaleto provide enough feedback to further improve the tran-scription system. It would be very helpful to have an ob-jective metric for music notation transcription, just like thestandard metric F-measure for parametric transcription [1].Considering the inherent complexity of music notation,such a metric would need to take into account all of theaspects of the high-level musical structures in the notation.To the best of our knowledge, there is no such metric, andthe goal of this paper is to propose such a metric.

Specifically, in this paper we propose an edit distance,based on similar metrics used in bioinformatics and lin-guistics, to compare a music transcription with the ground-truth score. The design of the metric was guided by a data-driven approach, and by simplicity. The metric is calcu-lated in two stages. In the first stage, the two scores arealigned based on the pitch content; in the second stage,the differences between the two scores are accumulated,taking into account twelve different aspects of music nota-tion: barlines, clefs, key signatures, time signatures, notes,note spelling, note durations, stem directions, groupings,rests, rest duration, and staff assignment. This will servethe same purpose as F-measure in evaluating parametrictranscription. To validate the saliency and the usefulnessof this metric we also apply a linear regression model tothe errors measured by the metric to predict human evalu-ations of transcriptions.

2. BACKGROUND

Approximate sequence comparison is a typical problem inbioinformatics [13], linguistics, information retrieval, andcomputational biology [15]. Its purpose is to find simi-larities and differences between two or more sequences ofelements or characters. The sequences are assumed suffi-ciently similar but potentially corrupted by errors. Possi-ble differences include the presence of different elements,missing elements or extra elements. Several metrics havebeen proposed to measure the distance between two se-quences, including the family of edit metrics [15], and gap-penalizing alignment techniques [13].

A music score in traditional Western notation can be

viewed as a sequence of musical characters, such as clefs,time and key signatures, notes and rests, possibly oc-curring concurrently, such as in simultaneous notes orchords. Transcription errors include alignment errors dueto wrong meter estimation or quantization, extra or miss-ing notes and rests, note and rest duration errors, wrongnote spelling, wrong staff assignment, wrong note group-ing and beaming, and wrong stem direction. All of theseerrors contribute to a various degree to the quality of theresulting transcription. However, the impact of each errorand error category has not, to the best of our knowledge,been researched.

As an example, Fig. 1 shows two transcriptions of thesame piece. Both transcriptions contain similar errors, i.e.,wrong meter detection, but the transcription in Fig. 1c isarguably worse than that in Fig. 1b. A similar problem canbe observed with the standard F-measure typically used toevaluate parametric transcriptions [1]; while the metric isobjective and widely used, the impact of different errorson the perceptual quality of a transcription has not beenresearched. Intuitively, certain errors, such as extra notesoutside of the harmony, should be perceptually more ob-jectionable than others, such as octave errors. This is thereason for both proposing an objective metric and correlat-ing the metric with human evaluations of transcriptions.

&?

42

42

rœ œ œ# œŒp Jœ ‰ rœ œ œ œ

œ œœ œœ œœJœ ‰ rœ œ œ# œœ œœ œœ œœ

rœ œ œ# œ rœ œ œ# œ

œ œœ œ œœ(a) Ground truth

&?

42

42

rœ œ œ# œ Jœ ‰Œ œ œœ

rœ œ œ œ Jœ ‰œœ œœ œ œœ

rœ œ œ# œ rœ œ œ# œœœ œœ œ œœ

rœ œ œ# œ Œœ œœ Œ

(b) Transcription with a wrong pickup measure

&?

42

42≈ rœ œ œ# œ œ ≈≈ ≈ ‰ ≈ œ œœ

≈ rœ œ œ œ œ ≈œœ œœ œœ œœ œ œœ

≈ rœ œ œ# œ rœ œ œ#œœ œœ œœ œœ œ œœ

œ rœ œ œ# Rœ ≈ ‰œœ œ œœ Rœœ ≈ ‰

(c) Transcription off by a 16th note

Figure 1: Comparison of two transcriptions of the samepiece containing similar errors but with different readabil-ity.

3. PROPOSED METHOD

The proposed metric is calculated in two stages: in thefirst stage, the transcription is aligned with the ground-truth music notation based on its pitch content only, i.e.,all of the other objects, such as rests, barlines, and timeand key signatures are ignored; in the second stage, all ofthe objects occurring at the aligned portions of the scores

Comparisonoftwotranscriptionsofthesamepiececontainingsimilarerrorsbutwithdifferentreadability:

• Alignthetranscriptiontothegroundtruthbasedonthepitchcontentonly• Pitchcontentisarguablythemost

salientfeatureofatranscription• Invarianttometerandkeymistakes• Increasedrobustnessofthe

alignment• Comparemusicalobjectsatalignedportionsbetweenthescoresandcountdifferencesonthefollowingfeatures:• Binarymatching: barlines,clefs,key

signatures,timesignatures• Rests: duration,staffassignment• Notes: spelling,duration,stem

direction,staffassignment,groupingintochords

• Normalizeerrorcountsbythetotalnumberofmusicalobjects• Translatenormalizederrorcountstomusicallyrelevantevaluationwithalinearregressiontofithumanratings• Humanratingsofthreemusical

aspectstakenfrom[1]:pitchcontent,rhythmnotation,notepositioning• Foreachaspect,linearregression

learnstwelveweights,oneforeachnormalizederrorcount

• Humanevaluatorsin[1]weregraduatestudentsinMusicTheory• Thedatasetshowsalowinter-

evaluatoragreement• Averagestandarddeviationfor(score

rangeis1to10)• Pitchnotation:1.64• Rhythmnotation:1.52• Notepositioning:1.84

&?

##

43

43Piano

œ œ œ œ œ˙̇̇ œ

œ œ œ.˙

œ œm œ œ œ.˙

œ œ œ.˙

œm œ œ œ œ.˙

œ œ œ œ œ.˙

&?

##Pno.

7

œ œ œ œ œœ œ œ

jœ .˙œ œ œ œ œ

œ œ œ œ œ˙ œ

œ œ œœ œ œ

œ œm œ œ œ.˙

œ œ œœ œ œ œ œ

&?

##

..

..Pno.

13 œm œ œ œ œ˙ œ

œ œ œ œ œ˙ œ

œ œ œ œ œœ œ œ

.˙˙ œ

Ethan Weinstein

K-8

&?

##

43

43Piano

˙ œ œœ œ œ Œ˙ œ˙

Œ

œ œ œÓ ..˙

Ó .

œ œ œ œ œÓ ..˙Ó .

œ œ œÓ ..˙

Ó .

œ œ œ œ œÓ ..˙

Ó .

œ œ œ œ œÓ ..Ó̇ .

œ œ œ œ œÓ .œ œ œÓ .

&?

##Pno.

8 .Ó̇ .œ œ œ œ œÓ .

œ œ œ œ œÓ .˙ œÓ .

œ œ œÓ .œ œ œÓ .

œ œ œ œ œ œÓ ..˙Ó .

œ œ œÓ .œ œ œ œ œÓ .

œ œ œ œ œÓ .˙ œÓ .

œ œ œ œ œÓ .˙ œÓ .

&?

##Pno.

15 œ œ œ œ œÓ .œ œ œÓ .

˙ ŒÓ .

.Ó̇ œ

L-8Ethan Weinstein

Figure 2: Alignment between the ground-truth (top) anda transcription (bottom) of Bach’s Minuet in G. Arrowsindicate aligned beats.

&?

##

43

43Piano

œ œ œ œ œ˙̇̇ œ

œ œ œ.˙

œ œm œ œ œ.˙

œ œ œ.˙

œm œ œ œ œ.˙

œ œ œ œ œ.˙

&?

##Pno.

7

œ œ œ œ œœ œ œ

jœ .˙œ œ œ œ œ

œ œ œ œ œ˙ œ

œ œ œœ œ œ

œ œm œ œ œ.˙

œ œ œœ œ œ œ œ

&?

##

..

..Pno.

13 œm œ œ œ œ˙ œ

œ œ œ œ œ˙ œ

œ œ œ œ œœ œ œ

.˙˙ œ

Ethan Weinstein

K-8

&?

44

44Piano

‰ Jœ .œ œ œ œ .œ œ‰ Jœœ ˙̇ œ

≈ œ œ œ œ Œ œ≈ œ œ .˙

≈ œœ œœ ..œœ œ ≈ œ œ œ œ#

œ ‰ jœ ‰ . rœ œw

&?Pno.

5 ≈ œ œ .œ œ ≈ œ œ œ œ≈ œ œ .˙

≈ œ œ œ œ œ œ ≈ œ œœ#w

œœ ≈ œ œ ≈ œ œ œ œœŒ ≈ œ œ ≈ œ œ œ

œœ ‰ jœ Œ ‰ JœŒ œ ≈ œ œ œ œ

&?Pno.

9 œ œ ≈ œ œ ≈ œ œ œ.œ œ œ œ œ

‰ jœ ‰ . rœ œ œœ‰ Jœ ‰ . Rœ œ Œ

‰ œ œ œ ≈ œ# œ œ

Ó ‰ . Rœ œœœ ≈ œ œ œ œ œ‰ . Rœ œ œ œ .œ œ œ

œ œ ≈ œ œ œ œ .œ œœ ≈ œ# œ œ œ œ

&?Pno.

14 ≈ œ œ œ œ ≈ œœ œœ ‰ jœœ.œ œ œ Ó

œ œ œ# ˙3

‰ . Rœ œ ˙œ Œ Ó

.œ œ œ Ó

G-8Ethan Weinstein

Figure 3: Alignment between the ground-truth (top) andanother transcription (bottom) of Bach’s Minuet in G. Ar-rows indicate aligned beats.

are grouped together and compared. The metric reports thedifferences in aligned portions in terms of twelve aspects:barlines, clefs, key signatures, time signatures, notes, notespelling, note durations, stem directions, groupings, rests,rest duration, and staff assignment.

Some algorithms to efficiently calculate certain edit dis-tances, e.g., the Wagner-Fischer algorithm to calculate theLevenshtein distance between two strings, are able to aligntwo sequences and calculate the edit costs in a single stage.We initially tried to apply the same strategy to our problem,but we discovered that the algorithm was not sufficientlyrobust, especially with transcriptions highly corrupted bywrong meter estimation. Intuitively, notes are the mostsalient aspects of music, so it is arguable that the align-ment of two transcriptions should be based primarily onthat aspect, while the overall quality of the transcriptionshould be judged on a variety of other aspects.

The ground truth and the transcription are both encodedin MusicXML, a standard format to share sheet music filesbetween applications [10]. The two scores are aligned us-ing Dynamic Time Warping [17]. The local distance issimply the number of mismatching pitches, regardless ofduration, spelling and staff positioning.

To illustrate the purpose of the initial alignment, weshow two examples in Fig. 2 and Fig. 3. The alignmentstage outputs a list of pairs of aligned beats. Fig. 2 showsthe alignment of a fairly good transcription of Bach’s Min-uet in G from the Notebook for Anna Magdalena Bach,with the ground truth, which corresponds to the following

sequence, expressed in beats, numbered as quarter notesstarting from 0 (GT is ground truth, T is transcription):

GT 0.0 1.0 1.5 2.0 2.5 3.0 4.0T 0.0 1.0 1.5 2.0 2.5 3.0 4.0

4.0 5.0 6.0 7.0 7.5 8.0 8.5 9.05.0 5.0 6.0 7.0 7.5 8.0 8.5 9.0

10.0 10.0 11.0 12.0 13.0 13.5 14.0 14.510.0 11.0 11.0 12.0 13.0 13.5 14.0 14.515.0 16.0 16.5 17.0 17.515.0 16.0 16.5 17.0 17.5

In this case, since the transcription is properly alignedwith the ground truth, the sequence is just a list of all equalnumbers, one for each onset of the notes in the score. How-ever, beat 4.0 in the ground truth is matched with beats 4.0and 5.0 in the transcription; the same happens for beats10.0 and 11.0, so DTW cannot properly distinguish re-peated pitches. Only one alignment is shown in the figurefor clarity.

Fig. 3 shows an example of an alignment for a badlyaligned transcription of the same piece. The correspondingsequence is the following:

GT 0.0 0.0 0.0 1.0 1.0 1.5T 0.0 0.5 1.0 1.75 2.0 2.5

2.0 2.5 3.0 3.0 3.0 4.0 4.03.0 3.75 4.25 4.5 5.0 5.5 7.05.0 6.0 6.0 6.0 7.0 7.5 8.07.0 8.25 8.5 9.0 9.75 10.25 10.758.0 8.5 9.0 10.0 10.0 10.0 11.0

11.0 11.5 12.0 13.5 14.75 15.0 15.0

In this case, multiple beats in the transcription corre-spond to the same beat in the ground truth, e.g., beat 1.0 inthe ground truth corresponds to beats 1.75 and 2.0 in thetranscription, because a single note in the ground truth hasbeen transcribed as two tied notes. Only one alignment isshown in the figure for clarity.

To calculate the distance between the two alignedscores, we proceed by first grouping all of the musical ob-jects occurring inside aligned portions of the two scoresinto sets, thus losing the relative location of the objectswithin each set but preserving all of the other aspects, in-cluding staff assignment. Then the aligned sets are com-pared, and the differences between the two sets are re-ported separately. The following aspects only allow binarymatching: barlines, clefs, key signatures, and time signa-tures. Rests are matched for duration and staff assignment,i.e., a rest with the correct duration but on the wrong staffwill be considered a staff assignment error, a rest with thecorrect staff assignment but wrong duration will be consid-ered a rest duration error. A missing or an extra rest will beconsidered a rest error. Notes are matched for spelling, du-ration, stem direction, staff assignment, and grouping intochords. For groupings, we only report the absolute valueof the difference between the number of chords present inthe two sets. The metric does not distinguish missing or

&?

##

43

43Piano

œ œ œ œ œ˙̇̇ œ

œ œ œ.˙

œ œm œ œ œ.˙

œ œ œ.˙

œm œ œ œ œ.˙

œ œ œ œ œ.˙

&?

##Pno.

7

œ œ œ œ œœ œ œ

jœ .˙œ œ œ œ œ

œ œ œ œ œ˙ œ

œ œ œœ œ œ

œ œm œ œ œ.˙

œ œ œœ œ œ œ œ

&?

##

..

..Pno.

13 œm œ œ œ œ˙ œ

œ œ œ œ œ˙ œ

œ œ œ œ œœ œ œ

.˙˙ œ

Ethan Weinstein

K-8

&?

##

43

43Piano

˙ œ œœ œ œ Œ˙ œ˙

Œ

œ œ œÓ ..˙

Ó .

œ œ œ œ œÓ ..˙Ó .

œ œ œÓ ..˙

Ó .

œ œ œ œ œÓ ..˙

Ó .

œ œ œ œ œÓ ..Ó̇ .

œ œ œ œ œÓ .œ œ œÓ .

&?

##Pno.

8 .Ó̇ .œ œ œ œ œÓ .

œ œ œ œ œÓ .˙ œÓ .

œ œ œÓ .œ œ œÓ .

œ œ œ œ œ œÓ ..˙Ó .

œ œ œÓ .œ œ œ œ œÓ .

œ œ œ œ œÓ .˙ œÓ .

œ œ œ œ œÓ .˙ œÓ .

&?

##Pno.

15 œ œ œ œ œÓ .œ œ œÓ .

˙ ŒÓ .

.Ó̇ œ

L-8Ethan Weinstein

Figure 2: Alignment between the ground-truth (top) anda transcription (bottom) of Bach’s Minuet in G. Arrowsindicate aligned beats.

&?

##

43

43Piano

œ œ œ œ œ˙̇̇ œ

œ œ œ.˙

œ œm œ œ œ.˙

œ œ œ.˙

œm œ œ œ œ.˙

œ œ œ œ œ.˙

&?

##Pno.

7

œ œ œ œ œœ œ œ

jœ .˙œ œ œ œ œ

œ œ œ œ œ˙ œ

œ œ œœ œ œ

œ œm œ œ œ.˙

œ œ œœ œ œ œ œ

&?

##

..

..Pno.

13 œm œ œ œ œ˙ œ

œ œ œ œ œ˙ œ

œ œ œ œ œœ œ œ

.˙˙ œ

Ethan Weinstein

K-8

&?

44

44Piano

‰ Jœ .œ œ œ œ .œ œ‰ Jœœ ˙̇ œ

≈ œ œ œ œ Œ œ≈ œ œ .˙

≈ œœ œœ ..œœ œ ≈ œ œ œ œ#

œ ‰ jœ ‰ . rœ œw

&?Pno.

5 ≈ œ œ .œ œ ≈ œ œ œ œ≈ œ œ .˙

≈ œ œ œ œ œ œ ≈ œ œœ#w

œœ ≈ œ œ ≈ œ œ œ œœŒ ≈ œ œ ≈ œ œ œ

œœ ‰ jœ Œ ‰ JœŒ œ ≈ œ œ œ œ

&?Pno.

9 œ œ ≈ œ œ ≈ œ œ œ.œ œ œ œ œ

‰ jœ ‰ . rœ œ œœ‰ Jœ ‰ . Rœ œ Œ

‰ œ œ œ ≈ œ# œ œ

Ó ‰ . Rœ œœœ ≈ œ œ œ œ œ‰ . Rœ œ œ œ .œ œ œ

œ œ ≈ œ œ œ œ .œ œœ ≈ œ# œ œ œ œ

&?Pno.

14 ≈ œ œ œ œ ≈ œœ œœ ‰ jœœ.œ œ œ Ó

œ œ œ# ˙3

‰ . Rœ œ ˙œ Œ Ó

.œ œ œ Ó

G-8Ethan Weinstein

Figure 3: Alignment between the ground-truth (top) andanother transcription (bottom) of Bach’s Minuet in G. Ar-rows indicate aligned beats.

are grouped together and compared. The metric reports thedifferences in aligned portions in terms of twelve aspects:barlines, clefs, key signatures, time signatures, notes, notespelling, note durations, stem directions, groupings, rests,rest duration, and staff assignment.

Some algorithms to efficiently calculate certain edit dis-tances, e.g., the Wagner-Fischer algorithm to calculate theLevenshtein distance between two strings, are able to aligntwo sequences and calculate the edit costs in a single stage.We initially tried to apply the same strategy to our problem,but we discovered that the algorithm was not sufficientlyrobust, especially with transcriptions highly corrupted bywrong meter estimation. Intuitively, notes are the mostsalient aspects of music, so it is arguable that the align-ment of two transcriptions should be based primarily onthat aspect, while the overall quality of the transcriptionshould be judged on a variety of other aspects.

The ground truth and the transcription are both encodedin MusicXML, a standard format to share sheet music filesbetween applications [10]. The two scores are aligned us-ing Dynamic Time Warping [17]. The local distance issimply the number of mismatching pitches, regardless ofduration, spelling and staff positioning.

To illustrate the purpose of the initial alignment, weshow two examples in Fig. 2 and Fig. 3. The alignmentstage outputs a list of pairs of aligned beats. Fig. 2 showsthe alignment of a fairly good transcription of Bach’s Min-uet in G from the Notebook for Anna Magdalena Bach,with the ground truth, which corresponds to the following

sequence, expressed in beats, numbered as quarter notesstarting from 0 (GT is ground truth, T is transcription):

GT 0.0 1.0 1.5 2.0 2.5 3.0 4.0T 0.0 1.0 1.5 2.0 2.5 3.0 4.0

4.0 5.0 6.0 7.0 7.5 8.0 8.5 9.05.0 5.0 6.0 7.0 7.5 8.0 8.5 9.0

10.0 10.0 11.0 12.0 13.0 13.5 14.0 14.510.0 11.0 11.0 12.0 13.0 13.5 14.0 14.515.0 16.0 16.5 17.0 17.515.0 16.0 16.5 17.0 17.5

In this case, since the transcription is properly alignedwith the ground truth, the sequence is just a list of all equalnumbers, one for each onset of the notes in the score. How-ever, beat 4.0 in the ground truth is matched with beats 4.0and 5.0 in the transcription; the same happens for beats10.0 and 11.0, so DTW cannot properly distinguish re-peated pitches. Only one alignment is shown in the figurefor clarity.

Fig. 3 shows an example of an alignment for a badlyaligned transcription of the same piece. The correspondingsequence is the following:

GT 0.0 0.0 0.0 1.0 1.0 1.5T 0.0 0.5 1.0 1.75 2.0 2.5

2.0 2.5 3.0 3.0 3.0 4.0 4.03.0 3.75 4.25 4.5 5.0 5.5 7.05.0 6.0 6.0 6.0 7.0 7.5 8.07.0 8.25 8.5 9.0 9.75 10.25 10.758.0 8.5 9.0 10.0 10.0 10.0 11.0

11.0 11.5 12.0 13.5 14.75 15.0 15.0

In this case, multiple beats in the transcription corre-spond to the same beat in the ground truth, e.g., beat 1.0 inthe ground truth corresponds to beats 1.75 and 2.0 in thetranscription, because a single note in the ground truth hasbeen transcribed as two tied notes. Only one alignment isshown in the figure for clarity.

To calculate the distance between the two alignedscores, we proceed by first grouping all of the musical ob-jects occurring inside aligned portions of the two scoresinto sets, thus losing the relative location of the objectswithin each set but preserving all of the other aspects, in-cluding staff assignment. Then the aligned sets are com-pared, and the differences between the two sets are re-ported separately. The following aspects only allow binarymatching: barlines, clefs, key signatures, and time signa-tures. Rests are matched for duration and staff assignment,i.e., a rest with the correct duration but on the wrong staffwill be considered a staff assignment error, a rest with thecorrect staff assignment but wrong duration will be consid-ered a rest duration error. A missing or an extra rest will beconsidered a rest error. Notes are matched for spelling, du-ration, stem direction, staff assignment, and grouping intochords. For groupings, we only report the absolute valueof the difference between the number of chords present inthe two sets. The metric does not distinguish missing or

Alignmentbetweenthegroundtruth(top)andatranscription(bottom)ofBach’sMinuetinG.Arrowsindicatealignedbeats.

Alignmentbetweenthegroundtruth(top)andanothertranscription(bottom)ofBach’sMinuetinG.Arrowsindicatealignedbeats.

2 3 4 5 6 7 8 9

Evaluator score

1

2

3

4

5

6

7

8

9

Pre

dic

ted

sco

re

(a) Pitch Notation

2 3 4 5 6 7 8

Evaluator score

2

3

4

5

6

7

8

Pre

dic

ted

sco

re

(b) Rhythm Notation

2 3 4 5 6 7 8

Evaluator score

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

Pre

dic

ted s

core

(c) Note Positioning

Figure 4: Correlation between the predicted ratings andthe average human evaluator ratings of all of the transcrip-tions in the dataset.

extra elements. These choices were dictated by simplicityof design and implementation.

All of the errors are cumulated for all of the matchingsets. The errors for barlines, notes, note spelling, note du-rations, stem directions, groupings, rests, rest duration, andstaff assignment are then normalized by dividing the totalnumber of errors for each aspect by the total number ofmusical objects taken into account in the score. This stepis necessary to normalize the number of errors for piecesof different lengths. The errors for clefs, key signatures,and time signatures are not normalized, as they are typi-cally global aspects of the scores, and not influenced by thelength of the piece. This might be a limitation for pieceswith frequent changes in key signature or time signature.

As an example, the set of objects at the first beat of thefirst measure of Fig. 2 include the initial barlines, clefs,time signature, key signature, and notes starting on thedownbeat of the measure. Barlines, clefs, time signature,and key signature are all correctly matched. All of thenotes are correct in pitch, spelling and duration, howeverthere are two errors in stem direction, one error in group-ing, and one error in staff assignment. All of the rests areconsidered rest errors at each respective onsets.

For the first beat of the first measure of Fig. 3, all of theelements of the transcription till the first transcribed notes(the three notes pointed by the first arrow) and the notestied to them will be considered as part of the same set. Thewrong key signature and time signature will be reported aserrors. The two eight rests will be reported as rest errors.The three notes in the transcription are properly spelled,but their duration is wrong, so that will be counted as threenote duration errors. The missing D from the chord willbe reported as a note error. The extra tied notes will bereported as note errors as well.

In summary, the following twelve normalized errorcounts are calculated by the metric: barlines, clefs, keysignatures, time signatures, notes, note spelling, note dura-tions, stem directions, groupings, rests, rest duration, andstaff assignment. In order to translate these error countsinto a musically relevant evaluation, we propose to uselinear regression of the twelve error counts to fit humanratings of three musical aspects of automatic transcrip-tions, i.e., the pitch notation, the rhythm notation, and thenote positioning. For each aspect, the linear regressionlearns twelve weights, one for each of the normalized errorcounts, to fit the human ratings. These weights can then beused to predict the human ratings of other music notationtranscriptions.

4. EXPERIMENTAL RESULTS

To evaluate the proposed approach, we calculate the nor-malized error count and run linear regression to fit humanratings of 19 short music excerpts collected in our priorwork [7]. These music excerpts were from the Kostka-Payne music theory book, all of them piano pieces by well-known composers, and were performed on a MIDI key-board by a semi-professional piano player. These excerptswere then transcribed into music notation using four differ-

Correlationbetweenthepredicted ratingsandtheaveragehumanratings.

Conclusions• Clearcorrelationbetweenpredictedratingsandaveragehumanratings• PitchnotationR2=0.558• RhythmnotationR2=0.534• NotepositioningR2=0.601

• Thetwelveproposederrorcountcategoriescapturemusicallyrelevantfeaturesofmusicnotationtranscription• Highvariancebetweenevaluatorscoresmayreduceperformance• Fullcodeisavailableat[2]