recognition and reconstruction of primitives in music scores

8
ELSEVIER Image and Vision Computing 14 (1996) 39-46 Recognition and reconstruction of primitives in music scores K.C. Ng, R.D. Boyle Division of Artificial Intelligence, School of Computer Studies. The University of Leeds, Leed.y LS2 9JT, UK Received 25 April 1994; revised 20 March 1995 Abstract Music recognition bears similarities and differences to OCR. In this paper we identify some of the problems peculiar to musical scores, and propose an approach which succeeds in a wide range of non-trivial cases. The composer customarily proceeds by writing notes, then stems, beams, ties and slurs - we have inverted this approach by segmenting and then subsegmenting scores to recapture the component parts of symbols. In this paper, we concentrate on the strategy of recognizing sub-segmented primitives, and the re- assembly process which reconstructs low level graphical primitives back to musical symbols. The sub-segmentation process proves to be worthwhile, since many primitives complement each other and high level musical theory can be employed to enhance the recognition process. Keywords; Document analysis; OCR; Score recognition 1. Introduction The goal of automatic reading of musical scores has been pursued for some time [ 1,2]. Various authors have drawn attention both to the potential benefits, which include reprintings, part creation and transcribing into other notation such as Braille, and to the particular difficulties inherent in the problem, which include inter- connected symbols, different fonts style and broken symbols. Superficially similar to the analysis of text, there are significant differences, some of which work to advantage and some not; in summary, we may note: Musical scores benefit from a clear geometric frame- work (the five horizontal stave lines that make-up the staf, subdivided by vertical bars). Within bars, there is pre-determined timing, usually revealed by a time signature at the beginning of the piece. There are general location guidelines for some symbols; for example, clef signs normally appear at the beginning of a staff, and a staff normally ends with a bar line. Despite several useful geometric constraints, musical symbols may be compounded in a complex way. email: {kia,roger}@scs.leeds.ac.uk 0262-8856/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved SSDZ 0262-8856(95)01038-6 Unlike text, musical symbols may by design overlap one another - beams, slurs and phrase marks may, especially in combination with stave lines, become very hard to disambiguate. Not only are the musical symbols interconnected, but there are many valid methods of grouping. Fig. 2 illustrates the form of a quaver and two of the many possible appearances for a beamed group of four quavers. The layout of musical notation is not sequential, for example having two or more note heads perpendicu- larly stacked indicates a chord that should be played simultaneously (see Fig. 3). Fig. 1 illustrates some of these points. Although the basic shapes of music symbols, for exam- ple note heads, are simple, the high degree of intercon- nection and confusing connections and overlapping has proved difficult to overcome [ 1,3]. To solve this problem, we have defined a process [4] to sub-segment intercon- nected composite features below the level of symbol usually accessed by the composer - for example, a beamed group (see Fig. 4), is segmented into ‘sub-primi- tives’ such as note heads, stems and others. These sub- primitives then need reconstructing into the score. The full pre-processing sub-segmentation process is described elsewhere [4], but in summary, primitives

Upload: kc-ng

Post on 28-Aug-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recognition and reconstruction of primitives in music scores

ELSEVIER Image and Vision Computing 14 (1996) 39-46

Recognition and reconstruction of primitives in music scores

K.C. Ng, R.D. Boyle

Division of Artificial Intelligence, School of Computer Studies. The University of Leeds, Leed.y LS2 9JT, UK

Received 25 April 1994; revised 20 March 1995

Abstract

Music recognition bears similarities and differences to OCR. In this paper we identify some of the problems peculiar to musical scores, and propose an approach which succeeds in a wide range of non-trivial cases. The composer customarily proceeds by writing

notes, then stems, beams, ties and slurs - we have inverted this approach by segmenting and then subsegmenting scores to recapture the component parts of symbols. In this paper, we concentrate on the strategy of recognizing sub-segmented primitives, and the re- assembly process which reconstructs low level graphical primitives back to musical symbols. The sub-segmentation process proves to be worthwhile, since many primitives complement each other and high level musical theory can be employed to enhance the recognition process.

Keywords; Document analysis; OCR; Score recognition

1. Introduction

The goal of automatic reading of musical scores has been pursued for some time [ 1,2]. Various authors have drawn attention both to the potential benefits, which include reprintings, part creation and transcribing into other notation such as Braille, and to the particular difficulties inherent in the problem, which include inter- connected symbols, different fonts style and broken symbols. Superficially similar to the analysis of text, there are significant differences, some of which work to advantage and some not; in summary, we may note:

Musical scores benefit from a clear geometric frame- work (the five horizontal stave lines that make-up the staf, subdivided by vertical bars).

Within bars, there is pre-determined timing, usually revealed by a time signature at the beginning of the piece. There are general location guidelines for some symbols; for example, clef signs normally appear at the beginning of a staff, and a staff normally ends with a bar line. Despite several useful geometric constraints, musical symbols may be compounded in a complex way.

email: {kia,roger}@scs.leeds.ac.uk

0262-8856/96/$15.00 0 1996 Elsevier Science B.V. All rights reserved SSDZ 0262-8856(95)01038-6

Unlike text, musical symbols may by design overlap one another - beams, slurs and phrase marks may, especially in combination with stave lines, become very hard to disambiguate. Not only are the musical symbols interconnected, but there are many valid methods of grouping. Fig. 2 illustrates the form of a quaver and two of the many possible appearances for a beamed group of four quavers. The layout of musical notation is not sequential, for example having two or more note heads perpendicu- larly stacked indicates a chord that should be played simultaneously (see Fig. 3).

Fig. 1 illustrates some of these points. Although the basic shapes of music symbols, for exam-

ple note heads, are simple, the high degree of intercon- nection and confusing connections and overlapping has proved difficult to overcome [ 1,3]. To solve this problem, we have defined a process [4] to sub-segment intercon- nected composite features below the level of symbol usually accessed by the composer - for example, a beamed group (see Fig. 4), is segmented into ‘sub-primi- tives’ such as note heads, stems and others. These sub- primitives then need reconstructing into the score.

The full pre-processing sub-segmentation process is described elsewhere [4], but in summary, primitives

Page 2: Recognition and reconstruction of primitives in music scores

K.C. Ng. R.D. Boyle/Image and Vision Computing 14 (1996) 39-46 . m Fig. 1. Some of the complexities of scored music.

Fig. 2. Quavers.

Fig. 3. Nonsequential ordering.

Fig. 4. Beamed group is sub-segmented into note heads, beam and stems.

such as beams and slurs are sequentially traced and removed from the original image; compound features such as minims, quavers, chords and others are dis- mantled into sub-primitives. Relevant parts of this procedure are described later.

In this paper, we discuss our recognition strategies and concentrate on the issue of reconstructing music scores, particularly timing, from the sub-segmented primitives, and confirmation derived from syntactic and semantic musical knowledge. We demonstrate that critical infor- mation may be reconstructed using our ideas, in particu- lar the notes (including timing), as well as musical features such as chords and ‘accidentals’ (flats and sharps). The next section outlines the general approach and assumptions. We then review the pre-processing and subsegmentation we take as a starting point, describe the recognition process, based on a nearest neighbour (NN) classifier, and describe reconstruction and the deployment of higher level knowledge. Our experimental procedures and .results are summarized, and conclusions drawn.

Throughout, we use elementary terminology of musical notation; in most instances, our meaning should

be clear. Where further definitions are required, any standard introductory music text [5,6] will provide them.

2. Approach

The analysis of musical scores is fraught with diffi- culty. This is due to various factors, primarily the nature of the symbols, which are often very easy to confuse at the resolution customarily used, inconsistencies in the printing and the standard document analysis problem of noise induced via scanning. In our experiments, we have scanned at 300 dpi and find ambiguity between, for example # (sharp) and h (natural), and also observe that ‘horizontal’ lines are often bowed, while ‘regular’ features often are not.

Our approach is to recognize that the musical features, the target of the recognition process, are often going to be inaccessible to common forms of attack, and that subdividing them may provide lower level primitives that may be recognized, and then reassembled into the original score. In following this approach we use two strategies:

1. The recognizer should be robust, since it has to be proof against poorly defined features and noisy images; to this end we use the simplest possible features, and attach confidence to the recognizer’s efforts. The recognizer is described later.

2. Recognition follows a phased approach. Features essential to an interpretation of the score are sought first and verified by their mutual coherence, thereby allowing an intelligent search for more ambiguous features.

To demonstrate our approach we present a simplifica- tion of the algorithm by permitting assumptions about the score under analysis. Specifically, these assumptions are:

1. 2. 3.

Foreknowledge of the time signature. Foreknowledge of the key signature. Foreknowledge that the primitive feature set under examination is limited to ten; treble clef, bass clef, curve, beam, vertical bar, horizontal bar, note head, dot, accidental signs and quaver rest.

Ultimately, the first and second assumptions here may be overcome moderately simply by exploiting the know- ledge that this form of information is normally provided at geometrically predictable and known positions in the score, and comes from a very limited symbol set. We mention possible approaches to automatic extraction of time signature in a later section, while key signature will be expected to align with the five-line staff, and normally to represent a ‘common’ key.

The third assumption permits us to demonstrate the approach, and its relaxation will in due course provide a

Page 3: Recognition and reconstruction of primitives in music scores

0 Begin

K.C. Ng. R.D. Boyle/Image and Vision Computing 14 11996) 39-46

V Iterative Thresholding

I

Fig. 5. Pre-processing stages with example image.

measure of its worth. Note that some of these features actually represent sets of musical symbols: thus ‘acciden- tal sign’ may be flat, natural or sharp, ‘vertical bar’ may

be note stem or bar line, ‘curve’ may be slur, flag (tail), phrase or upper part of a pause symbol, and ‘note head’

may be solid or hollow.

3. Pre-processing and sub-segmentation

The pre-processing flow of this system is shown in Fig. 5.

Firstly, the continuous-tone image from the scanner is

converted into a binary image using the iterative thresh- old selection method of Ridler and Calvard [7] with Lloyd’s [S] modification. After that, the skew of the image is detected and it is rotated to de-skew. A more

detailed description of the pre-processing and sub-seg- mentation process can be found elsewhere [4].

Next, stave lines, the fundamental element of a musical

score, are located and erased. The positions of these parallel and horizontal lines are important as the refer-

ence of many musical symbols, thus their location must be recorded. These lines are then erased, since they connect musical symbols which were located on them. The average thickness of the stave lines and the inter- stave height (see Fig. 6) are important measurements in automatically setting many thresholds for later pro- cessing; for example, the size of a note head cannot be taller than the inter-stave line distance plus two stave line thickness.

After stave line removal, the image will consist of blocks of connected foreground pixels. These may be

I I I

0

i Thickness of stave line, 011

Y T Inter-stave line height, cr,,

Fig. 6. Fundamental unit, cy = q + a,

Fig. 7. Before and after stave line removal.

- w __)

Fig. 8. Feature and its bounding box

recognizable primitives, such as note heads or bar lines,

or composite objects, for example, a group of four semi- quavers, or noise, or part of a stave. Fig. 7 shows a section of a score with two beamed quaver groups and

the result after stave line removal. These blocks are inspected sequentially to determine whether they are

primitive features or need further segmentation. Before sub-segmentation, a rule based recognition is per-

formed to identify simple isolated symbols, which include:

Treble clef signs located at the beginning of a staff,

and with confirmation from the nearest neighbour classifier discussed in the next section. Bass clef signs, also located at the beginning of a staff

with a characteristic pair of dots in between the top three stave lines.

Isolated curves, which can be identified using their bounding box (see Fig. 8). Criteria used are that density (given by number offoreground pixel/area qf’ the bounding box) is less than 30%, and mean pro- jective width or mean projective height is less than (Y.

Isolated vertical bars, which exhibit a high ratio of height/width > 10 and width < o and density >60%. If the top of a vertical bar is located on the first line of a staff and the bottom of the vertical bar is located on the last line of a staff, it is a bar line if it has no

associated notehead or flag. An isolated horizontal bar, which has a high ratio of width/height > 3 and height -c Q and density >60%. A quaver rest, which is normally located between the

second and fifth line of the staff. The number of fore- ground pixels in section A and section B (see Fig. 10) is at least twice the number of foreground pixels in

section C and section D. This pixel count test is important for confirming that the symbol is a quaver

rest and not an accidental (discussed later; see Fig. 18).

The initial recognition of isolated symbols is necessary to prevent any unnecessary sub-segmentation.

Page 4: Recognition and reconstruction of primitives in music scores

42 K.C. Ng, R.D. Boyle/Image and Vision Computing 14 (1996) 39-46

I horizontal curve/line tracking m-J JJ ii horizontal cut, and

iii vertical cut

I- J 4

Fig. 9. Three sub-segmentation processes.

A B a C D

beam or curve (less than Q) then method (iii) is used instead. The decision on where to cut features is deter- mined by the maximum value of the second differential of the appropriate projection histogram. See elsewhere [4] for details.

After sub-segmentation, the possible primitives are passed to a NN classifier. These two processes (sub- segmentation w recognition), iterate until one of the following termination criteria is satisfied:

l the possible primitive is recognized; l the size is too small for further sub-segmentation

(i.e. an error has occurred); l the density (measured as number of foreground pixels

inside the bounding box) is greater than 75% (i.e. near solid).

Fig. 11 shows part of a music score, before and after sub- segmentation.

Fig. 10. The four-section-check.

4. Recognizer

Fig. 9 illustrates three types of sub-segmentation pro- cess. Which process to use depends upon two measure- ments - the size of the composite feature relative to the derived unit Q, the average inter-stave line height plus the average thickness of stave line (see Fig. 6), and a crude measurement of the location of information within the feature’s bounding box (see Figs. 8 and 10). This measure is used to judge the likely presence of a note head and stem, and has to be supported by a suitable density of black pixels in the appropriate position. For a given composite feature of width w and height h, the decision on which to use is given by:

IF (w > 2~) AND (h > 3~) THEN use method (i)

ELSE IF w < 2a AND h > 2a AND (A and D sections contain foreground pixels)

AND (a2 window above or below cut position is 2 75% black) THEN use method (ii)

ELSE IF (w > 2cr OR h 2 2a) AND (w 2 a AND

h 2 o) THEN use method (iii)

ELSE stop sub-segmentation (unknown).

Method (i) is designed to locate horizontal (or near- horizontal) connectors, such as beams and slurs - it will trace such features after severing them. When the traced line or curve is too short to be considered as a

Musical primitives differ from each other not only in their shapes, but also in their relative sizes, position, importance, frequency and other features based on the rules of music notation. In our case, a very simple recog- nizer is the first attempt at recognition, and there are specific methods to confirm the correctness of each primitive. This simple classifier only uses the width w and height h of a feature’s bounding box (see Fig. 8) measured as multiples of the fundamental unit, o, to perform the recognition based on a pre-sampled training set. This extreme simplicity lends a robustness to the approach.

We have already specified a set of known patterns P={Pi,&,... , PN}. For each known pattern PC E P we have a training set of (2D) feature vectors

{Pr,PZY . , p:}, where:

pi = (w;,h;)

from which we can calculate a mean (exemplar) pC = (w,, h,) as the centroid of the cluster representing that feature.

We are interested in the spread of these clusters, and can measure it by examining the (Euclidean) distances of the cluster points to their respective centroid. We define:

Fig. 11. Sub-segmentation.

Page 5: Recognition and reconstruction of primitives in music scores

K.C. Ng, R.D. Boyle/Image and Vision Computing 14 (1996) 39-46 43

freq A

Fig. 12. Plot of frequency taersus < for the training data.

and:

giving for the cth cluster the mean such distance, and the standard deviation of their distribution. A ‘small’ pL, would suggest a ‘tight’ cluster, especially if supported by a ‘small’ a,; conversely, large pcLc or uC would suggest widely spread, and/or highly variable data.

The training set used consisted of 15 samples for each group of symbols (120 in all), drawn from a range of scores. This set could obviously be extended, both in the number of patterns and the number of representa- tives of each pattern, but we consider it sufficient with which to experiment.

For a feature vector p and a cluster c, we define the measure E(c, p) by:

E(c >

p) = 4Pi PC) - A

0,

the distance of the vector from the cluster’s exemplar relative to the mean such, scaled by the standard devia- tion of the corresponding distances of known members of the cluster. This measurement may take any value in the range [-pcL,/ cc, +co), but for known patterns we observe in fact the (perhaps intuitive) distribution shown in Fig. 12, which is aggregated over all training patterns (the denominator gC having a normalizing effect). Very few observations lie more that 30 away from the relevant exemplar, while most are less than II,.

On the basis of this empirical evidence, we may take an unknown pattern p and use the measurements <(c, p) as classification confidences. We proceed by taking an unknown pattern, determining E for which c(c,p) is minimal, and accepting this as a true classification if E(t,p) < 2.5. With our chosen feature set, the clusters are well separated and we never observe a pattern simultaneously satisfying this condition for two (or more) features. This simple classifier demonstrates over 90% reliability, with misclassification normally only occurring due to the unexpected inter-connection of

m + I - J

I +- +i- rl - Fig. 13. Some example combinations.

Fig. 14. Example ‘chord’ (without stem) and a note head.

two or more primitives with no clear break point, for example /. Misclassification can usually be detected and corrected with high level knowledge at the recon- struction stage.

We find that recognition in a single pass is both unreliable and difficult, but since we are working with sub-symbol primitives, many can be used for mutual confirmation after simple recognition. For example, if we find a solid note head, there must be a nearby stem or else the note head is probably mis-identified. A primi- tive may have different semantics according to its loca- tion related to other primitives. For example, a dot may be a duration sign if it is located at the right-hand side of a note head, but it can be a articulation sign (i.e. a stac- cato) if it is below a note head - without the recognition of the note head, it is too early to guess the actual semantic of the dot. Small beams and note heads can also present ambiguities. Thus, there sometimes remains confusion to be resolved at a later stage. With the subsequent recogni- tion of surrounding features, the primitives in doubt may be deduced using musical knowledge. For example, if we find a note head located far away from the staff with unclear positioning, its pitch information may be left for a later stage when all the ledger lines are found.

5. Reconstruction

At this stage, sub-segmented primitives are grouped to reconstruct their original musical semantics. For example, when we find a likely solid note head with a stem close by, it constitutes a crotchet. Fig. 13 shows two examples of common combinations.

On reconstruction, we consider a ‘chord’ (connected solid/hollow note head primitives, without stem, see Fig. 14), or single note head or rest as the basis of musical events from the musical score with all other primitives as complementary features, except perhaps global informa- tion such as tempo. This idea is similar to the event in MIDI [9], but in this case, the duration of the event is held by the event itself, and not by the next event as a delay.

Page 6: Recognition and reconstruction of primitives in music scores

44 K.C. Ng, R.D. Boyle/Image and Vision Computing 14 (1996) 39-46

Width

Fig. 15. Recognition of accidental signs.

For each likely note head, the following sequence of tests and actions is performed:

1.

2.

3.

4.

5.

Verify identification, by overlaying an ellipse and counting the number of foreground pixels. This also reveals whether the note head is solid or hollow by the density of foreground pixels in the ellipse.

This checking method is robust on solid note- heads, but occasionally fails on broken or incom- pleted hollow noteheads. This is attended to at a later stage. Find the pitch - this is revealed straightforwardly by the vertical position of the note head with respect to the staff. Search the surrounding neighbourhood for possible complementary features such as stem, dot, acciden- tal and articulation marks. At this stage, a number of features which have multiple semantics, such as dot and vertical bar, can be clarified. Update the infor- mation tag of the affected basis. Search the surrounding neighbourhood for possible complementary features such as beams and flags. Update the information tag of the affected basis. The NN classifier identifies possible accidental signs (flat, natural and sharp) by their relative width and height. This identification may be confirmed by proximity of a notehead, and we then proceed to identify the precise accidental; they are tricho- tomised into #, TV or b. The maximum and minimum point of the left and right section of the accidental signs are located (see Fig. 15) and with these four points, accidental signs can be recognized. Specifi- cally, the rules are:

l If maxright > maxl+ the symbol is a #. 0 If maxl,f > maxright and minl,f, > minright, the

symbol is a b. 0 If maxleft > maxright and mii@, 5 min,i@, the

symbol is a b.

Ambiguities may occur between # and $ in some fonts, but b only appears after a # or a b on the parti- cular note earlier in the same bar, or near a change of

key-signature. Further, there are clear constraining guidelines about when a particular accidental is or is not permissible or likely that may be used to guide the recognition.

There are various places in this procedure where extra information may be brought to bear on recognition. In particular, misidentification can be corrected by check- ing stems with no valid noteheads (in the case of a minim), duration of bars (breve), or unclassified small segments (height 5 CY and width 5 2~) connected to other noteheads (chords). More interestingly, we can deploy musical knowledge such as the observation that although there is no complete or convincing theory, it is generally true that an accidental sign is more likely to be a # in a sharp key-signature (e.g. G major and A major) and it is more likely to be a b in a flat key-signature (e.g. F major and Eb major).

6. High level knowledge correction - time and bar

The output of the reconstruction phase is frequently incomplete, or occasionally in error, as a result of pixel level effects causing misrecognition. This is always going to occur, since many of the features under inspection are very small (simple dots, for example), and some are very easily confused with each other when viewed in isolation; this effect is exacerbated by stave lines passing through, for instance, hollow note heads. At this stage, musical knowledge can be employed to enhance the result in many ways; for example, times, common rhythmical patterns and others.

Isolated vertical lines (with a height of 4a, or greater) which are drawn across the staff are called bar lines. The section included from a bar line to a subsequent bar line is called a bar. It divides music into parts of equal dura- tion and serves to indicate the position of the accent.

A time signature (for example i, g and others) can normally be found at the beginning of scores; it reveals the number of beats there are in each bar and the type of the beats. The top number is the number of beats and the bottom number is the type of beat; for example, z for minims (half notes), 5 for crotchets (quarter notes), s for quavers (eighth notes), and so on c time can also be written as the sign ‘C’). Notes and rests of many dura- tions may be mixed together in a bar, but the total number of beats has to be equal to the number of beats shown by the time signature.

When all features in a bar have been (putatively) recognized, this timing constraint provides a verification of the recognition. In simple cases where there is only one feature of unknown duration in the bar, the unknown can be calculated from the difference between the actual duration of a bar and the sum of identified durations. If there is more than one ambiguity, other clues may be

Page 7: Recognition and reconstruction of primitives in music scores

K.C. Ng, R.D. Boyle/Image and Vision Computing 14 (1996) 39-46 45

l-7 r-r Fig. 16. Two examples of a dotted basis being following by a halved

note basis.

most frequently occurring number of semi-quavers as an index into the time of the music score. If needed, the ambiguity between i and f6, i and l, and others may be resolved using information from grouping of beamed notes and generally, slurring will give some hints.

needed. For example, common patterns such as a dotted basis usually being followed by a halved basis (see Fig. 16) and other rhythmical groupings.

Refinement of this approach, and further exploitation to use high level knowledge about intra-bar patterns of timing, represent work in hand in this project.

The time signature is very useful in helping to discover any necessary corrections, and one approach in pro- cessing scanned scores is to assume its availability. This is often a reasonable approach since in a lengthy score, it is unlikely to change and it represents a small overhead to the user to provide it as an input parameter. At worst, resolving the time signature may be viewed as a separate OCR task, since it normally appears near the score beginning at very predictable positions after the clef. In practice, however, it does not appear in every staff and may not appear for many pages; many scanned inputs may not have a time signature at all. For this reason, approaches to automatic and robust extraction of the time signature are of interest.

‘7. Experiments and results

The system we have outlined has been trained and tested on live data.

The semi-quaver is the smallest resolution note in commonly used time signatures. For each bar which has been recognized with confidence, we count the number of semi-quavers (sixteenth notes) that can make up the same duration of the bar; Fig. 17 shows most of the commonly used time signatures and the corresponding number of semi-quavers per bar.

The classifier was built to recognize eight different (sub-)symbols - their location in feature space is illu- strated in Fig. 18, in which each cluster exemplar lies at the centre of a circle, whose radius is drawn as 2.50 according to the calculations outlined earlier. In this figure, all distances are scaled with respect to the funda- mental unit of the score cr, defined above. This data is derived from 15 examples of each feature, scanned from six different scores. We acknowledge that a larger training set would be an advantage.

We have tested the system on 22 A4 pages of music, representing a variety of publication styles and com- posers. These are far from trivial, representing musical complexity at the limit of that manageable by other analysis systems of which we know. We observe as follows:

Using the number of semi-quavers alone cannot pin l Sub-segmentation and NN classlfkation: vertical and

point the time signature precisely, but it normally gives horizontal sub-segmentations are robust. Failure to

enough information for correction within bars. Assum- locate the best break point may occur if the symbol is

ing that there are no time signature changes, we take the skewed, characteristically a stem that is not per-

~ SimpleTimes ! Numbcrof!wmiqunvcn limexignaurc

( ! CompoundTime\

cl d I6 f d. d. J J 8 p J. J. Jlsl 4 % J!J! cl d cl 24 f d. d. d. JJJ 12 i J. J. J. JJ‘Ii 6 i5 J!J!J! cl cl d d 32 :* d. d. d. d JJJJ l6 ;' J. J. J.J. JLUJS * k J!J!J!J!

I I I

Fig. 17. Table of commonly used time signatures.

24

12

6

36

18

9

48

24

12

Page 8: Recognition and reconstruction of primitives in music scores

46 K.C. Ng, R.D. Boyle/Image and Vision Computing 14 (1996) 39-46

height (cl)

-0.5’ I ’ I I t . I 4 ’ I ’ -0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 width (a)

Fig. 18. Clusters used by the NN classifier.

pendicular to a stave line. However, accidentally broken stems can normally be corrected in the reconstruction stage. The horizontal tracing sub- segmentation performs well with beamed groups and picks out overlapping curves. It cannot trace beams that are connected to one another due to printer flooding, or digitizing error.

l Reconstruction of notes and compound symbols: with correct classification, reconstruction has few prob- lems with the help of simple musical syntax.

l Recognition of accidentals and other relevant symbols: isolated symbols are recognized at the first pass before sub-segmentation. From the test data, very few features were mis-identified; characteristically these are broken or small symbols that were con- nected with no clear break point.

There are definite shortcomings to this work, but we feel these are less serious than those exhibited by comparable systems published in the literature, and are confident that these can be overcome in due course by the deployment of higher level knowledge such as that outlined above.

8. Conclusion

In this paper, a recognition system via primitive recon- struction for printed musical scores is described. To take full advantage of musical knowledge, we chose to start with sub-feature primitives, working bottom-up, and to deploy top-down (musical) knowledge to constrain the reconstructions.

The system uses the simplest possible features to assist

recognition, and avoids unresolvable ambiguity by deploying syntactic musical knowledge. At this stage, the system is only equipped with a subset of musical primitives and expansion of the set of recognizable primitive symbols represents work in hand. To assist in this, constraints of timing information, which are already used in a crude form, will be explored further. We shall be looking into possible application of high level musical perception [lo, 1 l] and harmonic analysis [12] for correcting more complicated music scores.

The system is being developed for fully automatic recognition without human assistance. With the primi- tive approach and knowledge reinforcement, we hope to recognize more complex music scores.

Acknowledgements

The authors are most grateful to the referees, whose suggestions have resulted in considerable clarification of the material in this paper.

References

[l] W.B. Hewlett and E. Selfridge-Field, (eds) Computing in Musicology, Vol 9, Center for Computer Assisted Research in the Humanities (1994).

[2] D. Pruslin, Automatic Recognition of Sheet Music, Scd, MIT (June 1966).

[3] A. Clarke, M. Brown and M. Thome, Problems to be faced by developers of computer based automatic music recognisers, Int. Computer Music Conf. (1990) 345-347.

[4] KC. Ng and R.D. Boyle, Segmentation of music primitives, Br. Machine Vision Conf., Springer-Verlag, London (Septem- ber 1992).

[5] G.A. Holmes, The Academic Manual of the Rudiments of Music (1949).

[6] M. Kennedy, The Concise Oxford Dictionary of Music, Oxford University Press (1980).

[7] T.W. Ridler and S. Calvard, Picture thresholding using an iterative selection method, IEEE Trans. SMC, Vol 8, No 8 (August 1978) 630-632.

[8] D.E. Lloyd, Automatic target classification using moment invariants of image shapes, Technical Report RAE IDN AW 126, Famborough, UK (December 1985).

[9] International MIDI Association Standard Musical Instru- ment Digital Interface Files 1.0 (July 1988).

[lo] R. Aiello and J.A. Sloboda (eds) Musical Perceptions, Oxford University Press (1994).

[l l] J.A. Sloboda, The Musical Mind: The cognitive psychology of music, Vol 5 of Oxford Psychology Series, Clarendon Press, Oxford (1988).

[ 121 G. Pratt, The Dynamics of Harmony, Open University Press, Milton Keynes (1984).