linking music-related information and audio data

Linking music-related informationand audio data

Stage 1 Report

Robert Macrae

Centre for Digital Music

Queen Mary, University of London

Supervisor: Simon Dixon

2nd Supervisor: Mark Plumbley

Independent Assessor: Josh Reiss

[email protected]

August 2008

Abstract

Due to recent technological advancement we now have a near endlesssupply of musical content. There is now a growing interest in new waysof interacting with and filtering this music. Examples are music editingsuites that can identify audio and align segments automatically, personalmusic players that respond to their wearers mood, context and movement oreducational software that not only teaches music but can assess, grade andaccompany students.

In the field of digital music, work is being done to provide computerprograms with the tools to examine, classify, generalize, and annotate musicin ways that were previously only possible by hand. The work here is basedon synchronizing audio and meta-data, with a focus on methods that run inrealtime with a high degree of accuracy. This can then be used by musiceditors to align audio segments, allow personal music players to keep themusic in time with the users footsteps and allow the educational software tofollow the student with an animated score, playing other instruments in time.It is hoped that this work will lead to making playing music more accessible tothose who would normally struggle to learn music and to providing methodsand tools that help people appreciate music in new contexts.

Contents

1 Introduction 31.1 synchronisation . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 Online and Offline synchronisation . . . . . . . . . . . 51.2 Audio Meta-Data . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Applications of synchronisation . . . . . . . . . . . . . . . . . 5

2 Background 82.1 Factors in Synchronisation . . . . . . . . . . . . . . . . . . . . 8

2.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Window Frames . . . . . . . . . . . . . . . . . . . . . . 92.1.3 Evaluating Methods . . . . . . . . . . . . . . . . . . . 10

2.2 Feature Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 Low Level Features . . . . . . . . . . . . . . . . . . . . 102.2.2 High Level Features . . . . . . . . . . . . . . . . . . . . 12

2.3 Synchronisation Techniques . . . . . . . . . . . . . . . . . . . 122.3.1 Dynamic Time Warping . . . . . . . . . . . . . . . . . 122.3.2 Dynamic Programming . . . . . . . . . . . . . . . . . . 142.3.3 Synchronising Multiple Features . . . . . . . . . . . . . 142.3.4 Hidden Markov Models . . . . . . . . . . . . . . . . . . 15

2.4 Non-Linear Alignments . . . . . . . . . . . . . . . . . . . . . . 162.5 Sate-of-the-Art Applications . . . . . . . . . . . . . . . . . . . 162.6 Challenges Identified in Linking Music-Related information

and Audio Data . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Work plan 193.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Work done so far . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Audio - Score synchronisation using On-Line DTW . . 19

1

3.2.2 synchronisation in musical education software and Note-Scroller . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.3 Runner driven music and Trackster . . . . . . . . . . . 203.3 Quantitative Goals . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3.1 Comparing synchronisation methods . . . . . . . . . . 203.3.2 Audio - Score synchronisation using On-Line DTW . . 203.3.3 synchronisation in musical education software and Note-

Scroller . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.4 Runner driven music and Trackster . . . . . . . . . . . 213.3.5 Mobile phone score synchronisation and Tabster . . . . 22

3.4 Other Possible Research Aims . . . . . . . . . . . . . . . . . . 233.5 Qualitative Goals . . . . . . . . . . . . . . . . . . . . . . . . . 24

2

Chapter 1

Introduction

Recent technological achievements have resulted in a greater access to digitalmusic. Online digital libraries contain millions of songs while personal mp3players allow us to enjoy our music wherever we are. Digital radio, websitesand recommendation software are even helping us find the artists that appealto us. Because of this endless supply of music, there is a growing interestin new ways of interacting with and filtering this content. Examples aremusic editing suites that can identify audio and align segments automatically,personal music players that respond to their wearers mood, context andtempo or educational software that not only teaches music but can assess,grade and accompany students.

Within the field of Music Information Retrieval there is a need to developtools and applications that can combine audio with metadata for uses suchas studying, editing and searching. Meta-data can be information such as as-sociated lyrics, musical instructions, or probabilistic features extracted fromthe audio. The challenge is in how to automatically ensure the audio andthis meta-data are linked and presented together such that the correspondingmusical events within the two streams of information are synchronised.

The focus of this work is on real-time and accurate synchronisation ofaudio and meta-data. This can then be used by music editors to align audiosegments, allow personal music players to keep the music in time with theusers footsteps and allow the educational software to follow the student withan animated score, playing other instruments in time. It is hoped that thiswork will lead to making playing music more accessible to those who wouldnormally struggle to learn music and to providing methods and tools thathelp people find and appreciate new music.

3

1.1 Synchronisation

We explore methods of synchronising meta-data and audio so that the twobecome aligned. A particular focus of this research is in live and accuratesynchronisation, to create tools that improve learning music, playing instru-ments and also in searching and listening to the audio. Fig. 1.1 illustrates anaudio signal aligned with various meta-data. We consider audio and meta-data as synchronised/aligned when they are presented in such a manner thatthe corresponding musical events in each are linked. If we take sheet music asan example of meta-data then in order for the two to be aligned the musicalevents in the audio must be matched up to the instructions for these eventsin the sheet music. There have been a number of methods proposed for syn-chronisation, and one purpose of this work is to establish which techniqueswork best and for which particular circumstances.

Figure 1.1: Synchronicity II by The Police aligned with various meta-data

4

Meta-data Examples

High level descriptors Chorus/verse boundaries, Key changes, Metrical structureMusical instructions Sheet music, Tablature, Chord sequences or MIDIText Labels, Lyrics, SpeechTransients Note onsets, DrumbeatsMovements Conductors gestures, Footsteps, Dance movements

Table 1.1: Examples of audio meta-data

1.1.1 Online and Offline Synchronisation

When discussing synchronisation methods, a consideration is whether theyfunction online or offline. Offline synchronisation matches two completestreams of information and is not necessarily restricted by time require-ments. Online synchronisation on the other hand will take either or bothof the streams of information in sequential parts and update the best matchin real time. Offline synchronisation, with the luxury of knowing the completeinformation, has its uses in non time-critical applications such as gatheringannotated data. However there are some applications that require synchro-nising audio and music-related information in real-time, most typically intracking scores and automatic accompaniment.

1.2 Audio Meta-Data

Audio meta-data can be defined as data about audio data. In the mostcommon use of the term audio meta-data is information such as artist, songand album names or it is how to efficiently process audio data in specificmusic file formats. In this thesis however we will use it to describe streamsof information that correspond musically to a specific song. Examples of themeta-data considered within the scope of this project are in Table 1.1.

1.3 Applications of Synchronisation

Here we list some of the possible benefiting applications from the work pro-posed in this thesis. This list is an attempt to group the common ideas in thisfield, as well as propose some new ones with a mention to the corresponding

5

meta-data the audio would be linked with. Along with examples these ideasare, where possible, attributed to the original contributor.

Score Tracking

The most common type of audio and meta-data synchronisation is that ofonline score tracking or score following using the musical instructions for thepiece. By synchronising the live recording with the known musical score itbecomes possible to follow where the performer is within a piece. The meta-data is usually in MIDI form as this is the most basic means of storing digitalmusical instructions but it can also be standard music notation, tablatureor other score format. The first example of score following forms part ofDannenbergs’s [10] automatic accompaniment system. Score tracking canlead to digital music books that turn pages for a musician automatically.

By following the score the synchronisation method will indicate wherethe user has missed notes or played parts incorrectly. This can lead to thepossibility of music education programs that attempt to improve performanceor teach music to beginners such as that proposed by Dannenberg et al [14].

Automatic Accompaniment

Automatic accompaniment is the process of synchronising audio and thescore and then automatically playing music (often synthesised) along, andin time, with the musician [10]. In this manner musicians can practice bythemselves with the full sound of the overall performance. Examples are inDannenbergs’s [10] On-Line Algorithm for Real-Time Accompaniment andRaphael’s [28] Music Plus One. The accompaniment may also take the formof visual information, relating to the score being tracked or pre-recordedsequences, enhancing the experience of the music.

Random Access to Audio Recordings

Synchronising audio and music scores is also useful in an offline context.If a piece of audio is aligned and linked to its musical score then you canselect parts of the score and view the audio of that selection at any time.In this manner a piece of music could be searched by features, segments orlyrics [19] etc. The idea of the intelligent editor of digital audio goes back toChafe et al [9] and Foster et al [17] who defined the need for music editingsoftware that could allow users to interact with audio in high level musical

6

terms. synchronising the audio with the musical instructions, labels andother meta-data would allow the music to be accessible through these linkedfeatures. Dannenberg and Hu [11] also describe an ’Intelligent Audio Editor’that could automatically adjust audio segments to fit in with the overall mixas and when it receives them.

Gathering Annotated Data

Synchronising audio samples with meta-data could allow for the quick andautomatic collection of annotated data. Such annotated audio samples couldaid the study of musical attributes or enhance future synthesised sound [13].

Audio that has been synchronised with musical scores can also be sys-tematically gathered to provide ground truth data for training and/or testingmusic information retrieval methods. The free availability of almost limitlessMIDI files of known songs on the internet make this a powerful method forgathering data. Turetsky and Ellis [37] first used MIDI data aligned withaudio to provide training data for automatic transcription methods. Youand Dannenberg [38] also used such data, gathered in a ’semi-supervised’manner, as training data for note onset detection.

As with MIDI, lyrics are also commonly available on the internet andcan therefore be automatically gathered and synchronised with the matchingmusic. This could then be used to train speech/vocal recognition (in music)methods. The linked lyrics can also be used as Karaoke data [19].

User/Beat Driven Music

By using a beat detector on a live recording of a drummer we can alterrecorded segments of music to keep in time with the drummer. B-Keeperby Robertson [30] is an example of this that can be used to improve liveperformances where a backing audio track is to accompany a band or as anadvanced loop station that alters the recorded loops. If you turn this aroundthen you can imagine an automatic drum accompaniment that detects thetempo of the live audio being played by musicians.

Similar to B-Keeper, synchronisation makes its possible for music to besynchronised with the footsteps of a runner/jogger/walker. Using synchro-nisation with meta-data to drive music can also be applied to other useractions such as singing in karaoke, dance movements, dance video-games,user actions in video games, gestures from a conductor and so on.

7

Chapter 2

Background

Linking audio and meta-data and specifically synchronisation are intrinsi-cally linked with many other aspects of Music Information Retrieval. Havingdefined the scope of the problem, the dependencies on other functions suchas feature extraction and onset detection will become clear. Synchronisationcan also, as mentioned previously, benefit other aspects of MIR in trainingand developing onset detection functions and automatic annotation. In thissection we hope to present a study of the current trends and lines of researchbeing undertaken in linking meta-data and synchronisation. We will thenendeavour to establish what the current state of the art is across the fieldof linking audio and meta-data and how far these techniques are realised incurrent applications.

2.1 Factors in Synchronisation

When looking at any synchronisation method for use in a given applicationit is important to consider a number of factors that will have an impact onhow effective the technique is.

2.1.1 Features

Synchronisation between two streams of data in different formats requiresfinding some common format to recognise the similar sub patterns withinthe data. The sequence can then be aligned by finding the greatest fit tothese matched sub patterns. Methods for synchronising two audio waves or

8

matching two lists of meta-data are common and in fact most audio andmeta-data synchronisation methods are based on these but with the addedstep of changing one stream’s format to the other. By either extractingor transcribing the expected information from the audio in the format ofthe meta-data or synthesising the expected audio of the meta-data, we cancompare one with the other in a common format. As you can see in Fig. 2.1there are many features in the possible feature space to choose from.

Figure 2.1: The different features audio and meta-data can be synchronizedwith

2.1.2 Window Frames

Those synchronisation methods that do rely on feature extraction methodsstart by breaking down the audio into a series of successive frames. A con-

9

sideration in any music information retrieval method is in the size of theseframes and how much they overlap by. Typically between 10 and 250 mswith half of that overlapped, the frame size, or granularity has a large effecton the accuracy, computational efficiency and frequency resolution of thefeature extraction. The methods we examine have chosen differing windowlengths based on the needs of the features chosen and the demands of theapplication of their synchronisation.

2.1.3 Evaluating Methods

How to assess the accuracy of synchronisation methods is another consider-ation we will look at as the test data often needs to be annotated by hand.This can be difficult when looking for precise events, such as note onset times,and also time consuming. Methods of automatically gathering correct testdata will also be looked into.

2.2 Feature Space

Between the high level musical constructs and the low level digital audiodata lies the musical feature space where the music can be represented by avariety of features. Some of the meta-data examples given in the previouschapter can be seen as features themselves and are presumably pre-computedor hand-written. There exists methods to extract musical features from theraw audio data or synthesise expected features from the higher level musicalscore. An example of the features we may wish to use in this case are (butare not limited to) amplitude/energy, zero crossing rate, tempo, key, funda-mental frequency, spectrum, cepstrum, mel-frequency coefficients/cepstrum,quefrequency, note onsets and beat onsets.

2.2.1 Low Level Features

Methods that synchronise audio and meta-data using a low-level approach re-quire the meta-data brought to the level where audio is typically synchronisedwith other audio for comparison using typical audio to audio synchronisationmethods. This can be done by either synthesising the expected audio fromthe meta-data or directly computing the expected spectral data from themeta-data. The most common method to compare two audio streams is to

10

use the spectral information however other low-level features have also beenused such as Mel-Frequency Cepstral Coefficients or chroma vectors.

Due to the nature of the sound it is impossible to make any meaningfulmatch directly on the digital audio (the base level) but the spectral informa-tion, the strengths of the different frequencies in the signal, can be calculatedfrom audio or synthesised audio and make a good basis for synchronisation.

A spectral representation of the synthesised/un-synthesised audio is madeby taking the Fast Fourier Transform of the frames to get the spectrum ofeach frame (see Fig. 2.2). By plotting these together you get a spectralmap of the audio (see Fig. 2.3). A chroma map could then be achieved bytransforming the spectrum into a 12 dimensional map of the different noteswhere different octaves are grouped together. The MFCCs can be calculatedby taking the FFT of the spectrum and then mapping the output of this ona mel scale. This makes the MFCCs more applicable to the human auditorysystem.

Figure 2.2: Spectrum of an audio frame

Figure 2.3: Spectrum view of the audio

11

2.2.2 High Level Features

Methods that synchronise audio and meta-data using a high-level approachrequire extracting the high level features from the audio. The accuracy ofany such synchronisation method would then be dependant on the accuracyof the feature extraction methods used.

2.3 Synchronisation Techniques

2.3.1 Dynamic Time Warping

Dynamic Time Warping is a process to find the best path through a matrixof costs. In audio and meta-data synchronisation these costs relate to thedifference between a frame of the audio and a frame of the meta-data. If thecost is low then the difference between these frames is small and thereforethey are more likely to be representative of each other. The problem withusing DTW is in the quadratic time and memory costs needed to computethe best path and that it needs the complete information and so cannot workin real-time for long sequences without alterations.

Similarity Matrix

To establish the matrix of costs, two sets of frames that contain similarfeature representations of the pieces to be synchronised are compared. Avalue is given of the difference for each comparison made and all these valuestogether make a similarity matrix (see Fig. 2.4). If the two pieces are similarthen a path will appear diagonally through the similarity matrix that showsthe progression of the least different frames throughout each piece. This pathrepresents how to synchronise the audio and meta-data frames by warpingone frames to fit the other (see Fig. 2.4) .

Path Finding

The path finding function is performed by dynamic programming to recur-sively find the minimum cost to the last position in the matrix. The outputof the DTW is a list of co-ordinates which in synchronisation terms relatesto pairs of aligned frames.

12

Figure 2.4: (Left) Similarity matrix made from two similar spectral mapsand (right) the best path

Examples of DTW

An example of the DTW method outlined above is that by Orio and Schwarz[26] which bases its matrix of difference costs on the spectral peak structureof the frames. This method achieves an average offset of 25 ms to the groundtruth. Dannenberg and Hu [11] experiment on this approach by using chromafeatures. Soulez et al [36] use DTW based on spectral differences synchroniseaudio with MIDI files. This technique could handle up to 5 instruments.Using an automatic evaluation system with MIDI recorded from keyboardsthe method produces an alignment with a standard deviation of 23 ms fromthe ground truth alignment.

Computational Constraints

As the time and memory costs of the DTW are quadratic, synchronising longsegments of audio with meta-data this way can become arduous. Constraintscan be added to improve the efficiency of the technique, such as Sakoe andChiba’s slope constraint [31] binding the path within a 10% window, howeverany such bound means losing the guarantee of an optimal solution. Soulez

13

et al [36] mention the use of local constraints, path pruning and only storingthe frames linked to specific score events as methods to cut down on thememory and processing requirements of their DTW algorithm.

Salvador and Chan [32] use a multi-resolution approach to DTW betweentwo sequences where a low resolution DTW first finds the rough path in whicha higher resolution DTW is bound by. The algorithm works in linear timeand space and is therefore much more suited to matching long pieces of audioand meta-data.

On-line Dynamic Time Warping by Dixon [15] synchronises two audiofiles in realtime by changing the standard DTW formulae to iterively matchparts of a buffered audio stream as they are received. This is the only knownmethod that allows DTW to work online.

2.3.2 Dynamic Programming

In one of the early works in score-tracking and automatic accompaniment,Dannenberg [10] defined the best alignment as that with longest commonsub-sequence of the two streams. In this case the two streams were themusical events that the musician played and the expected musical eventsthat were in the musical score. By assuming the knowledge of what is played,Dannenberg used dynamic programming guided by heuristics to match thesequences. This template matching method could equally apply to differenttypes of meta-data other than music notes, however it is reliant on othertechniques to extract the features from the audio.

2.3.3 Synchronising Multiple Features

Arifi et al [7] use a combination of note onset detection and pitch estimationto make a score to audio offline alignment of polyphonic piano music. Theyuse a frame size of 23 ms but use methods to predict the time signal resultingin a resolution of 10 ms. Raphael [29] has suggested a new method that usesdynamic programming and a graphical model of features to expand on amusical score, then synchronise this with the pitch of a piece of audio offline.Evaluation showed this synchronisation method rarely gets lost with multi-instrument orchestral recordings however is prone to errors when faced withlarge large tempo variations.

14

2.3.4 Hidden Markov Models

Another way of looking at the synchronising problem is to use a stochasticmodel based on the random variation from the inputs (the features chosen)over time. HMMs have been used in speech recognition since the 1970’sand have been used in MIR for various functions such as pitch-tracking andmusical query retrieval. HMMs are suitable for musical recordings as theprobabilistic nature doesn’t require an exact replication of the expected mu-sic. HMMs are dependent on training on similar data and haven’t been shownto handle multi-instrument recordings well.

The HMM Model

The meta-data represents a Hidden Markov Model with the features being theobserved sequence and the position within the meta-data being the hiddenstate sequence. The meta-data output is dependant on the feature and can becomputed using a forward/backward algorithm. This algorithm uses dynamicprogramming to calculate the probability of a sequence (the output).

Examples of HMMs

Various score-tracking methods have been implemented using HMMs. Raphael[27] used 32 ms frames and pitch estimation while Cano et al [8] use HMMswith 6 features to track a singing voice in a musical score. Similarly, in Lyri-cAlly by Kan et al [19], HMMs are used to synchronise lyrics with audiousing beat, chroma, chords and key features for the purpose of automaticallygathering karaoke data.

Figure 2.5: Architecture of a HMM (arrows represent dependencies)

15

2.4 Non-Linear Alignments

A problem for all synchronisation methods is when one stream has duplicateor otherwise differently ordered segments. It is entirely conceivable in musicalrenditions of the same song that a musician might play the piece differently,add or skip parts or go back to correct a mistake. Even missed notes andgeneral imperfections in renditions can disrupt the alignment as one streamappears in a different order. There have been some attempts at catering forthis problem within the differing synchronisation methods.

Within dynamic time warping Muller and Appelt [23] refer to this as path-constrained partial music synchronisation and have proposed improvementsto methods that use DTW in similarity matrixes by guiding the DTW usingheuristics and dynamic programming to ensure only aligned segments with ahigh enough ’score’ (and also favoring longer sections) are considered in thefinal alignment.

For template matching methods Dannenberg and Mukaino [12] have pro-posed initiating a new matcher object every time there is ambiguity in thealignment process so that for each possible path there is a specific align-ment occurring. When one of these objects obtains a clear match the otheralignments are terminated.

2.5 Sate-of-the-Art Applications

Score Tracking

SyncPlayer [20] is the culmination of synchronisation work by the Signal Pro-cessing Group at the university of Bonn led by Michael Clausen [6],[7],[21],[24]& [25]. SyncPlayer is a framework for implementing prototypical applica-tions using various meta-data synchronisation and random access methods.Clausen et al use this framework to demonstrate their methods which collec-tively represent a large part of the state-of-the-arch research in score-trackingand related applications.

MuseBook Score [3] is a commercial digital music book for pianists thatdisplays MIDI files represented as sheet music and follows the musiciansprogress through the file, effectively turning the pages automatically. Muse-Book Score can use either audio (through microphones) or MIDI input totrack the users performance however little information could be gathered onthe specific techniques used by this application.

16

Musical Education

Dannenbergs early work on a computer based Piano Tutor [14] inspired simi-lar projects such as Software Toolworks’s Miracle Piano Teaching System [5]and Smoliar et al’s PianoFORTE [35]. These methods all used score-trackingto teach new piano students how to read music and play basic songs.

The computer games industry has recently been producing music basedgames such as Guitar Hero, Rock Band or Frets on Fire that attempt tomake playing music more fun and accessible to gamers. The technology usedin these games has allowed for novel interfaces for representing musical in-structions but have greatly simplified the instruments and the music createdto the point where the skills learnt are not transferrable to the actual in-struments that they seek to recreate. Piano Wizard by the Music WizardGroup [4] has taken concepts from these games to provide a piano teachingapplication that evolves from a simplified game to actual score-reading asthe users advance in knowledge.

Automatic Accompaniment

Music Plus One by Raphael [28] is an automatic accompaniment system thatworks in real-time using HMMs based on pitch estimation. The output is asynthesised piano that plays in synchronisation with the a musician playingmonophonic music.

MySong by Simon et al [34] uses HMMs and tempo tracking to extractand formulate harmonising chords to accompany a user singing. The syn-thesised chords are then aligned with the users recording. It is used to allowpeople who don’t know how to play an instrument to record something andautomatically add music to it.

Intelligent Music Editors

The Intelligent Editor for Digital Audio proposed by Chafe et al [9] andFoster et al [17] and then also by Dannenberg and Hu [11] is in some wayrealised in current research projects such as SyncPlayer [20] and AudioDBby Casey et al [to be published].

Ableton Live [1] by Ableton is a commercial music editing applicationthat can analyse audio clips and warp new segments to be in time with anycurrent audio being worked on. In this manner the tempo of the music canbe automatically synchronised.

17

Beat/User Driven Music

BODiBEAT [2] is a personal audio player by Yamaha developed with thehelp of Perfecto Herrera-Boyer that is worn on a users wrist and detects therunning/jogging tempo of the user. It uses this tempo as a means to selectwhich song stored on the device best fits the users current tempo. If the userchanges their tempo dramatically, the device will change track to find a newone that fits.

An application for the iPhone called SyncStep by Elliot [16] takes thisconcept further by allowing the user to select the music and adjusting thetempo of the music to fit that of a walker. When the users tempo changesdramatically the music will change to the new tempo. SyncStep is currentlylimited to walking speeds.

B-Keeper by Robertson [30]...

2.6 Challenges Identified in Linking Music-

Related Information and Audio Data

• Score-tracking systems are still imprecise with highly polyphonic in-struments [33] and multiple instruments so how can score-trackingmethods be adapted to synchronise this music better?

• How can non-instrumental methods of user expression be synchronisedwith music?

• How can we make use of the intuitive displays of computer music gamesto teach users how to play real instruments?

• How can score-tracking be used to develop intelligent education appli-cations that respond to the users needs?

• In what way can the links made between audio and music-related in-formation be stored and accessed?

18

Chapter 3

Work plan

3.1 Challenges

3.2 Work done so far

3.2.1 Audio - Score synchronisation using On-Line DTW

By synthesising the MIDI instructions we can adapt the On-Line DynamicTime Warping function to synchronise MIDI and audio. This can then becompared with other audio to score synchronisation methods.

• Adapted Match to synthesise expected Midi and align this with audio.

3.2.2 Synchronisation in musical education softwareand NoteScroller

Demonstrating synchronisation as a way of making a more interactive learn-ing device and using the alignment to evaluate the performance.

• Improved the NoteScroller interface for using MIDI to teach piano mu-sic.

• Presented a demo of NoteScroller at Nime 07.

19

3.2.3 Runner driven music and Trackster

Efficient audio stretching and beat detection to predict a users footsteps anduse this to synchronise music in real time. This is to be prototyped in amobile audio device.

• Built a prototype of Trackster to synchronise music with a runnersfootsteps.

3.3 Quantitative Goals

3.3.1 Comparing synchronisation methods

A comparison of the leading audio to score synchronisation methods on avariety of musical pieces.

Research topics:

• By testing the various methods on a variety of pieces it is hoped to wewill prove this difference and uncover which is the best method to usein particular situations.

• We will also work towards automatic evaluation methods and hope toaddress the challenge of obtaining test data and look into means ofgathering this in a ’semi-supervised’ autonomous process.

• Also, as a precursor to work on efficient synchronisation methods (seebelow) it is hoped to also measure the efficiency of these processes.

Objectives:

• A paper introducing and comparing the various audio to score syn-chronisation methods using new evaluation techniques. (Target: IEEEtransactions paper)

3.3.2 Audio - Score synchronisation using On-Line DTW

By synthesising the MIDI instructions we can adapt the On-Line DynamicTime Warping function to synchronise MIDI and audio. This can then becompared with other audio to score synchronisation methods.

20

Research topics:

• Evaluate and finalise the MidiMatch system.

• Compare this alignment process with the current state of the art.

• Look into integrating MidiMatch with AudioDB and Sonic Vizualiser.

Objectives:

• A Mirex 2008/Mirex 2009 submission to Real-time Audio to ScoreAlignment (a.k.a Score Following).

• An accompaniment poster/paper outlining the methods used for theMirex submission.

3.3.3 Synchronisation in musical education softwareand NoteScroller

Demonstrating synchronisation as a way of making a more interactive learn-ing device and using the alignment to evaluate the performance.

Research topics:

• Incorporate MidiMatch into NoteScroller.

• Work on an autonomous evaluation technique.

Objectives:

• Notescroller with score-tracking and evaluation techniques. (Target:ICMC/ISMIR 2009)

3.3.4 Runner driven music and Trackster

Efficient audio stretching and beat detection to predict a users footsteps anduse this to synchronise music in real time. This is to be prototyped in amobile audio device.

21

Research topics:

• Build a system of offline beat detection coupled with efficient onlinetime warping.

• Evaluate various time warping methods for efficiency and accuracy.

• Evaluate the system and work on rules to accurately predict a usersfootsteps.

• Evaluate the entrainment factor of how we are susceptible to the beatwhilst also driving it. For example, are we driving the music or is themusic driving us?

• Experiment with the sensitivity of the system for efficiency and en-trainment.

Objectives:

• An application for a mobile audio device with an accelerometer to runto music. (Target: January)

• A poster on the application and the processes involved. (Target: ICME2009 (deadline 31st Dec))

• A paper on efficient DSP in mobile audio devices.

3.3.5 Mobile phone score synchronisation and Tabster

Finding efficient chord recognition methods and template matching to syn-chronise guitar tabs and live guitar music recorded through a mobile phone.

Research topics:

• Build a system for efficient live chord recognition coupled with a tem-plate match.

• Evaluate different chord recognition methods for efficiency and accu-racy.

• Work towards a method of measuring the energy efficiency of DSPalgorithms and techniques that affect this.

22

Objectives:

• An application for a mobile phone to view guitar tabs. (Target: De-cember)

• A paper on efficient live DSP algorithms. (Target: IEEE transactionspaper)

• A poster on the application and the processes involved. (Target: ICME2009 (deadline 31st Dec))

3.4 Other Possible Research Aims

As well as the projects we have mentioned here, the following are also underconsiderations:

• synchronised scores have been used as training data for onset detection[38] and transcription methods [37]. Using synchronised lyrics andaudio could provide training data for speech recognition methods.

• Using a high level feature match to guide a higher resolution low levelDTW. This could then in turn be refined by onset detection as sug-gested by Soulez et al [36].

• Using source separation on polyphonic tracks before DTW and thenletting the strong matches define the overall match.

• Using feature synchronisation to find and align appropriate matchesbetween songs finishing and another starting.

• Or using this feature synchronisation to automatically find and createmash-ups of different songs.

• (Omras2) Establishing a common framework for storing linked meta-data with audio as RDF in the semantic web.

• (Omras2) Look into alternative meta-data alignment options for Au-dioDB.

23

3.5 Qualitative Goals

• To ensure the new methods in music information retrieval methodsturn to real applications that provide greater interaction with musicand enjoy this music in novel ways.

• To make it easier to learn how to play new instruments and help thosewith difficulties in doing so.

• To discover tools and functions that allow musicians to learn and createmusic.

• To use synchronisation processes to help automate collecting musicalinstructions and karaoke files.

• To ensure that robots know how to dance as soon as they get the rightlegs.

24

Bibliography

[1] Ableton live. http://www.ableton.com/live

[2] BODiBEAT. http://www.yamaha.com/bodibeat/

[3] Musebook score. http://musebook.com/?page=mbscore

[4] Piano wizard. http://www.pianowizard.com/

[5] The miracle piano teaching system.http://www.mobygames.com/game/nes/miracle-piano-teaching-system

[6] V. Arifi, M. Clausen, F. Kurth, and M. Muller. Automatic synchroniza-tion of music data in score-, midi- and pcm-format. 2003.

[7] V. Arifi, M. Clausen, F. Kurth, and M. Muller. Score-PCM Music Syn-chronization Based on Extracted Score Parameters. Esbjerg, Denmark,2004.

[8] P. Cano, A. Loscos, and J. Bonada. Score-performance matching usinghmms. In International Computer Music Conference, pages 441–444,1999.

[9] C. Chafe, B. Mont-Reynaud, and L. Rush. Towards an intelligent editorof digital audio: Recognition of musical constructs. Computer MusicJournal, 6(1):30–41, 1982.

[10] R. B. Dannenberg. An on-line algorithm for real-time accompaniment.In International Computer Music Conference, pages 193–198, 1984.

[11] R. B. Dannenberg and N. Hu. Polyphonic audio matching for scorefollowing and intelligent audio editors. In International Computer MusicConference, pages 27–34, 2003.

25

[12] R. B. Dannenberg and H. Mukaino. New techniques for enhanced qualityof computer accompaniment. In International Computer Music Confer-ence, pages 243–249, 1988.

[13] R. B. Dannenberg and C. Raphael. Music score alignment and computeraccompaniment. Communications of the ACM, pages 38–43, 2006.

[14] R. B. Dannenberg, M. Sanchez, A. Joseph, P. Capell, R. Joseph, andR. Saul. A computer-based multi-media tutor for beginning piano stu-dents. Journal of New Music Research, 19(2-3):155–173, 1990.

[15] S. Dixon. Live tracking of musical performances using on-line time warp-ing. In Proceedings of the 8th International Conference on Digital AudioEffects, pages 92–97, Madrid, Spain, 2005.

[16] G. Elliot. Personalsoundtrack: Context-aware playlists that adapt touser pace. In Conference on Human Factors in Computing Systems,2006.

[17] S. Foster, W. A. Schloss, and A. J. Rockmore. Towards an intelligenteditor of digital audio: Signal processing methods. Computer MusicJournal, 6(1):42–51, 1982.

[18] S. Haynes. The computer as a sound processor: A tutorial. ComputerMusic Journal, 6(1):7–17, 1982.

[19] M.-Y. Kan, Y. Wang, D. Iskandar, T. L. Nwe, and A. Shenoy. Lyri-cally: Automatic synchronization of textual lyrics to acoustic music sig-nals. IEEE Transactions on Audio, Speech, and Language Processing,16(2):338–349, 2008.

[20] F. Kurth, M. Muller, D. Damm, C. Fremerey, A. Ribbrock, andM. Clausen. Syncplayer — an advanced system for multimodal musicaccess. In ISMIR, 2005.

[21] F. Kurth, M. Muller, C. Fremerey, Y. ha Chang, and M. Clausen. Auto-mated synchronization of scanned sheet music with audio recordings. InProceedings of the 8th International Conference on Music InformationRetrieval, 2007.

26

[22] Y. Meron and K. Hirose. Automatic alignment of a musical score toperformed music. Acoustic Science & Technology, 22(3):189–198, 2001.

[23] M. Muller and D. Applet. Path-constrained partial music synchroniza-tion. In ICASSP, 2008.

[24] M. Muller and F. Kurth. Enhancing similarity matrices for music audioanalysis. In Proceedings of the International Conference on Acoustics,Speech, and Signal Processing, pages 437–440, 2006.

[25] M. Muller, F. Kurth, and T. Roder. Towards an efficient algorithm forautomatic score-to-audio synchronization. In ISMIR, 2004.

[26] N. Orio and D. Schwarz. Alignment of monophonic and polyphonicmusic to a score. In International Computer Music Conference, pages155–158, 2001.

[27] C. Raphael. Automatic segmentation of acoustic musical signals usinghidden markov models. IEEE Transactions on Pattern Analysis andMachine Intelligence, 21(4):360–370, Apr 1999.

[28] C. Raphael. Music plus one: A system for flexible and expressive musicalaccompaniment. In Proceedings of the International Computer MusicConference, Havana, Cuba, 2001.

[29] C. Raphael. Aligning music audio with symbolic scores using a hybridgraphical model. Machine Learning, 65(2-3):389–409, 2005.

[30] A. Robertson and M. Plumbley. B-Keeper: A beat-tracker for live per-formance. In Proceedings of the 7th international conference on Newinterfaces for musical expression, pages 234–237, New York, New York,2007.

[31] H. Sakoe and S. Chiba. Dynamic programming algorithm optimizationfor spoken word recognition. IEEE Transactions on Acoustics, Speech,and Signal Processing, 26(1), 1978.

[32] S. Salvador and P. Chan. FastDTW: Toward accurate dynamic timewarping in linear time and space. In Workshop on Mining Temporaland Sequential Data, page 11, 2004.

27

[33] D. Schwarz, N. Orio, and N. Schnell. Robust polyphonic midi scorefollowing with hidden markov models. In Proceedings of the 2004 Inter-national Computer Music Conference, 2004.

[34] I. Simon, D. Morris, and S. Basu. Mysong: Automatic accompanimentgeneration for vocal melodies. Proceeding of the twenty-sixth annualSIGCHI conference on Human factors in computing systems, pages 725–734, 2008.

[35] S. W. Smoliar, J. A. Waterworth, and P. R. Kellock. pianoforte: Asystem for piano education beyond notation literacy, 1995.

[36] F. Soulez, X. Rodet, and D. Schwarz. Improving polyphonic and poly-instrumental music to score alignment. In ISMIR, pages 143–148, 2003.

[37] R. J. Turetsky and D. P. Ellis. Ground-truth transcriptions of real musicfrom force-aligned midi syntheses. 2003.

[38] W. You and R. B. Dannenberg. Polyphonic music note onset detectionusing semi-supervised learning. In ISMIR, 2007.

28

linking music-related information and audio data

Documents