[tutorial] computational approaches to melodic analysis of indian art music

66
Computational Approaches to Melodic Analysis of Indian Art Music Indian Institute of Sciences, Bengaluru, India 2016 Sankalp Gulati Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain

Upload: sankalp-gulati

Post on 22-Jan-2018

54 views

Category:

Education


3 download

TRANSCRIPT

Computational Approaches to Melodic Analysis of Indian Art Music

Indian Institute of Sciences, Bengaluru, India 2016

Sankalp Gulati Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain

Tonic Melody

Intonation

Raga

Motifs

Similarity

Melodic description

Tonic Identification

Tonic Identification

time (s)

Fre

qu

ency

(H

z)

0 1 2 3 4 5 6 7 80

1000

2000

3000

4000

5000

100 150 200 250 3000

0.2

0.4

0.6

0.8

1

Frequency (bins), 1bin=10 cents, Ref=55 Hz

Nor

mal

ized

sal

ienc

e

f2

f3

f4

f5f

6

Tonic

Signal processing Learning

q  Tanpura / drone background sound q  Extent of gamakas on Sa and Pa svara q  Vadi, sam-vadi svara of the rāga

S. Gulati, A. Bellur, J. Salamon, H. Ranjani, V. Ishwar, H.A. Murthy, and X. Serra. Automatic tonic identification in Indian art music: approaches and evaluation. Journal of New Music Research, 43(01):55–73, 2014.

Salamon, J., Gulati, S., & Serra, X. (2012). A multipitch approach to tonic identification in Indian classical music. In Proc. of Int. Conf. on Music Information Retrieval (ISMIR) (pp. 499–504), Porto, Portugal.

Bellur, A., Ishwar, V., Serra, X., & Murthy, H. (2012). A knowledge based signal processing approach to tonic identification in Indian classical music. In 2nd CompMusic Workshop (pp. 113–118) Istanbul, Turkey.

Ranjani, H. G., Arthi, S., & Sreenivas, T. V. (2011). Carnatic music analysis: Shadja, swara identification and raga verification in Alapana using stochastic models. Applications of Signal Processing to Audio and Acoustics (WASPAA), IEEE Workshop , 29–32, New Paltz, NY.

Accuracy : ~90% !!!

Tonic Identification: Multipitch Approach

q  Audio example: q  Utilizing drone sound q  Multi-pitch analysis

Vocals

Drone

J. Salamon, E. G´omez, and J. Bonada. Sinusoid extraction and salience function design for predominant melody estimation. In Proc. 14th Int. Conf. on Digital Audio Effects (DAFX-11), pages 73–80, Paris, France, Sep. 2011.

Tonic Identification: Block Diagram

STFT SpectralPeakPicking

Frequency/Amplitudecorrec<on

Saliencepeakpicking

Mul<-pitchhistogram

Histogrampeakpicking

BinsaliencemappingHarmonicsumma<on

Audio

Sinusoids

Timefrequencysalience

SinusoidExtrac<on

Toniccandidates

Saliencefunc<oncomputa<on

Toniccandidategenera<on

Tonic Identification: Signal Processing

q  STFT §  Hop size: 11 ms §  Window length: 46 ms §  Window type: hamming §  FFT = 8192 points

STFT

Tonic Identification: Signal Processing

q  Spectral peak picking §  Absolute threshold: -60 dB

SpectralPeakPicking

Tonic Identification: Signal Processing

q  Frequency/Amplitude correction §  Parabolic interpolation

Frequency/Amplitudecorrec<on

Tonic Identification: Signal Processing

q  Harmonic summation §  Spectrum considered: 55-7200 Hz §  Frequency range: 55-1760 Hz §  Base frequency: 55 Hz §  Bin resolution: 10 cents per bin (120

per octave) §  N octaves: 5 §  Maximum harmonics: 20 §  Square cosine window across 50 cents

BinsaliencemappingHarmonicsumma<on

Tonic Identification: Signal Processing

q  Tonic candidate generation §  Number of salience peaks per

frame: 5 §  Frequency range: 110-550 Hz

Mul<-pitchhistogram

Tonic Identification: Feature Exraction

q  Identifying tonic in correct octave using multi-pitch histogram

q  Classification based template learning q  Class of an instance is the rank of the tonic

100 150 200 250 300 350 400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frequency bins (1 bin = 10 cents), Ref: 55Hz

Norm

aliz

ed s

alie

nce

Multipitch Histogram

f2

f3f4

f5

q  Decision Tree:

f2

f3

f2

f3

f5

1st

1st2nd 3rd

4th 5th

>5<=5

>-7<=-7

>-11<=-11

>5<=5 >-6<=-6

Sa

Sa

Pa

salience

Frequency

SaSa

Pa

salience

Frequency

Tonic Identification: Classification

Tonic Identification: Results

S. Gulati, A. Bellur, J. Salamon, H. Ranjani, V. Ishwar, H.A. Murthy, and X. Serra. Automatic tonic identification in Indian art music: approaches and evaluation. Journal of New Music Research, 43(01):55–73, 2014.

Predominant Pitch Estimation

Pitch Estimation Algorithms

q  Time-domain approaches §  ACF-based (Rabiner 1977) §  AMDF-based (YIN) Cheveigné et al.

q  Frequency-domain approaches §  Two-way mismatch (Maher and

Beauchamp 1994) §  Subharmonic summation (Hermes 1988)

Rabiner, L. (1977, February). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(1), 24–33 De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society of America 111, no. 4 (2002): 1917-1930.

§  Multi-pitch approaches §  Source separation-based (Klapuri, 2003) §  Harmonic summation (Melodia) (Salamon and

Gómez, 2012)

Medan, Y., & Yair, E. (1991). Super resolution pitch determination of speech signals. IEEE transactions on signal processing, 39(1), 40–48. Maher, R., & Beauchamp, J. W. (1994). Fundamental frequency estimation of musical signals using a two-way mismatch procedure. The Journal of the Acoustical Society of , 95 (April), 2254–2263. Hermes, D. (1988, 1988). Measurement of pitch by subharmonic summation. Journal of the Acoustical Society of America, 83, 257 - 264.

Klapuri, A. (2003b, November). Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Transactions on Speech and Audio Processing, 11(6), 804–816. Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1759–1770.

Pitch Estimation Algorithms

q  Time-domain approaches §  ACF-based (Rabiner 1977) §  AMDF-based (YIN) Cheveigné et al.

q  Frequency-domain approaches §  Two-way mismatch (Maher and

Beauchamp 1994) §  Subharmonic summation (Hermes 1988)

Rabiner, L. (1977, February). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(1), 24–33 De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society of America 111, no. 4 (2002): 1917-1930.

§  Multi-pitch approaches §  Source separation-based (Klapuri, 2003) §  Harmonic summation (Melodia) (Salamon

and Gómez, 2012)

Medan, Y., & Yair, E. (1991). Super resolution pitch determination of speech signals. IEEE transactions on signal processing, 39(1), 40–48. Maher, R., & Beauchamp, J. W. (1994). Fundamental frequency estimation of musical signals using a two-way mismatch procedure. The Journal of the Acoustical Society of , 95 (April), 2254–2263. Hermes, D. (1988, 1988). Measurement of pitch by subharmonic summation. Journal of the Acoustical Society of America, 83, 257 - 264.

Klapuri, A. (2003b, November). Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Transactions on Speech and Audio Processing, 11(6), 804–816. Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1759–1770.

Predominant Pitch Estimation: YIN

Signal

Difference function

Auto-correlation

Cumulative difference function

The present article introduces a method for F0 estima-tion that produces fewer errors than other well-known meth-ods. The name YIN !from ‘‘yin’’ and ‘‘yang’’ of orientalphilosophy" alludes to the interplay between autocorrelationand cancellation that it involves. This article is the first of aseries of two, of which the second !Kawahara et al., inpreparation" is also devoted to fundamental frequency esti-mation.

II. THE METHOD

This section presents the method step by step to provideinsight as to what makes it effective. The classic autocorre-lation algorithm is presented first, its error mechanisms areanalyzed, and then a series of improvements are introducedto reduce error rates. Error rates are measured at each stepover a small database for illustration purposes. Fuller evalu-ation is proposed in Sec. III.

A. Step 1: The autocorrelation methodThe autocorrelation function !ACF" of a discrete signal

xt may be defined as

rt!#"! $j!t"1

t"W

x jx j"#, !1"

where rt(#) is the autocorrelation function of lag # calculatedat time index t, and W is the integration window size. Thisfunction is illustrated in Fig. 1!b" for the signal plotted inFig. 1!a". It is common in signal processing to use a slightlydifferent definition:

rt!!#"! $j!t"1

t"W##

x jx j"#. !2"

Here the integration window size shrinks with increasingvalues of #, with the result that the envelope of the functiondecreases as a function of lag as illustrated in Fig. 1!c". Thetwo definitions give the same result if the signal is zero out-side % t"1, t"W& , but differ otherwise. Except where noted,this article assumes the first definition !also known as‘‘modified autocorrelation,’’ ‘‘covariance,’’ or ‘‘cross-correlation,’’ Rabiner and Shafer, 1978; Huang et al., 2001".

In response to a periodic signal, the ACF shows peaks atmultiples of the period. The ‘‘autocorrelation method’’chooses the highest non-zero-lag peak by exhaustive searchwithin a range of lags !horizontal arrows in Fig. 1". Obvi-ously if the lower limit is too close to zero, the algorithmmay erroneously choose the zero-lag peak. Conversely, if thehigher limit is large enough, it may erroneously choose ahigher-order peak. The definition of Eq. !1" is prone to thesecond problem, and that of Eq. !2" to the first !all the moreso as the window size W is small".

To evaluate the effect of a tapered ACF envelope onerror rates, the function calculated as in Eq. !1" was multi-plied by a negative ramp to simulate the result of Eq. !2"with a window size W!#max :

rt"!#"!! rt!#"!1##/#max" if #'#max ,0, otherwise.

!3"

Error rates were measured on a small database of speech !seeSec. III for details" and plotted in Fig. 2 as a function of

FIG. 1. !a" Example of a speech waveform. !b" Autocorrelation function!ACF" calculated from the waveform in !a" according to Eq. !1". !c" Same,calculated according to Eq. !2". The envelope of this function is tapered tozero because of the smaller number of terms in the summation at larger #.The horizontal arrows symbolize the search range for the period.

FIG. 2. F0 estimation error rates as a function of the slope of the envelopeof the ACF, quantified by its intercept with the abscissa. The dotted linerepresents errors for which the F0 estimate was too high, the dashed linethose for which it was too low, and the full line their sum. Triangles at theright represent error rates for ACF calculated as in Eq. !1" (#max!(). Theserates were measured over a subset of the database used in Sec. III.

1918 J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002 A. de Cheveigne and H. Kawahara: YIN, an F0 estimator

Lag(samples)

The present article introduces a method for F0 estima-tion that produces fewer errors than other well-known meth-ods. The name YIN !from ‘‘yin’’ and ‘‘yang’’ of orientalphilosophy" alludes to the interplay between autocorrelationand cancellation that it involves. This article is the first of aseries of two, of which the second !Kawahara et al., inpreparation" is also devoted to fundamental frequency esti-mation.

II. THE METHOD

This section presents the method step by step to provideinsight as to what makes it effective. The classic autocorre-lation algorithm is presented first, its error mechanisms areanalyzed, and then a series of improvements are introducedto reduce error rates. Error rates are measured at each stepover a small database for illustration purposes. Fuller evalu-ation is proposed in Sec. III.

A. Step 1: The autocorrelation methodThe autocorrelation function !ACF" of a discrete signal

xt may be defined as

rt!#"! $j!t"1

t"W

x jx j"#, !1"

where rt(#) is the autocorrelation function of lag # calculatedat time index t, and W is the integration window size. Thisfunction is illustrated in Fig. 1!b" for the signal plotted inFig. 1!a". It is common in signal processing to use a slightlydifferent definition:

rt!!#"! $j!t"1

t"W##

x jx j"#. !2"

Here the integration window size shrinks with increasingvalues of #, with the result that the envelope of the functiondecreases as a function of lag as illustrated in Fig. 1!c". Thetwo definitions give the same result if the signal is zero out-side % t"1, t"W& , but differ otherwise. Except where noted,this article assumes the first definition !also known as‘‘modified autocorrelation,’’ ‘‘covariance,’’ or ‘‘cross-correlation,’’ Rabiner and Shafer, 1978; Huang et al., 2001".

In response to a periodic signal, the ACF shows peaks atmultiples of the period. The ‘‘autocorrelation method’’chooses the highest non-zero-lag peak by exhaustive searchwithin a range of lags !horizontal arrows in Fig. 1". Obvi-ously if the lower limit is too close to zero, the algorithmmay erroneously choose the zero-lag peak. Conversely, if thehigher limit is large enough, it may erroneously choose ahigher-order peak. The definition of Eq. !1" is prone to thesecond problem, and that of Eq. !2" to the first !all the moreso as the window size W is small".

To evaluate the effect of a tapered ACF envelope onerror rates, the function calculated as in Eq. !1" was multi-plied by a negative ramp to simulate the result of Eq. !2"with a window size W!#max :

rt"!#"!! rt!#"!1##/#max" if #'#max ,0, otherwise.

!3"

Error rates were measured on a small database of speech !seeSec. III for details" and plotted in Fig. 2 as a function of

FIG. 1. !a" Example of a speech waveform. !b" Autocorrelation function!ACF" calculated from the waveform in !a" according to Eq. !1". !c" Same,calculated according to Eq. !2". The envelope of this function is tapered tozero because of the smaller number of terms in the summation at larger #.The horizontal arrows symbolize the search range for the period.

FIG. 2. F0 estimation error rates as a function of the slope of the envelopeof the ACF, quantified by its intercept with the abscissa. The dotted linerepresents errors for which the F0 estimate was too high, the dashed linethose for which it was too low, and the full line their sum. Triangles at theright represent error rates for ACF calculated as in Eq. !1" (#max!(). Theserates were measured over a subset of the database used in Sec. III.

1918 J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002 A. de Cheveigne and H. Kawahara: YIN, an F0 estimator

!max . The parameter !max allows the algorithm to be biasedto favor one form of error at the expense of the other, with aminimum of total error for intermediate values. Using Eq. "2#rather than Eq. "1# introduces a natural bias that can be tunedby adjusting W. However, changing the window size hasother effects, and one can argue that a bias of this sort, ifuseful, should be applied explicitly rather than implicitly.This is one reason to prefer the definition of Eq. "1#.

The autocorrelation method compares the signal to itsshifted self. In that sense it is related to the AMDF method"average magnitude difference function, Ross et al., 1974;Ney, 1982# that performs its comparison using differencesrather than products, and more generally to time-domainmethods that measure intervals between events in time"Hess, 1983#. The ACF is the Fourier transform of the powerspectrum, and can be seen as measuring the regular spacingof harmonics within that spectrum. The cepstrum method"Noll, 1967# replaces the power spectrum by the log magni-tude spectrum and thus puts less weight on high-amplitudeparts of the spectrum "particularly near the first formant thatoften dominates the ACF#. Similar ‘‘spectral whitening’’ ef-fects can be obtained by linear predictive inverse filtering orcenter-clipping "Rabiner and Schafer, 1978#, or by splittingthe signal over a bank of filters, calculating ACFs withineach channel, and adding the results after amplitude normal-ization "de Cheveigne, 1991#. Auditory models based on au-tocorrelation are currently one of the more popular ways toexplain pitch perception "Meddis and Hewitt, 1991; Carianiand Delgutte, 1996#.

Despite its appeal and many efforts to improve its per-formance, the autocorrelation method "and other methods forthat matter# makes too many errors for many applications.The following steps are designed to reduce error rates. Thefirst row of Table I gives the gross error rate "defined in Sec.III and measured over a subset of the database used in thatsection# of the basic autocorrelation method based on Eq. "1#without bias. The next rows are rates for a succession ofimprovements described in the next paragraphs. These num-bers are given for didactic purposes; a more formal evalua-tion is reported in Sec. III.

B. Step 2: Difference function

We start by modeling the signal xt as a periodic functionwith period T, by definition invariant for a time shift of T:

xt!xt"T#0, $t . "4#

The same is true after taking the square and averaging over awindow:

$j#t"1

t"W

"x j!x j"T#2#0. "5#

Conversely, an unknown period may be found by formingthe difference function:

dt"!##$j#1

W

"x j!x j"!#2, "6#

and searching for the values of ! for which the function iszero. There is an infinite set of such values, all multiples ofthe period. The difference function calculated from the signalin Fig. 1"a# is illustrated in Fig. 3"a#. The squared sum maybe expanded and the function expressed in terms of the ACF:

dt"!##rt"0 #"rt"!"0 #!2rt"!#. "7#

The first two terms are energy terms. Were they constant, thedifference function dt(!) would vary as the opposite ofrt(!), and searching for a minimum of one or the maximumof the other would give the same result. However, the secondenergy term also varies with !, implying that maxima ofrt(!) and minima of dt(!) may sometimes not coincide. In-deed, the error rate fell to 1.95% for the difference functionfrom 10.0% for unbiased autocorrelation "Table I#.

The magnitude of this decrease in error rate may comeas a surprise. An explanation is that the ACF implementedaccording to Eq. "1# is quite sensitive to amplitude changes.As pointed out by Hess "1983, p. 355#, an increase in signalamplitude with time causes ACF peak amplitudes to growwith lag rather than remain constant as in Fig. 1"b#. Thisencourages the algorithm to choose a higher-order peak andmake a ‘‘too low’’ error "an amplitude decrease has the op-posite effect#. The difference function is immune to this par-

FIG. 3. "a# Difference function calculated for the speech signal of Fig. 1"a#."b# Cumulative mean normalized difference function. Note that the functionstarts at 1 rather than 0 and remains high until the dip at the period.

TABLE I. Gross error rates for the simple unbiased autocorrelation method"step 1#, and for the cumulated steps described in the text. These rates weremeasured over a subset of the database used in Sec. III. Integration windowsize was 25 ms, window shift was one sample, search range was 40 to 800Hz, and threshold "step 4# was 0.1.

Version Gross error "%#

Step 1 10.0Step 2 1.95Step 3 1.69Step 4 0.78Step 5 0.77Step 6 0.50

1919J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002 A. de Cheveigne and H. Kawahara: YIN, an F0 estimator

Lag(samples)

!max . The parameter !max allows the algorithm to be biasedto favor one form of error at the expense of the other, with aminimum of total error for intermediate values. Using Eq. "2#rather than Eq. "1# introduces a natural bias that can be tunedby adjusting W. However, changing the window size hasother effects, and one can argue that a bias of this sort, ifuseful, should be applied explicitly rather than implicitly.This is one reason to prefer the definition of Eq. "1#.

The autocorrelation method compares the signal to itsshifted self. In that sense it is related to the AMDF method"average magnitude difference function, Ross et al., 1974;Ney, 1982# that performs its comparison using differencesrather than products, and more generally to time-domainmethods that measure intervals between events in time"Hess, 1983#. The ACF is the Fourier transform of the powerspectrum, and can be seen as measuring the regular spacingof harmonics within that spectrum. The cepstrum method"Noll, 1967# replaces the power spectrum by the log magni-tude spectrum and thus puts less weight on high-amplitudeparts of the spectrum "particularly near the first formant thatoften dominates the ACF#. Similar ‘‘spectral whitening’’ ef-fects can be obtained by linear predictive inverse filtering orcenter-clipping "Rabiner and Schafer, 1978#, or by splittingthe signal over a bank of filters, calculating ACFs withineach channel, and adding the results after amplitude normal-ization "de Cheveigne, 1991#. Auditory models based on au-tocorrelation are currently one of the more popular ways toexplain pitch perception "Meddis and Hewitt, 1991; Carianiand Delgutte, 1996#.

Despite its appeal and many efforts to improve its per-formance, the autocorrelation method "and other methods forthat matter# makes too many errors for many applications.The following steps are designed to reduce error rates. Thefirst row of Table I gives the gross error rate "defined in Sec.III and measured over a subset of the database used in thatsection# of the basic autocorrelation method based on Eq. "1#without bias. The next rows are rates for a succession ofimprovements described in the next paragraphs. These num-bers are given for didactic purposes; a more formal evalua-tion is reported in Sec. III.

B. Step 2: Difference function

We start by modeling the signal xt as a periodic functionwith period T, by definition invariant for a time shift of T:

xt!xt"T#0, $t . "4#

The same is true after taking the square and averaging over awindow:

$j#t"1

t"W

"x j!x j"T#2#0. "5#

Conversely, an unknown period may be found by formingthe difference function:

dt"!##$j#1

W

"x j!x j"!#2, "6#

and searching for the values of ! for which the function iszero. There is an infinite set of such values, all multiples ofthe period. The difference function calculated from the signalin Fig. 1"a# is illustrated in Fig. 3"a#. The squared sum maybe expanded and the function expressed in terms of the ACF:

dt"!##rt"0 #"rt"!"0 #!2rt"!#. "7#

The first two terms are energy terms. Were they constant, thedifference function dt(!) would vary as the opposite ofrt(!), and searching for a minimum of one or the maximumof the other would give the same result. However, the secondenergy term also varies with !, implying that maxima ofrt(!) and minima of dt(!) may sometimes not coincide. In-deed, the error rate fell to 1.95% for the difference functionfrom 10.0% for unbiased autocorrelation "Table I#.

The magnitude of this decrease in error rate may comeas a surprise. An explanation is that the ACF implementedaccording to Eq. "1# is quite sensitive to amplitude changes.As pointed out by Hess "1983, p. 355#, an increase in signalamplitude with time causes ACF peak amplitudes to growwith lag rather than remain constant as in Fig. 1"b#. Thisencourages the algorithm to choose a higher-order peak andmake a ‘‘too low’’ error "an amplitude decrease has the op-posite effect#. The difference function is immune to this par-

FIG. 3. "a# Difference function calculated for the speech signal of Fig. 1"a#."b# Cumulative mean normalized difference function. Note that the functionstarts at 1 rather than 0 and remains high until the dip at the period.

TABLE I. Gross error rates for the simple unbiased autocorrelation method"step 1#, and for the cumulated steps described in the text. These rates weremeasured over a subset of the database used in Sec. III. Integration windowsize was 25 ms, window shift was one sample, search range was 40 to 800Hz, and threshold "step 4# was 0.1.

Version Gross error "%#

Step 1 10.0Step 2 1.95Step 3 1.69Step 4 0.78Step 5 0.77Step 6 0.50

1919J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002 A. de Cheveigne and H. Kawahara: YIN, an F0 estimator

!max . The parameter !max allows the algorithm to be biasedto favor one form of error at the expense of the other, with aminimum of total error for intermediate values. Using Eq. "2#rather than Eq. "1# introduces a natural bias that can be tunedby adjusting W. However, changing the window size hasother effects, and one can argue that a bias of this sort, ifuseful, should be applied explicitly rather than implicitly.This is one reason to prefer the definition of Eq. "1#.

The autocorrelation method compares the signal to itsshifted self. In that sense it is related to the AMDF method"average magnitude difference function, Ross et al., 1974;Ney, 1982# that performs its comparison using differencesrather than products, and more generally to time-domainmethods that measure intervals between events in time"Hess, 1983#. The ACF is the Fourier transform of the powerspectrum, and can be seen as measuring the regular spacingof harmonics within that spectrum. The cepstrum method"Noll, 1967# replaces the power spectrum by the log magni-tude spectrum and thus puts less weight on high-amplitudeparts of the spectrum "particularly near the first formant thatoften dominates the ACF#. Similar ‘‘spectral whitening’’ ef-fects can be obtained by linear predictive inverse filtering orcenter-clipping "Rabiner and Schafer, 1978#, or by splittingthe signal over a bank of filters, calculating ACFs withineach channel, and adding the results after amplitude normal-ization "de Cheveigne, 1991#. Auditory models based on au-tocorrelation are currently one of the more popular ways toexplain pitch perception "Meddis and Hewitt, 1991; Carianiand Delgutte, 1996#.

Despite its appeal and many efforts to improve its per-formance, the autocorrelation method "and other methods forthat matter# makes too many errors for many applications.The following steps are designed to reduce error rates. Thefirst row of Table I gives the gross error rate "defined in Sec.III and measured over a subset of the database used in thatsection# of the basic autocorrelation method based on Eq. "1#without bias. The next rows are rates for a succession ofimprovements described in the next paragraphs. These num-bers are given for didactic purposes; a more formal evalua-tion is reported in Sec. III.

B. Step 2: Difference function

We start by modeling the signal xt as a periodic functionwith period T, by definition invariant for a time shift of T:

xt!xt"T#0, $t . "4#

The same is true after taking the square and averaging over awindow:

$j#t"1

t"W

"x j!x j"T#2#0. "5#

Conversely, an unknown period may be found by formingthe difference function:

dt"!##$j#1

W

"x j!x j"!#2, "6#

and searching for the values of ! for which the function iszero. There is an infinite set of such values, all multiples ofthe period. The difference function calculated from the signalin Fig. 1"a# is illustrated in Fig. 3"a#. The squared sum maybe expanded and the function expressed in terms of the ACF:

dt"!##rt"0 #"rt"!"0 #!2rt"!#. "7#

The first two terms are energy terms. Were they constant, thedifference function dt(!) would vary as the opposite ofrt(!), and searching for a minimum of one or the maximumof the other would give the same result. However, the secondenergy term also varies with !, implying that maxima ofrt(!) and minima of dt(!) may sometimes not coincide. In-deed, the error rate fell to 1.95% for the difference functionfrom 10.0% for unbiased autocorrelation "Table I#.

The magnitude of this decrease in error rate may comeas a surprise. An explanation is that the ACF implementedaccording to Eq. "1# is quite sensitive to amplitude changes.As pointed out by Hess "1983, p. 355#, an increase in signalamplitude with time causes ACF peak amplitudes to growwith lag rather than remain constant as in Fig. 1"b#. Thisencourages the algorithm to choose a higher-order peak andmake a ‘‘too low’’ error "an amplitude decrease has the op-posite effect#. The difference function is immune to this par-

FIG. 3. "a# Difference function calculated for the speech signal of Fig. 1"a#."b# Cumulative mean normalized difference function. Note that the functionstarts at 1 rather than 0 and remains high until the dip at the period.

TABLE I. Gross error rates for the simple unbiased autocorrelation method"step 1#, and for the cumulated steps described in the text. These rates weremeasured over a subset of the database used in Sec. III. Integration windowsize was 25 ms, window shift was one sample, search range was 40 to 800Hz, and threshold "step 4# was 0.1.

Version Gross error "%#

Step 1 10.0Step 2 1.95Step 3 1.69Step 4 0.78Step 5 0.77Step 6 0.50

1919J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002 A. de Cheveigne and H. Kawahara: YIN, an F0 estimator

Lag(samples)

!max . The parameter !max allows the algorithm to be biasedto favor one form of error at the expense of the other, with aminimum of total error for intermediate values. Using Eq. "2#rather than Eq. "1# introduces a natural bias that can be tunedby adjusting W. However, changing the window size hasother effects, and one can argue that a bias of this sort, ifuseful, should be applied explicitly rather than implicitly.This is one reason to prefer the definition of Eq. "1#.

The autocorrelation method compares the signal to itsshifted self. In that sense it is related to the AMDF method"average magnitude difference function, Ross et al., 1974;Ney, 1982# that performs its comparison using differencesrather than products, and more generally to time-domainmethods that measure intervals between events in time"Hess, 1983#. The ACF is the Fourier transform of the powerspectrum, and can be seen as measuring the regular spacingof harmonics within that spectrum. The cepstrum method"Noll, 1967# replaces the power spectrum by the log magni-tude spectrum and thus puts less weight on high-amplitudeparts of the spectrum "particularly near the first formant thatoften dominates the ACF#. Similar ‘‘spectral whitening’’ ef-fects can be obtained by linear predictive inverse filtering orcenter-clipping "Rabiner and Schafer, 1978#, or by splittingthe signal over a bank of filters, calculating ACFs withineach channel, and adding the results after amplitude normal-ization "de Cheveigne, 1991#. Auditory models based on au-tocorrelation are currently one of the more popular ways toexplain pitch perception "Meddis and Hewitt, 1991; Carianiand Delgutte, 1996#.

Despite its appeal and many efforts to improve its per-formance, the autocorrelation method "and other methods forthat matter# makes too many errors for many applications.The following steps are designed to reduce error rates. Thefirst row of Table I gives the gross error rate "defined in Sec.III and measured over a subset of the database used in thatsection# of the basic autocorrelation method based on Eq. "1#without bias. The next rows are rates for a succession ofimprovements described in the next paragraphs. These num-bers are given for didactic purposes; a more formal evalua-tion is reported in Sec. III.

B. Step 2: Difference function

We start by modeling the signal xt as a periodic functionwith period T, by definition invariant for a time shift of T:

xt!xt"T#0, $t . "4#

The same is true after taking the square and averaging over awindow:

$j#t"1

t"W

"x j!x j"T#2#0. "5#

Conversely, an unknown period may be found by formingthe difference function:

dt"!##$j#1

W

"x j!x j"!#2, "6#

and searching for the values of ! for which the function iszero. There is an infinite set of such values, all multiples ofthe period. The difference function calculated from the signalin Fig. 1"a# is illustrated in Fig. 3"a#. The squared sum maybe expanded and the function expressed in terms of the ACF:

dt"!##rt"0 #"rt"!"0 #!2rt"!#. "7#

The first two terms are energy terms. Were they constant, thedifference function dt(!) would vary as the opposite ofrt(!), and searching for a minimum of one or the maximumof the other would give the same result. However, the secondenergy term also varies with !, implying that maxima ofrt(!) and minima of dt(!) may sometimes not coincide. In-deed, the error rate fell to 1.95% for the difference functionfrom 10.0% for unbiased autocorrelation "Table I#.

The magnitude of this decrease in error rate may comeas a surprise. An explanation is that the ACF implementedaccording to Eq. "1# is quite sensitive to amplitude changes.As pointed out by Hess "1983, p. 355#, an increase in signalamplitude with time causes ACF peak amplitudes to growwith lag rather than remain constant as in Fig. 1"b#. Thisencourages the algorithm to choose a higher-order peak andmake a ‘‘too low’’ error "an amplitude decrease has the op-posite effect#. The difference function is immune to this par-

FIG. 3. "a# Difference function calculated for the speech signal of Fig. 1"a#."b# Cumulative mean normalized difference function. Note that the functionstarts at 1 rather than 0 and remains high until the dip at the period.

TABLE I. Gross error rates for the simple unbiased autocorrelation method"step 1#, and for the cumulated steps described in the text. These rates weremeasured over a subset of the database used in Sec. III. Integration windowsize was 25 ms, window shift was one sample, search range was 40 to 800Hz, and threshold "step 4# was 0.1.

Version Gross error "%#

Step 1 10.0Step 2 1.95Step 3 1.69Step 4 0.78Step 5 0.77Step 6 0.50

1919J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002 A. de Cheveigne and H. Kawahara: YIN, an F0 estimator

Lag(samples)

De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society of America 111, no. 4 (2002): 1917-1930.

Predominant Pitch Estimation: YIN

Predominant Pitch Estimation: Melodia

Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.

Predominant Pitch Estimation: Melodia

Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.

Predominant Pitch Estimation: Melodia

audio

Spectrogram

Spectral peaks

Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.

Predominant Pitch Estimation: Melodia

Spectral peaks

Time-frequency salience

Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.

Predominant Pitch Estimation: Melodia

Time-frequency salience

Salience peaks

Contours Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.

Predominant Pitch Estimation: Melodia

Contours

Predominant melody contours

Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.

Essentia implementation of Melodia

Essentia implementation of Melodia

Essentia implementation of Melodia

Essentia implementation of Melodia

Essentia implementation of Melodia

Essentia implementation of Melodia

Essentia implementation of Melodia

Essentia implementation of Melodia

Essentia implementation of Melodia

Audio

Spectrogram

Essentia implementation of Melodia

Essentia implementation of Melodia

Spectral peaks

Spectrogram

Essentia implementation of Melodia

Essentia implementation of Melodia

Time-frequency salience

Spectral peaks

Essentia implementation of Melodia

Essentia implementation of Melodia

Salience peaks

Time-frequency salience

Essentia implementation of Melodia

Essentia implementation of Melodia

All contours

Salience peaks

Essentia implementation of Melodia

Essentia implementation of Melodia

Predominant melody contours

All contours

Essentia implementation of Melodia

Essentia implementation of Melodia

Essentia implementation of Melodia

Essentia implementation of Melodia

Predominant Pitch Estimation: Melodia

What about loudness and timbre?

What about loudness and timbre?

Loudness features in Essentia

Loudness of predominant voice

Loudness of predominant voice Freq

uency

Time

Loudness of predominant voice Freq

uency

Time

Loudness of predominant voice Freq

uency

Time

F0

Loudness of predominant voice Freq

uency

Time

F0

Loudness of predominant voice Freq

uency

Time

F0

Loudness of predominant voice Freq

uency

Time

F0

Loudness of predominant voice: example

Spectral centroid of predominant voice

CompMusic: Dunya

CompMusic: Dunya

API Internet

CompMusic: Dunya Web

CompMusic: Dunya API

hTps://github.com/MTG/pycompmusic

Dunya API Examples q  Metadata q  Features