[tutorial] computational approaches to melodic analysis of indian art music
TRANSCRIPT
Computational Approaches to Melodic Analysis of Indian Art Music
Indian Institute of Sciences, Bengaluru, India 2016
Sankalp Gulati Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
Tonic Identification
time (s)
Fre
qu
ency
(H
z)
0 1 2 3 4 5 6 7 80
1000
2000
3000
4000
5000
100 150 200 250 3000
0.2
0.4
0.6
0.8
1
Frequency (bins), 1bin=10 cents, Ref=55 Hz
Nor
mal
ized
sal
ienc
e
f2
f3
f4
f5f
6
Tonic
Signal processing Learning
q Tanpura / drone background sound q Extent of gamakas on Sa and Pa svara q Vadi, sam-vadi svara of the rāga
S. Gulati, A. Bellur, J. Salamon, H. Ranjani, V. Ishwar, H.A. Murthy, and X. Serra. Automatic tonic identification in Indian art music: approaches and evaluation. Journal of New Music Research, 43(01):55–73, 2014.
Salamon, J., Gulati, S., & Serra, X. (2012). A multipitch approach to tonic identification in Indian classical music. In Proc. of Int. Conf. on Music Information Retrieval (ISMIR) (pp. 499–504), Porto, Portugal.
Bellur, A., Ishwar, V., Serra, X., & Murthy, H. (2012). A knowledge based signal processing approach to tonic identification in Indian classical music. In 2nd CompMusic Workshop (pp. 113–118) Istanbul, Turkey.
Ranjani, H. G., Arthi, S., & Sreenivas, T. V. (2011). Carnatic music analysis: Shadja, swara identification and raga verification in Alapana using stochastic models. Applications of Signal Processing to Audio and Acoustics (WASPAA), IEEE Workshop , 29–32, New Paltz, NY.
Accuracy : ~90% !!!
Tonic Identification: Multipitch Approach
q Audio example: q Utilizing drone sound q Multi-pitch analysis
Vocals
Drone
J. Salamon, E. G´omez, and J. Bonada. Sinusoid extraction and salience function design for predominant melody estimation. In Proc. 14th Int. Conf. on Digital Audio Effects (DAFX-11), pages 73–80, Paris, France, Sep. 2011.
Tonic Identification: Block Diagram
STFT SpectralPeakPicking
Frequency/Amplitudecorrec<on
Saliencepeakpicking
Mul<-pitchhistogram
Histogrampeakpicking
BinsaliencemappingHarmonicsumma<on
Audio
Sinusoids
Timefrequencysalience
SinusoidExtrac<on
Toniccandidates
Saliencefunc<oncomputa<on
Toniccandidategenera<on
Tonic Identification: Signal Processing
q STFT § Hop size: 11 ms § Window length: 46 ms § Window type: hamming § FFT = 8192 points
STFT
Tonic Identification: Signal Processing
q Spectral peak picking § Absolute threshold: -60 dB
SpectralPeakPicking
Tonic Identification: Signal Processing
q Frequency/Amplitude correction § Parabolic interpolation
Frequency/Amplitudecorrec<on
Tonic Identification: Signal Processing
q Harmonic summation § Spectrum considered: 55-7200 Hz § Frequency range: 55-1760 Hz § Base frequency: 55 Hz § Bin resolution: 10 cents per bin (120
per octave) § N octaves: 5 § Maximum harmonics: 20 § Square cosine window across 50 cents
BinsaliencemappingHarmonicsumma<on
Tonic Identification: Signal Processing
q Tonic candidate generation § Number of salience peaks per
frame: 5 § Frequency range: 110-550 Hz
Mul<-pitchhistogram
Tonic Identification: Feature Exraction
q Identifying tonic in correct octave using multi-pitch histogram
q Classification based template learning q Class of an instance is the rank of the tonic
100 150 200 250 300 350 400
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frequency bins (1 bin = 10 cents), Ref: 55Hz
Norm
aliz
ed s
alie
nce
Multipitch Histogram
f2
f3f4
f5
q Decision Tree:
f2
f3
f2
f3
f5
1st
1st2nd 3rd
4th 5th
>5<=5
>-7<=-7
>-11<=-11
>5<=5 >-6<=-6
Sa
Sa
Pa
salience
Frequency
SaSa
Pa
salience
Frequency
Tonic Identification: Classification
Tonic Identification: Results
S. Gulati, A. Bellur, J. Salamon, H. Ranjani, V. Ishwar, H.A. Murthy, and X. Serra. Automatic tonic identification in Indian art music: approaches and evaluation. Journal of New Music Research, 43(01):55–73, 2014.
Pitch Estimation Algorithms
q Time-domain approaches § ACF-based (Rabiner 1977) § AMDF-based (YIN) Cheveigné et al.
q Frequency-domain approaches § Two-way mismatch (Maher and
Beauchamp 1994) § Subharmonic summation (Hermes 1988)
Rabiner, L. (1977, February). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(1), 24–33 De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society of America 111, no. 4 (2002): 1917-1930.
§ Multi-pitch approaches § Source separation-based (Klapuri, 2003) § Harmonic summation (Melodia) (Salamon and
Gómez, 2012)
Medan, Y., & Yair, E. (1991). Super resolution pitch determination of speech signals. IEEE transactions on signal processing, 39(1), 40–48. Maher, R., & Beauchamp, J. W. (1994). Fundamental frequency estimation of musical signals using a two-way mismatch procedure. The Journal of the Acoustical Society of , 95 (April), 2254–2263. Hermes, D. (1988, 1988). Measurement of pitch by subharmonic summation. Journal of the Acoustical Society of America, 83, 257 - 264.
Klapuri, A. (2003b, November). Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Transactions on Speech and Audio Processing, 11(6), 804–816. Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1759–1770.
Pitch Estimation Algorithms
q Time-domain approaches § ACF-based (Rabiner 1977) § AMDF-based (YIN) Cheveigné et al.
q Frequency-domain approaches § Two-way mismatch (Maher and
Beauchamp 1994) § Subharmonic summation (Hermes 1988)
Rabiner, L. (1977, February). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(1), 24–33 De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society of America 111, no. 4 (2002): 1917-1930.
§ Multi-pitch approaches § Source separation-based (Klapuri, 2003) § Harmonic summation (Melodia) (Salamon
and Gómez, 2012)
Medan, Y., & Yair, E. (1991). Super resolution pitch determination of speech signals. IEEE transactions on signal processing, 39(1), 40–48. Maher, R., & Beauchamp, J. W. (1994). Fundamental frequency estimation of musical signals using a two-way mismatch procedure. The Journal of the Acoustical Society of , 95 (April), 2254–2263. Hermes, D. (1988, 1988). Measurement of pitch by subharmonic summation. Journal of the Acoustical Society of America, 83, 257 - 264.
Klapuri, A. (2003b, November). Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Transactions on Speech and Audio Processing, 11(6), 804–816. Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1759–1770.
Predominant Pitch Estimation: YIN
Signal
Difference function
Auto-correlation
Cumulative difference function
The present article introduces a method for F0 estima-tion that produces fewer errors than other well-known meth-ods. The name YIN !from ‘‘yin’’ and ‘‘yang’’ of orientalphilosophy" alludes to the interplay between autocorrelationand cancellation that it involves. This article is the first of aseries of two, of which the second !Kawahara et al., inpreparation" is also devoted to fundamental frequency esti-mation.
II. THE METHOD
This section presents the method step by step to provideinsight as to what makes it effective. The classic autocorre-lation algorithm is presented first, its error mechanisms areanalyzed, and then a series of improvements are introducedto reduce error rates. Error rates are measured at each stepover a small database for illustration purposes. Fuller evalu-ation is proposed in Sec. III.
A. Step 1: The autocorrelation methodThe autocorrelation function !ACF" of a discrete signal
xt may be defined as
rt!#"! $j!t"1
t"W
x jx j"#, !1"
where rt(#) is the autocorrelation function of lag # calculatedat time index t, and W is the integration window size. Thisfunction is illustrated in Fig. 1!b" for the signal plotted inFig. 1!a". It is common in signal processing to use a slightlydifferent definition:
rt!!#"! $j!t"1
t"W##
x jx j"#. !2"
Here the integration window size shrinks with increasingvalues of #, with the result that the envelope of the functiondecreases as a function of lag as illustrated in Fig. 1!c". Thetwo definitions give the same result if the signal is zero out-side % t"1, t"W& , but differ otherwise. Except where noted,this article assumes the first definition !also known as‘‘modified autocorrelation,’’ ‘‘covariance,’’ or ‘‘cross-correlation,’’ Rabiner and Shafer, 1978; Huang et al., 2001".
In response to a periodic signal, the ACF shows peaks atmultiples of the period. The ‘‘autocorrelation method’’chooses the highest non-zero-lag peak by exhaustive searchwithin a range of lags !horizontal arrows in Fig. 1". Obvi-ously if the lower limit is too close to zero, the algorithmmay erroneously choose the zero-lag peak. Conversely, if thehigher limit is large enough, it may erroneously choose ahigher-order peak. The definition of Eq. !1" is prone to thesecond problem, and that of Eq. !2" to the first !all the moreso as the window size W is small".
To evaluate the effect of a tapered ACF envelope onerror rates, the function calculated as in Eq. !1" was multi-plied by a negative ramp to simulate the result of Eq. !2"with a window size W!#max :
rt"!#"!! rt!#"!1##/#max" if #'#max ,0, otherwise.
!3"
Error rates were measured on a small database of speech !seeSec. III for details" and plotted in Fig. 2 as a function of
FIG. 1. !a" Example of a speech waveform. !b" Autocorrelation function!ACF" calculated from the waveform in !a" according to Eq. !1". !c" Same,calculated according to Eq. !2". The envelope of this function is tapered tozero because of the smaller number of terms in the summation at larger #.The horizontal arrows symbolize the search range for the period.
FIG. 2. F0 estimation error rates as a function of the slope of the envelopeof the ACF, quantified by its intercept with the abscissa. The dotted linerepresents errors for which the F0 estimate was too high, the dashed linethose for which it was too low, and the full line their sum. Triangles at theright represent error rates for ACF calculated as in Eq. !1" (#max!(). Theserates were measured over a subset of the database used in Sec. III.
1918 J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002 A. de Cheveigne and H. Kawahara: YIN, an F0 estimator
Lag(samples)
The present article introduces a method for F0 estima-tion that produces fewer errors than other well-known meth-ods. The name YIN !from ‘‘yin’’ and ‘‘yang’’ of orientalphilosophy" alludes to the interplay between autocorrelationand cancellation that it involves. This article is the first of aseries of two, of which the second !Kawahara et al., inpreparation" is also devoted to fundamental frequency esti-mation.
II. THE METHOD
This section presents the method step by step to provideinsight as to what makes it effective. The classic autocorre-lation algorithm is presented first, its error mechanisms areanalyzed, and then a series of improvements are introducedto reduce error rates. Error rates are measured at each stepover a small database for illustration purposes. Fuller evalu-ation is proposed in Sec. III.
A. Step 1: The autocorrelation methodThe autocorrelation function !ACF" of a discrete signal
xt may be defined as
rt!#"! $j!t"1
t"W
x jx j"#, !1"
where rt(#) is the autocorrelation function of lag # calculatedat time index t, and W is the integration window size. Thisfunction is illustrated in Fig. 1!b" for the signal plotted inFig. 1!a". It is common in signal processing to use a slightlydifferent definition:
rt!!#"! $j!t"1
t"W##
x jx j"#. !2"
Here the integration window size shrinks with increasingvalues of #, with the result that the envelope of the functiondecreases as a function of lag as illustrated in Fig. 1!c". Thetwo definitions give the same result if the signal is zero out-side % t"1, t"W& , but differ otherwise. Except where noted,this article assumes the first definition !also known as‘‘modified autocorrelation,’’ ‘‘covariance,’’ or ‘‘cross-correlation,’’ Rabiner and Shafer, 1978; Huang et al., 2001".
In response to a periodic signal, the ACF shows peaks atmultiples of the period. The ‘‘autocorrelation method’’chooses the highest non-zero-lag peak by exhaustive searchwithin a range of lags !horizontal arrows in Fig. 1". Obvi-ously if the lower limit is too close to zero, the algorithmmay erroneously choose the zero-lag peak. Conversely, if thehigher limit is large enough, it may erroneously choose ahigher-order peak. The definition of Eq. !1" is prone to thesecond problem, and that of Eq. !2" to the first !all the moreso as the window size W is small".
To evaluate the effect of a tapered ACF envelope onerror rates, the function calculated as in Eq. !1" was multi-plied by a negative ramp to simulate the result of Eq. !2"with a window size W!#max :
rt"!#"!! rt!#"!1##/#max" if #'#max ,0, otherwise.
!3"
Error rates were measured on a small database of speech !seeSec. III for details" and plotted in Fig. 2 as a function of
FIG. 1. !a" Example of a speech waveform. !b" Autocorrelation function!ACF" calculated from the waveform in !a" according to Eq. !1". !c" Same,calculated according to Eq. !2". The envelope of this function is tapered tozero because of the smaller number of terms in the summation at larger #.The horizontal arrows symbolize the search range for the period.
FIG. 2. F0 estimation error rates as a function of the slope of the envelopeof the ACF, quantified by its intercept with the abscissa. The dotted linerepresents errors for which the F0 estimate was too high, the dashed linethose for which it was too low, and the full line their sum. Triangles at theright represent error rates for ACF calculated as in Eq. !1" (#max!(). Theserates were measured over a subset of the database used in Sec. III.
1918 J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002 A. de Cheveigne and H. Kawahara: YIN, an F0 estimator
!max . The parameter !max allows the algorithm to be biasedto favor one form of error at the expense of the other, with aminimum of total error for intermediate values. Using Eq. "2#rather than Eq. "1# introduces a natural bias that can be tunedby adjusting W. However, changing the window size hasother effects, and one can argue that a bias of this sort, ifuseful, should be applied explicitly rather than implicitly.This is one reason to prefer the definition of Eq. "1#.
The autocorrelation method compares the signal to itsshifted self. In that sense it is related to the AMDF method"average magnitude difference function, Ross et al., 1974;Ney, 1982# that performs its comparison using differencesrather than products, and more generally to time-domainmethods that measure intervals between events in time"Hess, 1983#. The ACF is the Fourier transform of the powerspectrum, and can be seen as measuring the regular spacingof harmonics within that spectrum. The cepstrum method"Noll, 1967# replaces the power spectrum by the log magni-tude spectrum and thus puts less weight on high-amplitudeparts of the spectrum "particularly near the first formant thatoften dominates the ACF#. Similar ‘‘spectral whitening’’ ef-fects can be obtained by linear predictive inverse filtering orcenter-clipping "Rabiner and Schafer, 1978#, or by splittingthe signal over a bank of filters, calculating ACFs withineach channel, and adding the results after amplitude normal-ization "de Cheveigne, 1991#. Auditory models based on au-tocorrelation are currently one of the more popular ways toexplain pitch perception "Meddis and Hewitt, 1991; Carianiand Delgutte, 1996#.
Despite its appeal and many efforts to improve its per-formance, the autocorrelation method "and other methods forthat matter# makes too many errors for many applications.The following steps are designed to reduce error rates. Thefirst row of Table I gives the gross error rate "defined in Sec.III and measured over a subset of the database used in thatsection# of the basic autocorrelation method based on Eq. "1#without bias. The next rows are rates for a succession ofimprovements described in the next paragraphs. These num-bers are given for didactic purposes; a more formal evalua-tion is reported in Sec. III.
B. Step 2: Difference function
We start by modeling the signal xt as a periodic functionwith period T, by definition invariant for a time shift of T:
xt!xt"T#0, $t . "4#
The same is true after taking the square and averaging over awindow:
$j#t"1
t"W
"x j!x j"T#2#0. "5#
Conversely, an unknown period may be found by formingthe difference function:
dt"!##$j#1
W
"x j!x j"!#2, "6#
and searching for the values of ! for which the function iszero. There is an infinite set of such values, all multiples ofthe period. The difference function calculated from the signalin Fig. 1"a# is illustrated in Fig. 3"a#. The squared sum maybe expanded and the function expressed in terms of the ACF:
dt"!##rt"0 #"rt"!"0 #!2rt"!#. "7#
The first two terms are energy terms. Were they constant, thedifference function dt(!) would vary as the opposite ofrt(!), and searching for a minimum of one or the maximumof the other would give the same result. However, the secondenergy term also varies with !, implying that maxima ofrt(!) and minima of dt(!) may sometimes not coincide. In-deed, the error rate fell to 1.95% for the difference functionfrom 10.0% for unbiased autocorrelation "Table I#.
The magnitude of this decrease in error rate may comeas a surprise. An explanation is that the ACF implementedaccording to Eq. "1# is quite sensitive to amplitude changes.As pointed out by Hess "1983, p. 355#, an increase in signalamplitude with time causes ACF peak amplitudes to growwith lag rather than remain constant as in Fig. 1"b#. Thisencourages the algorithm to choose a higher-order peak andmake a ‘‘too low’’ error "an amplitude decrease has the op-posite effect#. The difference function is immune to this par-
FIG. 3. "a# Difference function calculated for the speech signal of Fig. 1"a#."b# Cumulative mean normalized difference function. Note that the functionstarts at 1 rather than 0 and remains high until the dip at the period.
TABLE I. Gross error rates for the simple unbiased autocorrelation method"step 1#, and for the cumulated steps described in the text. These rates weremeasured over a subset of the database used in Sec. III. Integration windowsize was 25 ms, window shift was one sample, search range was 40 to 800Hz, and threshold "step 4# was 0.1.
Version Gross error "%#
Step 1 10.0Step 2 1.95Step 3 1.69Step 4 0.78Step 5 0.77Step 6 0.50
1919J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002 A. de Cheveigne and H. Kawahara: YIN, an F0 estimator
Lag(samples)
!max . The parameter !max allows the algorithm to be biasedto favor one form of error at the expense of the other, with aminimum of total error for intermediate values. Using Eq. "2#rather than Eq. "1# introduces a natural bias that can be tunedby adjusting W. However, changing the window size hasother effects, and one can argue that a bias of this sort, ifuseful, should be applied explicitly rather than implicitly.This is one reason to prefer the definition of Eq. "1#.
The autocorrelation method compares the signal to itsshifted self. In that sense it is related to the AMDF method"average magnitude difference function, Ross et al., 1974;Ney, 1982# that performs its comparison using differencesrather than products, and more generally to time-domainmethods that measure intervals between events in time"Hess, 1983#. The ACF is the Fourier transform of the powerspectrum, and can be seen as measuring the regular spacingof harmonics within that spectrum. The cepstrum method"Noll, 1967# replaces the power spectrum by the log magni-tude spectrum and thus puts less weight on high-amplitudeparts of the spectrum "particularly near the first formant thatoften dominates the ACF#. Similar ‘‘spectral whitening’’ ef-fects can be obtained by linear predictive inverse filtering orcenter-clipping "Rabiner and Schafer, 1978#, or by splittingthe signal over a bank of filters, calculating ACFs withineach channel, and adding the results after amplitude normal-ization "de Cheveigne, 1991#. Auditory models based on au-tocorrelation are currently one of the more popular ways toexplain pitch perception "Meddis and Hewitt, 1991; Carianiand Delgutte, 1996#.
Despite its appeal and many efforts to improve its per-formance, the autocorrelation method "and other methods forthat matter# makes too many errors for many applications.The following steps are designed to reduce error rates. Thefirst row of Table I gives the gross error rate "defined in Sec.III and measured over a subset of the database used in thatsection# of the basic autocorrelation method based on Eq. "1#without bias. The next rows are rates for a succession ofimprovements described in the next paragraphs. These num-bers are given for didactic purposes; a more formal evalua-tion is reported in Sec. III.
B. Step 2: Difference function
We start by modeling the signal xt as a periodic functionwith period T, by definition invariant for a time shift of T:
xt!xt"T#0, $t . "4#
The same is true after taking the square and averaging over awindow:
$j#t"1
t"W
"x j!x j"T#2#0. "5#
Conversely, an unknown period may be found by formingthe difference function:
dt"!##$j#1
W
"x j!x j"!#2, "6#
and searching for the values of ! for which the function iszero. There is an infinite set of such values, all multiples ofthe period. The difference function calculated from the signalin Fig. 1"a# is illustrated in Fig. 3"a#. The squared sum maybe expanded and the function expressed in terms of the ACF:
dt"!##rt"0 #"rt"!"0 #!2rt"!#. "7#
The first two terms are energy terms. Were they constant, thedifference function dt(!) would vary as the opposite ofrt(!), and searching for a minimum of one or the maximumof the other would give the same result. However, the secondenergy term also varies with !, implying that maxima ofrt(!) and minima of dt(!) may sometimes not coincide. In-deed, the error rate fell to 1.95% for the difference functionfrom 10.0% for unbiased autocorrelation "Table I#.
The magnitude of this decrease in error rate may comeas a surprise. An explanation is that the ACF implementedaccording to Eq. "1# is quite sensitive to amplitude changes.As pointed out by Hess "1983, p. 355#, an increase in signalamplitude with time causes ACF peak amplitudes to growwith lag rather than remain constant as in Fig. 1"b#. Thisencourages the algorithm to choose a higher-order peak andmake a ‘‘too low’’ error "an amplitude decrease has the op-posite effect#. The difference function is immune to this par-
FIG. 3. "a# Difference function calculated for the speech signal of Fig. 1"a#."b# Cumulative mean normalized difference function. Note that the functionstarts at 1 rather than 0 and remains high until the dip at the period.
TABLE I. Gross error rates for the simple unbiased autocorrelation method"step 1#, and for the cumulated steps described in the text. These rates weremeasured over a subset of the database used in Sec. III. Integration windowsize was 25 ms, window shift was one sample, search range was 40 to 800Hz, and threshold "step 4# was 0.1.
Version Gross error "%#
Step 1 10.0Step 2 1.95Step 3 1.69Step 4 0.78Step 5 0.77Step 6 0.50
1919J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002 A. de Cheveigne and H. Kawahara: YIN, an F0 estimator
!max . The parameter !max allows the algorithm to be biasedto favor one form of error at the expense of the other, with aminimum of total error for intermediate values. Using Eq. "2#rather than Eq. "1# introduces a natural bias that can be tunedby adjusting W. However, changing the window size hasother effects, and one can argue that a bias of this sort, ifuseful, should be applied explicitly rather than implicitly.This is one reason to prefer the definition of Eq. "1#.
The autocorrelation method compares the signal to itsshifted self. In that sense it is related to the AMDF method"average magnitude difference function, Ross et al., 1974;Ney, 1982# that performs its comparison using differencesrather than products, and more generally to time-domainmethods that measure intervals between events in time"Hess, 1983#. The ACF is the Fourier transform of the powerspectrum, and can be seen as measuring the regular spacingof harmonics within that spectrum. The cepstrum method"Noll, 1967# replaces the power spectrum by the log magni-tude spectrum and thus puts less weight on high-amplitudeparts of the spectrum "particularly near the first formant thatoften dominates the ACF#. Similar ‘‘spectral whitening’’ ef-fects can be obtained by linear predictive inverse filtering orcenter-clipping "Rabiner and Schafer, 1978#, or by splittingthe signal over a bank of filters, calculating ACFs withineach channel, and adding the results after amplitude normal-ization "de Cheveigne, 1991#. Auditory models based on au-tocorrelation are currently one of the more popular ways toexplain pitch perception "Meddis and Hewitt, 1991; Carianiand Delgutte, 1996#.
Despite its appeal and many efforts to improve its per-formance, the autocorrelation method "and other methods forthat matter# makes too many errors for many applications.The following steps are designed to reduce error rates. Thefirst row of Table I gives the gross error rate "defined in Sec.III and measured over a subset of the database used in thatsection# of the basic autocorrelation method based on Eq. "1#without bias. The next rows are rates for a succession ofimprovements described in the next paragraphs. These num-bers are given for didactic purposes; a more formal evalua-tion is reported in Sec. III.
B. Step 2: Difference function
We start by modeling the signal xt as a periodic functionwith period T, by definition invariant for a time shift of T:
xt!xt"T#0, $t . "4#
The same is true after taking the square and averaging over awindow:
$j#t"1
t"W
"x j!x j"T#2#0. "5#
Conversely, an unknown period may be found by formingthe difference function:
dt"!##$j#1
W
"x j!x j"!#2, "6#
and searching for the values of ! for which the function iszero. There is an infinite set of such values, all multiples ofthe period. The difference function calculated from the signalin Fig. 1"a# is illustrated in Fig. 3"a#. The squared sum maybe expanded and the function expressed in terms of the ACF:
dt"!##rt"0 #"rt"!"0 #!2rt"!#. "7#
The first two terms are energy terms. Were they constant, thedifference function dt(!) would vary as the opposite ofrt(!), and searching for a minimum of one or the maximumof the other would give the same result. However, the secondenergy term also varies with !, implying that maxima ofrt(!) and minima of dt(!) may sometimes not coincide. In-deed, the error rate fell to 1.95% for the difference functionfrom 10.0% for unbiased autocorrelation "Table I#.
The magnitude of this decrease in error rate may comeas a surprise. An explanation is that the ACF implementedaccording to Eq. "1# is quite sensitive to amplitude changes.As pointed out by Hess "1983, p. 355#, an increase in signalamplitude with time causes ACF peak amplitudes to growwith lag rather than remain constant as in Fig. 1"b#. Thisencourages the algorithm to choose a higher-order peak andmake a ‘‘too low’’ error "an amplitude decrease has the op-posite effect#. The difference function is immune to this par-
FIG. 3. "a# Difference function calculated for the speech signal of Fig. 1"a#."b# Cumulative mean normalized difference function. Note that the functionstarts at 1 rather than 0 and remains high until the dip at the period.
TABLE I. Gross error rates for the simple unbiased autocorrelation method"step 1#, and for the cumulated steps described in the text. These rates weremeasured over a subset of the database used in Sec. III. Integration windowsize was 25 ms, window shift was one sample, search range was 40 to 800Hz, and threshold "step 4# was 0.1.
Version Gross error "%#
Step 1 10.0Step 2 1.95Step 3 1.69Step 4 0.78Step 5 0.77Step 6 0.50
1919J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002 A. de Cheveigne and H. Kawahara: YIN, an F0 estimator
Lag(samples)
!max . The parameter !max allows the algorithm to be biasedto favor one form of error at the expense of the other, with aminimum of total error for intermediate values. Using Eq. "2#rather than Eq. "1# introduces a natural bias that can be tunedby adjusting W. However, changing the window size hasother effects, and one can argue that a bias of this sort, ifuseful, should be applied explicitly rather than implicitly.This is one reason to prefer the definition of Eq. "1#.
The autocorrelation method compares the signal to itsshifted self. In that sense it is related to the AMDF method"average magnitude difference function, Ross et al., 1974;Ney, 1982# that performs its comparison using differencesrather than products, and more generally to time-domainmethods that measure intervals between events in time"Hess, 1983#. The ACF is the Fourier transform of the powerspectrum, and can be seen as measuring the regular spacingof harmonics within that spectrum. The cepstrum method"Noll, 1967# replaces the power spectrum by the log magni-tude spectrum and thus puts less weight on high-amplitudeparts of the spectrum "particularly near the first formant thatoften dominates the ACF#. Similar ‘‘spectral whitening’’ ef-fects can be obtained by linear predictive inverse filtering orcenter-clipping "Rabiner and Schafer, 1978#, or by splittingthe signal over a bank of filters, calculating ACFs withineach channel, and adding the results after amplitude normal-ization "de Cheveigne, 1991#. Auditory models based on au-tocorrelation are currently one of the more popular ways toexplain pitch perception "Meddis and Hewitt, 1991; Carianiand Delgutte, 1996#.
Despite its appeal and many efforts to improve its per-formance, the autocorrelation method "and other methods forthat matter# makes too many errors for many applications.The following steps are designed to reduce error rates. Thefirst row of Table I gives the gross error rate "defined in Sec.III and measured over a subset of the database used in thatsection# of the basic autocorrelation method based on Eq. "1#without bias. The next rows are rates for a succession ofimprovements described in the next paragraphs. These num-bers are given for didactic purposes; a more formal evalua-tion is reported in Sec. III.
B. Step 2: Difference function
We start by modeling the signal xt as a periodic functionwith period T, by definition invariant for a time shift of T:
xt!xt"T#0, $t . "4#
The same is true after taking the square and averaging over awindow:
$j#t"1
t"W
"x j!x j"T#2#0. "5#
Conversely, an unknown period may be found by formingthe difference function:
dt"!##$j#1
W
"x j!x j"!#2, "6#
and searching for the values of ! for which the function iszero. There is an infinite set of such values, all multiples ofthe period. The difference function calculated from the signalin Fig. 1"a# is illustrated in Fig. 3"a#. The squared sum maybe expanded and the function expressed in terms of the ACF:
dt"!##rt"0 #"rt"!"0 #!2rt"!#. "7#
The first two terms are energy terms. Were they constant, thedifference function dt(!) would vary as the opposite ofrt(!), and searching for a minimum of one or the maximumof the other would give the same result. However, the secondenergy term also varies with !, implying that maxima ofrt(!) and minima of dt(!) may sometimes not coincide. In-deed, the error rate fell to 1.95% for the difference functionfrom 10.0% for unbiased autocorrelation "Table I#.
The magnitude of this decrease in error rate may comeas a surprise. An explanation is that the ACF implementedaccording to Eq. "1# is quite sensitive to amplitude changes.As pointed out by Hess "1983, p. 355#, an increase in signalamplitude with time causes ACF peak amplitudes to growwith lag rather than remain constant as in Fig. 1"b#. Thisencourages the algorithm to choose a higher-order peak andmake a ‘‘too low’’ error "an amplitude decrease has the op-posite effect#. The difference function is immune to this par-
FIG. 3. "a# Difference function calculated for the speech signal of Fig. 1"a#."b# Cumulative mean normalized difference function. Note that the functionstarts at 1 rather than 0 and remains high until the dip at the period.
TABLE I. Gross error rates for the simple unbiased autocorrelation method"step 1#, and for the cumulated steps described in the text. These rates weremeasured over a subset of the database used in Sec. III. Integration windowsize was 25 ms, window shift was one sample, search range was 40 to 800Hz, and threshold "step 4# was 0.1.
Version Gross error "%#
Step 1 10.0Step 2 1.95Step 3 1.69Step 4 0.78Step 5 0.77Step 6 0.50
1919J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002 A. de Cheveigne and H. Kawahara: YIN, an F0 estimator
Lag(samples)
De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society of America 111, no. 4 (2002): 1917-1930.
Predominant Pitch Estimation: Melodia
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
Predominant Pitch Estimation: Melodia
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
Predominant Pitch Estimation: Melodia
audio
Spectrogram
Spectral peaks
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
Predominant Pitch Estimation: Melodia
Spectral peaks
Time-frequency salience
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
Predominant Pitch Estimation: Melodia
Time-frequency salience
Salience peaks
Contours Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
Predominant Pitch Estimation: Melodia
Contours
Predominant melody contours
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.