on the use of autocorrelation analysis for pitch detection

24 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-25, NO. 1, FEBRUARY 1977

On the Use of Autocorrelation Analysis for PitchDetection

LAWRENCE R. RABINER, FELLOW, IEEE

AbstractOne of the most time honored methods of detecting pitchis to use some type of autocorrelation analysis on speech which hasbeen appropriately preprocessed. The goal of the speech preprocessingin most systems is to whiten, or spectrally flatten, the signal so as toeliminate the effects of the vocal tract spectrum on the detailed shapeof the resulting autocorrelation function. The purpose of this paper isto present some results on several types of (nonlinear) preprocessingwhich can be used to effectively spectrally flatten the speech signaLThe types of nonlinearities which are considered are classified by a non-linear input-output quantizer characteristic. By appropriate adjustmentof the quantizer threshold levels, both the ordinary (linear) autocor-relation analysis, and the center clippingpeak clipping auto correlationof Dubnowski et a!. [1] can be obtained. Results are presented todemonstrate the degree of spectrum flattening obtained using thesemethods. Each of the proposed methods was tested on several of theutterances used in a recent pitch detector comparison study by Rabineret a!. [2] Results of this comparison are included in this paper. Onefinal topic which is discussed in this paper is an algorithm for adaptivelychoosing a frame size for an autocorrelation pitch analysis.

I. INTRODUCTIONLTHOUGH a large number of different methods have

been proposed for detecting pitch, the autocorrelationpitch detector is still one of the most robust and reliable ofpitch detectors [2] . There are several reasons why autocor-relation methods for pitch detection have generally met withgood success. The autocorrelation computation is madedirectly on the waveform and is a fairly straightforward (albeittime consuming) computation. Although a high processingrate is required, the autocorrelation computation is simplyamenable to digital hardware implementation generally re-4uiring only a single multiplier and an accumulator as thecomputational elements. Finally, the autocorrelation compu-tation is largely phase insensitive.' Thus, it is a good methodto use to detect pitch of speech which has been transmittedover a telephone line, or has suffered some degree of phasedistortion via transmission.

Although an autocorrelation pitch detector has some advan-tages for pitch detection, there are several problems associatedwith the use of this method. Although the autocorrelationfunction of a section of voiced speech generally displays afairly prominent peak at the pitch period, autocorrelationpeaks due to the detailed formant structure of the signal arealso often present. Thus, one problem is to decide which ofseveral autocorrelation peaks corresponds to the pitch period.Another problem with the autocorrelation computation is therequired use of a window for computing the short time auto-

Manuscript received April 4, 1976; revised August 16, 1976.The author is with the Bell Laboratories, Murray Hill, NJ 07974.51n the limit of exactly periodic signals, or for S infinite correlation

function it is exactly phase insensitive.

correlation function. The use of a window for analysis leadsto at least two difficulties. First there is the problem ofchoosing an appropriate window. Second there is the problemthat (for a stationary analysis),2 no matter which window isselected, the effect of the window is to taper the autocor-relation function smoothly to 0 as the autocorrelation indexincreases. This effect tends to compound the difficultiesmentioned above in which formant peaks in the autocorre-lation function (which occur at lower indices than the pitchperiod peak) tend to be of greater magnitude than the peakdue to the pitch period.

A final difficulty with the autocorrelation computation isthe problem of choosing an appropriate analysis frame(window) size. The ideal analysis frame should contain from 2to 3 complete pitch periods. Thus, for high pitch speakers theanalysis frame should be short (5-20 ms), whereas for lowpitched speakers it should be long (20-50 ms).

A wide variety of solutions have been proposed to the aboveproblems. To partially eliminate the effects of the higherformant structure on the autocorrelation functions mostmethods use a sharp cutoff low-pass filter with cutoff around900 Hz. This will, in general, preserve a sufficient number ofpitch harmonics for accurate pitch detection, but will elim-inate the second and highet formants. In addition to linearfiltering to remove the fortnant structure, a wide variety ofmethods have been proposed for directly or indirectlyspectrally flattening the speech signal to remove the effects ofthe first formant [3] [5], [1] . Included among these tech-niques are center clipping and spectral equalization by filterbank methods [3], inverse filtering using linear predictionmethods [4], spectral flattening by linear predictiob and aNewton transformation [5] , and spectral flattening by a com-bination of center and peak clipping methods [1].

Each of these methods has met with some degree of success;however, problems still remain. It is the purpose of this paperto investigate the properties of a class of nonlinearities appliedto the speech signal prior to autocorrelation analysis with thepurpose of spectrally flattening the signal. Also a solution tothe problem of choosing an analysis frame size which adapts tothe estimated average pitch of the speaker will be presented.

The organization of this paper is as follows. In Section II wereview the theory of short-time autocorrelation analysis andpresent the various types of nonlinearities to be investigatedfor spectrally flattening the speech. Examples of signal spectra

2A stationary analysis is one for which the same set of input samplesis used in computing all the points of the autocorrelation function. Anonstationary analysis is impractical for pitch detection because of thelarge number of autocorrelation points involved in the computation.

RABINER: AUTOCORRELATION ANALYSIS FOR PITCH DETECTION 25

obtained with the nonlinearities being used will be given in thissection. In Section III the results of a limited but formalevaluation of each of the nonlinear autocorrelation analysesare given. Several of the test utterances used in [2] are used inthis test for comparison purposes. In Section IV we discuss asimple algorithm for adapting the frame size of the analysisbased on the estimated average pitch period for the speaker,and present results on how well it worked on several testexamples.

II. SHORT-TIME AUTOCORRELATION ANALYSIS

Given a discrete time signal x(n), defined for all n, the auto-correlation function is generally defined as

x(m) = x(n)x(n m).

The autocorrelation function of a signal is basically a (non-invertible) transformation of the signal which is useful fordisplaying structure in the waveform. Thus, for pitch detec-tion, if we assume x(n) is exactly periodic with period F, i.e.,x(n) = x(n + F) for all n, then it is easily shown that

q(m +P),i.e., the autocorrelation is also periodic with the same period.Conversely, periodicity in the autocorrelation function indi-cates periodicity in the signal.

For a nonstationary signal, such as speech, the concept of along-time autocorrelation measurement as given by (1) is notreally meaningful. Thus, it is reasonable to define a short-time autocorrelation function, which operates on short seg-ments of the signal, as

N' -102(m) = [x(n + 2)w(n)] [x(n + + m)w(n m)],N

0mM0-l (3)where w(n) is an appropriate window for analysis, N is thesection length being analyzed, N' is the number of signalsamples used in the computation of 02(m), M0 is the numberof autocorrelation points to be computed, and is the indexof the starting sample of the frame. For pitch detection appli-cations N' is generally set to the value

N'N-m (4)

1

*RRELA

L JSPEECH PREPROCESSOR

Fig. 1. Block diagram of the nonlinear correlation processing.

computation of (3), as discussed in Section I. Fig. 1 shows ablock diagram of the processing which was used. The speech

(1) signal s(n) is first low-pass filtered by an FIR, linear phase,digital filter with a passband of 0 to 900 Hz, and a stopbandbeginning at 1700 Hz.3 The output of the low-pass filter isthen used as input to two nonlinear processors, labeled NL1and NL2 in Fig. 1. The nonlinearities used in each path mayor may not be identical. The types of nonlinearities whichwere investigated were various center clippers, and peak clip-pers. Based on earlier works [3], [1] it has been shown thatsuch nonlinearities can provide a fairly high degree of spectral

2) flattening, and are computationally quite efficient to imple-ment [1]. Additionally, the capability of correlating two non-linearly processed versions of the same signal provides a usefuldegree of flexibility into the system. It has also been arguedthat such a correlation will be most appropriate in a varietyof actual situations in pitch detection.4

Three types of nonlinearity have been considered. They areclassified according to their input-output quantization charac-teristic in the following way. The first type of nonlinearity isa compressed center clipper whose output y(n) obeys therelation (with x(n) as input)5

y(n) = dc [x(n)J (x(n) - CL), x(n) CL=0, Jx(n)I

26 IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, FEBRUARY 1977

y sgn Cx)Fig. 2. Input-output characteristics of each of the three nonhinearities

used in the investigation.

y(n)rrsgn [x(n)] =1, x(n)CL=0, Ix(n)I


vtt2p

cip Lx Cx)]

r 20j[v I'

Sqn In (nflA A A AlA Eli 200flJ H 0

period there are five pulses of varying width whereas in thefirst periods there are only three pulses. Fig. 3(b) shows thatsuch problems are inherently eliminated by the dc quantizerwhose output samples are proportional in amplitude to theamount by which they exceed the clipping threshold.

Spectral Flattening from the QuantizersIt has already been argued that the effect of the nonlinear

processing preceding the correlation computation is to approx-imately spectrally flatten the signal spectrum, thereby en-hancing the periodicity of the signal. To investigate this, thepower spectrum of each of the correlation functions of Table Iwas computed directly from the correlation function by theFourier transform relation

M0-1S(fl (m)e2m.

M(M0-1)A 512-point FFT was used to compute S(f,3, k = 0,1," ,511where

fkk (10000\512i.e., at 512 points around the unit circle. Since 1(m) is theo-reticaily infinite, a (2M0 + 1) point Hamming window wasused to taper 0(m) smoothly to 0. (Note we are assuming0(m) is symmetric, i.e., 0(m) =

Figs. 47 show plots of the resuhs of processing four dif-ferent sections of voiced speech. The left-hand column showsthe signal x1 (n), the middle column shows the signal 0(m) =x1 (n) correlated with x2 (n) (where x2 (n) is as specified inTable I), and the right-hand column shows the power spec-trum S(f) obtained as described above. The ten rows in eachfigure correspond to the ten combinations of signals to becorrelated as shown in Table I. An examination of Fig. 4shows that for the unprocessed signal (i.e., the top row) thefirst several harmonics are seen in the power spectrum.Beyond 1 kHz, the spectrum decays rapidly due to the low-pass filter (the lack of a sharp falloff in the spectrum is due toa combination of the signal and autocorrelation windows).The amplitudes of the harmonics vary with the first formantenvelope. It can be seen that the spectrum for the autocor-relations of each of the nonlinear quantizers (i.e., rows 2, 3,and 10) are much flatter than the original signal spectrum.Additionally, the spectra of the nonlinearly processed signalsare much broader than the original spectrum. It is interestingto note that the spectra from correlations involving x(n), i.e.,correlations numbers 1, 4, 7, and 8, are the least flattened andare generally quite irregular (i.e., the harmonics are not veryeasy to find).

Fig. 5 shows similar results from a different section of voicedspeech. As seen from the spectrum of the unprocessed signal(on the top line) the bandwidth of the first formant is fairly

(9) small, causing the correlation function to show a great deal offormant periodicity for small values of m. The effects of thenonlinearities on the signal spectra are quite impressive evenfor such a difficult case as this one.

Fig. 6 shows resuhs from a section of speech from a femalespeaker (high pitch). Again the spectrum from the unpro-

4Cm)

x)n)+CL -CL

0(00 V 1300

cCc Cx (nfl

FRAME 23LRR-I SAW THE CAT

4Cm)

I V )COO

0

)b)200

m

S)GNAL 2, )n) 0 CORRELACiON -4 )m)

Fig. 3. Each of the processed signals and the resulting correlation func-tion for a section of voiced speech.

200 0 5000SIGNAL X,(nl 00RRELAT)ON (ml POWER SPECTRUM

5(f) 1db)Fig. 4. The signal x1 (n); the resulting correlation and power spectrum

for each of the ten correlators of Table I for a section of voicedspeech.

(8)


cessed signal shows only a few harmonics whose amplitudesvary with the formant amplitude. The nonlinearly processedsamples show various degrees of spectral flattening, as antici-pated by the previous discussion.- Finally, Fig. 7 shows the results obtained with a voicedframe from a low pitched (long period) male speaker. In thisexample the first formant has a very narrow bandwidth as seen

2L

.,. ___________________

3 __________________

____

!:7F-JW jWw --wv'\p- ________8F4JV P"

A.

ioi Il'0 300SIGNAL X1(fl) CORRELATION 4)(m) POWER SPECTRUM

sf> Cdb)Fig. 7. The signal xi(n); the resulting correlation and power spectrum

for each of the ten correlators of Table I for a section of voicedspeech from a low pitched male speaker.

in the spectrum at the top of Fig. 7. Pitch detection directlyon the autocorrelation of the signal yields incorrect results inthis case due to the first formant peak(s) in the autocorrela-tion function. However, as shown in Fig. 7, almost any of thenonlinearities flatten the spectrum and eliminate the trouble-some effects of the sharp first formant in the resulting corre-lation function.

In summary, we have presented examples which tend toshow that, as anticipated, the effect of nonlinearly quantizingthe signal amplitudes using the quantizers of Fig. 1 is to effec-tively flatten and broaden the signal power spectrum, therebyreducing the effects of the first formant on the correlationfunction, and simplifying the pitch detection problem. In thenext section we present results of a comparative test of theperformance of the ten correlation pitch detectors discussed inthis section on a series of speech utterances.

III. EVALUATION OF THE TEN NONLINEARCORRELATIONS

In order to evaluate an4 compare the performance of the tennonlinear correlations discussed in the preceding section, asmall set of the utterances from the data base in [2] was usedas a test set. For each of the utterances a reference pitchcontour was available from which an error analysis was made[6]. Since the problem of making a reliable voiced-unvoiceddecision was not a concern here, the reference voiced-unvoicedcontour was used directly, i.e., each correlator was required toestimate the pitch period, assuming a priori that the intervalwas properly classified as voiced. (No pitch detection wasdone during unvoiced intervals.) However, if the peak correla-tion value (normalized) fell below a threshold (0.25), theinterval was classified as unvoiced since reliable selection of

FRAME 88LRR-I SAW THE CAT

FRAME 146LMOST

2000 50005000

1, k- --\flp_

p --Jv -_____________ -.

82H - -H --

0 300 0 2000SIGNAL X1 In) CORRELATION 4) Cm) POWER SPECTRUM

SIt) 1db)Fig. 5. The signal x1(n); the resulting correlation and power spectrum

for each of the ten correlators of Table I for another section of voicedspeech.

2 ,A

AIAIA AlIII III IA A A I A

6 A Alluvyyw88(LA A A I

I0

FRAME 80F 105W

300 0 200 0 5000SIGNAL X1 In) CORRELATION 4) (ml POWER SPECTRUM

SIt) 1db)Fig. 6. The signal x1 (ii); the resulting correlation and power spectrum

for each of the ten correlators of Table I for a section of voicedspeech from a female speaker.


TABLE IISTANDARD DEVIATIONS FOR TEN CORRELATORS

Utterance

1 .53 .60 .8i .97 .51. 1.35 1.58 .85 .88 1.23 1.24 1.52 1.232 .63 .71 .88 .68 .58 1.59 1.70 1.18 1.19 1.21 1.32 1.39 1.053 .1!. .61, .72 .79 .56 1.50 1.66 1.10 1.18 1.31 1.17 1.1.6 1.08I. .52 .60 .79 .83 .79 1.54 1.82 .97 1.11 1.32 1.24 1.50 1.345 .63 .68 .76 .86 .99 1.47 1.75 1.07 1.07 1.28 1.02 1.37 1.27

.2 6 .ly .65 .72 .58 .82 1.55 1.66 1.08 1.19 1.29 1.05 1.43 1.257 .40 .63 .80 .78 .51. 1.46 1.70 .99 1.04 1.35 1.29 1.115 1.26

8 8 .37 .65 1.15 .73 .58 1.67 1.88 1.09 1.18 1.29 1.51 1.45 1.249 ,.46 .68 .97 .67 .56 1.56 1.76 1.13 1.17 1.30 1.111 1.1.6 1.11

13 .45 .79 .69 .75 .75 1.50 1.69 1.13 1.25 1.39 1.28 0.57 1.17Standard Deviatioll of Pitch Period Posmootlcd

1 .61 .1.0 .51 .50 .53 .811 1.12 1.59 1.52 2.08 1.22 1.92 1.752 .62 '.8 .61 .50 .59 1.09 1.40 1.63 1.61 1.54 1.25 2.15 1.50

8 3 .61 .19 .55 .50 .57 .85 1.31 1.49 1.76 1.36 1.28 1.98 1.81I. .60 .1.3 .55 .56 .1.9 1.25 .97 1.1.8 1.59 1.67 1.33 2.17 1.835 .70 .118 .57 .511 .62 1.18 1.02 1.65 1:60 1.36 1.85 2.23 1.536 .60 .56 .60 .56 .52 1.26 1.03 1.63 1.68 1.69 1.57 1.93 1.517 .59 .46 .58 .71 .51 1.03 1.14 1.55 1.75 1.77 1.21 2.15 2.11

8 8 .63 .48 .67 .68 .50 1.19 1.43 1.78 1.83 1.75 1.86 2.48 1.92.9 .59 .50 .61 .51 .57 1.21. 1.1.,. 1.67 1.92 1.18 1.1.1 2.39 1.66

10 .59 .59 .51. .56 .58 1.05 1.26 1.63 0.76 1.68 1.56 2.29 1.63Stsoderd Oevin080m of Pitch Period Smoothed

the pitch was not possible with a correlation peak below thisthreshold.

Thirteen utterances from [2] were used in this comparison.Tables IlV present the results of an error analysis whichmeasured the average and standard deviation of the pitchperiod, the number of gross pitch period errors, and thenumber of voiced-to-unvoiced errors [2] For all utterancesthe average pitch period error was well below 0.5 samples(10 kHz sampling rate) and so the results of this measurementare not presented. Table II presents the standard deviations ofthe pitch period for the ten correlations. The results are alsopresented for the errOrs when the pitch contours were non-linearly smoothed using a medium smoothing algorithm [7].From Table II it can be seen that the standard deviations forall correlators were approximately the same for the sameutterance. It is also seen that as the average pitch period getslonger (reading from left to right) the standard deviationincreases proportionally.

Tables III and IV show the error statistics for gross errors,and voiced-to-unvoiced errors both for the unsmoothed pitchcontours (Table III) and for the smoothed pitch cOntours(Table IV). These tables show that for the high pitchedspeakers (utterances prefaced by Cl, Fl, F2), although somedifferences'1 were present in the error scores for the Un-smoothed data, the nonlinear smoother was able to correctmost of the errors. Thus the overall performance on the first

10A voiced-to-unvoiced error occurred when a voiced region wasimproperly classified as an unvoiced region because no peak above thethreshold was present in the correlation function.

differences for the high pitched (short period) speakers weredue to pitch period doubling, i.e., the correlation peak at twice theperiod was somewhat higher than the correlation peak at the trueperiod. This is a common effect when the pitch period is on the orderof 30 ms (300 Hz pitch) as was the case for these speakers.

TABLE IIIERROR STATISTICS FOR TEN CORRELATORSUNSMOOTHED

Utteraoce

a a a F F F F Fa a a a a a a a a a a a a

1 6 1 1 2 7 1 1. 24 33 1.0 18 iS 25212 8 6 2 8 1 6 8 15 11 6 5 83 9

6m 5 '.1.

6 117 688 7,.c9 7

5 2 3 6 1 5 13 22 10 18 7 111 2 I. 6 2 3 21 29 21. 15 11. 206 2 3 6 2 3 111 25 18 18 17 195 2 1 4 2 1. 13 21. 12 11 11. 132 2 1. 8 1 1. 21. 31 27 13 9 162 3 5 10 1 00 28 37 31. 16 13 221 3 1. 8 1 7 16 25 '.1. 12 8 11

10 15 5 1. 1 6 2 6 15 27 15 11 9 13Number of Sr000 Voiced Errors (Unsetoothed)

1 0 S 1 2 1 1 0 1 3 S 2 3 22 2 1 5 1. 3 2 0 10 11 I. 10 13 1113 2 0 2 I. 2 2 0 3 6 2 5 I. 5I. 85 0 S 1 2 0 0 0 1 1 0 4 5 2S 1 1 0 S 2 2 1. 0 1. 2 1.6 0

aT 01

9 2

0 2 3 2 0 0 3 5 3 5 2 1.0 2 1 0 2 2 1 3 0 2 1 1.0 2 2 1 2 0 5 3 1 3 15 1.8 3I. 2 2 0 II 7 I. 1. 11 10

10 0 0 1 5 3 0 1 6 5 2 6 6 7Nlamber of Voiced Unvoiced Errors (Usomootbed)

Total Numberof Voiced 213Intervals

133 176 169 118 152 157 170 170 11.1. 105 131. 11.7

TABLE IVERROR STATISTICS FOR TEN CORRELATORSSMOOTHED

Utterance

0a

0aa

0 0 K o') K K K 0 0 0a a a a a a aa a a a a a a a a a0aa

1 0 0 0 0 0 0 5 11 17 1. 0 2 22 03 0

0056 00

'88 00

00000000

0 0 c 0 0 6 5 6 2 00 0 5 0 0 7 3 1. 1 10 0 0 0 0 8 8 2 0 20 0 0 0 0 6 2 11 3 30 0 0 0 0 3 3 3 2 20 0 0 0 0 9 8 8 020 0 0 0 1 9 8 5 3 00 0 S 0 1 I. 9 7 2 0

31

1153

73

10 2 0 0 0 S 0 0 3 1 3 2 2 3. Nocsher of lr000 Voiced Errors (Smoothed)

1 2 1 0 3 5 2 1 13 13 20 18 'ii 172 3 0 3 4 9 2 1 9 15 7 9 9 23 3I. 2

62223

0110110

0 4 8 2 1 3 11. 5 9 30 1. 5. 5 2 3 8 8 19 80 3 51 2 l 7 ii 15 30 3 6 1 2 9 15 8 10 80 I. 2 1 1 6 7 17 31 5 8 2 1 10 19 8 9 121 6 9 2 3 7 12 6 7 ii

2100261.

10 2 C 0 . II 9 1 3 11 10 8 10 10 1.Number of Voiced Sonrdced Errors (Smoothed)

four utterances was approximately the same for all correlators.For the low pitched speakers (utterances prefaced by LM, M2)there were more significant differences between the cor-relators. For the category of gross errors, correlators 1 and 8generally had the largest numbers of errors across the last 6utterances in the test. However, for the category of voiced-to-unvoiced errors, correlators 2 and 9 had consistently thelargest number of errors. Although the smoothing signifi-


cantly reduced the number of gross errors for many of thecorrelators, in turn it increased the number of voiced-to-unvoiced errors. Since both errors constitute a pitch error,in this case the most significant error statistic is probably thesum of the gross errors and voiced-to-unvoiced errors, Table Vshows these results. Based on this combined error statisticthe following conclusions can be drawn about the performanceof the ten correlators.

1) For high pitched speakers the differences in performancescores between the different correlators are small and probablyinsignificant. It is for this class of speakers that any type ofcorrelation measurement of pitch period tends to work verywell.

2) For low pitched speakers fairly significant differences inthe performance scores existed. Correlator number 1 (thenormal hnear .autocorrelation) tended to give the worst per-formance for all utterances in this class. Correlators numbers4, 7, 8 (the ones involving an unprocessed x(n) in the com-putation) were also somewhat poorer in their overall perform-ance based on the sum of gross errors and voiced-to-unvoicederrors.

3) Differences in the performance among the remainingsix correlators were not consistent. Thus, any one of thesecorrelators would be appropriate for an autocorrelation pitchdetector.

It is interesting to note that (as seen in Tables III-.V) theresults for utterance M208T were significantly worse than forutterance M208M. These utterances were simultaneouslyrecordedthe difference being that M208T was recorded off atelephone line, whereas M208M was recorded from a closetalking microphone. This result is due to the band-limitingeffects of the telephone line (300 Hz cutoff frequency) whicheliminate the first few harmonics of the pitch, thereby makingaccurate pitch detection more difficult.

To illustrate the errors made during one of the more difficultutterances, Fig. 8 shows the pitch period contours from threeof the correlators for the utterance LMOST ("we were away ayear ago," spoken by a low pitched male over the telephonehne). Also shown in this figure is the nonhnearly smoothedpitch contour from correlator number 10. The pitch periodcontour from correlator number 1 [Fig. 8(a)] shows the largenumber of gross pitch period errors made during the analysis.It is readily seen that most of the errors involved choosing alow valued correlation peak rather than the one at the pitchperiod. These errors, although due somewhat to the frame sizeused for analysis (30 ms or 300 samples), are primarily due tothe narrow bandwidth first formant which has a strongercorrelation peak than the one due to the pitch period. Theresults for correlators nuinber 2 [Fig. 8(b)] and 10 [Fig. 8(c)]confirm the fact that the use of the nonlinearities prior tocorrelation greatly flattens the spectrum, thereby reducing thenumber of errors of the type discussed above. As shown inFig. 8(d), the nonlinear smoother is quite capable of correctingmost of the gross pitch errors in analysis from the correlatorsusing the nonlinearities; however, the number of errors forcorrelator number 1 is too large to be adequately corrected bythis smoother. The nonlinearly smoothed pitch contour alsoshows that the only gross pitch period errors which were not

TABLE VTOTAL ERROR STATISTICS FOR TEN CORRELATORS

Utterance

F3 S S S S S S S F F F FF F S 5 7 F F S F F F1 6 1 2 3 5 2 1 25 36 .0 20 20 252 13 9 11 6 ii 3 6 iS 26 15 o6 ii 19311 5 1 7 8 3 5 i6 25 12 15 11 i61 6O iS 16 3 3 6 2 3 22 30 25 19 il 223 I 6 2 3 i6 29 iS 22 19 23

5611 5 1 I 6 2 1 i6 29 15 16 i6 1750 6 2 1 5 6 3 1 25 33 27 15 13 205 6 2 5 7 ii 3 10 33 II 35 19 23 26

9 9 1 6 6 10 3 2 20 32 10 11 19 2112 13 5 5 6 9 2 5 21 32 17 13 15 20

Octal lnccicr nO Pitch Ecrcra Oncrncnthcd

1 2 1 0 3 5 2 1 23 31 21 16 13 192 3 I 1 3 9 2 1 15 13 13 11 9 5

31 2

Z5 6So 2Cs 2CS 2

3

1

1

0

1

1

0 3 6 2 1 10 17 9 10 I 30 I I 1 2 11 i6 10 19 10 120 3 5 1 2 10 9 15 10 6 51 3 6 1 2 12 13 11 12 10 30 I 5 2 1 10 iS 15 17 5 51 7 6 2 2 19 27 i6 12 12 1309 3 0 1 6 9 2 2 11 21 53 9 11 7

10 2 0 0 1 9 1 3 11 11 11 12 12 7

Octal Nnccbcr cO Pitch Errors iaccthch

corrected by the smoother were those that occurred near anunvoiced boundary. As already mentioned, these gross pitchperiod errors were often changed into voiced-to-unvoicederrors in the smoothed pitch contour.

IV. ADAPTIVE FRAME SIZE FOR PITCH ANALYSISOne of the remaining problems in designing an effective cor-

relation pitch detector is to implement an algorithm formaking the analysis frame size variable. It is important to notethat the variability of frame size for a given speaker is notnearly as important as the variability of frame size fromspeaker to speaker. The most important feature of the anal-ysis frame size is that it be large enough to encompass at leasttwo complete pitch periods, but not so large that it encom-passes a large number of pitch periods. If we consider therange of pitch period variation across speakers [2] , then aframe size on the order of 40 samples (4 ms) is required for ahigh pitched speaker, and a frame size on the order of 400samples (40 ms) is required for a low pitched speaker. Thus, asingle fixed frame size will hot be suitable for all speakers.

The question now remains as to a suitable method of adapt-ing the frame size to the pitch of the speaker. We have alreadyargued that adaptation to the detailed pitch variation withinan utterance is generally unnecessarymainly because therange of pitch variation within an utterance is generally 1octave or less (a factor of 2 to 1) from the average pitch forthe utterance. Thus, an instantaneously adapting algorithmfor choosing the analysis frame size is not required. This isfortunate in that instantaneously adaptive methods generallydo not work well when the pitch estimates include gross pitcherrors.

In lieu of an instantaneously adaptive method, a simple buteffective method of adapting the frame size is to estimate theaverage pitch PQn) of the speaker using the relation

CORRELATOR NO.10(SMOOTHED CONTOUR)

FRAME 147LMO5TN = 300

FRAME 147LMOBTN =600

POWER SPECTRUMSI)) 1db)

31

200 -

150

RABINER: AUTOCORRELATION ANALYSIS FOR PITCH DETECTION

LMO5T-WE WERE AWAY A YEAR AGO

200CORRELATOR NO.)

150 .,._.--,_

100

50

P - I I I .I !C 50 100 150

FRAME NUMBER(a)

CORRELATOR NO. 2IC2 'wa-

r0Ia- 200 -

150

I OO

50

100 -

50

50 100 150FRAME NUMBER

(bI

CORRELATOR NO. 10

I

I I I I50 100 150

FRAME NUMBER(C)

s_

POWER SPECTRUMS(f) 1db)

200 -

150

100

50

2r

4W v\tJWz5V=1

1O0 300SIGNAL X1(n) CORRELATION

(a)

0

2 ' ______

____

____

i I________)b)z510I 61 ?"0 50 100 150 I

FRAME NUMBER(dl

of the correlators of Table I and a nonlinearly smoothed pitch COfl- 8tour from correlator number 10.

Fig. 8. The pitch period contours for the utterance LMO5T from three

9 Ii _____________________________________I

_____________________________

P(m) = p(i), Nm 10 0 000 ZOoLNm101

___0 5000m j= SIGNAL ))1(n) CORRELATION 4(m)100, Nm


LMO5TCORRELATOR 10 200SMOOTHED DATA

wN(I)w

U-

I I I I I I I I I I I10

FRAME NO.

Fig. 10. Plots of the pitch contour and the resulting adaptive frame sizefor four typical utterances.

To demonstrate the necessity and effectiveness of matchingthe analysis frame size to the speaker's average pitch, Fig. 9shows plots of the waveforms, correlation functions, andpower spectra for a section of voiced speech from a lowpitched male. The pitch during this section was about 150samples. Fig. 9(a) shows the results for the 10 correlators foran analysis size of 300 samples, Fig. 9(b) shows the results fora 600 sample analysis frame size. By comparing the flatness ofthe power spectrum for the best correlators (i.e., numbers 2, 3,and 10) it can be readily seen that the longer analysis framesize leads to significantly flatter spectra.

The analysis frame adaptation algorithm discussed above wastested on several utterances used in the study of [2] . Fig. 10

shows plots of both the nonlinearly smoothed pitch periodcontour, and the analysis frame size as obtained from (11) and(12). Fig. 10(a) shows the results on a low pitched malewhose average pitch period was about 140 samples. As dis-cussed above the first 10 voiced frames used a 300 sampleframe; after that the frame size adapted slowly to the pitchperiod, reaching a fairly constant value of about 420 samples.Fig. 10(b) shows the results for a normal pitched malespeaker with very little pitch variation throughout the utter-ance. The algorithm very rapidly converges to an analysisframe size of about 210 samples for this speaker. Fig. 10(c)shows the results for a female speaker. In this case the analysisframe size quickly converged to a length of about 135 samples.

200 -

ISO -

:too - I-

I I I i i i I i r

50 -

0-

300

200

100

U)w-Ja.

(I)

0Uia.

U)Ui-Ja.

U)UiNU)UI

U-

U)Ui-Ja.

U)00UIa.

(I,UI-Ja.

UINU)UI

U-

50 100 150

M1O7MCORRELATOR 10SMOOTHED DATA

U)w 1501a.

100

50 -

O I r.._i I I _.L I LI_ I i_j_j. i

-

300

200 -

100 -

I I I IllIlilil I 0 --0200 -

150 -

100 -

50 -

o300

200

100

1111 I I50 100 150 200

FRAME NO.

CIO8TCORRELATOR 10SMOOTHED DATA

0FRAME NO.

F1O7M200 - CORRELATOR 0

SMOOTHED DATA

ISO -

too -

50 ., U-- ____-C - I L_L L_L I I I I _S_i .

300

200 -

100-

0 I I I I I I I I I I I50 tOO

FRAME NO.

(I,LU-Ja.

U,

0UIa.

U)LU-Ja.4U)LUNci,UI

U.

0 i50 200 0 50 100 200

RABINER: AUTOCORRELATLON ANALYSIS FOR PITCH DETECTION 33

Finally, Fig. 10(d) shows the results for a high pitched child.In this case the frame size reached the lower limit of a 100sample frame size at the first iteration, and remained at thatvalue throughout the utterance.

Adapting the frame size to the estimated average pitch of thespeaker can have advantages other than the ones discussedabove. In cases where the resulting frame size is smaller than300 samples, the computation of the correlation function isspeeded up. In cases where the frame size falls below 200samples, the computation is speeded up even more becausefewer than 200 correlations need to be computed. Thus, forexample, for a frame size of 300 samples, on the order of N1 =300 X 200 60 000 operations (multiply, addition) need tobe performed to compute 200 autocorrelation points, whereasfor a frame size of 100 samples, on the order of N2 =100 X 100 = 10 000 operations are required providing a 6 to Isavings in computation. However, in cases where the framesize exceeds 300 samples, the correlation computation timeincreases, but this increase in computation time is unavoidableif one is to use the proper frame size.

V. SUMMARY

In this paper we have examined several methods for com-bining nonlinear processing of the speech waveform with astandard correlation analysis to give correlation functionswhich have sharp peaks at the pitch period. We have shownthat the nonlinearities provide some degree of spectralflattening, thereby enhancing the priodicity peaks in thecorrelation function, and reducing the correlation peaks due tothe formant structure of the waveform. A formal evaluation

of ten types of nonlinear correlation showed that correlationsinvolving the unprocessed signal were somewhat inferior tocorrelations involving the nonlinearly processed signal; how-ever, almost all the nonlinearities provided essentially the sameperformance.

In addition a simple procedure for adapting the analysisframe size of the correlation to the estimated average pitchperiod of the speaker was proposed and evaluated for severalutterances. By basing the adaptation on a running estimate ofthe pitch period, it was shown that a fairly reliable and robustmethod of adapting analysis frame size resulted. This methodshould be appropriate for any frame-by-frame speech analysissystem in which pitch is extracted.

REFERENCES[1] J. 3. Dubnowski, R. W. Schafer, and L. R. Rabiner, "Real-time

digital hardware pitch detector," IEEE Trans. Acoust., Speech,and Signal Pro cessing, vol. ASSP-24, pp. 28, Feb. 1976.

[2] L. R. Rabiner, M. J. Cheng, A. E. Rosenberg, and C. A. McGonegal,"A comparative performance study of several pitch detectionalgorithms," IEEE Trans. Acoust., Speech, and Signal Processing,vol. ASSP-24, pp. 399418, Oct. 1976.

[3] M. M. Sondhi, "New methods of pitch.extraction," iEEE Trans.Audio Electroacoust., vol AU-16, pp. 262266, June 1968.

[4] 3. D. Markel, "The SIFT algorithm for fundamental frequencyestimation," IEEE Trans. Audio Electroacoust., vol AU-20, pp.367377, Dec. 1972,

[5] B. S. Atal, unpublished work.[6] C. A. McGonegal, L. R. Rabiner, and A. E. Rosenberg, "A semi-

automatic pitch detector (SAPD)," IEEE Trans. Acoust.,Speech, and Signal Processing, vol. ASSP-23, pp. 570574, Dec.1975.

[7] L. R. Rabiner, M. R. Sambur, and C. F. Schmidt, "Applications ofa nonlinear smoothing algorithm to speech processing," IEEETrans. Acoust., Speech, Signal Processing, vol. ASSP-23, pp.552557, Dec. 1975.

on the use of autocorrelation analysis for pitch detection

Documents