digital speech processingdigital speech processing ... speech... · speech interval ending of...

Digital Speech ProcessingDigital Speech Processing--Lecture 14ALecture 14A

Algorithms for Algorithms for S h P iS h P iSpeech ProcessingSpeech Processing

Speech Processing AlgorithmsSpeech Processing AlgorithmsSpeech Processing AlgorithmsSpeech Processing Algorithms• Speech/Non-speech detection

– Rule-based method using log energy and zero crossing rateSi l h i t l i b k d i– Single speech interval in background noise

• Voiced/Unvoiced/Background classification– Bayesian approach using 5 speech parameters– Needs to be trained (mainly to establish statistics for background signals)eeds o be a ed ( a y o es ab s s a s cs o bac g ou d s g a s)

• Pitch detection– Estimation of pitch period (or pitch frequency) during regions of voiced

speechImplicitly needs classification of signal as voiced speech– Implicitly needs classification of signal as voiced speech

– Algorithms in time domain, frequency domain, cepstral domain, or using LPC-based processing methods

• Formant estimation– Estimation of the frequencies of the major resonances during voiced

speech regions– Implicitly needs classification of signal as voiced speech– Need to handle birth and death processes as formants appear and p pp

disappear depending on spectral intensity

MedianMedianMedian Median SmoothingSmoothingSmoothing Smoothing

andandand and Speech Speech pp

ProcessingProcessing

Why Median SmoothingWhy Median SmoothingWhy Median SmoothingWhy Median Smoothing

Obvious pitch period discontinuities that need to be smoothed in a manner that preserves the character of the surrounding regions –using a median (rather than a linear filter) smoother.g ( )

Running MediansRunning MediansRunning MediansRunning Medians

5 point median

5 point averaging

NonNon--Linear SmoothingLinear SmoothingNonNon Linear SmoothingLinear Smoothing• linear smoothers (filters) are not always appropriate for smoothing

parameter estimates because of smearing and blurring discontinuitiesparameter estimates because of smearing and blurring discontinuities• pitch period smoothing would emphasize errors and distort the

contour• use combination of non-linear smoother of running medians and g

linear smoothing• linear smoothing => separation of signals based on non-overlapping

frequency content• non linear smoothing => separating signals based on their character• non-linear smoothing => separating signals based on their character

(smooth or noise-like)

[ ] ( [ ]) ( [ ]) - smooth rough components= + +x n S x n R x n

1

[ ] ( [ ]) ( [ ]) smooth rough components( [ ]) median( [ ]) ( [ ])

( [ ]) median of [ ]... [ ]

+ += =

= − +L

L

x n S x n R x ny x n x n M x nM x n x n x n L

6

( [ ]) [ ] [ ]L

Properties of Running MediansProperties of Running MediansProperties of Running MediansProperties of Running Medians

Running medians of length L:Running medians of length L:1. ML(α x[n]) = α ML(x[n]) 2 Medians will not smear out discontinuities2. Medians will not smear out discontinuities

(jumps) in the signal if there are no discontinuities within L/2 samplesdiscontinuities within L/2 samples

3. ML(α x1[n]+β x2[n]) ≠ α ML(x1[n]) + β ML(x2[n])4 Median smoothers generally preserve sharp4. Median smoothers generally preserve sharp

discontinuities in signal, but fail to adequately smooth noise-like components

7

q y p

Median SmoothingMedian SmoothingMedian SmoothingMedian Smoothing

8


9


10


11

Nonlinear Smoother Based on Nonlinear Smoother Based on MediansMedians

12

Nonlinear SmootherNonlinear SmootherNonlinear SmootherNonlinear Smoother- [ ] is an approximation to the signal ( [ ])y n S x n- second pass of non-linear smoothing improves performancebased on: [ ] ( [ ])=y n S x n[ ] ( [ ])- the difference signal, [ ], is formed as: [ ] [ ] [ ] (= − =

yz n

z n x n y n R [ ])second pass of nonlinear smoothing of [ ] yields a correction

x nz n- second pass of nonlinear smoothing of [ ] yields a correction

term that is added to [ ] to give [ ], a refined approximation to ( [ ]) [ ] ( [ ]) [ ( [ ])]= +

z ny n w n S x n

w n S x n S R x n- if [ ] ( [ ]) exactly,=z n R x n i.e., the non-linear smoother was ideal, then

[ ( [ ])] would be identically zero and the correction term would beunnecessaryS R x n

13

y

Nonlinear Smoother with Delay CompensationNonlinear Smoother with Delay Compensation

14

Algorithm #1Algorithm #1Algorithm #1Algorithm #1

Speech/NonSpeech/Non--Speech Detection Speech Detection Using Simple RulesUsing Simple Rulesgg

Speech Detection IssuesSpeech Detection Issues

• key problem in speech processing is locating t l th b i i d d f haccurately the beginning and end of a speech

utterance in noise/background signal

beginning of speech

•• need endpoint detection to enable:

• computation reduction (don’t have to process background signal)

• better recognition performance (can’t mistake background for speech)

• non-trivial problem except for high SNR recordings

Ideal Speech/NonIdeal Speech/Non--Speech DetectionSpeech Detection

Beginning of speech interval

Ending of speech interval

Speech Detection ExamplesSpeech Detection Examples

can find beginning of speech based oncase of low background noise => simple case

can find beginning of speech based on knowledge of sounds (/S/ in six)

Speech Detection ExamplesSpeech Detection Examples

difficult case because of weak fricative sound, /f/, at beginning of speech

Problems for Reliable Speech DetectionProblems for Reliable Speech Detection

• weak fricatives (/f/, /th/, /h/) at beginning or end of ( , , ) g gutterance

• weak plosive bursts for /p/, /t/, or /k/• nasals at end of utterance (often devoiced and

reduced levels)i d f i ti hi h b d i d t d f• voiced fricatives which become devoiced at end of

utterance• trailing off of vowel sounds at end of utterance• trailing off of vowel sounds at end of utterance

the good news is that highly reliable endpoint detection is not required the good news is that highly reliable endpoint detection is not required for most practical applications; also we will see how some applications for most practical applications; also we will see how some applications can process background signal/silence in the same way that speech is can process background signal/silence in the same way that speech is processed, so endpoint detection becomes a moot issueprocessed, so endpoint detection becomes a moot issue

Speech/NonSpeech/Non--Speech DetectionSpeech DetectionSpeech/NonSpeech/Non Speech DetectionSpeech Detection

• sampling rate conversion to standard rate (10 kHz)sampling rate conversion to standard rate (10 kHz)• highpass filtering to eliminate DC offset and hum, using a length 101 FIR equiripple highpass filter• short-time analysis using frame size of 40 msec, with a frame shift of 10

t h t ti l d h t ti i tmsec; compute short-time log energy and short-time zero crossing rate• detect putative beginning and ending frames based entirely on short-time log energy concentrations• detect improved beginning and ending frames based on extensions to p g g gputative endpoints using short-time zero crossing concentrations

Speech/NonSpeech/Non--Speech Detection Speech Detection ––Al i h #1Al i h #1Algorithm #1Algorithm #1

1 D t t b i ib i i d didi f h i t l1. Detect beginningbeginning and endingending of speech intervals using short-time energy and short-time zero crossings

2. Find major concentration of signalmajor concentration of signal (guaranteed to be speech) using region of signal energy aroundspeech) using region of signal energy around maximum value of short-time energy => energy normalization

33 Refine region of concentrationRefine region of concentration of speech using3.3. Refine region of concentrationRefine region of concentration of speech using reasonably tight short-time energy thresholds that separate speech from backgrounds—but may fail to find weak fricatives, low level nasals, etcfind weak fricatives, low level nasals, etc

4.4. Refine endpoint estimatesRefine endpoint estimates using zero crossing information outside intervals identified from energy concentrations—based on zero crossing rates co ce t at o s based o e o c oss g atescommensurate with unvoiced speech

Speech/NonSpeech/Non--Speech DetectionSpeech DetectionSpeech/NonSpeech/Non Speech DetectionSpeech Detection

Log energy separates Voiced from Log energy separates Voiced from Unvoiced and SilenceUnvoiced and Silence

Zero crossings separate Unvoiced Zero crossings separate Unvoiced from Silence and Voicedfrom Silence and Voiced

RuleRule--Based ShortBased Short--Time Time M f S hM f S hMeasurements of SpeechMeasurements of Speech

Algorithm for endpoint detection:Algorithm for endpoint detection:

1. compute mean and σ of log En and Z100 for first 100 msec of signal (assuming no speech in this interval and assuming FS=10,000 Hz).

2. determine maximum value of log En for entire recording => normalization.

3. compute log En thresholds based on results of steps 1 and 2—e.g., take some percentage of the peaks over the entire interval. Use threshold for zero crossings based on ZC distribution for unvoiced speech.

4. find an interval of log En that exceeds a high threshold ITU.

5 find a putative starting point (N ) where log E crosses ITL from above; find5. find a putative starting point (N1) where log En crosses ITL from above; find a putative ending point (N2) where log En crosses ITL from above.

6. move backwards from N1 by comparing Z100 to IZCT, and find the first point where Z100 exceeds IZCT; similarly move forward from N2 by comparing Z100where Z100 exceeds IZCT; similarly move forward from N2 by comparing Z100to IZCT and finding last point where Z100 exceeds IZCT.

Endpoint Detection AlgorithmEndpoint Detection Algorithm

11 find heart of signal via conservative energy threshold => Interval 1find heart of signal via conservative energy threshold => Interval 11.1. find heart of signal via conservative energy threshold => Interval 1find heart of signal via conservative energy threshold => Interval 1

2.2. refine beginning and ending points using tighter threshold on energy => Interval 2refine beginning and ending points using tighter threshold on energy => Interval 2

3.3. check outside the regions using zero crossing and unvoiced threshold => Interval 3check outside the regions using zero crossing and unvoiced threshold => Interval 3

Endpoint Detection AlgorithmEndpoint Detection Algorithm

Isolated Digit DetectionIsolated Digit Detection

Panels 1 and 2: digit /one/- both initial and final endpointboth initial and final endpoint frames determined from short-time log energy

P l 3 d 4 di it / i /Panels 3 and 4: digit /six/- both initial and final endpoints determined from both short-time log energy and short-time zero gycrossings

Panels 5 and 6: digit /eight/initial endpoint determined from- initial endpoint determined from

short-time log energy; final endpoint determined from both short-time log energy and short-time zero gycrossings

Isolated Digit DetectionIsolated Digit Detection


Voiced/Unvoiced/Background Voiced/Unvoiced/Background (Silence) Classification(Silence) Classification( )( )

Voiced/Unvoiced/Background Voiced/Unvoiced/Background Cl ifi iCl ifi i Al i h #2Al i h #2ClassificationClassification——Algorithm #2Algorithm #2

• Utilize a Bayesian statistical approach to classification of frames as voiced speech,

i d h b k d i l (i 3unvoiced speech or background signal (i.e., 3-class recognition/classification problem)

• Use 5 short-time speech parameters as the• Use 5 short-time speech parameters as the basic feature set

• Utilize a (hand) labeled training set to learn the ( ) gstatistics (means and variances for Gaussian model) of each of the 5 short-time speech parameters for each of the classesparameters for each of the classes

Speech ParametersSpeech ParametersSpeech ParametersSpeech Parameters

[ ]X x x x x x= 1 2 3 4 5

1

2 100

[ , , , , ]log -- short-time log energy of the signal

-- short-time zero crossing rate of the signal S

X x x x x xx Ex Z

=

==2 100

3 1

g gfor a 100-sample frame

-- short-time autocorrelation coefficient at unit x C=sam

4 1

ple delay -- first predictor coefficient of a order linear predictorthx pα=

5 -- normalized energy of the prediction error of a

order linear predictorp

th

x E

p

=

Speech Parameter Signal Speech Parameter Signal P iP iProcessingProcessing

• Frame-based measurements• Frame size of 10 msec• Frame shift of 10 msec• 200 Hz highpass filter used to eliminate

any residual low frequency hum or dc a y es dua o eque cy u o dcoffset in signal

Manual TrainingManual TrainingManual TrainingManual Training• Using a designated training set of sentences, each 10 g g g

msec interval is classified manually (based on waveform displays and plots of parameter values) as either:– Voiced speech – clear periodicity seen in waveformp p y– Unvoiced speech – clear indication of frication or whisper– Background signal – lack of voicing or unvoicing traits– Unclassified – unclear as to whether low level voiced, low level U c ass ed u c ea as to et e o e e o ced, o e e

unvoiced, or background signal (usually at speech beginnings and endings); not used as part of the training set

• Each classified frame is used to train a single Gaussian gmodel, for each speech parameter and for each pattern class; i.e., the mean and variance of each speech parameter is measured for each of the 3 classesp

Gaussian Fit tFits to

Training D tData

Bayesian ClassifierBayesian ClassifierBayesian ClassifierBayesian Classifier

1Class 1 representing the background signal classiω =, 1,, 2,

3

Class 1, representing the background signal classClass 2, representing the unvoiced classClass 3 representing the voiced class

i

i

iii

ωωω

=

=, 3,Class 3, representing the voiced classi iω =

[ ] f ll i lE[ ]

[( )( ) ]

for all in class for all in class

i i

Ti i i i

E x x

W E x x x

ω

ω

=

= − −

m

m m[( )( ) ]i i i i

Bayesian ClassifierBayesian ClassifierBayesian ClassifierBayesian ClassifierMaximize the probability:

( | ) ( )( | )( )i i

ip x Pp x

p xω ωω ⋅

=

3

( )where

p

∑1

1( ) ( | ) ( )

1 T

i ii

p x p x Pω ω=

= ⋅∑1(1/ 2)( ) ( )

5/ 2 1/ 2

1( | )(2 ) | |

Ti i ix W x

ii

p x eW

ωπ

−− − −= m m

Bayesian ClassifierBayesian ClassifierBayesian ClassifierBayesian Classifier( | )Maximize using the monotonic discriminant ip xω

( ) ln ( | )function i ig x p xω=

( )ln[ ( | ) ( )] ln ( )ln ( | ) ln ln ( )

l ( )Di d t i it i i d d t f l

i i

i i

p x P p xp x P p x

ω ωω ω

= ⋅ −

= + −

ln ( ),

1

Disregard term since it is independent of class, givingi

p xω

1( )2ig x = − 1( ) ( ) ln ( )

5 1l (2 ) l | |

Ti i i i ix W x P c

W

ω−− − + +m m

ln(2 ) ln | |2 2i ic Wπ= − −

Bayesian ClassifierBayesian ClassifierBayesian ClassifierBayesian Classifier

lnIgnore bias term and apriori class probabilityc Pi , ln .Ignore bias term, and apriori class probability, Then we can convert maximization to a minimization by reversing the sign, giving the decision rule:

i ic Pi

g g g g

Decide class if and only ifiω

( ) ( )Ti id x x m= − 1( ) ( )

Utili fid b d l ti d i i

i i jW x m d x j i− − ≤ ∀ ≠

Utilize confidence measure, based on relative decision scores,to enable a no-decision output when no reliable class informationis obtained.

i

is obtained.

Classification PerformanceClassification PerformanceClassification PerformanceClassification PerformanceTraining Count Testing CountTraining Set

Count Testing Set

Count

Background 85 5% 76 96 8% 94Background-Class 1

85.5% 76 96.8% 94

Unvoiced –Class 2

98.2% 57 85.4% 82

Voiced –Class 3

99% 313 98.9% 375

VUS ClassificationsVUS Classifications

Panel (a): synthetic vowel sequence

Panel (b): all voiced utterance

Panels (c-e): speech futterances with a mixture of

regions of voiced speech, unvoiced speech and background signal (silence)background signal (silence)


Pitch Detection (Pitch Period Pitch Detection (Pitch Period Estimation Methods)Estimation Methods)))

Pitch Period EstimationPitch Period EstimationPitch Period EstimationPitch Period Estimation• Essential component of general synthesis model for p g y

speech production• Major component of excitation sourceexcitation source information

(along with voiced-unvoiced decision, amplitude)(along with voiced unvoiced decision, amplitude)• Pitch period estimation involves two problems,

simultaneously; determination as to whether the speech is periodic and if so the resulting pitch (period oris periodic, and, if so, the resulting pitch (period or frequency)

• A range of pitch detection methods have been proposed i l di l ti d i /f d i / t lincluding several time domain/frequency domain/cepstral domain/LPC domain methods

Fundamentals of Pitch Fundamentals of Pitch Period EstimationPeriod Estimation

The Ideal Case of Perfectly The Ideal Case of Perfectly Periodic SignalsPeriodic Signalsgg

Periodic SignalsPeriodic SignalsPeriodic SignalsPeriodic Signals• An analog signal x(t) is periodic with period T0 if:An analog signal x(t) is periodic with period T0 if:

• The fundamental frequency is:0 1 0 1( ) ( ) , ... , , ,...x t x t mT t m= + ∀ = −

• A true periodic signal has a line spectrum i e non-0

0

1=fT

• A true periodic signal has a line spectrum, i.e., non-zero spectral values exist only at frequencies f=kf0, where k is an integerSpeech is not precisel periodic hence its spectr m is• Speech is not precisely periodic, hence its spectrum is not strictly a line spectrum; further the period generally changes slowly with time

The “Ideal” Pitch DetectorThe “Ideal” Pitch DetectorThe Ideal Pitch DetectorThe Ideal Pitch Detector

• To estimate pitch period reliably the idealidealTo estimate pitch period reliably, the idealidealinput would be either:

a periodic impulse train at the pitch period– a periodic impulse train at the pitch period– a pure sinusoid at the pitch frequency

I lit ’t t ith ( lth h• In reality, we can’t get either (although we use signal processing to either try to fl tt th i l t li i t llflatten the signal spectrum, or eliminate all harmonics but the fundamental)

Ideal Input to Pitch DetectorIdeal Input to Pitch DetectorIdeal Input to Pitch DetectorIdeal Input to Pitch Detector1

Periodic Impulse Train

0.4

0.6

0.8

ampl

itude

T0=50 samples

0 100 200 300 400 500 600 700 8000

0.2

time in samples

0

50

ude F0=200 Hz (with

-100

-50

log

mag

nitu

0sampling rate of FS=10 kHz)

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000-150

frequency

Ideal Input to Pitch DetectorIdeal Input to Pitch DetectorIdeal Input to Pitch DetectorIdeal Input to Pitch Detector1

Pure sinewave at 200 Hz

0

0.5

1

plitu

de

0 100 200 300 400 500 600 700 800-1

-0.5

ti i l

am

time in samples

50

100

e Single

-100

-50

0

log

mag

nitu

de Single harmonic at 200 Hz

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000-150

frequency

Ideal Synthetic Signal InputIdeal Synthetic Signal InputIdeal Synthetic Signal InputIdeal Synthetic Signal Input1

Synthetic Vowel—100 Hz Pitch

0

0.5

1

plitu

de

0 100 200 300 400 500 600 700 800-1

-0.5

ti i l

am

time in samples

20

40

60

e

-40

-20

0

20

log

mag

nitu

de

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000-60

frequency

The “Real World”The “Real World”The Real WorldThe Real World1

Vowel with varying pitch period

0

0.5

1m

plitu

de

0 100 200 300 400 500 600 700 800-1.5

-1

-0.5

time in samples

am

time in samples

40

60

e

-20

0

20

log

mag

nitu

d

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000-40

frequency

Time Domain Pitch Detection (Pitch Time Domain Pitch Detection (Pitch P i d E ti ti ) Al ithP i d E ti ti ) Al ithPeriod Estimation) AlgorithmPeriod Estimation) Algorithm

1. Filter speech to 900 Hz region (adequate for all ranges of pitch—eliminates extraneous signal harmonics)eliminates extraneous signal harmonics)

2. Find all positive and negative peaks in the waveform3. At each positive peak:

• determine peak amplitude pulse (positive pulses only)d t i k ll lit d l ( iti l l )• determine peak-valley amplitude pulse (positive pulses only)

• determine peak-previous peak amplitude pulse (positive pulses only)4. At each negative peak:

• determine peak amplitude pulse (negative pulses only)• determine peak-valley amplitude pulse (negative pulses only)• determine peak-previous peak amplitude pulse (negative pulses only)

5. Filter pulses with an exponential (peak detecting) window to eliminate false positives and negatives that are far too short to be pitch pulse

ti testimates6. Determine pitch period estimate as the time between remaining major

pulses in each of the six elementary pitch period detectors7. Vote for best pitch period estimate by combining the 3 most recent

ti t f h f th 6 it h i d d t testimates for each of the 6 pitch period detectors8. Clean up errors using some type of non-linear smoother

Time Domain Pitch Time Domain Pitch MMMeasurementsMeasurements

Positive peaks

t

Negative peaksNegative peaks

Basic Pitch Detection PrinciplesBasic Pitch Detection PrinciplesBasic Pitch Detection PrinciplesBasic Pitch Detection Principles

• use 6 semi-independent parallel processors touse 6, semi independent, parallel processors to create a number of impulse trains which (hopefully) retain the periodicity of the original signal and discard features which are irrelevantsignal and discard features which are irrelevant to the pitch detection process (e.g., amplitude variations, spectral shape, etc)

i l it h d t t d• very simple pitch detectors are used• the 6 pitch estimates are logically combined to

infer the best estimate of pitch period for theinfer the best estimate of pitch period for the frame being analyzed

• the frame could be classified as unvoiced/silence with zero pitch periodunvoiced/silence, with zero pitch period

Parallel Processing Pitch DetectorParallel Processing Pitch Detector10 kH h• 10 kHz speech

• speech lowpass filtered to 900 Hz => guarantees 1 or more harmonics even formore harmonics, even for high pitched females and children

• a set of ‘peaks and valleys’ (local maxima and minima) are located, and from their locationslocated, and from their locations and amplitudes, 6 impulse trains are derived

Pitch Detection AlgorithmPitch Detection AlgorithmPitch Detection AlgorithmPitch Detection Algorithm6 impulse trains:1 ( ) i l l h k li d h l i f1. m1(n): an impulse equal to the peak amplitude at the location of

each peak2. m2(n): an impulse equal to the difference between the peak

amplitude and the preceding valley amplitude occurs at each peakp p g y p p3. m3(n): an impulse equal to the difference between the peak

amplitude and the preceding peak amplitude occurs at each peak (so long as it is positive)

4 m (n): an impulse equal to the negative of the amplitude at a4. m4(n): an impulse equal to the negative of the amplitude at a valley occurs at each valley

5. m5(n): an impulse equal to the negative of the amplitude at a valley plus the amplitude at the preceding peak occurs at each

llvalley6. m6(n): an impulse equal to the negative of the amplitude at a

valley plus the amplitude at the preceding local minimum occurs at each valley (so long as it is positive)y ( g p )

Peak Detection for SinusoidsPeak Detection for SinusoidsPeak Detection for SinusoidsPeak Detection for Sinusoids

Processing of Pulse TrainsProcessing of Pulse Trains

• each impulse train is processed by a time-varying non-linear system (called a peak detecting exponential window)peak detecting exponential window)

• impulse of sufficient amplitude is detected => output is reset to value of impulse and “held” for a blanking interval, Tau(n) during which no new pulses can be detectedcan be detected

• after the blanking interval, the detector output decays exponentially with a rate of decay dependent on the most recent estimate of pitch period

th d ti til i l th t d th l l f th d i• the decay continues until an impulse that exceeds the level of the decay is detected

• output is a quasi-periodic sequence of pulses, and the duration between ti t d l i ti t f th it h i destimated pulses is an estimate of the pitch period

• pitch period estimated periodically, e.g., 100/sec

Final Processing for PitchFinal Processing for PitchFinal Processing for PitchFinal Processing for Pitch

• same detection applied to all 6 detectors => 6• same detection applied to all 6 detectors => 6 estimates of pitch period every sampling interval– the 6 current estimates are combined with the twothe 6 current estimates are combined with the two

most recent estimates for each of the 6 detectors– the pitch period with the most occurrences (to within

some tolerance) is declared the pitch period estimate at that time

• the algorithm works well for voiced speech• the algorithm works well for voiced speech• there is a lack of pitch period consistency for

unvoiced speech or background signalunvoiced speech or background signal

Pitch Detector PerformancePitch Detector Performance

• using synthetic speech gives a measure of accuracy of the algorithm

• pitch period estimates generally within 2 samples of actual pitch period

• first 10-30 msec of voicing often classified as unvoiced since decision method needs about 3 pitch periods before consistency check works properly => delay of 2 pitch periods in detection

Yet Another Pitch Detector Yet Another Pitch Detector (YAPD)(YAPD)

Autocorrelation Method Autocorrelation Method of Pitch Detectionof Pitch Detectionof Pitch Detectionof Pitch Detection

Autocorrelation Pitch DetectionAutocorrelation Pitch DetectionAutocorrelation Pitch DetectionAutocorrelation Pitch Detection• basic principle – a periodic function has a periodic

t l ti j t fi d th t kautocorrelation – just find the correct peak• basic problem – the autocorrelation representation of

speech is just too rich– it contains information that enables you to estimate the vocal

tract transfer function (from the first 10 or so values)– many peaks in autocorrelation in addition to pitch periodicity

peakspeaks– some peaks due to rapidly changing formants– some peaks due to window size interactions with the speech

signal• need some type of spectrum flattening so that the

speech signal more closely approximates a periodic impulse train => center clipping spectrum flattener

Autocorrelation of Voiced Speech FrameAutocorrelation of Voiced Speech Frame

[ ] 0 1 399x n n =[ ], 0,1,...,399x n n =

[ ], 0,1,...,559x n n =[ ]

[ ], 0,1,..., 10pmaxR k k = +

pmin ploc pmax

Center ClippingCenter Clipping

• CL=% of Amax (e.g., 30%)

• Center Clipper definition:

• if x(n) > CL, y(n)=x(n)-CL( ) L, y( ) ( ) L

• if x(n) ≤ CL, y(n)=0

33--Level Center ClipperLevel Center Clipper33 Level Center ClipperLevel Center Clipper

y(n) = +1 if x(n) > CL

= -1 if x(n) < -CL

= 0 otherwise

• significantly simplified computation (no multiplications)

• autocorrelation function is very similar to that from a conventional center clipper => most of the extraneous peaks are eliminated and a clear indiction of periodicity is retained

Waveforms andWaveforms andWaveforms and Waveforms and AutocorrelationsAutocorrelations

First row: noFirst row: no clipping (dashed lines show 70% clipping level)

Second row:Second row: center clipped at 70% threshold

Third row: 3-level center clipped

Autocorrelations of CenterAutocorrelations of Center--Clipped SpeechClipped Speech

Clipping Level:

(a) 30%

(b) 60%( )

(c) 90%

Doubling Errors in AutocorrelationDoubling Errors in Autocorrelation

Doubling Errors in Doubling Errors in AutocorrelationAutocorrelation

Second and fourth harmonics much stronger than fi t d thi d h i t ti l d blifirst and third harmonics => potential doubling error in pitch detection.

Doubling Errors in AutocorrelationDoubling Errors in Autocorrelation

Doubling Errors in Doubling Errors in AutocorrelationAutocorrelation

Second and fourth harmonics again much stronger th fi t d thi d h i t ti l d blithan first and third harmonics => potential doubling error in pitch detection.

Autocorrelation Pitch DetectorAutocorrelation Pitch Detector

• lots of errors with conventional autocorrelation—especially short lag estimates of pitch period

• center clipping eliminates most of the gross errors

• nonlinear smoothing fixes the remaining errors

Yet Another Pitch Detector Yet Another Pitch Detector (YAPD)(YAPD)(YAPD)(YAPD)

Log Harmonic Product Log Harmonic Product Spectrum Pitch DetectorSpectrum Pitch DetectorSpectrum Pitch DetectorSpectrum Pitch Detector

STFT for Pitch DetectionSTFT for Pitch Detection from narrowband STFT's we see that the pitch period

is manifested in sharp peaks at integer multiples of the •

fundamental frequency => good input for designing a pitch detection algorithm defin• e a new measure, called the

2

1

harmonic product spectrum, as

( ) | ( ) |ω ω=∏K

j j rn n

r

P e X e1

2

the log harmonic product spectrum is thus

ˆ ( ) log| ( ) |ω ω

=

•

= ∑

r

Kj j r

n nP e X e1

ˆ is a sum of frequency compressed re=

•

∑r

P K plicas

of log| ( ) | => for periodic voiced speech, ωjnX e

the harmonics will all align at the fundamental frequency and reinforce each other

sharp peak at F0

Column (a): sequence of log harmonic product spectra during a voiced region of speech

Column (b): sequence of harmonic productharmonic product spectra during a voiced region of speech

STFT for Pitch DetectionSTFT for Pitch Detection• no problem with unvoiced speech—no strong peak isspeech no strong peak is manifest in log harmonic product spectrum

• no problem if fundamental• no problem if fundamental is missing (e.g., highpass filtered speech) as fundamental is found fromfundamental is found from higher order terms that line up at the fundamental but nowhere else

• no problem with additive noise or linear distortion (see plot at 0 dB SNR)(see plot at 0 dB SNR)


Cepstral Pitch DetectorCepstral Pitch Detector

CepstralCepstral Pitch DetectionPitch DetectionCepstralCepstral Pitch DetectionPitch Detection

• simple procedure for cepstral pitchsimple procedure for cepstral pitch detection

1 compute cepstrum every 10 20 msec1. compute cepstrum every 10-20 msec2. search for periodicity peak in expected

frange of n3. if found and above threshold => voice,

pitch=location of cepstral peak4. if not found => unvoiced

CepstralCepstral Sequences for Sequences for Voiced and Unvoiced SpeechVoiced and Unvoiced SpeechVoiced and Unvoiced SpeechVoiced and Unvoiced Speech

Male Talker Female Talker

Comparison of Comparison of CepstrumCepstrum and and ACFACF

Pitch doubling errors eliminated in cepstral display, b t t i t l ti di l W k t lbut not in autocorrelation display. Weak cepstral peaks still stand out in cepstral display.

Issues in Cepstral Pitch DetectionIssues in Cepstral Pitch DetectionIssues in Cepstral Pitch DetectionIssues in Cepstral Pitch Detection

1. strong peak in 3-20 msec range is strong indication of voiced speech absense of such a peak does not guarantee unvoicedspeech-absense of such a peak does not guarantee unvoiced speech

• cepstral peak depends on length of window, and formant structure• maximum height of pitch peak is 1 (RW, unchanging pitch, window

t i tl N i d ) h i ht i d ti ll ith HWcontains exactly N periods); height varies dramatically with HW, changing pitch, window interactions with pitch period => need at least 2 full pitch periods in window to define pitch period well in cepstrum => need 40 msec window for low pitch male—but this is way too long for high pitch femalefor high pitch female

2. bandlimited speech makes finding pitch period harder• extreme case of single harmonic => single peak in log spectrum =>

no peak in cepstrumthi d i i d t d (b d ) h th t i• this occurs during voiced stop sounds (b,d,g) where the spectrum is cut off above a few hundred Hz

3. need very low threshold-e.g., 0.1-on pitch period-with lots of secondary verifications of pitch period


LPCLPC--Based Pitch Based Pitch DetectorDetectorDetectorDetector

LPC Pitch DetectionLPC Pitch Detection--SIFTSIFTLPC Pitch DetectionLPC Pitch Detection SIFTSIFT

• sampling rate reduced from 10 kHz to 2 kHz

4 l i• p=4 analysis

• inverse filter signal to give spectrally flat result

• compute short time autocorrelation and find strongest peak in p g pestimated pitch region

LPC Pitch DetectionLPC Pitch Detection--SIFTSIFTLPC Pitch DetectionLPC Pitch Detection SIFTSIFT

t ti f i t• part a: section of input waveform being analyzed

• part b: input spectrum and i l f th i filtreciprocal of the inverse filter

• part c: spectrum of signal at output of the inverse filter

• part d: time waveform at output of the inverse filter

• part e: normalized autocorrelation of the signal at the output of the inverse filter

=> 8 msec pitch period found here

Algorithm #4 Algorithm #4 –– Formant Formant EstimationEstimation

CepstralCepstral--Based Based Formant EstimationFormant EstimationFormant EstimationFormant Estimation

Cepstral Formant EstimationCepstral Formant EstimationCepstral Formant EstimationCepstral Formant Estimation• the low-time cepstrum

corresponds primarilycorresponds primarily to the combination of vocal tract, glottal pulse, and radiation, while the high time partwhile the high time part corresponds primarily to excitation

=> use lowpass liftered t t icepstrum to give

smoothed log spectra to estimate formants

want to estimate time-varying model parameters every 10 20 msecevery 10-20 msec

CepstralCepstral Formant EstimationFormant EstimationCepstralCepstral Formant EstimationFormant Estimation1. fit peaks in cepstrum—decide if section p p

of speech voiced or unvoiced2. if voiced-estimate pitch period, lowpass2. if voiced estimate pitch period, lowpass

lifter cepstrum, match first 3 formant frequencies to smooth log magnitudefrequencies to smooth log magnitude spectrum

3 if unvoiced set pole frequency to highest3. if unvoiced, set pole frequency to highest peak in smoothed log spectrum; choose zero to maximize fit to smoothed logzero to maximize fit to smoothed log spectrum

CepstralCepstral Formant EstimationFormant EstimationCepstralCepstral Formant EstimationFormant Estimation

CepstralCepstral Formant EstimationFormant Estimationcepstra spectra

• sometimes 2 formants get so close that gthey merge and there are not 2 distinct peaks in the log magnitude spectrum’ use higher resolution spectral analysis via CZT

• blown up region of 0-900 Hz showing 2 peaks when only 1 seen in normal spectrum

CepstralCepstral Speech ProcessingSpeech ProcessingC t l it h d t t diCepstral pitch detector – median smoothed

Cepstral formant estimation – using p gCZT to resolve close peaks

Formant synthesizer – 3 estimated formants for voiced speech;formants for voiced speech; estimated formant and zero for unvoiced speech

All parameters quantized to appropriate number of levels

• essential features ofessential features of signal well preserved

• very intelligible synthetic speech formant speech

• speaker easily identified

o a tsynthesis

LPCLPC--Based Formant Based Formant EstimationEstimation

Formant Formant Analysis Analysis

Using LPCUsing LPCUsing LPCUsing LPC

• factor predictor polynomial—assign roots to formants

• pick prominent peaks in LPC spectrum

bl l h t• problems on nasals where roots are not poles or zeros

Algorithm #5 Algorithm #5 –– Speech Speech Synthesis MethodsSynthesis Methods

Speech SynthesisSpeech Synthesisp yp y can use cepstrally (or LPC) estimated parameters to control speech

synthesis model•

21 2 2

synthesis model for voiced speech the vocal tract transfer function is modeled as

cos( ) ( )

α απ− −

•

− +=

k kT Tk

Te F T e

V z4

21 2∏ T1 2 2( )

cos(α−− kTe 21 21

1 4

1 3 4

0)

-- cascade of digital resonators ( ) with unity gain at -- estimate using formant estimation methods, fixed

απ −− −= +

− =

−

∏ kTkk F T z e zF F f

F F F1 3 4

41

at 4000 Hz-- formant bandwidths fixed ( )

fixedα α−

• spectral compensation approximates glottal pulse shape fixed •

1 11 1

1 1

spectral compensation approximates glottal pulse shape and radiation

( )( ) ( )( )( )

− −− +=

aT bT

aT bTe eS z 1 11 1

400 5000( )( ),π π

− − − −− += =

aT bTe z e za b

Speech SynthesisSpeech SynthesisSpeech SynthesisSpeech Synthesis

2 1 2 21 2 2 1 2 2

for unvoiced speech the model is a complex pole and zero of the form

( cos( ) )( cos( ) )( )

β β β βπ π− − − − − −

•

− + − +=

T T T Tp ze F T e e F T z e z

V z 1 2 2 21 2 2 1 2 2( )

( cos( ) )( cos( ) )

largest pe

β β β βπ π− − − − − −=

− + − +

=

T T T Tp z

p

V ze F T z e z e F T e

F

0 0065 4 5 0 014 28

ak in smoothed spectrum above 1000 Hz

( )( )ΔF F F2 0

10 10

0 0065 4 5 0 014 28

20 20

( . . )( . )

log ( ) log ( )π

= + − Δ +

Δ = −p

z p p

j F T j

F F F

H e H e

these formulas ensure spectral amplitudes are preserved•

Quantization of Synthesizer Quantization of Synthesizer PPParametersParameters

• model parameters• model parameters estimated at 100/sec rate, lowpass filtered

• SR reduced to twice theSR reduced to twice the LP cutoff and parameters quantized

• parameters could beparameters could be filtered to 16 Hz BW with no noticeable degradation => 33 Hz SR

• formants and pitch quantized with a linear quantizer; amplitude

ti d ith l ith iquantized with a logarithmic quantizer

Quantization of Synthesizer Quantization of Synthesizer PPParametersParameters

Parameter Required Bits/SampleParameter Required Bits/SamplePitch Period (Tau) 6First Formant (F1) 3First Formant (F1) 3

Second Formant (F2) 4Third Formant (F3) 3log-amplitude (AV) 2

600 bps total rate for voiced speech with 100 bps for V/UV decisions

Quantization of Synthesizer ParametersQuantization of Synthesizer Parameters

formant

a: original; b: smoothed; c: quantized

formant modifications-

lowpass filtering

formant modifications

-pitch

g qand decimated by 3-to-1 ratio

--little perceptual difference

Algorithms for Speech Algorithms for Speech P iP iProcessingProcessing

• Based on the various representations of speech we can t l ith f i f t th tcreate algorithms for measuring features that

characterize speech and estimating properties of the speech signal, e.g.,

presence or absence of speech (Speech/Non Speech– presence or absence of speech (Speech/Non-Speech Discrimination)

– classification of signal frame as Voiced/Unvoiced/Background signal

– estimation of the pitch period (or pitch frequency) for a voiced speech frame

– estimation of the formant frequencies (resonances and anti-resonances of the vocal tract) for both voiced and unvoicedresonances of the vocal tract) for both voiced and unvoiced speech frames

• Based on the model of speech production, we can build a speech synthesizer on the basis of speech parameters p y p pestimated by the above set of algorithms and synthesize intelligible speech

digital speech processingdigital speech processing ... speech... · speech interval ending of...

Documents