coding of speech signals• instantaneous companding => snr only weakly dependent on xmax/ x for...

CODING OF SPEECH SIGNALS

Waveform Coding Analysis-Synthesis or

Vocoders

Hybrid Coding

Speech Coding

Waveform Coding: an attempt is made to preserve the original waveform.

Vocoders: a theoretical model of the speech production mechanism is

considered.

Hybrid Coding: uses techniques from the other two.

Speech Coders

Speech Quality Vs Bit Rate for

Codecs

From: J. Wooward, “Speech coding overview”,

http://www-mobile.ecs.soton.ac.uk/speech_codecs

Speech Coding Objectives

– High perceived quality

– High measured intelligibility

– Low bit rate (bits per second of speech)

– Low computational requirement (MIPS)

– Robustness to successive encode/decode cycles

– Robustness to transmission errors

Objectives for real-time only:

– Low coding/decoding delay (ms)

– Work with non-speech signals (e.g. touch tone)

Speech Information Rates• Fundamental level:

• 10-15 phonemes/second for continuous speech.

• 32-64 phonemes per language => 6 bits/phoneme.

• Information Rate=60-90 bps at the source.

• Waveform level

• Speech bandwidth from 4 – 10 kHz => sampling rate from 8 –20 kHz.

• Need 12-16 bit quantization for high quality digital coding.

• Information Rate=96-320 kbps => more than 3 orders of magnitude difference in Information rates between the production and waveform levels.

MOS (Mean Opinion Scores)

• Why MOS:– SNR is just not good enough as a subjective measure for

most coders (especially model-based coders where waveform is not preserved inherently)

– noise is not simple white (uncorrelated) noise

– error is signal correlated

• clicks/transients

• frequency dependent spectrum—not white

• includes components due to reverberation and echo

• noise comes from at least two sources, namely quantization and background noise

• delay due to transmission, block coding, processing

• transmission bit errors—can use Unequal Protection Methods

• tandem encodings

MOS Ratings• Subjective evaluation of speech quality

– 5—excellent

– 4—good

– 3—fair

– 2—poor

– 1—bad

• MOS Scores:– 4.5 for natural wideband speech

– 4.05 for toll quality telephone speech

– 3.5-4.0 for communications quality telephone speech

– 2.0-3.5 for lower quality speech from synthesizers, low bit rate coders, etc

• other measures of intelligibility– DRT-diagnostic rhyme test => uses rhyming words

– DAM-diagnostic acceptability measure

Digital Speech Coding

Sampling Theorem • Theorem: If the highest frequency contained in an

analog signal xa(t) is Fmax = B, and the signal is sampled at a frequency Fs > 2B, then the analog signal can be exactly recovered from its samples using the following reconstruction formula:

• Note that at the original sample instances (t = nT), the reconstructed analog signal is equal to the value of the original analog signal. At times between the sample instances, the signal is the weighted sum of shifted sinc functions.

naa

nTtT

nTtT

sinnTxtx

GRAPHICAL INTERPRETATION OF THESAMPLING THEOREM

RECONSTRUCTION VIA SINC(X)INTERPOLATION

TYPICAL SAMPLING FREQUENCIES IN SPEECH RECOGNITION

• 8 kHz: Popular in digital telephony. Provides

coverage of first three formants for most

speakers and most sounds.

• 16 kHz: Popular in speech research. Why?

• Sub 8 kHz Sampling: Can aliasing be useful in

speech recognition? Hint: Consumer

electronics.

Problems

• Sampling theorem for bandlimited signals.

• How to change the sample rate of a signal?

• How this can be implemented using time domain

interpolation (based on the Sampling Theorem)?

• How this can be implemented efficiently using

digital filters?

Digital Speech Coding

Speech Probability Density

Function• Probability density function for x(n) is the same as for xa(t)

since x(n)=xa(nt) the mean and variance are the same for

both x(n) and xa(t).

• Need to estimate probability density and power spectrum from

speech waveforms

– probability density estimated from long term histogram of

amplitudes

– good approximation is of a gamma distribution of the form:

– Simpler approximation is Laplacian density, of the form:

Measured Speech Densities

• Distribution normalized so

mean is 0 and variance is

1(x’=0, x=1)

• Gamma density more closely

approximates measured

distribution for speech than

Laplacian

• Laplacian is still a good

model and is used in

analytical studies

• Small amplitudes much

more likely than large

amplitudes by 100:1 ratio.

Speech AC and Power Spectrum• Can estimate long term autocorrelation and power spectrum

using time-series analysis methods

where L is a large integer

• 8kHz sampled speech for several

speakers

• High correlation between

adjacent samples

• Low pass speech more highly

correlated than bandpass

speech

Instantaneous Quantization

• Separating the processes of sampling and

quantization

• Assume x(n) obtained by sampling a bandlimited

signal at a rate at or above the Nyquist rate.

• Assume x(n) is known to infinite precision in

amplitude

• Need to quantize x(n) in some suitable manner.

Quantization and Encoding

Coding is a two-stage process

• Quantization process: x( n)→x’ (n)

• Encoding process: x’ (n) → c(n)

where Δ is the (assumed fixed) quantization step size

• Decoding is a single-stage process

• decoding process:c’(n) → x’’(n)

• if c’(n)=c(n), (no errors in transmission) then x’’(n)

=x’(n)

• x’’(n) x’(n) coding and quantization loses

information.

B-bit Quantization

• Use B-bit binary numbers to represent the quantized samples => 2B quantization levels

• Information Rate of Coder: I=B FS= total bit rate in bits/second

• B=16, FS= 8 kHz => I=128 Kbps

• B=8, FS= 8 kHz => I=64 Kbps

• B=4, FS= 8 kHz => I=32 Kbps

• Goal of waveform coding is to get the highest quality at a fixed value of I (Kbps), or equivalently to get the lowest value of I for a fixed quality.

• Since FS is fixed, need most efficient quantization methods to minimize I.

Quantization Basics

• Assume |x(n)| ≤Xmax (possibly ∞)

• For Laplacian density (where Xmax=∞), can show that

0.35% of the samples fall outside the range -4σx ≤

x(n) ≤ 4σx => large quantization errors for 0.35% of

the samples.

• Can safely assume that Xmax is proportional to σx.

Quantization Process

Uniform Quantizer

• The choice of quantization range and levels chosen such that signal can easily be processed digitally

Mid--Riser and Mid--Tread Quantizers

• mid-riser

– origin (x=0) in middle of rising part of the staircase

– same number of positive and negative levels

– symmetrical around origin.

• mid-tread

– origin (x=0) in middle of quantization level

– one more negative level than positive

– one quantization level of 0 (where a lot of activity occurs)

• Code words have direct numerical significance (sign-magnitude

representation for mid-riser, two’s complement for mid-tread).

• Uniform Quantizers characterized by:

– number of levels—2B (B bits)

– quantization step size-Δ.

• If |x(n)| ≤ Xmax and x(n) is a symmetric density, then

Δ2B =2Xmax

Δ= 2Xmax/ 2B

• If we let

x’(n)=x(n) + e(n)

• With x(n) the unquantized speech sample, and e(n)

the quantization

- Δ/2 ≤ e(n) ≤ Δ/2

Quantization Noise Model

• Quantization noise is a zero-mean, stationary white

noise process.

E[e(n)e(n+m)]=2e, m=0

= 0 otherwise

• Quantization noise is uncorrelated with the input

signal

E[x(n)e(n+m)]=0 m

• Distribution of quantization errors is uniform over

each quantization interval

pe(e)=1/ Δ - Δ/2 ≤ e ≤ Δ/2 ē =0, 2e = Δ2/12

=0 otherwise

SNR for Quantization

Review of Quantization

Assumptions

• Input signal fluctuates in a complicated manner so a

statistical model is valid.

• Quantization step size is small enough to remove

any signal correlated patterns in quantization error.

• Range of quantizer matches peak-to-peak range of

signal, utilizing full quantizer range with essentially

no clipping.

• For a uniform quantizer with a peak-to-peak range of

±4σx, the resulting SNR(dB) is 6B-7.2.

Instantaneous Companding • In order to get constant percentage error (rather than

constant variance error), need logarithmically spaced

quantization levels

– quantize logarithm of input signal rather than input signal

itself

Insensitivity to Signal Level

y(n) = ln|x(n)|

x(n) = exp[(y(n)].sign[x(n)]

where sign[x(n)] = -1 x(n) ≤ 0

= +1 x(n) > 0

The quantized log magnitude is

y’(n) = Q[log|x(n)|]

= log|x(n)| + e(n) a new error signal

Assume that e(n) is independent of log|x(n)|. The inverse is

x’(n) = exp[y’(n)].sign[x(n)]

= x(n).exp[e(n)]

Assume e(n) to be small, then

exp[e(v)] = 1 + e(n) + ….

x’(n) = x(n)[1+e(n)] = x(n) + e(n)x(n) = x(n) + f(n)

Since x(n) and e(n) are independent, then

σ2f = σ2

x . σ2e

SNR = σ2x / σ2

f =1/ σ2e

Pseudo-Logarithmic Compression

• Unfortunately true logarithmic compression is not practical, since the dynamic range (ratio between the largest and smallest values) is infinite => need an infinite number of quantization levels.

• Need an approximation to logarithmic compression => µ-law/A-law compression.

µ-law Compression

• y(n) = F[x(n)]

• When x(n) =0, y(n) =0;

• When =0, y(n)=x(n); no compression

• When is large and for large |x(n)|

)n(xsign.

)1log(

X/)n(x1logX

maxmax

max

max

X

)n(xlog.

log

X)n(y

SNR for µ-law Quantizer

• 6B dependence on B good

• Much less dependence on Xmax/x good

• For large , SNR is less sensitive to changes in

Xmax/x good

• -law used in wireline telephony for more than 30

years.

x

max2

x

max1010

X2

X1log10)1ln(log2077.4B6)dB(SNR

Companding

µ-Law Companding

Quantization for Optimum SNR

• Goal is to match quantizer to actual signal density to

achieve optimum SNR

• µ-law tries to achieve constant SNR over wide range of

signal variances => some sacrifice over SNR

performance when quantizer step size is matched to

signal variance

• If x is known, you can choose quantizer levels to

minimize quantization error variance and maximise SNR.

Quantizer Levels for Maximum

SNR

• Variance of quantization noise is:

• 2e =E[e2(n)]=E[(x’(n)-x(n))2]

• With x’(n)=Q[x(n)]. Assume quantization levels

• [x’-(M/2), x’-(M/2)+1,…,x’(-1), x’(1), …, x’(M/2)]

• associating quantization level with signal intervals as:

• x’j = quantization level for interval [xj-1, xj]

• For symmetric, zero-mean distributions, with large

amplitudes (∞) it makes sense to define the boundary

• xo = 0 (central boundary point), x±M/2 = ±∞

• The error variance is thusde)e(pe e

22e

Optimum Quantization Levels

Solution for Optimum Levels

• To solve for optimum values for {x’i} and {xi}, we differentiate 2e

wrt the parameters, set the derivation to 0, and solve numerically:

Proof??

• With boundary conditions of x0 = 0, x±M/2 = ±∞

• Can also constrain quantizer to be uniform and solve for value of Δ that maximizes SNR

• Optimum boundary points lie halfway between M/2 quantizer levels

• Optimum location of quantization level x’ is at the centroid of the probability density over the interval xi-1 to xi.

• Solve the above set of equations iteratively to obtain {x’i}, {xi}

M/2 ..., 1,2,i )xx(2

1x

M/2 ..., 1,2,i 0dx)x(p)xx(

'1i

'ii

x

ix

1ix

'i

Uniform/ Non-uniform Quantizers

Adaptive Quantization

• Linear quantization => SNR depends on x being constant(this is clearly not the case)

• Instantaneous companding => SNR only weaklydependent on Xmax/x for large µ-law compression (100-500)

• Optimum SNR => minimize 2e when 2

x is known, non-

uniform distribution of quantization levels

• Quantization dilemna:– want to choose quantization step size large enough to

accommodate maximum peak-to-peak range of x(n);

– at the same time need to make the quantization step size small soas to minimize the quantization error.

• The non-stationary nature of speech (variability acrosssounds, speakers, backgrounds) compounds thisproblem greatly.

Types of Adaptive Quantization

• Instantaneous-amplitude changes reflect sample-to-

sample variation of x(n) implying rapid adaptation.

• Syllabic-amplitude changes reflect syllable-to-syllable

variations in x(n) => slow adaptation

• Feed-forward-adaptive quantizers that estimate 2x

from x(n) itself.

• Feedback-adaptive quantizers that adapt the step

size, Δ, on the basis of the quantized signal, x’(n), (or

equivalently the codewords, c(n)).

Feed Forward Adaptation

• Variable step size

• assume uniform quantizer

with step size Δ(n)

• x(n) is quantized using Δ(n)

=>c(n) and Δ(n) need to be

transmitted to the decoder

• if c’(n)=c(n) and Δ’(n)= Δ(n)

=> no error in the channel,

and

• x’’(n) = x’(n)

Don’t have x(n) at the decoder to estimate Δ(n) => need to

transmit Δ(n); This is the major drawback of the feed

forward adaptation.

Feed-Forward Quantizer

• time varying gain, G(n) =>

c(n) and G(n) need to be

transmitted to the decoder.

Can’t estimate G(n) at

the decoder => it has to

be transmitted.

Feed Forward Quantizers

• Feed forward systems make estimates of 2x, then make Δ

or quantization level proportional x, the gain is inversely

proportional to x.

• Assume 2x is proportional to short time energy

• where h(n) is a low pass filter

• Consider h(n) = n-1 n 1

• = 0 otherwise

)mn(h)m(x)n(

m

22

2x

2)]n([E

proof??) (recursion )1n(x)1n()n(

1)(0 )m(x)n(

222

1n

m

1mn22

Feed Forward Quantizers

• Δ(n) and G(n) vary slowly compared to x(n)

• They must be sampled and transmitted as part of the waveform

coder parameters

• Rate of sampling depends on the bandwidth of the lowpass filter,

h(n)—for =0.99, the rate is about 13 Hz; for =0.9, the rate is

about 135 Hz

• It is reasonable to place limits on the variation of Δ(n) or G(n), of

the form

• Gmin G(n) Gmax

• Δ min Δ(n) Δ max

• For obtaining 2y ≈ constant over a 40 dB range in signal levels

Gmax/Gmin = Δmax/ Δmin =100 (40dB range)

Feed Forward Adaptation Gain

• Δ(n) or G(n) is evaluated after every M samples

• Use 128 to 1024 samples for estimation

• Adaptive quantizer achieves up to 5.6 dB better SNR than non-adaptive quantizers

• Can achieve this SNR with low "idle channel noise" and wide speech dynamic range by suitable choice Δmin and Δmax

Feedback Adaptation

• 2(n) estimated from quantizer output (or the code words).

• Advantage of feedback adaptation is that neither Δ(n) nor G(n) needs to be

transmitted to the decoder since they can be derived from the code words.

• Disadvantage of feedback adaptation is increased sensitivity to errors in

codewords, since such errors affect Δ(n) and G(n).

Feedback Adaptation

• 2(n) is based only on past values of x’( )

• Two typical windows/filters are:

• Can use very short window lengths (e.g. M=2) to achieve

12dB SNR for a B=3 bit quantizer.

1n

Mnm

22

1n

)m('xM

1)n(

otherwise 0

Mn1 1/M h(n)

0

1n )n(h

m

22)mn(h)m('x)n(

Alternative Approach to Adaptation

Input-output characteristic of a 3-bit adaptive quantizer

Optimal Step Size Multipliers

Nonuniform quantizer

CompressorUniform

Quantizer

Discrete

samplesdigital

signals

“Compressing-and-expanding” is called

“companding.”

Channel

• •

• •

• •

• •

Decoder Expanderreceived

digital

signals

output

Compression Techniques

used. is 255 U.S.,In the

1ln

)(1ln)(

1)(

nally)internatiopopular (very

compressor law-

1

2

1

twtw

tw

1)(1

ln1

)(ln1

1)(0

ln1

)(

)(

0 ,1)(

compressor law-

1

1

1

1

2

1

twAA

twA

Atw

A

twA

tw

Atw

A

Practical Implementation of µ-law compressor

Waveform Coders

Pulse Code Modulation (PCM)

• Needs the sampling frequency, fs, to be greater than the

Nyquist frequency (twice the maximum frequency in the

signal)

• For n bits per sample, the dynamic range is +2n-1 and the

quantisation noise power equals 2/12 ( = step size)

• Total bit rate = nfs

• Can use non-uniform quantisation / variable length codes

– Logarithmic quantization (A-law, -law)

– Adaptive Quantization

G.711

• Pulse Code Modulation (PCM) codecs are the simplest form of waveform codecs.

• Narrowband speech is typically sampled 8000 times per second, and then each speech sample must be quantized.

• If linear quantization is used then about 12 bits per sample are needed, giving a bit rate of about 96 kbits/s.

• However this can be easily reduced by using non-linear quantization.

• For coding speech it was found that with non-linear quantization 8 bits per sample was sufficient for speech quality which is almost indistinguishable from the original.

• This gives a bit rate of 64 kbits/s, and two such non-linear PCM codecs were standardised in the 1960s

Waveform Coders

Differential Pulse Code Modulation (DPCM)

• Predict the next sample based on the last few decoded

samples

• Minimise mean squared error of prediction residual

– use LP coding

• Good prediction results in a reduction in the dynamic range

needed to code the prediction residual and hence a

reduction in the bit rate.

• Can use non-uniform quantisation or variable length codes

Differential PCM (DPCM)

• Fixed predictors can give from 4-11dB SNR

improvement over direct quantization (PCM)

• Most of the gain occurs with first order predictor

• Prediction up to 4th or 5th order helps

Another Implementation of DPCM

Quantization error is not accumulated.

For slowly varying signals, a future sample can

predicted from past samples.

Transversal filter can perform the prediction process.

Predictor

++

-

e(t)s(t)

Transmitter Side

Predictor

++

s(t)e(t)

Receiver Side

+

DPCM with Adaptive Quantization• Quantizer step size proportional to variance at quantizer input

• Can use d(n) or x(n) to control step size

• Get 5 dB improvement in SNR over µ-law non-adaptive PCM

• Get 6 dB improvement in SNR using differential configuration with fixed prediction => ADPCM is about 10-11 dB SNR better than from a fixed quantizer.

Feedback ADPCM

• Can achieve same improvement in SNR as feed forward system

DPCM with Adaptive Prediction

• Need adaptive prediction to

handle non-stationarity of

speech.

• ADPCM encoders with

pole-zero decoder filters

have proved to be

particularly versatile in

speech applications.

• The ADPCM 32 kbits/s

algorithm adopted for the

G.721 CCITT standard

(1984) uses a pole-zero

adaptive predictor.

DPCM with Adaptive Prediction

• Prediction coefficients assumed to be time-dependent of the form

• Assume speech properties remain fixed over short time intervals.

• Choose to minimize the average squared prediction error over

short intervals.

• The optimum predictor coefficients satisfy the relationships

• Where Rn(j) is the short-time autocorrelation function of the form

• w(n-m), is window positioned at sample n of input.

• Update every 10-20msec.

p

k

kknxnnx

1

~~

)()()(

)(nk

p , . . . 1,2,j ),()()(1

kjRnjRn

p

k

kn

m

npjjmnwmjxmnwmxjR 0 ),()()()()(

s'

Prediction Gain for DPCM with

Adaptive Prediction

• Fixed prediction

10.5dB prediction gain

for large p.

• Adaptive prediction

14dB gain for large p.

• Adaptive prediction

more robust to

speaker and speech

material.

)(

)(log10log10

2

2

1010ndE

nxEGp

Comparison of Coders

• 6 dB between curves

• Sharp increase in

SNR with both fixed

prediction and

adaptive quantization

• Almost no gain for

adapting first order

predictor

ADPCM G.721 Encoder

ADPCM G.721 Encoder

• The algorithm consists of an adaptive quantizer and anadaptive pole-zero predictor.

• The pole-zero predictor (2 poles, 6 zeros) estimates theinput signal and hence it reduces the error variance.

• The quantizer encodes the error sequence into asequence of 4-bit words. The prediction coefficients areestimated using a gradient algorithm and the stability ofthe decoder is checked by testing the two roots of A(z).

• The performance of the coder, in terms of the MOS scale,is above 4 but it degrades as the number ofasynchronous tandem coding increases. The G.721ADPCM algorithm was also modified to accommodate 24and 40 kbits/s in the G.723 standard.

• The performance of ADPCM degrades quickly for ratesbelow 24 kbits/s.

Delta Modulation

• Simplest form of differential quantization is in delta

modulation (DM).

• Sampling rate chosen to be many times the Nyquist

rate for the input signal => adjacent samples are

highly correlated.

• This leads to a high ability to predict x(n) from past

samples, with the variance of the prediction error

being very low, leading to a high prediction gain =>

can use simple 1-bit (2-level) quantizer =>the bit rate

for DM systems is just the (high) sampling rate of the

signal.

Linear Delta Modulation (LDM)

Linear Delta Modulation

2- level quantizer with fixed step

size with quantizer form

d’(n) = if d(n) > 0 (c(n) =1)

= - if d(n) <0 (c(n) =0)

Illustration of DM• Basic equations of DM are x’(n) = x’(n-1) + d’(n)

• When ≈1 (it is digital integration or accumulation of increments of ±Δ)

• d(n) = x(n) – x’(n-1) = x(n) –x(n-1)-e(n-1)

• d(n) is the first backward difference of x(n), or an approximation to the derivative of the input.

• How big do we make Δ--at maximum slope of xa(t) we need

• Δ/T|dxa(t)/dt|

• Or else the resonstructed signal will lag the actual signal ‘slope overload’ condition resulting in quantization error called 'slope overload distortion'

• Since x’(n) can only increase by fixed increments of Δ, fixed DM is called linear DM or LDM

DM Waveform

DM Granular Noise

• When xa(t) has small slope, Δ determines the peak error when xa(t)=0, quantizer will be alternating sequence of 0's and 1's, and x’(n) will alternate around zero with peak variation of Δ this condition us called “granular noise”.

• Need large step size to handle wide dynamic range

• Need small step size to accurately represent low level signals.

• With LDM we need to worry about dynamic range and amplitude of the difference signal => choose Δto maximize mean-squared quantization error (a compromise between slope overload and granular noise).

Performance of DM SystemsNormalized step size defined as

oversampling index defined as

F0 = Fs/(2FN)

where Fs is the sampling rate of the DM

And FN is the Nyquist frequency of the

signal.

The total bit rate of the DM is

BR = Fs = 2FN.Fo

Can see that for given value of F0, there is

an optimum value of Δ.

Optimum SNR increases by 9dB for each

doubling of F0 => this is better than the 6dB

obtained by increasing the number of

bits/sample by 1 bit curves are very sharp

around optimum value of Δ => SNR is very

sensitive to input level for SNR=35 dB, for

FN=3kHz => 200 kbps rate

For toll quality much higher rate is required.

2/12))1n(x)n(x(E

Adaptive Delta Mod

• Step size adaptation for DM (from codewords)

– Δ(n) =M.Δ(n-1)

– Δmin Δ(n) Δmax

• M is a function of c(n) and c(n-1), since c(n) depends only on the sign of d(n)

– d(n) = x(n) - x’(n-1)

• The sign of d(n) can be determined before the actual quantized value d’(n) which needs the new value of Δ(n) for evaluation

• The algorithm for choosing the step size multiplier is

– M = P > 1 if c(n) = c(n-1)

– M = Q <1 if c(n) c(n-1)

Adaptive DM Performance

• Slope overload in LDM causes runs of 0’s or 1’s

• Granularity causes runs of alternating 0’s and 1’s

• figure above shows how adaptive DM performs with P=2, Q=1/2,

=1.

• During slope overload, step size increases exponentially to follow

increase in waveform slope.

• During granularity, step size decreases exponentially to Δmin and

stays there as long as the slope is small.

ADM Parameter Behavior

• ADM parameters are P, Q, Δmin

and Δmax

• Choose Δmax/ Δmin to maintain

high SNR over range of input

signal levels.

• Δmin should be chosen to

minimize idle channel noise.

• PQ should satisfy PQ1 for

stability.

Comparison of LDM, ADM and of log

PCM

• ADM is 8 dB better SNR at 20

kbps than LDM, and 14 dB

better SNR at 60 kbps than

LDM.

• ADM gives a 10 dB increase in

SNR for each doubling of the bit

rate; LDM gives about 6 dB.

• For bit rate below 40 kbps, ADM

has higher SNR than µ-law

PCM; for higher bit rates log

PCM has higher SNR.

Higher Order Prediction in DM

Waveform Coding versus Block

Processing

• Waveform coding

– sample-by-sample matching of waveforms

– coding quality measured using SNR

• Source modeling (block processing)

– block processing of signal => vector of outputs every block

– overlapped blocks

Adaptive Predictive Coder

Transmitter

Receiver

Adaptive Predictive Coder

• The use of adaptive long-term prediction in addition toshort-term prediction provides additional coding gain (atthe expense of higher complexity) and high quality speechat 16kbits/s. The long-term (long delay) predictor

• Provides for the pitch (fine) structure of the short-timevoiced spectrum. The index is the pitch period insamples and is a small integer. The long-term predictor(ideally) removes the periodicity and thereby redundancy.

• At the receiver the synthesis filter introduces periodicitywhile the synthesis filter associated with the short-termprediction polynomial represents the vocal tract.

• The parameters of the short-term predictors are computedfor every frame (typically 10 to 30 ms). The long-termprediction parameters are computed more often.

j

ji

ijL za)z(A

Model-Based Speech Coding• Waveform coding based on optimizing and maximizing

SNR about as far as possible.– achieved bit rate reductions on the order of 4:1 (i.e., from 128

kbps PCM to 32 kbps ADPCM) at the same time achieving toll quality SNR for telephone-bandwidth speech

• To lower bit rate further without reducing speech quality, we need to exploit features of the speech production model, including:– source modeling

– spectrum modeling

– use of codebook methods for coding efficiency

• We also need a new way of comparing performance of different waveform and model-based coding methods– an objective measure, like SNR, isn’t an appropriate measure

for model-based coders since they operate on blocks of speech and don’t follow the waveform on a sample-by-sample basis.

– new subjective measures need to be used that measure user perceived quality, intelligibility, and robustness to multiple factors.

Frequency Domain Speech Coding • All frequency domain methods for speech coding

exploit the Short-Time Fourier Transform using a

filter bank view with scalar quantization

• Sub-band Coding-use small number of filters with

wide and overlapping bandwidths

• 2-band sub-band coder

– advantage of sub-band coder is that the quantization noise

is limited to the sub-band that generated it => better

perceptual control of noise spectrum

– with careful design of filters, can get complete cancellation

of quantization noise that leaks across bands => use QMF-

Quadrature Mirror Filters

– can continue to split lower bands into 2 bands, giving

octave band filter bank => auditory front-end like analysis.

Sub-band coders

• Exploit the frequency sensitivity of the auditory

system.

• Split the signal into sub-band using band pass

filters.

• Code each sub-band at an appropriate resolution

– e.g. 4 bits per sample in the lower sub-bands and

– 2 bits per sample in the upper sub-bands

• Can also exploit auditory masking

– use fewer bits if a neighbouring sub-band is much louder

• Basis for the MPEG audio standard (5:1

compression of CD quality audio with no perceptual

degradation)

Sub-band coder

The 16 kbits/s SBC compared favorably against 16 kbits/s

ADPCM, and the 9.6 kbits/s SBC compared favorably

against 10.3 and 12.9 kbits/s ADMThe low-band filters in speech specific implementations are

usually associated with narrower widths so that they can resolve

more accurately the low-frequency narrowband formants. In the

absence of quantization noise, perfect reconstruction can be

achieved using Quadrature-Mirror Filter (QMF) banks.

AT&T Sub-band coder

AT&T Sub-band coder

• The AT&T SBC was used for voice storage at 16 or 24

kbits/s and consists of a five-band non-uniform tree-

structured QMF bank in conjunction with APCM coders.

• A silence compression algorithm is also part of the

standard. The frequency ranges for each band are: 0-0.5

kHz, 0.5-1 kHz, 1-2 kHz, 2-3 kHz, and 3-4 kHz. For the 16

kbits/s implementation the bit allocations are {4/4/2/2/0}

and for the 24 kbits/s the bit assignments are {5/5/4/3/0}.

The one-way delay of this coder is less than 18 ms. It

must be noted that although this coder was the

workhorse for the older AT&T voice store and forward

machines, the most recent AT&T machines use the new

16 kbits/s Low Delay CELP algorithm.

G.722 Sub-band coder

CCITT standard (G.722)

• The CCITT standard (G.722) for 7kHz audio at 64 kbits/s for

ISDN teleconferencing is based on a two-band sub-

band/ADPCM coder.

• The low-frequency sub-band is quantized at 48 kbits/s while the

high-frequency sub-band is coded at 16 kbits/s.

• The G.722 coder includes an adaptive bit allocation scheme

and an auxiliary data channel.

• Provisions for lower rates have been made by quantizing the

low-frequency sub-band at 40 kbits/s or at 32 kbits/s.

• The MOS at 64 kbits/s is greater than four for speech and

slightly less than four for music signals, and the analysis-

synthesis QMF banks introduce a delay of less than 3 ms.

Introduction to VQ

• Vector quantization (VQ) is a lossy data

compression method

• based on the principle of block coding.

• It is a fixed-to-fixed length algorithm.

• In 1980, Linde, Buzo, and Gray (LBG) proposed a VQ

design algorithm based on a training sequence.

• The use of a training sequence bypasses the need

for multi-dimensional integration. this algorithm is

referred to as LBG-VQ.

• Toll quality speech coder (digital wireline phone)

– G.711 (A-LAW and μ-LAW at 64 kbits/sec)

– G.721 (ADPCM at 32 kbits/ sec)

– G.723 (ADPCM at 40, 24 kbps)

– G.726 (ADPCM at 16,24,32,40 kbps)

• Low bit rate speech coder (cellular phone/IP phone)

– G.728 low delay (16 kbps, delay <2ms, same or better quality

than G.721)

– G. 723.1 (CELP Based, 5.3 and 6.4 kbits/sec)

– G.729 (CELP based, 8 bps)

– GSM 06.10 (13 and 6.5 kbits/sec, simple to implement, used in

GSM phones)

1-D VQ

• A VQ is nothing more than an approximator.

• similar to that of ``rounding-off'' An example of a 1-dimensional VQ is shown below

• Here, every number less than -2 are approximated by -3. Every number between -2 and 0 are approximated by -1. Every number between 0 and 2 are approximated by +1. Every number greater than 2 are approximated by +3. Note that the approximate values are uniquely represented by 2 bits. This is a 1-dimensional, 2-bit VQ. It has a rate of 2 bits/dimension.

2-D VQ

• An example of a 2-

dimensional VQ is shown

below:

• Here, every pair of numbers

falling in a particular region

are approximated by a red star

associated with that region.

Note that there are 16 regions

and 16 red stars -- each of

which can be uniquely

represented by 4 bits. Thus,

this is a 2-dimensional, 4-bit

VQ. Its rate is also 2

bits/dimension.

• In the above two examples, the red stars are called

codevectors and the regions defined by the blue

borders are called encoding regions. The set of all

codevectors is called the codebook and the set of all

encoding regions is called the partition of the space.

Design Problem

• Given a

– vector source with its statistical properties known

– a distortion measure

– the number of codevectors

• To find

– codebook (the set of all red stars)

– partition (the set of blue lines) which result in the

smallest average distortion.

Design Problem Cont.

• Assume that there is a training sequenceconsisting of M source vectors

• This training sequence can be obtained from some large database– For example, if the source is a speech signal, then

the training sequence can be obtained by recording several long telephone conversations.

– M is assumed to be sufficiently large so that all the statistical properties of the source are captured by the training sequence


• Assume that the source vectors are k-dimensional.

• Let N be the number of codevectors and let

represents the codebook.

• Each codevector is k-dimensional, e.g.,

• Let Sn be the encoding region associated with codevector

and let P denote the partition of the space.


• If the source vector xm is in the encoding region Sn, then its

approximation (denoted by Q(xm)) is cn.

• Assuming a squared-error distortion measure, the average

distortion is given by:

• The design problem can be succinctly stated as follows: Given

T and N, find C and P such that Dave is minimized.

Optimality Criteria

• If C and P are a solution to the above minimization problem, then it must satisfied the following two criteria.

• Nearest Neighbor Condition:

– This condition says that the encoding region Sn should consists of all vectors that are closer to cn than any of the other codevectors. For those vectors lying on the boundary (blue lines), any tie-breaking procedure will do

• Centroid Condition:

– This condition says that the codevector cn should be average of all those

training vectors that are in encoding region Sn . In implementation, one

should ensure that at least one training vector belongs to each encoding

region (so that the denominator in the above equation is never 0).

LBG Design Algorithm

• Iterative algorithm which solves the two optimality criteria.

• Requires initial codebook C(0).

• The initial codebook is obtained by the splitting method.

• In this method, an initial codevector is set as the average of the entire training sequence.

• This codevector is then split into two. The iterative algorithm is run with these two vectors as the initial codebook. The final two codevectors are splitted into four and the process is repeated until the desired number of codevectors is obtained.

LBG Design Algorithm Cont.

1. Given T. Fixed >0 to be a “small” number.

2. Let N =1.

Calculate

3. Splitting:

Set N=2N

4. Iteration: Let D(0)ave=D*

ave. Set the iteration index i=0;

i. For m=1,2, …, M, find the minimum value of over all n=1, 2, …, N. Let n* be the index which achieves the minimum. Set

LBG Design Algorithm Cont.

ii. For n=1, 2, …, N, update the codevector.

iii. Set i=i+1

iv. Calculate

v. If , go back to Step (i).

vi. Set . For n=1, 2, …, N, set as the final

codevector

5. Repeat steps 3 and 4 until the desired number of

codevectors is obtained.

Performance

• The performance of VQ are typically given in terms

of the signal-to-distortion ratio (SDR):

(in dB),

• where 2 is the variance of the source and Dave is

the average squared error distortion. The higher the

SDR the better the performance.

Toy Example of VQ Coding• 2-pole model of the vocal tract => 4

reflection coefficients

• 4-possible vocal tract shapes => 4 sets of reflection coefficients

• 1. Scalar Quantization-assume 4 values for each reflection coefficient => 2-bits x 4 coefficients = 8 bits/frame

• 2. Vector Quantization-only 4 possible vectors => 2-bits to choose which of the 4 vectors to use for each frame (pointer into a codebook)

• this works because the scalar components of each vector are highly correlated.

• if scalar components are independent => VQ offers no advantage over scalar quantization

Comparison of Scalar and Vector

Quantization

VQ Codebook of LPC Vectors

64 vectors

in a

codebook

of spectral

Shapes

10-bit VQ is

comparable

to 24-bit

scalar

quantization.

VQ Properties

• In VQ a cell can have arbitrary size and shape;

• Scalar quantization a decision region can have arbitrary size but its shape is fixed

• VQ has a distortion measure which is a measure of distance between the input and output used to both design the codebook vectors and to choose the optimal reconstruction vector

Iterative VQ Design Algorithm

• 1. assume initial set of points given– map all vectors to best set of points

– recomputed centroids from ensemble vectors in each cell

• 2. iterate until change in reconstruction levels is small

• Problem-need to know px(x) to correctly compute centroids

• Solution-use training set as an estimate of ensemble, k-mean clustering algorithm.

VQ Performance

• The simplest form of a vector quantizer can be considered as a generalization of the scalar PCM and is called Vector PCM (VPCM). In VPCM, the codebook is fully searched (full search VQ or F-VQ) for each incoming vector. The number of bits per sample in VPCM is given by

– B =(log2N)/k

• and the signal-to-noise ratio for VPCM is given by

– SNRk = 6B + Kk (dB)

• VPCM yields improved SNR since it exploits the correlation within the vectors. In the case of speech coding, it is reported that K2 is larger than K1 by more than 3 dB while K8 is larger than K1 by more than 8 dB.

VQ Performance

• Even though VQ offers significant coding gain by

increasing N and k its memory and computational

complexity grows exponentially with k for a given

rate.

• More specifically, the number of computations

required for F-VQ (full search VQ) is of the order of

2Bk while the number of memory locations required is

k 2Bk.

• In general the benefits of VQ are realized at rates of 1

bit per sample.

GS-VQ• The complexity of VQ can also

be reduced by normalizing the

vectors of the codebook and

encoding the gain separately.

The technique is called

Gain/Shape VQ (GS-VQ) and

has been introduced by Buzo

and later studied by Sabin and

Gray.

• The waveform shape is

represented by a codevector

from the shape codebook while

the gain can be encoded from

the gain codebook.

Comparisons of GS-VQ with F-

VQ in the case of speech

coding at one bit per sample

revealed that GS-VQ yields

about 0.7 dB improvement at

the same level of complexity.

Encoder

Decoder

Adaptive VQ

• The VQ methods discussed thus far are associated with time-invariant(fixed) codebooks. Since speech is a non-stationary process, onewould like to adapt the codebooks ("codebook design on the fly") toits changing statistics.

• VQ with adaptive codebooks is called adaptive VQ (A-VQ) andapplications to speech coding have been reported.

• There are two types of A-VQ, namely, forward adaptive and backwardadaptive.

• In backward adaptive VQ, codebook updating is based on past datawhich is also available at the decoder.

• Forward A-VQ updates the codebooks based on current (or sometimesfuture) data and as such additional information must be encoded.

• The principles of forward and backward A-VQ are similar to those ofscalar adaptive quantization.

• Practical A-VQ systems are backward adaptive and they can beclassified into vector predictive quantizers and finite state quantizers.Vector predictive coders are essentially an extension of scalarpredictive DPCM coders.

Implementation Issues

• The complexity in high-dimensionality VQ can be reduced significantly

with the use of structured codebooks which allow for efficient search.

• Tree-structured and multistep vector quantizers are associated with

lower encoding complexity at the expense of loss of performance and in

some cases increased memory requirements.

• Gray and Abut compared the performance of F-VQ and binary tree

search VQ for speech coding and reported a degradation of 1 dB in the

SNR for tree-structured VQ.

• Multistep vector quantizers consist of a cascade of two or more

quantizers each one encoding the error or residual of the previous

quantizer.

• Gersho and Cuperman compared the performance of full search

(dimension 4) and multistep vector quantizers (dimension 12) for

encoding speech waveforms at 1 bit per sample and reported a gain of 1

dB in the SNR in the case of multistep VQ.

I SEMESTER M. TECH. SESSIONAL TEST 2005-06

VOICE AND PICTURE CODING (EL-653)

Differentiate between Vowel and Diphthongs. (4)

Calculate the pitch in mel for a frequency of 5000Hz signal. (4)

Prove that the optimum level of the quantizer level is at the centroid of the probability density

function over the interval.

For an AT&T sub-band coder the frequency ranges for each band are: 0-0.5 kHz, 0.5-1 kHz,

1-2 kHz, 2-3 kHz, and 3-4 kHz. For the bit allocation {4/4/2/2/0} calculate the bit rate. Explain

why 0 bit is used for the 3-4 kHz?

Ans: 4x1+4x1+2x2+2x2=16kbps, only 3.4kHz bw is available

coding of speech signals• instantaneous companding => snr only weakly dependent on xmax/ x for...

Documents