coding of speech signals• instantaneous companding => snr only weakly dependent on xmax/ x for...

116
CODING OF SPEECH SIGNALS Waveform Coding Analysis-Synthesis or Vocoders Hybrid Coding Speech Coding Waveform Coding: an attempt is made to preserve the original waveform. Vocoders : a theoretical model of the speech production mechanism is considered. Hybrid Coding: uses techniques from the other two.

Upload: others

Post on 29-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

CODING OF SPEECH SIGNALS

Waveform Coding Analysis-Synthesis or

Vocoders

Hybrid Coding

Speech Coding

Waveform Coding: an attempt is made to preserve the original waveform.

Vocoders: a theoretical model of the speech production mechanism is

considered.

Hybrid Coding: uses techniques from the other two.

Page 2: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Speech Coders

Page 3: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Speech Quality Vs Bit Rate for

Codecs

From: J. Wooward, “Speech coding overview”,

http://www-mobile.ecs.soton.ac.uk/speech_codecs

Page 4: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Speech Coding Objectives

– High perceived quality

– High measured intelligibility

– Low bit rate (bits per second of speech)

– Low computational requirement (MIPS)

– Robustness to successive encode/decode cycles

– Robustness to transmission errors

Objectives for real-time only:

– Low coding/decoding delay (ms)

– Work with non-speech signals (e.g. touch tone)

Page 5: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Speech Information Rates• Fundamental level:

• 10-15 phonemes/second for continuous speech.

• 32-64 phonemes per language => 6 bits/phoneme.

• Information Rate=60-90 bps at the source.

• Waveform level

• Speech bandwidth from 4 – 10 kHz => sampling rate from 8 –20 kHz.

• Need 12-16 bit quantization for high quality digital coding.

• Information Rate=96-320 kbps => more than 3 orders of magnitude difference in Information rates between the production and waveform levels.

Page 6: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

MOS (Mean Opinion Scores)

• Why MOS:– SNR is just not good enough as a subjective measure for

most coders (especially model-based coders where waveform is not preserved inherently)

– noise is not simple white (uncorrelated) noise

– error is signal correlated

• clicks/transients

• frequency dependent spectrum—not white

• includes components due to reverberation and echo

• noise comes from at least two sources, namely quantization and background noise

• delay due to transmission, block coding, processing

• transmission bit errors—can use Unequal Protection Methods

• tandem encodings

Page 7: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

MOS Ratings• Subjective evaluation of speech quality

– 5—excellent

– 4—good

– 3—fair

– 2—poor

– 1—bad

• MOS Scores:– 4.5 for natural wideband speech

– 4.05 for toll quality telephone speech

– 3.5-4.0 for communications quality telephone speech

– 2.0-3.5 for lower quality speech from synthesizers, low bit rate coders, etc

• other measures of intelligibility– DRT-diagnostic rhyme test => uses rhyming words

– DAM-diagnostic acceptability measure

Page 8: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Digital Speech Coding

Page 9: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Sampling Theorem • Theorem: If the highest frequency contained in an

analog signal xa(t) is Fmax = B, and the signal is sampled at a frequency Fs > 2B, then the analog signal can be exactly recovered from its samples using the following reconstruction formula:

• Note that at the original sample instances (t = nT), the reconstructed analog signal is equal to the value of the original analog signal. At times between the sample instances, the signal is the weighted sum of shifted sinc functions.

naa

nTtT

nTtT

sinnTxtx

Page 10: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

GRAPHICAL INTERPRETATION OF THESAMPLING THEOREM

Page 11: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

RECONSTRUCTION VIA SINC(X)INTERPOLATION

Page 12: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

TYPICAL SAMPLING FREQUENCIES IN SPEECH RECOGNITION

• 8 kHz: Popular in digital telephony. Provides

coverage of first three formants for most

speakers and most sounds.

• 16 kHz: Popular in speech research. Why?

• Sub 8 kHz Sampling: Can aliasing be useful in

speech recognition? Hint: Consumer

electronics.

Page 13: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Problems

• Sampling theorem for bandlimited signals.

• How to change the sample rate of a signal?

• How this can be implemented using time domain

interpolation (based on the Sampling Theorem)?

• How this can be implemented efficiently using

digital filters?

Page 14: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Digital Speech Coding

Page 15: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Speech Probability Density

Function• Probability density function for x(n) is the same as for xa(t)

since x(n)=xa(nt) the mean and variance are the same for

both x(n) and xa(t).

• Need to estimate probability density and power spectrum from

speech waveforms

– probability density estimated from long term histogram of

amplitudes

– good approximation is of a gamma distribution of the form:

– Simpler approximation is Laplacian density, of the form:

Page 16: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Measured Speech Densities

• Distribution normalized so

mean is 0 and variance is

1(x’=0, x=1)

• Gamma density more closely

approximates measured

distribution for speech than

Laplacian

• Laplacian is still a good

model and is used in

analytical studies

• Small amplitudes much

more likely than large

amplitudes by 100:1 ratio.

Page 17: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Speech AC and Power Spectrum• Can estimate long term autocorrelation and power spectrum

using time-series analysis methods

where L is a large integer

• 8kHz sampled speech for several

speakers

• High correlation between

adjacent samples

• Low pass speech more highly

correlated than bandpass

speech

Page 18: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Instantaneous Quantization

• Separating the processes of sampling and

quantization

• Assume x(n) obtained by sampling a bandlimited

signal at a rate at or above the Nyquist rate.

• Assume x(n) is known to infinite precision in

amplitude

• Need to quantize x(n) in some suitable manner.

Page 19: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Quantization and Encoding

Coding is a two-stage process

• Quantization process: x( n)→x’ (n)

• Encoding process: x’ (n) → c(n)

where Δ is the (assumed fixed) quantization step size

• Decoding is a single-stage process

• decoding process:c’(n) → x’’(n)

• if c’(n)=c(n), (no errors in transmission) then x’’(n)

=x’(n)

• x’’(n) x’(n) coding and quantization loses

information.

Page 20: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

B-bit Quantization

• Use B-bit binary numbers to represent the quantized samples => 2B quantization levels

• Information Rate of Coder: I=B FS= total bit rate in bits/second

• B=16, FS= 8 kHz => I=128 Kbps

• B=8, FS= 8 kHz => I=64 Kbps

• B=4, FS= 8 kHz => I=32 Kbps

• Goal of waveform coding is to get the highest quality at a fixed value of I (Kbps), or equivalently to get the lowest value of I for a fixed quality.

• Since FS is fixed, need most efficient quantization methods to minimize I.

Page 21: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Quantization Basics

• Assume |x(n)| ≤Xmax (possibly ∞)

• For Laplacian density (where Xmax=∞), can show that

0.35% of the samples fall outside the range -4σx ≤

x(n) ≤ 4σx => large quantization errors for 0.35% of

the samples.

• Can safely assume that Xmax is proportional to σx.

Page 22: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Quantization Process

Page 23: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Uniform Quantizer

• The choice of quantization range and levels chosen such that signal can easily be processed digitally

Page 24: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Mid--Riser and Mid--Tread Quantizers

• mid-riser

– origin (x=0) in middle of rising part of the staircase

– same number of positive and negative levels

– symmetrical around origin.

• mid-tread

– origin (x=0) in middle of quantization level

– one more negative level than positive

– one quantization level of 0 (where a lot of activity occurs)

• Code words have direct numerical significance (sign-magnitude

representation for mid-riser, two’s complement for mid-tread).

Page 25: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

• Uniform Quantizers characterized by:

– number of levels—2B (B bits)

– quantization step size-Δ.

• If |x(n)| ≤ Xmax and x(n) is a symmetric density, then

Δ2B =2Xmax

Δ= 2Xmax/ 2B

• If we let

x’(n)=x(n) + e(n)

• With x(n) the unquantized speech sample, and e(n)

the quantization

- Δ/2 ≤ e(n) ≤ Δ/2

Page 26: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Quantization Noise Model

• Quantization noise is a zero-mean, stationary white

noise process.

E[e(n)e(n+m)]=2e, m=0

= 0 otherwise

• Quantization noise is uncorrelated with the input

signal

E[x(n)e(n+m)]=0 m

• Distribution of quantization errors is uniform over

each quantization interval

pe(e)=1/ Δ - Δ/2 ≤ e ≤ Δ/2 ē =0, 2e = Δ2/12

=0 otherwise

Page 27: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

SNR for Quantization

Page 28: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Review of Quantization

Assumptions

• Input signal fluctuates in a complicated manner so a

statistical model is valid.

• Quantization step size is small enough to remove

any signal correlated patterns in quantization error.

• Range of quantizer matches peak-to-peak range of

signal, utilizing full quantizer range with essentially

no clipping.

• For a uniform quantizer with a peak-to-peak range of

±4σx, the resulting SNR(dB) is 6B-7.2.

Page 29: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Instantaneous Companding • In order to get constant percentage error (rather than

constant variance error), need logarithmically spaced

quantization levels

– quantize logarithm of input signal rather than input signal

itself

Page 30: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Insensitivity to Signal Level

y(n) = ln|x(n)|

x(n) = exp[(y(n)].sign[x(n)]

where sign[x(n)] = -1 x(n) ≤ 0

= +1 x(n) > 0

The quantized log magnitude is

y’(n) = Q[log|x(n)|]

= log|x(n)| + e(n) a new error signal

Assume that e(n) is independent of log|x(n)|. The inverse is

x’(n) = exp[y’(n)].sign[x(n)]

= x(n).exp[e(n)]

Assume e(n) to be small, then

exp[e(v)] = 1 + e(n) + ….

x’(n) = x(n)[1+e(n)] = x(n) + e(n)x(n) = x(n) + f(n)

Since x(n) and e(n) are independent, then

σ2f = σ2

x . σ2e

SNR = σ2x / σ2

f =1/ σ2e

Page 31: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Pseudo-Logarithmic Compression

• Unfortunately true logarithmic compression is not practical, since the dynamic range (ratio between the largest and smallest values) is infinite => need an infinite number of quantization levels.

• Need an approximation to logarithmic compression => µ-law/A-law compression.

Page 32: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

µ-law Compression

• y(n) = F[x(n)]

• When x(n) =0, y(n) =0;

• When =0, y(n)=x(n); no compression

• When is large and for large |x(n)|

)n(xsign.

)1log(

X/)n(x1logX

maxmax

max

max

X

)n(xlog.

log

X)n(y

Page 33: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

SNR for µ-law Quantizer

• 6B dependence on B good

• Much less dependence on Xmax/x good

• For large , SNR is less sensitive to changes in

Xmax/x good

• -law used in wireline telephony for more than 30

years.

x

max2

x

max1010

X2

X1log10)1ln(log2077.4B6)dB(SNR

Page 34: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Companding

Page 35: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

µ-Law Companding

Page 36: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Quantization for Optimum SNR

• Goal is to match quantizer to actual signal density to

achieve optimum SNR

• µ-law tries to achieve constant SNR over wide range of

signal variances => some sacrifice over SNR

performance when quantizer step size is matched to

signal variance

• If x is known, you can choose quantizer levels to

minimize quantization error variance and maximise SNR.

Page 37: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Quantizer Levels for Maximum

SNR

• Variance of quantization noise is:

• 2e =E[e2(n)]=E[(x’(n)-x(n))2]

• With x’(n)=Q[x(n)]. Assume quantization levels

• [x’-(M/2), x’-(M/2)+1,…,x’(-1), x’(1), …, x’(M/2)]

• associating quantization level with signal intervals as:

• x’j = quantization level for interval [xj-1, xj]

• For symmetric, zero-mean distributions, with large

amplitudes (∞) it makes sense to define the boundary

• xo = 0 (central boundary point), x±M/2 = ±∞

• The error variance is thusde)e(pe e

22e

Page 38: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Optimum Quantization Levels

Page 39: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Solution for Optimum Levels

• To solve for optimum values for {x’i} and {xi}, we differentiate 2e

wrt the parameters, set the derivation to 0, and solve numerically:

Proof??

• With boundary conditions of x0 = 0, x±M/2 = ±∞

• Can also constrain quantizer to be uniform and solve for value of Δ that maximizes SNR

• Optimum boundary points lie halfway between M/2 quantizer levels

• Optimum location of quantization level x’ is at the centroid of the probability density over the interval xi-1 to xi.

• Solve the above set of equations iteratively to obtain {x’i}, {xi}

M/2 ..., 1,2,i )xx(2

1x

M/2 ..., 1,2,i 0dx)x(p)xx(

'1i

'ii

x

ix

1ix

'i

Page 40: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Uniform/ Non-uniform Quantizers

Page 41: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Adaptive Quantization

• Linear quantization => SNR depends on x being constant(this is clearly not the case)

• Instantaneous companding => SNR only weaklydependent on Xmax/x for large µ-law compression (100-500)

• Optimum SNR => minimize 2e when 2

x is known, non-

uniform distribution of quantization levels

• Quantization dilemna:– want to choose quantization step size large enough to

accommodate maximum peak-to-peak range of x(n);

– at the same time need to make the quantization step size small soas to minimize the quantization error.

• The non-stationary nature of speech (variability acrosssounds, speakers, backgrounds) compounds thisproblem greatly.

Page 42: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Types of Adaptive Quantization

• Instantaneous-amplitude changes reflect sample-to-

sample variation of x(n) implying rapid adaptation.

• Syllabic-amplitude changes reflect syllable-to-syllable

variations in x(n) => slow adaptation

• Feed-forward-adaptive quantizers that estimate 2x

from x(n) itself.

• Feedback-adaptive quantizers that adapt the step

size, Δ, on the basis of the quantized signal, x’(n), (or

equivalently the codewords, c(n)).

Page 43: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Feed Forward Adaptation

• Variable step size

• assume uniform quantizer

with step size Δ(n)

• x(n) is quantized using Δ(n)

=>c(n) and Δ(n) need to be

transmitted to the decoder

• if c’(n)=c(n) and Δ’(n)= Δ(n)

=> no error in the channel,

and

• x’’(n) = x’(n)

Don’t have x(n) at the decoder to estimate Δ(n) => need to

transmit Δ(n); This is the major drawback of the feed

forward adaptation.

Page 44: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Feed-Forward Quantizer

• time varying gain, G(n) =>

c(n) and G(n) need to be

transmitted to the decoder.

Can’t estimate G(n) at

the decoder => it has to

be transmitted.

Page 45: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Feed Forward Quantizers

• Feed forward systems make estimates of 2x, then make Δ

or quantization level proportional x, the gain is inversely

proportional to x.

• Assume 2x is proportional to short time energy

• where h(n) is a low pass filter

• Consider h(n) = n-1 n 1

• = 0 otherwise

)mn(h)m(x)n(

m

22

2x

2)]n([E

proof??) (recursion )1n(x)1n()n(

1)(0 )m(x)n(

222

1n

m

1mn22

Page 46: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Feed Forward Quantizers

• Δ(n) and G(n) vary slowly compared to x(n)

• They must be sampled and transmitted as part of the waveform

coder parameters

• Rate of sampling depends on the bandwidth of the lowpass filter,

h(n)—for =0.99, the rate is about 13 Hz; for =0.9, the rate is

about 135 Hz

• It is reasonable to place limits on the variation of Δ(n) or G(n), of

the form

• Gmin G(n) Gmax

• Δ min Δ(n) Δ max

• For obtaining 2y ≈ constant over a 40 dB range in signal levels

Gmax/Gmin = Δmax/ Δmin =100 (40dB range)

Page 47: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Feed Forward Adaptation Gain

• Δ(n) or G(n) is evaluated after every M samples

• Use 128 to 1024 samples for estimation

• Adaptive quantizer achieves up to 5.6 dB better SNR than non-adaptive quantizers

• Can achieve this SNR with low "idle channel noise" and wide speech dynamic range by suitable choice Δmin and Δmax

Page 48: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Feedback Adaptation

• 2(n) estimated from quantizer output (or the code words).

• Advantage of feedback adaptation is that neither Δ(n) nor G(n) needs to be

transmitted to the decoder since they can be derived from the code words.

• Disadvantage of feedback adaptation is increased sensitivity to errors in

codewords, since such errors affect Δ(n) and G(n).

Page 49: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Feedback Adaptation

• 2(n) is based only on past values of x’( )

• Two typical windows/filters are:

• Can use very short window lengths (e.g. M=2) to achieve

12dB SNR for a B=3 bit quantizer.

1n

Mnm

22

1n

)m('xM

1)n(

otherwise 0

Mn1 1/M h(n)

0

1n )n(h

m

22)mn(h)m('x)n(

Page 50: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Alternative Approach to Adaptation

Input-output characteristic of a 3-bit adaptive quantizer

Page 51: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Optimal Step Size Multipliers

Page 52: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Nonuniform quantizer

CompressorUniform

Quantizer

Discrete

samplesdigital

signals

“Compressing-and-expanding” is called

“companding.”

Channel

• •

• •

• •

• •

Decoder Expanderreceived

digital

signals

output

Page 53: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Compression Techniques

used. is 255 U.S.,In the

1ln

)(1ln)(

1)(

nally)internatiopopular (very

compressor law-

1

2

1

twtw

tw

1)(1

ln1

)(ln1

1)(0

ln1

)(

)(

0 ,1)(

compressor law-

1

1

1

1

2

1

twAA

twA

Atw

A

twA

tw

Atw

A

Page 54: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Practical Implementation of µ-law compressor

Page 55: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Waveform Coders

Pulse Code Modulation (PCM)

• Needs the sampling frequency, fs, to be greater than the

Nyquist frequency (twice the maximum frequency in the

signal)

• For n bits per sample, the dynamic range is +2n-1 and the

quantisation noise power equals 2/12 ( = step size)

• Total bit rate = nfs

• Can use non-uniform quantisation / variable length codes

– Logarithmic quantization (A-law, -law)

– Adaptive Quantization

Page 56: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

G.711

• Pulse Code Modulation (PCM) codecs are the simplest form of waveform codecs.

• Narrowband speech is typically sampled 8000 times per second, and then each speech sample must be quantized.

• If linear quantization is used then about 12 bits per sample are needed, giving a bit rate of about 96 kbits/s.

• However this can be easily reduced by using non-linear quantization.

• For coding speech it was found that with non-linear quantization 8 bits per sample was sufficient for speech quality which is almost indistinguishable from the original.

• This gives a bit rate of 64 kbits/s, and two such non-linear PCM codecs were standardised in the 1960s

Page 57: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Waveform Coders

Differential Pulse Code Modulation (DPCM)

• Predict the next sample based on the last few decoded

samples

• Minimise mean squared error of prediction residual

– use LP coding

• Good prediction results in a reduction in the dynamic range

needed to code the prediction residual and hence a

reduction in the bit rate.

• Can use non-uniform quantisation or variable length codes

Page 58: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Differential PCM (DPCM)

• Fixed predictors can give from 4-11dB SNR

improvement over direct quantization (PCM)

• Most of the gain occurs with first order predictor

• Prediction up to 4th or 5th order helps

Page 59: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Another Implementation of DPCM

Quantization error is not accumulated.

Page 60: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

For slowly varying signals, a future sample can

predicted from past samples.

Transversal filter can perform the prediction process.

Predictor

++

-

e(t)s(t)

Transmitter Side

Predictor

++

s(t)e(t)

Receiver Side

+

Page 61: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

DPCM with Adaptive Quantization• Quantizer step size proportional to variance at quantizer input

• Can use d(n) or x(n) to control step size

• Get 5 dB improvement in SNR over µ-law non-adaptive PCM

• Get 6 dB improvement in SNR using differential configuration with fixed prediction => ADPCM is about 10-11 dB SNR better than from a fixed quantizer.

Page 62: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Feedback ADPCM

• Can achieve same improvement in SNR as feed forward system

Page 63: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

DPCM with Adaptive Prediction

• Need adaptive prediction to

handle non-stationarity of

speech.

• ADPCM encoders with

pole-zero decoder filters

have proved to be

particularly versatile in

speech applications.

• The ADPCM 32 kbits/s

algorithm adopted for the

G.721 CCITT standard

(1984) uses a pole-zero

adaptive predictor.

Page 64: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

DPCM with Adaptive Prediction

• Prediction coefficients assumed to be time-dependent of the form

• Assume speech properties remain fixed over short time intervals.

• Choose to minimize the average squared prediction error over

short intervals.

• The optimum predictor coefficients satisfy the relationships

• Where Rn(j) is the short-time autocorrelation function of the form

• w(n-m), is window positioned at sample n of input.

• Update every 10-20msec.

p

k

kknxnnx

1

~~

)()()(

)(nk

p , . . . 1,2,j ),()()(1

kjRnjRn

p

k

kn

m

npjjmnwmjxmnwmxjR 0 ),()()()()(

s'

Page 65: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Prediction Gain for DPCM with

Adaptive Prediction

• Fixed prediction

10.5dB prediction gain

for large p.

• Adaptive prediction

14dB gain for large p.

• Adaptive prediction

more robust to

speaker and speech

material.

)(

)(log10log10

2

2

1010ndE

nxEGp

Page 66: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Comparison of Coders

• 6 dB between curves

• Sharp increase in

SNR with both fixed

prediction and

adaptive quantization

• Almost no gain for

adapting first order

predictor

Page 67: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

ADPCM G.721 Encoder

Page 68: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

ADPCM G.721 Encoder

• The algorithm consists of an adaptive quantizer and anadaptive pole-zero predictor.

• The pole-zero predictor (2 poles, 6 zeros) estimates theinput signal and hence it reduces the error variance.

• The quantizer encodes the error sequence into asequence of 4-bit words. The prediction coefficients areestimated using a gradient algorithm and the stability ofthe decoder is checked by testing the two roots of A(z).

• The performance of the coder, in terms of the MOS scale,is above 4 but it degrades as the number ofasynchronous tandem coding increases. The G.721ADPCM algorithm was also modified to accommodate 24and 40 kbits/s in the G.723 standard.

• The performance of ADPCM degrades quickly for ratesbelow 24 kbits/s.

Page 69: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Delta Modulation

• Simplest form of differential quantization is in delta

modulation (DM).

• Sampling rate chosen to be many times the Nyquist

rate for the input signal => adjacent samples are

highly correlated.

• This leads to a high ability to predict x(n) from past

samples, with the variance of the prediction error

being very low, leading to a high prediction gain =>

can use simple 1-bit (2-level) quantizer =>the bit rate

for DM systems is just the (high) sampling rate of the

signal.

Page 70: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Linear Delta Modulation (LDM)

Page 71: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Linear Delta Modulation

2- level quantizer with fixed step

size with quantizer form

d’(n) = if d(n) > 0 (c(n) =1)

= - if d(n) <0 (c(n) =0)

Page 72: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Illustration of DM• Basic equations of DM are x’(n) = x’(n-1) + d’(n)

• When ≈1 (it is digital integration or accumulation of increments of ±Δ)

• d(n) = x(n) – x’(n-1) = x(n) –x(n-1)-e(n-1)

• d(n) is the first backward difference of x(n), or an approximation to the derivative of the input.

• How big do we make Δ--at maximum slope of xa(t) we need

• Δ/T|dxa(t)/dt|

• Or else the resonstructed signal will lag the actual signal ‘slope overload’ condition resulting in quantization error called 'slope overload distortion'

• Since x’(n) can only increase by fixed increments of Δ, fixed DM is called linear DM or LDM

Page 73: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

DM Waveform

Page 74: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

DM Granular Noise

• When xa(t) has small slope, Δ determines the peak error when xa(t)=0, quantizer will be alternating sequence of 0's and 1's, and x’(n) will alternate around zero with peak variation of Δ this condition us called “granular noise”.

• Need large step size to handle wide dynamic range

• Need small step size to accurately represent low level signals.

• With LDM we need to worry about dynamic range and amplitude of the difference signal => choose Δto maximize mean-squared quantization error (a compromise between slope overload and granular noise).

Page 75: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Performance of DM SystemsNormalized step size defined as

oversampling index defined as

F0 = Fs/(2FN)

where Fs is the sampling rate of the DM

And FN is the Nyquist frequency of the

signal.

The total bit rate of the DM is

BR = Fs = 2FN.Fo

Can see that for given value of F0, there is

an optimum value of Δ.

Optimum SNR increases by 9dB for each

doubling of F0 => this is better than the 6dB

obtained by increasing the number of

bits/sample by 1 bit curves are very sharp

around optimum value of Δ => SNR is very

sensitive to input level for SNR=35 dB, for

FN=3kHz => 200 kbps rate

For toll quality much higher rate is required.

2/12))1n(x)n(x(E

Page 76: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Adaptive Delta Mod

• Step size adaptation for DM (from codewords)

– Δ(n) =M.Δ(n-1)

– Δmin Δ(n) Δmax

• M is a function of c(n) and c(n-1), since c(n) depends only on the sign of d(n)

– d(n) = x(n) - x’(n-1)

• The sign of d(n) can be determined before the actual quantized value d’(n) which needs the new value of Δ(n) for evaluation

• The algorithm for choosing the step size multiplier is

– M = P > 1 if c(n) = c(n-1)

– M = Q <1 if c(n) c(n-1)

Page 77: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Adaptive DM Performance

• Slope overload in LDM causes runs of 0’s or 1’s

• Granularity causes runs of alternating 0’s and 1’s

• figure above shows how adaptive DM performs with P=2, Q=1/2,

=1.

• During slope overload, step size increases exponentially to follow

increase in waveform slope.

• During granularity, step size decreases exponentially to Δmin and

stays there as long as the slope is small.

Page 78: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

ADM Parameter Behavior

• ADM parameters are P, Q, Δmin

and Δmax

• Choose Δmax/ Δmin to maintain

high SNR over range of input

signal levels.

• Δmin should be chosen to

minimize idle channel noise.

• PQ should satisfy PQ1 for

stability.

Page 79: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Comparison of LDM, ADM and of log

PCM

• ADM is 8 dB better SNR at 20

kbps than LDM, and 14 dB

better SNR at 60 kbps than

LDM.

• ADM gives a 10 dB increase in

SNR for each doubling of the bit

rate; LDM gives about 6 dB.

• For bit rate below 40 kbps, ADM

has higher SNR than µ-law

PCM; for higher bit rates log

PCM has higher SNR.

Page 80: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Higher Order Prediction in DM

Page 81: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Waveform Coding versus Block

Processing

• Waveform coding

– sample-by-sample matching of waveforms

– coding quality measured using SNR

• Source modeling (block processing)

– block processing of signal => vector of outputs every block

– overlapped blocks

Page 82: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Adaptive Predictive Coder

Transmitter

Receiver

Page 83: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Adaptive Predictive Coder

• The use of adaptive long-term prediction in addition toshort-term prediction provides additional coding gain (atthe expense of higher complexity) and high quality speechat 16kbits/s. The long-term (long delay) predictor

• Provides for the pitch (fine) structure of the short-timevoiced spectrum. The index is the pitch period insamples and is a small integer. The long-term predictor(ideally) removes the periodicity and thereby redundancy.

• At the receiver the synthesis filter introduces periodicitywhile the synthesis filter associated with the short-termprediction polynomial represents the vocal tract.

• The parameters of the short-term predictors are computedfor every frame (typically 10 to 30 ms). The long-termprediction parameters are computed more often.

j

ji

ijL za)z(A

Page 84: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Model-Based Speech Coding• Waveform coding based on optimizing and maximizing

SNR about as far as possible.– achieved bit rate reductions on the order of 4:1 (i.e., from 128

kbps PCM to 32 kbps ADPCM) at the same time achieving toll quality SNR for telephone-bandwidth speech

• To lower bit rate further without reducing speech quality, we need to exploit features of the speech production model, including:– source modeling

– spectrum modeling

– use of codebook methods for coding efficiency

• We also need a new way of comparing performance of different waveform and model-based coding methods– an objective measure, like SNR, isn’t an appropriate measure

for model-based coders since they operate on blocks of speech and don’t follow the waveform on a sample-by-sample basis.

– new subjective measures need to be used that measure user perceived quality, intelligibility, and robustness to multiple factors.

Page 85: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Frequency Domain Speech Coding • All frequency domain methods for speech coding

exploit the Short-Time Fourier Transform using a

filter bank view with scalar quantization

• Sub-band Coding-use small number of filters with

wide and overlapping bandwidths

• 2-band sub-band coder

– advantage of sub-band coder is that the quantization noise

is limited to the sub-band that generated it => better

perceptual control of noise spectrum

– with careful design of filters, can get complete cancellation

of quantization noise that leaks across bands => use QMF-

Quadrature Mirror Filters

– can continue to split lower bands into 2 bands, giving

octave band filter bank => auditory front-end like analysis.

Page 86: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Sub-band coders

• Exploit the frequency sensitivity of the auditory

system.

• Split the signal into sub-band using band pass

filters.

• Code each sub-band at an appropriate resolution

– e.g. 4 bits per sample in the lower sub-bands and

– 2 bits per sample in the upper sub-bands

• Can also exploit auditory masking

– use fewer bits if a neighbouring sub-band is much louder

• Basis for the MPEG audio standard (5:1

compression of CD quality audio with no perceptual

degradation)

Page 87: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Sub-band coder

Page 88: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

The 16 kbits/s SBC compared favorably against 16 kbits/s

ADPCM, and the 9.6 kbits/s SBC compared favorably

against 10.3 and 12.9 kbits/s ADMThe low-band filters in speech specific implementations are

usually associated with narrower widths so that they can resolve

more accurately the low-frequency narrowband formants. In the

absence of quantization noise, perfect reconstruction can be

achieved using Quadrature-Mirror Filter (QMF) banks.

Page 89: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

AT&T Sub-band coder

Page 90: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

AT&T Sub-band coder

• The AT&T SBC was used for voice storage at 16 or 24

kbits/s and consists of a five-band non-uniform tree-

structured QMF bank in conjunction with APCM coders.

• A silence compression algorithm is also part of the

standard. The frequency ranges for each band are: 0-0.5

kHz, 0.5-1 kHz, 1-2 kHz, 2-3 kHz, and 3-4 kHz. For the 16

kbits/s implementation the bit allocations are {4/4/2/2/0}

and for the 24 kbits/s the bit assignments are {5/5/4/3/0}.

The one-way delay of this coder is less than 18 ms. It

must be noted that although this coder was the

workhorse for the older AT&T voice store and forward

machines, the most recent AT&T machines use the new

16 kbits/s Low Delay CELP algorithm.

Page 91: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

G.722 Sub-band coder

Page 92: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

CCITT standard (G.722)

• The CCITT standard (G.722) for 7kHz audio at 64 kbits/s for

ISDN teleconferencing is based on a two-band sub-

band/ADPCM coder.

• The low-frequency sub-band is quantized at 48 kbits/s while the

high-frequency sub-band is coded at 16 kbits/s.

• The G.722 coder includes an adaptive bit allocation scheme

and an auxiliary data channel.

• Provisions for lower rates have been made by quantizing the

low-frequency sub-band at 40 kbits/s or at 32 kbits/s.

• The MOS at 64 kbits/s is greater than four for speech and

slightly less than four for music signals, and the analysis-

synthesis QMF banks introduce a delay of less than 3 ms.

Page 93: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Introduction to VQ

• Vector quantization (VQ) is a lossy data

compression method

• based on the principle of block coding.

• It is a fixed-to-fixed length algorithm.

• In 1980, Linde, Buzo, and Gray (LBG) proposed a VQ

design algorithm based on a training sequence.

• The use of a training sequence bypasses the need

for multi-dimensional integration. this algorithm is

referred to as LBG-VQ.

Page 94: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

• Toll quality speech coder (digital wireline phone)

– G.711 (A-LAW and μ-LAW at 64 kbits/sec)

– G.721 (ADPCM at 32 kbits/ sec)

– G.723 (ADPCM at 40, 24 kbps)

– G.726 (ADPCM at 16,24,32,40 kbps)

• Low bit rate speech coder (cellular phone/IP phone)

– G.728 low delay (16 kbps, delay <2ms, same or better quality

than G.721)

– G. 723.1 (CELP Based, 5.3 and 6.4 kbits/sec)

– G.729 (CELP based, 8 bps)

– GSM 06.10 (13 and 6.5 kbits/sec, simple to implement, used in

GSM phones)

Page 95: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

1-D VQ

• A VQ is nothing more than an approximator.

• similar to that of ``rounding-off'' An example of a 1-dimensional VQ is shown below

• Here, every number less than -2 are approximated by -3. Every number between -2 and 0 are approximated by -1. Every number between 0 and 2 are approximated by +1. Every number greater than 2 are approximated by +3. Note that the approximate values are uniquely represented by 2 bits. This is a 1-dimensional, 2-bit VQ. It has a rate of 2 bits/dimension.

Page 96: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

2-D VQ

• An example of a 2-

dimensional VQ is shown

below:

• Here, every pair of numbers

falling in a particular region

are approximated by a red star

associated with that region.

Note that there are 16 regions

and 16 red stars -- each of

which can be uniquely

represented by 4 bits. Thus,

this is a 2-dimensional, 4-bit

VQ. Its rate is also 2

bits/dimension.

Page 97: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

• In the above two examples, the red stars are called

codevectors and the regions defined by the blue

borders are called encoding regions. The set of all

codevectors is called the codebook and the set of all

encoding regions is called the partition of the space.

Page 98: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Design Problem

• Given a

– vector source with its statistical properties known

– a distortion measure

– the number of codevectors

• To find

– codebook (the set of all red stars)

– partition (the set of blue lines) which result in the

smallest average distortion.

Page 99: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Design Problem Cont.

• Assume that there is a training sequenceconsisting of M source vectors

• This training sequence can be obtained from some large database– For example, if the source is a speech signal, then

the training sequence can be obtained by recording several long telephone conversations.

– M is assumed to be sufficiently large so that all the statistical properties of the source are captured by the training sequence

Page 100: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Design Problem Cont.

• Assume that the source vectors are k-dimensional.

• Let N be the number of codevectors and let

represents the codebook.

• Each codevector is k-dimensional, e.g.,

• Let Sn be the encoding region associated with codevector

and let P denote the partition of the space.

Page 101: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Design Problem Cont.

• If the source vector xm is in the encoding region Sn, then its

approximation (denoted by Q(xm)) is cn.

• Assuming a squared-error distortion measure, the average

distortion is given by:

• The design problem can be succinctly stated as follows: Given

T and N, find C and P such that Dave is minimized.

Page 102: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Optimality Criteria

• If C and P are a solution to the above minimization problem, then it must satisfied the following two criteria.

• Nearest Neighbor Condition:

– This condition says that the encoding region Sn should consists of all vectors that are closer to cn than any of the other codevectors. For those vectors lying on the boundary (blue lines), any tie-breaking procedure will do

• Centroid Condition:

– This condition says that the codevector cn should be average of all those

training vectors that are in encoding region Sn . In implementation, one

should ensure that at least one training vector belongs to each encoding

region (so that the denominator in the above equation is never 0).

Page 103: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

LBG Design Algorithm

• Iterative algorithm which solves the two optimality criteria.

• Requires initial codebook C(0).

• The initial codebook is obtained by the splitting method.

• In this method, an initial codevector is set as the average of the entire training sequence.

• This codevector is then split into two. The iterative algorithm is run with these two vectors as the initial codebook. The final two codevectors are splitted into four and the process is repeated until the desired number of codevectors is obtained.

Page 104: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

LBG Design Algorithm Cont.

1. Given T. Fixed >0 to be a “small” number.

2. Let N =1.

Calculate

3. Splitting:

Set N=2N

4. Iteration: Let D(0)ave=D*

ave. Set the iteration index i=0;

i. For m=1,2, …, M, find the minimum value of over all n=1, 2, …, N. Let n* be the index which achieves the minimum. Set

Page 105: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

LBG Design Algorithm Cont.

ii. For n=1, 2, …, N, update the codevector.

iii. Set i=i+1

iv. Calculate

v. If , go back to Step (i).

vi. Set . For n=1, 2, …, N, set as the final

codevector

5. Repeat steps 3 and 4 until the desired number of

codevectors is obtained.

Page 106: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Performance

• The performance of VQ are typically given in terms

of the signal-to-distortion ratio (SDR):

(in dB),

• where 2 is the variance of the source and Dave is

the average squared error distortion. The higher the

SDR the better the performance.

Page 107: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Toy Example of VQ Coding• 2-pole model of the vocal tract => 4

reflection coefficients

• 4-possible vocal tract shapes => 4 sets of reflection coefficients

• 1. Scalar Quantization-assume 4 values for each reflection coefficient => 2-bits x 4 coefficients = 8 bits/frame

• 2. Vector Quantization-only 4 possible vectors => 2-bits to choose which of the 4 vectors to use for each frame (pointer into a codebook)

• this works because the scalar components of each vector are highly correlated.

• if scalar components are independent => VQ offers no advantage over scalar quantization

Page 108: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Comparison of Scalar and Vector

Quantization

Page 109: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

VQ Codebook of LPC Vectors

64 vectors

in a

codebook

of spectral

Shapes

10-bit VQ is

comparable

to 24-bit

scalar

quantization.

Page 110: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

VQ Properties

• In VQ a cell can have arbitrary size and shape;

• Scalar quantization a decision region can have arbitrary size but its shape is fixed

• VQ has a distortion measure which is a measure of distance between the input and output used to both design the codebook vectors and to choose the optimal reconstruction vector

Iterative VQ Design Algorithm

• 1. assume initial set of points given– map all vectors to best set of points

– recomputed centroids from ensemble vectors in each cell

• 2. iterate until change in reconstruction levels is small

• Problem-need to know px(x) to correctly compute centroids

• Solution-use training set as an estimate of ensemble, k-mean clustering algorithm.

Page 111: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

VQ Performance

• The simplest form of a vector quantizer can be considered as a generalization of the scalar PCM and is called Vector PCM (VPCM). In VPCM, the codebook is fully searched (full search VQ or F-VQ) for each incoming vector. The number of bits per sample in VPCM is given by

– B =(log2N)/k

• and the signal-to-noise ratio for VPCM is given by

– SNRk = 6B + Kk (dB)

• VPCM yields improved SNR since it exploits the correlation within the vectors. In the case of speech coding, it is reported that K2 is larger than K1 by more than 3 dB while K8 is larger than K1 by more than 8 dB.

Page 112: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

VQ Performance

• Even though VQ offers significant coding gain by

increasing N and k its memory and computational

complexity grows exponentially with k for a given

rate.

• More specifically, the number of computations

required for F-VQ (full search VQ) is of the order of

2Bk while the number of memory locations required is

k 2Bk.

• In general the benefits of VQ are realized at rates of 1

bit per sample.

Page 113: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

GS-VQ• The complexity of VQ can also

be reduced by normalizing the

vectors of the codebook and

encoding the gain separately.

The technique is called

Gain/Shape VQ (GS-VQ) and

has been introduced by Buzo

and later studied by Sabin and

Gray.

• The waveform shape is

represented by a codevector

from the shape codebook while

the gain can be encoded from

the gain codebook.

Comparisons of GS-VQ with F-

VQ in the case of speech

coding at one bit per sample

revealed that GS-VQ yields

about 0.7 dB improvement at

the same level of complexity.

Encoder

Decoder

Page 114: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Adaptive VQ

• The VQ methods discussed thus far are associated with time-invariant(fixed) codebooks. Since speech is a non-stationary process, onewould like to adapt the codebooks ("codebook design on the fly") toits changing statistics.

• VQ with adaptive codebooks is called adaptive VQ (A-VQ) andapplications to speech coding have been reported.

• There are two types of A-VQ, namely, forward adaptive and backwardadaptive.

• In backward adaptive VQ, codebook updating is based on past datawhich is also available at the decoder.

• Forward A-VQ updates the codebooks based on current (or sometimesfuture) data and as such additional information must be encoded.

• The principles of forward and backward A-VQ are similar to those ofscalar adaptive quantization.

• Practical A-VQ systems are backward adaptive and they can beclassified into vector predictive quantizers and finite state quantizers.Vector predictive coders are essentially an extension of scalarpredictive DPCM coders.

Page 115: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

Implementation Issues

• The complexity in high-dimensionality VQ can be reduced significantly

with the use of structured codebooks which allow for efficient search.

• Tree-structured and multistep vector quantizers are associated with

lower encoding complexity at the expense of loss of performance and in

some cases increased memory requirements.

• Gray and Abut compared the performance of F-VQ and binary tree

search VQ for speech coding and reported a degradation of 1 dB in the

SNR for tree-structured VQ.

• Multistep vector quantizers consist of a cascade of two or more

quantizers each one encoding the error or residual of the previous

quantizer.

• Gersho and Cuperman compared the performance of full search

(dimension 4) and multistep vector quantizers (dimension 12) for

encoding speech waveforms at 1 bit per sample and reported a gain of 1

dB in the SNR in the case of multistep VQ.

Page 116: CODING OF SPEECH SIGNALS• Instantaneous companding => SNR only weakly dependent on Xmax/ x for large µ-law compression (100-500) • Optimum SNR => minimize 2 e when 2 x is known,

I SEMESTER M. TECH. SESSIONAL TEST 2005-06

VOICE AND PICTURE CODING (EL-653)

Differentiate between Vowel and Diphthongs. (4)

Calculate the pitch in mel for a frequency of 5000Hz signal. (4)

Prove that the optimum level of the quantizer level is at the centroid of the probability density

function over the interval.

For an AT&T sub-band coder the frequency ranges for each band are: 0-0.5 kHz, 0.5-1 kHz,

1-2 kHz, 2-3 kHz, and 3-4 kHz. For the bit allocation {4/4/2/2/0} calculate the bit rate. Explain

why 0 bit is used for the 3-4 kHz?

Ans: 4x1+4x1+2x2+2x2=16kbps, only 3.4kHz bw is available