coding of speech signals• instantaneous companding => snr only weakly dependent on xmax/ x for...
TRANSCRIPT
CODING OF SPEECH SIGNALS
Waveform Coding Analysis-Synthesis or
Vocoders
Hybrid Coding
Speech Coding
Waveform Coding: an attempt is made to preserve the original waveform.
Vocoders: a theoretical model of the speech production mechanism is
considered.
Hybrid Coding: uses techniques from the other two.
Speech Coders
Speech Quality Vs Bit Rate for
Codecs
From: J. Wooward, “Speech coding overview”,
http://www-mobile.ecs.soton.ac.uk/speech_codecs
Speech Coding Objectives
– High perceived quality
– High measured intelligibility
– Low bit rate (bits per second of speech)
– Low computational requirement (MIPS)
– Robustness to successive encode/decode cycles
– Robustness to transmission errors
Objectives for real-time only:
– Low coding/decoding delay (ms)
– Work with non-speech signals (e.g. touch tone)
Speech Information Rates• Fundamental level:
• 10-15 phonemes/second for continuous speech.
• 32-64 phonemes per language => 6 bits/phoneme.
• Information Rate=60-90 bps at the source.
• Waveform level
• Speech bandwidth from 4 – 10 kHz => sampling rate from 8 –20 kHz.
• Need 12-16 bit quantization for high quality digital coding.
• Information Rate=96-320 kbps => more than 3 orders of magnitude difference in Information rates between the production and waveform levels.
MOS (Mean Opinion Scores)
• Why MOS:– SNR is just not good enough as a subjective measure for
most coders (especially model-based coders where waveform is not preserved inherently)
– noise is not simple white (uncorrelated) noise
– error is signal correlated
• clicks/transients
• frequency dependent spectrum—not white
• includes components due to reverberation and echo
• noise comes from at least two sources, namely quantization and background noise
• delay due to transmission, block coding, processing
• transmission bit errors—can use Unequal Protection Methods
• tandem encodings
MOS Ratings• Subjective evaluation of speech quality
– 5—excellent
– 4—good
– 3—fair
– 2—poor
– 1—bad
• MOS Scores:– 4.5 for natural wideband speech
– 4.05 for toll quality telephone speech
– 3.5-4.0 for communications quality telephone speech
– 2.0-3.5 for lower quality speech from synthesizers, low bit rate coders, etc
• other measures of intelligibility– DRT-diagnostic rhyme test => uses rhyming words
– DAM-diagnostic acceptability measure
Digital Speech Coding
Sampling Theorem • Theorem: If the highest frequency contained in an
analog signal xa(t) is Fmax = B, and the signal is sampled at a frequency Fs > 2B, then the analog signal can be exactly recovered from its samples using the following reconstruction formula:
• Note that at the original sample instances (t = nT), the reconstructed analog signal is equal to the value of the original analog signal. At times between the sample instances, the signal is the weighted sum of shifted sinc functions.
naa
nTtT
nTtT
sinnTxtx
GRAPHICAL INTERPRETATION OF THESAMPLING THEOREM
RECONSTRUCTION VIA SINC(X)INTERPOLATION
TYPICAL SAMPLING FREQUENCIES IN SPEECH RECOGNITION
• 8 kHz: Popular in digital telephony. Provides
coverage of first three formants for most
speakers and most sounds.
• 16 kHz: Popular in speech research. Why?
• Sub 8 kHz Sampling: Can aliasing be useful in
speech recognition? Hint: Consumer
electronics.
Problems
• Sampling theorem for bandlimited signals.
• How to change the sample rate of a signal?
• How this can be implemented using time domain
interpolation (based on the Sampling Theorem)?
• How this can be implemented efficiently using
digital filters?
Digital Speech Coding
Speech Probability Density
Function• Probability density function for x(n) is the same as for xa(t)
since x(n)=xa(nt) the mean and variance are the same for
both x(n) and xa(t).
• Need to estimate probability density and power spectrum from
speech waveforms
– probability density estimated from long term histogram of
amplitudes
– good approximation is of a gamma distribution of the form:
– Simpler approximation is Laplacian density, of the form:
Measured Speech Densities
• Distribution normalized so
mean is 0 and variance is
1(x’=0, x=1)
• Gamma density more closely
approximates measured
distribution for speech than
Laplacian
• Laplacian is still a good
model and is used in
analytical studies
• Small amplitudes much
more likely than large
amplitudes by 100:1 ratio.
Speech AC and Power Spectrum• Can estimate long term autocorrelation and power spectrum
using time-series analysis methods
where L is a large integer
• 8kHz sampled speech for several
speakers
• High correlation between
adjacent samples
• Low pass speech more highly
correlated than bandpass
speech
Instantaneous Quantization
• Separating the processes of sampling and
quantization
• Assume x(n) obtained by sampling a bandlimited
signal at a rate at or above the Nyquist rate.
• Assume x(n) is known to infinite precision in
amplitude
• Need to quantize x(n) in some suitable manner.
Quantization and Encoding
Coding is a two-stage process
• Quantization process: x( n)→x’ (n)
• Encoding process: x’ (n) → c(n)
where Δ is the (assumed fixed) quantization step size
• Decoding is a single-stage process
• decoding process:c’(n) → x’’(n)
• if c’(n)=c(n), (no errors in transmission) then x’’(n)
=x’(n)
• x’’(n) x’(n) coding and quantization loses
information.
B-bit Quantization
• Use B-bit binary numbers to represent the quantized samples => 2B quantization levels
• Information Rate of Coder: I=B FS= total bit rate in bits/second
• B=16, FS= 8 kHz => I=128 Kbps
• B=8, FS= 8 kHz => I=64 Kbps
• B=4, FS= 8 kHz => I=32 Kbps
• Goal of waveform coding is to get the highest quality at a fixed value of I (Kbps), or equivalently to get the lowest value of I for a fixed quality.
• Since FS is fixed, need most efficient quantization methods to minimize I.
Quantization Basics
• Assume |x(n)| ≤Xmax (possibly ∞)
• For Laplacian density (where Xmax=∞), can show that
0.35% of the samples fall outside the range -4σx ≤
x(n) ≤ 4σx => large quantization errors for 0.35% of
the samples.
• Can safely assume that Xmax is proportional to σx.
Quantization Process
Uniform Quantizer
• The choice of quantization range and levels chosen such that signal can easily be processed digitally
Mid--Riser and Mid--Tread Quantizers
• mid-riser
– origin (x=0) in middle of rising part of the staircase
– same number of positive and negative levels
– symmetrical around origin.
• mid-tread
– origin (x=0) in middle of quantization level
– one more negative level than positive
– one quantization level of 0 (where a lot of activity occurs)
• Code words have direct numerical significance (sign-magnitude
representation for mid-riser, two’s complement for mid-tread).
• Uniform Quantizers characterized by:
– number of levels—2B (B bits)
– quantization step size-Δ.
• If |x(n)| ≤ Xmax and x(n) is a symmetric density, then
Δ2B =2Xmax
Δ= 2Xmax/ 2B
• If we let
x’(n)=x(n) + e(n)
• With x(n) the unquantized speech sample, and e(n)
the quantization
- Δ/2 ≤ e(n) ≤ Δ/2
Quantization Noise Model
• Quantization noise is a zero-mean, stationary white
noise process.
E[e(n)e(n+m)]=2e, m=0
= 0 otherwise
• Quantization noise is uncorrelated with the input
signal
E[x(n)e(n+m)]=0 m
• Distribution of quantization errors is uniform over
each quantization interval
pe(e)=1/ Δ - Δ/2 ≤ e ≤ Δ/2 ē =0, 2e = Δ2/12
=0 otherwise
SNR for Quantization
Review of Quantization
Assumptions
• Input signal fluctuates in a complicated manner so a
statistical model is valid.
• Quantization step size is small enough to remove
any signal correlated patterns in quantization error.
• Range of quantizer matches peak-to-peak range of
signal, utilizing full quantizer range with essentially
no clipping.
• For a uniform quantizer with a peak-to-peak range of
±4σx, the resulting SNR(dB) is 6B-7.2.
Instantaneous Companding • In order to get constant percentage error (rather than
constant variance error), need logarithmically spaced
quantization levels
– quantize logarithm of input signal rather than input signal
itself
Insensitivity to Signal Level
y(n) = ln|x(n)|
x(n) = exp[(y(n)].sign[x(n)]
where sign[x(n)] = -1 x(n) ≤ 0
= +1 x(n) > 0
The quantized log magnitude is
y’(n) = Q[log|x(n)|]
= log|x(n)| + e(n) a new error signal
Assume that e(n) is independent of log|x(n)|. The inverse is
x’(n) = exp[y’(n)].sign[x(n)]
= x(n).exp[e(n)]
Assume e(n) to be small, then
exp[e(v)] = 1 + e(n) + ….
x’(n) = x(n)[1+e(n)] = x(n) + e(n)x(n) = x(n) + f(n)
Since x(n) and e(n) are independent, then
σ2f = σ2
x . σ2e
SNR = σ2x / σ2
f =1/ σ2e
Pseudo-Logarithmic Compression
• Unfortunately true logarithmic compression is not practical, since the dynamic range (ratio between the largest and smallest values) is infinite => need an infinite number of quantization levels.
• Need an approximation to logarithmic compression => µ-law/A-law compression.
µ-law Compression
• y(n) = F[x(n)]
• When x(n) =0, y(n) =0;
• When =0, y(n)=x(n); no compression
• When is large and for large |x(n)|
)n(xsign.
)1log(
X/)n(x1logX
maxmax
max
max
X
)n(xlog.
log
X)n(y
SNR for µ-law Quantizer
• 6B dependence on B good
• Much less dependence on Xmax/x good
• For large , SNR is less sensitive to changes in
Xmax/x good
• -law used in wireline telephony for more than 30
years.
x
max2
x
max1010
X2
X1log10)1ln(log2077.4B6)dB(SNR
Companding
µ-Law Companding
Quantization for Optimum SNR
• Goal is to match quantizer to actual signal density to
achieve optimum SNR
• µ-law tries to achieve constant SNR over wide range of
signal variances => some sacrifice over SNR
performance when quantizer step size is matched to
signal variance
• If x is known, you can choose quantizer levels to
minimize quantization error variance and maximise SNR.
Quantizer Levels for Maximum
SNR
• Variance of quantization noise is:
• 2e =E[e2(n)]=E[(x’(n)-x(n))2]
• With x’(n)=Q[x(n)]. Assume quantization levels
• [x’-(M/2), x’-(M/2)+1,…,x’(-1), x’(1), …, x’(M/2)]
• associating quantization level with signal intervals as:
• x’j = quantization level for interval [xj-1, xj]
• For symmetric, zero-mean distributions, with large
amplitudes (∞) it makes sense to define the boundary
• xo = 0 (central boundary point), x±M/2 = ±∞
• The error variance is thusde)e(pe e
22e
Optimum Quantization Levels
Solution for Optimum Levels
• To solve for optimum values for {x’i} and {xi}, we differentiate 2e
wrt the parameters, set the derivation to 0, and solve numerically:
Proof??
• With boundary conditions of x0 = 0, x±M/2 = ±∞
• Can also constrain quantizer to be uniform and solve for value of Δ that maximizes SNR
• Optimum boundary points lie halfway between M/2 quantizer levels
• Optimum location of quantization level x’ is at the centroid of the probability density over the interval xi-1 to xi.
• Solve the above set of equations iteratively to obtain {x’i}, {xi}
M/2 ..., 1,2,i )xx(2
1x
M/2 ..., 1,2,i 0dx)x(p)xx(
'1i
'ii
x
ix
1ix
'i
Uniform/ Non-uniform Quantizers
Adaptive Quantization
• Linear quantization => SNR depends on x being constant(this is clearly not the case)
• Instantaneous companding => SNR only weaklydependent on Xmax/x for large µ-law compression (100-500)
• Optimum SNR => minimize 2e when 2
x is known, non-
uniform distribution of quantization levels
• Quantization dilemna:– want to choose quantization step size large enough to
accommodate maximum peak-to-peak range of x(n);
– at the same time need to make the quantization step size small soas to minimize the quantization error.
• The non-stationary nature of speech (variability acrosssounds, speakers, backgrounds) compounds thisproblem greatly.
Types of Adaptive Quantization
• Instantaneous-amplitude changes reflect sample-to-
sample variation of x(n) implying rapid adaptation.
• Syllabic-amplitude changes reflect syllable-to-syllable
variations in x(n) => slow adaptation
• Feed-forward-adaptive quantizers that estimate 2x
from x(n) itself.
• Feedback-adaptive quantizers that adapt the step
size, Δ, on the basis of the quantized signal, x’(n), (or
equivalently the codewords, c(n)).
Feed Forward Adaptation
• Variable step size
• assume uniform quantizer
with step size Δ(n)
• x(n) is quantized using Δ(n)
=>c(n) and Δ(n) need to be
transmitted to the decoder
• if c’(n)=c(n) and Δ’(n)= Δ(n)
=> no error in the channel,
and
• x’’(n) = x’(n)
Don’t have x(n) at the decoder to estimate Δ(n) => need to
transmit Δ(n); This is the major drawback of the feed
forward adaptation.
Feed-Forward Quantizer
• time varying gain, G(n) =>
c(n) and G(n) need to be
transmitted to the decoder.
Can’t estimate G(n) at
the decoder => it has to
be transmitted.
Feed Forward Quantizers
• Feed forward systems make estimates of 2x, then make Δ
or quantization level proportional x, the gain is inversely
proportional to x.
• Assume 2x is proportional to short time energy
• where h(n) is a low pass filter
• Consider h(n) = n-1 n 1
• = 0 otherwise
)mn(h)m(x)n(
m
22
2x
2)]n([E
proof??) (recursion )1n(x)1n()n(
1)(0 )m(x)n(
222
1n
m
1mn22
Feed Forward Quantizers
• Δ(n) and G(n) vary slowly compared to x(n)
• They must be sampled and transmitted as part of the waveform
coder parameters
• Rate of sampling depends on the bandwidth of the lowpass filter,
h(n)—for =0.99, the rate is about 13 Hz; for =0.9, the rate is
about 135 Hz
• It is reasonable to place limits on the variation of Δ(n) or G(n), of
the form
• Gmin G(n) Gmax
• Δ min Δ(n) Δ max
• For obtaining 2y ≈ constant over a 40 dB range in signal levels
Gmax/Gmin = Δmax/ Δmin =100 (40dB range)
Feed Forward Adaptation Gain
• Δ(n) or G(n) is evaluated after every M samples
• Use 128 to 1024 samples for estimation
• Adaptive quantizer achieves up to 5.6 dB better SNR than non-adaptive quantizers
• Can achieve this SNR with low "idle channel noise" and wide speech dynamic range by suitable choice Δmin and Δmax
Feedback Adaptation
• 2(n) estimated from quantizer output (or the code words).
• Advantage of feedback adaptation is that neither Δ(n) nor G(n) needs to be
transmitted to the decoder since they can be derived from the code words.
• Disadvantage of feedback adaptation is increased sensitivity to errors in
codewords, since such errors affect Δ(n) and G(n).
Feedback Adaptation
• 2(n) is based only on past values of x’( )
• Two typical windows/filters are:
• Can use very short window lengths (e.g. M=2) to achieve
12dB SNR for a B=3 bit quantizer.
1n
Mnm
22
1n
)m('xM
1)n(
otherwise 0
Mn1 1/M h(n)
0
1n )n(h
m
22)mn(h)m('x)n(
Alternative Approach to Adaptation
Input-output characteristic of a 3-bit adaptive quantizer
Optimal Step Size Multipliers
Nonuniform quantizer
CompressorUniform
Quantizer
Discrete
samplesdigital
signals
“Compressing-and-expanding” is called
“companding.”
Channel
• •
• •
• •
• •
Decoder Expanderreceived
digital
signals
output
Compression Techniques
used. is 255 U.S.,In the
1ln
)(1ln)(
1)(
nally)internatiopopular (very
compressor law-
1
2
1
twtw
tw
1)(1
ln1
)(ln1
1)(0
ln1
)(
)(
0 ,1)(
compressor law-
1
1
1
1
2
1
twAA
twA
Atw
A
twA
tw
Atw
A
Practical Implementation of µ-law compressor
Waveform Coders
Pulse Code Modulation (PCM)
• Needs the sampling frequency, fs, to be greater than the
Nyquist frequency (twice the maximum frequency in the
signal)
• For n bits per sample, the dynamic range is +2n-1 and the
quantisation noise power equals 2/12 ( = step size)
• Total bit rate = nfs
• Can use non-uniform quantisation / variable length codes
– Logarithmic quantization (A-law, -law)
– Adaptive Quantization
G.711
• Pulse Code Modulation (PCM) codecs are the simplest form of waveform codecs.
• Narrowband speech is typically sampled 8000 times per second, and then each speech sample must be quantized.
• If linear quantization is used then about 12 bits per sample are needed, giving a bit rate of about 96 kbits/s.
• However this can be easily reduced by using non-linear quantization.
• For coding speech it was found that with non-linear quantization 8 bits per sample was sufficient for speech quality which is almost indistinguishable from the original.
• This gives a bit rate of 64 kbits/s, and two such non-linear PCM codecs were standardised in the 1960s
Waveform Coders
Differential Pulse Code Modulation (DPCM)
• Predict the next sample based on the last few decoded
samples
• Minimise mean squared error of prediction residual
– use LP coding
• Good prediction results in a reduction in the dynamic range
needed to code the prediction residual and hence a
reduction in the bit rate.
• Can use non-uniform quantisation or variable length codes
Differential PCM (DPCM)
• Fixed predictors can give from 4-11dB SNR
improvement over direct quantization (PCM)
• Most of the gain occurs with first order predictor
• Prediction up to 4th or 5th order helps
Another Implementation of DPCM
Quantization error is not accumulated.
For slowly varying signals, a future sample can
predicted from past samples.
Transversal filter can perform the prediction process.
Predictor
++
-
e(t)s(t)
Transmitter Side
Predictor
++
s(t)e(t)
Receiver Side
+
DPCM with Adaptive Quantization• Quantizer step size proportional to variance at quantizer input
• Can use d(n) or x(n) to control step size
• Get 5 dB improvement in SNR over µ-law non-adaptive PCM
• Get 6 dB improvement in SNR using differential configuration with fixed prediction => ADPCM is about 10-11 dB SNR better than from a fixed quantizer.
Feedback ADPCM
• Can achieve same improvement in SNR as feed forward system
DPCM with Adaptive Prediction
• Need adaptive prediction to
handle non-stationarity of
speech.
• ADPCM encoders with
pole-zero decoder filters
have proved to be
particularly versatile in
speech applications.
• The ADPCM 32 kbits/s
algorithm adopted for the
G.721 CCITT standard
(1984) uses a pole-zero
adaptive predictor.
DPCM with Adaptive Prediction
• Prediction coefficients assumed to be time-dependent of the form
• Assume speech properties remain fixed over short time intervals.
• Choose to minimize the average squared prediction error over
short intervals.
• The optimum predictor coefficients satisfy the relationships
• Where Rn(j) is the short-time autocorrelation function of the form
• w(n-m), is window positioned at sample n of input.
• Update every 10-20msec.
p
k
kknxnnx
1
~~
)()()(
)(nk
p , . . . 1,2,j ),()()(1
kjRnjRn
p
k
kn
m
npjjmnwmjxmnwmxjR 0 ),()()()()(
s'
Prediction Gain for DPCM with
Adaptive Prediction
• Fixed prediction
10.5dB prediction gain
for large p.
• Adaptive prediction
14dB gain for large p.
• Adaptive prediction
more robust to
speaker and speech
material.
)(
)(log10log10
2
2
1010ndE
nxEGp
Comparison of Coders
• 6 dB between curves
• Sharp increase in
SNR with both fixed
prediction and
adaptive quantization
• Almost no gain for
adapting first order
predictor
ADPCM G.721 Encoder
ADPCM G.721 Encoder
• The algorithm consists of an adaptive quantizer and anadaptive pole-zero predictor.
• The pole-zero predictor (2 poles, 6 zeros) estimates theinput signal and hence it reduces the error variance.
• The quantizer encodes the error sequence into asequence of 4-bit words. The prediction coefficients areestimated using a gradient algorithm and the stability ofthe decoder is checked by testing the two roots of A(z).
• The performance of the coder, in terms of the MOS scale,is above 4 but it degrades as the number ofasynchronous tandem coding increases. The G.721ADPCM algorithm was also modified to accommodate 24and 40 kbits/s in the G.723 standard.
• The performance of ADPCM degrades quickly for ratesbelow 24 kbits/s.
Delta Modulation
• Simplest form of differential quantization is in delta
modulation (DM).
• Sampling rate chosen to be many times the Nyquist
rate for the input signal => adjacent samples are
highly correlated.
• This leads to a high ability to predict x(n) from past
samples, with the variance of the prediction error
being very low, leading to a high prediction gain =>
can use simple 1-bit (2-level) quantizer =>the bit rate
for DM systems is just the (high) sampling rate of the
signal.
Linear Delta Modulation (LDM)
Linear Delta Modulation
2- level quantizer with fixed step
size with quantizer form
d’(n) = if d(n) > 0 (c(n) =1)
= - if d(n) <0 (c(n) =0)
Illustration of DM• Basic equations of DM are x’(n) = x’(n-1) + d’(n)
• When ≈1 (it is digital integration or accumulation of increments of ±Δ)
• d(n) = x(n) – x’(n-1) = x(n) –x(n-1)-e(n-1)
• d(n) is the first backward difference of x(n), or an approximation to the derivative of the input.
• How big do we make Δ--at maximum slope of xa(t) we need
• Δ/T|dxa(t)/dt|
• Or else the resonstructed signal will lag the actual signal ‘slope overload’ condition resulting in quantization error called 'slope overload distortion'
• Since x’(n) can only increase by fixed increments of Δ, fixed DM is called linear DM or LDM
DM Waveform
DM Granular Noise
• When xa(t) has small slope, Δ determines the peak error when xa(t)=0, quantizer will be alternating sequence of 0's and 1's, and x’(n) will alternate around zero with peak variation of Δ this condition us called “granular noise”.
• Need large step size to handle wide dynamic range
• Need small step size to accurately represent low level signals.
• With LDM we need to worry about dynamic range and amplitude of the difference signal => choose Δto maximize mean-squared quantization error (a compromise between slope overload and granular noise).
Performance of DM SystemsNormalized step size defined as
oversampling index defined as
F0 = Fs/(2FN)
where Fs is the sampling rate of the DM
And FN is the Nyquist frequency of the
signal.
The total bit rate of the DM is
BR = Fs = 2FN.Fo
Can see that for given value of F0, there is
an optimum value of Δ.
Optimum SNR increases by 9dB for each
doubling of F0 => this is better than the 6dB
obtained by increasing the number of
bits/sample by 1 bit curves are very sharp
around optimum value of Δ => SNR is very
sensitive to input level for SNR=35 dB, for
FN=3kHz => 200 kbps rate
For toll quality much higher rate is required.
2/12))1n(x)n(x(E
Adaptive Delta Mod
• Step size adaptation for DM (from codewords)
– Δ(n) =M.Δ(n-1)
– Δmin Δ(n) Δmax
• M is a function of c(n) and c(n-1), since c(n) depends only on the sign of d(n)
– d(n) = x(n) - x’(n-1)
• The sign of d(n) can be determined before the actual quantized value d’(n) which needs the new value of Δ(n) for evaluation
• The algorithm for choosing the step size multiplier is
– M = P > 1 if c(n) = c(n-1)
– M = Q <1 if c(n) c(n-1)
Adaptive DM Performance
• Slope overload in LDM causes runs of 0’s or 1’s
• Granularity causes runs of alternating 0’s and 1’s
• figure above shows how adaptive DM performs with P=2, Q=1/2,
=1.
• During slope overload, step size increases exponentially to follow
increase in waveform slope.
• During granularity, step size decreases exponentially to Δmin and
stays there as long as the slope is small.
ADM Parameter Behavior
• ADM parameters are P, Q, Δmin
and Δmax
• Choose Δmax/ Δmin to maintain
high SNR over range of input
signal levels.
• Δmin should be chosen to
minimize idle channel noise.
• PQ should satisfy PQ1 for
stability.
Comparison of LDM, ADM and of log
PCM
• ADM is 8 dB better SNR at 20
kbps than LDM, and 14 dB
better SNR at 60 kbps than
LDM.
• ADM gives a 10 dB increase in
SNR for each doubling of the bit
rate; LDM gives about 6 dB.
• For bit rate below 40 kbps, ADM
has higher SNR than µ-law
PCM; for higher bit rates log
PCM has higher SNR.
Higher Order Prediction in DM
Waveform Coding versus Block
Processing
• Waveform coding
– sample-by-sample matching of waveforms
– coding quality measured using SNR
• Source modeling (block processing)
– block processing of signal => vector of outputs every block
– overlapped blocks
Adaptive Predictive Coder
Transmitter
Receiver
Adaptive Predictive Coder
• The use of adaptive long-term prediction in addition toshort-term prediction provides additional coding gain (atthe expense of higher complexity) and high quality speechat 16kbits/s. The long-term (long delay) predictor
• Provides for the pitch (fine) structure of the short-timevoiced spectrum. The index is the pitch period insamples and is a small integer. The long-term predictor(ideally) removes the periodicity and thereby redundancy.
• At the receiver the synthesis filter introduces periodicitywhile the synthesis filter associated with the short-termprediction polynomial represents the vocal tract.
• The parameters of the short-term predictors are computedfor every frame (typically 10 to 30 ms). The long-termprediction parameters are computed more often.
j
ji
ijL za)z(A
Model-Based Speech Coding• Waveform coding based on optimizing and maximizing
SNR about as far as possible.– achieved bit rate reductions on the order of 4:1 (i.e., from 128
kbps PCM to 32 kbps ADPCM) at the same time achieving toll quality SNR for telephone-bandwidth speech
• To lower bit rate further without reducing speech quality, we need to exploit features of the speech production model, including:– source modeling
– spectrum modeling
– use of codebook methods for coding efficiency
• We also need a new way of comparing performance of different waveform and model-based coding methods– an objective measure, like SNR, isn’t an appropriate measure
for model-based coders since they operate on blocks of speech and don’t follow the waveform on a sample-by-sample basis.
– new subjective measures need to be used that measure user perceived quality, intelligibility, and robustness to multiple factors.
Frequency Domain Speech Coding • All frequency domain methods for speech coding
exploit the Short-Time Fourier Transform using a
filter bank view with scalar quantization
• Sub-band Coding-use small number of filters with
wide and overlapping bandwidths
• 2-band sub-band coder
– advantage of sub-band coder is that the quantization noise
is limited to the sub-band that generated it => better
perceptual control of noise spectrum
– with careful design of filters, can get complete cancellation
of quantization noise that leaks across bands => use QMF-
Quadrature Mirror Filters
– can continue to split lower bands into 2 bands, giving
octave band filter bank => auditory front-end like analysis.
Sub-band coders
• Exploit the frequency sensitivity of the auditory
system.
• Split the signal into sub-band using band pass
filters.
• Code each sub-band at an appropriate resolution
– e.g. 4 bits per sample in the lower sub-bands and
– 2 bits per sample in the upper sub-bands
• Can also exploit auditory masking
– use fewer bits if a neighbouring sub-band is much louder
• Basis for the MPEG audio standard (5:1
compression of CD quality audio with no perceptual
degradation)
Sub-band coder
The 16 kbits/s SBC compared favorably against 16 kbits/s
ADPCM, and the 9.6 kbits/s SBC compared favorably
against 10.3 and 12.9 kbits/s ADMThe low-band filters in speech specific implementations are
usually associated with narrower widths so that they can resolve
more accurately the low-frequency narrowband formants. In the
absence of quantization noise, perfect reconstruction can be
achieved using Quadrature-Mirror Filter (QMF) banks.
AT&T Sub-band coder
AT&T Sub-band coder
• The AT&T SBC was used for voice storage at 16 or 24
kbits/s and consists of a five-band non-uniform tree-
structured QMF bank in conjunction with APCM coders.
• A silence compression algorithm is also part of the
standard. The frequency ranges for each band are: 0-0.5
kHz, 0.5-1 kHz, 1-2 kHz, 2-3 kHz, and 3-4 kHz. For the 16
kbits/s implementation the bit allocations are {4/4/2/2/0}
and for the 24 kbits/s the bit assignments are {5/5/4/3/0}.
The one-way delay of this coder is less than 18 ms. It
must be noted that although this coder was the
workhorse for the older AT&T voice store and forward
machines, the most recent AT&T machines use the new
16 kbits/s Low Delay CELP algorithm.
G.722 Sub-band coder
CCITT standard (G.722)
• The CCITT standard (G.722) for 7kHz audio at 64 kbits/s for
ISDN teleconferencing is based on a two-band sub-
band/ADPCM coder.
• The low-frequency sub-band is quantized at 48 kbits/s while the
high-frequency sub-band is coded at 16 kbits/s.
• The G.722 coder includes an adaptive bit allocation scheme
and an auxiliary data channel.
• Provisions for lower rates have been made by quantizing the
low-frequency sub-band at 40 kbits/s or at 32 kbits/s.
• The MOS at 64 kbits/s is greater than four for speech and
slightly less than four for music signals, and the analysis-
synthesis QMF banks introduce a delay of less than 3 ms.
Introduction to VQ
• Vector quantization (VQ) is a lossy data
compression method
• based on the principle of block coding.
• It is a fixed-to-fixed length algorithm.
• In 1980, Linde, Buzo, and Gray (LBG) proposed a VQ
design algorithm based on a training sequence.
• The use of a training sequence bypasses the need
for multi-dimensional integration. this algorithm is
referred to as LBG-VQ.
• Toll quality speech coder (digital wireline phone)
– G.711 (A-LAW and μ-LAW at 64 kbits/sec)
– G.721 (ADPCM at 32 kbits/ sec)
– G.723 (ADPCM at 40, 24 kbps)
– G.726 (ADPCM at 16,24,32,40 kbps)
• Low bit rate speech coder (cellular phone/IP phone)
– G.728 low delay (16 kbps, delay <2ms, same or better quality
than G.721)
– G. 723.1 (CELP Based, 5.3 and 6.4 kbits/sec)
– G.729 (CELP based, 8 bps)
– GSM 06.10 (13 and 6.5 kbits/sec, simple to implement, used in
GSM phones)
1-D VQ
• A VQ is nothing more than an approximator.
• similar to that of ``rounding-off'' An example of a 1-dimensional VQ is shown below
• Here, every number less than -2 are approximated by -3. Every number between -2 and 0 are approximated by -1. Every number between 0 and 2 are approximated by +1. Every number greater than 2 are approximated by +3. Note that the approximate values are uniquely represented by 2 bits. This is a 1-dimensional, 2-bit VQ. It has a rate of 2 bits/dimension.
2-D VQ
• An example of a 2-
dimensional VQ is shown
below:
• Here, every pair of numbers
falling in a particular region
are approximated by a red star
associated with that region.
Note that there are 16 regions
and 16 red stars -- each of
which can be uniquely
represented by 4 bits. Thus,
this is a 2-dimensional, 4-bit
VQ. Its rate is also 2
bits/dimension.
• In the above two examples, the red stars are called
codevectors and the regions defined by the blue
borders are called encoding regions. The set of all
codevectors is called the codebook and the set of all
encoding regions is called the partition of the space.
Design Problem
• Given a
– vector source with its statistical properties known
– a distortion measure
– the number of codevectors
• To find
– codebook (the set of all red stars)
– partition (the set of blue lines) which result in the
smallest average distortion.
Design Problem Cont.
• Assume that there is a training sequenceconsisting of M source vectors
• This training sequence can be obtained from some large database– For example, if the source is a speech signal, then
the training sequence can be obtained by recording several long telephone conversations.
– M is assumed to be sufficiently large so that all the statistical properties of the source are captured by the training sequence
Design Problem Cont.
• Assume that the source vectors are k-dimensional.
• Let N be the number of codevectors and let
represents the codebook.
• Each codevector is k-dimensional, e.g.,
• Let Sn be the encoding region associated with codevector
and let P denote the partition of the space.
Design Problem Cont.
• If the source vector xm is in the encoding region Sn, then its
approximation (denoted by Q(xm)) is cn.
• Assuming a squared-error distortion measure, the average
distortion is given by:
• The design problem can be succinctly stated as follows: Given
T and N, find C and P such that Dave is minimized.
Optimality Criteria
• If C and P are a solution to the above minimization problem, then it must satisfied the following two criteria.
• Nearest Neighbor Condition:
– This condition says that the encoding region Sn should consists of all vectors that are closer to cn than any of the other codevectors. For those vectors lying on the boundary (blue lines), any tie-breaking procedure will do
• Centroid Condition:
– This condition says that the codevector cn should be average of all those
training vectors that are in encoding region Sn . In implementation, one
should ensure that at least one training vector belongs to each encoding
region (so that the denominator in the above equation is never 0).
LBG Design Algorithm
• Iterative algorithm which solves the two optimality criteria.
• Requires initial codebook C(0).
• The initial codebook is obtained by the splitting method.
• In this method, an initial codevector is set as the average of the entire training sequence.
• This codevector is then split into two. The iterative algorithm is run with these two vectors as the initial codebook. The final two codevectors are splitted into four and the process is repeated until the desired number of codevectors is obtained.
LBG Design Algorithm Cont.
1. Given T. Fixed >0 to be a “small” number.
2. Let N =1.
Calculate
3. Splitting:
Set N=2N
4. Iteration: Let D(0)ave=D*
ave. Set the iteration index i=0;
i. For m=1,2, …, M, find the minimum value of over all n=1, 2, …, N. Let n* be the index which achieves the minimum. Set
LBG Design Algorithm Cont.
ii. For n=1, 2, …, N, update the codevector.
iii. Set i=i+1
iv. Calculate
v. If , go back to Step (i).
vi. Set . For n=1, 2, …, N, set as the final
codevector
5. Repeat steps 3 and 4 until the desired number of
codevectors is obtained.
Performance
• The performance of VQ are typically given in terms
of the signal-to-distortion ratio (SDR):
(in dB),
• where 2 is the variance of the source and Dave is
the average squared error distortion. The higher the
SDR the better the performance.
Toy Example of VQ Coding• 2-pole model of the vocal tract => 4
reflection coefficients
• 4-possible vocal tract shapes => 4 sets of reflection coefficients
• 1. Scalar Quantization-assume 4 values for each reflection coefficient => 2-bits x 4 coefficients = 8 bits/frame
• 2. Vector Quantization-only 4 possible vectors => 2-bits to choose which of the 4 vectors to use for each frame (pointer into a codebook)
• this works because the scalar components of each vector are highly correlated.
• if scalar components are independent => VQ offers no advantage over scalar quantization
Comparison of Scalar and Vector
Quantization
VQ Codebook of LPC Vectors
64 vectors
in a
codebook
of spectral
Shapes
10-bit VQ is
comparable
to 24-bit
scalar
quantization.
VQ Properties
• In VQ a cell can have arbitrary size and shape;
• Scalar quantization a decision region can have arbitrary size but its shape is fixed
• VQ has a distortion measure which is a measure of distance between the input and output used to both design the codebook vectors and to choose the optimal reconstruction vector
Iterative VQ Design Algorithm
• 1. assume initial set of points given– map all vectors to best set of points
– recomputed centroids from ensemble vectors in each cell
• 2. iterate until change in reconstruction levels is small
• Problem-need to know px(x) to correctly compute centroids
• Solution-use training set as an estimate of ensemble, k-mean clustering algorithm.
VQ Performance
• The simplest form of a vector quantizer can be considered as a generalization of the scalar PCM and is called Vector PCM (VPCM). In VPCM, the codebook is fully searched (full search VQ or F-VQ) for each incoming vector. The number of bits per sample in VPCM is given by
– B =(log2N)/k
• and the signal-to-noise ratio for VPCM is given by
– SNRk = 6B + Kk (dB)
• VPCM yields improved SNR since it exploits the correlation within the vectors. In the case of speech coding, it is reported that K2 is larger than K1 by more than 3 dB while K8 is larger than K1 by more than 8 dB.
VQ Performance
• Even though VQ offers significant coding gain by
increasing N and k its memory and computational
complexity grows exponentially with k for a given
rate.
• More specifically, the number of computations
required for F-VQ (full search VQ) is of the order of
2Bk while the number of memory locations required is
k 2Bk.
• In general the benefits of VQ are realized at rates of 1
bit per sample.
GS-VQ• The complexity of VQ can also
be reduced by normalizing the
vectors of the codebook and
encoding the gain separately.
The technique is called
Gain/Shape VQ (GS-VQ) and
has been introduced by Buzo
and later studied by Sabin and
Gray.
• The waveform shape is
represented by a codevector
from the shape codebook while
the gain can be encoded from
the gain codebook.
Comparisons of GS-VQ with F-
VQ in the case of speech
coding at one bit per sample
revealed that GS-VQ yields
about 0.7 dB improvement at
the same level of complexity.
Encoder
Decoder
Adaptive VQ
• The VQ methods discussed thus far are associated with time-invariant(fixed) codebooks. Since speech is a non-stationary process, onewould like to adapt the codebooks ("codebook design on the fly") toits changing statistics.
• VQ with adaptive codebooks is called adaptive VQ (A-VQ) andapplications to speech coding have been reported.
• There are two types of A-VQ, namely, forward adaptive and backwardadaptive.
• In backward adaptive VQ, codebook updating is based on past datawhich is also available at the decoder.
• Forward A-VQ updates the codebooks based on current (or sometimesfuture) data and as such additional information must be encoded.
• The principles of forward and backward A-VQ are similar to those ofscalar adaptive quantization.
• Practical A-VQ systems are backward adaptive and they can beclassified into vector predictive quantizers and finite state quantizers.Vector predictive coders are essentially an extension of scalarpredictive DPCM coders.
Implementation Issues
• The complexity in high-dimensionality VQ can be reduced significantly
with the use of structured codebooks which allow for efficient search.
• Tree-structured and multistep vector quantizers are associated with
lower encoding complexity at the expense of loss of performance and in
some cases increased memory requirements.
• Gray and Abut compared the performance of F-VQ and binary tree
search VQ for speech coding and reported a degradation of 1 dB in the
SNR for tree-structured VQ.
• Multistep vector quantizers consist of a cascade of two or more
quantizers each one encoding the error or residual of the previous
quantizer.
• Gersho and Cuperman compared the performance of full search
(dimension 4) and multistep vector quantizers (dimension 12) for
encoding speech waveforms at 1 bit per sample and reported a gain of 1
dB in the SNR in the case of multistep VQ.
I SEMESTER M. TECH. SESSIONAL TEST 2005-06
VOICE AND PICTURE CODING (EL-653)
Differentiate between Vowel and Diphthongs. (4)
Calculate the pitch in mel for a frequency of 5000Hz signal. (4)
Prove that the optimum level of the quantizer level is at the centroid of the probability density
function over the interval.
For an AT&T sub-band coder the frequency ranges for each band are: 0-0.5 kHz, 0.5-1 kHz,
1-2 kHz, 2-3 kHz, and 3-4 kHz. For the bit allocation {4/4/2/2/0} calculate the bit rate. Explain
why 0 bit is used for the 3-4 kHz?
Ans: 4x1+4x1+2x2+2x2=16kbps, only 3.4kHz bw is available