politecnico di torino · nadia perreca, id 211012 speech coding: techniques, standards and...

POLITECNICO DI TORINO

ANALOG AND TELECOMMUNICATION ELECTRONICS

MINIPROJECT:

SPEECH CODING: TECHNIQUES, STANDARDS AND APPLICATIONS

PROFESSOR: DANTE DEL CORSO

STUDENT: NADIA PERRECA, ID: 211012

ACADEMIC YEAR: 2013-2014

Nadia Perreca, ID 211012 Speech coding: techniques, standards and applications

2

INDEX

INTRODUCTION 3

CHAPTER I: SPEECH CODING 4

I.1 Speech signal 4

I.2 Speech processing 7

I.3 Speech coding 8

I.4 Speech coding standards 12

I.5 Parametric representations 13

I.6 Waveform representations 17

I.7 Methods of comparison of speech coding techniques 20

CHAPTER II: PULSE CODE MODULATION TECHNIQUES 22

II.1 Pulse Code Modulation 22

II.2 Linear PCM 24

II.3 Logarithmic PCM 27

II.3.1 A and μ conversion laws 29

II.4 Differential PCM 33

II.5 Adaptive Differential PCM 37

II.6 Time division multiplexing 39

APPENDIX 42

BIBLIOGRAPHY 44


3

INTRODUCTION

Speech signals are maybe the most natural and common signals we can image to deal with; that’s

why, in the sphere of Information Technologies, voice communication have always played a very

important role. Speech signals are unpredictable signals, whose values and characteristic vary very

much according to the speaker and the message he wants to transmit, so there’s the need to develop

specific techniques that can simplify their complicated processing. Among them, the speech coding

is the process of obtaining a compact representation of voice signals for efficient transmission over

band-limited wired and wireless channels, storage or many other applications. “Compact” is not a

simple adjective but a key word: the goal of speech coding is to represent speech in digital form

with as few bit per second as possible without losing the intelligibility and "pleasantness" of

speech, which include speaker identity, emotions, intonation, timbre and so on. The need of

compactness is due to the technological transition from analog to digital electronics.

In the past, the speech signal coding techniques have been implemented and optimized for

networks "dedicated" to the telephone traffic; today, speech coders have become essential

components in both telecommunications and multimedia infrastructures. Commercial systems that

rely on efficient speech coding include cellular communication, voice over internet protocol

(VOIP), videoconferencing, electronic toys and so on.

The aim of this project is to analyze the basic characteristics of speech signals and the most

common speech signal processing techniques, with particular attention to the speech coding as a

compression data technique. We’ll give some information about the different types of speech signal

representations and the methods we can use to compare them. Then, we’ll focus on the different

Pulse Code Modulation techniques and their application to the speech coding, in order to point out

benefits and drawbacks of each technique and differences among all of them. In the end, we’ll

analyze the Time Division Multiplexing technique, that’s one of the most important and actual

application of Pulse Code Modulation technique.


4

CHAPTER I: SPEECH CODING

This chapter is an introduction to the speech coding techniques, standards and applications.

We’ll analyze the basic characteristics of speech signals and the most common speech signal

processing techniques, with particular attention to the speech coding as a compression data

technique. We’ll give some information about the different types of speech signal representations

and the methods we can use to compare them.

I.1 SPEECH SIGNAL

A speech signal is created by the vocal cords, travels through the vocal tract, is produced at

speakers’ mouth and gets to the listener’s ear as a pressure wave.

From an engineering point of view, we can model the speech production with a source-filter

model: we can see the vocal cords as a source and the vocal tract a resonant cavity. If you placed a

microphone right above someone’s glottis during voicing, you would hear the glottal source by

itself as a buzzing sound. The vocal tract filters the sound energy by suppressing some components

of the glottal wave and amplifying the ones that are close to the resonance frequencies of the vocal

tract, which depends on its shape and its length. In this way, it changes the sound quality of the

complex wave produced by the sound source. That means that when we talk about speech signal,

we mean a sort filtered version of the real emitted sound.

In Figure 1, we can see a representation of a speech signal.

Figure 1: Speech signal.


5

We can divide the speech sound into voiced and unvoiced: voiced signals are produced when the

vocal cords vibrate during the pronunciation of a phoneme, while unvoiced signals do not entail the

use of the vocal cords. Voiced signals tend to be louder like the vowels; on the other hand, unvoiced

signals tend to be more abrupt like the stop consonant. The production of voiced and unvoiced

speech is separated by silence regions: during silence region, there is no excitation supplied to the

vocal tract and hence no speech output. However, silence is an integral part of speech signal: even if

from an energy point of view it’s unimportant, its duration is essential for intelligible speech and it

helps to recognize a certain category of sounds. Without the presence of silence region between

voiced and unvoiced speech, the speech will not intelligible.

A first distinction among voiced, unvoiced and silence speech can be done looking at the signal

amplitude: if it’s low or negligible, the signal can be marked as silence, otherwise as unvoiced. If it

exceeds a threshold level that is usually chosen by the user according to the preliminary noted

characteristics of the sound he’s studying, it is declared to be voiced. In Figure 2, this classification

is illustrated.

Figure2: Speech signal: voiced, unvoiced and silence regions.

As we can see from previous figures, speech is not a predictable signal; from an analytic point of

view, it means that it’s a non stationary signal with a non-uniform probability distribution, even if

sometimes it’s approximated with a Gaussian distribution. Its characteristics vary quickly and

depend on the different emitted sound; this makes the speech signal hard to analyze and model.

A quite common practical solution consists in to model the speech signal as a slowly varying

function of time: we mean that, during intervals of 5 to 25 ms, the speech characteristics hopefully

don’t change too much and we can consider them to be almost constant. That means that, over small

time intervals, the speech signal can be considered stationary with good approximation. Over these

windows we can analyze the signal spectrum, the power density distribution and we can distinguish

voice and unvoiced sounds.


6

A general block diagram is shown in Figure 3.

Figure 3: Block diagram of the analysis of speech signal by using frames.

The bandwidth of speech signals is concentrated in the bandwidth: 300 Hz- 3.4 kHz; they have a

low pass trend.

Even if it’s difficult to recognize this fact in the time domain, voiced signal waveforms are

periodic signals, so they have a line spectrum. On the contrary, unvoiced signal spectrum is

continuous. There may be regions where the speech can be mixed version of voiced and unvoiced

speech. In mixed speech, the speech signal will look like unvoiced speech, but you will also observe

some periodic structures. We can see the voiced speech as the useful signal and the unvoiced speech

as a sort of noise; usually, it is modeled as a White Gaussian Random Variable.

In Figure 4 we can see the signal for the speech word “six”; if we consider a frame during the

pronunciation of the consonant “s” (unvoiced signal), the signal appears continuous, as a sort of

noise, while if we analyze a frame during the pronunciation of the vocal “i” (voiced signal), we can

see a more regular signal.

Figure 4: Speech word “six”.


7

I.2 SPEECH PROCESSING

Speech processing is the study of speech signals and the processing methods of these signals. It

involves the study of the techniques we use to deal with speech signals and of all the applications

they are suitable to. There are analog and digital speech signal processing techniques. At the very

beginning, speech signal processing was developed using analog electronics, in fact all 1G systems

have analog speech transmission. Since the 1970s, signal processing has more and more been

implemented on computers in the digital domain, in fact all 2G and 3G systems have digital speech

transmission.

Digital speech signal processing techniques are easier, less expensive, more sophisticated and

faster than the analog ones. They allow an improvement in quality of speech, are reliable and very

compact and can be implemented in Integrated Circuits. Plus, with regard to the transmission of

voice signal requiring security, the digital representation has a distinct advantage over analog

systems: the information bits can be scrambled in a manner which can be ultimately be unscrambled

at the receiver.

Digital signal processing techniques can be applied in many speech communication areas:

- Transmission and storage;

- Speaker verification and identification;

- Speech recognition (conversion from speech to written text);

- Aids to the handicapped;

- Enhancement of signal quality.

A general scheme which includes the fundamental signal processing is shown in Figure 5: a

speech coder converts the analog speech signal into a coded digital representation, which is usually

transmitted in frames over a non distorting channel. A speech decoder receives coded frames and

synthesizes reconstructed speech.

Figure 5: Block diagram of a general speech signal processing.


8

I.3 SPEECH CODING

As we can see in Figure 5, the speech signal coding is a very important step in speech signal

processing. It involves the study of all the techniques used to represent and threat speech signals in

a more convenient form that allows us to use them for several applications such as acquisition,

manipulation, storage, transfer and so on.

In general, a speech signal coder has two essential characteristics:

- Integrity of the speech: the information contained into the speech signal must be kept intact

without distortions.

- Quality: the speech signal must be intelligible and pleasantness, that means that things such

as speaker identity, emotions, intonation, timbre and so on must be recognized.

In addition, it can have several desirable properties as:

- Low bit-rate;

- Low memory requirements;

- Low transmission power required;

- Fast transmission speed;

- Low computational complexity;

- Low coding delay;

- Robustness.

We can easily understand why all these properties are desirable. If we have a low bit-rate coder, a

less bandwidth is required for the transmission. Plus, we reduce the amount of transmitted data,

saving memory and transmission power required and increasing the transmission speed. If the coder

has a low computational complexity, the power required still decrees. The coding delay is the time

that elapses from the time the speech sample arrives at the encoder input to the time the speech

sample appears at the decoder output, so it’s clear that we want a coding delay that is as low as

possible, in order to minimize interferences and interruptions during the communication. A speech

code is robustness if it’s suitable for any types of speakers such as male, female, children, and many

different languages. That’s a very difficult property to satisfy because in general we need different

circuits and devices to deal with different types of speech sound, according to their complexity.

However, we have to put in mind that here is always a tradeoff between having one or another

property, in particular between low bit-rate and speech quality. In general, we have to design the

system according to the given specifications.


9

There are essentially two types of digital speech signal coding: waveform representations and

parametric representations. As we can see in Figure 6, both of them involves a series of techniques

that we’ll analyze in the next sections.

Figure 6: Speech signal coding techniques

Waveform representation, as the name implies, are concerned with preserving the wave shape of the

analog speech signal trough a sampling and quantization process.

Parametric representations, on the other hand, are concerned with representing the speech signal

by using some of its characteristic parameters.

There are also hybrid representations, which are a fusion of the two illustrated coding

techniques, but we won’t analyze them.

In the study of speech signal processing, the speech coding is a very important matter, especially in

telecommunication area.

Speech coding is the process of obtaining a compact representation of the speech signal that can

be efficiently transmitted over band-limited wired and wireless channels or stored in digital media.

“Compact” is not a simple adjective but a key word: the goal of speech coding is to represent

speech in digital form with as few bits as possible without losing the intelligibility and

"pleasantness" of speech, which include speaker identity, emotions, intonation, timbre and so on.


10

Other requirements, such as low coding delay, good performances, complexity, low losses, depend

on the particular application we’re dealing with.

In general, we can recognize parametric representations as a form of speech coding, but we’ll

refer more in detail to the ways we have to improve the bit rate of waveform representations,

because they allow to safe the quality of the signal. We’ll see that a standard bit-rate for a waveform

representation is fixed at 64 kb/s: any bit-rate below 64 kb/s is treated as compression and the

output of the source encoder is an encoded speech signal having a bit-rate less than 64 kb/s.

If we compress the speech signal by reducing the number of bits per sample, we obtain a lot of

benefits:

- Reduction of the bandwidth;

- Reduction of the transmitted data (memory occupation);

- Reduction of the transmission power required;

- Increase of the transmission speed;

- Increase of the immunity to noise (some of the saved bits per sample can be used as

protective error control bits to the speech parameters).

Usually we can distinguish four levels of quality, according to the bit rate:

- TOLL: perfect quality;

- NEAR TOLL: almost perfect;

- DIGITAL CELLULAR: noise is introduced as background but the spoken is still very well

reconstructed;

- LOW BIT RATE: speech is over that noisy artificial and not natural, but still

understandable.

In Table 1, we can see some examples of cited speech coding and relative bit rate and quality.

Table 1: Speech coding bit rate and quality.

Speech coding Bit rate (kb/s) Quality

PCM 64/32 TOLL

DPCM 32/16 NEAR TOOL

ADPCM 4 DIGITAL CELLULAR

Vocoder 4.2 LOW BIT RATE

LPC-10 2.4 LOW BIT RATE


11

In the past, the speech signal coding techniques have been implemented and optimized for networks

"dedicated" to the telephone traffic; but the growing need of integration between telephony and data

will involve the study of new standards that offer such services, the "voice" of IP (Internet

Protocol), and network data, which are able to ensure quality levels comparable to those offered the

old telephone network. The satellite communication systems, where the cost of the channel is very

high, mobile systems, where the number of users grows exponentially, as well as multimedia

systems, whose information content requires employment of considerable mass memory, all

applications are for which it is necessary to introduce processes of encoding voice.

We can synthesize all the areas of application in a unique immediate graph, shown in Figure 7.

Figure 7: Speech coding applications.

Clearly, according to the specific area we are interested in, we’ve to use different coding techniques

and standards.


12

I.4 SPEECH CODING STANDARDS

Standards for landline Public Switched Telephone Service (PSTN) networks are established by the

International Telecommunication Union (ITU). The ITU is a branch of the International Standards

Organization (ISO), which also develops standards for the Moving Picture Experts Group (MPEG).

The ITU has promulgated a number of important speech and waveform coding standards at high bit

rates and with very low delay, including:

- G.711 : it standardizes the PCM 64 kb/s in which you are using a uniform quantization for

the discretization of the amplitudes in 8 bits per sample;

- G.721: it standardizes the ADPCM halving the bit-rate to 32 kb/s while maintaining the

same encoding quality;

- G.722: it standardizes the ADPCM 64 kb/s but uses two ADCPM 32 kb/s, one in the band

0-4 kHz and the other in the 4-7 kHz band;

- G.723.1: it provides two operating speeds: one at 6.3 kb/s and the other at 5.3 kb/s.

Standards for cellular telephony in Europe are established by the European Telecommunications

Standards Institute (ETSI). The ETSI has standardized algorithm for speech coding digital mobile

communication standards, published by the Global System for Mobile Telecommunications

(GSM) subcommittee. All speech coding standards for digital cellular telephone use are based on

LPC-AS algorithms.

The first GSM standard coder was based on a precursor of CELP called regular-pulse excitation

with long-term prediction. In 1999 he designed a new mobile network with global coverage called

UMTS ( Universal Mobile Telecommunication System ) for which the ETSI has proposed a new

coding standard called the AMR ( Adaptive Multi-Rate ) as it uses an encoder that generates

adaptive traffic flows with eight different speeds ( from 12.2 kb/s up to 4.75 kb/s) in function of the

operating conditions.

In Table 2 we can see some applications of illustrated standards.


13

Table 2: Speech coding for several applications.

Application Bandwidth

(kHz)

Bit rate

(kb/s)

Standards

organization

Standard

number Algorithm Year

Landline telephone 3.4 64 ITU G.711 PCM 1988

Video conferencing 7 64 (32+32) ITU G.722 ADPCM 1988

Digital cellular 3.4 8 ITU G.729 ACELP 1996

Digital cellular 3.4 12.2 ETSI EFR ACELP 1997

VoIP 3.4 5.3–6.3 ITU G.723.1 CELP 1996

I.5 PARAMETRIC REPRESENTATIONS

Parametric representations are concerned with representing the speech signal by using some

parameters which are obtained by analyzing the speech signal spectrum.

The idea is that a sampled speech signal contains a great deal of information that is either

redundant (nonzero mutual information between successive samples in the signal) or perceptually

irrelevant (information that is not perceived by human listeners): if we succeed in describe the

signal by using some of its characteristic parameters, we can transmit these parameters instead than

the signal itself. The parameters of a signal change relatively slowly than the signal they describe

and they are in a little amount, so we can save bandwidth and increase the speed of transmission. In

this way we obtain a lot of benefits in term of transmitted data, memory safe, transmission power

and speed and so on.

Clearly, there are backwards in term of quality: the output signal is not a reconstruction of the

input signal based on its samples, but on some parameters which describes it in an encrypted way;

we’re not able to recreate the original speech, but only a dehumanizing version of it.

In order to obtain these parameters, we need to describe the speech signal production with a

mathematical model, as the one introduced in Section I.1. This model is called source-filter model

because we model the vocal cords as a source and the vocal tract a resonant cavity. The vocal tract

filters the sound energy by suppressing some components of the glottal wave and amplifying others,

the ones that are close to the resonance frequencies of the vocal tract, which depends on its shape


14

and its length. In this way, it changes the sound quality of the complex wave produced by the

sound source.

Figure 8: Source – Filter model.

Next step consists in representing voiced and unvoiced signals.

If we segment the speech signal into frames of small time duration in which it can be considered

as a stationary signal, as seen in Section I.1, we can examine each frame at time. The duration of a

single frame must be short enough so that the properties of the sound do not change significantly

within it; must be long enough to be able to calculate the parameters that we want to estimate (also

useful for reducing the effect of any noise which affect the signal). Plus, the series of windows

should cover the entire signal, like shown in Figure 9.

Figure 9: Framed speech signal.

We can use different types of frame: this choice, clearly, influences the quality of the analysis. The

simplest one is the rectangular waveform, but this choice can provide large fluctuation of the


15

parameters we are interested in. For example, if we’re measuring the energy of the signal and the

frame shifts, the part of the signal that is contained in the new frame can assume higher values than

the one it assumed in the previous frame and it causes a big difference in the signal energy. So, an

alternative to the rectangular window may be the Hanning one: tapering the ends of the window, we

avoid to have large effects on the parameters even if the signal suddenly changes. Both framing are

shown in Figure 10.

Figure 10: Different types of windows.

If we look at a single frame, we know that we can model the voiced signal as a periodic signal

and the unvoiced one as a White Gaussian Random Variable. In this way, we can model the signal

source (the vocal cords) as two distinct signal sources.

Looking at Figures 9 and 11, we can see that there’s an overlapping between different frames: this

fact allows us to predict the trend of the signal in the next frame by studying its trend in the current

frame.


16

Figure 11: Different types of windows.

One of the most powerful speech analysis techniques, and one of the most useful methods for

encoding good quality speech at a low bit (only 2kb/s!) is the Linear Predictive Coding (LPC).

It’s defined as a method for encoding an analog signal, in which the value of a current speech

sample is estimated by using its past few speech sample values. That’s possible just because frames

are usually overlapped.

In Figure 12 we can see a schematization of an LPC model.

Figure 12: Linear Prediction Coding model.


17

First of all, we’re interested in understanding if a voiced or an unvoiced signal has been transmitted:

according to the analysis of the frame, a switch selects the right source. We can’t neglect the effect

of area which acts like a multiplicative factor on the signal, so we find a multiplier. Then, we have

to characterize the filter, which models the effect of the vocal tract; the characteristic parameters of

this filter, such as the gain, the cut off frequency, depends on the specific frame and changes when

we consider different frames and can be estimated using different methods, such as the one of

interest, the LPC.

The parameters which describe the model, and so the speech signal, are transmitted to the

receiver. The receiver unit needs to be set up in the same channel configuration to re-synthesize a

version of the original signal spectrum in order to recreate speech; it will carry out LPC synthesis

using the received parameters and builds a source filter model, that when provided a correct input,

will accurately reconstruct the original speech signal.

LPC is generally used for speech analysis and resynthesis. Some applications of this technique are:

- Phone companies (GSM standard);

- Vocoders;

- Secure Wireless;

- Audio codecs.

I.6 WAVEFORM REPRESENTATIONS

Waveform representation (also called standard representations) are concerned with preserving the

wave shape of the analog speech signal in order to transmit a loyal representation of the speech

signal: they are characterized by a great quality. That implies many data to transmit, a lot of power

required, but also a simple structure based on the well noted sampling and quantization techniques:

that’s the reason why they are not specific to speech signals and can be used for any type of signals.

As we can say in Figure 6, there are two different types of waveform coders: Time domain

Waveform Coders and Frequency domain Waveform Coders. They are based essentially on the

same idea to represent the speech signal using the set of its samples, but they differ themselves

about the techniques used to implement it. We’ll focus on the first class of techniques and in

particular on the Pulse Code Modulation and its variants (Linear PCM (LPCM), Logarithmic PCM

(LPCM), Differential PCM (DPCM), Adaptive Differential PCM (ADPCM)).


18

In Figure 13, the general scheme of a waveform representation technique is shown. As

anticipated before, this block diagram is quite general and it’s suitable to schematize many types of

applications; however, regarding speech processing we have a series of standards which governess

every single step according to the application of interest.

Figure 13: Block diagram of a generic waveform representation technique.

Let’s analyze the sampler and the A/D conversion blocks more in detail, as shown in Figure 14.

Figure 14: Detailed block diagram of sampling and A/D conversion.

The sampler provides to make a continuous-time signal into a discrete-time signal. In order to keep

the value of the signal constant for the time required to the following circuits to convert it, a Sample

and Hold technique is used.

The most used sampling technique is the Pulse Amplitude Modulation (PAM). It is an analog,

impulsive modulation technique: it means that the modulating signal is an analog signal and the

carrier signal is a train of pulses whose rate depends on the Nyquist’s criteria and whose duration

depends on the time required for A/D conversion. In Figure 15 the PAM modulation is shown.


19

Figure 15: Pulse Code Modulation; A) Analog signal (modulating signal), B) Train of pulses (carrier), C) Sampled

signal (modulated signal).

Since the bandwidth of speech signals is from 20 Hz up to 20 kHz, we should sample at least at 40

kHz (according to the Nyquist’s criteria). Anyway, we said that the energy of a speech signal is

concentrated in the firsts 4 kHz so we could think to modify the sampling rate according, for

example, to the bandwidth of telephone channel lines, which is from 300 Hz to 3.4 kHz. As we well

know, the output signal of a telephone line is a clear and pleasant sound!

In this way we could sample at a frequency at least greater than 6.8 kHz; international

standards fixed the telephonic sampling rate at 8kHz, so the sampling period has a duration of 125

μs. It means that the speech signal would be perfectly reconstructed if you have at least 8000

samples per second.

8 / → 125

The sampled signal is a discrete-time signal with continuous amplitude. The A/D converter provides

a quantization and a coding of the amplitude of each modulated signal pulse. The number of bits

assigned for the encoding has been determined by the international standards to 8 bits, so since we

have to transmit 8000 samples per second, we work at 64kb/s. The type of code used depends on

the nature of the communication channel and the transmission speed. As anticipated, any bit-rate

below 64 kb/s is treated as compression and the output of the source encoder is an encoded speech

signal having a bit-rate less than 64 kb/s.


20

In the end, we apply a source and channel encoding, which are a set of operations that simplify

the transmission of data over the channel. For example, the source encoding allows to converge at

the same time a greater number of traffic flows on a single physical medium.

The decoding step provides more or less the same operations but in reverse, in order to obtain a

loyal version of the analog input speech signal as output.

Waveform coders are most useful in applications that require the successful coding of both voiced

and unvoiced signals. In the Public Switched Telephone Network (PSTN), for example, successful

transmission of modem and fax signaling tones, and switching signals is nearly as important as the

successful transmission of speech.

I.7 METHODS OF COMPARISON OF SPEECH CODING TECHNIQUES

If we want to compare a parametric technique with a waveform technique (or with an hybrid

technique), we need some indicator of the intelligibility and quality of the speech produced by each

coder. There are two methods for evaluating the quality of the speech signal that has undergone a

process of compression:

1) Subjective methods: they are the most significant and reliable methods of comparison, but

they are also very expensive and require high test development time. The most commonly

used parameter is the MOS (Mean Opinion Score), which represents the average ratings of

opinion of a group of listeners. To establish a MOS for a coder, listeners are asked to

classify the quality of the encoded speech in one of five categories characterized by a

numerical value:

1- Bad

2- Poor

3- Fair

4- Good

5- Excellent

We can consider a speech coder as a good one if it’s characterized by a MOS greater than

3.5-4. In Figure 16 we can see how the MOS of different types of speech coders changes

according to the increasing of the bit rate.


21

Figure 16: Comparison of different types of speech coding techniques.

2) Objective methods: these methods are used in the initial phase of a codec project. They

provide several analytical measurements; the most important one is the relationship Signal-

to-Noise Ratio (SNR) between the power of the input signal and the power of the error

coding. The objective measures have the advantage of being able to be carried out

automatically (and therefore of very large databases), plus they don’t depend on the tastes of

listeners. The main problem with objective measures is that, especially for coders operating

at low speeds, they are not correlated with the quality of the speech signal.


22

CHAPTER II: PULSE CODE MODULATION TECHNIQUES

In this chapter we’ll analyze the different Pulse Code Modulation techniques and their application

to the speech coding, in order to point out benefits and drawbacks of each technique and differences

among all of them. Then, we’ll analyze the Time Division Multiplexing technique, that’s a very

important application based on the Pulse Code Modulation.

II.1 PULSE CODE MODULATION

The Pulse Code Modulation (PCM) is a method used to digitally represent sampled analog signals;

in other words, it’s a quantization technique that is usually applied to PAM signals, as we

anticipated in the previous Chapter, Section I.6.

The quantization is an operation which, given a continuous amplitude signal, returns a discrete

amplitude signal. The set of discrete amplitudes depends on the range of values assumed by the

input signal and on the number of bits, N, of the quantizer. In the case of interest, in which the input

signal is a PAM signal, quantization affects the amplitude of each sample, by comparing it with the

different levels of quantization and by rounding it to the closest level. The quantization is an

operation which introduces an error called quantization error, due to the rounding or the

truncation of the signal amplitude: it’s the difference between the real analog signal (A) and the

quantized digital value of the same (A’).

′

The quantization error can be quantified by evaluating the Signal to Noise Ratio (SNR):

This quantity is usually express in dB, so:

| 10


23

In order to compute the power of a signal s t , we need to know its probability density function, ρ, because:

∞

where σ is the variance of the signal. As explained in the previous chapter, Section I.1, speech signal probability density function can

be approximated with a Gaussian, so we have:

36

And then:

| 10 36 ∙1

Since this parameter depends on the signal power and, so, on the signal amplitude, and since speech

signals are usually low level signals, we can expect that the SNR will be very low. As we can see in Figure 17, the SNR related to speech signals is lower than the one computed for sine or square input signals. We will see that the SNR improves at the increasing of the number of bit; in particular, for each bit we add, we improve the SNR of 6 dB.


24

Figure 17: SNR for different types of waveforms.

II.2 LINEAR PCM

Linear PCM, or uniform PCM, is the name given to quantization algorithm in which the

reconstruction levels are uniformly distributed among the PAM range of values [0; S] (we consider a unipolar signal, but nothing changes if we consider a bipolar signal whose dynamic belongs to a

range V ; V ] ). It means that we divide the signal dynamic range into a number M 2 of interval having the same amplitude, A ; the interval amplitude is also called quantization step and is equal to:

2

As we can see in Figure 18, the ideal quantization characteristic is a step function.


25

Figure 18: Ideal transfer function of a linear quantizer.

As anticipated in the previous section, the quantization introduces an error that’s intrinsic in the

process: it can’t be deleted, but only reduced. This error is due to the rounding or the truncation of

the signal amplitude and can be expressed as the difference between the real analog signal and the

quantized digital value of the same.

In the case of LPCM, the error has a sawthoot trend (as shown in Figure 19) and its maximum

value is:

2

Figure 19: Quantization error.

The advantage of LPCM is that the quantization error affects in the same way all the signal

dynamic; this property is desirable in many digital audio applications. However, we can easily

understand that the quantization error is much more relevant when the signals has low amplitude


26

rather than when the signal is high: if we deal with a low power signal, the quantization error affects

its value more than it does if we deal with an high power signal. It means that high level signals are

quantized with a good precision, while low level signals with a bad precision: in general, we want

to quantizy whit the same precision all signal levels.

Let’s evaluate the .

Since the quantization error has a sawthoot trend, we can approximate it with a triangular

waveform, whose probability density function is uniform over the function support ( /2; /2]), and easily calculate:

/

/

/

/ 12 12212 2 ∙ 12

So, we obtain:

| 10 36 ∙2 ∙ 12 10 23 10 ∙ 4 3

6 4.77

As anticipated in Section II.1, we have a very low (a negative term is present, while when we

deal with other input signal, we have a positive term), because of the low level of speech signals.

The SNR has a linear trend until the input signal reaches an amplitude that is higher than the quantizer dynamic: this condition is called overload and is shown in Figure 20.

Figure 20: SNR for Linear PCM.


27

Plus, we can see that if we increase the number of bit, we improve the SNR of 6dB: the drawback of this choice is clearly related to the complexity of a system characterized by an elevated number

of bit.

II.3 LOGARITHMIC PCM

A clever way to improve the accuracy of the quantization technique consists in realizing a non-

linear (or non-uniform) quantization; it means that we consider a quantization step size which is

not constant over the entire dynamic range of the signal but changes according to the level of the

input signal. In this way, we can quantizy a low level signal using a very small quantization step

size and an high level signal using a wider quantization step size, obtaining an acceptable precision

over all PAM signal levels.

Since, in general, we don’t know the signal distribution, a good criteria to follow in order to

realize a non linear PCM is to make the constant on all PAM signal levels. In this way we

can realize a general technique which doesn’t depend on the specific signal amplitude.

We can obtain a non-linear quantization by using an analog or a digital process.

In the analog process, the continuous amplitude PAM signal passes through an analog

compressor before being converted into a digital signal. The compressor is essentially a logarithmic

amplifier that has the task of amplifying the lowest levels of the PAM signal and compress the high

ones.

In the digital process the continuous amplitude PAM signal is converted into a digital signal by

using a linear quantization and subsequently it passes through a compressor which modifies the

digital representation using a different number of bits.

In the last generation systems, the digital non-linear quantization method is more adopted for

obvious reasons of cost, performance, simplicity of construction and integration with all other

apparatuses numeric. Anyway, both solution need that the receiving apparatus PCM must contain

an organ known as complementary to the compressor expander which can restore the original levels

analog information.

For voice signals, whose values are usually very low, we want narrower intervals close to zero, so

the best type of non linear PCM to adopt can intuitively be the logarithmic one.


28

Figure 21: Ideal transfer function of a logarithmic quantizer.

Actually, we can prove this fact in a more rigorous way.

If we look at the block diagram in Figure 22, we can see that the quantization error is generally

seen as an additive error.

Figure 22: Block diagram of a logarithmic quantizer.

Out of our scheme we will have D, the digital signal, which is:

For low level input signals, whose amplitude is close to zero, we want a quantization error which is

close to zero, so we can express it as:

where K is close to 1.

In this way, we can see the additive quantization error as a multiplicative error:


29

∙

This is a very good thing, because the quantization error is relative to the value of the signal and,

since we can see it as a multiplicative error, it changes linearly the output of the system.

As we can see in Figure 23, we have a constant SNR until the overload condition.

Figure 23: SNR for Logarithmic PCM.

II.3.1 A AND μ CONVERSION LAWS

The logarithmic characteristic does not pass through the origin: this fact can be a problem when we

process signals whose level is very close to the origin, a that’s quite common when we deal with

speech signals.

Two laws have been enacted to standardize the ways to solve this problem:

- A – Law: provides the linearization of the logarithmic characteristic for values close to the

origin, whose amplitude is fixed by a parameter A, as shown in Figure 24.


30

Figure 24: A – Law.

Mathematically speaking, we have:

∙ | | 1

| | | | 1

This law is in the standard law in Europe.

- μ – Law: provides a translation of the logarithmic characteristic in order to obtain a passage

through the origin, as shown in Figure 25. The translation must be equal to the intercept of

the curve; the transfer function will be:

1 ∙

Figure 25: μ – Law.

This law is the standard law in USA and Japan.


31

These laws introduce an approximation of the logarithmic characteristic: in both cases, in fact, we

don’t have a logarithmic trend close to the origin and that causes a decreasing of the SNR , as shown in Figure 26.

Figure 26: SNR of an approximated logarithmic PCM.

In Figure 27, the SNR ’s trend for both linear and logarithmic trend is shown.

Figure 27: Comparison of the SNR for different PCM.

In order to simplify the analysis of a logarithmic characteristic, we can introduce a piecewise

approximation: the shape of the function is approximated with a set of lines, one put after the

other. We divide the function in segments, which are divided in levels, as shown in Figure 28.


32

Figure 28: Piecewise approximation.

We can notice the presence of three numbers: the first one is a sign parameters, so a bit which can

identify the polarity of the signal; the second one is a segment parameter, which can identify which

line we are considering, and the last one is a level parameter, that identifies which point of the

considered segment are we considering.

With logarithmic functions, we have, for each point, that the same distance represent the same

ratio; in piecewise approximation we can imagine that this behavior is globally satisfied, but not

locally. This approximation introduces a very bad issue: on each segment, we have a linear

approximation of the logarithm and this means that the quantization error is constant only on each

segment. This is a bad thing because if we consider points near to the bound of the segment, which

have almost the same amplitude, with points which are right or left respect to the bound, we will

have very different quantization error, even if the points are close.

The actual behavior of the signal to noise ratio is shown in Figure 29: there are ripples where we

expected to have a fat behavior; these ripples are wide 6 dB, because of the considerations

previously done.


33

Figure 29: SNR of a piecewise approximated Log-PCM.

II.4 DIFFERENTIAL PCM

The basic problem with the previous types of PCM is that the quantizer works on a fixed dynamic

range, while the speech signal, for its nature, is usually very low. It means that we effectively work

with a number of bits that’s smaller than the number of bits of the quantizer: we don’t use the

bits which are associated to the higher values of the dynamic range. In the following section

we’ll see how to solve this problem in a very clever way that allows us to drastically reduce the bit

rate. Clearly, it means that we have also a drastically reduction of speech signal quality.

These methods are based on the observation that consecutive samples are often correlated. This

allow two considerations:

1) if samples are correlated, we can predict in a more or less precise way the value of a sample

by an estimation of previous samples;

2) correlated samples contain redundant information that are no useful, so we can delete them

and obtain a faster transmission.

Differential pulse-code modulation (DPCM) is a signal encoder that uses the baseline of pulse-code

modulation (PCM) but adds some functionalities based on the exposed ideas. DPCM was invented

by C. Chapin Cutler at Bell Labs in 1950.

This method provides the coding of the difference between an input signal and its predicted

value, estimated by an evaluation of previous input samples; in other words, we code the prediction

error. If the difference (so, the error) is small, it means that the two samples are strictly correlated


34

and we can remove redundant information, plus the number of bits required for transmission is

reduced. In this way, we can obtain compression ratios on the order of 2 to 4 and we drastically

reduce the bit rate: clearly, as we know well, this provides a drawback in term of the quality of the

obtained signal, that will be still clear and understandable, but no more pleasantness.

It’s a predictive form of coding, because we have to predict current sample value based upon

previous samples.

The general block diagram of a transmitter and receiver system based on DPCM is shown in

Figure 30.

Figure 30: Transmitter and Receiver for a DPCM.

Let’s analyze them.

The Predictor is a unit whose task is to predict the quantized value of the current input sample,

by an estimation which depends on previous sample (or samples) that were normally quantizied,

and a prediction factor. We can intuitively understand that we can do a better estimation if we

consider many samples instead than only one. In order to consider more samples, we need to

consider a framed input signal, so all the samples which occur in a frame are used to do a new

estimation. Let assume that K is the number of samples in a frame; the choice of the value of K is

critical because:

- If K is low, we have less samples and they are more correlated: they contain a lot of

redundant information that we can delete. However, we have few parameters to deal with, so

the complexity of the circuit is reduced, but the precision of the estimation decreases.

- If K is high, we have more samples and they may have very different values, so they are less

correlated. We can do a good estimation, but the complexity of the circuit increases and also


35

its efficiency, because if we have uncorrelated samples, there’re no redundant information to

remove.

We have to chose K in such a way that a possible error will weigh little on the estimate and,

contemporary, that we can work with samples which are correlated. A clever idea that’s usually

implemented is to multiply the frame with an exponential function that allows to weigh the close

samples and reduce the weight of the more distant samples.

The predicted value is a function of those K previous samples:

1 ; 2 ;…;

The difference between the input signal and its predicted value is called prediction error and it’s the

input of the quantizer.

→

If the prediction was right, and so x n x , this signal is zero: redundant information are deleted. Otherwise, it’s a very low signal and we need few bits to quantizy it. In this way, the bit rate is

strongly reduce: if we use, for example, 4 bits, we work at 32 Kb/s.

In the receiver the process is reversed. We need to decoder the input signal and then to add the

predicted signal.

Now we have to understand how the Predictor works.

Usually, the prediction is linear: it means that the predicted value is a weighed linear

combination of previous quantized samples.

1 2 3 …

where A, B, C … are natural numbers.

A block diagram for a linear Predictor is shown in Figure 31.


36

Figure 31: Block diagram of a Linear Predictor.

The values of the coefficient depend on the autocorrelation function of the signal in the frame we’re

analyzing, so they are not constant numbers. Their optimum values are the ones which minimize the

prediction error power:

, , … ∶ argmin

We can obtain them by using variable gain amplifiers. The D units are delay units.

If the prediction error power is low, we can reduce the bit we need to quantizy the prediction error.

The complete block diagram for a linear Prediction DPCM coder is shown in Figure 32.

Figure 32: Transmitter with Linear Predictor for a DPCM.


37

II.5 ADAPTIVE DIFFERENTIAL PCM Adaptive differential pulse-code modulation (ADPCM) is a variant of DPCM that introduces some

improvements in order to obtain a further reduction of the bit rate. Actually, this technique can be

applied to improve also standard PCM, LPCM, Log-PCM and so on, not only DPCM. It was

developed in the early 1970s at Bell Labs for voice coding, by P. Cummiskey, N. S. Jayant,

and James L. Flanagan.

The basic idea of ADPCM is to adapt the quantization step to the effective dynamic of the

signal we want to deal with. If we consider a DPCM, we can say that if the difference input signal

is low, ADPCM decreases the quantization step, so it can quantizy this small value with a better

precision. Otherwise, if the difference signal is high, ADPCM increases the quantization level, in

order to cover the entire dynamic.

Whit this technique, we need just a bit and we are succeed in working at bit rates lower than 8

kb/s! However, ADPCM cannot produce satisfactory quality when bit rate is lower than 16 Kb/s.

We can work until to 4 Kb/s, but we lose the quality of the signal and, in particular, the possibility

to recognize the speaker.

To further reduce the bit rate, we need to use speech signal parametric representations.

The basic block diagram for an ADPCM transmitter is shown in Figure 33: it’s very similar to the

block diagram of a DPCM shown in Figure 30, excepted for the presence of the interconnection

between the Predictor and the Quantizer, due to the adaption of the quantization step.

Figure 33: Transmitter for an ADPCM.


38

In order to adapt the quantization step to the dynamic of the signal, we can use a multiplier that

changes it according to the Predictor’s output. There are two basic and conflicting requirements

during the design of the step-size multiplier: the need of a fast response and the prevention of

excessive step-size alterations in a stationary or steady-state situation (in which no step change is

requirement).

There are two types of ADCPM configuration:

- Adaptive Quantization Forward: the prediction is estimated on samples which haven’t

been still quantized;

- Adaptive Quantization Backward: the prediction is estimated on samples which have been

already quantizied.

For the adaptive quantization forward technique, input samples are memorized in a buffer and

sent to the Prediction unit; then, the quantization step is changed. This technique can be

implemented with a very simple structure but has two limitis:

1) Introduces a delay related to the memorization of samples in the buffer and an additive

amount of information to sent, so it makes the data rate higher.

2) There’s no possibility to recover the analog signal, so we have problems about the

realization of the receiver.

Figure 34: Adaptive Quantization Forward.

These problems are solved by the adaptive quantization backward technique, because it is

implemented by using a feedback configuration. The buffer is put at the outside of the quantizer, so

it doesn’t introduce any delay; plus, the predictor and quantizer information does not need to be

transmitted: “side information” data rate is lower! The receiver can be realized just inverting the

process. However, this technique is less precise because we adapt the quantization step according to

past frames.


39

Figure 35: Adaptive Quantization Backward.

II.6 TIME DIVISION MULTIPLEXING

Nowadays, PCM is the standard form of digital audio in computers, CD, digital telephony and other

digital audio applications. This technique was proposed around 1930-1940s, when there was the

need to increase the number of long-distance telephone connections. This requirement,

however, conflicted with the difficulties and the cost associated to the large-conductor bundles,

which were very bulky and difficult to connect. So it was thought to multiplex a large number of

telephone connections on a single coaxial cable. This gave rise to the Time Division Multiplexing

(TDM) technique, a very modern and efficient digital technique based on PCM.

A general scheme of TDM technique is shown in Figure 36.

Figure 36: Time Division Multiplexing.

We have n channels, a switcher that selects one of them and a unique coaxial cable through which

the information is transmitted to the receiver. The demultiplexing unit selects the channel that

corresponds to the source channel and then the transmission is complete.

The idea of the TDM is based on the ability to sample a speech signal and at the same time,

during a sampling period, transmit another speech signal on another channel. Consider the same


40

sampling period (T 125μs) for each channel; it is divided into n time-intervals called time-slot; in each time-slot, the system transmits the sample generated from one among the n channels. All

channels are served cyclically: it means that each channel transmits samples with a period T125μs. In a sampling period, a word of n samples is sent to the receiver.

We can better understand this process looking at Figure 37.

Figure 37: Time Division Multiplexing; transmission.

The receiver receives a continuous flow of information: in order to allow a correct decodification

and association of each sample to its relative channel, we need to sent to the receiver information

about the time duration of the sampling period (the length of each word associated to each period)

and the channel associated to each received sample. That’s why the information transmitted over

the coaxial cable contains also the serial data lines (DATA), a reference frequency signal

(CLOCK) and a synchronous phase (FRAME).

Actually, the clock information may or may not be present according to the type of TDM; there

are, in fact, two different types of TDM: synchronous and asynchronous. In synchronous TDM, the

clock provides the synchronism of bits, while the phase provides a time reference to identify the slot

in which each device is enabled to transmit or receive information.

The number of channels and the rates are established by international standards.

The European TDM allows to pass 32 channels simultaneously on a single coaxial cable without,

of course, to interfere between them. Of the 32 multiplexed channels, 30 are voice channels (calls)

and 2 channels of service: the channel n.0 is used to sent the clock information at the receiver and


41

the n. 16 for the phase information. It applies the G.711 standard: uniform or logarithmic (A-

law) PCM 64 kb/s (sampling rate of 8 kb/s and 8 bits per code). That means that we obtain a rate of

∙ 32 ∙ 8

125 2.048 /

The American TDM allows to pass 24 channels and a single service bit: all channels are

dedicated to calls. It applies the logarithmic (μ- law) PCM 64 kb/s (sampling rate of 8 kb/s and 8

bits per code), so we obtain a rate of

∙ 1 24 ∙ 8 1

125 1.544 /

TDM can also be used within Time Division Multiple Access (TDMA), where stations sharing the

same frequency channel can communicate with one another.

An example of application that utilizes both TDM and TDMA is GSM.


42

APPENDIX

In the following you can see some real speech signals.

The images were obtained in the LED 2, II floor of “Cittadella”, Politecnico di Torino.

Instruments: Analog Oscilloscope Hameg 1004-3, Microphone , Power Supply.

In Figure 38, two signals are show in order to point out the difference between a voiced and an

unvoiced sound.

Figure 38: Difference between voiced signals and unvoiced signals.

In Figure 39, an amplitude modulated signal is shown: it’s obtained by varying the tone of the

vowel “a”.

Figure 39: Amplitude modulation of the tone “a”.


43

In Figure 40, we can see four different signals; they are all related to the pronunciation of the

“Hello” word but the word is spoken by four different people.

Figure 40: “Hello” word spoken by four different people.


44

BIBLIOGRAPHY

Texts:

L.R. Rabiner & R. W. Schafer, Digital Processing of Speech Signals, Prentice-Hall, 1978. ISBN

0-13213603-1.

D. Del Corso, Elettronica per Telecomunicazioni, McGraw-Hill, 2002. ISBN 88-386-0832-6.

Jerry D. Gibson, Digital Compression for Multimedia: Principles and Standards, Elsevier

Science (USA), 1998. ISBN 1-55860-369-7.

(Partial version available on

http://books.google.it/books?hl=it&lr=&id=aqQ2Ry6spu0C&oi=fnd&pg=PR13&dq=Speech+C

oding+Methods,+Standards,+and+Applications+Jerry+D.+Gibson&ots=vJ8yfLOEV3&sig=ovrz

OwYvkCLDU7kgBgusxljWeP0#v=onepage&q=Speech%20Coding%20Methods%2C%20Stand

ards%2C%20and%20Applications%20Jerry%20D.%20Gibson&f=false )

Wiley Encyclopedia of Telecommunications, John Wiley & Sons, 2003. ISBN 978-0-471-36972-1.

ITU-T Recommendation ( extract from the Blue book), ITU, 1988,1993.

Articles and slides downloaded from web (in May 2014):

M. Hasegawa-Johnson & A. Alwan, Speech Coding: Fundamentals and Applications, University

of Illinois at Urbana.

(http://www.seas.ucla.edu/spapl/paper/mark_eot156.pdf )

J. D. Gibson, Speech Coding Methods, Standards, Applications, University of California at Santa

Barbara.

(http://vivonets.ece.ucsb.edu/casmagarticlefinal.pdf )


45

D. Tipper, Digital Speech Processing, University of Pittsburgh

( www.pitt.edu/~dtipper/2720/2720_Slides7.pdf )

D. P. W. Ellis, An introduction to signal processing for speech, Columbia University, 2008.

It’s a chapter of the book by Hardacastle William J., The Handbook of Phonetic Sciences, edited

by Wiley-Blackwell.

(http://academiccommons.columbia.edu/catalog/ac%3A144483 )

P. Cummiskey, Adaptive Quantization in DPCM Coding of Speech, The Bell System Technical

Journal (volume 52, issue 7, pages 1105-1118), 1973.

(http://www.alcatel.hu/bstj/vol52-1973/articles/bstj52-7-1105.pdf )

Websites:

http://en.wikipedia.org/wiki/Speech_processing

http://en.wikipedia.org/wiki/Pcm

http://www.itu.int/rec/T-REC-G.711/_page.print

politecnico di torino · nadia perreca, id 211012 speech coding: techniques, standards and...

Documents