1988 the design and implementation of a speech codec for

University of WollongongResearch Online

University of Wollongong Thesis Collection University of Wollongong Thesis Collections

1988

The design and implementation of a speech codecfor packet switched networksStephen Charles HallUniversity of Wollongong

Research Online is the open access institutional repository for theUniversity of Wollongong. For further information contact ManagerRepository Services: [email protected].

Recommended CitationHall, Stephen Charles, The design and implementation of a speech codec for packet switched networks, Doctor of Philosophy thesis,Department of Electrical and Computer Engineering, University of Wollongong, 1988. http://ro.uow.edu.au/theses/1352

http://ro.uow.edu.au/


http://ro.uow.edu.au

http://ro.uow.edu.au/theses

http://ro.uow.edu.au/thesesuow



THE DESIGN AND IMPLEMENTATION OF A SPEECH CODEC

FOR PACKET SWITCHED NETWORKS

A thesis submitted in fulfilment of the

requirements for the award of the degree of

DOCTOR OF PHILOSOPHY

from

THE UNIVERSITY OF WOLLONGONG

by

STEPHEN CHARLES HALL, B.Sc. (Eng.)

Department of Electrical

and Computer Engineering

1988

I hereby certify that no part of the work presented in this thesis has

been previously submitted for a degree to any university or similar

institution.

Stephen Charles Hall 29/08/88

CONTENTS

ACKNOWLEDGEMENTS

ABSTRACT

CHAPTER 1 : INTRODUCTION

1.1 Background to the thesis

1.1.1 Segregated and integrated communications networ

1.1.2 Local and wide area networks

1.1.3 Problems associated with the addition of voice

data LAN

1.1.4 The need for a special speech codec

1.2 Aims of the thesis

1.3 An overview of the thesis contents

1.4 Original contributions made by the thesis

1.5 Publications by the author related to the thesis

CHAPTER 2 : THE NETWORK AND WORKSTATIONS

2.1 Introduction

2.2 The network

2.2.1 Configuration

2.2.2 Switching technique

2.2.3 Capacity

2.2.4 Channel errors

2.2.5 Delay

2.3 The workstations

2.3.1 Functional components

n

2.3.2 Structure of the packet voice terminal 15

2.3.3 Conclusions 18

CHAPTER 3 : SPEECH QUALITY IN PACKET VOICE COMMUNICATIONS 20

20 3.1 Introduction

3.2 Signal distortion 20

3.2.1 Introduction 20

3.2.2 Fixed distortion 21

3.2.3 Variable distortion 21

3.2.4 Summary and conclusions 22

3.3 Signal delay 22

3.3.1 Types of delay 22

3.3.2 The subjective effects of fixed delay 23

3.3.3 Delay minimization 24


3.4 Signal loss 26


3.4.2 Lost packets 27

3.4.3 The effect of lost packets on speech quality 27


3.5 Signal corruption 29


3.5.2 Corruption of voice packets 29


3.6 Silence elimination 31


3.6.2 The advantage of silence elimination 31

3.6.3 The disadvantages of silence elimination 32

3.6.4 Summary and conclusions

iii

3.7 Overal1 speech qual ity 35

3.7.1 Quality standards 35

3.7.2 Maximizing the overall quality 36

3.7.3 Conclusions 37

CHAPTER 4 : THE ACCESS CONTROLLER 38

4.1 Introduction 38

4.2 Contention-based vs. ordered access 38

4.3 Priority access 39

4.4 Summary and conclusions 40

CHAPTER 5 : THE NETWORK VOICE PROTOCOL 41

5.1 Introduction 41

5.2 Packetization 41


5.2.2 Factors influencing the optimum packet length 42

5.2.3 Instantaneous variations in the packet length 42

5.2.4 Long-term variations in the packet length 43


5.3 Prioritization 44


5.3.2 The relative prioritization of voice and data 45

5.3.3 Prioritization of voice according to its activity 45

5.3.4 Prioritization of voice according to its transmission

history 46


5.4 Flow control 47


5.4.2 Flow control of voice traffic 48

IV

5.4.3 Network load estimation/prediction 49


5.5 Synchronization 49


5.5.2 Essential issues in packet voice synchronization 50

5.5.2.1 Packet ordering 50

5.5.2.2 Identification of the type of a missing packet 51

5.5.2.3 Correction of variable packet delay 52

5.5.2.4 Clock frequency matching 55

5.5.2.5 Temporal distortion of silence intervals 55

5.5.2.6 Adjustment of the playout rate 56

5.5.3 A taxonomy of packet voice synchronization schemes 57

5.5.3.1 Introduction 57

5.5.3.2 Synchronization schemes with exact knowledge

of Dv 57

5.5.3.3 Synchronization schemes with approximate

knowledge of Dv 58

5.5.3.4 Synchronization schemes with no knowledge

of Dv 58


5.6 Fil 1 -in 61


5.6.2 Simple packet fill-in schemes 61

5.6.3 Advanced packet fill-in schemes 62


CHAPTER 6 : CODEC REQUIREMENTS 65

6.1 Introduction 65

6.2 Input signal characteristics 65

6.3 Signal distortion 66

6.4 Signal delay 66

6.5 Bandwidth efficiency 66

6.6 Variable rate coding 67

6.7 Robustness to bit errors 67

6.8 Robustness to packet loss 68

6.9 Tandem coding 68

6.10 Voice conferencing 69

6.11 Voice messaging 70

6.12 PCM compatibility 71

6.13 Non-speech code 71

6.14 Control information 72

6.15 Packetization 72

6.16 Implementation 72

CHAPTER 7 : DESIGN AND DEVELOPMENT OF THE CODEC 74

7.1 Introduction 74

7.2 Variable rate coding 74


7.2.2 Variable rate coding in DCM systems 75

7.2.2.1 Techniques 75

7.2.2.2 Issues 76

7.2.3 Variable rate coding in packet switched networks 77

7.2.3.1 Techniques 77

7.2.3.2 Issues 78

7.2.4 Multi rate coding 79

7.2.5 Embedded coding 80

7.2.6 Issues in the design of the embedded code 81

7.2.6.1 Code hierarchy 81

VI

7.2.6.2 Explicit noise coding vs. coarse feedback coding 82

7.2.6.3 Code format 83


7.3 Redundancy reduction coding 85


7.3.2 Waveform coders vs. vocoders 85

7.3.3 Time domain vs. frequency domain waveform coders 86

7.3.4 Predictive waveform coders 86

7.3.5 Concl usions 92

7.4 The adaptive quantizer in the primary coder 92


7.4.2 Adaptation vs. companding 93

7.4.3 Backward vs. forward adaptation 96

7.4.4 Syllabic, instantaneous and hybrid adaptation 96

7.4.5 The optimization of backward adaptive quantizers 98

7.4.6 The Generalized Hybrid Adaptive Quantizer 101


7.4.6.2 The syllabic compandor 101

7.4.6.3 The instantaneously adaptive quantizer 103

7.4.7 Derivation of the GHAQ optimization procedure 107

7.4.8 Performance measures 110

7.4.9 The training set 111

7.4.10 Evaluation of the GHAQ optimization procedure 112


7.4.10.2 Convergence 113

7.4.10.3 Design optimality 117

7.4.10.4 The effect of p on the performance of the GHAQ 118

7.4.10.5 The effect of L on the performance of the GHAQ 120

7.5 The predictor in the primary coder 120

Vll


7.5.2 An analytic approach to predictor optimization 122

7.5.3 An iterative approach to predictor optimization 123

7.6 Comparative performance tests 124


7.6.2 Test conditions 125

7.6.3 Results for the 1-bit adaptive quantizers 126

7.6.3.1 The optimum coder parameters 126

7.6.3.2 SNR results 128

7.6.3.3 Step responses 130

7.6.4 Results for the 2-bit adaptive quantizers 133

7.6.4.1 The optimum coder parameters 133

7.6.4.2 SNR results 135

7.6.4.3 Step responses 135


7.7 Development of the secondary coding algorithm 138


7.7.2 Selection of the coding technique 138

7.7.3 Adaptation of the secondary quantizer 139

7.7.4 Embedded code generation 140

7.7.5 Optimization of the secondary quantizer 143

7.7.5.1 The optimization procedure 143

7.7.5.2 Convergence of the optimization procedure 144

7.7.5.3 Results 145

7.8 Recovery from bit errors 147


7.8.2 Effects of bit errors on the primary decoder 148

7.8.3 The development of the robust GHAQ 150

7.8.4 Performance of the robust GHAQ 153

viii

7.8.5 Idle channel noise in the robust GHAQ 154

7.8.6 The effects of bit errors on the secondary decoder 155

7.9 Recovery from missing packets 155


7.9.2 The effect of missing packets on the embedded decoder 156

7.9.3 A mechanism for recovering from missing packets 157

7.10 Packetization issues 158

7.11 Prioritization and flow control issues 160


7.11.2 Speech prioritization in DCM systems 161

7.11.3 Fixed-rate performance of the embedded coder 162

7.11.4 Generation of the prioritization variables 165

7.11.5 Use of the prioritization variables 167

7.12 Packet voice synchronization and fill-in issues 170

7.12.1 Synchronization 170

7.12.2 Fi I 1 -in 170

CHAPTER 8 : IMPLEMENTATION OF THE CODEC 173

8.1 Introduction 173

8.2 Implementation strategy 173

8.3 An overview of the codec card 173

8.4 Signal conditioning and conversion 175

8.5 The embedded codec 176

8.5.1 Choice of digital signal processor 176

8.5.2 Program structure and timing 176

8.5.3 Arithmetic considerations 179

8.5.3.1 Fixed-point notation 179

8.5.3.2 Arithmetic overflow 180

8.5.3.3 Truncation error 180

IX

8.5.4 Code and control information formats 181


8.5.4.2 Transmit group structure 182

8.5.4.3 Receive group structure 184

8.5.4.4 Codec control/status word 185

8.5.5 DSP resource usage 187

8.6 The codec/network voice protocol interface 188


8.6.2 Information transfer techniques 188

8.6.3 Memory buffers and blocks 189

8.6.4 Information parcels and frames 189

8.7 The card control/status register 192

8.8 Card configuration options 194

CHAPTER 9 : EVALUATION OF THE CODEC 196

9.1 Introduction 196

9.2 Performance comparison with log PCM 196

9.3 Dynamic range 199

9.4 Signal delay 200

9.5 Robustness to bit errors and missing packets 200

9.6 Idle channel noise 202

9.7 Transcoding 203

9.8 Subjective quality 205

9.9 Cost 206

CHAPTER 10 : CONCLUSIONS AND FURTHER WORK 207

10.1 Conclusions 207

10.1.1 Embedded coding 207

10.1.2 Silence elimination 208

10.1.3 Adaptive quantization 208

10.1.4 Interdependence of adaptive quantizers and predictors 209

10.1.5 The codec implementation 209

10.1.5.1 Performance 209

10.1.5.2 Facilities 210

10.1.6 The network voice protocol 211

10.2 Further work 212

10.2.1 The codec 212

10.2.1.1 Optimization of the GHAQ with alternative

distortion measures 212

10.2.1.2 Adaptation of the syllabic compandor in the GHAQ 212

10.2.1.3 Switched predictor adaptation 213

10.2.1.4 Prioritization variables 214

10.2.1.5 Channel error robustness 214

10.2.2 The network voice protocol 215

APPENDIX A : Minimization of the GHAQ distortion measure 217

APPENDIX B : Error dissipation in the robust GHAQ 220

APPENDIX C : The minimum output level of the robust GHAQ 222

APPENDIX D : Development equipment and software 224

APPENDIX E : Reference speech material 225

APPENDIX F : Adaptive quantizers used in the comparative tests 225

REFERENCES 230

ABBREVIATIONS AND ACRONYMS 241

XI

ACKNOWLEDGEMENTS

I am deeply grateful to my supervisor, Professor Hugh Bradlow, for his

guidance, insight and enthusiasm, and to my family, for their unfailing

support.

I would also like to thank Mr. James Irvine for his assistance during

the development of the simulation software used in this thesis, Mr.

Carlo Giusti for laying out the printed-circuit version of the codec

card, and Mr. Joe Tiziano for assisting with its fabrication.

Finally, I would like to acknowledge the financial support of the

University of Wollongong, the Council for Scientific and Industrial

Research (South Africa), and St. John's College (Johannesburg).

xi i

ABSTRACT

Packet switching is used extensively in Local Area Networks (LANs) for

data communications, and is becoming increasingly important in the

trend towards integrated services Wide Area Networks (WANs). As most

existing speech codecs were designed with circuit switched connections

in mind, they are vulnerable to packet loss, and are unable to fully

exploit the variable capacity of packet switched connections. Con

sideration is therefore given in this thesis to the design and imple

mentation of a speech codec specifically intended for use with packet

switched networks.

The thesis starts with a discussion of the general characteristics of

local and wide area networks. Then a model for a packet voice termi

nal, consisting of a speech codec, network voice protocol, and access

controller, is described. The way in which the network and the compo

nents of the packet voice terminal can affect the quality of speech

communications is then discussed, and this leads to a detailed set of

requirements for the codec itself.

The codec design makes use of an "embedded" coding scheme, which allows

rapid flow control of voice traffic to be performed, and enables the

variable activity of the signal to be exploited for bandwidth compres

sion purposes. The fundamental coding technique used is Adaptive

Differential Pulse Code Modulation (ADPCM), and particular attention is

given to the design of the adaptive quantizer in this algorithm.

A new structure for this device is developed, and the result is called

the Generalized Hybrid Adaptive Quantizer (GHAQ). The GHAQ is easily

optimized to the statistics of a particular signal by means of an

xiii

iterative procedure, and is shown to yield improved signal-to-noise

ratio over other well-known adaptive quantizers in Adaptive Delta

Modulators and 2-bit ADPCM coders.

The codec is implemented on an IBM PC expansion card using a program

mable digital signal processor. Associated interface hardware, design

ed to allow the packetization of coded speech with minimal processing

overhead, is also included. This hardware/software system represents

an economical means of adding voice traffic to an existing data LAN,

and is a flexible vehicle for further research into packet voice commu

nications.

1

CHAPTER 1 : INTRODUCTION

1.1 Background to the thesis

1.1.1 Segregated and integrated communications networks

The field of telecommunications has traditionally been dominated by

interactive voice traffic. However, with the rise of computer techno

logy in recent decades, the provision of high-speed data communications

has become increasingly important. For example, while in 1987 voice

traffic represented about 87% of the total volume of communications

traffic (the remaining 13% being data), it is anticipated that this

will drop to 57% by 1991 [Malek 88]. Unfortunately, the current Public

Switched Telephone Network (PSTN) is not well suited to data communica

tions, due to inherent differences between the properties and require

ments of voice and data traffic.

For example, interactive data traffic can have a "burstiness factor"

(the ratio of the peak to average information rate) of more than 10,

whereas for interactive voice traffic a lower figure of between 2 and 5

applies [Burgin 87]. Furthermore, data traffic is relatively tolerant

of delay, but intolerant of transmission errors, whereas the reverse is

true of voice traffic [Gruber 83]. These incompatibilities have led to

the installation of special-purpose networks for data communications,

such as the ARPANET in the USA [Weinstein 83]. However, the creation

of a global data communications network in parallel with the PSTN is an

enormously expensive proposition.

A way out of this impasse is provided by recent technological develop

ments, such as high-bandwidth optical transmission and high-speed digi

tal signal processing and switching. These will allow the current PSTN

2

to evol ve into a network which will provide efficient, reliable and

inexpensive transmission of both voice and data traffic, and which can

therefore be cal led an integrated services network.

Early forms of this network will make use of integrated access facili

ties, but will retain segregated transmission and switching facilities.

Current standards for the Integrated Services Digital Network (ISDN)

therefore relate primarily to the network access interface [Pandhi 87].

However, it is highly likely that in the future, the separate trans

mission and switching facilities will be fused, resulting in a truly

integrated network. Apart from the economic advantages of eliminating

duplicate equipment, an integrated network is easier to manage than

segregated ones are, and offers the possibility of providing new multi

media communications services (eg. video-conferencing).

In the long term, increasing use of the above technologies will in

crease the capacity and flexibility of the ISDN, allowing it to evolve

further into the Broadband ISDN (BISDN), which will be capable of

carrying high-quality video and high-speed data traffic [Weinstein 87].

It is highly likely that the BISDN will be based largely on packet

switching rather than circuit switching, due to the resulting ease and

economy with which different types of traffic may be integrated, and

the flexibility of the network with respect to changing service demands

[Burgin 87]. Fast Packet Switching (FPS), which makes use of high

speed switch architectures and simple link protocols to maintain low

delay in multi-link networks, is particularly promising in this respect

[Burgin 87].

These profound changes in global communications networks will need to

be matched by corresponding changes in the equipment connected to them

3

if their full potential is to be realised. In particular, local area

communications networks, discussed in the next section, will need to be

adapted or re-designed appropriately.

1.1.2 Local and wide area networks

A distinction is commonly made between Local Area Networks (LANs) and

Wide Area Networks (WANs). While these terms are conventionally asso

ciated with data communications, they have corresponding concepts in

telephony, namely Private Automatic Branch Exchange (PABX) networks,

and the PSTN itself. Accordingly, the terms LAN and WAN are used in

this thesis in connection with both voice and data traffic.

By definition, a LAN is confined to a limited geographical area, and

usually serves a single organization, such as a business corporation or

a university [Tanenbaum 81]. Due to the relatively low cost of laying

high-capacity cables over a small geographical area, LANs have prolif

erated in recent years as a means of providing data communications.

However, due to the absence of a global data WAN, long-distance data

communications must still be conducted over dedicated lines or special

data networks (which are expensive and restricted in connectivity), or

via the PSTN (which has relatively long call set-up times, a low trans

mission rate, and a high bit error rate).

As the PSTN evolves into the ISDN, the service offered to data traffic

will improve dramatically, so that the number of LANs connected to this

WAN is likely to increase correspondingly. It will then be natural to

extend service integration from the wide environment to the local

environment, requiring that existing data LANs are adapted for voice

traffic, or alternatively that new LANs suited to both types of traffic

are developed [Anido 87]. Another possibility is that PABXs, which

4

have traditionally control 1 ed voice communications in the local en

vironment, will be designed to handle data traffic, a trend which is

already emerging in practice [Camrass 87]. It is difficult to predict

which of these paths to integrated local networks will prevail in the

future, as each has its advantages.

The path considered in this thesis involves the addition of voice

traffic to an existing data LAN, and particular attention is given here

to the processes of encoding and decoding speech signals for this

application. This option is likely to be the most economical of the

above three, as it does not Require the replacement of any equipment,

and can make use of existing network hardware and software.

1.1.3 Problems associated with the addition of voice to a data LAN

As most data LANs are designed to provide a high-integrity transport

service, which responds to increased traffic load with increased delay,

they are intrinsically ill-suited to the transmission of voice traffic.

Assuming that speech must incur some form of service degradation when

the network is congested, it would prefer a decreased signal-to-noise

ratio to an increased delay. A large part of the problem of adding

voice traffic to a data network therefore lies in making the network

look less hostile to the voice signal (or conversely making the voice

signal appear more pliable to the network). This function may be

performed jointly by the speech coding algorithm and the network commu

nications protocols, as is discussed in detail in the body of this

thesis.

1.1.4 The need for a special speech codec

A digital voice terminal requires a speech codec (coder/decoder) to

convert the speech signal between its analog and digital forms and to

5

enable it to be represented as a code suited to transmission over a

particular network. While much work has been done on digital speech

coding in the past two decades [Jayant 84], [Rabiner 78], attention has

been concentrated to a large extent on algorithms suitable for use with

the current PSTN, ie. with fixed bandwidth, circuit switched connec

tions.

At present, the most widely used speech coding technique is 64 kbps A-

law or u-law Pulse Code Modulation (PCM), referred to from this point

as 64 kbps PCM. This technique has the advantages of providing good

speech quality and signal transparency (ie. it is able to handle non-

speech signals such as voiceband data). This makes it suitable for use

in the current PSTN, in which modulated data is carried on voice chan

nels. Due to the relative simplicity of 64 kbps PCM and the fact that

it has been standardized by the CCITT (specification G.711) [Jayant

84], inexpensive PCM codecs have been available in integrated circuit

form for a number of years.

The main disadvantage of 64 kbps PCM is that it is relatively ineff

icient in its use of transmission bandwidth, and this fact has led to

the recent standardization by the CCITT of a 32 kbps Adaptive Differen

tial PCM (ADPCM) algorithm (specification G.721) [Jayant 84]. While

this algorithm is far more complex than 64 kbps PCM, single-chip Digi

tal Signal Processor (DSP) implementations have been reported [Nishi-

tani 87]. It is intended that these devices will replace 64 kbps PCM

codecs in the PSTN, thereby allowing an increase in bandwidth efficien

cy while maintaining good speech quality and signal transparency.

However, the ADPCM algorithm is not well suited for use with packet

switched networks for a number of reasons. Firstly, as it operates at

6

a fixed rate (32 kbps), it cannot take advantage of the variable capa

city of packet switched channels. Secondly, it makes use of a number

of adaptive coding variables (for example 8 predictor coefficients),

which are vulnerable to the effects of packet loss. While there is

provision in the algorithm for the effects of channel bit errors on

these variables to be dissipated with time, this strategy does not cope

well with the signal discontinuity represented by a lost packet.

Thirdly, while the algorithm's ability to handle voiceband data does

not actually make it unsuitable for use in an all-digital network, it

does mean that significant extra complexity is associated with a redun

dant function.

Another speech coding algorithm available in integrated circuit form is

16 kbps Continuously Variable Slope Delta Modulation (CVSD) [Jayant

84]. This algorithm gives a speech quality which is significantly

lower than that of the two described above, and it is intended for

specialized applications such as military communications, rather than

for commercial telephony [Glasbergen 81]. Its main disadvantages as

far as this project is concerned are its fixed coding rate, low speech

quality and the fact that its adaptive step size is vulnerable to

packet loss, as described for the 32 kbps ADPCM codec.

While a detailed discussion of the requirements of a speech codec

suitable for use with packet switched networks is left to Chapter 6, it

is clear that the above codecs are not well suited to this application,

thus providing the incentive for the rest of the work described in this

thesis.

1.2 Aims of the thesis

The ultimate aim of the work of which this thesis forms a part is to

7

al 1ow interactive voice communications to be conducted over existing

data LANs, which may be linked together through a WAN (eg. the ISDN).

Important objectives are that the voice facility should be flexible

(meaning that it should not make restrictive assumptions about the

nature of the network), non-intrusive (meaning that it should not

impact severely on existing data communications), efficient (thus

making the facility simultaneously available to a large number of

users), and economical.

The specific concern of this thesis is the design and implementation of

a speech codec for the above application. The implementation of the

voice communications protocols and the evaluation of the overall system

is not considered here, but in order to allow the project objectives to

be reached, it is important that the communications protocols and

network are considered when the codec is designed.

Apart from its immediate application in telephony, the codec/protocol

combination will also be used for research purposes. As described in

Section 1.1.1, packet switching is likely to become of considerable

importance in future versions of the ISDN, and due to the relatively

high capacity and low delay variance of LANs, they may be used as low-

cost vehicles for the study of real-time traffic in more general packet

switched networks.

1.3 An overview of the thesis contents

In Chapter 2 of this thesis, a brief description of the communications

network is given, with particular reference to its limitations. In

addition, a model for a packet voice terminal is presented. In Chapter

3, the causes of quality degradation in packet voice communications are

summarized, and general techniques for controlling these are described.

8

The access controller used in each workstation is considered in Chapter

4, and the need to avoid restrictive assumptions about this device is

made clear. In Chapter 5, the functions of the network voice protocol

are considered in detail, and it is shown that a number of these

functions impact on the codec design.

Chapter 6 provides a summary of the codec design requirements, with

reference both to the material in the preceding chapters and to addi

tional issues. The development of the speech coding algorithm and

associated operations is covered in Chapter 7, leading to the hardware

implementation described in Chapter 8. In Chapter 9, this implementa

tion is evaluated with reference to the requirements of Chapter 6.

Finally, conclusions are drawn in Chapter 10, and some opportunities

for further work are suggested.

1.4 Original contributions made by the thesis

The original contributions made by this thesis to the fields of speech

coding and packet voice communications are as follows:

A new "hybrid" adaptive quantizer for speech coding is described,

and is shown to give superior signal-to-noise ratios to a number

of other algorithms of similar complexity. It has a generalized,

flexible structure which permits its use in applications other

than that described in this thesis.

An efficient optimization procedure is derived for the new adapt

ive quantizer, allowing it to be easily tailored to the statistics

of its input signal in a given application. By contrast, optimi

zation of previous adaptive quantizers had to be performed using

time-consuming random search techniques.

9

A simple bit error recovery mechanism is adapted for use with the

new adaptive quantizer, and an analysis of its operation is pre

sented. It is shown that this technique is suitable for speech

coding applications in the context of low error-rate digital

networks.

The interaction between the optimum parameters of the predictor

and the adaptive quantizer in a delta modulator is demonstrated,

and it is shown that a random search procedure can be used for

finding optimum predictor coefficients in this context.

A new speech prioritization scheme for use in packet switched

communications is described. In contrast to the silence/talkspurt

discriminators conventionally used in this context, this scheme

prioritizes speech on a continuous scale, thereby allowing the

variable capacity of packet switched channels to be used more

effectively. In addition, the fact that this scheme transmits

"silence" at a low bit rate, instead of eliminating it, simplifies

the design of the network voice protocol and has perceptual advan

tages.

An inexpensive hardware implementation of the speech codec and the

associated codec/protocol interface is described and evaluated.

1.5 Publications by the author related to the thesis

S.C. Hall, "A review of speech coding : theory and techniques", Elek-

tron (Journal of the South African Institute of Electrical Engineers),

vol. 1, pp. 25-31, Sept. 1984.

10

J.M. Irvine, S.C. Hall, H.S. Bradlow, "An improved hybrid companding

delta modulator", IEEE Trans. Commun., vol. COM-'34, pp. 995-998, Oct.

1986.

H.S. Bradlow and S.C. Hall, "Integration of conversational voice into

networks designed for data communications", 2nd Fast Packet Switching

Workshop, Melbourne, May 1987.

H.S. Bradlow and S.C. Hall, "The design of an integrated voice/data

terminal and voice transport protocol", 3rd Fast Packet Switching

Workshop, Melbourne, May 1988.

S.C. Hall and H.S. Bradlow, "The design of a hybrid adaptive quantizer

for speech coding applications", to be published in IEEE Trans.

Commun., Nov. 1988.

11

CHAPTER 2 : THE NETWORK AND WORKSTATIONS

2.1 Introduction

In this chapter, an overview is given of the network and workstations

with which the codec is to be used, in order to establish a basis for

the work in the rest of the thesis. In particular, limiting assumptions

about the network characteristics are made. Attention is also given to

the functional and physical structure of the workstations, and a model

for a packet voice terminal is presented, the individual elements of

which are considered in detail in subsequent chapters.

2.2 The network

2.2.1 Configuration

The basic network configuration considered in this thesis is shown in

Fig. 2.1. A number of integrated voice/data workstations are connected

to a local area network, enabling the transmission of voice and data

traffic among the "local" users (eg. A and B). A user outside this

local environment (eg. C) is considered to be "remote". In order to

extend communications to remote users, a gateway to a wide area network

supporting voice and data traffic is used. Since the gateway handles

all the traffic between local and remote users, it can access the WAN

by means of a single high-capacity link, as is done by current PABX

systems.

The possible inclusion of both local and wide area networks in the call

path makes the design of the voice terminal more demanding, as the two

types of network are often significantly different in terms of topo

logy, capacity, delay statistics and error rate. While the trend

12

INTEGRATED WORKSTATION INTEGRATED WORKSTATION

"LOCAL" "REMOTE"

Fig. 2.1 Network configuration

13

towards wide area networks using fast packet switching and optical

transmission will reduce the magnitude of these differences, the pro

cess of change is slow, due to the scale of the exercise, and there are

also limits to the convergence which may be achieved.

2.2.2 Switching technique

Packet switching is invariably used in local area networks, because it

is an efficient means for a number of data users to utilize a single

transmission channel. In the wider environment, circuit switching

still dominates the PSTN, due mainly to its ease of implementation.

However, as described in Chapter 1, packet switching is likely to find

increasing use in wide area networks in the future.

The fact that at least part of the overall link for voice traffic is

packet switched should be taken into account in the design of the

speech codec. The most important considerations in this respect are

that the network has the potential to provide a variable rate connec

tion between the encoder and decoder, and that any shortcomings of the

network in terms of capacity or error rate are likely to manifest as

gaps in the received speech signal, corresponding to missing packets.

By contrast, a wholly circuit switched network would provide a fixed-

rate connection, and network limitations would result in call blocking

and isolated bit errors.

2.2.3 Capacity

A fundamental assumption about the network is that the capacity of each

link is-sufficient to allow digitized speech to be transmitted in real

time. From past experience with speech coding, it can be predicted

that a bit rate of about 16 - 64 kbps will be involved [Flanagan 79].

This is not likely to be a problem in the local environment, as LANs

14

require a link capacity of at least 1 Mbps in order to support bursts

of data traffic [Tanenbaum 81].

Considering the wider environment, standard voice channels in the PSTN

are currently capable of supporting data rates up to about 10 kbps,

although higher-bandwidth connections can be made over leased lines

[Tanenbaum 81]. However, it is certain that increased bandwidth will

be available to users of future versions of the PSTN, for example as

specified in the ISDN basic- and primary-rate interfaces, which provide

respectively for 144 kbps and 2.048 Mbps of transmission capacity

[Pandhi 87].

2.2.4 Channel errors

Assuming that the interface equipment associated with a network link is

functioning correctly, bit errors can be caused in the channel by

thermal noise, interference, and signal dispersion [Tanenbaum 81]. The

error rate in a particular channel is highly dependent on the type of

transmission medium used. For example, coaxial cables are more resis

tant to electrical interference than twisted pairs [Gee 83], and local

area networks using the former typically have error rates of the order

of 1 in IO9.

While the current trend in wide area networks is towards the use of

optical transmission technology, which is highly reliable, the gradual

nature of this process means that the possibile inclusion of a low-

integrity link in the overall voice path must be considered. Error

rates up to 1 in 10"* are currently encountered in the PSTN, and this is

likely to remain true for some time to come [Maitre 82]. An extreme

type of network as regards error rate is the mobile radio network, in

which signal fading may cause average error rates of up to 1 in 10^

15

[Jayant 75]. For this reason, and because of the tight bandwidth

restrictions involved, the transmission of coded speech over mobile

radio networks is not considered in this thesis.

2.2.5 Delay

Signal delay has a deleterious effect on interactive voice communica

tions, as is discussed in detail in Chapter 3. In general, the domin

ant causes of signal delay in packet switched networks are propagation

delay and queueing delay, the latter being incurred in network nodes

(ie. switches). In order for a network to support interactive voice

communications, queueing delay should be minimized, a requirement which

makes some networks installed for data communications poorly suited to

voice traffic, an example being the ARPANET [Weinstein 83].

2.3 The workstations

2.3.1 Functional components

Each integrated workstation on the network may be considered to consist

of three functional components; a packet voice terminal associated with

voice communications, a packet data terminal associated with data

communications, and a local computing facility. While these entities

may overlap in their use of the workstation's physical resources,

distinguishing among them is conceptually useful, because this thesis

is concerned specifically with the packet voice terminal. By contrast,

the packet data terminal and local computing facility are assumed to

exist already.

2.3.2 Structure of the packet voice terminal

As the field of packet voice communications is evolving rapidly, a

16

sensible approach to the design of a packet voice terminal is to speci

fy its functions in terms of a set of independent modules with clearly

defined interfaces [O'Leary 81]. Then if one of the modules is changed

or updated, the impact on the other modules is minimized. This philo

sophy is also compatible with a fundamental aim of this project, namely

to provide a voice communications facility which is independent of the

nature of the network as far as possible.

There are three essential modules in a packet voice terminal [O'Leary

81], as shown in Fig. 2.2. The first is the speech codec, which

converts the speech signal between its analog and digital forms and

implements the coding algorithm. The second is the network voice

protocol [Cohen 78], which provides service-specific features to allow

voice communications to take place over a network. In particular, the

packetization, prioritization, flow control, synchronization and fill-

in of voice traffic are performed in this module, as is discussed in

detail in Chapter 5. The third module is the access controller, which

provides the network-specific packet transport mechanism [O'Leary 81].

The nature of the access controller, and in particular its ability to

handle prioritized traffic, can have a considerable effect on voice

communications, as is discussed in Chapter 4.

In terms of the International Standards Organization (ISO) Reference

Model of Open Systems Interconnection [Tannenbaum 81], the access

controller covers the Physical and Data Link protocol layers, and the

network voice protocol covers the Network, Transport and Session

layers. While a practical packet voice terminal will probably include

higher-level protocols as well (ie. Presentation and Application

layers), these are not considered here.

The maintenance of independence among the above three modules places

17

("•——n

i CODEC

I NETWORK VOICE PROTOCOL

I ACCESS CONTROLLER

I c NETWORK }

Fig. 2.2 Structure of the packet voice terminal

18

certain restrictions on the design of the packet voice terminal. For

example, some schemes for performing packet voice fill-in (a function

of the network voice protocol), operate on the speech signal in decoded

form. They therefore need to be implemented between the codec and the

handset, which clearly violates the structure of Fig. 2.2. A similar

observation applies to packet voice synchronization techniques which

require network-wide synchronized clocks, and therefore make special

demands on the access controller.

In the context of an integrated voice/data workstation, the packet

voice terminal is likely to share some physical resources with other

functional components, as described above. It is assumed in this

thesis that the network voice protocol is implemented on the work

station's main processor, and that the access controller is implemented

by means of dedicated hardware (eg. on an expansion card).

It is known from past experience that implementing the speech coding

algorithm on the workstation's main processor is not feasible, because

the general purpose nature of such devices leads to inefficient imple

mentations of signal processing functions. Furthermore, the processor

already has a significant load placed on it by the high-level protocol

software. For these reasons, extra hardware is required for the speech

codec and the associated codec/protocol interface.


The maintenance of modularity in a packet voice terminal is an impor

tant design goal. While it precludes the use of some techniques, it is

believed that its advantage in terms of "future-proofing" is of greater

significance. However, the flexibility provided by terminal modularity

is achieved at the expense of efficiency, and in practice it may be

necessary to violate this principle in order to increase information

throughput. An example of this is described in Section 7.10, where

part of the process of packetization (which is strictly speaking a

function of the network voice protocol), is assigned to the codec.

20

CHAPTER 3 : SPEECH QUALITY IN PACKET VOICE COMMUNICATIONS

3.1 Introduction

An important measure of the performance of a voice communication system

is the subjective speech quality as perceived by the users. Other

performance issues are the service availability and the network res

ponse time [Gruber 83], but as these are not related directly to the

design of the codec, they are not considered here.

Subjective speech quality in packet communications is typically influ

enced by a number of factors, namely distortion, delay, loss and cor

ruption of the signal. In this context, "distortion" refers to deter

ministic perturbation of the signal , such as is caused by quantization,

whereas "corruption" refers to stochastic perturbations caused by chan

nel errors. In this chapter, the causes of these various degradations

are discussed, and consideration is given to their minimization.

3.2 Signal distortion

3.2.1 Introduction

The speech path in a digital communication system typically consists of

a microphone, pre-sampling filter, encoder, transmission channel, de

coder, reconstruction filter, and earphone or loudspeaker. One of the

major benefits of a digital channel is that distortion-free trans

mission over an arbitrary distance may be achieved, given sufficient

bandwidth and signal power [Taub 71]. However, practical restrictions

mean that a certain amount of signal corruption occurs in digital

channels and this issue is considered separately in Section 3.5.

In this section, signal distortion introduced by the other elements in

21

the speech path is discussed, in terms of "variable" distortion, which

is available for manipulation, and "fixed" distortion, which is not.

3.2.2 Fixed distortion

The distortion introduced by the pre-sampling and reconstruction fil

ters is essentially in the form of signal bandlimiting, and is neces

sary to control the transmission bandwidth used by the signal and to

prevent aliasing in digital systems. At present, "narrowband" tele

phony (300 - 3400 Hz) is almost universal [Jayant 84], although "wide

band" telephony (50 Hz - 7000 Hz) is likely to become of increasing

importance in the future [Mermel stein 88].

Microphones and earphones currently used in narrowband telephony

exhibit considerable non-linearity in dynamic range and non-uniformity

in frequency response, implying correspondingly large signal distor

tion, although some aspects of this distortion, such as the attennua-

tion of low-level background noise and the pre-emphasis of high

frequencies, can actually enhance speech intelligibility [Gayford 70].

While it is likely that future transducers will introduce less signal

distortion, in response to the requirements of wideband telephony

[Maitre 82], the nature of current devices should be allowed for in the

design and evaluation of a narrowband voice communications system.

3.2.3 Variable distortion

The nature and degree of the distortion introduced by the process of

encoding and decoding the signal is dependent on the coding algorithm

used. Appropriate measures of this distortion are also coder-

dependent, since some techniques attempt to preserve the perceptual

qualities of speech, while others attempt to preserve its waveform, as

is discussed in Chapter 7. Where a waveform coding technique is used,

a reasonable measure of signal distortion is the signal-to-noise ratio

(SNR) of the decoded speech. The SNR level required for speech coders

in the current telephone network is 33.9 dB [Jayant 84]. However, it

is important to note that this figure includes an allowance for up to

14 tandem encoding-decoding operations in the network. If no such

operations take place, (as is likely in a fully digital network) then

an SNR of 22 dB provides equivalent quality [Bylanski 84].

3.2.4 Conclusions

In order to maintain compatibility with current systems, it should be

assumed that standard narrowband transducers and filters are used.

Attempts to minimize the overall signal distortion should therefore be

concentrated on the speech coding algorithm.

3.3 Signal delay

3.3.1 Types of delay

It is possible to identify a number of independent components of the

total signal delay in packet switched networks. The first is the

length of time it takes to sample the input speech signal and convert

it to digitally coded form, which is known as the encoding delay.

Sufficient code words must then be accumulated to fill a packet, lead

ing to the packetization delay. While the packet waits to be transmit

ted, it incurs queueing (ie. access) delay. The time taken to clock

the packet out of the transmitter depends on the bit rate of the link,

and is known as the transmission time (or delay).

If the packet does not have to pass through any intermediate nodes in

the network, the time it takes to reach the receiver is determined

23

solely by the propagation delay. However, in a multi-link network

there will be further queueing and transmission delays associated with

each network node. Once the packet reaches the receiver, it is put

into a buffer, in which it incurs a synchronization delay before being

played out to the decoder. Finally, the process of converting the

coded speech back to analog form adds decoding delay.

For a given call path, some of the above delays are variable (eg.

queueing delay) while others are fixed (eg. propagation delay). It is

the function of the synchronization algorithm in the network voice

protocol to compensate for variable delays, which it does by intro

ducing extra synchronization delay where required, as is described in

Section 5.5. If the synchronization process is perfect, then the total

signal delay is constant for all packets.

In practice, perfect synchronization is difficult to achieve, so that

the total signal delay may still contain a degree of variability. This

implies temporal distortion of the decoded speech, the subjective

effects of which are discussed in Section 5.5.2.5. For the purposes of

the present discussion, however, it is assumed that any variable delay

the signal incurs is fully compensated for by the synchronization algo

rithm, although at the possible expense of extra fixed delay.

3.3.2 The subjective effects of fixed delay

When the fixed delay in a voice communications system is excessive,

users involved in a conversation tend to mistake the pause which occurs

while a talkspurt is propagating across the network as an indication

that the other party has stopped talking. This results in frequent

talkspurt "collisions" between the two users, and a degradation in the

quality of the service [Seidl 87]. While a "limit of acceptability"

24

cannot be defined precisely for fixed signal delay, the CCITT has

suggested that only very disciplined users who are aware of the prob

lems involved can effectively use a connection with an end-to-end delay

of greater than 300 mS [Seidl 87].

Smaller delays can also cause significant service degradation when

combined with inadequate isolation between the transmit and receive

speech paths, as delayed echoes are confusing to users if they are not

sufficiently attennuated, typically making them stutter [Tanenbaum 81].

While the electrical echoes associated with two-wire/four-wire hybrids

and other line impedance mismatches do not occur in fully digital

networks, acoustic echoes can still be introduced if there is a signi

ficant degree of acoustic coupling between the transmit and receive

paths in the remote telephone set, as in the case of loudspeaking

telephones [Seidl 87].

3.3.3 Delay minimization

In the general case, a speech path will include local and remote LANs,

as wel 1 as an intervening WAN, as discussed in Chapter 2. The signal

delay within the WAN will consist of transmission, propagation and

queueing delays. Through the use of high-capacity links and high-speed

switches, it is possible to reduce the transmission and queueing delays

in wide area networks to sub-millisecond values. However, the propaga

tion delay is constrained by the speed of light, and can be consider

able in a long-distance link, an example being the 270 mS one-way

propagation delay in a satellite link [Tanenbaum 81]. This means that

the delay incurred by the signal in the LANs should be a few tens of

milliseconds at most, so as to maintain acceptable performance for

long-distance cal1 s.

25

For a given LAN, the only delays which cannot be manipulated are the

transmission and propagation delays. Minimization of the encoding and

decoding delays requires the choice of a suitable coding technique.

Some speech coding algorithms operate on a "block" or "frame" of the

speech signal at a time, and the encoding delay hence consists of a

period during which sufficient data is acquired to fill the block

(typically about 30 mS), and a further period in which the block is

analysed (typically about 20 mS) [Seidl 87]. (However, in this case no

further packetization delay is involved, so that the effective encoding

delay is the time taken to analyse the block.)

Other speech coding algorithms operate on a sample-by-sample or

"sequential" basis, and therefore introduce an encoding delay of only

one sample period (typically 125 uS). With both block and sequential

coding techniques, the decoding delay is usually less than or equal to

the encoding delay.

Current specifications for encoding/decoding delays in interactive

voice communications over wide area networks are in the range 2 mS

[Maitre 82] to 4 mS [Mermel stein 88]. However, these requirements are

imposed primarily for echo control purposes [Mermelstein 88], and it is

likely that in a fully digital network they could be relaxed somewhat.

The packetizing delay is determined by the bit rate of the codec and

the packet length, which is a parameter of the network voice protocol.

The packet length is usually selected by trading off a number of con

flicting requirements, one of which is the requirement for a small

packetization delay, as is discussed in Section 5.2.

The queueing delay in the transmitter, (ie. the access delay), depends

in general on the nature of the access mechanism and on the prevailing

network load, which is in turn influenced by the flow control algo

rithms implemented by all workstations on the network. It is thus to

be expected that a wel 1-designed flow control algorithm will tend to

reduce the access delay.

As mentioned in Section 3.3.1, the synchronization delay is a variable

quantity introduced by the synchronization algorithm in the network

voice protocol to compensate for other variable delays in the signal

path. Minimization of this delay is thus dependent on the minimization

of the other variable delays, as well as on the design of the synchro

nization algorithm.


Due to its deleterious effects on interactive communications, and the

possibility of echo-related problems, it is desirable that the total

delay between the production and reproduction of speech is minimized.

For a given network and access controller, this requires the use of a

low-delay speech coding algorithm, an appropriate voice packet length,

an effective voice/data flow control algorithm, and a wel 1-designed

packet synchronization scheme. Furthermore, the variability of the

delay encountered by packets within the network is of concern, as

variable delay must be compensated for by increased total delay if

temporal distortion of the signal is to be avoided.

3.4 Signal loss

3.4.1 Introduction

In a packet switched network, signal loss manifests as the absence at

the receiver of a portion of the code stream corresponding to the

length of one or more packets. There are in general four possible

27

causes of such "missing" packets. One possibility is that no attempt

is made to transmit a packet, because it contains code which corres

ponds to an interval of silence in the input signal. These are called

"silent" packets in this thesis, and their elimination should in theory

not detract from the perceived speech quality. The extent to which

this is true in practice is discussed separately in Section 3.6.

3.4.2 Lost packets

When an attempt is made to transmit a packet (ignoring for the moment

whether it is silent or not), there are in general three ways in which

it can become "lost". Firstly, it may be discarded at the transmitter

or at one of the network nodes, due to a temporary overload in the

associated network link. This is referred to here as a "blocked"

packet. Secondly, a packet which is not blocked may have its header

(containing address and control information) corrupted by a channel

error, with the result that it fails to reach the receiver, or is

discarded when it does so. This is called a "corrupted" packet. The

third possibility is that the packet is neither blocked nor corrupted,

but incurs so much delay in transit that by the time it reaches the

receiver it is too late to be used, and is hence a "late" packet. As

blocked, corrupted and late packets are those which are offered to the

transport network but are not delivered in a useful way, they are

referred to collectively as "lost" packets.

3.4.3 The effect of lost packets on speech quality

As lost packets may occur at any point in the speech signal, including

during talkspurts, they are expected to detract to some extent from the

overall speech quality. It is known that for PCM-coded speech a lost

packet rate of about 1 in 100 is subjectively acceptable, provided that

28

individual packets contain less than about 32 mS of speech [Jayant 81].

As a PCM system encodes each sample independently, this figure relates

only to the effect of the gaps in the decoded signal, and assumes that

the loss of a packet has no effect on subsequent portions of the

signal. However, in coding schemes which are more bandwidth-efficient

than PCM, samples are usual ly not coded independently, implying that

the effect of a lost packet will not be isolated. This means that with

such schemes attention must be given to ensuring that tracking between

the encoder and decoder is quickly re-established after packet loss

occurs.

It is also important to note that the above figure for an acceptable

packet loss rate is based on average statistics, and does not consider

the temporal distribution of lost packets or their relative perceptual

importance. For example, the loss of two successive packets is likely

to be more disturbing than if the packets are widely separated. In

general, it is possible to manipulate the probability that a packet

will be blocked or late by altering its transmission priority relative

to that of other traffic. This can take place either in the access

controller, as is discussed in Chapter 4, or in the network voice

protocol, as is discussed in Chapter 5. The effect of lost packets on

perceived speech quality also depends on the way in which gaps in the

packet stream are filled in at the receiver, as is discussed in Chapter

5.


Gaps in the received packet stream due to lost (blocked, corrupted or

late) packets are expected to have a more severe effect on speech

quality than those due to silent packets. While packet loss caused by

29

channel errors is unavoidable (in the absence of error correction), it

is possible to minimize the incidence of blocked and late packets by

means of appropriate flow control and synchronization algorithms.

Furthermore, the effect of such gaps on speech quality can be reduced

by means of suitable voice prioritization and packet fill-in algo

rithms.

3.5 Signal corruption

3.5.1 Introduction

Bit errors in the transmission channel can corrupt either the header or

the information field of a voice or data packet. There are three

possible responses to this situation. Firstly, it can be assumed that

errors are sufficiently infrequent as to be insignificant, so that the

possibility of a packet being corrupted is ignored. Secondly, corrupt

ed packets can be identified by means of error-detecting codes [Tanen

baum 81], after which they are simply discarded. Thirdly, errors can

be corrected, either by requesting that corrupted packets are retrans

mitted, or through the use of error-correcting codes [Tanenbaum 81].

A minimum requirement in any packet switched network is that errors in

the packet header are detected, in order to ensure that the packet is

not delivered to the wrong destination. In the case of a data packet,

it is also required that errors in the header or information field are

corrected. Due to the relatively low error rates of the majority of

data networks, this is most efficiently done by means of retransmission

[Tanenbaum 81].

3.5.2 Corruption of voice packets

The retransmission of a corrupted voice packet is undesirable because

30

of the extra delay incurred. A packet with a corrupted header will

therefore be discarded at some point in the network, so that the packet

is "lost" as far as the receiver is concerned, as discussed in Section

3.4.

If only the information field is corrupted, it is usually better to

make use of the packet than to discard it, as the robustness of the

speech perception mechanism means that errors in the decoded signal are

tolerable. While error-correcting codes can be used to remove errors

in the information field, the extra processing and bandwidth associated

with such codes means they are only worthwhile with very low-integrity

channels such as in mobile radio networks [Natvig 88].

The effect of corruption of the speech code on the decoded signal will

depend on the coding technique used and the way in which the code is

formatted, as is discussed in detail in Section 7.8. However, of

particular concern is the case where the coding algorithm is adaptive,

as it is then possible that a single bit error will cause a decoder

error which persists indefinitely, unless special provision is made in

the coding algorithm itself for the effect of bit errors to be dissi

pated with time [Goodman 75].

Apart from the actual speech code, there may also be "side information"

in the packet information field, such as a quantizer step size or

predictor coefficient, which can be used to assist the decoder in

recovering from packet loss. As the decoder is likely to be more

sensitive to the corruption of this side information than of the speech

code itself, limited forward error correction might be applied in this

case. As a minimum requirement, errors in the side information should

be detected, so that it can be ignored if it is corrupted.

31


Error correction by re-transmission is not appropriate for voice traf

fic, and should therefore not be implemented by the low-level protocols

of a network used for interactive voice communications. Forward error

correction is also not applicable in general, although it may in some

cases be applied to small portions of the information field. It is

therefore important that the decoder itself is able to recover from the

effects of speech code corruption.

3.6 Silence elimination

3.6.1 Introduction

The detection and elimination of silence from the transmitted signal in

a packet voice network represents a special case of signal loss, as

considered in general in Section 3.4. In theory, silence elimination

should not have a noticeable effect on perceived speech quality, as

long as appropriate inter-talkspurt pauses are reconstructed at the

receiver. In practice, there may be significant direct effects on

speech quality, as considered below, as well as indirect effects, as

are discussed in Chapter 5 in the context of packet synchronization and

fill-in.

3.6.2 The advantage of silence elimination

Silence elimination offers a potential bandwidth saving of up to about

50% in packet voice networks [Forgie 76]. In the somewhat unrealistic

case of a voice-only packet switched network, realization of this

saving requires that 40 or more conversations are in progress at a

given time, in order to allow the statistical multiplexing of indepen

dent signals to be effective [Forgie 76]. While this requirement can

32

be relaxed through the queueing of voice traffic [Weinstein 79], this

has the undesirable effect of increasing the signal delay.

By contrast, in an integrated voice/data network, the less stringent

delay requirements of data traffic mean that it can be queued in order

to aid the multiplexing process, with the result that the increase in

total traffic throughput associated with silence elimination is realiz

able with fewer than 40 conversations [Forgie 76].

3.6.3 The disadvantages of silence elimination

The process of discriminating between talkspurts and silence is common

ly known as Speech Activity Detection (SAD) [Weinstein 83], and one

disadvantage of silence elimination is the potential degradation of the

speech signal caused by the non-ideal operation of practical SAD de

vices [Seidl 87]. Traditional SAD algorithms discriminate between

talkspurts and silence on the basis of a measurement of the short-term

signal energy [Drago 78]. More sophisticated schemes which make use of

other signal characteristics, such as the signal's zero-crossing rate,

have also been proposed, with corresponding increases in algorithmic

complexity and processing delay [Un Aug. 80].

There are two types of "mistake" which a SAD algorithm can make, namely

to erroneously reject a portion of a talkspurt, and to erroneously

accept silence. The first type of mistake typically manifests as the

clipping of the front-end or tail-end of a talkspurt, which may be

difficult to distinguish from background noise due to its low energy

[Drago 78]. Unfortunately, these portions of talkspurts are often

perceptually important, particularly low-energy consonants such as the

"s" in "stop" [Drago 78].

This problem is exacerbated by high levels of background noise, and by

33

any mismatch which may exist between the long-term signal energy and

the decision threshold of the SAD algorithm, although an improvement

can be obtained in such cases through the use of adaptive energy

thresholds [Drago 78]. (Long-term energy mismatch can be caused by

changes in the parameters of analog portions of the speech path, eg.

transducers, transmission lines and amplifiers, as well as by the

varying loudness of speech from different speakers).

The erroneous rejection of portions of talkspurts by simple energy-

based SAD algorithms can also be improved by introducing "hangover"

periods, which keep the SAD output in the active state for a short time

(typically about 250 mS [Drago 78]) after the end of a talkspurt is

indicated by the short-term energy calculation. A similar technique

can be used to allow anticipation of the commencement of a talkspurt,

although this requires an appropriate degree of buffering in the trans

mitter, and consequent extra delay [Weinstein 83].

The second type of SAD error, involving the acceptance of a portion of

a silence interval, is not detrimental to the intelligibility of

speech, but can nevertheless be subjectively annoying. With this type

of error, the "silence", which is in practice low-level background

noise, is transmitted, and because it is often different in nature from

the artificial silence which is generated by the receiver to fill in

for missing packets, it is more noticeable to the listener than would

have been the case if the connection were continuous [Forgie 76].

Final lyi the overall effect of silence elimination in a speech communi

cation link is a loss of "subjective transparency" [Derby 87]. During

talkspurts, the listener is usually aware to some extent of background

noise in the speaker's environment. Depending on the fill-in strategy

34

used in the receiver, this noise either ceases or is replaced by random

noise during silent intervals. Whether this is perceived as a loss of

subjective transparency depends on the level of background noise, and

how well the fill-in noise matches it [Dvorak 88].

In recognition of this problem, it has been proposed that a few packets

of background noise be transmitted at the end of each talkspurt, and

that at the receiver these packets be played out with gradually increa

sing attennuation, thereby fading the noise into silence [DeTreville

83]. While this strategy is reported to be effective in an "office"

environment, it is not known whether it will be sufficient in more

noisy environments.

Also of concern is the case where the conversation is suspended for a

time (such as when one user puts down the handset in order to perform

some task). If silence is eliminated from the speech signal, the other

user is left with a "dead" connection. Although a solution to this is

for the user who puts down the handset to put the other "on hold" until

the conversation is resumed, this requires a degree of user discipline,

and is arguably less friendly than the maintenance of an open connec

tion.


While bandwidth saving has traditionally formed a strong incentive for

eliminating silence from packet voice communications, it is believed

that this is outweighed by the perceptual disadvantages when trans

mission bandwidth is not at a premium. Furthermore, the on/off

approach to speech transmission does not take full advantage of the

ability of a packet switched network to provide a truly variable rate

connection.

35

For example, low bit rate transmission of silence can be implemented

straightforwardly on such a network. Apart from the improvement in

subjective transparency this offers, it also means that any mistake

made by the SAD algorithm is likely to have a relatively minor percep

tual effect, since it will merely cause a change in coding rate (and

hence in speech quality), instead of affecting the continuity of the

transmitted signal. Such a scheme also has advantages for the synchro

nization and fill-in of voice packets, as is described in Chapter 5.

3.7 Overall speech quality

3.7.1 Quality standards

The subjective quality of digitally coded speech is usually described

in terms of four broad categories, namely broadcast, toll, communica

tions, and synthetic quality [Flanagan 79]. Broadcast quality speech

has a bandwidth (7 kHz or more) which is wider than that currently

associated with telephony. Toll quality speech is accepted as standard

for commercial telephony, and is indistinguishable in terms of signal

distortion from undigitized speech which has been correspondingly band-

limited [Jayant 84].

Communications quality speech, by contrast, contains detectable distor

tion, but suffers from very little degradation of intelligibility.

Finally, synthetic quality speech is characterized by substantial loss

of "naturalness", which may not be a disadvantage in person-machine

transactions [Gold 77], but is unnacceptable in person-to-person commu

nications in which factors such as speaker recognition are important.

It is important to note that standards for digital speech quality have

until recently been dominated by the characteristics of the current

36

PSTN (ie. analog transmission, fixed bandwidth connections, circuit

switching). With the increasing use of Digital Circuit Multiplication

(DCM) equipment and the advent of packet switched voice communications,

it has become necessary to take account of forms of degradation pecu

liar to such systems, such as sample (or packet) loss, variable delay,

and the effects of variable rate coding. A subjective testing methodo

logy for such systems is currently under development by the CCITT

[Dvorak 88].

3.7.2 Maximizing the overall speech quality

In principle, it is possible to trade the various factors affecting

speech quality in packet voice communications off against each other.

For example, packet loss due to network congestion can be avoided by

reducing the bit rate (and hence possibly increasing the distortion) of

the coded speech. Alternatively, for a given bit rate, the incidence

of packet loss due to momentary link overload can be reduced by

increasing the maximum lengths of queues in the network (and hence

increasing the average signal delay). As another example, signal

corruption can be reduced by means of forward error correction, at the

expense of increased network load.

A rigorous approach to the design of a packet voice communications

system would thus attempt to maximize the overall subjective speech

quality by finding an optimal compromise among the above degradations.

However, this is difficult to do in practice because the various types

of degradation interact with each other. For example, it has been

found that when the signal distortion is low (as in 64 kbps PCM

coding), subjective quality is affected more by increased packet loss

than by increased signal delay, whereas for higher signal distortion

(as in 32 kbps Delta Modulation coding) the reverse is true [Aoki 86].

37

Furthermore, performing formal subjective tests for incremental changes

in each type of degradation is impractical.

Another issue is that because the causes of degradation are distributed

across the network, it is not possible to apply accurate control at any

single point. For example, the transmitter does not know in general

how long a packet will take to reach the receiver, implying that it

cannot make an accurate trade-off between packet loss and signal delay

by extending its transmit queue length. (However, this trade-off can

be made at the receiver, by appropriate manipulation of the receive

buffer length, as is described in Section 5.5.) Similarly, the trans

mitter does not know in advance how many packets will be discarded by

"downstream" network nodes due to local congestion conditions, and thus

cannot make an accurate trade-off between the signal coding rate and

the packet loss rate.

3.7.3 Conclusions

Due to the practical problems involved in a rigorous trade-off among

the various types of quality degradation in a packet switched network,

it is necessary to adopt a pragmatic design approach. Specifically,

fixed target levels are set for most of the forms of degradation, and

an attempt is then made to minimize the remaining degradations. Thus

if the packet loss rate is constrained to be less than 1 in 100, the

total signal delay (excluding propagation delay) less than 50 mS, and

the channel bit error rate less than 1 in 103, then speech quality can

be measured in terms of the signal distortion.

38

CHAPTER 4 : THE ACCESS CONTROLLER

4.1 Introduction

In a packet switched local area network, it is necessary for access to

the transmission channel to be re-negotiated for each packet in a call,

and this is done by the access controller shown in Fig. 2.2. As the

terminals connected to the LAN usually compete for the use of the

channel on a distributed basis, the problem of access control is

frequently complex [Kurose 84].

It is assumed in this thesis that access multiplexing is performed by

means of statistical time division. This approach, which is used in

the majority of existing data LANs, implies that in general packets do

not get instant access to the channel, and have to wait for some period

(which may be fixed or variable) until the channel is available. An

alternative multiplexing technique uses "spread spectrum" modulation,

which can avoid access delay at the expense of variable SNR in the

received signals [Kahn 78].

4.2 Contention-based access vs. ordered access

While many distinct access control schemes have been proposed for local

area networks [Kurose 84], it is possible to identify two generic

types. The first is the "ordered" or "controlled" access type, an

example being token passing, which is used in the Token Ring [Gee 83].

The second is the "contention-based" or "random" access type, an ex

ample being Carrier Sense Multiple Access with Collision Detection

(CSMA/CD), which is used in the Ethernet [Gee 83].

A characteristic of contention-based access schemes is that the access

39

delay is not bounded (ie. it cannot be guaranteed that a particular

packet will be successfully transmitted within a given time period.)

This is not a major problem for most data traffic, but is clearly of

concern when the packet contains coded speech, which must meet real

time end-to-end delay constraints. In practice, voice packets which

experience excessive access delay will be discarded at the transmitter,

and hence lost.

By contrast, an upper bound can be defined for the access delay of a

particular packet when an ordered access technique is used. (Whether

or not this upper bound will be acceptable to voice traffic is another

issue.) As might be expected from these considerations, it has been

found that token passing results in better voice performance than

CSMA/CD in an integrated voice/data network [DeTreville 84]. Neverthe

less, the relative ease of implementation of contention-based access

schemes has resulted in their wide proliferation in data LANs.

4.3 Priority access

In addition to discriminating between ordered and contention-based

access techniques, it is possible to make a distinction between those

which support prioritized traffic, and those which don't. The former

type typically refrain from accessing the channel if they do not have

packets of a sufficiently high priority to transmit, thereby making

bandwidth available to other workstations which have more important

traffic. For example, if voice data is given a higher access priority

than data traffic in an integrated network, then the variable delay

incurred by voice packets will be reduced, regardless of whether an

ordered or a contention-based access technique is used.

However, as far as this thesis is concerned, it is important to note

40

that a large number of currently installed data LANs (in particular

Ethernet LANs), do not support priority access. This means that it is

not possible to guarantee the continuity of voice calls on these LANs,

particularly under adverse channel conditions, such as during bulk data

transfers. Nevertheless, it is likely that it will be possible to

reduce the incidence of call interruptions to an acceptably low level,

by means of suitable flow control measures applied to both voice and

data traffic.

4.4 Summary and conclusions

In order to allow the speech codec and network voice protocol to be

used with a variety of local area networks, restrictive assumptions

should not be made about the access controller associated with the

network. In particular, the possibility of voice packets incurring

variable, and in some cases unacceptably high access delays should be

considered. This implies the need for packet voice synchronization and

fill-in algorithms in the network voice protocol, and these are dis

cussed in Chapter 5. In addition, the effect of excessively delayed

(ie. lost) packets on the codec must be considered, and provision made

for tracking between the encoder and decoder to be recovered.

Furthermore, it should not be assumed that priority access to the

channel is supported by the access controller. This means that atten

tion must be given to the identification and isolation of the most

important portions of the coded voice signal, so that the impact of

adverse channel conditions on voice communications is minimized. This

issue is discussed in detail in Chapters 5 and 7.

41

CHAPTER 5 : THE NETWORK VOICE PROTOCOL

5.1 Introduction

The network voice protocol utilizes the network and access controller

to provide a real-time voice communication facility. It has two main

tasks, namely to establish/disconnect a call, and to maintain a call

which is in progress [Cohen 78], [O'Leary 81].

In order to establish a call, the network voice protocol must interact

with the user to obtain the number of the callee. It must then asso

ciate this number with the physical address of a workstation, and

negotiate the setting up of the call with the network and workstation

concerned. At the end of the call, disconnection must be similarly

negotiated. It is also necessary that audible signalling tones (dial

ling, ringing and busy) be conveyed to the user at various stages of

the above process, and the generation of these tones impacts on the

codec design, as is considered in Chapter 6.

As regards call maintenance, the network voice protocol in the trans

mitter must packetize, prioritize and flow control the coded speech.

In the receiver, it must depacketize the speech, synchronize its play-

out, and fill in any gaps caused by missing packets. As many of these

functions of the network voice protocol have a significant effect on

the design of the speech codec, they are considered in detail below.

5.2 Packetization

5.2.1 Introduction

The process of packetization involves collecting a suitable amount of

coded speech, formatting it into the information field of a packet, and

42

appending a packet header containing addresses and control information.

Depacketization reverses this process to produce the original coded

speech. A fundamental issue in this context is the choice of the size

of the packet information field, or equivalently, the total packet

length for a given header length.

5.2.2 Factors influencing the optimum packet length

Overall signal delay is an important factor which influences the choice

of the length of voice packets. Increasing the packet length reduces

the packet production rate, thereby reducing the delay incurred in

network nodes in processing each packet [Suda 84]. On the other hand,

increasing the packet length increases the packetization delay, for a

given coding rate. This means that minimization of the overall delay

involves trading these delay components off against each other [Minoli

79].

Also of importance is the fact that increasing the packet length for a

given coding rate increases the audibility of gaps in the decoded

speech caused by missing packets [Jayant 81]. Finally, decreasing the

packet length decreases the bandwidth efficiency of the packetization

process, since the proportion of the total packet length consumed by

the packet header increases. In practice, an upper bound is placed on

the packet length by delay and loss-perception issues, while a lower

bound is defined by efficiency considerations.

5.2.3 Instantaneous variations in the packet length

Given that the conditions prevailing in a network vary dynamically, it

has been proposed that the packet length should be adapted accordingly.

In one such scheme [Gonsalves 82], an attempt is made to transmit a

43

packet once it contains a reasonable amount of information. During an.y

delay incurred in the process of transmission, the packet grows in

length, up to a specified maximum. If the packet still has not been

transmitted when the maximum length is reached, its contents are

managed as a FIFO queue, the oldest information being discarded when

new information becomes available.

The main benefits of this scheme are that it reduces the total delay in

the transmitter by effectively overlapping the packetization and access

delays, thereby avoiding the discarding of packets due to excessive

access delay. It also transmits fewer packets under heavy load, there

by applying a degree of negative feedback [DeTreville 84]. However,

satisfactory synchronization of a stream of variable-length packets can

only be achieved at the receiver at the expense of extra synchroniza

tion delay, in order to prevent the receiver running out of code for

playback when a short packet is followed by a long packet. In addi

tion, the fact that the packet length in the above scheme is not set

until the time of transmission means that it must be implemented in the

access controller itself, rather than in the network voice protocol.

5.2.4 Long-term variations in the packet length

It is also possible to adapt the packet length to long-term changes in

the network conditions. As described in Section 5.2.2, minimization of

the total packet delay involves a trade-off between the packetization

delay and the nodal queueing delay, so that if the queueing delay

increases significantly due to congestion, the use of longer packets is

indicated. In order to avoid the possible synchronization problems

described in the previous section, the packet length can be adapted

gradually. Another possibility is to determine an appropriate packet

length at the beginning of a call, and keep this fixed for the call's

44

duration.


A significant implication of the above discussion is that the codec

should allow flexibility in the choice of the packet length. In block-

oriented speech coding techniques, the encoder collects a number of

input samples and analyses them before producing the output code. This

means that the total encoding delay (ie. collection + analysis) is

considerable, typically about 50 mS [Seidl 87]. However, if the packet

length is made equal to an integral number of block lengths, then there

is no need for further delay to be incurred by the packetization pro

cess, as mentioned in Section 3.3.3. Nevertheless, the need to relate

the packet and block lengths to each other in this case increases the

degree of dependence between the codec and the network voice protocol.

By contrast, with sequential coding techniques each input sample is

encoded immediately, so that no constraints are placed by the codec on

the packet length chosen by the network voice protocol (or the access

controller). However, in order to obtain this benefit in practice, it

must be ensured that control information produced by the codec (such as

an indication of speech activity), is also produced sequentially,

rather than for a block of samples at a time.

5.3 Prioritization

5.3.1 Introduction

The traffic prioritization algorithm in an integrated voice/data termi

nal should attempt to maintain acceptable speech quality by identifying

the relative importance of traffic before it is offered to the flow

control algorithm. As described in Chapter 3, there are four fundamen-

45

tal causes of speech quality degradation in packet switched networks,

namely signal distortion, delay, loss, and corruption, and these can be

manipulated by the prioritization algorithm in order to maximize the

overall perceptual quality.

In particular, signal distortion can be controlled by adjusting the

coding rate according to the activity of the signal, signal delay can

be controlled by suitable prioritization of competing data traffic, and

signal loss can be controlled by prioritizing the signal according to

its transmission history. These distinct aspects of the prioritization

algorithm are discussed below.

5.3.2 The relative prioritization of voice and data

It is usually easy for voice traffic from a particular workstation to

be given priority over data traffic originating in the same work

station, but it cannot be assumed in general that an access controller

is available which wil 1 give voice traffic priority over data traffic

from other workstations. The best that can be done in this case is to

provide voice traffic with "quasi-priority", meaning that flow control

is applied to data traffic at a lower network load than for voice

traffic, thereby making network capacity available (on a statistical

basis) for use by voice traffic from any workstation. While this

policy is likely to lead to a reduction in the peak data rate offered

to each user, it is unrealistic to expect that a voice service can be

added to a data LAN with no impact whatsoever on data performance.

53.3 Prioritization of voice according to its activity

If high-activity portions of the speech signal are given priority over

low-activity portions, then in principle a saving in the average bit

46

rate required by the voice traffic can be obtained with no perceptible

quality degradation. If it becomes necessary to reduce the channel

bandwidth used by the voice traffic due to network congestion, then

identification of the relative importance of different portions of the

signal will also ensure that rate reduction has the minimum impact on

speech quality.

In practice, the success of this approach will depend on the accuracy

with which the activity of the signal is estimated. The simplest

activity estimate classifies the speech signal into talkspurts and

silence, the implication being that talkspurts should be transmitted

and silence discarded. However, as a packet switched network offers a

truly variable rate transmission service, more refined levels of acti

vity classification can be used, with potentially better results.

53.4 Prioritization of voice according to its transmission history

A performance measure often used in studies of packet voice trans

mission is the percentage of voice packets lost due to excessive delay

[DeTreville 84], [Gonsal ves 82]. However, this measure gives no indi

cation of the temporal distribution of lost packets. This factor is

likely to have a significant effect on speech quality, as the loss of

two successive packets is equivalent to the loss of a single larger

packet, and it is known that the probability of losing an entire speech

phoneme increases rapidly for packet periods greater than 20 mS [Jayant

81]. It is therefore desirable that the loss of successive packets be

avoided by prioritizing the voice traffic according to the degree of

success of previous packet transmission attempts.

A generalization of this principle is possible in the case of variable

rate speech coding, as it is reasonable to suppose that a momentary

47

reduction of the transmitted bit rate due to channel congestion will be

less noticeable than a sustained reduction. It is thus indicated that

the prioritization algorithm should maintain a record of the degree of

transmission success achieved by recent packets in a particular call,

and prioritize succeeding packets accordingly.


The relative prioritization of voice and data traffic, and the priori

tization of voice traffic according to its transmission history, are

expected to have significant effects on speech quality in packet voice

communications. However, these are not of direct concern in the design

of the codec, in contrast to the relative prioritization of different

portions of the speech signal according to its activity.

The modular structure of the packet voice terminal in Fig. 2.2 indi

cates that the network voice protocol has access to the speech signal

in its coded form only, from which it may be difficult to derive

prioritization information. However, assigning the prioritization

function entirely to the codec is undesirable, because it increases the

dependency of the network voice protocol on the codec. For example, if

the codec simply tags "silent" portions of code [O'Leary 81], then the

network voice protocol can only use an "on/off" prioritization strate

gy. A more flexible approach is for the codec to supply suitable side

information to the network voice protocol to allow it to prioritize the

speech as it chooses, and with a minimum of effort.

5.4 Flow control

5.4.1 Introduction

The objective of network flow control is to regulate the entry of

48

traffic into a network in order to avoid congestion [Bially Mar. 80].

Data traffic is usually flow controlled by delaying its transmission,

ie. queueing it. If the queues become too long, fundamental corrective

action must be taken, typically involving a reduction in the rate at

which the source of the traffic (such as a disk drive) produces infor

mation. In the case of voice traffic, however, flow control by means

of queueing is undesirable, as it introduces extra signal delay. Fur

thermore, reducing the information rate of the source by requiring the

user to speak more slowly is clearly not acceptable, so that alterna

tive approaches must be used, as described below.

5.4.2 Flow control of voice traffic

Flow control has traditionally been performed in voice communications

networks by means of blocking. In the case of new calls, the user is

required to wait until bandwidth is available, and in the case of

momentary channel overload during calls in progress, excess demand is

simply "frozen out", implying loss of speech [Weinstein 79]. Both of

these manifestations of the blocking policy are unfriendly to the user,

and motivate for a more refined flow control scheme.

Fortunately, the robustness of the speech perception mechanism means

that a slight reduction in signal quality does not significantly impair

its intelligibility. This means that it is possible to flow control

voice traffic by reducing the coding rate (and hence probably increas

ing the signal distortion) when the network is congested. This princi

ple can be applied to the control of access to the network by new users

as wel 1 as to the control of rapid statistical variations in the traf

fic of calls in progress [Bially Mar. 80], although it will still be

necessary to block new calls at a certain level of congestion.

49

5.4.3 Network load estimation/prediction

In order for effective flow control to be implemented by the network

voice protocol, it must be provided with some quantity correlated with

the current network load, such as the delay experienced by previous

packets in gaining access to the channel. This quantity can then be

used to estimate or even predict the network load on a statistical

basis, so that appropriate flow control action can be taken before

network congestion has time to degrade the voice service [Johnson 81].

Prediction is possible in this case because it is expected that the

load in an integrated voice/data network will not vary arbitrarily with

time. Rather, because the average talkspurt duration is much greater

than the typical packet generation time, and because of the long hold

ing time of individual voice calls, there will be some short-term

correlation in successive measures of the level of voice traffic on the

network [Johnson 81]. Furthermore, while data traffic is inherently

less structured than voice traffic, queueing in network nodes will tend

to give data traffic Markov characteristics, thereby increasing the

predictability of the overall network load.


Flow control ling voice traffic by means of queueing or blocking is

undesirable. A better strategy involves the reduction of the speech

coding rate in response to network congestion, requiring that a var

iable rate coding technique is implemented in the codec.

5.5 Synchronization

5.5.1 Introduction

The function of the packet voice synchronization algorithm in the

50

receiver is to establish a suitable playout time for each received

packet. This requires not only that the variable delay incurred by

each packet in traversing the network is accounted for, but also that

missing packets, and packets which arrive out of order, are identified.

As a number of disparate packet voice synchronization schemes have been

described in the literature [Montgomery 83], an attempt is made below

to identify the essential problems in this area, and to categorize the

solutions adopted.

5.5.2 Essential issues in packet voice synchronization

5.5.2.1 Packet ordering

In general, a packet switched network may deliver voice packets to the

receiver in a different order from that in which they are transmitted.

This can occur if the route taken by successive packets through the

network varies with time (eg. when adaptive routing is used as a means

of controlling congestion). The traditional solution to this problem

is to place a sequence number in each packet prior to transmission

[Weinstein 83]. The sequence number is incremented every time a packet

is transmitted, so that the receiver can determine the correct packet

ordering by examining successive packet sequence numbers.

It is reasonable to assume in this application that mis-ordering of

packets by the network never occurs. Obviously, this assumption holds

within LANs, which have only one possible transmission route. Further

more, a strong trend in packet switched wide area networks for real

time services is towards the use of packet virtual circuits, which also

maintain packet ordering [Burgin 87]. Nevertheless, packet sequence

numbers are still of value, as they can be used to detect missing

packets, and to distinguish among the various types of missing packet

51

described in Section 3.4.

5.5.2.2 Identification of the type of a missing packet

It is useful if the synchronization algorithm is able to detect the

cause of a potential gap in the played out speech, ie. whether it is

due to a silent or lost packet, as this information may be used by the

packet fill-in algorithm to generate an appropriate substitute packet,

as is described in Section 5.6. The synchronization algorithm should

also be able to discriminate between "on-time" and late packets, as

late packets must usually be discarded.

If silence is eliminated from the transmitted packet stream, then

packet sequence numbers are required if silent packets are to be de

tected unambiguously at the receiver [Weinstein 83]. In this case, the

packet sequence number is only incremented when a packet is actually

transmitted, ie. it is unchanged during silence intervals. Therefore,

a gap at the receiver between two packets with successive sequence

numbers must be due to one or more silent packets. Conversely, a

missing sequence number indicates a lost rather than a silent packet.

However, it is important to note that discrimination between lost and

silent packets can only be performed in retrospect, ie. once the packet

after a missing one has arrived. This may imply that extra delay must

be introduced at the receiver, which is undesirable. If silence is not

eliminated from the transmitted packet stream, then the above process

of discrimination is unnecessary.

Discrimination between on-time and late packets is also required, as

failure to do so will result in late packets being played out as if

they were on time, and the resulting temporal distortion of the signal

52

rma^ is likely to be objectionable [Forgie 76]. If explicit timing info

tion is included in the packet header, then this may be used to detect

late packets. In the absence of such information, packet sequence

numbers must be used for this purpose, as late packets cannot be de

tected unambiguously by observation of the packet arrival times alone,

due to the possibility that packets may be blocked or corrupted.

5.5.2.3 Correction of variable packet delay

The total network delay Dn suffered by a packet consists of a fixed

component Df (due to transmission and propagation delays) and a vari

able component Dv (due to queueing delays). Variations in the value of

Dv associated with successive packets result in temporal "jitter" in

the packet stream. If not accounted for, this jitter leads to the loss

of portions of the reconstructed signal, and the introduction of arti

ficial gaps. Both of these phenomena degrade the perceptual quality of

the signal, and should therefore be prevented from occurring as far as

possible. This requires that packets are buffered in the receiver prior

to being decoded, in order to smooth out jitter in the packet stream.

Fig. 5.1 shows a generic Probability Density Function (PDF) for the

network delay in packet switched networks [Aoki 86]. The tail on the

left of the distribution is truncated because packets must incur at

least some fixed delay Df. However, the tail on the right may in

general have an asymptotic form, due to the possible presence of a

number of queues of varying length in the call path. It is undesirable

to wait for the arrival of late packets falling into this region of the

distribution, as this would result in excessive overall delay. Such

packets are therefore considered lost as far as the playout of speech

is concerned.

A

D

Delay D n

D v.th

Fig. 5.1 Generic packet delay distribution

54

It is well known from listening tests that a lost packet rate of 1 in

100 or less is acceptable [Forgie 76], [Jayant 81]. This means that a

threshold of acceptability, DVj t h can be defined in Fig. 5.1 such that

the shaded area represents a cumulative probability of 0.01.

A fundamental goal of the synchronization algorithm is to maintain a

continuous playout of speech. This means that if a packet incurs an

unusually small delay (ie. less than the median in Fig. 5.1), it should

not be played out immediately, as it would then be likely that a gap

would be introduced into the decoded speech at a later stage. Rather,

the packet should be delayed by an additional amount Ds, (the synchro

nization delay) where:

Ds = Dv,th - Dv (5.D

which gives a probability of 0.99 that subsequent packets will arrive

in time to allow a continuous playout.

In practice, there are two problems associated with the implementation

of the above strategy. The first is that the delay PDF is in general

not known when a call is established. This is because unknown networks

(with unknown delay characteristics) may be included in the call path,

and these characteristics may also change with time as the network

loads change. The result is that Dv t^, ie. the delay threshold which

will include 99% of all packets, is not known in equation (5.1). The

second problem is that it is often difficult for the receiver to deter

mine the delay associated with a particular packet, implying that Dv in

(5.1) is also not known. Practical synchronization schemes therefore

usually involve imperfect estimates of Dv tn or Dv (or both).

55

5.5.2.4 Clock frequency matching

The fundamental timing references in a packet voice communications

system are the encoder and decoder sampling clocks, since these deter

mine the rate at which packetized speech code is offered to and demand

ed from the network. All operations of the synchronization algorithm

should therefore be referenced to the codec sampling clock, and this

would ideally be derived from a network-wide master source. For exam

ple, in a LAN with a bit-synchronous channel, the sampling clocks can

be derived from the channel clock [Anido 87]. However, this is often

not possible in more general networks, and the installation of special

lines for the distribution of the clock signal is usually impractical

[Montgomery 83].

It is therefore more usual to assume independent crystal-control 1 ed

sampling clocks in the packet voice terminals, which are typically

accurate to 1 part in IO5. A problem in this case is that any differ

ence in frequency between the clocks in the transmitter and receiver

will manifest as an increasing relative phase drift (up to 72 mS per

hour in the above example). Whether this represents a problem in

practice depends on the way in which the synchronization algorithm

responds to such changes, as is discussed in Section 5.5.3.

5.5.2.5 Temporal distortion of silence intervals

If silence is eliminated from the stream of transmitted packets, then

it is in general necessary to re-synchronize the packets in each talk

spurt. This is of no consequence to synchronization schemes which know

exactly how much delay each packet has incurred (these schemes are

discussed in Section 5.5.3.2). However, when this information is

approximate (Section 5.5.3.3), or not available at all (Section

56

5.5.3.4), then each packet is typically synchronized relative to the

preceding one, which clearly presents a problem in the case of the

first packet in a talkspurt. The result is that the silence intervals

between talkspurts tend to become distorted in length [Suda 84].

The perceptual importance of accurately reconstructing silence inter

vals is a subject of ongoing research. One study claims that modifica

tion of the lengths of silence intervals by up to 50% is acceptable

[Webber 77]. More detailed studies note that distortion of the rela

tively long silence intervals between sentences and phrases is less

noticeable than for intervals between words and syllables [Gruber 81],

[Gruber 85]. This is one reason for using a relatively long hangover

period (about 200 mS) in speech activity detectors, to ensure that

short silence intervals are not eliminated.

5.5.2.6 Adjustment of the playout rate

Most packet synchronization schemes assume that the decoder empties the

receive buffer at a constant rate, or at least at a rate which is

independent of the buffer occupancy, implying that the synchronization

algorithm must control buffer overflow and underflow conditions by

discarding or inserting packets of code as necessary. However, it is

also possible to adjust the rate at which the decoder operates accord

ing to the buffer occupancy [Gold 77], in a manner analogous to the

control of the encoding rate according to the occupancy of the transmit

buffer [Dubnowski 79]. In general, this strategy cannot be implemented

by simply altering the sampling rate of the decoder, as this will

change the pitch of the decoded speech. Instead, vocoder techniques

must be used, which allow the duration of a speech sound to be in

creased or decreased, without altering its pitch [Gold 77].

57

5.5.3 A taxonomy of packet voice synchronization schemes

5.5.3.1 Introduction

As mentioned in Section 5.5.2.3, an issue common to most packet voice

synchronization schemes is that Dv tn in equation (5.1) is not known in

advance for a general call path. However, there are significant diff

erences among synchronization schemes as regards their knowledge of Dv

in (5.1), and accordingly they are classified below on this basis.

5.5.3.2 Synchronization schemes with exact knowledge of Dv

There are essentially two ways in which the receiver can obtain exact

knowledge of the network delay experienced by a particular packet. In

the first, the transmitter and receiver maintain clocks which are

phase-synchronized, or equivalently, both have access to a network-wide

master clock. The transmitter places a time stamp into each packet

immediately prior to transmission, and the receiver can then compare

this stamp with its own clock when the packet is received, thereby

obtaining an exact delay measurement (in the absence of channel bit

errors).

This is known as the "absolute timing" technique [Montgomery 83]. Its

main disadvantage is that the requirement for perfectly synchronized

transmitter and receiver clocks or a network-wide master clock is often

difficult to satisfy in practice, particularly when the network covers

a wide area.

The second technique measures the variable delay suffered by a packet

at the point where it actually occurs (for example in a queue in a

network node), and then places this "delay stamp" in the packet. If

all network nodes implement this strategy by suitably incrementing the

58

delay stamps of packets in transit, then by the time a particular

packet reaches the receiver, it will contain an accurate representation

of the total delay it has suffered in traversing the network. This is

known as the "added variable delay" technique [Montgomery 83], and

although it avoids the need for clock synchronization, it has the major

disadvantage of placing special requirements on the network nodes,

namely that they must be able to process the delay stamp.

5.5.3.3 Synchronization schemes with approximate knowledge of Dv

A less restrictive assumption than that of the absolute timing scheme

is that the transmitter and receiver have separate clocks which are

matched in frequency, but not necessarily in phase. Then if packet

time stamps are used, it is possible for the receiver to gradually

correct any phase offset between its clock and that of the transmitter,

by continually revising its timebase whenever it observes a value of Dv

which is smaller than those for preceding packets [Barberis 80].

This type of scheme can also be used to track slow changes in the

packet arrival statistics [Forgie 76]. These might be due to real

changes in the packet delay statistics caused by changing network load,

or due to a slight mismatch between the frequencies of the transmitter

and receiver clocks, which will cause apparent changes in the delay

statistics.

5.5.3.4 Synchronization schemes with no knowledge of Dv

Synchronization schemes in this category do not make use of time or

delay stamps, and therefore have no knowledge of the actual network

delay incurred by individual packets. Instead, they assume a worst-

case value of Ds for the first packet in a talkspurt (ie. Dv = 0 in

59

equation (5.1)). If Dv>t h is known exactly, and remains fixed, then

this assumption ensures that subsequent packets in the talkspurt arrive

in time to be played out with a probability of 0.99.

However, if an inappropriate value of Dy th is assumed, or Dv th varies

with time, then this synchronization scheme functions sub-optimally.

This means either that unnecessary extra delay is introduced into the

speech path, or that the packet loss rate becomes unacceptably high.

Such "blind delay" [Montgomery 83] or "null timing information" [Bar

beris 80] schemes are therefore best suited to applications in which a

good estimate of Dy th is available from a priori knowledge about the

network, for example within the controlled environment of a LAN.

The limitations of the above schemes can be substantially removed if

the possibility of revising the initial value of Ds (ie. the initial

estimate of Dv t n) is allowed. For example, if Ds is increased when

ever a late packet is encountered, and decreased after a "faultless"

period during which no late packets occur, then the average rate at

which late packets occur can be set. If the "faultless" period is set

to correspond to 100 packets, then at most 1% of packets will be

discarded due to being late [Forgie 76]. (In order to obtain an ini

tial value for Ds at the beginning of a call without affecting speech

quality, a few "dummy" packets can be transmitted as part of the call

set-up procedure.)

This synchronization scheme is appealing because it requires no ex

plicit knowledge of Dv or Dv tn, and may therefore be used for a call

spanning unknown networks. Furthermore, it is able to adapt automatic

ally to changes in the packet arrival statistics caused by changing

network load or drift between the transmitter and receiver clocks.

60


It is apparent that the elimination of silence from the transmitted

packet stream makes the task of the synchronization algorithm signifi

cantly more complex. In particular, the possible distortion of the

lengths of silence intervals becomes an issue, and the unambiguous

identification by the receiver of gaps in the packet stream caused by

the elimination of silence requires the introduction of extra delay.

If a single packet route is used for the duration of a call, and

silence intervals are transmitted, then packet sequence numbers are not

required for the correction of packet mis-ordering or the detection of

silent packets, although they are still needed if late packets are to

be detected unambiguously.

Due to the restrictive requirements they place on the network, synchro

nization schemes using time or delay stamps to obtain exact knowledge

of Dy for each packet are not considered appropriate in this applica

tion. Furthermore, schemes which require an a priori assumption to be

made about Dv t^ are too restrictive for synchronization over general

call paths. It is therefore recommended that a synchronization tech

nique which adapts to implicit changes in the packet delay distribu

tion, as indicated by the packet arrival statistics, is used.

Since the encoder and decoder will operate from separate clocks, all

operations of the synchronization algorithm should be referenced to the

decoder clock. If the latter is not explicitly synchronized with the

main processor clock in the workstation (for implementation reasons),

then some means of scheduling the synchronization algorithm software is

needed, for example an interrupt signal from the codec.

While the ability to vary the decoding rate without changing the pitch

61

of the speech can be useful in avoiding receive buffer overflow or

underflow, the restrictions this places on the nature of the coding

algorithm (ie. that it is vocoder-like) are undesirable if flexibility

of the terminal is to be maintained. This implies that receive buffer

management should rather be based on packet discarding and insertion,

5.6 Fill-in

5.6.1 Introduction

The task of the packet voice fill-in algorithm in the receiver is to

generate substitutes for missing packets. Such packets may have been

discarded by the transmitter or the network, or may arrive too late to

be useable. In any event, the gap in the packet stream is detected by

the packet synchronization algorithm. Ideally, the fact that a fill-in

packet has been used should be imperceptible, but the degree to which

this is true in practice depends on the length of the fill-in packet

and how wel 1 it matches adjacent packets.

Various packet fill-in schemes are discussed below in two categories;

those which require minimal processing (at most the copying of a block

of code from one area of memory to another), and those which involve a

considerable amount of computation.

5.6.2 Simple packet fill-in schemes

The simplest fill-in schemes substitute silence or low-level random

noise for missing packets [Seneviratne 87], Whether such a substitute

matches adjacent packets depends on the type of the missing packet. For

example, if the packet is missing because it was silent, and therefore

not transmitted, then replacing it by silence is a good strategy.

However, if the packet was lost from the middle of a talkspurt, then

62

replacing it with silence is inappropriate [Forgie 76]. As described

in Section 5.5, sequence numbers can be used to discriminate between

silent and lost packets, at the possible expense of extra delay.

Attempts to improve the matching between fill-in packets and adjacent

ones typically use some form of extrapolation or interpolation. The

simplest scheme in this category uses the previously received packet as

a fill-in packet (ie. packet extrapolation). While this approach works

well for isolated missing packets, and provides a significant improve

ment over silence substitution [Wasem 88], a problem arises when a

number of consecutive packets are missing (such as when silence is

eliminated), as in this case packet repetition introduces an artificial

periodicity into the signal, which is audible as a buzzing sound. A

variant which avoids this phenomenon replaces the first in a sequence

of missing packets by the previous packet, and subsequent missing

packets by silence [Anido 87].

5.6.3 Advanced packet fill-in schemes

A problem in the design of more effective fill-in schemes is that a

missing" packet represents the loss of a large number of consecutive

speech samples (typically 80 - 160 samples for packet periods of 10 -

20 mS), implying that relatively sophisticated extrapolation or inter

polation algorithms must be used.

One example is the "pattern matching" technique, where an attempt is

made to find an earlier occurrence in the speech waveform of the

pattern leading up to the missing packet [Goodman 86]. If such an

occurrence is found, then a portion of the subsequent waveform is used

to fill in for the missing packet. A disadvantage of this technique is

that it operates on the speech in decoded form, thereby violating the

63

modular structure of the packet voice terminal described in Chapter 2.

Furthermore, a considerable amount of real-time signal processing is

involved.

Other sophisticated approaches make use of vocoding techniques to

generate the fill-in packet. If a model of the speech production

process is available (or specially created) at the receiver, then its

state just before a missing packet provides a good clue as to the

likely behaviour of the signal during the missing packet. For example,

it has been found in systems using LPC vocoding that the synthesis of

voiceless speech is an effective packet fill-in technique [Weinstein

83]. Another scheme estimates the local pitch of the speech signal,

and generates a fill-in packet by repeating a recent waveform segment

equal in duration to the pitch period [Goodman 86]. However, like the

pattern-matching technique, these vocoder-oriented fill-in strategies

violate the modular stucture of the packet voice terminal and involve

considerable computation.

The packet extrapolation problem considered above is greatly simplified

if "even" samples in the code stream are placed in one packet and "odd"

samples in another [Jayant 81]. Then if the packet containing the even

samples is missing at the receiver, these samples can be interpolated

from the odd samples in the adjacent packet (assuming it is not also

missing), and vice versa. If neither of the packets is missing, the

samples are simply put back into their original order and played out.

The advantage of this scheme is that it replaces packet extrapolation

by sample interpolation, so that a simple first-order interpolation

algorithm gives good results with missing packet rates as high as 5%

[Jayant 81]. However, it still operates on the speech signal in de-

64

coded form, and the separation of odd and even samples doubles the

packetization delay for a particular packet length.


In order to retain the modular structure of the packet voice terminal,

packet fill-in strategies which operate on the decoded speech signal

should not be used. Furthermore, real-time constraints motivate for

the use of simple fill-in packets containing silence, random noise, the

contents of previous packets, or some combination of these. This

allows substitution for missing packets to be performed with minimal

processing.

The effectiveness of such simple fill-in schemes will depend on a

number of other aspects of the network voice protocol; namely the

packet length, the voice prioritization algorithm, and the flow control

algorithm, as these will determine the length and frequency of gaps in

the received packet stream. Optimization of the fill-in algorithm will

thus need to be performed once these components of the packet voice

terminal have been finalized.

65

CHAPTER 6 : CODEC REQUIREMENTS

6.1 Introduction

In this chapter the codec requirements are discussed, with reference to

the assumptions made about the network (Chapter 2) and the access

controller (Chapter 4), the functions performed by the network voice

protocol (Chapter 5), and the factors which influence the overall

speech quality in packet voice networks (Chapter 3). Additional issues

such as the cost of implementation and the ability to handle non-voice

signals are also considered.

6.2 Input signal characteristics

It is assumed that the codec input signal is captured by a standard

telephone handset and has a bandwidth of 300 - 3400 Hz. As digitiza

tion of the signal will take place immediately, only speech and acous

tic background noise will be routed through the codec. This means that

the codec can be optimized specifically for speech signals, thereby

improving its performance for a given design complexity [Mermelstein

88].

By contrast, speech codecs designed to enable the progressive

conversion to digital technology of portions of the current public

telephone network need to be able to successfully encode and decode

modulated voiceband data, as well as speech. This requirement leads

either to codecs which do not exploit much of the redundancy in speech

signals, such as PCM codecs, or to designs which are very complex, and

therefore difficult to implement, such as the CCITT G.721 ADPCM algo

rithm [Jayant 84].

66

6.3 Signal distortion

No fixed distortion target is set for the codec, due to the statistical

nature of the transport mechanism in packet switched networks. Never

theless, it should be possible under favourable network conditions to

obtain distortion levels similar to those of 64 kbps PCM (ie. toll

quality), although as noted in Section 3.2, reduction of the number of

tandem coding operations in the call path reduces the SNR required for

the initial encoding operation. Under unfavourable conditions, such as

when the channel is heavily congested, the codec should still provide

intelligible speech (ie. communications quality), in line with current

specifications for error robustness [Mermelstein 88].

6.4 Signal delay

In order to maintain acceptable overall signal delay for a call path

which includes a wide area network, the delay introduced by the encod

ing and decoding processes should be no more than about 4 mS. However,

if a coding technique is used in which a block of code is collected

before being analysed, then the packetization and encoding delays will

overlap, so that this requirement can be relaxed somewhat.

6.5 Bandwidth efficiency

The codec should make efficient use of the transmission bandwidth,

while maintaining acceptable speech quality and allowing an economical

implementation. Bandwidth compression of speech can in general be

achieved through the exploitation of the variable activity of the

speech signal [Dubnowski 79] and/or through the use of redundancy

reduction coding techniques [Flanagan 79]. However, due to the rela

tively large bandwidth available to each user on a LAN, bandwidth

compression is not as important as it would be, for example, in a

67

packet radio system. The use of sophisticated compression techniques

is thus not indicated in this application, particularly in the light of

the observation that the compression of speech by redundancy reduction

methods tends to follow the "law of diminishing returns" [Jayant 84].

6.6 Variable rate coding

In order to al low the use of a flow control scheme for voice traffic

which varies the voice bit rate in response to the network congestion,

a variable rate coding scheme is required in the codec. A reasonably

wide range of rates should be covered, and the codec should respond

quickly to rate control commands. Furthermore, variable rate coding

can be used to achieve a reduction in the average transmission band

width required by the voice traffic, if the coding rate is varied in

accordance with the activity of the speech signal. This requires that

suitable prioritization information is made available to the network

voice protocol by the codec.

6.7 Robustness to bit errors

As protocol-associated error detection and correction are in general

inappropriate for voice traffic in packet switched networks, the codec

needs to be robust to bit errors in the channel. In particular, if an

adaptive coding algorithm is used, it is necessary to ensure that

tracking is re-established between the encoder and decoder after the

occurrence of a bit error. In the light of current specifications for

speech codecs to be used over wide area networks [Mermelstein 88], it

is required that intelligibility of the decoded speech be maintained

for bit error rates up to 1 in 10^.

68

6.8 Robustness to packet loss

In order to reduce the audibility of missing packets at the receiver,

the codec should be amenable to the substitution of fill-in code. For

example, it should be possible to generate silence or white noise using

synthesized code, and to repeat the code from a previously received

packet without producing an anomalous decoder output. Furthermore,

recovery of tracking between the encoder and decoder subsequent to a

missing packet should be as rapid as possible. This is an important

consideration if an adaptive coding technique is used, as a mechanism

for gradually dissipating the effects of individual bit errors is

unlikely to be sufficient to deal with the signal discontinuities

caused by packet loss. Robustness to a packet loss rate of 1 in 100 is

a reasonable target.

6.9 Tandem coding

If the speech signal has different representations in different por

tions of the call path, the effect of tandem coding operations on

speech quality must be considered. For example, if the network uses a

mixture of transmission technologies, the signal may be converted

between analog and digital forms a number of times as it traverses the

network. Such conversions are termed "asynchronous" [Jayant 84], and

involve an inevitable degradation in SNR.

In an all-digital connection, no asynchronous conversions are required,

although it may be necessary to convert the code from one digital

format to another at one or more points in the call path. Such "syn

chronous" conversions, or "transcoding" operations, do not necessarily

result in reduced SNR, although this is dependent on the code formats

involved. For example, synchronous conversion between log PCM and

69

linear PCM codes results in no cumulative SNR degradation, regardless

of the number of tandem stages involved [Jayant 84].

In the application considered in this thesis, transcoding is unlikely

to take place within the LAN itself. However, it may be necessary to

communicate with a remote terminal equipped with a PCM coding facility

only. Alternatively, it may be necessary to use a WAN channel intended

for 64 kbps PCM code to link the local and remote LANs. In either of

these cases, transcoding to 64 kbps PCM code will be necessary, and

this is likely to be performed in the LAN gateways. It is thus import

ant to ensure that this transcoding operation can be performed without

excessive signal qual ity degradation.

6.10 Voice conferencing

Voice conferencing among a number of users of a network is typically

implemented in one of two ways. In the first, the speech signal from

each user is transmitted to a conference bridge, where it is combined

with those of the other users involved. Since almost all speech coding

algorithms do not permit direct superposition of code words, each

signal must first be converted to linear form [Mermelstein 88]. The

composite signal is then re-encoded and broadcasted to all the users,

so that they can all hear each other simultaneously. An important con

sideration in this case is the amount of processing, delay, and signal

degradation involved in converting the speech code to linear form and

then re-encoding it.

The second scheme makes use of an arbitration mechanism in the confer

ence bridge which "gives the floor" to one user at a time. This

becomes more bandwidth-efficient than the first scheme as the number of

users involved in the conference increases [Weinstein 83]. In order to

70

implement this one-at-a-time conferencing scheme, an indication of the

activity of each speaker is required, so that it is useful if the codec

provides this information to the network voice protocol in each work

station.

A third possibility is to combine the above two schemes, so that the

signals from all active speakers are combined in the conference bridge

and then broadcasted. This is likely to have the advantage of prevent

ing the "fastest talker" from dominating a discussion, while at the

same time conserving transmission bandwidth. For this scheme to be

implemented, the requirements of both the other two must be satisfied

by the codec.

6.11 Voice messaging

Since the packet voice terminal is part of a workstation which is

likely to include local storage facilities such as disk drives, it is

relatively straightforward for digitized speech to be stored for mess

aging purposes. The main problem in this regard is the relatively

large storage requirements of even short spoken messages. For example,

a single 20-second message requires about 160 Kbytes of storage if

coded using 64 kbps PCM. From this point of view, a coding scheme more

efficient than PCM is desirable.

Furthermore, it is likely that a lower-quality representation of the

signal than that associated with interactive communications would be

accepted for voice messages, if the benefit was that a greater number

of messages could be stored. The elimination of silence from the

stored signal, which would further reduce the storage requirements,

would probably be tolerated for the same reason [Gan 88].

71

6.12 PCM compatibility

PCM encoding of speech at 64 kbps with a signal band of 300-3400 Hz and

a sampling rate of 8 kHz is a well established standard for telephony

purposes [Jayant 84], and it is desirable that the codec maintain

compatibility with this standard as far as possible. While direct code

compatibility is not expected, the coding rates used should be multi

ples of 8 kbps, and it should be possible to format one or more code

words into 8-bit bytes. The benefits of such compatibility include the

following. Firstly, devices intended for PCM systems such as inte

grated anti-aliasing filters can be used. Secondly, performance com

parisons with published results are made easier. Thirdly, transcoding

the output of the codec to PCM form, and the handling of PCM code by

the codec, are facilitated.

6.13 Non-speech code

While a codec in a fully digital network is not required to actually

encode non-voice signals, it should nevertheless al low dial 1ing, ring

ing and busy tones to be conveyed to the user through the earphone in a

straightforward manner. These tones will typically be generated by the

network voice protocol and be sent to the codec at the appropriate time

for decoding, in the same manner as coded speech. This means that the

codec should not employ a code which makes tone synthesis complex.

Alternatively, a back-up mode of operation should be possible, in which

the decoder operates with a simple code such as PCM. Other non-speech

code which the decoder needs to handle is that used to fill in for

missing packets at the receiver, as discussed in Section 6.8.

72

6.14 Control information

Apart from speech code, fill-in code and signalling tones, it is nece

ssary for control information to be exchanged between the network voice

protocol and the codec. Some of this information may be "code-

asynchronous", such as a command to disable the codec or a flag repor

ting a codec error condition. Code-asynchronous information may be

exchanged relatively simply, by means of a special-purpose register or

memory location. By contrast, "code-synchronous" control information

(such as an indication of speech activity) is only useful when it is

associated with the corresponding code, and must therefore be embedded

in the actual code stream flowing between the codec and network voice

protocol. This requires the specification of a suitable format for the

composite code stream.

6.15 Packetization

Strictly, the process of formatting the coded speech into packets of

suitable length and appending packet headers is a function of the

network voice protocol. However, the codec can assist considerably in

this process by pre-formatting the code and transferring it directly to

the appropriate packet locations in the workstation's memory. At the

receiver, the reverse process is implemented. This leaves the work

station's processor free to perform more complex tasks such as network

flow control.

6.16 Implementation

It is intended that the codec will initially be used on an expansion

card in an IBM PC/XT/AT or compatible workstation. The total cost of

the card should thus be a reasonably small fraction of the cost of a

PC, a requirement which places a limit on the complexity of the speech

73

coding technique which can be used. The workstation will be connected

to a LAN by means of a suitable access controller, and the high-level

communications protocols will be implemented in software running on the

workstation's main processor.

As these protocols represent a substantial processing burden, it is

desirable that the processing overhead associated with the addition of

the codec card be kept to a minimum. This indicates that techniques

such as direct memory access, interrupt signalling and pointer passing

should be used in preference to program-controlled I/O, device polling

and location-to-location code copying.

74

CHAPTER 7 : DESIGN AND DEVELOPMENT OF THE CODEC

7.1 Introduction

In this chapter, a description is given of the design and development

of the speech coding algorithm and its associated operations, with

reference to the codec requirements in Chapter 6. This development was

performed by means of simulation on a general-purpose computer, and a

brief description of the associated hardware and software is given in

Appendix D. The implementation of the codec in special-purpose digital

hardware, and issues specifically associated with this implementation

(such as the effect of limited precision arithmetic), are considered in

Chapter 8.

7.2 Variable rate coding

7.2.1 Introduction

"Variable rate coding" is used in this thesis as a generic term to

refer to any coding scheme involving a time-varying bit rate. Variable

rate coding of speech does not necessarily imply variable quality, as

the speech signal itself contains a time-varying information rate.

However, assuming that the coder is able to track the information rate

of the signal exactly, variations in speech quality can be introduced

by the constraints of the channel. A fundamental assumption in this

thesis is that such variations do not have disastrous consequences for

voice communications, due to the robustness of the speech perception

mechanism.

Variable rate coding can be used as a means of flow controlling voice

traffic, as well as of reducing its average transmission bandwidth

75

requirements. As this principle has already been applied to Digital

Circuit Multiplication (DCM) systems in the current telephone network,

it is instructive to compare these systems with packet switched net

works, and this is done in Sections 7.2.2 and 7.2.3 below. In the case

of packet switched networks, it is particularly important to disting

uish between variable rate coding schemes based on "multirate" coding

[Bially Mar. 80] and those based on "embedded" coding [Bially Mar. 80],

as is explained in Sections 7.2.4 and 7.2.5 below.

7.2.2 Variable rate coding in DCM systems

7.2.2.1 Techniques

Digital Circuit Multiplication (DCM) systems make use of Time Assign

ment Speech Interpolation (TASI) and low bit rate speech coding tech

niques to increase (with respect to 64 kbps PCM) the number of simul

taneous voice calls that can be transmitted over a digital link of

given capacity [Decina 88]. The incentive for using DCM equipment is

to make the most efficient use of an expensive transmission facility,

such as a submarine cable or satellite link.

There are two fundamental assumptions about the nature of the input

voice channels which are employed by DCM systems in order to enable

concentration. The first is that the activity of each channel varies

with time, implying that the bandwidth assigned to it may be varied

accordingly. The second assumption is that the activity of each chan

nel is statistically independent of that of any other channel, implying

that if enough channels are concentrated in a single DCM system, then

any bandwidth in the output channel not used by a particular input

channel can be used by another input channel. The total capacity of

the input channels and the capacity of the output channel are related

76

by a "concentration factor", which is greater than unity in a useful

DCM system.

Early circuit multiplication systems assigned output channel bandwidth

to each input channel on the basis of whether the signal represented a

talkspurt or a silence interval. Using this technique a concentration

factor (or "TASI gain") of about 2 is obtained with 40 or more input

channels [Bially Sept. 80]. However, a problem associated with this

approach is that there is a finite probability that the output channel

will be overloaded at a given time. This means that one or more input

channels requiring bandwidth in the output channel may be denied it,

leading to the phenomenon known as "freeze-out". This usually mani

fests as clipping of the initial portion of a talkspurt, which is

subjectively unacceptable if the fraction of speech lost is greater

than about 0.5% [Weinstein 79].

By combining the TASI principle with low bit rate digital speech coding

techniques, concentration factors of greater than 2 become possible in

circuit multiplication equipment. Furthermore, as the signal is repre

sented in digital form, a more refined level of bandwidth-assignment

than that used in early systems is feasible. Specifically, variable

rate coding can be used to track the changing information rate of each

input signal [Yatsuzuka 82]. In addition, the coding rate can be

adapted to the dynamic conditions existing in the output channel, so

that freeze-out can be prevented, by reducing the coding rate of all

input signals momentarily [Gruber 81]. With these features, the

concentration factor of DCM systems can be as high as 7 [Yatsuzuka 82].

7.2.2.2 Issues

There are two important points to note as regards the use of variable

77

rate coding in DCM systems. Firstly, due to their constrained connect

ivity, DCM systems require relatively little addressing and control

information to be transmitted along with the speech code. This effi-

ciency means that it is practical to multiplex relatively short por-

tions of code from each input channel, typical figures being 2 - 5 mS

worth of speech [Langenbucher 82], [Yatsuzuka 82]. The coding rate for

a particular input channel can thus be changed with corresponding

frequency, ie. every 2 - 5 mS.

Secondly, because the use of centralized multiplexing means that the

output link utilization is known exactly to the flow control algorithm,

it can specify a bit rate for each input channel in the sure knowledge

that the required capacity will be available to convey the speech

signal to its destination (ie. the other end of the multiplexed link).

7.2.3 Variable rate coding in packet switched networks

7.2.3.1 Techniques

The flexibility of a packet switched network means that there are a

number of ways in which variable rate speech coding can be applied.

One possibility is to negotiate a coding rate between the voice term

inal and the network when the call is set up [Cohen 78]. The nego

tiated rate might be influenced by the prevailing level of network

congestion and the amount the user is prepared to pay for a particular

quality of service. However, this approach does not exploit the vari

able activity of speech, and does not permit dynamic flow control.

Another possibility is to allow the coding rate to be set independently

for each packet in a call. Suitable rate-control information is in

cluded in each packet header, and at the receiver the information is

used to set the decoding rate accordingly for the duration of the

78

packet. The coding rate can thus be adapted on a per-packet basis to

the source activity, as well as to changing network conditions, while

the cal 1 is in progess.

A third possibi1ity is to al low the coding rate to be varied within a

packet (in the limit on a per-sample basis), although flow control-

related rate variations are still likely to be implemented on a per-

packet basis. While this approach allows the maximum bandwidth com

pression to be achieved [Dubnowski 79], the extra complexity involved

makes it unsuitable for this project.

7.2.3.2 Issues

The fact that the connectivity of a packet switched network is invar

iably less constrained than that of a DCM system means that the former

has a greater bandwidth overhead associated with addressing and con

trol information. Thus for the same level of efficiency, the amount of

speech in a packet must be greater than the 2 - 5 mS portion of speech

code multiplexed in DCM systems, a value of 20 mS being typical [Forgie

76].

Although this is unlikely to be restrictive as regards flow control, it

does put a limit on the bandwidth compression which can be achieved by

varying the coding rate on a per-packet basis. However, per-packet

rate adaptation can still account for the significant long-term varia

tions in the speech signal, such as the differences between talkspurts

and pauses, and between voiced and unvoiced sounds.

As mentioned in Section 7.2.2.2, the centralized multiplexing imple

mented in DCM systems means that rate assigment can be done in the sure

knowledge that the required bandwidth will be available in the trans-

79

mission channel. By contrast, the multiplexing in packet switched

networks is often performed on a distributed basis, examples being

local area networks and packet radio networks. This means that there

is in general a degree of uncertainty involved in rate assignment, as

the demands made by other network nodes on the transmission channel are

not known. This fact has important implications for the way in which

the variable rate code is generated, as is discussed in the next two

sections.

7.2.4 Multirate coding

A multirate coder is defined here as one for which the bit rate of the

output code stream is varied by means of some change in the encoding

algorithm itself, such as the sampling rate, or the number of bits with

which each sample is quantized [Un 82], [Dubnowski 79]. When a multi-

rate coder is used in a packet switched network, rate control commands

must be issued to it by the flow control algorithm in the network voice

protocol. Assuming that a per-packet rate adaptation strategy is used,

it is then necessary for the encoder to know the location of packet

boundaries in its output code stream, in order that rate changes can be

executed on these boundaries. This increases the complexity of the

interface between the codec and the network voice protocol, and de

creases the degree of independence between these two modules.

Furthermore, a flow control "dead time" is introduced between the

sensing of an overload condition and the reduction of the coding rate.

While this delay is limited to the length of a packet in a single-link

network such as a LAN, there is no such limit in multi-link networks.

For example, if an overload occurs on a link which is geographically

remote from the transmitter, it may take a considerable time for a

supervisory packet to travel back from the point of congestion to the

80

transmitter to inform it of this condition. In the meantime, the

transmitter will continue to produce packets at a high rate, thereby

compounding the congestion problem [Bially Mar. 80].

7.2.5 Embedded coding

In contrast to a multirate coder, an embedded coder produces code at a

fixed rate, but allows the network to remove appropriate portions of

the code stream in order to reduce the effective bit rate if necessary.

The code words are generated in a manner which minimizes the effect of

this removal on the quality of the decoded signal [Goodman 80].

As the removal of bits from the code stream may be performed at any

point in the call path, without the transmitter or receiver being

explicitly informed, the "dead time" problem associated with multirate

coding is avoided. Furthermore, if the embedded code is organized into

packets of bits of equal significance, then bit rate reduction can be

implemented by dropping entire packets at a time [Tierney 81]. Em

bedded coding thus allows packet loss (due to overflowing queues) to be

replaced by a gradual SNR reduction in both local [Frost 85] and wide

area [Bially Mar. 80] networks. In addition, as flow control may be

exercised by the network voice protocol in a packet voice terminal

without informing the codec, it permits a considerable simplification

of the interface between the two modules.

A disadvantage of embedded coding in comparison with multirate coding

is that it results in a lower SNR than the latter. This is a conse

quence of the fact that the embedded encoder cannot be simultaneously

optimized for all possible decoder rates, and therefore represents a

compromise design [Tzou 86]. However, in practical speech coding

applications this SNR reduction can be made small (typically less than

81

1 dB) [Goodman 80]. On balance, the slight reduction in SNR incurred

by embedded coding is outweighed by the advantage of fast and simple

flow control, and it is therefore chosen in preference to multi-rate

coding for this application.

7.2.6 Issues in the design of the embedded code

7.2.6.1 Code hierarchy

It is usual , though not essential , for an embedded code word to have a

hierarchical structure, meaning that different bits in the word are of

differing significance [Jayant 84]. Bits of lesser significance should

therefore be removed from the word before bits of greater significance,

in order to provide a graceful degradation in the quality of the de

coded speech. Furthermore, a certain minimum number of bits must

typically be left in the word in order to maintain an acceptable signal

quality. In this sense, the code word can be considered to consist of

"primary" (ie. essential) and "secondary" (ie. disposable) bits.

It is reasonable to assume that a packet switched network will success

ful ly convey the primary bits in each embedded code word to the re

ceiver with a high probability (say more than 99% of the time). This

implies that at any given time the information contained in the primary

bits of previous code words will almost certainly be available at the

receiver for the decoding of the current code word. The primary bits

may therefore be generated and decoded using an algorithm containing a

certain amount of "memory". Such algorithms are in general more band

width-efficient than memoryless ones in speech coding applications

[Jayant 84].

By contrast, it is expected that the secondary bits in each code word

82

will be successfully conveyed to the receiver with a significantly

lower probability than the primary bits, and it should therefore not be

assumed that the secondary bits from previous code words will be avail

able at the receiver when the current code word is decoded. With these

considerations in mind, two ways in which the embedded code words can

be generated are by means of "explicit noise coding" and by means of

"coarse feedback coding", and these are discussed in the next section.

7.2.6.2 Explicit noise coding vs. coarse feedback coding

In the explicit noise coding scheme [Jayant 83], each input sample is

first applied to a "primary" encoder, which generates the primary bits

in the corresponding code word. The quantization noise or "residual"

generated by this process becomes the input signal to a "secondary"

encoder, which produces the secondary code bits in the code word. At

the receiver, the two portions of each code word are decoded by sepa

rate algorithms, and the outputs are then summed to produce the recon

structed sample.

In principle, the number of coding operations may be increased beyond

two, so that the quantization noise from the secondary coder is encoded

by a tertiary encoder, and so on. The use of more than two coders has

been studied for the special case where each coder is a delta modula

tor, and SNR improvements over a single delta modulator operating at an

equivalent total bit rate have been demonstrated [Chakravarthy 76].

However, in this application only two coders are considered, in the

interests of simplicity of implementation.

In the coarse feedback coding scheme [Goodman 80], a single encoder

generates both the primary and the secondary bits in each code word.

However, any portions of the coding algorithm involving memory make use

83

of the information contained in the primary bits only (hence the term

"coarse feedback"). An example of this scheme is an 8 bits/sample DPCM

coder in which the input samples to the predictor are represented with

only 2 bits, implying that each code word contains 2 primary and 6

secondary bits [Goodman 80].

The single coding algorithm used in coarse feedback coding makes it

conceptually simpler than explicit noise coding. However, the latter

is more flexible than coarse feedback coding because it allows the use

of different types of algorithm for the primary and secondary coders,

and these can then be independently optimized. For instance, it has

been shown that the SNR of an explicit noise coding scheme with an

ADPCM primary coder can be significantly improved through the use of a

block coding technique in the secondary encoder [Jayant 83]. Another

example uses CVSD as the primary coder and log PCM as the secondary

coder [Tierney 81]. Because of this flexibility, explicit noise coding

is chosen as a means of realizing an embedded code in this application.

7.2.6.3 Code format

In order to maintain compatibility with 64 kbps PCM technology, the

code words are specified to be 8 bits long, and to be produced at a

rate of one per 125 uS, thus setting a maximum coding rate of 64 kbps.

At the other extreme, it is known that the lowest rate at which commu

nications quality speech can be achieved without resorting to high-

complexity coding techniques is approximately 16 kbps [Jayant 84].

This implies that at least 2 bits in each code word must be transmit

ted, and these 2 primary bits are therefore grouped into a single code

"fragment".

If the secondary bits in each code word are also grouped into 2-bit

84

fragments, then additional bit rates of 32 and 48 kbps may be obtained

by discarding 1 or 2 secondary fragments from each word. In keeping

with the hierarchical structure described in Section 7.2.6.1, the

fragments are distinguished on the basis of their significance in the

code word, and discarding must be performed by working upwards, from

the least significant fragment to the most significant fragment.

It would also be possible to define 1-bit instead of 2-bit secondary

fragments, so that rates of 24, 32, 40, 48, and 56 kbps were obtained,

in addition to the 16 and 64 kbps rates. However, since fragments of

like significance must be placed in the same packet, as explained in

Section 7.2.5, the time taken to accumulate a packet of a particular

type of fragment (ie. the packetization delay for a particular packet

length) would then be doubled.


Variable rate coding is most easily applied to packet voice communica

tions if the rate is varied on a per-packet basis. With this approach,

embedded coding can be used to permit simple, fast flow control to be

performed at any point in the call path by simply discarding packets

appropriately, and explicit noise coding using separate primary and

secondary coders is a flexible means of generating an embedded code.

Given the code format specification of Section 7.2.6.3, it is necessary

to find suitable algorithms for the primary and secondary encoders by

considering such factors as bandwidth compression and implementation

complexity.

85

7.3 Redundancy reduction coding

7.3.1 Introduction

While variable rate coding permits a degree of statistical bandwidth

compression, further gains can be made through the use of redundancy

reduction coding techniques. The degree of compression achieved by

such techniques is directly related to their complexity, so that a

suitable technique should be chosen on the basis of a trade-off between

performance and complexity. A brief taxonomy of speech coding tech

niques offering redundancy reduction is given below, leading to the

selection of appropriate algorithms for the primary and secondary

coders.

7.3.2 Waveform coders vs. vocoders

Speech coders are conventionally grouped into two main classes, namely

waveform coders and vocoders [Jayant 84]. Vocoders attempt to extract

the perceptual ly important features of the speech signal (such as its

formant frequencies or pitch), in order to enable the synthesis at the

decoder of a signal which "sounds the same" as the original, but which

may have a significantly different waveform [Jayant 84]. A well-known

example is the Linear Predictive Coding (LPC) vocoder [Rabiner 78].

While vocoders may achieve high bandwidth compression factors, they are

correspondingly complex to implement, and also tend to produce speech

with a synthetic quality, so that they are not considered further here.

Waveform coders, on the other hand, attempt to preserve the time-

amplitude waveform of the speech signal , an inherently simpler approach

than that of vocoders. The class of waveform coders thus includes

algorithms with implementation complexities appropriate to this appli

cation.

86

7.3.3 Time domain vs. frequency domain waveform coders

A distinction can be made between waveform coders which attempt to

preserve the signal waveform by means of operations performed directly

on its time-domain representation, and those which first transform the

signal to the frequency domain. In the latter case, the Discrete

Fourier Transform may be used, and as this transform by itself repre

sents a significant implementation problem, such techniques are not

considered appropriate to this project.

An alternative means of obtaining a frequency-domain representation of

the signal involves passing it through a filter bank. However, even

when this bank contains only a few filters, as in Sub-Band Coding (SBC)

[Jayant 84], its implementation in digital form still represents a

considerable computational load, and a sub-band coder typically has a

complexity similar to that of an LPC vocoder [Goldberg 79].

Elimination of the "frequency-domain" waveform coders leaves those

which operate directly on the time-domain representation of the input

signal. Such coders remove redundancy from the signal by means of

linear prediction (Differential PCM, Delta Modulation, Adaptive Predic

tive Coding), or by some other means (Run-length Coding, Entropy

Coding, Vector Coding) [Jayant 84]. For reasons of implementation

complexity, attention is restricted here to the simpler predictive

coders.

7.3.4 Predictive waveform coders

The basic structure of a predictive encoder/decoder pair is shown in

Fig. 7.1. In the notation used here, the quantizer produces both a

code word w(n), which is transmitted over the channel, and an estimate

87

x(n) V Q

u(n)

v(n) QUANTIZER

-• w(n)

•0

PREDICTOR

s(n)

y(n)

Encoder

w'(n) ^ INVERSE QUANTIZER

s'(n)

•e-u'(n)

y'(n)

PREDICTOR

Decoder

Fig. 7.1 A predictive encoder and decoder

88

s(n) of the prediction residual v(n). In the decoder, an "inverse"

quantizer produces an estimate s'(n) of the prediction residual from

the received code word w'(n). The predictor is assumed to be linear,

as is standard practice in speech coding [Jayant 84], and in the

simplest case all its coefficients may be equal to zero, resulting in a

PCM coder.

The predictor and quantizer may each be fixed or adaptive, adaptive

forms giving better performance than fixed ones with non-stationary

signals such as speech, at the expense of increased complexity and

increased vulnerability to channel errors (ie. packet loss or bit

errors). In general, it is possible to adapt the quantizer and predic

tor in a predictive coder in a "forward" or "backward" manner. This

issue is discussed further in Section 7.4.3, but is worth mentioning

here that in practice forward adaptation is associated with block

coding techniques, whereas backward adaptation is associated with se

quential techniques. For the reasons given in Sections 3.3.3 and

5.2.5, sequential coding (and hence backward adaptation) is preferred

in this application.

Fig. 7.2 shows relative performance figures for a range of predictive

waveform coders operating at 16 kbps [Noll 75], which is the intended

coding rate of the primary coder. These results are presented here in

order to give an impression of the relative performance gains which can

be expected as the algorithmic complexity of the predictive coding

system is increased.

Details of the quantizers and predictors and their associated adapta

tion mechanisms used to obtain the results in Fig. 7.2 are given in

[Noll 75]. While points "d", "e" and "f" in this figure were generated

89

Coder type

The relative performance of predictive coders at 16 kbps :

(a) Log PCM (b) APCM (c) ADPCM with a lst-order fixed

predictor (d) ADPCM with a lst-order adaptive predictor (e)

ADPCM with a 4th-order adaptive predictor (f) ADPCM with a

12th-order adaptive predictor [Noll 75]

90

using forward predictor adaptation, the performance of backward adapta

tion is known to be similar for transmission rates > 16 kbps, provided

that the extra side information which must be transmitted in forward

adaptation is taken into account [Jayant 84].

The simplest predictive coder (at least conceptually, if not in terms

of implementation) is uniformly quantized ("linear") PCM, which

requires 96 kbps for toll quality coding of speech. Logarithmically

quantized ("log") PCM provides toll quality at 64 kbps, at the expense

of increased quantizer complexity. However, its performance falls off

relatively rapidly with decreasing bit rate, and speech quality is

unacceptable (lower than communications quality) at 16 kbps (point "a"

in Fig. 7.2).

If the quantizer in a PCM system is made adaptive, yielding Adaptive

PCM (APCM), then communications quality can be approximated at 16 kbps

(point "b" in Fig. 7.2). The penalty for this improvement in bandwidth

efficiency is that the coder becomes vulnerable to channel errors

(packet loss and bit errors), since tracking between the adaptive

quantizers in the encoder and decoder may be lost. In practice, speci

fic measures must be taken to deal with this problem, as mentioned in

Section 6.8.

The addition of a fixed predictor to an APCM system, resulting in

Adaptive Differential PCM (ADPCM), improves the SNR by an amount which

is related to the average degree of correlation among samples of the

input signal. This potential improvement, the "prediction gain", is

dependent upon the statistics of the speech source as wel1 as on the

transfer function of the acoustic/electrical transducer used. Point

"c" in Fig. 7.2 shows the performance obtainable for speech if the

transducer has a uniform frequency response.

91

However, with a typical telephone microphone, which has a non-uniform

frequency response, the prediction gain over APCM is virtually zero

[Cattermole 69]. The predictor also increases the vulnerability of the

coder to channel errors, although error effects decay automatically

with time for stable predictors [Jayant 84].

Making the predictor in an ADPCM coder adaptive yields SNR improvements

of about 0.5, 2 and 3.5 dB for predictor orders of 1, 4 and 12, respec

tively (points "d", "e" and "f" in Fig. 7.2). However, predictor

adaptation algorithms are fairly complex to implement, and significant

ly increase the sensitivity of the coder to channel errors [Yatrou 88].

An alternative way of improving the SNR of an ADPCM coder is to in

crease the sampling rate. This has the effect of increasing the corre

lation between signal samples, so that the prediction gain obtainable

with a simple fixed predictor is increased. In the special case where

a 1-bit adaptive quantizer is used, the scheme is known as Adaptive

Delta modulation (ADM). The increase in SNR obtained for ADM with

increasing bit rate is less than that obtained for ADPCM, although at

low bit rates (around 16 kbps) the performances of the two systems

converge [Jayant 84].

Increasing the predictor order above 1 in an ADM system does not gener

ally yield significant SNR improvement, and tends to lead to codec

instability, due to the highly non-linear nature of the 1-bit quantizer

[Jayant 84]. For the same reason, adaptive prediction is also not used

in delta modulators. The presence of an adaptive quantizer and a

predictor in an ADM coder makes it vulnerable to channel errors, al

though individual bit errors tend to have less effect than on a multi-

bit/sample ADPCM coder [Jayant 84].

92

7.3.5 Concl usions

The two coding techniques suitable for this project in terms of speech

quality, complexity of implementation, and robustness to packet loss,

are Differential PCM with fixed prediction and adaptive quantization,

and Delta Modulation with fixed prediction and adaptive quantization

(referred to simply as ADM and ADPCM from this point). Both of these

techniques provide communications quality speech at 16 kbps, which

represents a reasonable lower quality bound for this application.

However, as ADM and ADPCM have significantly different implementation

requirements (in terms of sampling rate and quantizer resolution), a

more detailed study of each scheme is required before a choice can be

made between the two.

In particular, it is necessary to optimize the predictor and adaptive

quantizer in each case for the appropriate sampling rate and input

transducer type, if a fair comparison is to be made between the two

techniques. Optimization of the predictors in these coders is rela

tively straightforward, requiring only that optimum predictor coeffi

cients be found, as is discussed in Section 7.5. By contrast, a large

number of distinct quantizer adaptation algorithms have been described

in the literature [Jayant 84], so that in this case it is necessary to

select an appropriate algorithm, as well as to optimize its parameters,

as is considered in the next section.

7.4 The adaptive quantizer in the primary coder

7.4.1 Introduction

During preliminary comparisons of various delta modulation algorithms

conducted as part of this project, it was noted that anomalously low

93

performance was obtained with telephone speech inputs (ie. speech

captured by a standard telephone microphone). Specifically, the diff

erence in SNR between ADPCM and ADM coders with fixed first-order

predictors was less than 1 dB with high-quality input speech, but

greater than 4 dB with telephone speech. As these results were ob

tained after the predictor coefficient had been adjusted for maximum

SNR in each case, it was indicated that the performance discrepancy

might be due to inadequacies in the adaptive quantizers used in the ADM

coders.

In particular, the fact that these adaptive quantizers had been design

ed for high-quality speech inputs suggested that they were poorly

matched to the statistics of telephone speech, and that this could not

be compensated for by manipulation of the predictor coefficient. How

ever, as the adaptive quantizers involved had to be optimized using

time-consuming random search methods, it was difficult to test this

hypothesis. It was therefore considered worthwhile to investigate the

operation of adaptive quantizers in detail, and to attempt to develop

one which could be optimized by means of a more efficient procedure.

During the course of this work, a new ADM algorithm which offered good

subjective performance and simple implementation was developed [Irvine

86]. Subsequently, a generalized form of adaptive quantizer and an

associated optimization procedure were developed [Hall 88]. The struc

ture of this new adaptive quantizer is described in detail in Section

7.4.6, but in order to facilitate the description, a review of perti

nent issues in adaptive quantization is given below.

7.4.2 Adaptation vs. companding

In the past, it has been widely assumed that changes in the mean and

94

PDF shape of the quantizer input signal are insignificant in comparison

with changes in its local variance, and adaptive quantizers have there

fore tended to take account of the latter statistic only, typically by

relating all the quantizer levels to a fundamental "step size", and

then increasing or decreasing this step size appropriately [Jayant 70],

[Jayant 73]. For the same reason, the term "companding" (compressing/

expanding) is often used interchangeably with "adaptation" in the

literature. Nevertheless, it is quite feasible to adapt a quantizer to

changes in the mean and PDF shape of its input signal [Jayant 84].

Accordingly, "adaptation" is used here in the more general sense, while

"companding" is used to refer to adaptation to changes in the signal

variance only.

Adaptation to a change in the local variance of the quantizer input

signal requires that all the quantizer levels are scaled by an appro

priate factor, or equivalently that the signal is scaled by a recipro

cal factor prior to quantization. Adaptation to a change in the mean

of the signal requires that a uniform shift is applied to the quantizer

levels, and adaptation to a change in PDF shape requires the re

distribution of these levels [Jayant 84]. In principle, all of the

above types of adaptation may be required at each sampling instant in

order to quantize the input signal "optimally".

One case in which quantizer adaptation has a clear potential advantage

over simple companding is delta modulation, where the relatively small

range of the quantizer means that it is frequently overloaded. Fig.

7.3 shows the encoder input and predictor output signals for an ADM

system with a perfect integrator in the encoder feedback loop. Two

distinct regions of operation can be identified, one in which "slope

overload" distortion is incurred, and the other in which "granular"

95

ENCODER INPUT

PREDICTOR OUTPUT

SLOPE OVERLOAD REGION GRANULAR NOISE REGION

Fig. 7.3 Slope overload and granular noise in delta modulation

96

noise is produced.

The quantizer input signal in this case is equal to the difference

between the encoder input and the predictor output, as indicated by the

vertical arrows in Fig. 7.3. Clearly, this difference signal has a

mean which is approximately zero in the granular noise region, but

which is non-zero in the slope overload region. This suggests that

adaptation of the quantizer to the shift in the mean might result in a

worthwhile performance improvement. The potential advantage of adapta

tion over companding is less clear for higher-resolution adaptive

quantizers, but it is shown in Section 7.6.4.2 that an improvement is

obtained at least in the case of 2-bit quantization.

7.4.3 Backward vs. forward adaptation

A distinction can be made between "forward" adaptive quantizers, which

are adapted from observations of the quantizer input signal, and "back

ward" adaptive quantizers, which are adapted from observations of the

quantizer output signal only [Jayant 84]. As the quantizer input

signal is not available to the decoder, forward adaptive quantizers

typically calculate adaptation parameters for a block of speech and

transmit these to the decoder as side information. This association

between forward adaptation and block coding makes it undesirable in

this application (as mentioned in Section 7.3.4), so that attention is

restricted here to backward adaptation, which can be implemented in a

sequential manner.

7.4.4 Syllabic, instantaneous, and hybrid adaptation

Adaptive quantizers can be further classified according to the "speed"

of their adaptation characteristics, which may be instantaneous (ie. an

97

attempt is made to track sample-to-sample variations in the signal

magnitude), syllabic (ie. an attempt is made to track changes in the

signal magnitude associated with different syllables), or some combina

tion of the two [Jayant 84]. Instantaneously adaptive quantizers are

able to quantize signal transients well, but can cause codec instabili

ty if their response is too rapid. This tendency is accentuated by low

quantizer resolution and the use of fixed predictors of order 2 or more

[Gibson 78]. At the other extreme, syllabically adaptive quantizers

are inherently stable in predictive coders, but quantize transients

poorly.

While it may appear that there should be an adaptation scheme which has

a response time somewhere between those of instantaneously and sylla

bically adaptive quantizers, and which could therefore be called

"critically damped", the fast attack/slow decay characteristic of

speech signals makes such a simple solution sub-optimal. A better

approach is represented by "pseudosyl labical ly" adaptive quantizers

[Nasr 84], which employ instantaneous adaptation during periods of

quantizer overload, and syllabic adaptation elsewhere. An adaptive

quantizer of this type is used in Continuously Variable Slope Delta

Modulation (CVSD) [Jayant 84].

An alternative to pseudosyl labic adaptation is hybrid adaptation, which

makes continuous use of both instantaneous and syllabic adaptation.

This technique was first used in a delta modulator [Magi 11 77], [Un

81], where it was shown to provide significant improvements in SNR and

dynamic range over other schemes [Un Jan. 80], and was later general

ized to include multi-bit quantizers [Nasr 84].

98

7.4.5 The optimization of backward adaptive quantizers

For a fixed quantizer, the decision and reconstruction levels used to

quantize a particular input sample depend solely on its magnitude (and

sign). Optimization of the quantizer therefore simply requires that

the decision and reconstruction levels are matched to the likely values

of the input samples.

However, in the case of a backward adaptive quantizer, the levels used

to quantize a particular sample are determined not only by its magni

tude, but also by one or more quantizer output bits. In principle,

these output bits could be related to the "past" or the "future", but

the use of future output bits (involving Delayed Decision or Multipath

Search Coding [Jayant 84]) requires considerably more processing than

the use of past output bits, so that attention is restricted here to

the latter case.

Optimization of a backward adaptive quantizer thus requires that the

decision and reconstruction levels are matched to the likely values of

the input samples for each possible combination of previous quantizer

output bits, which significantly increases the dimensionality of the

optimization problem. This tends to make optimization by means of

trial-and-error or using a random search procedure very time-consuming,

particularly since a large number of samples must be included in the

training sequence (>10 000) in order to obtain a statistically repre

sentative input.

An alternative to the above approach has been proposed for optimizing a

particular instantaneously adaptive quantizer [Castellino 77]. In this

procedure, the appropriate quantizer levels are updated after each

sample is coded, so that suitable values are "learnt" over the course

99

of the training sequence. While this procedure is much more efficient

than random search methods, it is not rigorous when the training se

quence is taken from a non-stationary signal, since it assigns more

weight to samples at the beginning of the sequence than at the end.

By contrast, the optimization procedure developed by Bello et al. for

the Statistical Delta Modulation (SDM) system assigns equal weight to

all samples in the training sequence [Bello 67]. The SDM system, which

is shown in Fig. 7.4, differs from conventional delta modulators in

that it does not use an integrator (ie. a linear predictor) in the

encoder feedback loop. Instead, a generalized mapping operation is

used to map past quantizer output bits, stored in a shift register, to

a predicted signal value, which is then used as a decision level for

calculating the output bit. A corresponding operation in the decoder

maps received bits to an estimated signal value, or reconstruction

level. The SDM optimization procedure iteratively improves these map

pings when the input to the system is a training sequence of samples.

The SDM system can thus be thought of as a form of backward adaptive

quantizer, as its decision and reconstruction levels at any given

sampling instant are dependent on a number of quantizer output bits.

Unfortunately, however, due to the relatively short memories which must

be used in practical implementations, account cannot be taken of long-

term (ie. syllabic) changes in the signal variance. As such changes

are considerable in speech, typically spanning 40 dB [Jayant 84], the

original SDM system is unsuitable for use in speech coding applica

tions.

Nevertheless, it has been found that an optimized SDM system performs

significantly better than a conventional DM system when coding a

stationary signal, to the extent that the coding rate can be reduced by

x (n) >_4<+>-v {n)

u(n)

Q b(n)

100

{p(j)J * j(n)

MEMORY t

CHANNEL

* MEMORY Mn)

* {q(k)} ->d(n)

Fig. 7.4 The Statistical Delta Modulation system

101

a factor of 1.5 for the same SNR [Bello 67]. This suggests that the

ability of the SDM coder to adapt to short-term changes in the varian

ce, mean and PDF shape of its input signal is important.

7.4.6 The Generalized Hybrid Adaptive Quantizer


The fundamental problem with the SDM system, which prevents it from

being used in speech coding applications, is that it is unable to adapt

to long-term changes in the variance of its input signal. This problem

is overcome here by the addition of a syllabic compandor, thereby

creating a hybrid adaptive quantizer. (While this does not allow

adaptation to long-term changes in the mean and PDF shape of the quant

izer input signal, these are expected to be of secondary importance in

speech coding applications.) Furthermore, the scheme is generalized to

include multi-bit quantization. The resulting adaptive quantizer has

an arbitrary memory and code word length (in contrast to many previous

ly proposed schemes), and is therefore called the Generalized Hybrid

Adaptive Quantizer (GHAQ).

Fig. 7.5 shows the GHAQ in the context of a predictive encoder, the

GHAQ being the portion of the figure to the right of the vertical

dashed line. Its input is the prediction residual v(n), and its out

puts are the code word w(n) and the estimate s(n), which is a quantized

version of the prediction residual v(n). The GHAQ itself can be fur

ther divided into an instantaneously adaptive quantizer, which is shown

inside the dashed box in Fig. 7.5., and a syllabic compandor.

7.4.6.2 The syllabic compandor

The syllabic compandor makes use of an exponentially-weighted average

102

u(n)

• w(n)

Fig. 7.5 The GHAQ used in a predictive encoder

103

magnitude estimate. This is not the only possible weighting function

for average magnitude estimation, nor is it necessarily optimal

[Fischer 85], but the fact tht it can be realized in recursive form

makes its implementation considerably simpler than for most other

estimators. The average magnitude estimate g(n) is used to compress

the prediction residual v(n) before it is quantized, and to expand it

again afterwards. The compression operation is performed by inverting

g(n) before multiplying, as shown by the block labelled "INV" in Fig.

7.5. The block labelled "ABS" calculates the magnitude of the syllabic

compandor input sample. The minimum value of g(n) is constrained to be

unity, in order to prevent g(n) from diminishing indefinitely if the

input to the syllabic compandor is zero.

Multiplication of the quantizer decision and reconstruction levels by

g(n) would be equivalent to the compression/expansion representation

in Fig. 7.5, and would have the practical advantage that the inversion

operation could be avoided. However, the form shown in Fig. 7.5 is

used here, because it facilitates the description of the optimization

problem, although the hardware implementation of the codec described in

Chapter 8 uses the alternative form.

7.4.6.3 The instantaneously adaptive quantizer

The instantaneously adaptive quantizer in the GHAQ consists of a number

of sets of decision and reconstruction levels, a fixed 1-bit quantizer,

and a FIFO memory of length L bits, which stores one or more quantizer

output words. Multi-bit quantization is implemented by determining the

bits bm(n) ; ra = 0,1,...,M-1 in code word w(n) sequentially, starting

with the Most Significant Bit (MSBit) bM_ 1(n), and finishing with the

Least Significant Bit (LSBit) bQ(n), where M is the number of bits/

sample. This approach is similar to the "successive approximation"

technique used in analog-to-digital converters, although it is impor

tant to note that in the GHAQ the instantaneously adaptive quantizer is

adapted during the generation of each code word.

Associated with each value of the bit index m is a set of decision

levels Pffl = {pm(jm) i j m = 0,1,... ,Jm-l}, and a set of corresponding

reconstruction levels Qm = {qm(km)|km = 0,1,... ,1^-1}. The quantizer

level indices jm(n) and km(n) are generated by interpreting the con

tents of the memory as respective unsigned binary numbers. The exact

interpretation in each case depends on the current state of the

"successive approximation" process. This is illustrated in Fig. 7.6

using the 2-bit GHAQ with memory length L = 4 as an example, and the

sequential generation of these indices is described below.

First, ji(n) is generated, and this selects the appropriate decision

level from the set P^, enabling bj(n) to be determined. This bit is

latched to form the MSBit of w(n), and is also clocked into the memory,

in the position shown in Fig. 7.6. Next, j0(n) is generated, and this

selects a decision level from the set PQ, enabling bQ(n) to be deter

mined. This bit is latched to form the LSBit of w(n), and is also

clocked into the memory. Finally kQ(n) is determined, which selects a

reconstruction level from the set Qg, and this level becomes the quan

tizer output level d(n). The contents of the entire memory are then

shifted to the right by 2 bits, in preparation for the next input

sample c(n+l).

It can be seen that at no stage in the above process is the index kj(n)

required, implying that the set Q1 is not used. This observation can

be generalized to the sets Q M_ 1»QM-2 »- - -»Ql in an imPl e r a e n t a t i o n o f tne

M-bit GHAQ. However, these sets are required in the process of opti-

105

MEMORY

b0(n) bi(n) b0(n-l) bi(n-l)

j^n)

k^n)

j0(n)

kQ(n)

Fig. 7.6 Generation of the GHAQ level indices jm(n) and km(n) for code

word length M = 2 and memory length L = 4

106

mizing the GHAQ, since the optimum decision levels are defined in terms

of the optimum reconstruction levels, as is shown in Section 7.4.7.

The number of elements in each set Pm or Qm is determined by the number

of different values that j m or km can assume, which is in turn deter

mined by the memory length L and the bit index m, as illustrated in

Fig. 7.6. Since there are by definition two reconstruction levels

associated with each decision level, the number of levels in each case

can be expressed:

Jm = 2L"l-m (7.1)

m̂ = 2 L" m = 2Jm (7.2)

In the case of speech inputs, it is to be expected that the signal

statistics will be symmetrical with respect to the signal mean [Jayant

84], implying a corresponding symmetry in the quantizer decision and

reconstruction levels:

Pm(j'm) = -Pm^m-i-J'm) (7.3)

*M - "Wi-V (7-4)

Use of this assumption in an implementation of the GHAQ halves the

number of levels which must be stored, as the missing levels can be

derived from the stored ones by using the MSBit of the current code

word, bM_j(n), as a sign bit. In the interests of simplicity of pre

sentation, statistical symmetry of the signal is not assumed in the

derivation of the optimization procedure in Section 7.4.7. However, it

is assumed when the optimization procedure is actually used, by simply

averaging the corresponding values at each iteration.

Another implementation issue worth noting is that it was found empiric

ally that the SNR of the 2-bit GHAQ with speech inputs was degraded

107

only slightly (< 0.5 dB) by setting all the decision levels in the set

PQ to zero. This simplifies the implementation significantly, and

accordingly all references in the rest of this thesis to the 2-bit GHAQ

assume the use of this scheme.

7.4.7 Derivation of the GHAQ optimization procedure

There are three independent factors which simultaneously determine the

performance of the GHAQ with a given input signal, namely the syllabic

compandor coefficient 8, the memory length L, and the values of the

quantizer decision and reconstruction levels. An iterative procedure

is derived below which is suitable for finding optimum quantizer levels

for given 8 and L. This procedure is similar to the Lloyd-Max proce

dure for optimizing fixed quantizers [Lloyd 57], [Max 60], and the

optimization procedure associated with the SDM system [Bello 67].

It is assumed for the purposes of quantizer optimization that speech is

"quasi-ergodic", in the sense that if time averages are calculated for

a sufficiently long period, they will tend to approach ensemble aver

ages across a correspondingly large number of utterances [Linde 80].

Furthermore, a training sequence representative of general speech sig

nals is used, as is described in Section 7.4.9. For these reasons,

time-independent random variables x,y, etc., rather than individual

samples x(n), y(n), etc. are considered in the analysis below.

A quadratic cost function is assumed, as done in [Max 60] and [Lloyd

57], although other functions could also be used, as is discussed in

Section 7.4.8. The distortion to be minimized is thus:

D = E{(x - y)2} (7-5)

where E{#} denotes expectation. It can be seen from Fig. 7.5 that:

108

X = V + u

y = s + u

Also:

v = c g

s = d g

Thus (7.5) can be rewritten:

D = E{g2(c - d)2} (7.6)

As d is the random variable associated with the quantizer output level

d(n), it can only take on values equal to the reconstruction levels in

the sets Qm ; m = 0,1,...,M-l. In the interests of clarity of presen

tation, the subscript m is omitted in the development below, with the

understanding that the analysis is valid for any bit bm(n) in code word

w(n). Associated with each decision level p(j) are two reconstruction

levels, q(kjb=0) and q(k|b=l), so that (7.6) can be expressed:

j_l «o p(j)

D = J [ / J g2(c - q(k|b=0))2 fcg(c,g|j) dc dg fj(j) j=0 -0O--0

+ iT 92(c - q(k|b=l))2 fcg(c,g|j) dc dg fj(j) ] (7.7) -«*p(j)

where : fc (c,g|j) is the joint PDF of c and g conditional upon j

fj(j) is the PDF of j

A crucial issue in the development of an optimization procedure from

(7.7) is the extent to which the PDFs fcg(c,g|j) and fj(j) are indepen

dent of the decision levels p(j) for j = 0,1,...,J-1, as (7.7) must be

differentiated with respect to these levels. Strictly, g(n) is a

weighted sum of quantizer output samples d(n),d(n-l),...,d(l), which

109

are in turn dependent on the reconstruction levels, and hence on the

decision levels. However, in order to maintain a tractable analysis,

it is assumed in the development below that the assumption of indepen

dence is valid. This point is discussed further in Section 7.4.10,

where the performance of the optimization procedure is evaluated.

In order to minimize D, it is differentiated with respect to the p(j)'s

and q(k)'s, and the derivatives are equated with zero. This gives

necessary, although not sufficient, conditions for a local minimum.

Details of the minimization are given in Appendix A, and the main

results are:

P(j) = 1/2 [ q(j) + q(J+K/2) ] ; j = 0,1,...,J-1 (7.8)

E{cg2|k} q(k) = ; k = 0,1,...,K-1 (7.9)

E{g2|k}

Equation (7.8), which is identical to the design equation obtained for

non-uniform fixed quantizers [Max 60], [Lloyd 57], states that the

optimum decision levels lie midway between the corresponding recon

struction levels. Equation (7.9) states that the optimum reconstruc

tion levels are the "weighted" centroids of the quantizer input signal

on the corresponding quantizer intervals.

In order to optimize the GHAQ, initial sets {p(j)| and {q(k)} are

assumed, and these are then iteratively improved with a training se

quence of input samples by repeated evaluation of (7.8) and (7.9),

where the expectations are replaced by sample averages. This procedure

is repeated for values of 8 in the range [0,1), in order to optimize

this parameter simultaneously with the quantizer levels. The memory

length L is assumed to be given, but unlike 8, there is no theoretical

no

upper bound on this parameter. Rather, it is expected that increasing

L will result in a monotonic increase in the SNR of the GHAQ, and that

the SNR wil 1 tend to saturate above a certain value of L, due to the

finite correlation time of the input signal, as found for the SDM

system [Bello 67]. In practice, L will be constrained by the desire

for a simple implementation.

If it is assumed that c and g are independent random variables, then

(7.9) reduces to:

q(k) = E{c|k} ; k = 0,1,...,K-l (7.10)

which corresponds to the well-known result for fixed quantizers [Max

60], [Lloyd 57]. However, it was found empirically that if (7.10) is

used instead of (7.9) in the optimization procedure, unsatisfactory

convergence behaviour results in many instances, indicating that the

assumption of independence of c and g is not valid in general.

7.4.8 Performance measures

The distortion measure used in the GHAQ optimization procedure is the

Mean Square Error (MSE), implying that the procedure will attempt to

maximize the SNR of the GHAQ. The MSE has the advantage that it is

easily calculated and tractable, and is widely used in the optimization

and evaluation of speech coding systems [Jayant 84]. However, as the

ultimate objective is to maximize the perceived quality of the decoded

speech, the question arises as to how well SNR is correlated with

perceptual judgements.

A number of studies have addressed this issue [McDermott 78], [Nakatsui

82], [Scagliola 79], [Barnwell 82], and average correlations between

SNR and subjective judgements in the range 0.24 to 0.89 have been

Ill

reported. However, the major conclusion of these studies is that SNR

is not a good predictor of subjective quality ratings for different

coder types (such as when both waveform coders and vocoders are in

cluded in the tests). When only time-domain waveform coders are in

cluded, the correlation between SNR and subjective judgements improves

considerably [Nakatsui 82].

It has also been found in these studies that segmental SNR (SNRSEG)

[Jayant 84] is a better predictor than SNR of the subjective quality of

waveform coded speech, and has correlations in the range 0.77 to 0.95.

The distortion measure associated with SNRSEG is the Energy-Weighted

Mean Square Error (EWMSE) [Chen 87], and while this measure involves

more computation than the MSE, it is considerably simpler than alterna

tive distortion measures such as the LPC Cepstrum Distance [Kitawaki

82]. Although the EWMSE has not been used in the GHAQ optimization

procedure described here, it should be directly applicable, and this

represents a reasonable area for further work. Nevertheless, SNRSEG

figures are given in all performance evaluations of the GHAQ, in order

to allow comparison with SNR figures.

7.4.9 The training set

When any algorithm is optimized by tailoring its characteristics using

a particular training set, it is important to ensure that the training

set is representative of the class of signals from which it is taken

[Linde 80]. In the case of speech, significant variations in training

set statistics can be caused by variations in the linguistic material

used, and in the identity (in particular the sex), of the speaker. For

these reasons, a special training set containing "phonetically bal

anced" linguistic material spoken by both male and female speakers, was

constructed, as is standard practice in evaluations of speech coders

112

[IEEE 69]. This training set, consisting of Sentences 1 to 4 in Appen

dix E, was used in all GHAQ optimization runs.

It was observed during the development of the GHAQ that there was in

general little difference among the sets of optimum quantizer levels

obtained for training sets containing different (but phonetically

balanced) utterances by different speakers. By contrast, significant

differences were noted between the sets of optimum quantizer levels

when the training set contained only voiced speech or only unvoiced

speech, thus emphasizing the need to use phonetically balanced

utterances.

As a check on the representative nature of the training set, the per

formance of the optimized 2-bit GHAQ for this set was compared with its

performance for another input sequence (Sentences 5 to 8 in Appendix

E). The difference between the performance figures obtained in each

case was 0.26 dB for SNR and 0.02 dB for SNRSEG, and in fact the

outside-training-set utterance yielded the higher figures. It is

therefore concluded that the above training set is sufficiently repre

sentative of general speech signals for this application.

7.4.10 Evaluation of the GHAQ optimization procedure


There are a number of issues which must be addressed in order to

determine the usefulness of the optimization procedure derived in

Section 7.4.7. Firstly, it needs to be established whether and under

what conditions the procedure converges in practice. Due to the ass

umption that the probability distributions of c and g are independent

of the quantizer levels, convergence cannot be guaranteed from the

113

analysis. Secondly, if the procedure converges, does the point of

convergence represent an optimum design, ie. has the objective function

been minimized? Thirdly, if the point of convergence is a minimum of

the objective function, is the minimum global or not? Fourthly, what

are the effects of the GHAQ parameters 8 and L on the peak performance

obtained?

7.4.10.2 Convergence

The optimization procedure was evaluated with 1-bit and 2-bit versions

of the GHAQ, operating at respective sampling rates of 16 kHz and 8

kHz, and with speech inputs. The procedure was found to converge from

a number of different starting points and for a range of values of 8

and L, although some exceptions are described below.

Typical convergence characteristics are shown in Figs. 7.7 and 7.8,

where SNR is plotted against the iteration number i. (The SNRF measure

used in Fig. 7.7 is the SNR calculated after the decoder output fil

ter.) While convergence is not necessarily monotonic, as can be seen

from Fig. 7.7, the final optima are stable with respect to further

iteration in both cases. Also plotted in these figures is a converg

ence measure 6^, which is defined:

K-l 6, = 1/(K-D [ I (q(k)|.- q(k)l. ) 2 ] ^ 2 (7.11)

k=0 1 1-1

where the vertical bars indicate the iteration at which q(k) is eval

uated. This convergence measure is used to provide a termination

criterion for the optimization procedure, namely 6-,- < 0.05.

The GHAQ reconstruction levels corresponding to the first and fifteenth

iterations in Figs. 7.7 and 7.8 are given in Tables 7.1 and 7.2,

114

X X * X~

X SNRF

1.50

1.25

1.00

0.75^-

0.50

0.25

JI IB 2 3 4 5 G 7 B 9 10 11 12 13 14 15 0.00

Fig. 7.7 Convergence characteristics of the optimization procedure for

the 1-bit GHAQ

X X X X X X X—-x X X- X } 1.50

X SNR

M= JH m a K j m at-

1.25

1.00

0.75co

- 0.50

0.25

10 11 12 13 14 15 0.00

Fig. 7.8 Convergence characteristics of the optimization procedure for

the 2-bit GHAQ

b0(n)

0

0

0

0

0

0

0

0

b0(n-l)

0

0

0

0

1

1

1

1

0

0

0

0

1

1

1

1

b0(n-2)

0

0

1

1

0

0

1

1

0

0

1

1

0

0

1

1

b0(n-3)

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

k(n)

0

1

2

3

4

5

e 7

8

9

10

11

12

13

14

15

q(k)

i=0

1.50

1.50

1.50

1.50

-0.50

-0.50

-0.50

-0.50

0.50

0.50

0.50

0.50

-1.50

-1.50

-1.50

-1.50

1=15

3.21

2.73

1.63

1.61

1.02

0.58

0.07

0.16

-0.16

-0.07

-0.58

-1.02

-1.61

-1.63

-2.73

-3.21

Table 7.1 Reconstruction levels q(k) of the 1-bit GHAQ before

iteration i = 1 and after iteration i = 15 of the

optimization procedure

b0(n)

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

bi<n)

0

0

0

0

1

1

1

1

0

0

0

0

1

1

1

1

bo(n-D

0

0

1

1

0

0

1

1

0

0

1

1

0

0

1

1

bi(n-l)

0

1

0

X

0

1

0

1

0

1

0

1

0

1

0

1

k(n)

0

1

2

3

4

5

6

7

e 9

10

11

12

13

14

15

q(k)

i=0

1.50

1.50

1.50

1.50

-0.50

-0.50

-0.50

-0.50

0.50

0.50

0.50

0.50

-1.50

-1.50

-1.50

-1.50

i=15

3.34

1.67

1.86

2.00

-0.55

-0.57

-0.46

-0.91

0.91

0.46

0.57

0.55

-2.00

-1.86

-1.67

-3.34

Table 7.2 Reconstruction levels q(k) of the 2-bit GHAQ before

iteration i = 1 and after iteration i = 15 of the

optimization procedure

116

respectively. The corresponding decision levels can be calculated

straightforwardly using equation (7.8). Although all 16 reconstruction

levels are given in each case for the sake of clarity, the fact that

they are constrained to be symmetrical means that only 8 need be stored

in a practical implementation, as explained in Section 7.4.6.3.

The choice of a starting point for the GHAQ optimization procedure is

an important issue, not only because it affects the number of itera

tions required for convergence, but also because in extreme cases the

adaptive quantizer design associated with a particular point will be

unstable (such as when all the reconstruction levels are greater than

unity), with the result that the procedure will not converge. The

starting set shown in Table 7.1 is chosen from intuitive considera

tions, by distinguishing those cases in which an increase in signal

variance is indicated from those in which a decrease is indicated.

However, as the memory length increases, this intuitive approach becomes

less useful, and a more rigorous way of finding a starting set is

required.

This issue has been addressed in a study of adaptive vector quantizers,

and a two-phase optimization procedure has been proposed [Chen 87]. In

Phase 1, the quantizer is optimized using unquantized inputs to the

"gain predictor" (equivalent to the syllabic compandor in the GHAQ).

The use of unquantized values ensures that the adaptive quantizer

cannot become unstable during the optimization process, no matter what

starting set is used. Once Phase 1 has converged, its final design is

used as a starting point for Phase 2, in which quantized syllabic

compandor inputs are used. This second phase thus corresponds to the

optimization procedure derived in this thesis, and Phase 1 is a means

of finding a good starting set for the procedure.

117

Non-convergence of the optimization procedure was observed when an

attempt was made to optimize the l-blt GHAQ in a delta modulator with a

value of p which was too small and/or a predictor coefficient hl which

was too small. This phenomenon may be related to the assumption that

the conditional PDFs in equation (7.7) are independent of the quantizer

levels, which becomes less valid as the quantization noise increases

(eg. when the quantizer resolution is reduced).

Furthermore, there are two feedback loops around the instantaneously

adaptive quantizer in Fig. 7.5, which tend to compensate for the

effects of quantization noise. These loops contain the predictor and

the syllabic compandor, respectively. Reducing the value of the pred

ictor coefficient reduces the amount of negative feedback applied by

the first loop, and reducing the syllabic compandor coefficient reduces

the "smoothing" action of the second. The effect in both cases is to

perturb the PDFs between iterations of the optimization procedure,

thereby hindering convergence. However, it was found that with speech

inputs convergence always occurred for 8 > 0.4 and hx > 0.5, and these

limits are sufficiently distant from the respective optimum values

(given in Sections 7.4.10.4 and 7.6.3.1), to be unrestrictive in prac

tice.

7.4.10.3 Design optimality

The optimal ity of the final design was verified in the cases of Figs.

7.7 and 7.8 by means of a general-purpose search procedure [Nelder 65],

which failed to improve by more than 0.2 dB on the final SNRs shown in

these figures. However, due to the large number of iterations required

by the general-purpose procedure, this verification process was not

extended to other cases.

118

The question of whether a local minimum is also global is common to all

optimization procedures, and cannot be answered conclusively without

knowledge of the nature of the objective function, which is not avail

able in this case. However, it is worth noting that an empirical

technique has been suggested for ensuring the location of a global

optimum using the Lloyd-Max optimization procedure [Linde 80]. It

involves the addition of independent noise to the input signal, the

noise variance being gradually attennuated as the procedure progresses.

While this technique has not been evaluated with the GHAQ optimization

procedure, it is likely to be applicable, and represents a possible

area for further work.

7.4.10.4 The effect of 0 on the performance of the GHAQ

The effect of varying 8 on the performance of the 1-bit and 2-bit GHAQs

can be seen in Figs. 7.9 and 7.10, respectively. The GHAQ was re-

optimized for each value of 8 plotted. A first-order predictor was

used, and L was fixed at 4, as negligible interaction was observed

between the respective effects of 8 and L on performance. It can be

seen that there is a well-defined performance peak in each case, occur

ring at approximately 8 = 0.97 for the 1-bit GHAQ, and 8 = 0.94 for the

2-bit GHAQ. The difference between the SNR and SNRSEG curves (up to

1.2 dB) is due to the MSE distortion measure used by the optimization

procedure, which gives more significance to large-amplitude than small -

amplitude signals.

The optimum values of 8 in Figs. 7.9 and 7.10 correspond to a syllabic

compandor time constant of about 2 mS in both cases, which is somewhat

lower than corresponding values used in other hybrid adaptive quanti

zers (5 - 20mS) [Nasr 84], [Un 81]. This may be attributed to the fact

15.5

DQ "D 14.5

L_ (J Ld IS) CK 13.5 Z 01 73 C to 12.5 LL

CK Z

cn

X SNRF 0 SNRSEGF

1 1 . 5 I I i i i I i r i i I i i i i I i i i i I i i .. I .... I .... I .... I .... | .... I 0 . 1 . 2 . 3 . 4 . 5 . G . 7 . 8 . 9 1

fi

Fig. 7.9 Performance of the optimized 1-bit GHAQ vs. p

15.5

m 14.5

u Ld LG CK 13.5 01

T3 C

Z 01

12.5

11.5 .... 1 • • * ' • • • i . . . . i . . . . i _ i i i i i -

0 SNRSEG 0

i i i i i i i i i i i i i .I. I 0 .1 .2 .3 .4 .5

fi

.6 .7 .8 .3 1

Fig. 7.10 Performance of the optimized 2-bit GHAQ vs. 6

120

that the instantaneously adaptive quantizers in the latter schemes

typically have a more rapid response than does the corresponding quan

tizer in the GHAQ, implying that a more rapidly-responding syllabic

compandor is required in the GHAQ.

7.4.10.5 The effect of L on the performance of the GHAQ

In Figs. 7.11 and 7.12, the effect on performance of varying the memory

length L is shown for the 1-bit and 2-bit GHAQs. Only even values of L

are shown in Fig. 7.12, because it is required that an integral number

of 2-bit code words are stored in the quantizer memory. Values of L

greater than 8 were not considered, due to the relatively large amount

of storage that would be required for the quantizer levels. As in the

previous section, the GHAQ was re-optimized for each value of L, and a

first-order predictor was used. The optimum values of p found in the

previous section (0.97 and 0.94) were used for the 1-bit and 2-bit

GHAQs, respectively.

The increase in performance with increasing L is as predicted from the

discussion in Section 7.4.7, although negligible improvement is observ

able above L = 4. Again, the difference between the SNR and SNRSEG

figures can be attributed to the MSE distortion measure used in the

optimization procedure.

7.5 The predictor in the primary coder

7.5.1 Introduction

In order to enable a fair comparison to be made among various ADM and

ADPCM coders, it is necessary to ensure that the predictor used in each

case is optimized, as this can have a significant effect on coder

performance. Predictors used in speech coders are usually constrained

15.5

ra 13 14.5

CJ Ld V) LY. 13.5

CO

73 C rd

L_ CK Z

cn

12.5

11.5 5 L

X SNRF 0 SNRSEGF

Fig. 7.11 Performance of the optimized 1-bit GHAQ vs. L

L5.5

T5

c fd

CK

z cn

12.5

LI.5

X SNR 0 SNRSEG

5

L

Fig. 7.12 Performance of the optimized 2-bit GHAQ vs. L

122

to be linear, in the interests of simplicity of analysis and design,

and also from consideration of the speech production process [Jayant

84]. The predictor shown in Fig. 7.1 thus predicts the value of the

input sample x(n) using a linear combination of previous signal esti

mates :

A "(n) = I ha y(n-a) (7.12)

a=l

where ha ; a = 1,2,...,A are the predictor coefficients.

The predicted value u(n) is then subtracted from the input sample x(n),

to form the prediction residual v(n). If the predictor coefficients

are all zero, then the coding system is effectively PCM. If one or

more of the predictor coefficients is non-zero, the system is DPCM, and

a special case of this is DM. For a given predictor order A, predictor

optimization requires the calculation of the optimum coefficients.

7.5.2 An analytic approach to predictor optimization

In order to find optimum values of ha ; a = 1,2,...,A in (7.12), the

following approximation is usually made:

A u(n) ~ I ha x(n-a) (7.13)

a=l

The well-known Wiener-Hopf equations [Jayant 84] can then be developed

from (7.13), so that the optimum predictor coefficients may be found

from the long-time-averaged Autocorrelation Function (ACF) of the en

coder input signal. However, the approximation in (7.13) is only

useful when y(n)^x(n), implying that the quantization of the predic

tion residual is sufficiently fine, typically requiring 2 or more

bits/sample.

This approximation can therefore not be used to find the optimum pre

dictor coefficients in delta modulators, except in the special case

where a first-order predictor is used and the first coefficient Pl of

the long-time-averaged ACF of the input signal is close to unity

[Jayant 78], implying that the magnitude of the prediction residual is

small in comparison with the signal magnitude.

The above ACF criterion is usually satisfied in speech coding applica

tions using high-quality microphones with uniform frequency response.

However, when a standard telephone microphone is used, the resulting

pre-emphasis of the input speech reduces P1 considerably. For example,

it was found in this project that the same utterance captured by a

high-quality microphone and a telephone microphone had p, = 0.95 and p}

= 0.66, respectively, at a sampling rate of 16 kHz.

Another factor to be considered is the interaction which can occur

between the backwardly adaptive quantizer and the predictor [Gibson

78], and which can perturb the optimum predictor coefficients from the

values which might otherwise be expected. For this reason, as well as

because of the coarseness of the quantization process, an alternative

method to (7.13) for finding the optimum predictor coefficients is

required in the case of ADM.

7.5.3 An iterative approach to predictor optimization

In this thesis, predictor optimization was performed for a number of

ADM coders using a mul tivariabl e search technique, known as the

"Simplex" Method [Nelder 65], or more accurately, the Flexible Poly

hedron Method (FPM) [Himmelblau 72]. (Details of the coders are given

in Section 7.6) Because the FPM does not use derivatives of the objec-

124

tive function, no analysis of the optimization problem is necessary.

The only restriction is that the objective function must be convex,

otherwise a non-global minimum might be found.

The distortion measure used in this case was the MSE, implying an SNR

performance measure, as for the GHAQ optimization procedure. Strictly,

the performance measure applied to oversampled systems should be SNRF

[Jayant 84] (ie. the SNR calculated after filtering at the decoder

output). However, in this work it is assumed that the shape of the

quantization noise spectrum is independent of the predictor coeffi

cients, implying that optimization of the SNR is equivalent to optimi

zation of the SNRF. This assumption considerably reduces the computing

time required to perform each optimization, by eliminating the need to

simulate the decoder output filter.

The Flexible Polyhedron Method was found to perform well in this appli

cation, although as it does not estimate derivatives it takes longer to

converge than steepest-descent algorithms. Convergence was observed to

be smooth and unambiguous, at least for the first- and second-order

predictors studied. The optimum predictor coefficients found in each

case are given in Section 7.6.3.1.

7.6 Comparative performance tests

7.6.1 Introduction

In order to evaluate the usefulness of the optimized GHAQ, the perform

ance figures of 1-bit and 2-bit versions were compared with those of a

representative selection of other adaptive quantizers of similar com

plexity. This selection included the Constant Factor Delta Modulation

Adaptive Quantizer (CFDMAQ) [Jayant 70], the Hybrid Companding Delta

Modulation Adaptive Quantizer (HCDMAQ) [Un 81], and a 2-bit version of

125

the "Adaptive Quantizer with a One-Word Memory", which is referred to

here as the Jayant Adaptive Quantizer (JAQ) [Jayant 73]. Details of

these algorithms are given in Appendix F.

While the Continuously Variable Slope Delta Modulation Adaptive Quant

izer (CVSDAQ) [Jayant 84] is equally well known, it was not included in

the comparison, as it is known to have a limited dynamic range [Un Jan.

80]. By contrast, the dynamic range of the other algorithms can be

made arbitrarily large by setting the step size limits appropriately.

7.6.2 Test conditions

The test signal for the comparisons consisted of Sentences 1 to 4 in

Appendix E. While strictly speaking this signal is "inside" the train

ing set for the GHAQ, the GHAQ is not given an unfair advantage over

the other adaptive quantizers, because the utterance is designed to be

representative of general speech signals, as discussed in Section 7.4.9.

Both a telephone handset and a high-quality microphone were used for

speech acquisition, in order to allow the effect of each transducer on

the coding algorithms to be assessed. The performance measures used

were SNR and SNRSEG, although for the oversampled DM systems the

corresponding measures SNRF and SNRSEGF [Jayant 84] were also calcu

lated, as explained in Section 7.5.3. The decoder output filter in

this case had a 6th-order Chebyshev low-pass characteristic, with 0.5

dB passband ripple and a cutoff frequency of 3.4 kHz.

Fixed predictors of order 0, 1, and 2 were used, with the exception

that the zero-order predictor was not used with the DM systems, as this

resulted in coder instability. The optimum predictor coefficients were

found for the DM systems using the Flexible Polyhedron Method, and for

126

the 2-bit/sample systems from the signal autocorrelation function, as

described in Section 7.5. The 1-bit and 2-bit versions of the GHAQ

were optimized using the procedure described in Section 7.4.7, with

respective values for 8 of 0.97 and 0.94, and with L = 4 in both cases.

(For the DM coder using the 1-bit GHAQ, the predictor and adaptive

quantizer were optimized simultaneously.)

Step responses were also plotted for each algorithm, in order to allow

comparison of the transient-handling ability of the algorithm with its

SNR performance when coding speech. For these plots, second-order

predictors were used, as this increased the likelihood of tendencies

towards instability being revealed [Gibson 78]. The predictor coeff

icients used for the 1-bit/sample coders were h^ = 1.25 and l^ = -0.31,

and for the 2-bit/sample coders hj = 1.38 and h2 = -0.50. For the

1-bit and 2-bit GHAQs, the quantizer levels which were optimum for

coding high-quality speech were used (see Tables 7.4 and 7.7).

7.6.3 Results for the 1-bit adaptive quantizers

7.6.3.1 The optimum coder parameters

The optimum predictor coefficients for all the 1-bit/sample coders are

given in Table 7.3. It can be seen that the microphone type and the

adaptive quantizer used in each case has a significant effect on the

optimum coefficients. For the high-quality speech input, pj was calcu

lated to be 0.95, whereas for the telephone speech input it was 0.66.

Comparing these values with the optimum h1 for A = 1 in Table 7.3, it

is clear that hx = p1 is a good design only when p^l [Jayant 78].

It can also be seen from Table 7.3 that the optimum predictor coeffi

cients for the 1-bit GHAQ are in general smaller in magnitude than

those for the other two adaptive quantizers. This indicates that the

127

CFOMAQ

HCDMAQ

GHAQ

HIGH-QUALITY SPEECH

A=l

hl

0.95

0.95

0.93

A=2

hl

1.25

1.34

0.96

h2

-0.31

-0.39

-0.03

TELEPHONE SPEECH

A=l

hl

0.75

0.80

0.55

A=2

hl

0.95

0.96

0.69

h2

-0.39

-0.33

-0.35

Table 7.3 Optimum predictor coefficients ha for the 1-bit/sample

coders with predictor order A = 1 and 2

b0(rO

0

0

0

0

0

0

0

0

b0(n-l)

0

0

0

0

1

1

1

1

0

0

0

0

1

1

1

1

b0(n-2,

0

0

1

1

0

0

1

1

0

0

1

1

0

0

1

1

b0(n-3)

0

1

0

1

0

1

0

1

0

1

0

1

0

1

0

1

k(n)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

q(k)

HIGH-QUALITY

SPEECH

A=l

3.13

2.64

1.58

1.51

1.07

0.55

0.02

0.07

-3.13

-2.64

-1.58

-1.51

-1.07

-0.55

-0.02

-0.07

A=2

3.16

2.67

1.59

1.53

1.07

0.60

0.11

0.19

-3.16

-2.67

-1.59

-1.53

-1.07

-0.60

-0.11

-0.19

TELEPHONE

SPEECH

A=l

2.22

1.66

0.85

2.31

-0.28

0.84

0.34

1.04

-2.22

-1.66

-0.85

-2.31

0.28

-0.84

-0.34

-1.04

A=2

2.48

1.76

0.96

2.05

0.47

0.86

0.42

0.65

-2.48

-1.76

-0.96

-2.05

-0.47

-0.86

-0.42

-0.65

Table 7.4 Optimum reconstruction levels q(k) for the 1-bit GHAQ with

L = 4 for predictor order A = 1 and 2

GHAQ is taking over part of the function of the predictor, by removing

some of the correlation from the input signal. The reason the GHAQ is

able to do this, whereas the other adaptive quantizers are not, is that

it produces an "optimal" prediction of its input signal on the basis of

past observations, ie. it generates a "Bayesian" estimate [Schwartz

75].

The optimum reconstruction levels for the 1-bit GHAQ are given in Table

7.4, and the effect of the microphone type and of the predictor order

on these levels can clearly be seen. This observation supports the

hypothesis put forward in Section 7.4.1 that 1-bit adaptive quantizers

optimized with high-quality speech are not well matched to telephone

speech.

7.6.3.2 SNR results

Tables 7.5 and 7.6 show that in all cases the 1-bit GHAQ gives an

improvement over the other adaptive quantizers, in terms of SNR(F) and

SNRSEG(F). The performance advantage is most marked for the telephone

speech input, which can be attributed to the fact that the other adap

tive quantizers were designed for signals with a higher degree of

adjacent-sample correlation. (Due to the complex interaction between

the predictor and the adaptive quantizer in a DM system, the pre-

emphasis introduced by the telephone microphone cannot be completely

compensated for by manipulation of the predictor coefficients alone.)

Furthermore, the performance of the 1-bit GHAQ is seen to be virtually

independent of the predictor order in all cases, which can be attrib

uted to the ability of the GHAQ to perform a degree of prediction, as

mentioned in Section 7.6.3.1.

It is to be expected that if the quantization noise in the DM systems

129

CFDMAQ

HCDMAQ

GHAQ

HIGH-QUALITY SPEECH

A=l

SNR

10.2

11.7

13.7

SNRSEG

9.0

10.2

12.0

A=2

SNR

11.4

13.0

13.8

SNRSEG

10.2

11.4

12.0

TELEPHONE SPEECH

A=l

SNR

0.7

2.5

5.7

SNRSEG

1.3

3.2

5.6

A=2

SNR

3.6

3.0

6.0

SNRSEG

3.9

4.2

5.7

Table 7.5 SNR and SNRSEG [dB] figures for the 1-bit/sample coders with

predictor order A = 1 and 2

CFDMAQ

HCOMAQ

GHAQ

HIGH-QUALITY SPEECH

A=l

SNRF

12.7

13.7

14.6

SNRSEGF

11.7

12.2

13.1

A=2

SNRF

12.9

13.5

14.7

SNRSEGF

11.9

12.1

13.1

TELEPHONE SPEECH

A=l

SNRF

3.8

3.3

7.0

SNRSEGF

4.7

4.8

7.2

A=2

SNRF

5.7

3.4

7.2

SNRSEGF

6.4

5.2

7.4

Table 7.6 SNRF and SNRSEGF [dB] figures for the 1-bit/sample coders

with predictor order A = 1 and 2

130

were uniformly spectrally distributed, then filtering the decoder out

put signal to 3.4 kHz would give increases in the SNRF and SNRSEGF

figures over the SNR and SNRSEG figures equal to about 3 dB, since the

quantization noise power should be approximately halved. However,

comparing the results in Table 7.5 with those in Table 7.6, it is

apparent that the increase is almost always less than 3 dB in practice,

and furthermore that it is dependent on the type of adaptive quantizer

used.

The increase tends to be greatest for the CFDMAQ, and smallest for the

GHAQ, which can be related to the fact that the CFDMAQ has a faster

transient response than the 1-bit GHAQ (see the step response plots),

implying that the CFDMAQ tends to generate a greater proportion of

high-frequency noise than the 1-bit GHAQ. This suggests that the SNRF

of the 1-bit GHAQ might be improved in oversampled systems by spectral

ly redistributing its quantization noise, so that more of the noise lay

outside the signal band. This would require the use of a frequency-

weighted distortion function in the optimization procedure, and repre

sents an area for further work. Alternatively, the reduced level of

out-of-band noise in DM systems using the 1-bit GHAQ would justify a

relaxation of the rol 1-off requirements of the decoder output filter.

7.6.3.3 Step responses

The step response plots are shown in Figs. 7.13 - 7.15. It can be seen

that the CFDMAQ responds more quickly to a step input than does the 1-

bit GHAQ optimized for speech. However, the penalty for the rapid

response of the CFDMAQ is relatively severe oscillatory behaviour after

the step level has been reached. The benefit to be gained by increas

ing the quantizer memory length can be seen from the response of the

200

160

150

140

120

100

8 0

50

40

20

0 80

i/VVVVWli i i 1l

p u P ijywvWw yWlrVV"

I 100 120 140 160 160

n

200 220 240 260 280

Fig. 7.13 Step response of the CFDMAQ

200

160

160

140

120

100

80

60

40

20

0 80

/

flWpll||ll|iliiPlwJ

100 120 140 1G0 180

n

200 220 240 260 280

Fig. 7.14 Step response of the HCDMAQ

132

200

280

Fig. 7.15 Step response of the 1-bit GHAQ

HCDMAQ, which has a memory one bit longer than that of the CFDMAQ.

(The HCDMAQ is essentially an instantaneously adaptive quantizer for

the short period of time defining the overload response).

As the memory in the 1-bit GHAQ can be made arbitrarily long, it is

reasonable to believe that an overload response similar to that of the

instantaneously adaptive quantizers could be obtained while avoiding

the subsequent oscillatory behaviour, if the quantizer were optimized

for a step input. This possibility is not of particular interest in

the context of speech coding, but would be in video coding applica

tions, where step-like transients are frequently encountered [Weiss

75].

7.6.4 Results for 2-bit adaptive quantizers

7.6.4.1 Optimum coder parameters

The optimum predictor coefficients for the high-quality speech input

were hj = 0.92 for a first-order predictor and hj = 1.38, h2 = -0.50

for a second-order predictor. With the telephone speech input, the

optimum coefficients for first- and second-order predictors were vir

tually zero, an observation which is also made in [Cattermole 69].

Accordingly, the predictor was omitted in this case.

The optimum reconstruction levels for the 2-bit GHAQ are shown in Table

7.7. The effect of the microphone type and predictor order on the

optimum levels is less pronounced in this case than for the 1-bit GHAQ,

which can be attributed to the reduced interaction between the adaptive

quantizer and the predictor. While the parameters of the 2-bit JAQ are

given in Appendix F, it should be noted that in the specific case of

the telephone speech input, the appropriate step size multipliers are

those specified for DPCM rather than PCM, due to the de-correlation of

b0(n)

0

0

0

0

0

0

0

0

"»x(n)

0

0

0

0

1

1

1

1

0

0

0

0

1

1

1

1

b0(n-l)

0

0

1

1

0

0

1

1

0

0

1

1

0

0

1

1

b^n-1)

0

1

0

1

0

1

0

I

0

1

0

1

0

1

0

1

k(n)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

q{kl

HIGH-QUALITY

SPEECH

A=0

2.19

1.28

1.28

1.94

-0.44

-0.45

-0.25

-1.15

1.15

0.25

0.45

0.44

-1.94

-1.28

-1.28

-2.19

A=l

3.34

1.60

1.75

1.95

-0.54

-0.51

-0.42

-0.91

0.91

0.42

0.51

0.54

-1.95

-1.75

-1.60

-3.34

A=2

3.43

1.67

1.79

2.07

-0.66

-0.51

-0.47

-0.77

0.77

0.47

0.51

0.66

-2.07

-1.79

-1.67

-3.43

TELEPHONE

SPEECH

A=0

3.37

1.74

1.84

2.24

-0.77

-0.53

-0.51

-0.89

0.89

0.51

0.53

0.77

-2.24

-1.84

-1.74

-3.37

Table 7.7 Optimum reconstruction levels q(k) for the 2-bit GHAQ for

predictor order A = 0, 1 and 2

JAQ

GHAQ

A

SNR

7.8

10.7

=0

SNRSEG

7.6

10.0

HIGH-QUALITY

SPEECH

A

SNR

13.1

15.0

= 1

SNRSEG

12.8

13.9

A=2

SNR

14.1

15.5

SNRSEG

13.0

14.4

TELEPHONE

SPEECH

A=0

SNR

6.9

8.2

SNRSEG

7.3

8.0

Table 7.8 SNR and SNRSEG [dB] figures for the 2-bit/sample coders with

predictor order A = 0, 1 and 2

the signal performed by the telephone microphone.

7.6.4.2 SNR results

It can be seen from Table 7.8 that the 2-bit GHAQ gives improved

performance in terms of SNR and SNRSEG over the 2-bit JAQ in all cases.

The improvement increases with decreasing predictor order, a trend

which can be attributed to the fact that the optimized GHAQ is able to

perform a degree of prediction, so that it can exploit correlation in

its input signal in order to improve the accuracy of the quantization

process, as mentioned in Section 7.6.3.1. In the PCM system (ie.

without a predictor), significant correlation is present in the quant

izer input signal, whereas in the DPCM system with a second-order

predictor, most of the correlation has been removed.

7.6.4.3 Step responses

It can be seen from Figs. 7.16 and 7.17 that the step response of the

2-bit GHAQ optimized for speech is slower than that of the 2-bit JAQ,

although the discrepancy is less pronounced than for the 1-bit adaptive

quantizers. As in Section 7.6.3.3, the faster-responding system exhi

bits more severe oscillations after the step level is reached.


It is evident from the results of the comparative tests that the 1-bit

and 2-bit versions of the GHAQ provide the best SNR and SNRSEG perform

ance with telephone speech inputs. As the implementation complexity of

the GHAQ is of the same order as that of the other schemes, its use in

the primary coding algorithm of the embedded coder is indicated.

If the primary coding algorithm is ADM, then only a first-order

200

280

200

Fig. 7.16 Step response of the 2-bit JAQ

180

160

140

120

100

60

Y^wvWiru^^ nnnnnru

260

Fig. 7.17 Step response of the 2-bit GHAQ

137

predictor is required, as negligible improvement results through an

increase in predictor order. However, if it is ADPCM, then the predic

tor can be eliminated altogether. It can be seen from Tables 7.6 and

7.8 that the differences in SNR(F) and SNRSEG(F) between the GHAQ-based

ADM and ADPCM coders with telephone speech inputs are small (about 1

dB). The choice between the two coder types is thus dependent primari

ly on the implementation strategy.

In a full-custom or semi-custom integrated implementation, the ADM

coder has the advantages of a simpler quantizer and higher sampling

rate than the ADPCM coder. Specifically, the quantizer may be imple

mented as single analog comparator, and the high sampling rate eases

the roll-off requirements of the pre-sampling and reconstruction fil

ters. These two advantages of delta modulation have lead to its use as

an initial digitizing technique in 64 kbps PCM codecs, the DM code

being digitally converted to PCM format prior to transmission [Sakane

78].

On the other hand, the fact that the ADPCM coder operates at 8 kHz

means that a commercial PCM codec chip can be used to perform A/D and

D/A conversion of the signal in a board-level implementation, leaving

the embedded coder itself with an all-digital transcoding task. While

this means that the overall communication link involves two additional

PCM coding operations in tandem with the embedded coder, implying that

the peak SNR and dynamic range of the overall link will be limited by

the PCM codecs, this is not likely to be a problem in practice, because

the subjective quality of 64 kbps PCM coded speech is known to be good

[Jayant 84].

As the original codec implementation strategy was based on semi-custom

integration, the use of an ADM coder was favoured initially. However,

138

during the course of the project, an alternative strategy based on a

programmable digital signal processor was adopted (for the reasons

given in Chapter 8). Accordingly, the primary coding algorithm as

finally specified is ADPCM (strictly, APCM), using the 2-bit GHAQ

optimized for telephone speech inputs.

7.7 Development of the secondary coding algorithm

7.7.1 Introduction

The function of the secondary coding algorithm in the embedded coder is

to encode the quantization noise (or "residual" signal) produced by the

primary coding algorithm. Furthermore, it must do this while preserv

ing the embedded code hierarchy described in Section 7.2.6.1, and with

the code format described in Section 7.2.6.3. Given that the primary

coder uses an ADPCM structure, with a sampling rate of 8 kHz, it is

reasonable to constrain the secondary coder to be of a similar form, in

order to facilitate implementation.

7.7.2 Selection of the coding technique

Since the degree of average correlation in the quantizer input signal

of an optimized ADPCM coder is low by definition, the quantization

noise will be similarly uncorrel ated. This implies that a predictor is

not required in the secondary coder, so that it is only necessary to

choose a "secondary" quantizer. Furthermore, as the residual signal

will be non-stationary (due to the non-stationarity of the input sig

nal), the secondary quantizer should be adaptive.

One proposal for an explicit noise coding system makes use of a loga

rithmic secondary quantizer (ie. a 64 kbps PCM codec) [Tierney 81].

However, this has the disadvantages that it does not produce a strict-

139

ly embedded code (as is explained in Section 7.1 A), and is not neces

sarily well matched to the residual signal statistics, either locally

or in the long term.

Another proposal uses a block coding approach to the quantization of

the residual signal [Jayant 83], While this produces substantial

performance improvements over sequential techniques, it does so at the

expense of considerable extra complexity. Furthermore, as mentioned a

number of times in this thesis, sequential coding is preferred to block

coding in this application.

Another possibility is to use the GHAQ in the secondary coder as well

as the primary coder. However, as speech quality is expected to be

good at the "secondary" coding rates of 32 kbps and higher, a simple

variance-adaptive quantizer is considered adequate. Given this choice,

there are three main design tasks, namely to find a suitable way of

adapting the secondary quantizer, to ensure that its output code is

embedded, and to maximize its SNR, and these issues are discussed

below.

7.7.3 Adaptation of the secondary quantizer

The secondary quantizer is adapted by scaling its levels by a suitable

factor, in order to account for changes in the local variance of the

residual signal. The scaling factor cannot be obtained recursively

from the output of the secondary quantizer itself, because there is no

guarantee that the decoder will have the same information available, as

some of it may be discarded by the channel. The scaling factor must

therefore be derived from some quantity in the primary encoder which is

reasonably well correlated with the variance of the residual signal.

140

Since the GHAQ attempts to preserve an approximately constant SNR,

regardless of the input signal variance, it is to be expected that the

magnitude of its quantization noise will be relatively well correlated

with the signal magnitude. As the GHAQ syllabic factor is an estimate

of the local average magnitude of the input signal, it thus represents

a reasonable scaling factor for the secondary quantizer. However, the

magnitude of the GHAQ output level s(n) is even better in this respect,

as it is an estimate of the most recent input sample, rather than of

the local average, and is therefore less likely to lead to momentary

overload of the secondary quantizer. For this reason, |s(n)| is chosen

as the scaling factor for the secondary quantizer.

Fig. 7.18 shows the structure of the embedded encoder based on explicit

noise coding. (A predictor is shown in the primary coder in this

figure to maintain generality, although it is not used in the imple

mentation described in Chapter 8, due to the fortuitous observation

that the optimum coefficients are zeros, as mentioned in Section

7.6.5.) The dependence of the secondary quantizer on the primary one

is indicated by means of the dashed linkage in Fig. 7.18, and the

output code word w(n) is shown as being formed from its primary and

secondary components, namely wp(n) and ws(n), respectively.

7.7.4 Embedded code generation

One criterion which must be satisfied by an embedded code is that it

should result in successively better approximations to the input sample

as additional code fragments become available to the decoder [Tzou 86].

This is clearly satisfied for the 16 and 64 kbps rates by the structure

of the explicit noise coding scheme, ie. reception of al 1 4 fragments

of a particular code word must result in a better estimate of the input

sample than if only 1 fragment is received. However, as the embedded

141

+ (n) — ^ - J

u(n) i

pV U>7

r(n)

i — *~

\ •

QUANTIZER

QUANTIZER

PRFDI^TrvD

*

*C;

V

> / /

/ s

s(n)

y(n)

•wg(n)

• w (n) ' pv '

w(n)

Fig. 7.18 The embedded encoder based upon explicit noise coding

coder must also support intermediate rates of 32 and 48 kbps, it is

necessary to ensure that the output code of the secondary quantizer

also satisfies the above criterion, and it is not immediately clear

that it wil1.

It has been shown that in order for the output code of a multi-bit

quantizer to be embedded, the decision levels for quantization with F

fragments must form a subset of the set of decision levels for quanti

zation with (F+l) fragments [Tzou 86]. This condition is also known as

"threshold alignment", because the decision levels in the lower-

resolution quantizer are aligned with those in the higher-resolution

one [Tzou 86]. The condition is not satisfied in general when the F-

fragment and (F+l)-fragment quantizers are independently optimized for

maximum SNR. An exception is when both quantizers are constrained to

have uniform level distributions, although this distribution is optimal

for uniform signal PDFs only, and is therefore not appropriate for

speech coding [Jayant 84].

There is thus a conflict between the desire to maximize the SNR for all

coding rates, requiring independent sets of decision levels for each

rate, and the desire for an embedded output code, requiring a single

set of decision levels for all rates. A compromise must therefore be

reached among the SNR degradations incurred at the various rates when a

single set of quantizer decision levels is chosen. One possibility is

to assign equal amounts of SNR degradation to the highest and lowest

bit rates, with no degradation at a "central" bit rate [Tzou 86].

However, in this project it was decided to maximize the SNR at the

highest coding rate, in order to allow the SNR of 64 kbps PCM to be

approached as closely as possible.

143

As regards the quantizer reconstruction levels, it is well known from

the theory of optimum non-uniform quantization that these should be

equal to the centroids of the corresponding quantization intervals if

the SNR is to be maximized [Lloyd 57], [Max 60]. However, the location

of the quantization intervals will depend on the number of fragments of

the embedded code which are received, thus causing a corresponding

variation in the location of the centroids. This implies that separate

sets of reconstruction levels should be used for the various bit rates,

and the use of a single set as in [Tierney 81] is in general sub-

optimal .

7.7.5 Optimization of the secondary quantizer

7.7.5.1 The optimization procedure

According to the discussion in the previous section, optimization of

the secondary quantizer requires a single "compromise" set of decision

levels and four sets of reconstruction levels (one for each rate) to be

found. This can be done using the standard Lloyd-Max optimization

procedure [Lloyd 57], [Max 60], although a "weighted" objective func

tion must be used to take account of the quantizer scaling factor

obtained from the primary coder [Chen 87]. The optimum decision and

reconstruction levels of the secondary quantizer are therefore given

by:

p(k) = 1/2 [ q(k) + q(k-l) ] ; k = 1,2,...,K-1 (7.14)

E{r |s|2|k} q(k) = - ; k = 0,1,...,K-1 (7.15)

E{|s|2|k}

where : p(k) is the decision level associated with level index k

q(k) is the reconstruction level associated with level index k

r is the residual signal from the primary coder

s is the output level from the GHAQ in the primary coder

It should be noted that in this case k is the standard quantizer level

index associated with memoryless quantizers [Jayant 84], and is not

related to the level index used in the GHAQ, as described in Section

7.4.6.3. It is assumed that the residual signal has a symmetrical PDF

about a mean of zero, implying that the optimum quantizer levels will

exhibit similar symmetry.

In order to maximize the SNR at the 64 kbps rate (ie. K - 32), it is

necessary to find the sets of optimum decision and reconstruction

levels for this rate first. Once this has been done, the set of

optimum decision levels is used to calculate sets of reconstruction

levels from (7.15) for the 48 and 32 kbps rates (ie. K = 8 and K = 2).

Due to the zero-mean assumption for the residual signal, the optimum

reconstruction level for the 16 kbps rate is zero.

7.7.5.2 Convergence of the optimization procedure

When a quantizer is optimized with the Lloyd-Max procedure using a

training sequence of input samples, convergence problems may be en

countered when the number of quantizer levels is large. This is first

ly because a finite number of input samples is used, and secondly

because the samples do not change between successive iterations. The

first factor means that the objective function is granular, to a degree

which is determined by the number of samples used. In terms of the

minimization problem, it may be said that the surface of the objective

function is "pitted" with local minima. The second factor means that

these local minima do not shift from iteration to iteration, so that

the procedure may become "trapped" in one of these minima and terminate

spuriously.

While this problem cannot be overcome completely, its effects can be

minimized through the choice of a good starting point for the proce

dure, which reduces the distance which must be traversed to the opti

mum, and therefore decreases the likelihood of spurious termination. A

good starting point also reduces the number of iterations required for

the optimum to be reached, which can be considerable for multi-bit

quantizers.

One way of finding such a starting point is by means of "level split

ting" [Linde 80]. This technique starts with a 1-bit quantizer, which

can be optimized analytically [Jayant 84], and then splits each recon

struction level to form a 2-bit quantizer. The process of splitting a

level amounts to the addition and subtraction of a small constant to

and from the level, thereby forming two new levels. The 2-bit quant

izer is then re-optimized, after which its levels are split to give a

3-bit quantizer, and so on, until the desired quantizer resolution is

reached.

7.7.5.3 Resul ts

The level-splitting technique was applied to the procedure described in

Section 7.7.5.1, in order to find the optimum decision and recon

struction levels for the secondary quantizer. The training sequence

was obtained by applying Sentence 1 in Appendix E to the primary en

coder, and then collecting the residual samples r(n). The reconstruc

tion levels for K = 32, 8 and 2 are given in Table 7.9, and the corres

ponding decision levels can be calculated from (7.14).

Also shown in Table 7.9 are the optimum reconstruction levels calcu

lated without the constraint that the output code must be embedded.

k

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

K = 32

0.0143

0.0454

0.0753

0.1039

0.1336

0.1616

0.1950

0.2322

0.2651

0.2959

0.3310

0.3628

0.4278

0.4889

0.5561

0.6262

0.7074

0.7852

0.8678

0.9605

1.0509

1.1412

1.2488

1.3763

1.5352

1.7357

2.0206

2.1875

2.9429

3.7147

8.7802

17.6459

EMBEDDED

q(k)

K = 8

0.0592

0.1841

0.3117

0.5233

0.8303

1.1840

1.7665

3.4271

K = 2

0.2395

1.0164

NON-EMBEDDED

<M

K = 8

0.0572

0.1781

0.3083

0.5112

0.8048

1.1806

2.0210

3.5840

K = 2

0.2275

0.9478

Table 7.9 Optimum embedded and non-embedded reconstruction levels q(k)

for the secondary quantizer

While the differences between the corresponding levels are not substan-

tial in this case, this might not be so for different residual signal

statistics, which might arise if the primary coding algorithm is modi

fied. It was therefore considered advisable to retain the separate

sets of reconstruction levels in the codec implementation described in

Chapter 8. Performance figures for the embedded coder using these

levels are given in Chapter 9.

7.8 Recovery from bit errors

7.8.1 Introduction

In order to allow the voice call path to include low-integrity wide

area networks, the codec should have a reasonable degree of robustness

to bit errors in the channel. Furthermore, such a feature can assist

the decoder in recovering from missing packets, as is discussed in

Section 7.9, as well as in dissipating the effects of different initial

conditions in the encoder and decoder at the start of a call.

In general, a bit error will cause the decoder output to be perturbed

at an one or more sampling times. This error effect can be usefully

separated into "immediate" and "long-term" components. The immediate

effect of a bit error depends on the relative significance of the bit

in the code word hierarchy, which is determined by the shape of the

quantizer characteristic (ie. its level distribution), as well as on

the "weight" assigned to the code word by the coding scheme.

When the quantizer is adaptive, the weight assigned to a code word

changes with time, and is typically correlated with the signal energy.

The "noise" produced by a bit error thus also tends to be correlated in

magnitude with the signal energy, which is desirable if it is required

148

that a constant SNR is maintained. By contrast, when the quantizer is

fixed, as in a 64 kbps PCM coder, the magnitude of the noise due to a

bit error will in general be uncorrected with the signal amplitude,

and for example might stand out as a noticeable click during a low-

energy portion of the signal [Jayant 84].

The long-term effects of bit errors are caused by "memory" in the

coding algorithm. In a general predictive coder, memory might be

present in the quantizer adaptation mechanism, the predictor, and the

predictor adaptation mechanism. In each case, the use of memory is

associated with increased bandwidth efficiency. The degree to which

this memory causes long-term error effects is dependent on the memory

length, and the relative weighting of the individual memory elements.

Recursive algorithms tend to be particularly sensitive in this respect,

due to their (theoretically) infinite memory lengths [Jayant 84].

7.8.2 Effects of bit errors on the primary decoder

Referring to Fig 7.5, the output level of the GHAQ in the primary

encoder is given by:

s(n) = d(n) g(n) (7.16)

where d(n) is the quantizer output level and g(n) is the syllabic

factor. The corresponding quantities in the decoder are:

s'(n) = d'(n) g'(n) (7.17)

The immediate effect of a bit error in the channel will be to perturb

d'(n), and thereby s'(n). The magnitude of this perturbation will

depend on g'(n), and hence on the local average magnitude of the

signal. This dependence of the error effect on the signal magnitude is

desirable, as explained in the previous section.

However, since

g'(n + l) = (i-p) |s'(n)| + B g'(n) (7.18)

the bit error will also perturb g'(n+l). Furthermore, due to the

recursive form of (7.18), this perturbation will propagate to future

values of g'(n) and s'(n). The only way in which this perturbation

will be removed is if explicit limits are put on g(n) and g'(n), or if

a subsequent channel error happens by chance to have a cancelling

effect. •<

To determine whether the inability of the GHAQ to recover from bit

errors represented a problem in practice, informal listening tests were

performed. These indicated that the speech quality was unacceptable

with a channel error rate of 1 in IO3 or worse. Specifically, while

short-term error effects (occasional crackles) were not objectionable,

the long-term effects (perturbations of the average magnitude of the

signal), were particularly noticeable in connection with background

noise, an observation which is explained below.

While a listener might attribute perturbations in the average magnitude

of successive tal kspurts to the behaviour of the speaker, the same

reasoning is not applicable to background noise, so that variations in

the level of this noise are perceived as anomalous. The same observa

tion has been made in the context of systems which eliminate silence

from the transmitted signal , and which must therefore attempt to match

the level of artificially generated noise with that of the background

noise in the original signal [Dvorak 88].

150

7.8.3 The development of the robust GHAQ

It is clear from the previous section that the inclusion of a specific

error-recovery mechanism in the GHAQ in the primary coder is necessary,

if satisfactory performance is to be obtained with non-ideal channels.

Two such mechanisms which have been successfully applied to other

adaptive quantizers were considered for this purpose. The first is the

"exponential leakage" technique used in the Robust Adaptive Quantizer

(RAQ) [Goodman 75], which is essentially a modified form of the Jayant

Adaptive Quantizer (JAQ) [Jayant 73]. In the RAQ, exponential leakage

is applied to the quantizer step size at each sampling instant. The

quantizer adaptation mechanism is thus given by:

A(n) = A(n-1)Y M(w(n-1)) (7.19)

where : A(n) is the quantizer step size

Y is the leakage constant

M(w(n-1)) is the step size multiplier

w(n-l) is the previous quantizer output code word

It has been shown that for y< 1. the effect of channel bit errors on

the quantizer step size in the decoder is dissipated with time [Goodman

75]. Furthermore, the recovery process is independent of the statis

tics of the input signal. However, the increase in implementation

complexity over the JAQ (ie. y = 1) is significant.

The second error recovery mechanism considered here, which has been

successfully used in an ADM coder, involves the addition of a small

constant to the quantizer step size at each sampling instant [Dodds

80]. This scheme is much simpler to implement than the exponential

leakage algorithm, but has the disadvantage that the speed of recovery

is dependent on the input signal statistics. Specifically, recovery

151

takes longer when the average magnitude of the signal is large [Dodds

80]. The recovery time can be shortened by increasing the magnitude of

the step size increment, but only at the expense of a reduced SNR

during low-level portions of the input signal.

In this case, the quantizer adaptation mechanism is given by:

A(n) = A(n-l) M(w(n-1)) + I (7.20)

where : A(n) is the quantizer step size

I is the step size increment

M(w(n-1)) is the step size multiplier

w(n-l) is the previous quantizer output code word

The recovery mechanism of (7.20) is considered more suitable for this

project than that of (7.19), due to its considerably simpler implement

ation. Also, the fact that error rates greater than 1 in IO3 are not

expected means that a relatively small step size increment can be used,

thereby avoiding significant SNR degradation for low-level signals.

As the syllabic factor in the GHAQ corresponds to the step size A(n) in

(7.20), a "syllabic increment" I is used in this application, instead

of a step size increment. The syllabic increment is added to the input

of the syllabic compandor, as shown in Fig. 7.19, so that the syllabic

factor in the resulting "robust" GHAQ is calculated as follows:

9(n) = (1-B) [|s(n-l)| + I] + B g(n-l) (7.21)

It is not immediately obvious why the addition of the syllabic incre

ment should cause the effects of channel errors on the syllabic factor

to dissipate. An intuitive explanation given in [Dodds 80] is that it

causes an increase in the probability of the selection of quantizer

x (n) >—fe>-v (n!

u(n)

PREDICTOR i

w(n)

Fig. 7.19 The robust GHAQ used in a predictive encoder

Syllabic increment I

Fig. 7.20 Performance of the robust 2-bit GHAQ vs. the syllabic

increment I for various channel bit error probabilities

(a) p e = 0 (b) p e = IO"3 (c) p e = 10"2'5 (d) p e = 10"2

153

reconstruction levels which have magnitudes less than unity. This

causes the feedback path between the output and input of the syllabic

compandor (ie. between g(n) and s(n) in Fig. 7.5) to have an average

"gain" less than unity, so that perturbations of the syllabic factor

are leaked away with time. A more rigorous explanation of error recov

ery in the robust GHAQ is given in Appendix B.

7.8.4 Performance of the robust GHAQ

As observed in the previous section, the selection of a suitable value

for the syllabic increment involves a trade-off between the SNR with an

error-free channel on the one hand, and the SNR with channel errors on

the other. Furthermore, the effect of the bit error recovery mechanism

is greater on low-level signals than on high-level signals. The com

plex nature of this trade-off makes it difficult to identify a clear

"optimum" for the syllabic increment. Fortunately, however, the per

formance of the robust GHAQ is not very sensitive to the exact value of

this parameter.

This is illustrated in Fig. 7.20, where SNRSEG is plotted against the

value of the syllabic increment, with the channel error rate as a

parameter. (The units of the syllabic increment are 1/4095 of the

maximum signal magnitude.) SNRSEG, rather than SNR, is used as a

performance measure in this case for two reasons. Firstly, it is known

that SNR is a poor indicator of the quality of ADPCM coders under noisy

channel conditions, and that SNRSEG is significantly better in this

respect [Scagliola 79]. Secondly, the effect of the bit error recovery

mechanism is greater on low-level than on high-level signals, and these

low-level signals tend to be more fairly weighted in the SNRSEG mea

sure.

Curve (a) in Fig. 7.20 shows that the performance of the robust GHAQ

with an error-free channel degrades gradually as the syllabic increment

is increased, which can be attributed to increasingly sub-optimal

estimation by the syllabic compandor of the average magnitude of the

signal. The large effect of an error rate of 1 in IO3 on the perform

ance of the "non-robust" GHAQ (ie. I = 0) is evident from curve (b),

confiriming the need for the bit error recovery mechanism.

At a higher error rate, (1 in IO2*5) the performance improves as the

syllabic increment is increased, until a saturation level is reached,

as can be seen from curve (c). Given that error rates in excess of 1

in IO3 are not expected in this application, a reasonable choice for

the syllabic increment is between about 1 and 4.

7.8.5 Idle channel noise in the robust GHAQ

Another issue which must be considered in choosing a value for the

syllabic increment is that it determines, in combination with the

smallest quantizer reconstruction level, the minimum output level of

the robust GHAQ. The relationship between these three quantities is

shown in Appendix C to be:

1 drnin ti oo\

U - dmin)

where : I is the syllabic increment

dm1-n is the smallest quantizer reconstruction level

sml-n is the minimum output level of the robust GHAQ

The minimum output level of the robust GHAQ is of interest because it

determines the "idle channel noise", ie. the decoder output when the

input to the encoder is zero or close to zero. (The GHAQ has a

155

"midrise" quantizer characteristic, which means that it reconstructs

zero or near-zero amplitudes using a small non-zero level [Jayant 84].)

For example, for the idle channel noise to have an amplitude no greater

than 1, and assuming that the minimum quantizer reconstruction level

dml-n is 0.5, then from (7.22) the maximum permissible value of the

syllabic increment I is also 1.

7.8.6 The effects of bit errors on the secondary decoder

As the secondary decoder itself contains no memory, bit errors which

corrupt secondary bits in the embedded code words will have only short-

term (spike-like) effects on the secondary decoder output. The magni

tude of these spikes will depend on the significance of the affected

bit within the code word hierarchy, as well as on the weight assigned

to the code word by the quantizer scaling factor, as discussed in

Section 7.8.1. While it is possible to use smoothing techniques to

remove spikes from the decoder output [Jayant 84], this is unlikely to

be worthwhile at the error rates expected in this case.

Bit errors which corrupt primary bits of the embedded code will also

affect the secondary decoder output, through perturbations in the GHAQ

output level s'(n), which is used to adapt the secondary quantizer.

However, making the GHAQ robust to such errors, as described in Section

7.8.3, will also reduce their effects on the secondary decoder output.

7.9 Recovery from missing packets

7.9.1 Introduction

A packet may be missing from the received stream due to being silent or

lost. In either case, it is the task of the packet fill-in algorithm

to replace the missing packet with substitute code, so that the decoder

156

sees a continuous code stream. However, a side-effect of the missing

packet is that with adaptive coding algorithms the decoder is likely to

lose track of the encoder, as in the case of channel bit errors.

Whereas bit errors may occur at arbitrary points in the code stream,

missing packets may only occur at known points, corresponding to packet

boundaries. Furthermore, whereas bit errors may be scattered, and

hence cause "diffused" mistracking effects, missing packets are equiva

lent to concentrated error bursts of considerable duration (typically

10 - 20 mS), implying the probable loss of a significant amount of

tracking information.

While mechanisms developed to enable adaptive coding algorithms to

recover from channel bit errors will also lead to eventual recovery

from missing packets, this may take a substantial length of time. For

example, using the Robust Adaptive Quantizer (RAQ) with exponential

leakage (described in Section 7.8.3), a burst of 60 bit errors in 255

successive code words can cause a decoder error which takes a further

400 sample periods to dissipate [Yatrou 88]. (While the coder in this

example also makes use of backward adaptive prediction, it is stated in

[Yatrou 88] that the decoder error is attributable mainly to mistrack

ing in the adaptive quantizer.) When the code words in a packet are

missing altogether (as opposed to merely corrupted), the potential

error effect is even greater.

7.9.2 The effect of missing packets on the embedded decoder

In the embedded decoder of this project, information related to the

signal history is present in the primary decoder only. Specifically,

it is contained in the bits b'Q(n-l) and b'^n-1) in the GHAQ memory,

and in the syllabic factor g'(n). The absence of b'0(n-l) and b'^n-1)

157

after a missing packet is equivalent in the worst case to a pair of bit

errors. Assuming a packet size of 256 bits (16 mS at 16 kbps) and a

missing packet rate of 1 in 100, this amounts to a bit error rate of

about 1 in 10 . However, according to the results of Section 7.8.4,

this will cause little degradation of the performance of the robust

version of the GHAQ.

Of more concern is the probable loss of tracking between g(n) in the

encoder and g'(n) in the decoder. If a packet is missing because it

was eliminated due to being silent, then the mismatch between g(n) and

g'(n) is likely to be negligible if g'(n) is simply held fixed for the

duration of the missing packet. However, when the packet is missing

due to being lost, then the mismatch produced by this strategy might be

considerable. As a packet period of 16 mS corresponds to 128 sampling

periods of 125 uS, it is possible for the encoder syllabic factor to

change substantially in magnitude over the course of a missing packet.

For example, in the 14 sampling periods taken to reach the step level

in the step response of Fig. 7.17, g(n) increases by a factor of about

3.

7.9.3 A mechanism for recovering from missing packets n

The fact that missing packet effects can only manifest at packet bound

aries, rather than at arbitrary points in the code stream, makes it

feasible to transmit side information in each packet which will assist

with the re-establishment of tracking in the event that the preceding

packet is missing at the receiver. As it is shown in the previous

section that the most important quantity in this respect is the sylla

bic factor in the GHAQ, provision is made for this variable to be re

initialized in the decoder to a value provided by the receiver's net

work voice protocol.

158

This also requires that the syllabic factor in the encoder is made

available to the transmitter's network voice protocol at the approriate

time, so that it can be included in each packet. It is worth noting in

this connection that the syllabic factor is also used by the network

voice protocol for prioritization purposes, as is discussed in Section

7.11.4.

7.10 Packetization issues

The embedded encoder produces code words containing fragments with four

different levels of significance. However, embedded coding is most

useful in a packet switched network if code fragments of equal signifi-

cance are grouped together during the packetization process. This

means that network flow control may be performed by discarding entire

packets at a time. As requiring the network voice protocol to group

the code fragments appropriately places a substantial burden on the

workstation's processor, this task is given to the codec instead.

The packetization of the embedded code fragments is illustrated in Fig.

7.21. The specific example used in this figure is a fragment with

significance 0, but the packetization of fragments with significance 1,

2 and 3 is exactly analogous. As can be seen, significance 0 fragments

from four successive sampling periods are grouped into a single byte.

These bytes are then formed into a packet of the desired length, so

that the packet contains only fragments of a particular significance.

(An implication of this approach is that the packet period must be

equal to a multiple of four sampling periods, ie. of 500 uS.)

Once this process is complete, the network voice protocol simply has to

make a transmit/discard decision for each packet, using the associated

f3(n) t>(n) ^(n) f0(n)

Mn) bQ(n)

f0(n-3) f0(n-2) f0(n-D f0(n)

EMBEDDED CODE WORD w(n)

SIGNIFICANCE 0 FRAGMENT

SIGNIFICANCE 0 BYTE

HEADER

SIDE INFORMATION

SIGNIFICANCE 0 CODE

SIGNIFICANCE 0 CODE

SIGNIFICANCE 0 CODE

BYTE

BYTE

BYTE

SIGNIFICANCE 0 CODE BYTE

SIGNIFICANCE 0 PACKET

Fig. 7.21 Packetization of an embedded code fragment with

significance 0

160

prioritization and flow control information, and append side informa

tion and packet headers as appropriate.

The fact that the packetization process is under the control of the

codec implies that the packet length cannot be varied instantaneously

according to the access delay experienced by each packet. However, at

the end of every packet generation period, the network voice protocol

has a chance to set the length of the next packet, so that long-term

adaptation of the packet length to changing network conditions is still

possible. As described in Section 5.2, this approach is likely to

cause fewer problems in the synchronization process at the receiver

than instantaneous variations in the packet length.

7.11 Prioritization and flow control issues

7.11.1 Introduction

In general, the network voice protocol assigns a priority to a packet

of speech code on the basis of the source activity, the degree of

transmission success achieved by previous packets in the call, and the

relative requirements of data packets. Attention is restricted here to

the first factor, as this is of specific concern in the design of the

codec, which is required to supply the network voice protocol with

suitable prioritization information.

The flow control algorithm uses the packet priority, as well as inform

ation supplied by the load prediction algorithm, to decide whether an

attempt should be made to transmit the packet. As the formatting of

voice packets to facilitate flow control has already been described in

Section 7.10, no further attention is given here to the flow control

issue. However, implicit in the discussion of the prioritization

algorithm below is the assumption that the flow control algorithm will

161

discard (as opposed to queue) voice packets which it decides are not

important enough to transmit immediately.

7.11.2 Speech prioritization in DCM systems

Speech prioritization is used in DCM systems not only to allow effi

cient concentration through silence elimination, but also to identify

information which can be discarded in overload situations when variable

rate coding is used, as described in Section 7.2.2. Two main types of

prioritization scheme have been described in the literature, namely

"signal-based" and "coder-based" types.

Signal-based schemes use properties of the input signal as the basis of

prioritization. Included in this category are the tal kspurt/silence

discrimination schemes used in traditional TASI systems, which are

usually based on a measurement of the local average signal energy

[Drago 78]. More recently, higher resolution prioritization using

spectral properties of the signal has been described [Yatsuzuka 82].

In this approach, the number of bits with which each sample is quant

ized is dependent not only on the signal energy, but also on whether

the signal is classified as a "wideband" or a "narrowband" sound, the

former being assigned more bits/sample than the latter. The reasoning

behind this is that wideband sounds are less predictable than narrow

band ones, and are therefore less well coded by ADPCM algorithms

[Yatsuzuka 82].

Coder-based prioritization schemes, on the other hand, vary the coding

rate according to the performance of the coder, rather than according

to the properties of the signal. For example, a common approach is to

attempt to maintain a constant coding SNR for all types of sound [Lang-

enbucher 82]. Obviously, coder-based prioritization schemes have the

162

potential to be more effective than signal-based ones, as it is the

decoded signal that the listener actually hears, rather than the en

coder input signal. However, in practice this advantage is dependent

on how well the performance measure used is correlated with perceptual

speech quality.

7.11.3 Fixed-rate performance of the embedded coder

In order to develop a scheme for varying the number of bits assigned to

various portions of the input signal, it is necessary to evaluate the

performance of the embedded coding algorithm at each individual bit

rate. Informal listening tests were therefore conducted, with the

fol lowing resul ts:

At 16 kbps, good intelligibility was observed, but distortion was

noticeable in all types of sound. This "roughness" was most

noticeable in sounds with substantial high-frequency content,

particularly fricatives.

At 32 kbps, slight distortion was audible in high-frequency

sounds.

At 48 and 64 kbps, no distortion was audible in any type of sound.

A significant aspect of the above observations is the discrepancy

between the perceptual performance for low and high-frequency sounds.

This might be explained by the fact that the fixed predictor coeffi

cients (which happen to be zeros in this case) are calculated using

long-term average statistics of the signal, implying that in the short

term significant mismatch may exist between the signal and predictor.

As voiced (low-frequency) sounds are more prevalent than unvoiced

(high-frequency) sounds in speech [Jayant 84], the long term average

statistics will be dominated by the former, with the result that the

coding of unvoiced sounds will be less accurate [Evci 1981]. However,

informal listening tests with a prioritization scheme which attempted

to maintain a constant SNR for all sounds (as in Section 7.11.2) were

not encouraging, and suggested that some factor other than SNR needed

to be considered.

An investigation of the spectral properties of the signal and coding

noise was therefore undertaken. In Figs. 7.22 and 7.23, the spectra of

the signal and quantization noise are shown for the coding of a voiced

and an unvoiced sound by the primary coder. The voiced sound is the

"AA" in "dark", and the unvoiced sound is the "SH" in "sheet", the

utterance involved being Sentence 1 in Appendix E. It can be seen that

in both cases the noise spectrum is substantially flat, apart from

minor deviations, as is to be expected in APCM systems [Jayant 84].

In the case of the voiced sound (Fig. 7.22), there is a predominance of

low-frequency energy, but it is important to note that at higher fre

quencies the noise power does not significantly exceed that of the

signal. (An exception to this is at the extreme high end of the

spectrum, where the signal power decreases rapidly. This is due to the

roll-off above 3.4 kHz of the pre-sampling filter, and as the recon

struction filter has a similar roll-off, the noise power will be simi

larly reduced at the earphone.)

In the case of the unvoiced sound in Fig. 7.23, there is a predominance

of high-frequency energy, and at lower frequencies the signal power

drops considerably below the noise power. It is well known that coding

noise is more objectionable if its power significantly exceeds that of

the signal in any part of the spectrum within the signal band [Jayant

164

-0 TS

a u N M

J O

z

" flfl"

FREQUENCY D0MRIN MRG SOURRED DRTFI

POINTS NORMALIZED db: 0= B.037 5E+001 db FREQUENCY RESOLUTION «= 6.E500E+01 Hz FOLDOVER FREQUENCY = 4.0000E+03 Hz = POINT

— i

G4

64

Fig. 7.22 Signal and coding noise spectra for the primary coder with a

voiced sound

" S H " FREQUENCY DOMRIN MRG SQURRED DRTfl

POINTS NORMALIZED db: 0= 8.4G05E+001 db FREQUENCY RESOLUTION - 6.2500E+01 Hz FOLDOVER FREQUENCY = 4.0000E+03 Hz = POINT

— i

G4

64

Fig. 7.23 Signal and coding noise spectra for the primary coder with

an unvoiced sound.

165

84]. This is the basis of the Noise Feedback Coding (NFC) technique,

in which the quantization noise spectrum is purposely re-shaped so that

it follows that of the signal, and is therefore perceptually "masked"

[Jayant 84]. Referring again to Fig. 7.23, the fact that the noise is

"exposed" in the low frequency region makes it more noticeable than

would otherwise be the case, thus further degrading the speech quality

obtained for unvoiced sounds.

The question naturally arises as to whether the reverse situation might

occur, ie. whether a voiced sound might produce a significant degree of

exposed noise in the high frequency region. This was not found to

occur in practice, an observation which may be attributed to the pre-

emphasis of the input signal performed by the telephone microphone,

which flattens the "low-pass" spectrum conventionally associated with

voiced sounds.

It is thus indicated that the speech coding rate should be increased

for sounds which have a predominance of high-frequency energy, which

will decrease the noise power across the entire spectrum in Fig. 7.23,

thereby reducing the amount of exposed noise. This approach is con

sistent with the increased number of bits allocated to unvoiced sounds

in some DCM systems based on ADPCM [Cox 80], [Yatsuzuka 82].

7.11.4 Generation of the prioritization variables

Clearly, an important prioritization variable is the local average

energy of the input signal, as the priority of low-energy pauses must

be made lower than that of high-energy talkspurts. This does not

necessarily imply that there should be a linear relationship between

the signal energy and the transmission priority, as this might en

courage users to speak more loudly in order to claim more transmission

166

bandwidth! Nevertheless, as far as the codec is concerned, the genera

tion of some quantity related to the local average signal energy is

required, the exact way in which this information is used being left to

the network voice protocol.

Since the primary coder already generates such a quantity, namely the

syllabic factor of the GHAQ, it makes sense that this is transferred to

the network voice protocol for prioritization purposes. Furthermore,

this fits in with the need to transmit the encoder syllabic factor to

the decoder to aid in the recovery from packet loss, as described in

Section 7.9.3.

The discussion in the previous section indicates that it is also nece

ssary to discriminate between voiced and unvoiced sounds for prioriti

zation purposes. A well-known means of doing this is to estimate the

zero crossing rate of the input signal, a high rate indicating a high

frequency (unvoiced) sound [Rabiner 78]. Although this technique is

only approximate, it is considered sufficient for this application,

particularly since prioritization takes place only on a per-packet

basis, and is therefore inherently limited in precision.

The use of the zero crossing rate for prioritization is also consistent

with the need to assign a higher priority to talkspurts than to sil

ence, as high-frequency consonants (such as the "s" in "stop") freq

uently have a low energy, and are therefore not given a sufficiently

high priority by energy-based schemes [Rabiner 78].

The zero crossing rate estimate is generated by the codec as follows.

Successive output code words from the primary encoder are examined, a

reversal of the sign bit indicating that the signal has crossed the

zero axis. In this case a counter is incremented, but if no sign

167

reversal is observed, the counter is decremented. The value in the

counter is referred to as the "zero crossing count". Due to the way in

which it is generated, the zero crossing count tends to be close to

zero for voiced sounds and close to its maximum value for unvoiced

sounds, spending little time between these limits.

The distinct types of behaviour of the two prioritization variables are

shown in Figs. 7.24 and 7.25, which were generated using Sentence 1 in

Appendix E. The syllabic factor g(n) simply follows the local signal

energy as expected, whereas the zero crossing count in Fig. 7.25 dis

criminates between the significant burst of high frequency energy in

the word "sheet" and the predominance of low frequency energy in most

of the rest of the sentence.

7.11.5 Use of the prioritization variables

The portion of the prioritization algorithm which takes account of the

source activity should be of the following form:

P = f(S,G,Z) (7.23)

where : P is priority

S is the significance of the code fragments in the packet

G is the maximum syllabic factor associated with the packet

Z is the maximum zero crossing count associated with the packet

f(*) is some function

A suitable form for f(S,G,Z) needs to be determined, and is a subject

for future research, although it can be stated at this stage that it

should be an increasing function of S, G, and Z. This will mean that

the highest priority will be assigned to a packet containing the most

significant fragments of code associated with high-level, high-

250 -

200

C

cn

150

100

50

168

GLUE THE SHEET TO THE DARK BLUE BACKGROUND

-L_ i <J

5000 10000 n

u VJ J — . i . . i 15000 20000

Fig. 7.24 Plot of the syllabic factor g(n) over the course of a

sentence

GLUE THE SHEET TO THE DARK BLUE BACKGROUND

T

20000

Fig. 7.25 Plot of the zero crossing count over the course of a

sentence

169

frequency sounds, and the lowest priority to a packet containing the

least significant fragments of code associated with low-level, low-

frequency sounds. (Maximum values of the syllabic factor and zero

crossing count are specified above, in order to ensure that a high

priority is given to packets in which only a small portion of the code

is associated with a "high-priority" sound.)

In addition, the following guidelines are suggested from informal

1istening tests:

The best subjective quality is obtained if al 1 portions of the

speech signal are transmitted, including intervals of so-called

silence. This implies that f(G,S,Z) should never produce a prior

ity of zero.

Transmitting any type of speech sound at a bit rate greater than

48 kbps produces no audible quality improvement. (While this

seems to imply that the 64 kbps coding rate is redundant, this

rate might nevertheless be useful in special cases where the SNR

of the signal is degraded considerably in the transmission chan

nel , due for example to a large number of asynchronous tandem

coding operations.)

Transmitting silence (ie. background noise) at a bit rate greater

than 16 kbps produces no audible improvement.

Talkspurts (detected by means of the syllabic factor) should be

transmitted at a higher bit rate than silence, an appropriate rate

increment being 16 kbps.

Unvoiced sounds (detected by means of the zero crossing count)

should be transmitted at a higher bit rate than voiced sounds, an

170

appropriate rate increment being 16 kbps.

7.12 Packet voice synchronization and fill-in issues

7.12.1 Synchronization

The embedded coder design places a few minor restrictions on the type

of synchronization algorithm which can be used in the network voice

protocol. Firstly, as the decoder sampling rate is fixed, receive

buffer management must be done on the basis of code insertion and

discarding, rather than sampling rate adjustment. Secondly, because

the coding algorithm is adaptive, samples within a particular packet

are dependent on each other, so that where code is discarded, this

should be either an entire packet or the tail-end of a packet. If the

front-end of a packet is discarded, an error will be produced in the

decoder syllabic factor.

The codec is thus suitable for use with the packet voice synchroniza

tion algorithm described in Section 5.5.3.4, which adapts to the packet

delay statistics on the basis of the observed arrival times, by in

creasing or decreasing the packet synchronization delay in units of a

packet period. As mentioned in Section 5.5.4, such an algorithm is

recommended in this application.

7.12.2 Fill-in

A number of features are included in the codec design in order to

facilitate the implementation of the simpler packet voice fill-in

strategies described in Section 5.6.2 (ie. the insertion of silence,

random noise or the contents of the previous packet). While further

experimentation is required to determine the perceptually optimal fill-

in strategy for the embedded codec with given packet loss statistics,

171

the features described below provide considerable flexibility in this

regard.

Firstly, the decoder syllabic factor can be explicitly set to a parti

cular value by the fill-in algorithm, implying that the average magni

tude of the fill-in signal can be easily manipulated. For example, an

appropriate level for random noise can be determined from the syllabic

factor of the previous packet and set accordingly, thereby improving

the matching between the substitute packet and adjacent packets.

If the contents of the previous packet are used as a substitute, then

the decoder syllabic factor can be re-initialized to its value at the

start of the previous packet, to ensure that the signal at the decoder

output is a replica of that associated with the previous packet. If

this were not done, repetition of the code in the previous packet would

in general produce a different decoder output signal, due to the diff

erent initial state of the adaptive quantizer.

Secondly, provision is made to "freeze" (ie. not update) the decoder

syllabic factor for the duration of a packet. This means that the

average magnitude of the decoder output can be held constant, regard

less of the substituted code. This feature can be used to ensure that

digitally synthesized random noise does not cause the decoder syllabic

factor to grow or decay with time.

Thirdly, a decoding rate of "zero" kbps can be selected. This simply

replaces the decoder output with zero-valued samples, regardless of the

actual speech code involved, and means that silence can be inserted

without copying special code into the receive buffer.

Finally, provision is made for the embedded decoding algorithm to be

bypassed, so that the substitute code can be in PCM form. This feature

172

is useful if the fill-in signal cannot be conveniently represented in

embedded code form.

173

CHAPTER 8 : IMPLEMENTATION OF THE CODEC

8.1 Introduction

In this chapter, a real-time implementation in digital hardware of the

embedded codec developed in Chapter 7 is described. Interface and

control circuitry associated with the transfer of information between

the codec and the network voice protocol is also discussed, and the

format of this information is specified.

8.2 Implementation strategy

It was originally proposed that the codec be implemented by integrating

it onto a number of digital gate arrays and analog uncommitted chips.

At the time of the proposal (January 1984), this seemed the only feas

ible strategy, as general-purpose microprocessors did not have the

speed necessary for real-time speech coding, and the specialized Digi

tal Signal Processors (DSPs) which were beginning to emerge were either

prohibitively priced or too slow for this application.

Since then, however, the cost of DSPs has reduced to the point where

their use in this project became reasonable. Apart from the relative

ease with which complex signal processing functions may be implemented

using these devices, the fact that they are programmable means that

modifications and improvements may be made to the coding algorithm at a

later date.

8.3 An overview of the codec card

The speech coding system is implemented on a printed circuit expansion

card for the IBM PC (Fig. 8.1), referred to here as the "codec card".

A block diagram of the circuitry on the card is given in Fig. 8.2.

Fig. 8.1 Photograph of the codec card

r

L

HOO SWITC

K ft ;H "

PCM CODEC/ FILTER

DIGITAL SIGNAL

PROCESSOR

CONTROL REGISTER

DMA CONTROLLER

MEMORY

MEMORY ACCESS

ARBITRATOR

•4-*

CO

CD

o Q.

Fig. 8.2 Block diagram of the circuitry on the codec card

175

This circuitry performs four essential functions. Firstly, signal

conditioning and conversion is performed by analog amplifiers and a PCM

codec/filter chip. Secondly, the embedded coding algorithm and asso

ciated operations are implemented by a digital signal processor.

Thirdly, the transfer of speech code and associated information between

the codec and the network voice protocol is performed by a shared-

memory interface. Finally, the operation of individual card elements

is co-ordinated by means of a card control/status register.

8.4 Signal conditioning and conversion

The input signal from the handset mouthpiece is high-pass filtered by a

first-order RC filter with a lower cutoff frequency of 190 Hz, to

prevent 50 Hz mains "hum" picked up by the handset lead from reaching

the earpiece via the sidetone path. The signal is then amplified by 35

dB before being fed into the PCM filter/codec IC. Further gain of up

to 20 dB (variable by means of a trimpot) is applied to the signal

within the PCM codec. A fraction of the input signal (variable between

0 and 1 using a trimpot) is also added to the output signal from the

PCM codec/filter chip, in order to provide the sidetone. The gain of

the composite output signal can be varied between -«°and 0 dB (using a

trimpot) before it reaches the handset earpiece.

The PCM codec/filter chip performs the pre-sampling and reconstruction

filtering, with a standard signal band of 300 - 3400 Hz. Conversion

between analog and digital forms is also performed, and the digital

code is 8-bit, sign-magnitude, A-law PCM.

176

8.5 The embedded codec

8.5.1 Choice of digital signal processor

The DSP used to implement the embedded codec is the Texas Instruments

TMS320E17, a variant of the TMS32010. Features of this device which

make it particularly suitable for this application are a serial PCM

interface, logarithmic companding hardware, a latched "coprocessor"

port, and a relatively low cost.

The serial PCM interface facilitates the use of a PCM filter/codec chip

for signal filtering and conversion, and the companding hardware allows

log PCM code to be linearized for signal processing purposes without

the use of time-consuming software. The latched coprocessor port

allows the asynchronous exchange of data between the DSP and the card's

shared memory, thus avoiding the reduction in the speed of program

execution which would result if the program had to wait for memory

accesses to be completed.

8.5.2 Program structure and timing

The DSP program is written in TMS32010 assembly language, and is con

tained in an EPROM which is part of the device itself. The program

consists of an initialization section, an interrupt handler, and a main

routine, as shown in Fig. 8.3. (Subroutines are avoided because of the

processing overhead associated with the call/return sequence.) The

initialization section is executed when the DSP is reset, and the

interrupt handler is associated with real-time synchronization, as

explained below.

The main routine contains the transmit and receive functions associated

with a single sampling period, ie. 125 uS. (The "transmit" direction

177

C RESET " )

INITIALIZATION

MAIN ROUTINE

FRAMING PULSE

INTERRUPT i

INTERRUPT HANDLER

Fig. 8.3 DSP program flow chart

178

is defined to be from the mouthpiece to the network, and the "receive"

direction is from the network to the earpiece.) The transmit functions

include fetching a PCM input sample from the serial port and encoding

it into embedded form, as well as generating speech prioritization

information. In addition, a transmit buffer inside the DSP, which is

required to allow formatting of the embedded code as is described in

Section 8.5.4, must be managed. The receive functions include decoding

an embedded code word and sending the result to the PCM codec via the

serial port, as well as managing an internal receive buffer.

Apart from executing the transmit and receive functions, it is also

necessary for the codec to maintain synchronization with the 125 uS

sample "framing pulse" associated with the PCM codec. This obviously

requires that the main routine executes in less than 125 uS, which

corresponds to 625 cycles of the DSP's clock. Synchronization is

maintained by tying the framing pulse to a DSP interrupt, rather than

by means of polling. This approach has the advantage that any viola

tion of the 125 uS period, due for example to a software modification,

is easily detected.

The 125 uS period during which one sample is encoded/decoded is re

ferred to as a "sample cycle", and is subdivided into "active" and

"inactive" phases. During the active phase, the transmit and receive

functions in the main routine are executed, and during the subsequent

inactive phase the DSP sits in an idle loop waiting for the next

framing pulse. Thus as the processing time required by the active

phase of the sample cycle tends to 125 uS (see Section 8.5.5), the

duration of the inactive phase will tend to a minimum.

While information is transferred between the DSP and the PCM codec on a

sanple-by-sample basis, the exchange of information between the DSP and

179

the card memory is done on a "group" basis. This is necessary because

of the way that the embedded code words are formatted into bytes, as is

described in Section 8.5.4. A "group cycle" equal to four sample

cycles (ie. 500 /JS) is thus defined. During a group cycle, four suc

cessive samples are encoded (and decoded). The code, as well as asso

ciated control information, is formatted into an 8-byte group, which is

stored in an internal transmit buffer in the DSP itself before being

transferred to the card memory. A corresponding operation also takes

place in the receive direction. The composition of the transmit and

receive groups is described in Section 8.5.4.

8.5.3 Arithmetic considerations

8.5.3.1 Fixed-point notation

The fundamental wordlength of the TMS320E17 DSP is 16 bits (this is the

length of most registers and all memory locations), although 32-bit

arithmetic can be performed in the accumulator. Since linearized A-law

PCM code requires 13 bits (including the sign bit) for its representa

tion, a natural notation is Q3, ie. 13 whole bits and 3 fractional bits

[Texas Instruments 85]. This implies a representational accuracy of

+/- 0.0625 "linear units", where one linear unit is the weight of the

LSBit of the linearized PCM sample. For this reason, most variables in

the coding algorithm are represented in Q3 form. As two's complement

arithmetic is used, conversion to and from the sign-magnitude notation

used by the PCM codec is required.

Some quantities in the coding algorithm, such as the secondary quan

tizer levels, require a higher representational accuracy than that

provided by Q3 notation. Careful choice of the notation in these cases

minimizes the number of instructions required to convert non-Q3 pro-

180

ducts to Q3 form after multiplication operations. For example, repre

senting the secondary quantizer levels in Q12 notation means that a Q15

product is formed after multiplication by the scaling factor from the

primary coder, which is a Q3 quantity. This 32-bit Q15 product can

then be converted back to a 16-bit Q3 quantity with a single shift-

right-by-4-and-store instruction. As only shifts of 0,1 or 4 places

can be specified in this instruction [Texas Instruments 85], the need

for care in selecting the notation of the quantities involved is clear.

8.5.3.2 Arithmetic overflow

Arithmetic overflow is avoided through the use of appropriate scaling,

and an explicit upper limit on the syllabic factor in the primary

coding algorithm. Also, the TMS320E17's "arithmetic saturation" func

tion is enabled, which causes overflow results in the accumulator to be

replaced by the largest number which can be represented.

8.5.3.3 Truncation error

Another issue in the choice of notation is the effect of truncation

error on the long-term operation of the coding algorithm. This is

often important when recursive structures are used in digital signal

processing applications, as truncation errors can then accumulate. In

this project, only the primary coder contains such a structure, namely

the syllabic compandor in the GHAQ. However, in this case the use of

feedback tends to compensate for truncation errors in the same way as

it compensates for quantization errors, as discussed in Section

7.4.10.2. Truncation error is thus not expected to affect the SNR of

the codec significantly, and this was confirmed experimentally by

comparing the performance of the simulated codec (which used floating

point arithmetic) with the DSP version. With the same input sequence

181

(Sentence 1 in Appendix E), the respective SNR and SNRSEG figures

differed by at most 0.1 dB.

However, in contrast to the above, the ability of the codec to recover

from channel errors is not protected from the effects of truncation by

the use of feedback. As discussed in Section 7.8.3, the "robust" GHAQ

derives its robustness from the addition of the syllabic increment

which is applied to the input of the syllabic compandor. Suitable

values for this increment are shown in Section 7.8.4 to be in the

region of 1 - 4. As this increment is then scaled by a factor (1-0) =

0.06 (see Fig. 7.5), it is in danger of being truncated out of exist

ence in a Q3 quantity.

For this reason, B is changed in the codec implementation from 0.94 to

0.9375, which according to Fig. 7.10 has negligible effect on the

performance of the 2-bit GHAQ. The term (1-B) then becomes equal to

0.0625, which corresponds to an arithmetic right shift of 4 places, as

opposed to more than 4 in the case of (1-B) = 0.06. In addition, the

syllabic increment is made equal to 2, ensuring that it will not dis

appear from a Q3 quantity after being shifted right by 4 places. The

effect of these two measures is thus to improve channel error recovery

in the fixed-point implementation of the robust GHAQ, by reducing

truncation error in the syllabic increment. Appropriate results are

presented in Chapter 9.

8.5.4 Code and control information formats


Information is transferred between the DSP and the card memory in

groups of 8 bytes, one group being transferred in each direction every

182

500 uS. The interpretation of each byte depends on its position within

the group, and whether the group is associated with the transmit or

receive directions. Furthermore, the interpretation of each transmit

and receive group is affected by the "codec control/status word", which

is described in Section 8.5.4.4.

8.5.4.2 Transmit group structure

Byte 0 : Type 0 code - This byte contains either four embedded code

fragments or one PCM code word, depending on the "Tx PCM"

flag in the codec control word. If this flag is clear, the

byte contains embedded code fragments of significance 0 (ie.

the lowest significance) for samples n-3 to n. If the flag

is set, the byte contains an 8-bit PCM code word correspond

ing to sample n-3. These two possible code formats are shown

in Fig. 8.4.

Byte _1 : Type 1 code - As for Byte 0, except that embedded code frag

ments have significance 1, and PCM code corresponds to sample

n-2.

Byte 2 : Type 2 code - As for Byte 0, except that embedded code frag


n-1.



n.

Byte 4 : Codec control/status word - The upper 4 bits of this byte are

always zero in the transmit group, but the lower 4 bits are

the lower 4 bits of the codec control/status word, described

EMBEDDED CODE (TX PCM =0̂

b^n-3) b (n-3) 0

b(rh2) bo(n-2) b^n-1) bo(n-1) ^(n) bo(n)

PCM CODE rrxpcM=-n

b (n-3) b (n-3) 6

b (n-3) 5

b (n-3) 4

b (n-3) 3

b (n-3) 2

b (n-3) 1

b (n-3) 0

Fig. 8.4 Alternative formats for Type 0 code bytes, depending on the

state of the "TX PCM" flag

FROM PCM ^ r CODEC

RX LOOP \

TO PCM ^ CODEC

RX F

1

1 1

V**W

0

PCM DECODER

EMBEDDED ENCODER

1 i

1

PCM ENCODER

EMBEDDED DECODER

o

TXPCM

<

1 <

0 ft- T O

i ̂ Innn CHANNEL 1 TX LOOP

TXLOOP » ^ FROM

0* " CHANNEL

Fig. 8.5 Signal paths inside the DSP which may be reconfigured using

the lower 4 bits of the codec control/status word

in Section 8.5.4.4.

184

Byte 5 : Zero crossing count - This byte contains the zero crossing

count (0-255), latched after encoding sample n.

Byte 6 : MSByte of syllabic factor - This byte contains the MSByte of

the encoder syllabic factor (in Q3 notation), latched prior

to encoding sample n-3.

Byte 7 : LSByte of syllabic factor - As for Byte 6, except that the

LSByte is invol ved.

8.5.4.3 Receive group structure

Byte 0 : Type 0 code - This byte contains either four embedded code

fragments or one PCM code word, depending on the "Rx PCM"

flag in the codec control/status word. If this flag is

clear, the byte contains embedded code fragments of signifi

cance 0 for samples n-3 to n. If the flag is set, the byte

contains an 8-bit PCM code word corresponding to sample n-3.

These two formats are the same as those shown in Fig. 8.4 for

the transmit group.



n-2.



n-1.

185



n.

Byte 4 : Codec control/status word - This byte contains the codec

control /status word, described in Section 8.5.4.4.

Byte 5 : Rate indicator - This byte contains a number from 0 to 4

which indicates to the decoder the number of valid fragments

in the received embedded code words. Equivalently, when

multiplied by 16 kbps, it gives the effective decoding rate.

Thus "0" selects a decoding rate of 0 kbps (resulting in

zero-valued output samples), and "4" selects a decoding rate

of 64 kbps. The byte is ignored if the "Latch rate indica

tor" flag in the codec control/status word is clear, or if

the "Rx PCM" flag is set.

Byte 6 : MSByte of syllabic factor - This byte contains the MSByte of

the syllabic factor (in Q3 notation). If the "Latch decoder

syllabic factor" flag in the codec control/status word is

set, the byte is latched into the MSByte of the decoder

syllabic factor prior to the decoding of sample n-3. Other

wise this byte is ignored.

Byte 7 : LSByte of syllabic factor - As for Byte 6, except that the

LSByte is invol ved.

8.5.4.4 Codec control/status word

The codec control/status word enables code-synchronous control/status

information to be transferred between the codec and the network voice

protocol. It consists of eight single-bit flags, as described below.

There is a corresponding "codec control/status register" in the DSP

which stores the lower 4 bits of the codec control/status word. These

bits determine the internal connections of the codec, as shown in Fig.

8.5, thus allowing test configurations such as signal loops to be set

up.

Bit 0 : "Rx loop" flag (Read/Write) - If this flag is set, the de

coder output is looped back to the encoder input (see Fig.

8.5). This feature can be used for testing purposes. Note

that a delay of one sample cycle (125 juS) is associated with

the loop.

Bit 1 : "Tx loop" flag (Read/Write) - If this flag is set, the en

coder output is looped back to the decoder input (see Fig.

8.5). This feature can be used for testing purposes. Note

that a delay of one sample cycle (125 /JS) is associated with

the loop.

Bit 2 : "Rx PCM" flag (Read/Write) - If this flag is clear, bytes 0 -

3 of the receive group are assumed by the codec to contain

embedded code, otherwise they are assumed to contain PCM

code.

Bit 3 : "Tx PCM" flag (Read/Write) - If this flag is clear, bytes 0 -

3 of the transmit group contain embedded code, otherwise they

contain PCM code.

Bit 4 : "Freeze decoder syllabic factor" flag (Read/Write) - If this

flag is set, the decoder syllabic factor is not updated

during each sample cycle, although all other aspects of the

decoder's operation are unchanged.

Bit 5 : "Latch decoder syllabic factor" flag (Write only) - If this

flag is set, the contents of receive group Bytes 6 and 7 are

latched into the decoder syllabic factor in the DSP.

Bit 6 : "Latch rate indicator" flag (Write only) - If this flag is

set, the contents of receive group Byte 5 are latched into

the rate indicator in the DSP.

Bit 7 : "Latch codec control/status word" flag (Write only) - If this

flag is set, Bits 0 - 3 of the codec control/status word are

latched into the codec control/status register in the DSP,

and the operations specified by bits 4-6 are performed.

Otherwise, bits 0-6 are ignored.

8.5.5 DSP resource usage

RAM usage : 56% (143/256 words)

ROM usage : 14% (580/4096 words)

Processing time : 92% (574/625 clock cycles)

The above processing time applies to the active phase of the sample

cycle (described in Section 8.5.2), and is a worst-case figure, as it

assumes that the least favourable branch is taken at conditional

instructions. As mentioned in Section 8.5.1, transfers between the DSP

and card memory are performed by means of asynchronous latches, so that

slight variations in the times required by these transfers (due to

contention for access to the shared memory) have no effect on the

processing time given above.

188

8.6 The codec/network voice protocol interface

8.6.1 Introduction

The interface between the codec and the network voice protocol is an

area of shared memory which is physically located on the codec card,

but which is mapped into the PC's memory space. Single-byte memory

accesses made by the DSP and the PC are interleaved by an access

arbitrator, as shown in Fig. 8.2.

This design has the advantage that information can be communicated

between the codec and network voice protocol with a minimum of process

ing on the part of the PC, and without slowing down the DSP. Further

more, the memory can be made to appear non-contiguous to the codec, a

feature which has benefits explained below. Alternative interface

designs, for example using the DMA controller in the PC (which has only

one spare channel), or program-controlled I/O, would not offer these

features.

8.6.2 Information transfer techniques

Information is transferred between the card memory and the DSP by means

of a DMA controller chip on the card, which can be programmed by the

PC. This device has four DMA channels, of which two are associated

with the transmit direction (channels 0 and 2), and two with the re

ceive direction (channels 1 and 3). Alternate use is made of the

channels associated with a particular direction, so that at any given

time two channels are "active" and two are "inactive". Special hard

ware on the card switches automatically between the active and inactive

channels for each transfer direction, the switching operation being

initiated by a "terminal count" on the associated active channel. A PC

hardware interrupt is also generated at this stage, to inform the

189

network voice protocol that the changeover has taken place.

The above design facilitates the implementation of an "alternating

buffer" approach to the exchange of information between the codec and

the network voice protocol, thereby avoiding the transfer synchroniza

tion problems which might otherwise occur. As the network voice proto

col is not expected to move the speech code once it is in the card

memory, the exchange of information between this protocol and the

network access controller can be performed by simply passing pointers

to areas of this memory. The speech code can then be read from or

written to the card memory by the network access controller, using

program-controlled or DMA-controlled transfers.

8.6.3 Memory buffers and blocks

The 4 Kbyte card memory is organized into 4 "buffers", each with a size

of 1 Kbyte. The buffers are numbered from 0 to 3, working upwards in

memory. Each buffer is subdivided into 8 "blocks" of 128 bytes. The

blocks in a buffer are numbered from 0 to 7, working upwards in memory.

The significance of the buffers is that they define regions of memory

which the codec treats as non-contiguous. Thus when speech code and

control/status information is written to or read from these buffers by

the codec, successive bytes are associated with successive blocks,

rather than successive memory locations. (This is achieved at a hard

ware level by permutation of the address lines of the DMA controller.)

The utility of this feature is explained in the next section.

8.6.4 Information parcels and frames

As explained in Section 8.5.4, the codec works with "groups" of 8 bytes

of speech code and control/status information. If successive bytes in

190

a group were stored in successive locations in the card memory, then

speech code would be intermingled with control/status information, so

that when the network voice protocol wanted to packetize the code, it

would have to extract it byte by byte, which would require extra pro

cessing. However, as successive bytes in a group are in fact stored in

successive blocks, as shown in Fig. 8.6, each block is filled with a

single type of byte only. For example, referring to Section 8.5.4,

Block 0 contains only Type 0 code bytes, so that when the network voice

protocol wants to create a packet from these bytes, it simply appends a

header.

A collection of bytes of the same type in a particular block is re

ferred to as a "parcel", and the 8 parcels which are formed in the 8

blocks of a particular buffer constitute a "frame". The frame is the

fundamental unit of data transferred between the codec and the network

voice protocol. In the simplest case, a single frame is associated

with each buffer, so that there are two transmit frames and two receive

frames which are used alternately by the codec and the network voice

protocol. When the end of a transmit or receive frame is reached by

the DSP, the alternate DMA channels are selected, as described in

Section 8.6.2, and an interrupt is sent to the PC.

Unlike buffers and blocks, whose locations are determined at the hard

ware level, frames (and hence parcels) have starting addresses and

lengths which can be programmed by means of the "base address" and

"word count" registers of the DMA controller. It is obviously desir

able that the parcel length is made equal to the packet length used by

the network voice protocol, in order to facilitate packetization.

There is an upper limit on the parcel length of 128 bytes (ie. the

block size), but as this corresponds to 64 mS of speech, it is unlikely

191

GROUP

ByteO

Bytel

Byte 2

Byte 3

Byte 4

Byte 5

Byte 6

Byte 7

ByteO

i i

i i

i i

i i

Byte 1

i i

i i

i i

i i

Byte 2

i i

i i

i i

i i

Byte 3

i i

i i

i i

Byte 4

i i

i i

i •

i i

Byte 5

i i i i

i •

• i

Byte 6

i i

i i i i

i i

I i Byte 7

i i

i i

i i i i

i 1

i Block 0

i Block 1

Block 2

Block 3

Block 4

Block 5

Block 6

Block 7

BUFFER

Fig. 8.6 Non-contiguous storage of a group in a buffer

192

to be restrictive in practice. With typical packet periods of 16 mS,

it is even possible to store up to 4 frames in a single buffer, as

might be required for example in the adaptive packet voice synchroniza

tion scheme described in Section 5.5.3.4.

The alternating use of the DMA channels means that it is possible for

the PC to re-program the base address and word count registers of

channels while they are inactive. This allows parcel lengths to be

varied "on the fly", as might be required for variable packet length

schemes (Section 5,2). Also, changing the starting address of a frame

allows the "alternating buffer" approach to be overridden when re

quired. For example, a frame of "dial tone" can be kept permanently in

a buffer and brought into use when required. As another example, a

frame containing speech code can be repeated in the event of packet

loss, as described in Section 5.6.2.

8.7 The card control/status register

The card control/status register co-ordinates the operation of the

devices on the card and provides a code-asynchronous interface between

the codec and the network voice protocol. It contains an 8-bit "card

control/status word", which is composed as follows:

Bit 0 : "Error" flag (Read/Clear) - The function of this flag is to

indicate codec error conditions to the network voice proto

col, and to disable the codec automatically in the event of

such errors. It is set by a PC RESET signal (thereby dis

abling the codec) and should be cleared before an attempt is

made to enable the codec. Once the codec has been enabled,

the "Error" flag is set by the codec if a sample cycle error

occurs, which means either that the DSP program is not exe-

193

cuting sufficiently quickly (due for example to a program

modification), or that the DMA controller has failed to

respond to a service request. In the latter case, the DMA

controller may have been programmed incorrectly, or DMA

requests may simply have been masked out, such as at the end

of a call. In either case, a PC hardware interrupt is gener

ated, and the "Codec Enable" flag is cleared, thereby dis-

abling the codec.

Bit 1 : "End-of-frame" flag (Read/Clear) - The function of this flag

is to indicate to the network voice protocol when the end of

the current receive or transmit frame is reached by the

codec, by generating a PC hardware interrupt. It should be

cleared during card initialization and upon acknowledgement

of the interrupt. The network voice protocol can determine

which DMA channel caused the interrupt by examining Bits 6

and 7 of the card control/status register.

Bit 2 : "Off-hook" flag (Read/Clear) - The function of this flag is

to indicate to the network voice protocol when the handset is

lifted, by generating a PC hardware interrupt. It should be

cleared during card initialization and upon acknowledgement

of the interrupt.

Bit 3 : "On-hook" flag (Read/Clear) - The function of this flag is to

indicate to the network voice protocol when the handset is

replaced, by generating a PC hardware interrupt. It should

be cleared during card initialization and upon acknowledge

ment of the interrupt.

194

Bit 4 : "Analog Loop" flag (Read/Write) - This flag controls the

"loop" function of the 2914 PCM codec/filter chip [Intel 84],

and can be used for testing purposes. When set by a write

operation, the loop is closed. It should be cleared for

normal communications.

Bit 5 : "Codec Enable" flag (Read/Write) - This flag enables and dis

ables the codec by means of the DSP reset line. It is

cleared automatically by the "Error" flag, as explained

above. It should be cleared during card initialization, and

set at the start of a call. It is not necessary to clear it

expl icitly at the end of a cal 1 , as masking out the DMA

requests will cause the "Error" flag to do this automatical

ly.

Bit 6 : Tx active channel index (Read only) - This bit indicates

which DMA channel is currently being used by the codec for

transmit transfers. If it is clear, it indicates that chan

nel 0 is being used, and if it is set, it indicates that

channel 2 is being used.

Bit 7 : Rx active channel index (Read only) - This bit indicates

which DMA channel is currently being used by the codec for

receive transfers. If it is clear, it indicates that channel

1 is being used, and if it is set, it indicates that channel

3 is being used.

8.8 Card configuration options

In order to allow the codec card to be used in a variety of hardware/

software environments, the following hardware features are made user-

configurable:

195

The location of the card memory in the PC's memory space. The

buffer memory can be given a starting address of 8C000, AC000,

CC000, or EC000 hex, by means of jumpers J3 and J4. These jumpers

are respectively associated with address lines A18 and A17 on the

PC's expansion bus, and tying either of them low requires the cor

responding address line to be high to select the card memory.

The location of the DMA controller registers and the card control/

status register in the PC's I/O space. These can be given a

starting address of 300, 320, 340 or 360 hex, by means of jumpers

JI and J2. These jumpers are respectively associated with address

lines A6 and A5 on the PC expansion bus, and tying either of them

low requires the corresponding address line to be high to select

these registers. The DMA controller registers occupy the 16

locations from the starting address upwards, followed by the card

control/status register.

The interrupt request line used by the card. This can be selected

from IRQ2, IRQ3, IRQ4, and IRQ5 using jumper J5.

Provided the above configuration options are set correctly, the card

can be used in IBM PC/XT/AT and compatible microcomputers, including

"Turbo" machines, ie. those with accelerated system clocks.

CHAPTER 9 : EVALUATION OF THE CODEC

9.1 Introduction

In this chapter, the codec implementation described in Chapter 8 is

evaluated. The results presented here were obtained using test soft

ware which is described in Appendix D. The hardware test configura

tion, which uses a single codec card, is shown in Fig. 9.1. It can be

seen that by connecting point "c" to point "d" in this configuration, a

loop is completed which allows the coding operations in an end-to-end

speech path to be evaluated. In this case the signal undergoes two PCM

encoding/decoding operations, as well as an embedded encoding/decoding

operation, and the complete process is thus referred to as PCM-EC-PCM

coding in this chapter.

In order to calculate signal-to-noise ratios for PCM-EC-PCM coding, a

linearly quantized test signal (Sentence 1 in Appendix E) was encoded

into 64 kbps PCM form by the test software and then injected into the

speech path at point "a" in Fig. 9.1. The output signal was then

extracted at point "b", and converted back to linear form by the soft

ware for comparison with the input signal. Injection and extraction of

the test signal and the looping back of the encoder output to the

decoder input was performed using the flags in the lower 4 bits of the

codec control/status word, as described in Section 8.5.4.4.

9.2 Performance comparison with log PCM

As each input sample is coded independently in log PCM systems, it is

possible to obtain a fairly graceful reduction in speech quality by

stripping bits progressively from each code word, and replacing them

with "filler" bits prior to decoding [Goodman 80]. However, because

1

PCM CODEC DIGITAL SIGNAL PROCESSOR

r

L

PCM ENCODER

a

-X- PCM DECODER

EMBEDDED ENCODER "*^CHANNEL

D

PCM DECODER

b

-X- PCM ENCODER

EMBEDDED DECODER

d

-X- FROM CHANNEL

Fig. 9.1 Codec card test configuration

T ' r 32 48

Rate [kbps]

Fig. 9.2 Performance vs. coding rate : (a) PCM-EC-PCM coding (b) lo

PCM coding with bit deletion and insertion

198

the decision thresholds are not aligned in this case, the code is not

strictly embedded, as explained in Section 7.7.4. Furthermore, the use

of a single set of reconstruction levels for all decoding rates is sub-

optimal .

These facts, combined with the use of a non-adaptive quantizer in log

PCM, mean that it is not a viable alternative to the embedded coding

scheme developed in this thesis, as the speech quality it yields at 16

kbps is unacceptable. Nevertheless, it provides a useful reference

point with which to compare the performance of the embedded codec.

In Fig. 9.2, SNRSEG vs. bit rate is shown for PCM-EC-PCM coding (curve

"a") and for log PCM with bit stripping and insertion (curve "b"). At

a rate of 64 kbps, log PCM yields better performance than PCM-EC-PCM

coding, the difference being about 3 dB. This is attributable to the

fact that the quantizer decision and reconstruction levels used in the

embedded coder are not related to those used by the adjacent PCM co

ders, so that the two PCM operations are in effect "asynchronous" with

respect to each other, ie. it is as if the output of the first one is

converted to analog form and then re-sampled by the second. It is

known that the SNR degradation in such asynchronous tandems is approx

imately 3 dB [Jayant 84].

As the performance of both log PCM and PCM-EC-PCM coding is very good

at 64 kbps (as indicated by the SNRSEG figures of over 30 dB) the 3 dB

performance discrepancy at this rate is not important. However, at a

coding rate of 16 kbps, audible distortion is present in the outputs of

both coding schemes, so that the SNRSEG improvement of 8 dB provided by

PCM-EC-PCM coding is significant.

199

9.3 Dynamic range

As the local variance of speech varies over a wide range (typically 40

dB [Jayant 84]), it is important to establish that the codec has an

appropriate dynamic range, requiring that the SNR of the codec is

measured for a range of input signal powers. However, a problem in

this context is the choice of a suitable test signal. In the past,

sinusoids and bandlimited Gaussian noise have been used for evaluating

the dynamic range of PCM codecs [Jayant 84]. The attractive feature of

these signals is that because they may be described exactly, the tests

are easily repeatable.

However, as speech codecs become more signal-specific, testing them

using non-speech signals becomes increasingly less meaningful. Accord

ingly, a number of researchers have evaluated the dynamic range of

codecs by plotting SNR for a speech signal which has been attenuated to

various degrees [Un Jan. 80], [Nasr 84], While this represents an

improvement on the use of synthetic signals, the SNR figure is still

dominated for all the points plotted by the portions of the signal

which have the largest relative amplitude.

In recognition of these problems, an improved technique for evaluating

coder dynamic range has recently been proposed [Chen 87]. It involves

the calculation of segmental signal to noise ratios, as in the SNRSEG

measure. However, instead of averaging the segmental figures across

the entire utterance to obtain a single result, they are grouped on the

basis of the normalized average signal power during each segment, the

average power of the entire utterance being used as a normalizing

factor. (For example, the signal-to-noise ratios of all segments with

a normalized average signal power between -2.5 dB and + 2.5dB are

grouped together.) The average signal to noise ratio is then

200

calculated for each group, and the results plotted against the corres

ponding normalized signal power. These figures give a very precise

idea of how the quantizer performs for speech segments having different

power levels [Chen 87].

In Fig. 9.3, dynamic range curves calculated as described above are

shown, for PCM-EC-PCM coding with various bit rates (curves "b" to

"e"). Also shown is a curve applying to a single 64 kbps PCM coding

operation (curve "a"). It can be seen that at the lower bit rates the

dynamic range of PCM-EC-PCM coding is at least 40 dB, and that it is

limited at the higher rates by the dynamic range of the PCM coder. As

in Section 9.2, the SNR difference between PCM-EC-PCM coding and PCM

coding is about 3 dB at 64 kbps.

9.4 Signal delay

The use of sequential rather than block coding techniques in the em

bedded codec means that the encoding and decoding delays per se are

each equal to a single sampling period (125 uS). However, the format

ting process performed in the DSP to aid in packetization means that

internal buffering must be used, as described in Section 8.5.2. Each

internal buffer stores a "group" of information, corresponding to 4

sampling periods. The total signal delay in the embedded codec is thus

(2 + (4 x 2)) x 125 uS = 1.25 mS, which is wel 1 within the encoding/

decoding delay limit of 4 mS specified in Chapter 6.

9.5 Robustness to bit errors and missing packets

The performance of PCM-EC-PCM coding with a noisy channel was evaluated

by introducing pseudo-random errors into the loop between points "c"

and "d" in Fig. 9.1, and the results are shown in Fig. 9.4. It can be

201

co

QC Z CO 75 o o

-20 -10 0 10

Normalized signal power [dB]

20

9.3 Dynamic range curves : (a) 64 kbps PCM (b) PCM-EC-PCM coding,

EC rate = 64 kbps (c) PCM-EC-PCM coding, EC rate = 48 kbps

(d) PCM-EC-PCM coding, EC rate = 32 kbps (e) PCM-EC-PCM

coding, EC rate = 16 kbps

40

10 10

Pe

9.4 Performance vs. channel bit error probability for PCM-EC-PCM

coding : (a) EC rate = 64 kbps (b) EC rate = 48 kbps (c) EC

rate = 32 kbps (d) EC rate = 16 kbps

seen that the SNRSEG is only slightly degraded for error rates up to

about 1 in IO3, but that at higher rates it drops off rapidly. In

informal listening tests, bit errors were audible as background

"crackling" at an error rate of 1 in IO3, although the intelligibility

was not degraded substantially.

Recovery from missing packets is facilitated by allowing the decoder

syllabic factor to be reset to a value transmitted as side information

in each packet, and it was verified that this function was implemented

correctly in the codec.

It is also worthwhile noting that if the decoder syllabic factor is

reset appropriately at the start of each packet, regardless of whether

the previous packet is missing or not, then any perturbation of this

quantity which may have been caused by corruption of the speech code by

bit errors will be eliminated. However, it is also possible that the

side information in a packet may be corrupted, implying that some form

of error detection and/or correction may be needed, as discussed in

Section 3.5. An issue in the design of the network voice protocol is

thus the identification of the precise circumstances in which the

decoder syllabic factor should be reset.

9.6 Idle channel noise

In the idle channel state (ie. with no background noise or DC offset),

the first PCM encoder in the PCM-EC-PCM coding operation will produce

code words of 00 hex and 80 hex with equal probability, corresponding

to the use of its innermost quantization levels. With these input

codes, the additional idle channel noise due to the embedded coding

operation will be determined by the quantizer levels used in the prim

ary and secondary coding algorithms, the minimum value of the syllabic

203

factor, and the transmission bit rate, as discussed- in Section 7.8.5.

Since the quantizer levels in the primary coder are selected on the

basis of two successive quantizer output words, there are four (ie. 22)

combinations of the two idle channel input codes which must be consid

ered, namely 00 00, 00 80, 80 00, and 80 80 hex.

When the above PCM code combinations were injected at point "a" in Fig.

9.1, the codes shown in Table 9.10 were obtained at point "b". It can

be seen that only at the 64 kbps rate are the codes preserved in all

cases, although the maximum deviation at lower rates is 1 LSBit. How

ever, as the reconstruction levels corresponding to these code words

are very small in log PCM coding (about 60 dB below the largest recon

struction level), the idle channel noise performance of the overall

PCM-EC-PCM coding operation is considered satisfactory.

9.7 Transcoding

A number of circumstances in which it may be necessary to transcode the

embedded code words within the network are identified in Chapter 6, and

are discussed here with reference to Fig. 2.1. Firstly, it may be

necessary to communicate with a remote voice terminal equipped with a

PCM coding facility only. In this case, the embedded code is likely to

be converted to PCM form in the local gateway, and then transmitted in

this form to the remote terminal for decoding. Referring to Fig. 9.1,

this is equivalent to inserting an extra channel at point "b", but as

no extra coding operations are involved in the end-to-end speech path,

the performance results of Sections 9.2 to 9.6 apply.

Secondly, it may be necessary to communicate with a remote terminal

equipped with an embedded coding facility, but which can only be reach

ed via a 64 kbps PCM connection (ie. embedded coding is used for voice

204

INPUT COOES

00 00

00 80

80 00

80 80

OUTPUT CODES

16 kbps

01 01

01 81

81 01

81 81

32 kbps

00 00

00 81

81 00

81 81

48 kbps

00 00

00 81

81 00

81 81

64 kbps

00 00

00 80

80 00

80 80

Table 9.1 Idle channel input and output codes of the embedded codec

16

i • r 32 48

Rate [kbps] 80

Fig. 9.5 Tandem coding performance of PCM-EC-PCM coding : (a) 1 coding

operation (b) 2 coding operations

205

traffic within the LANs in Fig. 2.1, but 64 kbps PCM is used within the

WAN). In this case the embedded code must be converted to PCM form at

the local gateway, and back to embedded form at the remote gateway.

This is equivalent to a tandem connection of two PCM-EC-PCM coding

operations.

Fig. 9.5 shows SNRSEG vs. bit rate for this tandem connection (curve

"b"), as well as for a single PCM-EC-PCM coding operation (curve "a").

It is assumed that the rates associated with the two embedded coding

operations are equal in the tandem case. Clearly, the extra degrada

tion incurred by the tandem connection (at most 1.5 dB) is not substan

tial at any coding rate. Furthermore, the extra delay in this case

need be no greater than the 1.25 mS specified in Section 9.4.

A third situation in which transcoding is required is when the speech

code must be converted to linear form for superposition in a conference

bridge, located in one of the LAN gateways, or within the WAN itself.

If the conference bridge operates with 64 kbps PCM inputs and outputs,

then the speech code may need to be converted from embedded form to PCM

form and back again. However, the overall speech path is then equiva

lent to a tandem connection of two PCM-EC-PCM coding operations, and

the results of Fig. 9.5 apply. Alternatively, if the conference bridge

operates with linear inputs and outputs, then converting the embedded

code directly to and from linear form will incur even less degradation

than is indicated by Fig. 9.5.

9.8 Subjective quality

As the codec itself represents only a portion of the speech communica

tion link in a packet voice system, formal tests of the subjective

speech quality are not appropriate at this stage of the voice/data

206

integration project. Such tests (eg. Mean Opinion Scores [Jayant 84])

will be meaningful only when the combined effects of signal distortion,

delay, loss and corruption can be assessed. As explained in Chapters

2, 3, 4, and 5, the levels of these types of degradation are critically

dependent on the prioritization, flow control, synchronization and

fill-in algorithms in the network voice protocol, as well as on the

nature of the network and access controller. The development and

optimization of the above algorithms in the network voice protocol, and

the evaluation of the subjective speech quality in suitable networks,

is thus an area for further work.

9.9 Cost

The cost of the codec card is kept low through the use of a standard

PCM codec/filter chip for signal conversion and filtering purposes.

The most expensive item on the card is the DSP chip, although only one

of these is required for full duplex communications. Other components

(apart from the DMA controller) are mainly MSI or SSI devices. It is

estimated that the total materials cost for the codec card is A$ 200.

207

CHAPTER 10 : CONCLUSIONS AND FURTHER WORK

10.1 Conclusions

10.1.1 Embedded coding

It is clear that two major advantages of embedded coding in packet

switched networks are that it enables both rapid flow control and

statistical bandwidth compression of speech to be performed. A further

benefit is that it provides an "evolutionary path" for the coding

equipment associated with the transmission of bursty real-time traffic

on packet switched networks, as described below.

Two well-established trends in the field of digital communications are

the falling cost of transmission bandwidth, caused primarily by the

increasing use of optical transmission media, and the falling cost of

signal compression, caused by the increase in the integration density

and speed of digital signal processing chips [Vickers 87]. These two

trends are to some extent in competition, since the reduced cost of

transmission bandwidth makes signal compression less attractive. At

any given time, a trade-off must be made between these two factors, the

optimal solution being the one which minimizes the overall cost of the

system. However, the position of this optimum will shift with time,

unless transmission bandwidth and compression facilities decrease in

cost at exactly the same rate, which is unlikely.

A solution to this problem is to specify a range of bit rates which may

be used by a codec, and then to allow the codec to demand transmission

capacity as required, with the user being charged accordingly. Thus as

sophisticated signal compression techniques become less expensive to

implement, they can be used to provide higher quality at lower

transmission rates, within the same bit rate structure. Similarly, if

transmission costs fall, simpler codecs can be produced which use the

higher rates more frequently. This feature of embedded coding is

likely to be particularly useful in connection with video services, as

the technology associated with the real-time compression of these

signals is less mature than in the case of speech.

10.1.2 Silence el imination

It is pointed out in this thesis that there are a number of disadvan

tages to the traditional elimination of silence from the transmitted

signal in packet voice communications, in terms of its effects on

subjective speech quality, packet voice synchronization and packet

voice fill-in. A more flexible approach, which assigns transmission

priority to the speech signal on a continuous scale, is proposed here.

This has none of the disdavantages of silence elimination, but stil 1

allows statistical bandwidth compression of the signal to be achieved

by exploiting the ability of a packet switched network to provide a

variable capacity connection.

10.1.3 Adaptive quantization

It is demonstrated in this thesis that 1-bit and 2-bit versions of the

Generalized Hybrid Adaptive Quantizer provide improved performance over

a number of other adaptive quantizers currently used in ADM and ADPCM

speech coders. This improvement is most marked in the case of ADM with

telephone speech inputs. Furthermore, as the GHAQ can be readily opti

mized in situ to the statistics of its input signal, it can be expected

to give improved performance in other speech coding applications as

well. Examples are Sub-Band Coding and Adaptive Predictive Coding

systems [Jayant 84], which both require the quantization of a "resi-

209

dual" signal which has application-specific statistics.

10.1.4 Interdependence of adaptive quantizers and predictors

It is shown in this work that in predictive waveform coders a consider

able degree of interdependence exists between the optimum parameters of

backward adaptive quantizers and those of fixed predictors, for given

input signal statistics. In particular, when 1-bit quantization is

used and the speech signal is pre-emphasized by a conventional tele

phone microphone, joint optimization of the adaptive quantizer and

predictor is required if the best performance is to be obtained. Fur

thermore, with both 1-bit and 2-bit quantization, sub-optimal pre

diction can be compensated for to some extent by careful optimization

of the adaptive quantizer.

10.1.5 The codec implementation

10.1.5.1 Performance

The embedded coding algorithm, implemented as described in Chapter 8,

yields segmental SNR performance which is similar to that of log PCM at

64 and 48 kbps, and which is considerably better at lower bit rates.

Subjectively, the codec provides good speech quality at rates of 32

kbps and above. At 16 kbps, noticeable distortion is present in

speech, although not in background noise. It is therefore recommended

that under normal operating conditions talkspurts should be transmitted

at a rate of 32 kbps or more, with the 16 kbps rate being used for the

transmission of "silence", and in conditions of momentary channel

overload.

However, as the coding algorithm itself is implemented in software, it

is possible that in the future its performance may be improved, as is

210

considered in Section 10.2.1. In particular, an improvement of the

performance at the 16 kbps rate will be passed on to the higher rates

as well, due to the explicit noise coding structure. In this case it

will be desirable to increase the frequency with which the lower rates

are used, and this can be done by simply altering the parameters of the

prioritization algorithm in the network voice protocol.

The codec provides good robustness to channel bit errors with rates up

to 1 in 10°. The inclusion of a facility to prevent the propagation of

mistracking due to missing packets means that the acceptable packet

loss rate associated with PCM coding (1 in 102) applies. The total

signal delay introduced by the codec is 1.25 mS, which is acceptable,

as is the idle channel noise at all bit rates. Finally, transcoding

of the embedded code to PCM form for transmission over a 64 kbps

channel, and/or conversion to linear form for voice conferencing pur

poses, results in little signal degradation.

10.1.5.2 Facilities

The organization of the embedded code into 8-bit bytes means that

operation with 64 kbps PCM code is facilitated. This feature can be

used for communicating with voice terminals equipped with PCM coding

facilities only, for playing out non-speech signals that are conven

iently represented in PCM form, and for codec testing.

The prioritization variables provided to the network voice protocol by

the codec permit voice traffic to be prioritized on a continuous scale,

rather than in an on/off manner. Nevertheless, discrimination between

talkspurts and silence can also be performed (as might be required for

voice conferencing purposes), by combining the energy and frequency

measurements as done in conventional speech detectors.

211

The hardware interface between the codec and the network voice proto

col, and its associated data structures (parcels and frames), facili

tate the rapid packetization of speech code while maintaining flexi

bility as regards packet length. Provision for a substantial amount of

control/status information to be exchanged along with the speech code

permits additional features to be included in the codec in the future.

For example, the inclusion of a switched predictor (discussed in Sec

tion 10.2.1.3) would require the transmission of additional side infor

mation in each packet, which could be communicated to the network voice

protocol in the control/status parcels.

10.1.6 The network voice protocol

During the development of the speech codec in this project, a number of

important issues in the design of the network voice protocol were

considered. As a result, the following specific recommendations are

made with regard to the packetization, prioritization, flow control,

synchronization and fill-in algorithms of this protocol:

- Flexibility in the selection of the length of voice packets should

be maintained, and it should be possible to change this length on

a per-cal 1 basis. SIow adaptation of the packet length during a

cal 1 may be beneficial in some cases.

Voice traffic should be prioritized on a continuous scale accord

ing to its activity, transmission history, and the relative re

quirements of data traffic.

Flow control of voice traffic should be performed by discarding

low-priority packets containing embedded code of lesser signifi

cance, and a network load estimation algorithm should be developed

212

to facilitate this.

Packet voice synchronization should be based on the observed

packet arrival statistics, rather than on time stamping, and

should preferably be adaptive.

Packet voice fill-in should be based on the repetition of previous

packets, except where more than one or two successive packets are

missing, in which case a synthetic substitute is required.

10.2 Further work

10.2.1 The codec

10.2.1.1 Optimization of the GHAQ with alternative distortion measures

The use of an alternative distortion measure to the MSE in the GHAQ

optimization procedure is suggested in Section 7.4.8. The incentive

for doing this is the possibility of improving the subjective speech

quality of the embedded codec without increasing its complexity. The

Energy-Weighted MSE distortion measure is considered worthy of investi

gation in this respect, due to the relative simplicity of its calcula

tion and its established correlation with subjective quality judge

ments.

10.2.1.2 Adaptation of the syllabic compandor in the GHAQ

The coefficient 6, which determines the time constant of the syllabic

compandor in the GHAQ, is essentially a compromise value which maxi

mizes the average performance for an entire utterance. A less con

strained design would allow B to vary with time, according to the local

signal statistics. The likelihood that this would result in a per

formance improvement is suggested by the work in [Chen 87], where it is

213

shown that an exponentially-weighted variance estimator with a fixed

time constant (as used in the GHAQ) is sub-optimal for the adaptive

quantization of speech at low bit rates.

The possibility of applying backward adaptation to the variance estima

tor is also suggested in [Chen 87], and this would amount to adapting p

in the GHAQ. This parameter could be adapted in the same way that the

quantizer in the GHAQ is currently adapted, namely by selecting an

optimum value on the basis of a number of previous quantizer output

bits. This would require a set of optimum values for 6 to be found,

which could be done by incorporating a suitable calculation into the

iterative optimization procedure, as suggested in [Chen 87]. Apart

from the simplicity of implementation of this scheme, it has the advan

tage that the existing mechnisms for allowing the decoder to recover

from bit errors and missing packets would continue to operate.

10.2.1.3 Switched predictor adaptation

Adaptive prediction was originally rejected when the codec was de

signed, due to the associated implementation complexity and vulnerabi

lity to packet loss. However, these disadvantages are substantially

reduced if switched rather than continuous adaptation is used [Evci

81]. In this approach, the predictor coefficients are selected from a

small set of possibilities, according to a simple estimate of the auto

correlation function of the input signal. The "selector" is then

transmitted to the decoder as side information, and due to its con

strained form requires far fewer bits for its representation than do

the coefficients themselves.

While the average SNR improvement of adaptive over fixed prediction is

not considerable for low predictor orders (as shown in Fig. 7.2), it

214

has been reported that the short-term improvement can be large (about 5

dB) in the specific case of unvoiced speech sounds, due to the improve

ment in predictor/signal matching [Evci 81]. As observed in Section

7.11.3, it is in precisely these cases that the primary coding algo

rithm currently used performs poorly.

10.2.1.4 Prioritization variables

The prioritization variables currently produced by the codec are the

syllabic factor and the zero crossing count. The use of the latter

variable is related to some extent to the observation that the perform

ance of the primary coding algorithm is worse for unvoiced sounds than

for voiced sounds. As explained in Section 7.11.3, this can be attrib

uted to the use of a fixed rather than an adaptive predictor. Thus if

the primary coding algorithm is modified significantly, for example by

including a switched predictor (as discussed in the previous section),

then the prioritization strategy will need to be re-assessed. Never

theless, the zero crossing count and the syllabic factor will remain

useful in this case for discriminating between talkspurts and silence.

10.2.1.5 Channel error robustness

The technique described in Section 7.8.3 to ensure the robustness of

the codec to channel bit errors has the important advantage of being

extremely simple to implement, but has the disadvantage that its opera

tion is dependent on the input signal statistics. As explained in

Section 7.8.3, the more complex "exponential leakage" algorithm does

not suffer from this limitation. However, a problem with this scheme

is that it tends to dampen the response of the adaptive quantizer

[Jayant 84], thus requiring the recalculation of optimum step size

multipliers by time-consuming random search methods. This suggests

215

that further work might usefully be done on the application of the

exponential leakage technique to the GHAQ, as in this case the damp

ening effect could be accounted for by re-optimizing the adaptive

quantizer using the procedure described in Section 7.4.7.

10.2.2 The network voice protocol

The next stage in the project of which this thesis forms a part is the

development and implementation of a network voice protocol suitable for

use with the speech codec card. The codec/protocol combination will

then be used to add a voice facility to an existing local area network

linking a number of workstations. In this environment, it will be

possible to evaluate the overall subjective speech quality obtained,

and to optimize the trade-off among the various types of signal degra

dation by adjusting the parameters of the packetization, prioritiza

tion, flow control, packet synchronization and packet fill-in algo

rithms. It is suggested that specific areas of interest in this con

text wil 1 be:

The prioritization of variable rate speech according to its trans

mission history. Specifically, to what extent are momentary bit

rate variations caused by channel conditions noticeable?

The relative prioritization of voice and data traffic. Where

prioritized network access is not available, to what extent can

this be compensated for by traffic prioritization in the high-

level protocols? Is it possible to prevent data traffic from

affecting the continuity of voice calls without impacting signifi

cantly on the data service performance in the absence of voice

traffic?

216

Network load estimation/prediction. This must be performed on a

distributed basis, and must only use information which is readily

available to the network voice protocol. A compromise must be

achieved between rapidity of response to changing network load and

stability.

Adaptive packet voice synchronization. Given that the delay stat

istics of a particular call path are not known when the call is

set up, how quickly can satisfactory synchronization of voice

packets be achieved using an adaptive strategy without time stamp

ing? (It might be necessary to transmit a number of "dummy"

packets at the start of the call in order to achieve a reasonable

degree of synchronization.)

217

APPENDIX A : Minimization of the GHAQ distortion measure

From Section 7.4.7, it is required that the distortion given by:

D =j [ J / 92(c - q(k|b=0))2 fcg(c,g|j) dc dg fj(j) j =0 -<xs -«©

+ J°° T 92(c - q(k|b=l))2 fcg(c,g|j) dc dg fj(j) ] (Al) -^p(j)

is minimized by differentiation with respect to p(j) ; j = 0,1,...,J-l

and q(k) ; k = 0,1,...,K-1, with the assumption that the PDFs

fcg(c,g|j) and f,-(j) are independent of p(j) ; j = 0,1,...,J-l. Noting

that k and j can be related by k = j + (K/2) b, (Al) can be rewritten:

D = l C f^!? ° 92(c - <>(J))2 fcg(c'9lJ) dc d*3 fj(j) j =0 -<*° -~»

+ CT 92<c - q(J+K/2))2 fcg(c,g|j) dc dg fj(j) ] (A2) -~p(j)

Differentiating (A2) with respect to p(j) ; j = 0,1,...,J-l:

0D/dp(j) = 3/dp(j) [ / / 9^(c - q(j)T fcg(c,g|j) dc dg

+ {° f° g2(c - q(j+K/2))2 fcg(c,g|j) dc dg ] fj(j) -^p(j)

; j = 0,1,...,J-1 (A3)

Splitting the joint PDF fcg(c,g|j) in (A3):

*=£> p ( j )

5D/5p(j) = 5/&p(j) [ / J 92(c - q(J))Z fc(c|g,j) dc fg(g|j) dg

+ ~ f° g2(c - q(j+K/2))2 fc(c|g,j) dc fg(g|j) dg ] fj(j) -c-p(j)

; j = 0,1,...,J-1 (A4)

218

In order to simplify (A4), use is made of Leibniz's Rule for the

differentiation of an integral containing a parameter [Wilson 11],

which can be expressed as follows:

If:

b(a) F(a) = / f(x,a) dx

a(a)

Then:

b(a) 5F/da = j df/da dx - f(a,a) da/da + f(b,a) db/da

a(a)

Applying this rule to (A4) gives:

of

oD/dp(j) = [ / g2(p(j) - q(j))2 fc(c|g,j) f (g|j) dg

- J°V(P(J) - q(J+K/2))2 fc(c|g,j) fg(g|j) dg ] fj(j)

; j = 0,1,...,J-1 (A5)

And hence:

5D/dp(j) = [ (p(j) - q(J+K/2))2 - (p(j) - q(j))2 ]

. [ ^c(c|g,j) fg2 fg(g|j) dg ] fj(j) ; j = 0,1,...,J-l (A6)

-©a

Now (A6) is equated with zero, and since g is constrained to be strict

ly positive,

[ (P(j) - q(J+K/2))2 - (p(j) - q(J))2 ] = 0

Giving:

p(j) = 1/2 [ q(j) + q(J+K/2) ] ; J = 0,1 J-l (A7)

219

which is the first design equation.

In differentiating D with respect to q(k) ; k = 0,1,...,K-1, the cases

k = 0,l,...,K/2-l and k = K/2,K/2+l,...,K-1 are considered separately.

Noting that k < K/2-1 for b = 0 and k > K/2 for b = 1, (Al) can be

rewritten:

J-l oo p(j) D = I [ / / g2(c - q(k))2 fcq(c,g|j) dc dg M j )

j=0 -oo-oo U y J

; k = 0,1,...,K/2-1 (A8a)

J-l OO oo 0

D = IE / / g2(c - q(k))2 fcq(c,g|j) dc dg f^j) j=0 - - P ( J ) 9 J

; k = K/2,K/2+l,...,K-l (A8b)

Now (A8a) and (A8b) are differentiated with respect to q(k):

OO p (j)

5D/dq(k) = / / 5/aq(k) [ g2(c - q(k))2 ] fcg(c,g|j) dc dg f-(j)

; k = 0,l,...,K/2-l (A9a)

5D/dq(k) = f° j°° d/5q(k) [ g2(c - q(k))2 ] fcg(c,g|j) dc dg fj(j) -~ p(j)

; k = K/2,K/2+l,...,K-l (A9b)

(A9a) and (A9b) may be simplified to give:

5D/dq(k) = -2 / / g2(c - q(k))^ fcg(c,g|j) dc dg fj(j)

; k = 0,l,...,K/2-l (AlOa)

5D/dq(k) = -2 j" f° g2(c - q(k))2 fcg(c,g|j) dc dg fj(j) -°°P(J)

; k = K/2,K/2+l,...,K-l (AlOb)

220

Equating (AlOa) and (AlOb) with zero and solving for q(k):

<* P(j) 7 / / 9 c fcq(c'9l^ dc d9 -oo -«*> 3

q(k) = — ; k = 0,l,...,K/2-l (Alia) °o p(j) / / 9 f

Cg(c'9lj) dc d9

-CO -OO

' J, vg2 c fcg(c'9|j) dc dg -«*p(j)

q(k) = ; k = K/2,K/2+l,...,K-l (Allb) ,oO cO

/ / g2 fcq(c'9lj) dc d9 -«>P(j)

Combining (Alia) and (Allb) gives:

E{c g2|k} q(k) = ; k = 0,1,...,K-1 (A12)

E{g2|k}

which is the second design equation.

APPENDIX B : Error dissipation in the robust GHAQ

Referring to Fig. 7.5, the encoder syllabic factor g(n) in the robust

GHAQ is given by:

g(n) = (1-B) [|s(n-l)| + I] + B g(n-l) (Bl)

where : B is the syllabic compandor coefficient

s(n) is the GHAQ output level

I is the syllabic increment

Now since:

s(n) = d(n) g(n) (B2)

where d(n) is the quantizer output level, (B2) can be rewritten:

221

g(n) = (1-P) [|d(n-l)| g(n-l) + I] + B g(n-l)

= g(n-l) [(1-6) |d(n-l)| + B] + I (1-B) (B3)

Now g(n-l) in (B3) can be expanded in a manner similar to that in (Bl),

yielding an expression containing g(n-2), and so on. Continuing this

process leads to a general expression:

i i-1 m g(n) = g(n-i) TT ^(n-h) + k2 [ ") JT kx(n-h) ] + k2 ; i > 1 (B4)

h=l m=l h=l

where : kx(h) = (1-B) |d(h)| + B

k2 = I (1-B)

Rewriting (B4) in terms of j = n-i:

n-j n-j-1 m g(n) = g(j) TT M n - h ) + k2 [ \ TT M n - h ) ] + k2 ; j < n-1 (B5)

h=l m=l h=l

A similar expression can also be written for the decoder syllabic

factor g'(n):

n-j n-j-1 m g'(n) = g'(j) TT M n - h ) + k2 [ I TT kx(n-h) ] + k2 ; j < n-1 (B6)

h=l m=l h=l

where : k:'(h) = (1-B) |d'(h)| + B

k2 is as in (B4)

Now consider the case where a single channel bit error causes the

quantities g(j) and g'(j) to differ, for some j. Using (B5) and (B6),

and assuming that no subsequent channel errors occur, the difference

between g(n) and g'(n) for n = j+l,j+2,... can be written:

n-j g(n) - g'(n) = [g(j) - g'(j)] TT kl(n"h) i J < "-1 (B7)

222

From (B5) it is evident that if g(n) is to remain bounded as n-*-©,

then:

n-j-1 m lim C I TTkl(n-h) ] < c ; j < n-1 (B8) n-**> m=l h=l

where c is an arbitary finite constant. This implies that:

m lim ["IT M n - h ) ] = 0 ; m = n-j-1, j < n-1 (B9) n-**> h=l

Applying (B9) to (B7), it is evident that:

lim [g(n) - g'(n)] = 0 (BIO) n+«o

That is, the difference between the syllabic factors in the encoder and

decoder is dissipated with time. The rate at which this difference is

dissipated depends on the successive values of kj(n-h) in the product

term in (B7), which are in turn dependent on the input signal statis

tics.

APPENDIX C : The minimum output level of the robust GHAQ

The general expression derived in Appendix B for the syllabic factor

g(n) of the robust GHAQ in terms of g(j) is:

n-j n-j-1 m 9(n) = g(j) TT M n - h ) + k2 [ I T T M n - h ) ] + k2 ; j < n-1 (Cl)

h=l m=l h=l

where : k:(h) = (1-8) |d(h)| + B

k2 = I (1-B)

B is the syllabic compandor coefficient

223

d(h) is the quantizer output level

I is the syllabic increment

Consider the case where |d(n)| assumes its minimum value, dm1n, for

some n = j , and maintains it for all subsequent n = j+l,j+2,... The

corresponding value of kj(h) is denoted by kx (ie. a constant). Now

(Cl) may be rewritten:

n-j n-j-1 m lim [g(n)] = g(j) lim [ T J k ^ + k2 lim [ | TT kl + 1 ] ; J < n-1 (C2) n-** n+*> h=l n-*» m=i n=l

But since dmin must be less than unity if g(n) is to remain bounded,

(C2) may be reduced to:

n-j-1 m lim [g(n)] = k2 1 im [ £ XT kl + l- > J < n"l (C3) n-»"« n-**> m=l h=l

Since k̂ is a constant, the right-hand term in (C3) contains an

infinite geometric series, so that (C3) may be reduced to:

lim [g(n)] = k2 / (1 - k:) n->oo

= I / (1 - dmin) (C4)

This expression shows that the minimum value of the syllabic factor in

the robust GHAQ is dependent on the syllabic increment and the minimum

quantizer reconstruction level. The minimum GHAQ output level, smi-n,

will thus be the product of the syllabic factor and the minimum quanti

zer reconstruction level:

s . = l dmin (C5)

d - dminJ

It should be noted that infinite-precision arithmetic is assumed in the

224

above analysis. In practice, truncation error will cause g(n) to be

slightly less than the value calculated in (Cl) to (C4), with the

result that (C5) provides an upper bound on sm i n in a limited-precision

implementation of the robust GHAQ.

APPENDIX D : Development equipment and software

The coding algorithm described in this thesis was developed using

software written in Pascal and Basic, running on a Hewlett-Packard

9836C computer. The capabilities of this software included the acqui

sition, storage, and playback of digitized speech, as well as the

implementation of PCM, ADPCM and ADM coding algorithms and the calcula

tion of the SNR and SNRSEG performance measures. Other capabilities

included the design and implementation of digital filters for interpo

lation and decimation of sample sequences, quantizer optimization using

the Lloyd-Max algorithm, optimization of the GHAQ using the procedure

derived in Section 7.4.7, and predictor optimization using the Flexible

Polyhedron Method. The associated hardware comprised 12-bit A/D and

D/A convertors, standard PCM pre-sampling and reconstruction filters, a

high-quality electret condenser microphone, and a standard electromag

netic telephone handset.

The assembly language program implementing the coding algorithm on the

TMS320E17 digital signal processor was developed using a Texas Instru

ments simulation program running on an IBM PC. Input and output sample

sequences were transferred to and from a 12-bit data acquisition/play

back card, with the same filters as described above and an electromag

netic telephone handset.

The codec card was tested and evaluated using software written in the C

language, running on an IBM PC. Capabilities of this software included

225

the examination and modification of the card control/status register

and of any byte in the card memory, and the corruption of code bits on

a random basis, thereby simulating channel errors. In addition, code

could be stored on and retrieved from disk, and converted to and from

64 kbps PCM form, thereby allowing the calculation of signal-to-noise

ratio figures for the complete PCM-EC-PCM coding operation, as de

scribed in Section 9.1.

APPENDIX E : Reference speech material

The following sentences from the "Harvard" list [IEEE 69] of phonetic

ally balanced sentences were used as test inputs. The sex and identity

of the speaker is indicated in brackets in each case.

(1) The birch canoe slid on the smooth planks. (Male 1)

(2) Glue the sheet to the dark blue background. (Male 1)

(3) It's easy to tell the depth of a well. (Female 1)

(4) These days a chicken leg is rare dish. (Female 1)

(5) The boy was there when the sun rose. (Male 2)

(6) A rod is used to catch pink salmon. (Male 2)

(7) The source of the huge river is the clear spring. (Female 2)

(8) Kick the ball straight and follow through. (Female 2)

APPENDIX £ : Adaptive quantizers used in the comparative tests

The adaptive quantizers used in the comparative tests of Section 7.6,

namely the CFDMAQ, the HCDMAQ, and the JAQ, are described here in terms

of the predictive coder of Fig. 7.1. Each adaptive quantizer may

be thought of as consisting of a fixed quantizer and some form of

variance estimator. The input sample v(n) is compressed using a seal-

226

ing factor or "step size" A(n) obtained from the variance estimator,

quantized using a fixed quantizer, and then expanded again to give

s(n). This process may be expressed:

s(n) = QM(c(n)) A(n) (FI)

c(n) = v(n) / A(n) (F2)

where Qj^(c(n)) denotes fixed (ie. time-invariant) M-bit quantization of

c(n). The distinguishing features of the CFDMAQ, the HCDMAQ and the

JAQ are as fol lows:

CFDMAQ:

In the CFDMAQ, a 1-bit fixed quantizer is used, with a single decision

level equal to zero, and reconstruction levels with nominal values of

+1 and -1. The quantizer scaling factor A(n) is calculated recursively

as shown in (F4) below. The CFDMAQ may be described by:

s(n) = Q2(c(n)) A(n) (F3)

A(n) = m(n) A(n-l) (F4)

fl.5 ; b0(n) = b0(n-l) m(n) = < (F5J

10.67 ; b0(n) t bQ(n-l)

HCDMAQ:

The fixed quantizer in the HCDMAQ is the same as that in the CFDMAQ.

However, in the HCDMAQ A(n) is proportional to the product of an "in

stantaneous" factor y(n) and a "syllabic" factor g(n). The instantan

eous factor is calculated recursively as shown in (F8) below. In the

original form of the HCDMAQ [Un 81], the syllabic factor is calculated

as the RMS slope energy of the signal estimate y(n), using a periodi

cally updated block average.

227

However, the results presented in Section 7.6.3 correspond to the use

of a recursive average magnitude estimator (shown in (F9) below) to

calculate the syllabic factor, as the recursive estimator is much

easier to implement in digital hardware than the original scheme, and

little difference (<1 dB) was found between the performances of the two

versions of the HCDMAQ when coding speech. The version used here may

be described by:

s(n) = Q^cfn)) A(n) (F6)

A(n) = ag(n) Y(n) (F7)

Y(n) = m(n) y(n-l) (F8)

g(n) = (1-B) A(n-l) + 8 g(n-l) (F9)

where a is a constant and m(n) is selected according to the rule given

in Table FI. The parameter p associated with the time constant of the

average magnitude estimator is set to 0.98, and the constant a to 0.7,

these values having been found empirically to maximize the SNR.

JAQ:

In the 2-bit JAQ a 2-bit fixed quantizer is used, with decision and

reconstruction levels matched to the PDF of the compressed input sig

nal. The quantizer scaling factor A(n) is calculated recursively as

shown in (Fll) below. The 2-bit JAQ may be described by:

s(n) = Q2(c(n)) A(n) (F10)

A(n) = m(n) A(n-l) (F11)

where the multiplier m(n) is selected according to the rule given in

Table F2. Since the optimum values of the multipliers depend upon the

long-time-averaged autocorrelation function of the input signal [Jayant

73], different sets are specified for use in PCM and DPCM codecs,

-0M

0

0

0

0

1

1

1

1

i>Q{n-l)

0

0

1

1

0

0

1

1

b0(n-2)

0

1

0

1

0

1

0

1

m(n)

1.5

1.0

0.66

0.66

0.66

0.66

1.0

1.5

Table FI Step size multipliers m(n) for the HCDMAQ [Un 81]

^(n)

0

0

1

1

b0(n)

0

1

0

1

m(n)

PCM

2.20

0.60

0.60

2.20

DPCM

1.60

0.80

0.80

1.60

Table F2 Step size multipliers m(n) for the 2-bit JAQ [Jayant 73]

j

0

1

2

3

4

P(j)

- eO

-1.127

0.000

1.127

C O

q(j)

-1.834

-0.420

0.420

1.834

Table F3 Normalized decision levels p(j) and reconstruction levels

q(j) for a 2-bit non-uniform quantizer optimized for a signal

with a Laplacian PDF [Jayant 84]

229

respectively. In this work a 2-bit nonuniform quantizer optimized for

signals with symmetrical Laplacian PDFs [Jayant 84] was used, and the

associated decision and reconstruction levels (normalized for a signal

variance of unity), are given in Table F3.

230

REFERENCES

G.J. Anido, "The design, analysis and implementation of a fully dis

tributed local area network for integrated voice and data communica

tions", Ph.D. thesis, University of New South Wales, Nov. 1987.

M. Aoki, F. Ishino, F. Ishino, "Speech qual ity of conversational pack-

etized voice communications", Trans. IECE Japan, vol. E 69, pp. 107-

112, Feb. 1986.

G. Barberis and D. Pazzaglia, "Analysis and optimal design of a packet-

voice receiver", IEEE Trans. Commun., vol. 28, pp. 217-227, Feb. 1980.

T.P. Barnwell and S.R. Quackenbush, "An analysis of objectively compu

table measures for speech quality testing", Proc. Int. Conf. Acoust.,

Speech, and Signal Proc, Paris, pp. 996-999, 1982.

P.A. Bello, R.N. Lincoln, H. Gish, "Statistical del ta modul ation",

Proc. IEEE, vol. 55, pp. 308-319, Mar. 1967.

T. Bially, B. Gold, S. Seneff, "A technique for adaptive voice flow

control in integrated packet networks", IEEE Trans. Commun., vol. 28,

pp. 325-333, Mar. 1980.

T. Bially, A.J. McLaughlin, CJ. Weinstein, "Voice communication in

integrated digital voice and data networks", IEEE Trans. Commun., vol.

28, pp. 1478-1489, Sept. 1980.

J. Burgin, "Integrated services packet network - protocol design con

siderations", Telecom Australia Branch Paper 108 (Switching and Signal

ling Branch), Jan. 1987.

231

P. Bylanski and T.W. Chong, "Advances in speech coding for communica

tions", GEC J. of Research, vol. 2, pp. 16-22, 1984.

R. Camrass, "New technologies to make their mark in the corporate

network", Communication Systems Worldwide, pp. 30-40, Oct. 1987.

P. Castellino and C. Scagliola, "Design of instantaneously companded

delta modulators with m-bit memory", Proc. Int. Conf. on Acoustics,

Speech and Signal Proc, Hartford, pp. 196-199, May 1977.

K.W. Cattermole, Principles of Pulse Code Modulation. London : Iliffe,

1969.

CV. Chakravarthy and M.N. Faruqui, "A multidigit adaptive delta modu

lation (ADM) system", IEEE Trans. Commun., vol. 24, pp. 931-935, Aug.

1976.

J.-H. Chen and A. Gersho, "Gain-adaptive vector quantization with

application to speech coding", IEEE Trans. Commun., vol. 35, pp. 918-

930, Sept. 1987.

D. Cohen, "A protocol for packet-switching voice communication", Compu

ter Networks, vol. 2, pp. 320-331, 1978.

R.V. Cox and R.E. Crochiere, "Multiple user variable rate coding for

TASI and packet transmission systems", IEEE Trans. Commun., vol. 28,

pp. 334-344, Mar. 1980.

M. Decina and G. Modena, "CCITT Standards on Digital Speech Process

ing", IEEE. J. Selected Areas Commun., vol 6, pp. 227-234, Feb. 1988.

J.H. Derby and CR. Gal and, "Multirate subband coding applied to digi

tal speech interpolation", IEEE Trans. Acoust., Speech, Signal Proc,

vol. 35, pp. 1684 - 1698, Dec. 1987.

232

J. DeTreville and W.D. Sincoskie, "A distributed experimental communi

cations system", IEEE. J. Selected Areas Commun., vol. 1, pp. 1070-

1075, Dec. 1983.

J.D. DeTreville, "A simulation-based comparison of voice transmission

on CSMA/CD networks and on token buses", Bell Syst. Tech. J., vol. 63,

pp. 33-55, Jan. 1984.

D.E. Dodds, A.M. Sendyk, D.B. Wohlberg, "Error tolerant adaptive algo

rithms for delta-modulation coding", IEEE Trans. Commun., vol. 28, pp.

385-391, Mar. 1980.

P.G. Drago, A.M. Molinari, F.C Vagliani, "Digital dynamic speech

detectors", IEEE Trans. Commun., vol . 26, pp. 140-145, Jan. 1978.

J.J. Dubnowski and R.E. Crochiere, "Variable rate coding of speech",

Bel 1 Syst. Tech. J., vol. 58, pp. 577-600, Mar. 1979.

C. A. Dvorak and J.R. Rosenberger, "Deriving a subjective testing

methodology for digital circuit multiplication and packetized voice

systems", IEEE J. Selected Areas Commun., vol. 6, pp. 235-241, Feb.

1988.

CC Evci, CS. Xydeas, R. Steele, "Sequential adaptive predictors for

ADPCM speech encoders", Nat. Telecomm. Conf., pp. E8.1.1-E8.1.5, Nov.

1981.

T.R. Fischer and P.F. Dahm, "Variance estimation and adaptive quantiza

tion", IEEE Trans. Inf. Theory, vol. 31, pp. 428-433, May 1985.

J.L. Flanagan et al., "Speech coding", IEEE Trans. Commun., vol. 27,

pp. 710-736, Apr. 1979.

233

J.W. Forgie, "Network speech - System implications of packetized

speech", Annual Report to the Defense Communications Agency, Lincoln

Lab., Massachusetts Institute of Technology, 1976.

V.S. Frost, E.M. Friedman, G.J. Minden, "Multirate voice coding for

load control on CSMA/CD local computer networks", 10th Conf. on Local

Computer Networks, Minneapolis, pp. 10-19, Oct. 1985.

CK. Gan and R.W. Donaldson, "Adaptive silence deletion for speech

storage and voice mail applications", IEEE Trans. Acoust., Speech,

Signal Proc, vol. 36, pp. 924-927, June 1988.

M.L. Gayford, Electroacoustics. London : Newnes-Butterworths, 1970.

K.C.E Gee, Introduction to Local Area Networks. London : Macmillan,

1983.

J.D. Gibson, "Sequentially adaptive backward prediction in ADPCM speech

coders", IEEE Trans. Commun., vol. 26, pp. 145-150, Jan. 1978.

J.W. Glasbergen, "This versatile IC digitizes speech", Philips Telecom.

Review, vol. 39, pp. 147-154, Sept. 1981.

B. Gold, "Digital speech networks", Proc IEEE, vol. 65, pp. 1636-1658,

Dec. 1977.

A.J. Goldberg, "Practical implementations of speech waveform coders for

the present day and for the mid 1980s", J. Acoust. Soc Amer., vol . 66,

pp. 1653-1657, Dec. 1979.

T.A. Gonsalves, "Packet-voice communication on an Ethernet local compu

ter network : An experimental study", Technical Report No. 230, Compu

ter Systems Lab., Stanford Univ., Feb. 1982.

D.J. Goodman and R.M. Wilkinson "A Robust Adaptive Quantizer", IEEE

Trans. Commun., vol. 23, pp. 1362-1365, Nov. 1975.

D.J. Goodman, "Embedded DPCM for variable bit rate transmission", IEEE

Trans. Commun., vol. 28, pp. 1040-1046, July 1980.

D.J. Goodman et al., "Waveform substitution techniques for recovering

missing speech segments in packet voice communications", IEEE Trans.

Acoust., Speech, Signal Proc, vol. 34, pp. 1440-1447, Dec. 1986.

J.G. Gruber, "Delay related issues in integrated voice and data net

works", IEEE Trans. Commun., vol . 29, pp. 786-800, June 1981.

J.G. Gruber and N.H. Le, "Performance requirements for integrated

voice/data networks", IEEE J. Selected Areas Commun., vol. 1, pp. 981-

1005, Dec. 1983.

J.G. Gruber and L. Strawczynski, "Subjective effects of variable delay

and speech clipping in dynamically managed voice systems", IEEE Trans.

Commun., vol. 33, pp. 801-808, Aug. 1985.

S.C. Hall and H.S. Bradlow, "The design of a hybrid adaptive quantizer

for speech coding applications", to be published in IEEE Trans.

Commun., Nov. 1988.

D.M. Himmelblau, Applied Nonlinear Programming. New York : McGraw-

Hill, 1972.

IEEE Subcommittee on Subjective Measurements, "IEEE recommended prac

tice for speech quality measurements", IEEE Trans, on Audio and Elec

troacoustics, vol. 17, pp. 227-246, Sept. 1969.

Intel Corporation, Telecommunications Products Handbook. 1984.

235

J.M. Irvine, S.C. Hall, H.S. Bradlow, "An improved hybrid companding

delta modulator", IEEE Trans. Commun., vol. 34, pp. 995-998, Oct. 1986.

N.S. Jayant, "Adaptive delta modulation with a one-bit memory", Bell

Syst. Tech. J., vol. 49, 321-342, Mar. 1970.

N.S. Jayant, "Adaptive quantization with a one-word memory", Bell Syst.

Tech. J., pp. 1119-1144, Sept. 1973.

N.S. Jayant, "Step-size transmitting differential coders for mobile

telephony", Bel 1 Syst. Tech. J., vol. 54, pp. 1557-1580, Nov. 1975.

N.S. Jayant, "On the delta modulation of a first-order Gauss-Markov

signal ", IEEE Trans. Commun., vol. 26, pp. 150-156, Jan. 1978.

N.S. Jayant and S.W. Christensen, "Effects of packet losses in waveform

coded speech and improvements due to an odd-even sample-interpolation

procedure", IEEE Trans. Commun, vol. 29, pp. 101-109, Feb. 1981.

N.S. Jayant, "Variable rate ADPCM based on explicit noise coding", Bell

Syst. Tech J., vol. 62, pp. 657 - 677, Mar. 1983.

N.S. Jayant and P. Noll, Digital Coding of Waveforms. New Jersey :

Prentice-Hall , 1984.

D.H. Johnson and G.C. O'Leary, "A local access network for packetized

digital voice communication", IEEE Trans. Commun., vol. 29, pp. 679-

688, May 1981.

R.E. Kahn, "Advances in packet radio technology", Proc. IEEE, vol. 66,

pp. 1468-1496, Nov. 1978.

236

N. Kitawaki et al., "Comparison of objective speech quality measures

for voiceband codecs", Proc. Int. Conf. Acoust., Speech, Signal Proc,

Paris, pp. 1000-1003, 1982.

J.F. Kurose, M. Schwartz, Y. Yemini, "Multiple-access protocols and

time-constrained communication", Computing Surveys, vol. 16, pp. 43-70,

Mar. 1984.

G.G. Langenbucher, "Efficient coding and speech interpolation : Prin

ciples and performance characterization", IEEE Trans. Commun., vol. 30,

pp. 769-779, Apr. 1982.

Y. Linde, A. Buzo, R.M. Gray, "An algorithm for vector quantizer

design", IEEE Trans. Commun., vol. 28, pp. 84-95, Jan. 1980.

S.P. Lloyd, "Least squares quantization in PCM", unpublished memo.,

Bell Lab., 1957; IEEE Trans. Information Theory, vol. 28, pp. 129-137,

Mar. 1982.

D.T. Magill and CK. Un, "Speech residual coding by adaptive delta

modulation with hybrid companding", Proc. Nat. Electronics Conf., vol.

29, pp. 403-408, 1977.

X. Maitre and T. Aoyama, "Speech coding activities within CCITT :

Status and trends", Proc. Int. Conf. Acoust., Speech, Signal Proc,

Paris, pp. 954-959, 1982.

M. Malek, "Integrated voice and data communications overview", IEEE

Commun. Mag., vol . 26, pp. 5-15, June 1988.

J. Max, "Quantizing for minimum distortion", IRE Trans. Inf. Theory,

vol. 6, pp. 7-12, Mar. 1960.

237

B. McDermott, C Scagliola, D. Goodman, "Perceptual and objective

evaluation of speech processed by adaptive differential PCM", Bell

Syst. Tech. J., vol. 57, pp. 1597 - 1619, May 1978.

P. Mermelstein, "G.722, A new CCITT coding standard for digital trans

mission of wideband audio signals", IEEE Commun. Mag., vol. 26, pp. 8-

15, Jan. 1988.

D. Mi noli, "Optimal packet length for packet voice communication", IEEE

Trans. Commun., vol. 27, pp. 607-611, Mar. 1979.

W.A. Montgomery, "Techniques for packet voice synchronization", IEEE J.

Selected Areas Commun., vol. 1, pp. 1022-1028, Dec. 1983.

M. Nakatsui and P. Mermelstein, "Subjective speech-to-noise ratio as a

measure of speech quality for digital waveform coders", J. Acoust. Soc.

Amer. vol. 72, pp. 1136 - 1144, Oct. 82.

M.E.M. Nasr and CV. Chakravarthy, "Hybrid adaptive quantization for

speech coding", IEEE Trans. Commun., vol. 32, pp. 1358-1361, Dec. 1984.

J.E. Natvig, "Evaluation of six medium bit-rate coders for the Pan-

European digital mobile radio system", IEEE J. Selected Areas Commun.,

vol. 6, pp. 324-331, Feb. 1988.

J.A. Nelder and R. Mead, "A simplex method for function minimization",

Computer Journal, vol. 7, pp. 308-313, 1965.

T. Nishitani et al., "A CCITT standard 32 kbit/s ADPCM LSI codec", IEEE

Trans. Acoust., Speech, Signal Proc, vol. 35, pp. 219-225, Feb. 1987.

P. Noll, "A comparative study of various quantization schemes for

speech coding", Bel 1 Syst. Tech. J., vol. 54, pp. 1597-1614, Nov. 75.

238

G.C. O'Leary et al., "A modular approach to packet voice terminal

design", Nat. Computer Conf., pp. 183-189, 1981.

S.N. Pandhi, "The universal data connection", IEEE Spectrum, vol. 24,

pp. 31-37, July 1987.

L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals.

New Jersey : Prentice-Hall, 1978.

F.T. Sakane and R. Steele, "Two-bit instantaneously adaptive delta

modulation for p.c.m. encoding", The Radio and Electronic Engineer,

vol. 48, pp. 187-197, Apr. 1978.

C. Scagliola, "Evaluation of adaptive speech coders under noisy channel

conditions", Bell Syst. Tech. J., vol 58, pp. 1369 - 1394, July 1979.

M. Schwartz and L. Shaw, Signal Processing : Discrete Spectral

Analysis, Detection and Estimation. New York : Mcgraw-Hill, 1975.

R. Seidl, "Echo and delay in packet voice services", 2nd Fast Packet

Switching Workshop, Melbourne, May 1987.

A. Seneviratne, "Factors influencing the quality of packet voice", 2nd

Fast Packet Switching Workshop, Melbourne, May 1987.

T. Suda, H. Miyahara, T. Hasegawa, "Performance evaluation of a packet-

ized voice system - simulation study", IEEE Trans. Commun., vol. 32,

pp. 97-102, Jan. 1984.

A.S. Tanenbaum, Computer Networks. New Jersey : Prentice-Hall, 1981.

H. Taub and D.L. Schilling, Principles of Communication Systems. New

York : McGraw-Hil 1, 1971.

Texas Instruments Inc., TMS32010 User's Guide. 1985.

239

J. Tierney and M.L. Mai pass, "Enhanced CVSD - an embedded speech coder

for 64-16 kbps", Proc. IEEE Int. Conf. on Acoustics, Atlanta, pp. 840-

843, Mar. 1981.

K.-H. Tzou, "Embedded Max quantization", Int. Conf. Acoust., Speech,

Signal Proc, Tokyo, pp. 10B.3.1-10B.3.4, 1986.

CK. Un and H.S. Lee, "A study of the comparative performance of adapt

ive delta modulation systems", IEEE Trans. Commun., vol. 28, pp. 96-

101, Jan. 1980.

CK. Un and H.H. Lee, "Voiced/unvoiced/silence discrimination of speech

by delta modulation", IEEE Trans. Acoust., Speech and Signal Proc, vol

28, pp. 398-407, Aug. 1980.

CK. Un, H.S. Lee, J.S. Song, "Hybrid companding delta modulation",

IEEE Trans. Commun., vol. 29, pp. 1337-1344, Sept. 1981.

CK. Un and D.H. Cho, "Hybrid companding delta modulation with

variable-rate sampling", IEEE Trans. Commun., vol. 30, pp. 593-599,

Apr. 1982.

R. Vickers and T. Vilmansen, "The evolution of telecommunications

technology", IEEE Commun. Mag., vol. 25, pp. 6-18, July 1987.

O.J. Wasem et al., "The effect of waveform substitution on the quality

of PCM packet communications", IEEE Trans. Acoust., Speech, Signal

Proc, vol. 36, pp. 342-348, Mar. 1988.

S.A. Webber, CJ. Harris, J.L. Flanagan, "Use of variable-quality

coding and time-interval modification in packet transmission of

speech", Bel 1 Syst. Tech. J., vol . 56, pp. 1569-1573, Oct. 1977.

CJ. Weinstein and E.M. Hofstetter, "The tradeoff between delay and

TASI advantage in a packetized speech multiplexer", IEEE Trans.

Commun., vol 27, pp. 1716-1720, Nov. 1979.

CJ. Weinstein and J.W. Forgie, "Experience with speech communication

in packet networks", IEEE J. Selected Areas Commun., vol. 1, pp. 963-

980, Dec. 1983.

S.B. Weinstein, "Telecommunications in the coming decades", IEEE Spec

trum, pp. 62-67, Nov. 1987.

L. Weiss, I.M. Paz, D.L. Schilling, "Video encoding using an adaptive

digital delta modulator with overshoot suppression", IEEE Trans.

Commun., vol. COM-23, pp. 905-920, Sept. 1975.

E.B. Wilson, Advanced Calculus. Boston : Athenaeum Press, 1911.

P. Yatrou and P. Mermelstein, "Ensuring predictor tracking in ADPCM

speech coders under noisy transmission conditions", IEEE J. Selected

Areas Commun., vol. 6, pp. 249-261, Feb. 1988.

Y. Yatsuzuka, "High-gain digital speech interpolation with adaptive

differential PCM encoding", IEEE Trans. Commun., vol. 30, pp. 750-761,

Apr. 1982.

ABBREVIATIONS AND ACRONYMS

ACF - Autocorrelation Function

ADM - Adaptive Delta Modulation

ADPCM - Adaptive Differential Pulse Code Modulation

APCM - Adaptive Pulse Code Modulation

bps - bits per second i

CCITT- International Telegraph and Telephone Consul tative Committee

CFDMAQ - Constant Factor Delta Modulation Adaptive Quantizer

CVSD - Continuously Variable Slope Delta modulation

DM - Delta Modulation

DMA - Direct Memory Access

DPCM - Differential Pulse Code Modulation

DCM - Digital Circuit Multiplication

DSI - Digital Speech Interpolation

DSP - Digital Signal Processor

EC - Embedded Coding

FIFO - First-In First-Out

FPS - Fast Packet Switching

GHAQ - Generalized Hybrid Adaptive Quantizer

HCDMAQ - Hybrid Companding Delta Modulation Adaptive Quantizer

ISDN - Integrated Services Digital Network

JAQ - Jayant Adaptive Quantizer

k = 1000 (eg..kHz)

K = 1024 (eg. Kbytes)

LAN - .Local Area Network

M = IO6 (eg. Mbps)

PABX - Private Automatic Branch Exchange

PCM - Pulse Code Modulation

PDF - Probability Density Function

PSTN - Public Switched Telephone Network

SAD - Speech Activity Detection

SBC - Sub-Band Coding

SDM - Statistical Delta Modulation

SNR - Signal-to-Noise Ratio

SNRF - Signal-to-Noise Ratio after Filtering

SNRSEG - Segmental Signal-to-Noise Ratio

SNRSEGF - Segmental Signal-to-Noise Ratio after Filtering

TASI - Time Assignment Speech Interpolation

WAN - Wide Area Network

1988 the design and implementation of a speech codec for

Documents