audio signal processing and codings2.bitdl.ir/ebook/electronics/spanias - audio signal...10.3.2 the...

AUDIO SIGNALPROCESSINGAND CODING

Andreas SpaniasTed PainterVenkatraman Atti

WILEY-INTERSCIENCEA John Wiley amp Sons Inc Publication

Innodata
9780470041963jpg





Copyright 2007 by John Wiley amp Sons Inc All rights reserved

Published by John Wiley amp Sons Inc Hoboken New JerseyPublished simultaneously in Canada

No part of this publication may be reproduced stored in a retrieval system or transmitted in anyform or by any means electronic mechanical photocopying recording scanning or otherwiseexcept as permitted under Section 107 or 108 of the 1976 United States Copyright Act withouteither the prior written permission of the Publisher or authorization through payment of theappropriate per-copy fee to the Copyright Clearance Center Inc 222 Rosewood Drive DanversMA 01923 (978) 750-8400 fax (978) 750-4470 or on the web at wwwcopyrightcom Requeststo the Publisher for permission should be addressed to the Permissions Department John Wiley ampSons Inc 111 River Street Hoboken NJ 07030 (201) 748-6011 fax (201) 748-6008 or online athttpwwwwileycomgopermission

Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their bestefforts in preparing this book they make no representations or warranties with respect to theaccuracy or completeness of the contents of this book and specifically disclaim any impliedwarranties of merchantability or fitness for a particular purpose No warranty may be created orextended by sales representatives or written sales materials The advice and strategies containedherein may not be suitable for your situation You should consult with a professional whereappropriate Neither the publisher nor author shall be liable for any loss of profit or any othercommercial damages including but not limited to special incidental consequential or otherdamages

For general information on our other products and services or for technical support please contactour Customer Care Department within the United States at (800) 762-2974 outside the UnitedStates at (317) 572-3993 or fax (317) 572-4002

Wiley also publishes its books in a variety of electronic formats Some content that appears in printmay not be available in electronic formats For more information about Wiley products visit ourweb site at wwwwileycom

Wiley Bicentennial Logo Richard J Pacifico

Library of Congress Cataloging-in-Publication Data

Spanias AndreasAudio signal processing and codingby Andreas Spanias Ted Painter Venkatraman Atti

p cmldquoWiley-Interscience publicationrdquoIncludes bibliographical references and indexISBN 978-0-471-79147-8

1 Coding theory 2 Signal processingndashDigital techniques 3 SoundndashRecording andreproducingndashDigital techniques I Painter Ted 1967-II Atti Venkatraman 1978-IIITitle

TK510292S73 2006621382rsquo8ndashdc22

2006040507

Printed in the United States of America

10 9 8 7 6 5 4 3 2 1

ToPhotini John and Louis

Lizzy Katie and LeeSrinivasan Sudha Kavitha Satish and Ammu

CONTENTS

PREFACE xv

1 INTRODUCTION 1

11 Historical Perspective 112 A General Perceptual Audio Coding Architecture 413 Audio Coder Attributes 5

131 Audio Quality 6132 Bit Rates 6133 Complexity 6134 Codec Delay 7135 Error Robustness 7

14 Types of Audio Coders ndash An Overview 715 Organization of the Book 816 Notational Conventions 9

Problems 11Computer Exercises 11

2 SIGNAL PROCESSING ESSENTIALS 13

21 Introduction 1322 Spectra of Analog Signals 1323 Review of Convolution and Filtering 1624 Uniform Sampling 1725 Discrete-Time Signal Processing 20

vii

viii CONTENTS

251 Transforms for Discrete-Time Signals 20252 The Discrete and the Fast Fourier Transform 22253 The Discrete Cosine Transform 23254 The Short-Time Fourier Transform 23

26 Difference Equations and Digital Filters 2527 The Transfer and the Frequency Response Functions 27

271 Poles Zeros and Frequency Response 29272 Examples of Digital Filters for Audio Applications 30

28 Review of Multirate Signal Processing 33281 Down-sampling by an Integer 33282 Up-sampling by an Integer 35283 Sampling Rate Changes by Noninteger Factors 36284 Quadrature Mirror Filter Banks 36

29 Discrete-Time Random Signals 39291 Random Signals Processed by LTI Digital Filters 42292 Autocorrelation Estimation from Finite-Length Data 44

210 Summary 44Problems 45Computer Exercises 47

3 QUANTIZATION AND ENTROPY CODING 51

31 Introduction 51311 The QuantizationndashBit AllocationndashEntropy Coding

Module 5232 Density Functions and Quantization 5333 Scalar Quantization 54

331 Uniform Quantization 54332 Nonuniform Quantization 57333 Differential PCM 59

34 Vector Quantization 62341 Structured VQ 64342 Split-VQ 67343 Conjugate-Structure VQ 69

35 Bit-Allocation Algorithms 7036 Entropy Coding 74

361 Huffman Coding 77362 Rice Coding 81363 Golomb Coding 82

CONTENTS ix

364 Arithmetic Coding 8337 Summary 85


4 LINEAR PREDICTION IN NARROWBAND AND WIDEBANDCODING 91

41 Introduction 9142 LP-Based Source-System Modeling for Speech 9243 Short-Term Linear Prediction 94

431 Long-Term Prediction 95432 ADPCM Using Linear Prediction 96

44 Open-Loop Analysis-Synthesis Linear Prediction 9645 Analysis-by-Synthesis Linear Prediction 97

451 Code-Excited Linear Prediction Algorithms 10046 Linear Prediction in Wideband Coding 102

461 Wideband Speech Coding 102462 Wideband Audio Coding 104


5 PSYCHOACOUSTIC PRINCIPLES 113

51 Introduction 11352 Absolute Threshold of Hearing 11453 Critical Bands 11554 Simultaneous Masking Masking Asymmetry and the Spread

of Masking 120541 Noise-Masking-Tone 123542 Tone-Masking-Noise 124543 Noise-Masking-Noise 124544 Asymmetry of Masking 124545 The Spread of Masking 125

55 Nonsimultaneous Masking 12756 Perceptual Entropy 12857 Example Codec Perceptual Model ISOIEC 11172-3

(MPEG - 1) Psychoacoustic Model 1 130571 Step 1 Spectral Analysis and SPL Normalization 131

x CONTENTS

572 Step 2 Identification of Tonal and Noise Maskers 131573 Step 3 Decimation and Reorganization of Maskers 135574 Step 4 Calculation of Individual Masking Thresholds 136575 Step 5 Calculation of Global Masking Thresholds 138

58 Perceptual Bit Allocation 13859 Summary 140


6 TIME-FREQUENCY ANALYSIS FILTER BANKS ANDTRANSFORMS 145

61 Introduction 14562 Analysis-Synthesis Framework for M-band Filter Banks 14663 Filter Banks for Audio Coding Design Considerations 148

631 The Role of Time-Frequency Resolution in MaskingPower Estimation 149

632 The Role of Frequency Resolution in Perceptual BitAllocation 149

633 The Role of Time Resolution in Perceptual BitAllocation 150

64 Quadrature Mirror and Conjugate Quadrature Filters 15565 Tree-Structured QMF and CQF M-band Banks 15666 Cosine Modulated ldquoPseudo QMFrdquo M-band Banks 16067 Cosine Modulated Perfect Reconstruction (PR) M-band Banks

and the Modified Discrete Cosine Transform (MDCT) 163671 Forward and Inverse MDCT 165672 MDCT Window Design 165673 Example MDCT Windows (Prototype FIR Filters) 167

68 Discrete Fourier and Discrete Cosine Transform 17869 Pre-echo Distortion 180610 Pre-echo Control Strategies 182

6101 Bit Reservoir 1826102 Window Switching 1826103 Hybrid Switched Filter Banks 1846104 Gain Modification 1856105 Temporal Noise Shaping 185


CONTENTS xi

7 TRANSFORM CODERS 195

71 Introduction 19572 Optimum Coding in the Frequency Domain 19673 Perceptual Transform Coder 197

731 PXFM 198732 SEPXFM 199

74 Brandenburg-Johnston Hybrid Coder 20075 CNET Coders 201

751 CNET DFT Coder 201752 CNET MDCT Coder 1 201753 CNET MDCT Coder 2 202

76 Adaptive Spectral Entropy Coding 20377 Differential Perceptual Audio Coder 20478 DFT Noise Substitution 20579 DCT with Vector Quantization 206710 MDCT with Vector Quantization 207711 Summary 208


8 SUBBAND CODERS 211

81 Introduction 211811 Subband Algorithms 212

82 DWT and Discrete Wavelet Packet Transform (DWPT) 21483 Adapted WP Algorithms 218

831 DWPT Coder with Globally Adapted DaubechiesAnalysis Wavelet 218

832 Scalable DWPT Coder with Adaptive Tree Structure 220833 DWPT Coder with Globally Adapted General

Analysis Wavelet 223834 DWPT Coder with Adaptive Tree Structure and

Locally Adapted Analysis Wavelet 223835 DWPT Coder with Perceptually Optimized Synthesis

Wavelets 22484 Adapted Nonuniform Filter Banks 226

841 Switched Nonuniform Filter Bank Cascade 226842 Frequency-Varying Modulated Lapped Transforms 227

85 Hybrid WP and Adapted WPSinusoidal Algorithms 227

xii CONTENTS

851 Hybrid SinusoidalClassical DWPT Coder 228852 Hybrid SinusoidalM-band DWPT Coder 229853 Hybrid SinusoidalDWPT Coder with WP Tree

Structure Adaptation (ARCO) 23086 Subband Coding with Hybrid Filter BankCELP Algorithms 233

861 Hybrid SubbandCELP Algorithm for Low-DelayApplications 234

862 Hybrid SubbandCELP Algorithm forLow-Complexity Applications 235

87 Subband Coding with IIR Filter Banks 237Problems 237Computer Exercise 240

9 SINUSOIDAL CODERS 241

91 Introduction 24192 The Sinusoidal Model 242

921 Sinusoidal Analysis and Parameter Tracking 242922 Sinusoidal Synthesis and Parameter Interpolation 245

93 AnalysisSynthesis Audio Codec (ASAC) 247931 ASAC Segmentation 248932 ASAC Sinusoidal Analysis-by-Synthesis 248933 ASAC Bit Allocation Quantization Encoding and

Scalability 24894 Harmonic and Individual Lines Plus Noise Coder (HILN) 249

941 HILN Sinusoidal Analysis-by-Synthesis 250942 HILN Bit Allocation Quantization Encoding and

Decoding 25195 FM Synthesis 251

951 Principles of FM Synthesis 252952 Perceptual Audio Coding Using an FM Synthesis

Model 25296 The Sines + Transients + Noise (STN) Model 25497 Hybrid Sinusoidal Coders 255

971 Hybrid Sinusoidal-MDCT Algorithm 256972 Hybrid Sinusoidal-Vocoder Algorithm 257


CONTENTS xiii

10 AUDIO CODING STANDARDS AND ALGORITHMS 263

101 Introduction 263102 MIDI Versus Digital Audio 264

1021 MIDI Synthesizer 2641022 General MIDI (GM) 2661023 MIDI Applications 266

103 Multichannel Surround Sound 2671031 The Evolution of Surround Sound 2671032 The Mono the Stereo and the Surround Sound

Formats 2681033 The ITU-R BS775 51-Channel Configuration 268

104 MPEG Audio Standards 2701041 MPEG-1 Audio (ISOIEC 11172-3) 2751042 MPEG-2 BCLSF (ISOIEC-13818-3) 2791043 MPEG-2 NBCAAC (ISOIEC-13818-7) 2831044 MPEG-4 Audio (ISOIEC 14496-3) 2891045 MPEG-7 Audio (ISOIEC 15938-4) 3091046 MPEG-21 Framework (ISOIEC-21000) 3171047 MPEG Surround and Spatial Audio Coding 319

105 Adaptive Transform Acoustic Coding (ATRAC) 319106 Lucent Technologies PAC EPAC and MPAC 321

1061 Perceptual Audio Coder (PAC) 3211062 Enhanced PAC (EPAC) 3231063 Multichannel PAC (MPAC) 323

107 Dolby Audio Coding Standards 3251071 Dolby AC-2 AC-2A 3251072 Dolby AC-3Dolby DigitalDolby SR middot D 327

108 Audio Processing Technology APT-x100 335109 DTS ndash Coherent Acoustics 338

1091 Framing and Subband Analysis 3381092 Psychoacoustic Analysis 3391093 ADPCM ndash Differential Subband Coding 3391094 Bit Allocation Quantization and Multiplexing 3411095 DTS-CA Versus Dolby Digital 342Problems 342Computer Exercise 342

11 LOSSLESS AUDIO CODING AND DIGITAL WATERMARKING 343

111 Introduction 343

xiv CONTENTS

112 Lossless Audio Coding (L2AC) 3441121 L2AC Principles 3451122 L2AC Algorithms 346

113 DVD-Audio 3561131 Meridian Lossless Packing (MLP) 358

114 Super-Audio CD (SACD) 3581141 SACD Storage Format 3621142 Sigma-Delta Modulators (SDM) 3621143 Direct Stream Digital (DSD) Encoding 364

115 Digital Audio Watermarking 3681151 Background 3701152 A Generic Architecture for DAW 3741153 DAW Schemes ndash Attributes 377

116 Summary of Commercial Applications 378Problems 382Computer Exercise 382

12 QUALITY MEASURES FOR PERCEPTUAL AUDIO CODING 383

121 Introduction 383122 Subjective Quality Measures 384123 Confounding Factors in Subjective Evaluations 386124 Subjective Evaluations of Two-Channel Standardized Codecs 387125 Subjective Evaluations of 51-Channel Standardized Codecs 388126 Subjective Evaluations Using Perceptual Measurement

Systems 3891261 CIR Perceptual Measurement Schemes 3901262 NSE Perceptual Measurement Schemes 390

127 Algorithms for Perceptual Measurement 3911271 Example Perceptual Audio Quality Measure (PAQM) 3921272 Example Noise-to-Mask Ratio (NMR) 3961273 Example Objective Audio Signal Evaluation (OASE) 399

128 ITU-R BS1387 and ITU-T P861 Standards for PerceptualQuality Measurement 401

129 Research Directions for Perceptual Codec Quality Measures 402

REFERENCES 405

INDEX 459

PREFACE

Audio processing and recording has been part of telecommunication and enter-tainment systems for more than a century Moreover bandwidth issues associatedwith audio recording transmission and storage occupied engineers from the veryearly stages in this field A series of important technological developments pavedthe way from early phonographs to magnetic tape recording and lately compactdisk (CD) and super storage devices In the following we capture some of themain events and milestones that mark the history in audio recording and storage1

Prototypes of phonographs appeared around 1877 and the first attempt to mar-ket cylinder-based gramophones was by the Columbia Phonograph Co in 1889Five years later Marconi demonstrated the first radio transmission that markedthe beginning of audio broadcasting The Victor Talking Machine Company withthe little nipper dog as its trademark was formed in 1901 The ldquotelegraphonerdquo amagnetic recorder for voice that used still wire was patented in Denmark aroundthe end of the nineteenth century The Odeon and His Masters Voice (HMV)label produced and marketed music recordings in the early nineteen hundredsThe cabinet phonograph with a horn called ldquoVictrolardquo appeared at about the sametime Diamond disk players were marketed in 1913 followed by efforts to producesound-on-film for motion pictures Other milestones include the first commercialtransmission in Pittsburgh and the emergence of public address amplifiers Elec-trically recorded material appeared in the 1920s and the first sound-on-film wasdemonstrated in the mid 1920s by Warner Brothers Cinema applications in the1930s promoted advances in loudspeaker technologies leading to the develop-ment of woofer tweeter and crossover network concepts Juke boxes for musicalso appeared in the 1930s Magnetic tape recording was demonstrated in Ger-many in the 1930s by BASF and AEGTelefunken The Ampex tape recordersappeared in the US in the late 1940s The demonstration of stereo high-fidelity(Hi-Fi) sound in the late 1940s spurred the development of amplifiers speakersand reel-to-reel tape recorders for home use in the 1950s both in Europe and

xv

xvi PREFACE

Apple iPod (Courtesy of Apple Computer Inc) Apple iPod is a registered trademarkof Apple Computer Inc

the US Meanwhile Columbia produced the 33-rpm long play (LP) vinyl recordwhile its rival RCA Victor produced the compact 45-rpm format whose salestook off with the emergence of rock and roll music Technological developmentsin the mid 1950s resulted in the emergence of compact transistor-based radiosand soon after small tape players In 1963 Philips introduced the compact cas-sette tape format with its EL3300 series portable players (marketed in the US asNorelco) which became an instant success with accessories for home portableand car use Eight track cassettes became popular in the late 1960s mainly for caruse The Dolby system for compact cassette noise reduction was also a landmarkin the audio signal processing field Meanwhile FM broadcasting which hadbeen invented earlier took off in the 1960s and 1970s with stereo transmissionsHelical tape-head technologies invented in Japan in the 1960s provided high-bandwidth recording capabilities which enabled video tape recorders for homeuse in the 1970s (eg VHS and Beta formats) This technology was also usedin the 1980s for audio PCM stereo recording Laser compact disk technologywas introduced in 1982 and by the late 1980s became the preferred format forHi-Fi stereo recording Analog compact cassette players high-quality reel-to-reelrecorders expensive turntables and virtually all analog recording devices startedfading away by the late 1980s The launch of the digital CD audio format in

PREFACE xvii

the 1980s coincided with the advent of personal computers and took over inall aspects of music recording and distribution CD playback soon dominatedbroadcasting automobile home stereo and analog vinyl LP The compact cas-sette formats became relics of an old era and eventually disappeared from musicstores Digital audio tape (DAT) systems enabled by helical tape head technologywere also introduced in the 1980s but were commercially unsuccessful becauseof strict copyright laws and unusually large taxes

Parallel developments in digital video formats for laser disk technologiesincluded work in audio compression systems Audio compression research papersstarted appearing mostly in the 1980s at IEEE ICASSP and Audio Engineer-ing Society conferences by authors from several research and development labsincluding Erlangen-Nuremburg University and Fraunhofer IIS ATampT Bell Lab-oratories and Dolby Laboratories Audio compression or audio coding researchthe art of representing an audio signal with the least number of informationbits while maintaining its fidelity went through quantum leaps in the late 1980sand 1990s Although originally most audio compression algorithms were devel-oped as part of the digital motion video compression standards eg the MPEGseries these algorithms eventually became important as stand alone technologiesfor audio recording and playback Progress in VLSI technologies psychoacous-tics and efficient time-frequency signal representations made possible a series ofscalable real-time compression algorithms for use in audio and cinema applica-tions In the 1990s we witnessed the emergence of the first products that usedcompressed audio formats such as the MiniDisc (MD) and the Digital CompactCassette (DCC) The sound and video playing capabilities of the PC and theproliferation of multimedia content through the Internet had a profound impacton audio compression technologies The MPEG-1-2 layer III (MP3) algorithmbecame a defacto standard for Internet music downloads Specialized web sitesthat feature music content changed the ways people buy and share music Com-pact MP3 players appeared in the late 1990s In the early 2000s we had theemergence of the Apple iPod player with a hard drive that supports MP3 andMPEG advanced audio coding (AAC) algorithms

In order to enhance cinematic and home theater listening experiences anddeliver greater realism than ever before audio codec designers pursued sophis-ticated multichannel audio coding techniques In the mid 1990s techniques forencoding 51 separate channels of audio were standardized in MPEG-2 BC andlater MPEG-2 AAC audio Proprietary multichannel algorithms were also devel-oped and commercialized by Dolby Laboratories (AC-3) Digital Theater System(DTS) Lucent (EPAC) Sony (SDDS) and Microsoft (WMA) Dolby Labs DTSLexicon and other companies also introduced 2N channel upmix algorithmscapable of synthesizing multichannel surround presentation from conventionalstereo content (eg Dolby ProLogic II DTS Neo6) The human auditory systemis capable of localizing sound with greater spatial resolution than current multi-channel audio systems offer and as a result the quest continues to achieve theultimate spatial fidelity in sound reproduction Research involving spatial audioreal-time acoustic source localization binaural cue coding and application of

xviii PREFACE

head-related transfer functions (HRTF) towards rendering immersive audio hasgained interest Audiophiles appeared skeptical with the 441-kHz 16-bit CDstereo format and some were critical of the sound quality of compression for-mats These ideas along with the need for copyright protection eventually gainedmomentum and new standards and formats appeared in the early 2000s In par-ticular multichannel lossless coding such as the DVD-Audio (DVD-A) and theSuper-Audio-CD (SACD) appeared The standardization of these storage for-mats provided the audio codec designers with enormous storage capacity Thismotivated lossless coding of digital audio

The purpose of this book is to provide an in-depth treatment of audio com-pression algorithms and standards The topic is currently occupying several com-munities in signal processing multimedia and audio engineering The intendedreadership for this book includes at least three groups At the highest level anyreader with a general scientific background will be able to gain an appreciation forthe heuristics of perceptual coding Secondly readers with a general electrical andcomputer engineering background will become familiar with the essential signalprocessing techniques and perceptual models embedded in most audio codersFinally undergraduate and graduate students with focuses in multimedia DSPand computer music will gain important knowledge in signal analysis and audiocoding algorithms The vast body of literature provided and the tutorial aspectsof the book make it an asset for audiophiles as well

Organization

This book is in part the outcome of many years of research and teaching at Ari-zona State University We opted to include exercises and computer problems andhence enable instructors to either use the content in existing DSP and multimediacourses or to promote the creation of new courses with focus in audio and speechprocessing and coding The book has twelve chapters and each chapter containsproblems proofs and computer exercises Chapter 1 introduces the readers tothe field of audio signal processing and coding In Chapter 2 we review thebasic signal processing theory and emphasize concepts relevant to audio cod-ing Chapter 3 describes waveform quantization and entropy coding schemesChapter 4 covers linear predictive coding and its utility in speech and audio cod-ing Chapter 5 covers psychoacoustics and Chapter 6 explores filter bank designChapter 7 describes transform coding methodologies Subband and sinusoidalcoding algorithms are addressed in Chapters 8 and 9 respectively Chapter 10reviews several audio coding standards including the ISOIEC MPEG family thecinematic Sony SDDS the Dolby AC-3 and the DTS-coherent acoustics (DTS-CA) Chapter 11 focuses on lossless audio coding and digital audio watermarkingtechniques Chapter 12 provides information on subjective quality measures

Use in Courses

For an undergraduate elective course with little or no background in DSP theinstructor can cover in detail Chapters 1 2 3 4 and 5 then present select

PREFACE xix

sections of Chapter 6 and describe in an expository and qualitative mannercertain basic algorithms and standards from Chapters 7-11 A graduate class inaudio coding with students that have background in DSP can start from Chapter 5and cover in detail Chapters 6 through Chapter 11 Audio coding practitioners andresearchers that are interested mostly in qualitative descriptions of the standardsand information on bibliography can start at Chapter 5 and proceed readingthrough Chapter 11

Trademarks and Copyrights

Sony Dynamic Digital Sound SDDS ATRAC and MiniDisc are trademarks ofSony Corporation Dolby Dolby Digital AC-2 AC-3 DolbyFAX Dolby Pro-Logic are trademarks of Dolby laboratories The perceptual audio coder (PAC)EPAC and MPAC are trademarks of ATampT and Lucent Technologies TheAPT-x100 is trademark of Audio Processing Technology Inc The DTS-CA istrademark of Digital Theater Systems Inc Apple iPod is a registered trademarkof Apple Computer Inc

Acknowledgments

The authors have all spent time at Arizona State University (ASU) and ProfSpanias is in fact still teaching and directing research in this area at ASU Thegroup of authors has worked on grants with Intel Corporation and would like tothank this organization for providing grants in scalable speech and audio codingthat created opportunities for in-depth studies in these areas Special thanks toour colleagues in Intel Corporation at that time including Brian Mears GopalNair Hedayat Daie Mark Walker Michael Deisher and Tom Gardos We alsowish to acknowledge the support of current Intel colleagues Gang Liang MikeRosenzweig and Jim Zhou as well as Scott Peirce for proof reading some of thematerial Thanks also to former doctoral students at ASU including Philip Loizouand Sassan Ahmadi for many useful discussions in speech and audio processingWe appreciate also discussions on narrowband vocoders with Bruce Fette in thelate 1990s then with Motorola GEG and now with General Dynamics

The authors also acknowledge the National Science Foundation (NSF) CCLIfor grants in education that supported in part the preparation of several computerexamples and paradigms in psychoacoustics and signal coding Also some ofthe early work in coding of Dr Spanias was supported by the Naval ResearchLaboratories (NRL) and we would like to thank that organization for providingideas for projects that inspired future work in this area We also wish to thankASU and some of the faculty and administrators that provided moral and materialsupport for work in this area Thanks are extended to current ASU studentsShibani Misra Visar Berisha and Mahesh Banavar for proofreading some ofthe material We thank the Wiley Interscience production team George TeleckiMelissa Yanuzzi and Rachel Witmer for their diligent efforts in copyeditingcover design and typesetting We also thank all the anonymous reviewers for

xx PREFACE

their useful comments Finally we all wish to express our thanks to our familiesfor their support

The book content is used frequently in ASU online courses and industry shortcourses offered by Andreas Spanias Contact Andreas Spanias (spaniasasuedu httpwwwfultonasuedusimspanias) for details

1 Resources used for obtaining important dates in recording history include web sites at the Universityof San Diego Arizona State University and Wikipedia

CHAPTER 1

INTRODUCTION

Audio coding or audio compression algorithms are used to obtain compact dig-ital representations of high-fidelity (wideband) audio signals for the purpose ofefficient transmission or storage The central objective in audio coding is to rep-resent the signal with a minimum number of bits while achieving transparentsignal reproduction ie generating output audio that cannot be distinguishedfrom the original input even by a sensitive listener (ldquogolden earsrdquo) This textgives an in-depth treatment of algorithms and standards for transparent codingof high-fidelity audio

11 HISTORICAL PERSPECTIVE

The introduction of the compact disc (CD) in the early 1980s brought to thefore all of the advantages of digital audio representation including true high-fidelity dynamic range and robustness These advantages however came atthe expense of high data rates Conventional CD and digital audio tape (DAT)systems are typically sampled at either 441 or 48 kHz using pulse code mod-ulation (PCM) with a 16-bit sample resolution This results in uncompresseddata rates of 7056768 kbs for a monaural channel or 141154 Mbs for astereo-pair Although these data rates were accommodated successfully in first-generation CD and DAT players second-generation audio players and wirelesslyconnected systems are often subject to bandwidth constraints that are incompat-ible with high data rates Because of the success enjoyed by the first-generation

Audio Signal Processing and Coding by Andreas Spanias Ted Painter and Venkatraman AttiCopyright 2007 by John Wiley amp Sons Inc

1

2 INTRODUCTION

systems however end users have come to expect ldquoCD-qualityrdquo audio reproduc-tion from any digital system Therefore new network and wireless multimediadigital audio systems must reduce data rates without compromising reproduc-tion quality Motivated by the need for compression algorithms that can satisfysimultaneously the conflicting demands of high compression ratios and trans-parent quality for high-fidelity audio signals several coding methodologies havebeen established over the last two decades Audio compression schemes in gen-eral employ design techniques that exploit both perceptual irrelevancies andstatistical redundancies

PCM was the primary audio encoding scheme employed until the early 1980sPCM does not provide any mechanisms for redundancy removal Quantizationmethods that exploit the signal correlation such as differential PCM (DPCM)delta modulation [Jaya76] [Jaya84] and adaptive DPCM (ADPCM) were appliedto audio compression later (eg PC audio cards) Owing to the need for dras-tic reduction in bit rates researchers began to pursue new approaches for audiocoding based on the principles of psychoacoustics [Zwic90] [Moor03] Psychoa-coustic notions in conjunction with the basic properties of signal quantizationhave led to the theory of perceptual entropy [John88a] [John88b] Perceptualentropy is a quantitative estimate of the fundamental limit of transparent audiosignal compression Another key contribution to the field was the characterizationof the auditory filter bank and particularly the time-frequency analysis capabili-ties of the inner ear [Moor83] Over the years several filter-bank structures thatmimic the critical band structure of the auditory filter bank have been proposedA filter bank is a parallel bank of bandpass filters covering the audio spectrumwhich when used in conjunction with a perceptual model can play an importantrole in the identification of perceptual irrelevancies

During the early 1990s several workgroups and organizations such asthe International Organization for StandardizationInternational Electro-technicalCommission (ISOIEC) the International Telecommunications Union (ITU)ATampT Dolby Laboratories Digital Theatre Systems (DTS) Lucent TechnologiesPhilips and Sony were actively involved in developing perceptual audio codingalgorithms and standards Some of the popular commercial standards publishedin the early 1990s include Dolbyrsquos Audio Coder-3 (AC-3) the DTS CoherentAcoustics (DTS-CA) Lucent Technologiesrsquo Perceptual Audio Coder (PAC)Philipsrsquo Precision Adaptive Subband Coding (PASC) and Sonyrsquos AdaptiveTransform Acoustic Coding (ATRAC) Table 11 lists chronologically some ofthe prominent audio coding standards The commercial success enjoyed bythese audio coding standards triggered the launch of several multimedia storageformats

Table 12 lists some of the popular multimedia storage formats since the begin-ning of the CD era High-performance stereo systems became quite common withthe advent of CDs in the early 1980s A compact-discndashread only memory (CD-ROM) can store data up to 700ndash800 MB in digital form as ldquomicroscopic-pitsrdquothat can be read by a laser beam off of a reflective surface or a medium Threecompeting storage media ndash DAT the digital compact cassette (DCC) and the

HISTORICAL PERSPECTIVE 3

Table 11 List of perceptual and lossless audio coding standardsalgorithms

Standardalgorithm Related references

1 ISOIEC MPEG-1 audio [ISOI92]2 Philipsrsquo PASC (for DCC applications) [Lokh92]3 ATampTLucent PACEPAC [John96c] [Sinh96]4 Dolby AC-2 [Davi92] [Fiel91]5 AC-3Dolby Digital [Davis93] [Fiel96]6 ISOIEC MPEG-2 (BCLSF) audio [ISOI94a]7 Sonyrsquos ATRAC (MiniDisc and SDDS) [Yosh94] [Tsut96]8 SHORTEN [Robi94]9 Audio processing technology ndash APT-x100 [Wyli96b]10 ISOIEC MPEG-2 AAC [ISOI96]11 DTS coherent acoustics [Smyt96] [Smyt99]12 The DVD Algorithm [Crav96] [Crav97]13 MUSICompress [Wege97]14 Lossless transform coding of audio (LTAC) [Pura97]15 AudioPaK [Hans98b] [Hans01]16 ISOIEC MPEG-4 audio version 1 [ISOI99]17 Meridian lossless packing (MLP) [Gerz99]18 ISOIEC MPEG-4 audio version 2 [ISOI00]19 Audio coding based on integer transforms [Geig01] [Geig02]20 Direct-stream digital (DSD) technology [Reef01a] [Jans03]

Table 12 Some of the popular audio storageformats

Audio storage format Related references

1 Compact disc [CD82] [IECA87]2 Digital audio tape (DAT) [Watk88] [Tan89]3 Digital compact cassette (DCC) [Lokh91] [Lokh92]4 MiniDisc [Yosh94] [Tsut96]5 Digital versatile disc (DVD) [DVD96]6 DVD-audio (DVD-A) [DVD01]7 Super audio CD (SACD) [SACD02]

MiniDisc (MD) ndash entered the commercial market during 1987ndash1992 Intendedmainly for back-up high-density storage (sim13 GB) the DAT became the primarysource of mass data storagetransfer [Watk88] [Tan89] In 1991ndash1992 Sony pro-posed a storage medium called the MiniDisc primarily for audio storage MDemploys the ATRAC algorithm for compression In 1991 Philips introduced theDCC a successor of the analog compact cassette Philips DCC employs a com-pression scheme called the PASC [Lokh91] [Lokh92] [Hoog94] The DCC began

4 INTRODUCTION

as a potential competitor for DATs but was discontinued in 1996 The introduc-tion of the digital versatile disc (DVD) in 1996 enabled both video and audiorecordingstorage as well as text-message programming The DVD became oneof the most successful storage media With the improvements in the audio com-pression and DVD storage technologies multichannel surround sound encodingformats gained interest [Bosi93] [Holm99] [Bosi00]

With the emergence of streaming audio applications during the late1990s researchers pursued techniques such as combined speech and audioarchitectures as well as joint source-channel coding algorithms that are optimizedfor the packet-switched Internet The advent of ISOIEC MPEG-4 standard(1996ndash2000) [ISOI99] [ISOI00] established new research goals for high-qualitycoding of audio at low bit rates MPEG-4 audio encompasses more functionalitythan perceptual coding [Koen98] [Koen99] It comprises an integrated family ofalgorithms with provisions for scalable object-based speech and audio coding atbit rates from as low as 200 bs up to 64 kbs per channel

The emergence of the DVD-audio and the super audio CD (SACD) pro-vided designers with additional storage capacity which motivated research inlossless audio coding [Crav96] [Gerz99] [Reef01a] A lossless audio coding sys-tem is able to reconstruct perfectly a bit-for-bit representation of the originalinput audio In contrast a coding scheme incapable of perfect reconstruction iscalled lossy For most audio program material lossy schemes offer the advan-tage of lower bit rates (eg less than 1 bit per sample) relative to losslessschemes (eg 10 bits per sample) Delivering real-time lossless audio contentto the network browser at low bit rates is the next grand challenge for codecdesigners

12 A GENERAL PERCEPTUAL AUDIO CODING ARCHITECTURE

Over the last few years researchers have proposed several efficient signal models(eg transform-based subband-filter structures wavelet-packet) and compressionstandards (Table 11) for high-quality digital audio reproduction Most of thesealgorithms are based on the generic architecture shown in Figure 11

The coders typically segment input signals into quasi-stationary frames rangingfrom 2 to 50 ms Then a time-frequency analysis section estimates the temporaland spectral components of each frame The time-frequency mapping is usuallymatched to the analysis properties of the human auditory system Either waythe ultimate objective is to extract from the input audio a set of time-frequencyparameters that is amenable to quantization according to a perceptual distortionmetric Depending on the overall design objectives the time-frequency analysissection usually contains one of the following

ž Unitary transformž Time-invariant bank of critically sampled uniformnonuniform bandpass

filters

AUDIO CODER ATTRIBUTES 5

Time-frequencyanalysis

Quantizationand encoding

Psychoacousticanalysis Bit-allocation

Entropy(lossless)

codingMUX

Tochannel

Inputaudio

Maskingthresholds

Sideinformation

Parameters

Figure 11 A generic perceptual audio encoder

ž Time-varying (signal-adaptive) bank of critically sampled uniformnonunif-orm bandpass filters

ž Harmonicsinusoidal analyzerž Source-system analysis (LPC and multipulse excitation)ž Hybrid versions of the above

The choice of time-frequency analysis methodology always involves a fun-damental tradeoff between time and frequency resolution requirements Percep-tual distortion control is achieved by a psychoacoustic signal analysis sectionthat estimates signal masking power based on psychoacoustic principles Thepsychoacoustic model delivers masking thresholds that quantify the maximumamount of distortion at each point in the time-frequency plane such that quan-tization of the time-frequency parameters does not introduce audible artifactsThe psychoacoustic model therefore allows the quantization section to exploitperceptual irrelevancies This section can also exploit statistical redundanciesthrough classical techniques such as DPCM or ADPCM Once a quantized com-pact parametric set has been formed the remaining redundancies are typicallyremoved through noiseless run-length (RL) and entropy coding techniques egHuffman [Cove91] arithmetic [Witt87] or Lempel-Ziv-Welch (LZW) [Ziv77][Welc84] Since the output of the psychoacoustic distortion control model issignal-dependent most algorithms are inherently variable rate Fixed channelrate requirements are usually satisfied through buffer feedback schemes whichoften introduce encoding delays

13 AUDIO CODER ATTRIBUTES

Perceptual audio coders are typically evaluated based on the following attributesaudio reproduction quality operating bit rates computational complexity codecdelay and channel error robustness The objective is to attain a high-quality(transparent) audio output at low bit rates (lt32 kbs) with an acceptable

6 INTRODUCTION

algorithmic delay (sim5 to 20 ms) and with low computational complexity (sim1 to10 million instructions per second or MIPS)

131 Audio Quality

Audio quality is of paramount importance when designing an audio codingalgorithm Successful strides have been made since the development of sim-ple near-transparent perceptual coders Typically classical objective measures ofsignal fidelity such as the signal to noise ratio (SNR) and the total harmonicdistortion (THD) are inadequate [Ryde96] As the field of perceptual audio cod-ing matured rapidly and created greater demand for listening tests there was acorresponding growth of interest in perceptual measurement schemes Severalsubjective and objective quality measures have been proposed and standard-ized during the last decade Some of these schemes include the noise-to-maskratio (NMR 1987) [Bran87a] the perceptual audio quality measure (PAQM1991) [Beer91] the perceptual evaluation (PERCEVAL 1992) [Pail92] the per-ceptual objective measure (POM 1995) [Colo95] and the objective audio signalevaluation (OASE 1997) [Spor97] We will address these and several other qual-ity assessment schemes in detail in Chapter 12

132 Bit Rates

From a codec designerrsquos point of view one of the key challenges is to rep-resent high-fidelity audio with a minimum number of bits For instance if a5-ms audio frame sampled at 48 kHz (240 samples per frame) is representedusing 80 bits then the encoding bit rate would be 80 bits5 ms = 16 kbs Lowbit rates imply high compression ratios and generally low reproduction qual-ity Early coders such as the ISOIEC MPEG-1 (32ndash448 kbs) the Dolby AC-3(32ndash384 kbs) the Sony ATRAC (256 kbs) and the Philips PASC (192 kbs)employ high bit rates for obtaining transparent audio reproduction However thedevelopment of several sophisticated audio coding tools (eg MPEG-4 audiotools) created ways for efficient transmission or storage of audio at rates between8 and 32 kbs Future audio coding algorithms promise to offer reasonable qual-ity at low rates along with the ability to scale both rate and quality to matchdifferent requirements such as time-varying channel capacity

133 Complexity

Reduced computational complexity not only enables real-time implementationbut may also decrease the power consumption and extend battery life Com-putational complexity is usually measured in terms of millions of instructionsper second (MIPS) Complexity estimates are processor-dependent For examplethe complexity associated with Dolbyrsquos AC-3 decoder was estimated at approxi-mately 27 MIPS using the Zoran ZR38001 general-purpose DSP core [Vern95]for the Motorola DSP56002 processor the complexity was estimated at 45MIPS [Vern95] Usually most of the audio codecs rely on the so-called asym-metric encoding principle This means that the codec complexity is not evenly

TYPES OF AUDIO CODERS ndash AN OVERVIEW 7

shared between the encoder and the decoder (typically encoder 80 and decoder20 complexity) with more emphasis on reducing the decoder complexity

134 Codec Delay

Many of the network applications for high-fidelity audio (streaming audio audio-on-demand) are delay tolerant (up to 100ndash200 ms) providing the opportunityto exploit long-term signal properties in order to achieve high coding gainHowever in two-way real-time communication and voice-over Internet proto-col (VoIP) applications low-delay encoding (10ndash20 ms) is important Considerthe example described before ie an audio coder operating on frames of 5 msat a 48 kHz sampling frequency In an ideal encoding scenario the minimumamount of delay should be 5 ms at the encoder and 5 ms at the decoder (same asthe frame length) However other factors such as analysis-synthesis filter bankwindow the look-ahead the bit-reservoir and the channel delay contribute toadditional delays Employing shorter analysis-synthesis windows avoiding look-ahead and re-structuring the bit-reservoir functions could result in low-delayencoding nonetheless with reduced coding efficiencies

135 Error Robustness

The increasing popularity of streaming audio over packet-switched and wire-less networks such as the Internet implies that any algorithm intended for suchapplications must be able to deal with a noisy time-varying channel In partic-ular provisions for error robustness and error protection must be incorporatedat the encoder in order to achieve reliable transmission of digital audio overerror-prone channels One simple idea could be to provide better protection tothe error-sensitive and priority (important) bits For instance the audio frameheader requires the maximum error robustness otherwise transmission errorsin the header will seriously impair the entire audio frame Several error detect-ingcorrecting codes [Lin82] [Wick95] [Bayl97] [Swee02] [Zara02] can also beemployed Inclusion of error correcting codes in the bitstream might help to obtainerror-free reproduction of the input audio however with increased complexityand bit rates

From the discussion in the previous sections it is evident that several tradeoffsmust be considered in designing an algorithm for a particular application For thisreason audio coding standards consist of several tools that enable the design ofscalable algorithms For example MPEG-4 provides tools to design algorithmsthat satisfy a variety of bit rate delay complexity and robustness requirements

14 TYPES OF AUDIO CODERS ndash AN OVERVIEW

Based on the signal model or the analysis-synthesis technique employed to encodeaudio signals audio coders can be broadly classified as follows

ž Linear predictivež Transform

8 INTRODUCTION

ž Subbandž Sinusoidal

Algorithms are also classified based on the lossy or the lossless nature of audiocoding Lossy audio coding schemes achieve compression by exploiting percep-tually irrelevant information Some examples of lossy audio coding schemesinclude the ISOIEC MPEG codec series the Dolby AC-3 and the DTS CA Inlossless audio coding the audio data is merely ldquopackedrdquo to obtain a bit-for-bitrepresentation of the original The meridian lossless packing (MLP) [Gerz99]and the direct stream digital (DSD) techniques [Brue97] [Reef01a] form a classof high-end lossless compression algorithms that are embedded in the DVD-audio [DVD01] and the SACD [SACD02] storage formats respectively Losslessaudio coding techniques in general yield high-quality digital audio without anyartifacts at high rates For instance perceptual audio coding yields compressionratios from 101 to 251 while lossless audio coding can achieve compressionratios from 21 to 41

15 ORGANIZATION OF THE BOOK

This book is organized as follows In Chapter 2 we review basic signal pro-cessing concepts associated with audio coding Chapter 3 provides introductorymaterial to waveform quantization and entropy coding schemes Some of the keytopics covered in this chapter include scalar quantization uniformnonuniformquantization pulse code modulation (PCM) differential PCM (DPCM) adap-tive DPCM (ADPCM) vector quantization (VQ) bit-allocation techniques andentropy coding schemes (Huffman Rice and arithmetic)

Chapter 4 provides information on linear prediction and its application innarrow and wideband coding First we address the utility of LP analysissynthesisapproach in speech applications Next we describe the open-loop analysis-synthesis LP and closed-loop analysis-by-synthesis LP techniques

In Chapter 5 psychoacoustic principles are described Johnstonrsquos notion ofperceptual entropy is presented as a measure of the fundamental limit of trans-parent compression for audio The ISOIEC 11172-3 MPEG-1 psychoacous-tic analysis model 1 is used to describe the five important steps associatedwith the global masking threshold computation Chapter 6 explores filter bankdesign issues and algorithms with a particular emphasis placed on the modi-fied discrete cosine transform (MDCT) that is widely used in several perceptualaudio coding algorithms Chapter 6 also addresses pre-echo artifacts and controlstrategies

Chapters 7 8 and 9 review established and emerging techniques for trans-parent coding of FM and CD-quality audio signals including several algorithmsthat have become international standards Transform coding methodologies aredescribed in Chapter 7 subband coding algorithms are addressed in Chapter 8and sinusoidal algorithms are presented in Chapter 9 In addition to methodsbased on uniform bandwidth filter banks Chapter 8 covers coding methods that

NOTATIONAL CONVENTIONS 9

utilize discrete wavelet transforms (DWT) discrete wavelet packet transforms(DWPT) and other nonuniform filter banks Examples of hybrid algorithmsthat make use of more than one signal model appear throughout Chapters 7 8and 9

Chapter 10 is concerned with standardization activities in audio coding Itdescribes coding standards and products such as the ISOIEC MPEG family(minus1 ldquoMP123rdquo minus2 minus4 minus7 and minus21) the Sony Minidisc (ATRAC) thecinematic Sony SDDS the Lucent Technologies PACEPACMPAC the DolbyAC-2AC-3 the Audio Processing Technology APT-x100 and the DTS-coherentacoustics (DTS-CA) Details on the MP3 and MPEG-4 AAC algorithms thatare popular in Web and in handheld media applications eg Apple iPod areprovided

Chapter 11 focuses on lossless audio coding and digital audio watermarkingtechniques In particular the SHORTEN the DVD algorithm the MUSICom-press the AudioPaK the C-LPAC the LTAC and the IntMDCT lossless codingschemes are described in detail Chapter 11 also addresses the two popular high-end storage formats ie the SACD and the DVD-Audio The MLP and the DSDtechniques for lossless audio coding are also presented

Chapter 12 provides information on subjective quality measures for perceptualcodecs The five-point absolute and differential subjective quality scales areaddressed as well as the subjective test methodologies specified in the ITU-R Recommendation BS1116 A set of subjective benchmarks is provided forthe various standards in both stereophonic and multichannel modes to facilitatealgorithm comparisons

16 NOTATIONAL CONVENTIONS

Unless otherwise specified bit rates correspond to single-channel or monauralcoding throughout this text Subjective quality measurements are specified interms of either the five-point mean opinion score (MOS Table 13) or the 41-point subjective difference grade (SDG Chapter 12 Table 121) Table 14 listssome of the symbols and notation used in the book

Table 13 Mean opinionscore (MOS) scale

MOS Perceptual quality

1 Bad2 Poor3 Average4 Good5 Excellent

10 INTRODUCTION

Table 14 Symbols and notation used in the book

Symbolnotation Description

t n Time indexsample indexω Frequency index (analog domain

discrete domain)f (= ω2π) Frequency (Hz)Fs Ts Sampling frequency sampling periodx(t) harr X(ω) Continuous-time Fourier transform

(FT) pairx(n) harr X() Discrete-time Fourier transform

(DTFT) pairx(n) harr X(z) z transform pairs[] Indicates a particular element in a

coefficient arrayh(n) harr H() Impulse-frequency response pair of a

discrete time systeme(n) Errorprediction residual

H(z) = B(z)

A(z)= 1 + b1z

minus1 + + bLzminusL

1 + a1zminus1 + + aMzminusMTransfer function consisting of

numerator-polynomial anddenominator-polynomial(corresponding to b-coefficientsand a-coefficients)

s(n) = sumMi=0 ais(n minus i) Predicted signal

Q〈s(n)〉 = s(n) Quantizationapproximation operatoror estimatedencoded value

s[] Square brackets in the superscriptdenote recursion

s() Parenthesis superscript timedependency

N Nf Nsf Total number of samples samples perframe samples per subframe

log() ln() logp() Log to the base-10 log to the base-elog to the base-p

E[] Expectation operatorε Mean squared error (MSE)microx σ 2

x Mean and the variance of the signalx(n)

rxx(m) Autocorrelation of the signal x(n)

rxy(m) Cross-correlation of x(n) and y(n)

Rxx(ej) Power spectral density of the signal

x(n)

Bit rate Number of bits per second (bs kbsor Mbs)

dB SPL Decibels sound pressure level

COMPUTER EXERCISES 11

PROBLEMS

The objective of these introductory problems are to introduce the novice to simplerelations between the sampling rate and the bit rate in PCM coded sequences

11 Consider an audio signal s(n) sampled at 441 kHz and digitized using a)8-bit b) 24-bit and c) 32-bit resolution Compute the data rates for the cases(a)ndash(c) Give the number of samples within a 16-ms frame and compute thenumber of bits per frame

12 List some of the typical data rates (in kbs) and sampling rates (in kHz)employed in applications such as a) video streaming b) audio streamingc) digital audio broadcasting d) digital compact cassette e) MiniDisc f)DVD g) DVD-audio h) SACD i) MP3 j) MP4 k) video conferencingand l) cellular telephony

COMPUTER EXERCISES

The objective of this exercise is to familiarize the reader with the handling ofsound files using MATLAB and to expose the novice to perceptual attributes ofsampling rate and bit resolution

13 For this computer exercise use MATLAB workspace ch1pb1mat from thewebsite

Load the workspace ch1pb1mat using

gtgt load(lsquoch1pb1matrsquo)

Use whos command to view the variables in the workspace The data-vectorlsquoaudio inrsquo contains 44100 samples of audio data Perform the following inMATLAB

gtgt wavwrite(audio in4410016lsquopb1 aud44 16wavrsquo)gtgt wavwrite(audio in1000016lsquopb1 aud10 16wavrsquo)gtgt wavwrite(audio in441008lsquopb1 aud44 08wavrsquo)

Listen to the wave files pb1minusaud44minus16wav pb1minusaud10minus16wav andpb1minusaud44minus08wav using a media player Comment on the perceptual qualityof the three wave files

14 Down-sample the data-vector lsquoaudio inrsquo in problem 13 using

gtgt aud down 4 = downsample(audio in 4)

Use the following commands to listen to audio in and aud down 4 Com-ment on the perceptual quality of the data vectors in each of the cases below

gtgt sound(audio in fs)gtgt sound(aud down 4 fs)gtgt sound(aud down 4 fs4)

CHAPTER 2

SIGNAL PROCESSING ESSENTIALS

21 INTRODUCTION

The signal processing theory described here will be restricted only to the con-cepts that are relevant to audio coding Because of the limited scope of thischapter we provide mostly qualitative descriptions and establish only the essen-tial mathematical formulas First we briefly review the basics of continuous-time(analog) signals and systems and the methods used to characterize the frequencyspectrum of analog signals We then present the basics of analog filters and sub-sequently describe discrete-time signals Coverage of the basics of discrete-timesignals includes the fundamentals of transforms that represent the spectra ofdigital sequences and the theory of digital filters The essentials of random andmultirate signal processing are also reviewed in this chapter

22 SPECTRA OF ANALOG SIGNALS

The frequency spectrum of an analog signal is described in terms of the con-tinuous Fourier transform (CFT) The CFT of a continuous-time signal x(t) isgiven by

X(ω) =int infin

minusinfinx(t)eminusjωt dt (21)

where ω is the frequency in radians per second (rads) Note that ω = 2πf wheref is the frequency in Hz The complex-valued function X(ω) describes the CFT


13

14 SIGNAL PROCESSING ESSENTIALS

02

T0 T0

2t w

1 CFT

0 2πT0

4πT0

X(w) = T0 sinc2

wT0

T0

1x(t) = 2 2

0 otherwise

T0 T0tminus le le

minus

Figure 21 The pulse-sinc CFT pair

t

1

T0

x(t) = cos(w0t)where w0 = 2pT0

w0 w0minusw0

w0

w0minusw0

CFT

0

t

1CFT

0

w(t)

CFT W(w) = T0e minusjwT02sinc2

wT0

X(w) = p(d(w minus w0) + d(w + w0))

1 0 le t leT0w(t) =0 otherwise

xw(t) = x(t) w(t)Xw (w) = pT0 e minusj (w minus w

0)T

02 sinc ((w minus w0)T0 2)

+e minusj (w + w0)T

02 sinc ((w + w0)T02)

pT0

T0

p

Figure 22 CFT of a sinusoid and a truncated sinusoid

magnitude and phase spectrum of the signal The inverse CFT is given by

x(t) = 1

2π

int infin

minusinfinX(ω)ejωt dω (22)

The inverse CFT is also known as the synthesis formula because it describesthe time-domain signal x(t) in terms of complex sinusoids In CFT theory x(t)

and X(ω) are called a transform pair ie

x(t) harr X(ω) (23)

SPECTRA OF ANALOG SIGNALS 15

The pulse-sinc pair shown in Figure 21 is useful in explaining the effects oftime-domain truncation on the spectra For example when a sinusoid is truncatedthen there is loss of resolution and spectral leakage as shown in Figure 22

In real-life signal processing all signals have finite length and hence time-domain truncation always occurs The truncation of an audio segment by arectangular window is shown in Figure 23 To smooth out frame transitionsand control spectral leakage effects the signal is often tapered prior to truncationusing window functions such as the Hamming the Bartlett and the trapezoidalwindows A tapered window avoids the sharp discontinuities at the edges of thetruncated time-domain frame This in turn reduces the spectral leakage in thefrequency spectrum of the truncated signal This reduction of spectral leakage isattributed to the reduced level of the sidelobes associated with tapered windowsThe reduced sidelobe effects come at the expense of a modest loss of spectral

minus20 minus10 0 10 20 30 40

minus1

minus05

0

05

1

(a)

x(t

)

minus20 minus10 0 10 20 30 40minus1

minus05

0

05

1

(b)

x w(t

)

0 02 04 06 08 1minus80

minus60

minus40

minus20

0

(c)

Xw

(w)

w(t )

Time t (ms)

Time t (ms)

Frequency (x p) rads

Figure 23 (a) Audio signal x(t) and a rectangular window w(t) (shown in dashedline) (b) truncated audio signal xw(t) and (c) frequency-domain representation Xw(ω)of the truncated audio


minus20 minus10 0 10 20 30 40

minus1

minus05

0

05

1

(a)

x(t

)w(t )

Time t (ms)


minus05

0

05

1

(b)

x w(t

)

Time t (ms)

0 02 04 06 08 1minus80

minus60

minus40

minus20

0

(c)

Xw

(w)


Figure 24 (a) Audio signal x(t) and a Hamming window w(t) (shown in dashed line)(b) truncated audio signal xw(t) and (c) frequency-domain representation Xw(ω) of thetruncated audio

resolution An audio segment formed using a Hamming window is shown inFigure 24

23 REVIEW OF CONVOLUTION AND FILTERING

A linear time-invariant (LTI) system configuration is shown in Figure 25 Alinear filter satisfies the property of generalized superposition and hence its outputy(t) is the convolution of the input x(t) with the filter impulse response h(t)Mathematically convolution is represented by the integral in Eq (24)

y(t) =int infin

minusinfinh(τ)x(t minus τ)dτ = h(t) lowast x(t) (24)

UNIFORM SAMPLING 17

LTI system

h(t)

x(t ) y(t ) = x(t ) h(t )

Figure 25 A linear time-invariant (LTI) system and convolution operation

0 20 40 60 80 100minus50

minus40

minus30

minus20

minus10

0

Frequency w (rads)

Mag

nitu

de|H

(w)|2

(dB

)

RC = 1

1H(w) =

1 + jwRC

x(t)

R

C y(t)

Figure 26 A simple RC low-pass filter

The symbol between the impulse response h(t) and the input x(t) is oftenused to denote the convolution operation

The CFT of the impulse response h(t) is the frequency response of thefilter ie

h(t) harr H(ω) (25)

As an example for reviewing these fundamental concepts in linear systemswe present in Figure 26 a simple first-order RC circuit that corresponds to a low-pass filter The impulse response for this RC filter is a decaying exponential andits frequency response is given by a simple first-order rational function H(ω)This function is complex-valued and its magnitude represents the gain of thefilter with respect to frequency at steady state If a sinusoidal signal drives thelinear filter the steady-state output is also a sinusoid with the same frequencyHowever its amplitude is scaled and phase is shifted in a manner consistent withthe magnitude and phase of the frequency response function respectively

24 UNIFORM SAMPLING

In all of our subsequent discussions we will be treating audio signals andassociated systems in discrete time The rules for uniform sampling of ana-log speechaudio are provided by the sampling theorem [Shan48] This theoremstates that a signal that is strictly bandlimited to a bandwidth of B rads can beuniquely represented by its sampled values spaced at uniform intervals that are


not more than πB seconds apart In other words if we denote the samplingperiod as Ts then the sampling theorem states that Ts πB In the frequencydomain and with the sampling frequency defined as ωs = 2πfs = 2πTs thiscondition can be stated as

ωs 2B(rads) or fs B

π (26)

Mathematically the sampling process is represented by time-domain multi-plication of the analog signal x(t) with an impulse train p(t) as shown inFigure 27

Since multiplication in time is convolution in frequency the CFT of the sam-pled signal xs(t) corresponds to the CFT of the original analog signal x(t)convolved with the CFT of the impulse train p(t) The CFT of the impulses isalso a train of uniformly spaced impulses in frequency that are spaced 1Ts Hzapart The CFT of the sampled signal is therefore a periodic extension of theCFT of the analog signal as shown in Figure 28 In Figure 28 the analog signalwas considered to be ideally bandlimited and the sampling frequency ωs waschosen to be more than 2B to avoid aliasing The CFT of the sampled signal is

x(t)

t0

xs(t )

t0 Ts

x

p(t)

0 Ts

=

Analog signal Sampling Discrete signal

t

Figure 27 Uniform sampling of analog signals

x(t)

t0 0

X(w)

w

CFT

xs(t)

t

CFT

BminusB

0

Xs(w)

wBminusB wsminuswsTs0

Figure 28 Spectrum of ideally bandlimited and uniformly sampled signals

UNIFORM SAMPLING 19

given by

Xs(ω) = 1

Ts

infinsum

k=minusinfinX(ω minus kωs) (27)

Note that the spectrum of the sampled signal in Figure 28 is such that anideal low-pass filter (LPF) can recover the baseband of the signal and henceperfectly reconstruct the analog signal from the digital signal The reconstructionprocess is shown in Figure 29 This reconstruction LPF essentially interpolatesbetween the samples and reproduces the analog signal from the digital signal Theinterpolation process becomes evident once the filtering operation is interpretedin the time domain as convolution Reconstruction occurs by interpolating withthe sinc function which is the impulse response of the ideal low-pass filter Thereconstruction process for ωs = 2B is given by

x(t) =infinsum

n=minusinfinx(nTs)sinc(B(t minus nTs)) (28)

Note that if the sampling frequency is less than 2B then aliasing will occurand therefore the signal can no longer be reconstructed perfectly Figure 210illustrates aliasing

In real-life applications the analog signal is not ideally bandlimited and thesampling process is not perfect ie sampling pulses have finite amplitude andfinite duration Therefore some level of aliasing is always present To reduce

x(t)

t0

Reconstruction

0

Xs(w)

wBminusB wsminusws Ts

A reconstructionlow-pass filter

Interpolated signal

Figure 29 Reconstruction (interpolation) using a low-pass filter

Xs(w)

0w

BminusBwsminusws

Aliasing

Figure 210 Aliasing when ωs lt 2B


Table 21 Sampling rates and bandwidth specifications

Format Bandwidth Sampling frequency

Telephony 32 kHz 8 kHzWideband audio 7 kHz 16 kHzHigh-fidelity CD 20 kHz 441 kHzDigital audio tape (DAT) 20 kHz 48 kHzSuper audio CD (SACD) 100 kHz 28224 MHzDVD audio (DVD-A) 96 kHz 441 48 882 96 1764 or 192 kHz

aliasing the signal is prefiltered by an anti-aliasing low-pass filter and usuallyover-sampled (ωs gt 2B) The degree of over-sampling depends also on the choiceof the analog anti-aliasing filter For high-quality reconstruction and modest over-sampling the anti-aliasing filter must have good rejection characteristics On theother hand over-sampling by a large factor relaxes the requirements on the analoganti-aliasing filter and hence simplifies analog hardware at the expense of a higherdata rate Nowadays over-sampling is practiced often even in high-fidelity sys-tems In fact the use of inexpensive Sigma-Delta () analog-to-digital (AD)converters in conjunction with down-sampling in the digital domain is a com-mon practice Details on AD conversion and some over-sampling schemestailored for high-fidelity audio will be presented in Chapter 11 Standard samplingrates for the different grades of speech and audio are given in Table 21

25 DISCRETE-TIME SIGNAL PROCESSING

Audio coding algorithms operate on a quantized discrete-time signal Prior tocompression most algorithms require that the audio signal is acquired with high-fidelity characteristics In typical standardized algorithms audio is assumed to bebandlimited at 20 kHz sampled at 441 kHz and quantized at 16 bits per sampleIn the following discussion we will treat audio as a sequence ie as a streamof numbers denoted x(n) = x(t)|t=nTs

Initially we will review the discrete-timesignal processing concepts without considering further aliasing and quantizationeffects Quantization effects will be discussed later during the description ofspecific coding algorithms

251 Transforms for Discrete-Time Signals

Discrete-time signals are described in the transform domain using the z-transformand the discrete-time Fourier transform (DTFT) These two transformations havesimilar roles as the Laplace transform and the CFT for analog signals respec-tively The z-transform is defined as

X(z) =infinsum

n=minusinfinx(n)zminusn (29)

DISCRETE-TIME SIGNAL PROCESSING 21

where z is a complex variable Note that if the z-transform is evaluated on theunit circle ie for

z = ej = 2πf Ts (210)

then the z-transform becomes the discrete-time Fourier transform (DTFT) TheDTFT is given by

X(ej) =infinsum

n=minusinfinx(n)eminusjn (211)

The DTFT is discrete in time and continuous in frequency As expected thefrequency spectrum associated with the DTFT is periodic with period 2π rads

Example 21

Consider the DTFT of a finite-length pulse

x(n) = 1 for 0 n N minus 1

= 0 else

Using geometric series results and trigonometric identities on the DTFT sum

X(ej) =Nminus1sum

n=0

eminusjn

0 01 02 03 04 05

Normalized frequency Ω (x p rad)

06 07 08 09 10

2

4

6

8

10

12

14

16N = 8N = 16

|X(e

jΩ)|

Figure 211 DTFT of a sampled pulse for the Example 21 Digital sinc for N = 8(dashed line) and N = 16 (solid line)


= 1 minus eminusjN

1 minus eminusj

= eminusj (Nminus1)2 sin(N2)

sin(2) (212)

The ratio of sinusoidal functions in Eq (212) is known as the Dirichlet func-tion or as a digital sinc function Figure 211 shows the DTFT of a finite-lengthpulse The digital sinc is quite similar to the continuous-time sinc functionexcept that it is periodic with period 2π and has a finite number of sidelobeswithin a period

252 The Discrete and the Fast Fourier Transform

A computational tool for Fourier transforms is developed by starting from theDTFT analysis expression (211) and considering a finite length signal consistingof N points ie

X(ej) =Nminus1sum

n=0

x(n)eminusjn (213)

Furthermore the frequency-domain signal is sampled uniformly at N pointswithin one period = 0 to 2π ie

rArr k = 2π

Nk k = 0 1 N minus 1 (214)

With the sampling in the frequency domain the Fourier sum of Eq (213)becomes

X(ejk ) =Nminus1sum

n=0

x(n)eminusjnk (215)

It is typical in the DSP literature to replace k with the frequency index k

and hence Eq (215) can be written as

X(k) =Nminus1sum

n=0

x(n)eminusj2πknN k = 0 1 2 N minus 1 (216)

The expression in (216) is called the discrete Fourier transform (DFT) Notethat the sampling in the frequency domain forces periodicity in the time domainie x(n) = x(n + N) We also have periodicity in the frequency domain X(k) =X(k + N) because the signal in the time domain is also discrete These periodic-ities create circular effects when convolution is performed by frequency-domainmultiplication ie

x(n) otimes h(n) harr X(k)H(k) (217)


where

x(n) otimes h(n) =Nminus1sum

m=0

h(m) x((n minus m)modN) (218)

The symbol otimes stands for circular or periodic convolution and mod N impliesmodulo N subtraction of indices The DFT is a one-to-one transformation whosebasis functions are orthogonal With the proper normalization the DFT matrixcan be written as a unitary matrix The N -point inverse DFT (IDFT) is written as

x(n) = 1

N

Nminus1sum

k=0

X(k)ej2πknN n = 0 1 2 N minus 1 (219)

The DFT transform pair is represented by the following notation

x(n) harr X(k) (220)

The DFT can be computed efficiently using the fast Fourier transform (FFT)The FFT takes advantage of redundancies in the DFT sum by decimating thesequence into subsequences with even and odd indices It can be shown that if N

is a radix-2 integer the N -point DFT can be computed using a series of butterflystages The complexity associated with the DFT algorithm is of the order of N2

computations In contrast the number of computations associated with the FFTalgorithm is roughly of the order of N log2 N This is a significant reduction incomputational complexity and FFTs are almost always used in lieu of a DFT

253 The Discrete Cosine Transform

The discrete cosine transform (DCT) of x(n) can be defined as

X(k) = c(k)

radic2

N

Nminus1sum

n=0

x(n) cos

[π

N

(

n + 1

2

)

k

]

0 k N minus 1 (221)

where c(0) = 1radic

2 and c(k) = 1 for 1 k N minus 1 Depending on the period-icity and the symmetry of the input signal x(n) the DCT can be computed usingdifferent orthonormal transforms (usually DCT-1 DCT-2 DCT-3 and DCT-4)More details on the DCT and the modified DCT (MDCT) [Malv91] are given inChapter 6

254 The Short-Time Fourier Transform

Spectral analysis of nonstationary signals cannot be accommodated by theclassical Fourier transform since the signal has time-varying characteristicsInstead a time-frequency transformation is required Time-varying spectral


hk(n) X X

x(n)

Analysis Synthesis

xk(n)

e jΩkne minus jΩkn

Figure 212 The k-th channel of the analysis-synthesis filterbank (after [Rabi78])

analysis [Silv74] [Alle77] [Port81a] can be performed using the short-timeFourier transform (STFT) The analysis expression for the STFT is given by

X(n ) =infinsum

m=minusinfinx(m)h(n minus m)eminusjm = h(n) lowast x(n)eminusjn (222)

where = ωT = 2πf T is the normalized frequency in radians and h(n) is thesliding analysis window The synthesis expression (inverse transform) is given by

h(n minus m)x(m) = 1

2π

int π

minusπ

X(n )ejm d (223)

Note that if n = m and h(0) = 1 [Rabi78] [Port80] then x(n) can be obtainedfrom Eq (223) The basic assumption in this type of analysis-synthesis is thatthe signal is slowly time-varying and can be modeled by its short-time spectrumThe temporal and spectral resolution of the STFT are controlled by the lengthand shape of the analysis window For speech and audio signals the length of thewindow is often constrained to be about 5ndash20 ms and hence spectral resolutionis sacrificed The sequence h(n) can also be viewed as the impulse responseof a LTI filter which is excited by a frequency-shifted signal (see Eq (222))The latter leads to the filter-bank interpretation of the STFT ie for a discretefrequency variable k = k() k = 0 1 N minus 1 and and N chosensuch that the speech band is covered Then the analysis expression is written as

X(n k) =infinsum

m=minusinfinx(m)h(n minus m)eminusjkm = h(n) lowast x(n)eminusjkn (224)

and the synthesis expression is

xST FT (n) =Nminus1sum

k=0

X(n k)ejkn (225)

where xST FT (n) is the signal reconstructed within the band of interest Ifh(n) and N are chosen carefully [Scha75] the reconstruction by Eq (225)

DIFFERENCE EQUATIONS AND DIGITAL FILTERS 25

can be exact The k-th channel analysis-synthesis scheme is depicted inFigure 212 where hk(n) = h(n)ejkn

26 DIFFERENCE EQUATIONS AND DIGITAL FILTERS

Digital filters are characterized by difference equations of the form

y(n) =Lsum

i=0

bix(n minus i) minusMsum

i=1

aiy(n minus i) (226)

In the input-output difference equation above the output y(n) is given as thelinear combination of present and past inputs minus a linear combination of pastoutputs (feedback term) The parameters ai and bi are the filter coefficients or fil-ter taps and they control the frequency response characteristics of the digital filterFilter coefficients are programmable and can be made adaptive (time-varying)A direct-form realization of the digital filter is shown in Figure 213

The filter in the Eq (226) is referred to as an infinite-length impulse response(IIR) filter The impulse response h(n) of the filter shown in Figure 213 isgiven by

h(n) =Lsum

i=0

biδ(n minus i) minusMsum

i=0

aih(n minus i) (227)

The IIR classification stems from the fact that when the feedback coefficientsare non-zero the impulse response is infinitely long In a statistical signal rep-resentation Eq (226) is referred to as a time-series model That is if the inputof this filter is white noise then y(n) is called an autoregressive moving average(ARMA) process The feedback coefficients ai are chosen such that the filteris stable ie a bounded input gives a bounded output (BIBO) An input-outputequation of a causal filter can also be written in terms of the impulse responseof the filter ie

y(n) =infinsum

m=0

h(m)x(n minus m) (228)

zminus1 zminus1 zminus1 zminus1zminus1

b0

b1

bL Σ

minusa1

minusa2

minusaM

x(n)

y(n)

Figure 213 Direct-form realization of an IIR digital filter


The impulse response of the filter is associated with its coefficients and canbe computed explicitly by programming the difference equation It can also beobtained in closed form by solving the difference equation

Example 22

Consider the first-order IIR filter shown in Figure 214 The difference equationof this digital filter is given by

y(n) = 02x(n) + 08y(n minus 1)

The coefficient b0 = 02 and a1 = minus08 The impulse response of this filter isgiven by

h(n) = 02δ(n) + 08h(n minus 1)

The impulse response can be determined in closed-form by solving the abovedifference equation Note that h(0) = 02 and h(1) = 016 Therefore theclosed-form expression for the impulse response is h(n) = 02(08)nu(n) Notealso that this first-order IIR filter is BIBO stable because

infinsum

n=minusinfin|h(n)| lt infin (229)

Digital filters with finite-length impulse response (FIR) are realized by settingthe feedback coefficients ai = 0 for i = 1 2 M FIR filters Figure 215are inherently BIBO stable because their impulse response is always abso-lutely summable The output of an FIR filter is a weighted moving averageof the input The simplest FIR filter is the so-called averaging filter that isused in some simple estimation applications The input-output equation of theaveraging filter is given by

y(n) = 1

L + 1

Lsum

i=0

x(n minus i) (230)

02

x(n)

Σ

08

y(n)

zminus1

Figure 214 A first-order IIR digital filter

THE TRANSFER AND THE FREQUENCY RESPONSE FUNCTIONS 27

b0

b1

bL Σzminus1zminus1x(n) y(n)

Figure 215 An FIR digital filter

The impulse response of this filter is equal to h(n) = 1(L + 1) for n =0 1 L The frequency response of the averaging filter is the DTFT ofits impulse response h(n) Therefore frequency responses of averaging filtersfor L = 7 and 15 are normalized versions of the DTFT spectra shown inFigure 211(a) and Figure 211(b) respectively

27 THE TRANSFER AND THE FREQUENCY RESPONSE FUNCTIONS

The z-transform of the impulse response of a filter is called the transfer functionand is given by

H(z) =infinsum

n=minusinfinh(n)zminusn (231)

Considering the difference equation we can also obtain the transfer functionin terms of filter parameters ie

X(z)

(Lsum

i=0

bizminusi

)

= Y (z)

(

1 +Msum

i=1

aizminusi

)

(232)

The ratio of output over input in the z domain gives the transfer function interms of the filter coefficients

H(z) = Y (z)

X(z)= b0 + b1z


1 + a1zminus1 + + aMzminusM (233)

For an FIR filter the transfer function is given by

H(z) =Lsum

i=0

bizminusi (234)

The frequency response function is a special case of the transfer function ofthe filter That is for z = ej then

H(ej) =infinsum

n=minusinfinh(n)eminusjn


By considering the difference equation associated with the LTI digital filterthe frequency response can be written as the ratio of two polynomials ie

H(ej) = b0 + b1eminusj + b2e

minusj2 + + bLeminusjL

1 + a1eminusj + a2eminusj2 + + aMeminusjM

Note that for an FIR filter the frequency response becomes



Example 23

Frequency responses of four different first-order filters are shown inFigure 216 The frequency responses in Figure 216 are plotted up to the fold-over frequency which is half the sampling frequency Note from Figure 216that low-pass and high-pass filters can be realized as either FIR or as IIRfilters The location of the root of the polynomial of the FIR filter determineswhere the notch in the frequency response occurs Therefore in the top twofigures that correspond to the FIR filters the low-pass filter (top left) has anotch at π rads (zero at z = minus1) while the high-pass filter has a notch at

0 05 1minus60

minus40

minus20

0

20

Normalized frequency Ω ( x p rad )

|H(z

)|2(d

B)

0 05 1minus60

minus40

minus20

0

20

0 05 1minus10

0

10

20

0 05 1minus10

0

10

20

H(z) = 1+zminus1 H(z) = 1minuszminus1

H(z) = 11 minus 09zminus1 H(z) = 1

1 + 09zminus1

Figure 216 Frequency responses of first-order FIR and IIR digital filters


0 rads (zero at z = 1) The bottom two IIR filters have a pole at z = 09 (peakat 0 rads) and z = minus09 (peak at π rads) for the low-pass and high-pass filtersrespectively

271 Poles Zeros and Frequency Response

A z domain function H(z) can be written in terms of its poles and zeros asfollows

H(z) = G(z minus ζ1)(z minus ζ2) (z minus ζL)

(z minus p1)(z minus p2) (z minus pM)= G

prodLi=1(z minus ζi)

prodMi=1(z minus pi)

(235)

where ζi and pi are the zeros and poles of H(z) respectively and G is a constantThe locations of the poles and zeros affect the shape of the frequency responseThe magnitude of the frequency response can be written as

|H(ej)| = G

prodLi=1 |ej minus ζi |

prodMi=1 |ej minus pi |

(236)

It is therefore evident that when an isolated zero is close to the unit circle thenthe magnitude frequency response will assume a small value at that frequencyWhen an isolated pole is close to unit circle it will give rise to a peak in themagnitude frequency response at that frequency In speech processing the pres-ence of poles in z domain representations of the vocal tract has been associatedwith the speech formants [Rabi78] In fact formant synthesizers use the polelocations to form synthesis filters for certain phonemes On the other hand thepresence of zeros has been associated with the coupling of the nasal tract Forexample zeros associate with nasal sounds such as m and n [Span94]

Example 24

For the second-order system below find the poles and zeros give a z-domaindiagram with the pole and zeros and sketch the frequency response

H(z) = 1 minus 13435zminus1 + 09025zminus2

1 minus 045zminus1 + 055zminus2

The poles and zeros appear in conjugate pairs because the coefficients of H(z)

are real-valued

H(z) = (z minus 95ej45o

)(z minus 95eminusj45o

)

(z minus 7416ej7234o)(z minus 7416eminusj7234o

)

The pole zero diagram and the frequency response are shown in Figure 217Poles give rise to spectral peaks and zeros create spectral valleys in themagnitude of the frequency response The symmetry around π is due to thefact that roots appear in complex conjugate pairs


minus1 minus05 0 05 1minus1

minus05

0

05

1

Real part

Imag

inar

y pa

rt

z plane plot

0 05 1 15 2minus30

minus20

minus10

0

10

Normalized frequency x p (rad)

Mag

nitu

de (

dB)

Frequency response

Figure 217 z domain and frequency response plots of the second-order system givenin Example 24

272 Examples of Digital Filters for Audio Applications

There are several standard designs for digital filters that are targeted specificallyfor audio-type applications These designs include the so-called shelving filterspeaking filters cross-over filters and quadrature mirror filter (QMF) bank filtersLow-pass and high-pass shelving filters are used for bass and treble tone controlsrespectively in stereo systems

Example 25

The transfer function of a low-pass shelving filter can be expressed as

Hlp(z) = Clp

(1 minus b1z

minus1

1 minus a1zminus1

)

(237)

where

Clp = 1 + kmicro

1 + k b1 =

(1 minus kmicro

1 + kmicro

)

a1 =(

1 minus k

1 + k

)

k =(

4

1 + micro

)

tan

(c

2

)

and micro = 10g20

Note also that c = 2πfcfs is the normalized cutoff frequency and g is thegain in decibels (dB)

Example 26

The transfer function of a high-pass shelving filter is given by

Hhp(z) = Chp

(1 minus b1z

minus1

1 minus a1zminus1

)

(238)


where

Chp = micro + p

1 + p b1 =

(micro minus p

micro + p

)

a1 =(

1 minus p

1 + p

)

p =(

1 + micro

4

)

tan

(c

2

)

and micro = 10g20

Again c = 2πfcfs is the normalized cutoff frequency and g is the gainin dB More complex tone controls that operate as graphic equalizers areaccomplished using bandpass peaking filters

Example 27

The transfer function of a peaking filter is given by

Hpk(z) = Cpk

(1 + b1z

minus1 + b2zminus2

1 + a1zminus1 + a2zminus2

)

(239)

where

Cpk =(

1 + kqmicro

1 + kq

)

b1 = minus2 cos(c)

1 + kqmicro b2 = 1 minus kqmicro

1 + kqmicro

a1 = minus2 cos(c)

1 + kq

a2 = 1 minus kq

1 + kq

(a) (b)

0 05 10

2

4

6

8

10Low-pass shelving filter with gain = 10 dB

Mag

nitu

de (

dB)

Normalized frequency x p (rad)0 02 04 06 08 1

minus20

minus10

0

10

20

Low-pass shelving filter with Ω c = p4

Mag

nitu

de (

dB)

Gain = 20 dB

10 dB

5 dBminus5 dB

minus10 dB

minus20 dB

Ω c = p4

Ω c = p6


Figure 218 Frequency responses of a low-pass shelving filter (a) for different gains and(b) for different cutoff frequencies c = π 6 and π 4


(c)

0 05 10

2

4

6

8

10

Mag

nitu

de (

dB)

Peaking filter with gain = 10 dB and Q = 4

(a)

0 02 04 06 08 10

5

10

15

20


Mag

nitu

de (

dB)

Peaking filter with Q factor 2 and Ωc = p2

Gain = 20 dB

10 dB

Ω c= p4Ω c = p2

(b)

0 05 10

2

4

6

8

10

Mag

nitu

de (

dB)

Peaking filter with gain = 10 dB and Ωc = p2

Q = 2

Q = 4



Figure 219 Frequency responses of a peaking filter (a) for different gains g = 10 dBand 20 dB (b) for different quality factors Q = 2 and 4 and (c) for different cutofffrequencies c = π 4 and π 2

kq =(

4

1 + micro

)

tan

(c

2Q

)

and micro = 10g20

The frequency c = 2πfcfs is the normalized cutoff frequency Q is thequality factor and g is the gain in dB Example frequency responses of shelv-ing and peaking filters for different gains and cutoff frequencies are given inFigures 218 and 219

Example 28

An audio graphic equalizer is designed by cascading peaking filters as shownin Figure 220 The main idea behind the audio graphic equalizer is that itapplies a set of peaking filters to modify the frequency spectrum of the inputaudio signal by dividing its audible frequency spectrum into several frequencybands Then the frequency response of each band can be controlled by varyingthe corresponding peaking filterrsquos gain

REVIEW OF MULTIRATE SIGNAL PROCESSING 33

PeakingFilter

1

PeakingFilter

2

PeakingFilterNminus1

PeakingFilter

N

Σ Σ Σ

Input

Output

Figure 220 A cascaded setup of peaking filters to design an audio graphic equalizer

28 REVIEW OF MULTIRATE SIGNAL PROCESSING

Multirate signal processing (MSP) involves the change of the sampling rate whilethe signal is in the digital domain Sampling rate changes have been popular inDSP and audio applications Depending on the application changes in the sam-pling rate may reduce algorithmic and hardware complexity or increase resolutionin certain signal processing operations by introducing additional signal samplesPerhaps the most popular application of MSP is over-sampling analog-to-digital(AD) and digital-to-analog (DA) conversions In over-sampling AD the sig-nal is over-sampled thereby relaxing the anti-aliasing filter design requirementsand hence the hardware complexity The additional time-resolution in the over-sampled signal allows a simple 1-bit delta modulation (DM) quantizer to delivera digital signal with sufficient resolution even for high-fidelity audio applicationsThis reduction of analog hardware complexity comes at the expense of a data rateincrease Therefore a down-sampling operation is subsequently performed usinga DSP chip to reduce the data rate This reduction in the sampling rate requiresa high precision anti-aliasing digital low-pass filter along with some other cor-recting DSP algorithmic steps that are of appreciable complexity Therefore theover-sampling analog-to-digital (AD) conversion or as otherwise called Delta-Sigma AD conversion involves a process where complexity is transferred fromthe analog hardware domain to the digital software domain The reduction ofanalog hardware complexity is also important in DA conversion In that casethe signal is up-sampled and interpolated in the digital domain thereby reducingthe requirements on the analog reconstruction (interpolation) filter

281 Down-sampling by an Integer

Multirate signal processing is characterized by two basic operations namely up-sampling and down-sampling Down-sampling involves increasing the samplingperiod and hence decreasing the sampling frequency and data rate of the digitalsignal A sampling rate reduction by integer L is represented by

xd(n) = x(nL) (240)


Given the DTFT transform pairs

x(n)DT FTlarrminusminusrarrX(ej) and xd(n)

DT FTlarrminusminusrarrXd(ej) (241)

it can be shown [Oppe99] that the DTFT of the original and decimated signalare related by

Xd(ej) = 1

L

Lminus1sum

l=0

X(ej(minus2πl)L) (242)

Therefore down-sampling introduces L copies of the original DTFT that areboth amplitude and frequency scaled by L It is clear that the additional copiesmay introduce aliasing Aliasing can be eliminated if the DTFT of the originalsignal is bandlimited to a frequency πL ie

X(ej) = 0π

L || π (243)

An example of the DTFTs of the signal during the down-sampling processis shown Figure 221 To approximate the condition in Eq (243) a digital anti-aliasing filter is used The down-sampling process is illustrated in Figure 222

x(n)

n0

2

0

helliphellip

1 3

xd(n) = x(2n)

n0 1

helliphellip

minus2p 2p 0minus2p 2p

112

(a)

X (e jΩ) Xd (e jΩ)

(b)

42 2

ΩΩ

Figure 221 (a) The original and the down-sampled signal in the time-domain and(b) the corresponding DTFTs

HD(ejΩ)

x prime(nL)

Anti-aliasingfilter

x(n)

L

x prime(n)

Figure 222 Down-sampling by an integer L


In Figure 222 HD(ej) is given by

HD(ej) =

1 0 || πL

0 πL || π (244)

282 Up-sampling by an Integer

Up-sampling involves reducing the sampling period by introducing additionalregularly spaced samples in the signal sequence

xu(n) =infinsum

m=minusinfinx(m)δ(n minus mM) =

x(nM) n = 0 plusmnM plusmn2M

0 else (245)

The introduction of zero-valued samples in the up-sampled signal xu(n)increases the sampling rate of the signal The DTFT of the up-sampled signalrelates to the DTFT of the original signal as follows

Xu(ej) = X(ejM) (246)

xu(n) = x(n2)

n0

2

0

helliphellip

1 3

x(n)

n0 1

helliphellip

minus2p 2p0minus2p 2p

(a)

(b)

X (e jΩ) Xu (e jΩ)

2 42

ΩΩ

Figure 223 (a) The original and the up-sampled signal in the time-domain and (b) thecorresponding DTFTs

HU (ejΩ)

x prime(nM)

Interpolationfilter

x(n)

M

x(nM)

Figure 224 Up-sampling by an integer M


M

x(n)

L

x prime(nM)x(nM)x prime(nLM)

HLP (e jΩ)

Ωc =max (LM)

p

Figure 225 Sampling rate changes by a noninteger factor

Therefore the DTFT of the up-sampled signal Xu(ej) is described by a

series of frequency compressed images of the DTFT of the original signal locatedat integer multiples of 2πM rads (see Figure 223) To complete the up-samplingprocess an interpolation stage is required that fills appropriate values in the time-domain to replace the artificial zero-valued samples introduced by the samplingIn Figure 224 HU(ej) is given by

HU(ej) =

M 0 || πM

0 πM || π (247)

283 Sampling Rate Changes by Noninteger Factors

Sampling rate by noninteger factors can be accomplished by cascading up-sampling and down-sampling operations The up-sampling stage precedes thedown-sampling stage and the low-pass interpolation and anti-aliasing filters arecombined into one filter whose bandwidth is the minimum of the two filtersFigure 225 For example if we want a noninteger sampling period modificationsuch that Tnew = 12T 5 In this case we choose L = 12 and M = 5 Hence thebandwidth of the low-pass filter is the minimum of π 12 and π 5

284 Quadrature Mirror Filter Banks

The analysis of the signal in a perceptual audio coding system is usuallyaccomplished using either filter banks or frequency-domain transformations ora combination of both The filter bank is used to decompose the signal intoseveral frequency subbands Different coding strategies are then derived andimplemented in each subband The technique is known as subband coding in thecoding literature Figure 226

One of the important aspects of subband decomposition is the aliasing betweenthe different subbands because of the imperfect frequency responses of the digitalfilters Figure 227 These aliasing problems prevented early analysis-synthesisfilter banks from perfectly reconstructing the original input signal in the absenceof quantization effects In 1977 a solution to this problem was provided bycombining down-sampling and up-sampling operations with appropriate filterdesigns [Este77] The perfect reconstruction filter bank design came to be knownas a quadrature mirror filter (QMF) bank An analysis-synthesis QMF consists ofanti-aliasing filters down-sampling stages up-sampling stages and interpolationfilters A two-band QMF structure is shown in Figure 228


x(n)

BP1

BP2

BP3

BPN

Encoder1

Encoder2

Encoder3

EncoderN

2f1

2f2

2f3

2fN

MUX

DEMUX

Decoder1

Decoder2

Decoder3

DecoderN

BP1

BP2

BP3

BPN

Channel x(n)ˆ

Figure 226 Signal coding in subbands

Lower band Upper band

p 2

H0 H1

Frequency Ω (rad)

|Hk(e jΩ)|

p

Figure 227 Aliasing effects in a two-band filter bank

H0(z)

H1(z)

2

2

2

2

F0(z)

F1(z)

x(n) x0(n)

x1(n)

xd0(n)

xd1(n)

Analysis stage Synthesis stage

x(n)ˆ

sum

Figure 228 A two-band QMF structure

The analysis stage consists of the filters H0(z) and H1(z) and down-samplingoperations The synthesis stage includes up-sampling stages and the filters F0(z)

and F1(z) If the process includes quantizers those will be placed after the down-sampling stages We first examine the filter bank without the quantization stageThe input signal x(n) is first filtered and then down-sampled The DTFT of the


down-sampled signal can be shown to be

Xdk(ej) = 1

2(Xk(e

j2) + Xk(ej(minus2π)2)) k = 0 1 (248)

Figure 229 presents plots of the DTFTs of the original and down-sampledsignals It can be seen that an aliasing term is present

The reconstructed signal x(n) is derived by adding the contributions fromthe up-sampling and interpolations of the low and the high band It can be shown

helliphellip

X (ejΩ)

helliphellip

0minus2p 2p

X (ejΩ 2) Aliasing term

X (minusejΩ 2)

Ω

0minus2p 2p Ω

Figure 229 DTFTs of the original and down-sampled signals to illustrate aliasing

QMF(11)

QMF(21)

QMF(22)

QMF(41)

QMF(42)

QMF(43)

QMF(44)

QMF(81)

QMF(84)

Stage - 0 Stage - 1 Stage - 3Stage - 2

QMF(82)

QMF(83)

Level -1

Level -2

Figure 230 Tree-structured QMF bank

DISCRETE-TIME RANDOM SIGNALS 39

that the reconstructed signal in the z-domain has the form

X(z) = 1

2(H0(z)F0(z) + H1(z)F1(z))X(z) + 1

2(H0(minusz)F0(z)

+ H1(minusz)F1(z))X(minusz) (249)

The signal X(minusz) in Eq (249) is associated with the aliasing term Thealiasing term can be cancelled by designing filters to have the following mirrorsymmetries

F0(z) = H1(minusz) F1(z) = minusH0(minusz) (250)

Under these conditions the overall transfer function of the filter bank can thenbe written as

T (z) = 1

2(H0(z)F0(z) + H1(z)F1(z)) (251)

If T (z) = 1 then the filter bank allows perfect reconstruction Perfect delay-less reconstruction is not realizable but an all-pass filter bank with linear phasecharacteristics can be designed easily For example the choice of first-order FIRfilters

H0(z) = 1 + zminus1 H1(z) = 1 minus zminus1 (252)

results in alias-free reconstruction The overall transfer function of the QMF inthis case is

T (z) = 1

2((1 + zminus1)2 minus (1 minus zminus1)2) = 2zminus1 (253)

Therefore the signal is reconstructed within a delay of one sample and withan overall gain of 2 QMF filter banks can be cascaded to form tree structures Ifwe represent the analysis stage of a filter bank as a block that divides the signalin low and high frequency subbands then by cascading several such blockswe can divide the signal into smaller subbands This is shown in Figure 230QMF banks are part of many subband and hybrid subbandtransform audio andimagevideo coding standards [Thei87] [Stoll88] [Veld89] [John96] Note thatthe theory of quadrature mirror filterbanks has been associated with wavelettransform theory [Wick94] [Akan96] [Stra96]

29 DISCRETE-TIME RANDOM SIGNALS

In signal processing we generally classify signals as deterministic or random Asignal is defined as deterministic if its values at any point in time can be definedprecisely by a mathematical equation For example the signal x(n) = sin(πn4)

is deterministic On the other hand random signals have uncertain values and areusually described using their statistics A discrete-time random process involvesan ensemble of sequences x(nm) where m is the index of the m-th sequence


in the ensemble and n is the time index In practice one does not have accessto all possible sample signals of a random process Therefore the determinationof the statistical structure of a random process is often done from the observedwaveform This approach becomes valid and simplifies random signal analysis ifthe random process at hand is ergodic Ergodicity implies that the statistics of arandom process can be determined using time-averaging operations on a singleobserved signal Ergodicity requires that the statistics of the signal are indepen-dent of the time of observation Random signals whose statistical structure isindependent of time of origin are generally called stationary More specificallya random process is said to be widesense stationary if its statistics upto the sec-ond order are independent of time Although it is difficult to show analyticallythat signals with various statistical distributions are ergodic it can be shownthat a stationary zero-mean Gaussian process is ergodic up to second order Inmany practical applications involving a stationary or quasi-stationary processit is assumed that the process is also ergodic The definitions of signal statis-tics presented henceforth will focus on real-valued stationary processes that areergodic

The mean value microx of the discrete-time wide sense stationary signal x(n)is a first-order statistic that is defined as the expected value of x(n) ie

microx = E[x(n)] = limNrarrinfin

1

2N + 1

Nsum

n=minusN

x(n) (254)

where E[] denotes the statistical expectation The assumption of ergodicity allowsus to determine the mean value with a time-averaging process shown on theright-hand side of Eq (254) The mean value can be viewed as the DC com-ponent in the signal In many applications involving speech and audio signalsthe DC component does not carry any useful information and is either ignoredor filtered out

The variance σ 2x is a second-order signal statistic and is a measure of signal

dispersion from its mean value The variance is defined as

σ 2x = E[(x(n) minus microx)(x(n) minus microx)] = E[x2(n)] minus micro2

x (255)

The square root of the variance is the standard deviation of the signal Fora zero-mean signal the variance is simply E[x2(n)] The autocorrelation of asignal is a second-order statistic defined by

rxx(m) = E[x(n + m)x(n)] = limNrarrinfin

1

2N + 1

Nsum

n=minusN

x(n + m)x(n) (256)

where m is called the autocorrelation lag index The autocorrelation can be viewedas a measure of predictability of the signal in the sense that a future value of acorrelated signal can be predicted by processing information associated with its


past values For example speech is a correlated waveform and hence it can bemodeled by linear prediction mechanisms that predict its current value from alinear combination of past values Correlation can also be viewed as a measureof redundancy in the signal in that correlated waveforms can be parameterizedin terms of statistical time-series models and hence represented by a reducednumber of information bits

The autocorrelation sequence rxx(m) is symmetric and positive definite ie

rxx(minusm) = rxx(m) rxx(0) |rxx(m)| (257)

Example 29

The autocorrelation of a white noise signal is

rww(m) = σ 2wδ(m)

where σ 2w is the variance of the noise The fact that the autocorrelation of

white noise is the unit impulse implies that white noise is an uncorrelatedsignal

Example 210

The autocorrelation of the output of a second-order FIR digital filter H(z)(in Figure 231) to a white noise input of zero mean and unit variance is

ryy(m) = E[y(n + m)y(n)]

= δ(m + 2) + 2δ(m + 1) + 3δ(m) + 2δ(m minus 1) + δ(m minus 2)

Cross-correlation is a measure of similarity between two signals The cross-correlation of a signal x(n) relative to a signal y(n) is given by

rxy(m) = E[x(n + m)y(n)] (258)

Similarly cross-correlation of a signal y(n) relative to a signal x(n) isgiven by

ryx(m) = E[y(n + m)x(n)] (259)

Note that the symmetry property of the cross-correlation is

ryx(m) = rxy(minusm) (260)

White noise x(n) y(n)

H(z) = 1 + zminus1 + zminus2

Figure 231 FIR filter excited by white noise


m0

rww (m) = sw2 d (m) Rww (e jΩ) = sw

2

0

DTFTsw

2

sw2

Ω

Figure 232 The PSD of white noise

The power spectral density (PSD) of a random signal is defined as the DTFTof the autocorrelation sequence

Rxx(ej) =

infinsum

m=minusinfinrxx(m)eminusjm (261)

The PSD is real-valued and positive and describes how the power of therandom process is distributed across frequency

Example 211

The PSD of a white noise signal (see Figure 232) is

Rww(ej) = σ 2w

infinsum

m=minusinfinδ(m)eminusjm = σ 2

w

291 Random Signals Processed by LTI Digital Filters

In Example 210 we determined the autocorrelation of the output of a second-order FIR digital filter when the excitation is white noise In this section wereview briefly the characterization of the statistics of the output of a causal LTIdigital filter that is excited by a random signal The output of a causal digitalfilter can be computed by convolving the input with its impulse response ie

y(n) =infinsum

i=0

h(i)x(n minus i) (262)

Based on the convolution sum we can derive the following expressions for themean the autocorrelation the cross-correlation and the power spectral densityof the steady-state output of an LTI digital filter

microy =infinsum

k=0

h(k)microx = H(ej)|=0microx (263)

ryy(m) =infinsum

k=0

infinsum

i=0

h(k)h(i)rxx(m minus k + i) (264)


ryx(m) =infinsum

i=0

h(i)rxx(m minus i) (265)

Ryy(ej) = |H(ej)|2Rxx(e

j) (266)

These equations describe the statistical behavior of the output at steady stateDuring the transient state of the filter the output is essentially nonstationary ie

microy(n) =nsum

k=0

h(k)microx (267)

Example 212

Determine the output variance of an LTI digital filter excited by white noiseof zero mean and unit variance

σ 2y = ryy(0) =

infinsum

k=0

infinsum

i=0

h(k)h(i)δ(i minus k) =infinsum

k=0

h2(k)

Example 213

Determine the variance the autocorrelation and the PSD of the output of thedigital filter in Figure 233 when its input is white noise of zero mean andunit varianceThe impulse response and transfer function of this first-order IIR filter is

h(n) = 08nu(n) H(z) = 1

1 minus 08zminus1

The variance of the output at steady state is

σ 2y =

infinsum

k=0

h2(k)σ 2x =

infinsum

k=0

064k = 1

1 minus 064= 278

zminus1x(n)

08

y(n)

Σ

Figure 233 An example of an IIR filter


The autocorrelation of the output is given by

ryy(m) =infinsum

k=0

infinsum

i=0

08k+iδxx(m minus k + i)

It is easy to see that the unit impulse will be non-zero only for k = m + i andhence

ryy(m) =infinsum

i=0

08m+2i m 0

And taking into account the autocorrelation symmetry

ryy(m) = ryy(minusm) = 277(08)|m|forallm

Finally the PSD is given by


j) = 1

|1 minus 08eminusj|2

292 Autocorrelation Estimation from Finite-Length Data

Estimators of signal statistics given N observations are based on the assumptionof stationarity and ergodicity The following is an estimator of the autocorrelation(based on sample averaging) of a signal x(n)

rxx(m) = 1

N

Nminusmminus1sum

n=0

x(n + m)x(n) m = 0 1 2 N minus 1 (268)

Correlations for negative lags can be taken using the symmetry property inEq (257) The estimator above is asymptotically unbiased (fixed m and N gtgt

m) but for small N it is biased

210 SUMMARY

A brief review of some of the essentials of signal processing techniques weredescribed in this chapter In particular some of the important concepts coveredin this chapter include

ž Continuous Fourier transformž Spectral leakage effectsž Convolution sampling and aliasing issuesž Discrete-time Fourier transform and z-transformž The DFT FFT DCT and STFT basics

PROBLEMS 45

ž IIR and FIR filter representationsž Polezero and frequency response interpretationsž Shelving and peaking filters audio graphic equalizersž Down-sampling and up-samplingž QMF banks and alias-free reconstructionž Discrete-time random signal processing review

PROBLEMS

21 Determine the continuous Fourier transform (CFT) of a pulse described by

x(t) = u(t + 1) minus u(t minus 1)

where u(t) is the unit step function

22 State and derive the CFT properties of duality time shift modulation andconvolution

23 For the circuit shown in Figure 234(a) and for RC = 1a Write the input-output differential equationb Determine the impulse response in closed-form by solving the differ-

ential equationc Write the frequency response functiond Determine the steady state response y(t) for x(t) = sin(10t)

x(t)

R

C y(t)

0 12minus12 4minus4

1

x(t)

t

(a) (b)

(c)

0minus1

1

x(t)

t1

Figure 234 (a) A simple RC circuit (b) input signal for problem 23(e) x(t) and(c) input signal for problem 23(f)


e Given x(t) as shown in Figure 234(b) find the circuit output y(t)using convolution

f Determine the Fourier series of the output y(t) of the RC circuit forthe input shown in Figure 234(c)

24 Determine the CFT of p(t) = suminfinn=minusinfin δ(t minus nTs) Given xs(t) = x(t)p(t)

derive the following

Xs(ω) = 1

Ts

infinsum

k=minusinfinX(ω minus kωs)

x(t) =infinsum

n=minusinfinx(nTs) sinc(B(t minus nTs))

where X(ω) and Xs(ω) are the spectra of ideally bandlimited and uniformlysampled signals respectively and ωs = 2πTs (Refer to Figure 28 forvariable definitions)

25 Determine the z-transforms of the following causal signalsa sin(n)

b δ(n) + δ(n minus 1)

c pn sin(n)

d u(n) minus u(n minus 9)

26 Determine the impulse and frequency responses of the averaging filter

h(n) = 1

L + 1 n = 0 1 L for L = 9

27 Show that the IDFT can be derived as a least squares signal matchingproblem where N points in the time domain are matched by a linearcombination of N sampled complex sinusoids

28 Given the transfer function H(z) = (z minus 1)2(z2 + 081)a Determine the impulse response h(n)

b Determine the steady state response due to the sinusoid sin(πn

4

)

29 Derive the decimation-in-time FFT algorithm and determine the number ofcomplex multiplications required for an FFT size of N = 1024

210 Derive the following expression in a simple two-band QMF

Xdk(ej) = 1

2(Xk(e

j2) + Xk(ej(minus2π)2)) k = 0 1

Refer to Figure 228 for variable definitions Give and justify the conditionsfor alias-free reconstruction in a simple QMF bank


211 Design a tree-structured uniform QMF bank that will divide the spectrumof 0-20 kHz into eight uniform subbands Give appropriate figures anddenote on the branches the range of frequencies

212 Modify your design in problem 211 and give one possible realization ofa simple nonuniform tree structured QMF bank that will divide the of 0-20 kHz spectrum into eight subbands whose bandwidth increases with thecenter frequency

213 Design a low-pass shelving filter for the following specifications fs =16 kHz fc = 4 kHz and g = 10 dB

214 Design peaking filters for the following cases a) c = π4Q = 2 g =10 dB and b) c = π2Q = 3 g = 5 dB Give frequency responses ofthe designed peaking filters

215 Design a five-band digital audio equalizer using the concept of peakingdigital filters Select center frequencies fc as follows 500 Hz 1500 Hz4000 Hz 10 kHz and 16 kHz sampling frequency fs = 441 kHz andthe corresponding peaking filter gains as 10 dB Choose a constant Q forall the peaking filters

216 Derive equations (263) and (264)

217 Derive equation (266)

218 Show that the PSD is real-valued and positive

219 Show that the estimator (268) of the autocorrelation is biased

220 Show that the estimator (268) provides autocorrelations such that rxx(0) |rxx(m)|

221 Provide an unbiased autocorrelation estimator by modifying the estimatorin (268)

222 A digital filter with impulse response h(n) = 07nu(n) is excited by whiteGaussian noise of zero mean and unit variance Determine the mean andvariance of the output of the digital filter in closed-form during the transientand steady state

223 The filter H(z) = z(z minus 08) is excited by white noise of zero mean andunit variancea Determine all the autocorrelation values at the output of the filter at

steady stateb Determine the PSD at the output

COMPUTER EXERCISES

Use the speech file lsquoCh2speechwavrsquo from the Book Website for all the computerexercises in this chapter


224 Consider the 2-band QMF bank shown in Figure 235 In this figure x(n)

denotes speech frames of 256 samples and x(n) denotes the synthesizedspeech frames

a Design the transfer functions F0(z) and F1(z) such that aliasing iscancelled Also calculate the overall delay of the QMF bank

b Select an arbitrary voiced speech frame from Ch2speechwav Givetime-domain and frequency-domain plots of xd0(n) and xd1(n) for thatparticular frame Comment on the frequency-domain plots with regardto low-passhigh-pass band-splitting

H0(z)

H1(z)

2

2

2

2

F0(z)

F1(z)

x(n)

sum

x0(n)

x1(n)

xd0(n)

xd1(n)


x(n)

xe0(n)

xe1(n)

Given

H1(z) = 1 + zminus1

H0(z) = 1 minus zminus1 Choose F0(z) and F1(z) such that the aliasing term can be cancelled

ˆ

Figure 235 A two-band QMF bank

Select Lcomponents

out of NFFT

(Size N = 256)

x(n) Xrsquo(k) InverseFFT (Size N = 256)

xrsquo(n)X(k)

n = [1 times 256] k = [1 times N] k = [1 times L] n = [1 times 256]

Figure 236 Speech synthesis from a select number (subset) of FFT components

Table 22 Signal-to-noise ratio (SNR) and MOS values

Number of FFTcomponents L SNRoverall

Subjective evaluation MOS (meanopinion score) in a scale of

1ndash5 for the entire speech record

16128


c Repeat step (b) for x1(n) and xd1(n) in order to compare the signalsbefore and after the downsampling stage

d Calculate the SNR between the input speech record x(n) and the syn-thesized speech record x(n) Use the following equation to computethe SNR

SNR = 10 log10

( sumn x2(n)

sumn(x(n) minus x(n))2

)

(dB)

Listen to the synthesized speech record and comment on its qualitye Choose a low-pass F0(z) and a high-pass F1(z) such that aliasing

occurs Compute the SNR Listen to the synthesized speech and describeits perceptual quality relative to the output speech in step(d) (Hint Usefirst-order IIR filters)

225 In Figure 236 x(n) denotes speech frames of 256 samples and x prime(n)

denotes the synthesized speech frames For L = N(= 256) the synthesizedspeech will be identical to the input speech In this computer exercise youneed to perform speech synthesis on a frame-by-frame basis from a selectnumber (subset) of FFT components ie L lt N We will use two meth-ods for the FFT component selection (i) Method 1 selecting the first L

components including their conjugate-symmetric ones out of a total of N

components and (ii ) Method 2 the least-squares method (peak-pickingmethod that selects the L components that minimize the sum of squareserror)a Use L = 64 and Method 1 for component selection Perform speech

synthesis and give time-domain plots of both input and output speechrecords

b Repeat the above step using the peak-picking Method 2 (choose L peaksincluding symmetric components in the FFT magnitude spectrum) Listthe SNR values in both the cases Listen to the output files correspondingto (a) and (b) and provide a subjective evaluation (on a MOS scale1ndash5) To calibrate the process think of a wireline telephone quality(toll) as 4 cellphone quality around 37

c Perform speech synthesis for (i) L = 16 and (ii) L = 128 Use the peak-picking Method 2 for the FFT component selection Compute the overallSNR values and provide a subjective evaluation of the output speechfor both the cases Tabulate your results in Table 22

CHAPTER 3

QUANTIZATION AND ENTROPY CODING

31 INTRODUCTION

This chapter provides an introduction to waveform quantization and entropycoding algorithms Waveform quantization deals with the digital or more specifi-cally binary representation of signals All the audio encoding algorithms typicallyinclude a quantization module Theoretical aspects of waveform quantizationmethods were established about fifty years ago [Shan48] Waveform quantizationcan be i) memoryless or with memory depending upon whether the encod-ing rules rely on past samples and ii ) uniform or nonuniform based on thestep-size or the quantization (discretization) levels employed Pulse code modu-lation (PCM) [Oliv48] [Jaya76] [Jaya84] [Span94] is a memoryless method fordiscrete-time discrete-amplitude quantization of analog waveforms On the otherhand Differential PCM (DPCM) delta modulation (DM) and adaptive DPCM(ADPCM) have memory

Waveform quantization can also be classified as scalar or vector In scalarquantization each sample is quantized individually as opposed to vector quanti-zation where a block of samples is quantized jointly Scalar quantization [Jaya84]methods include PCM DPCM DM and their adaptive versions Several vectorquantization (VQ) schemes have been proposed including the VQ [Lind80] thesplit-VQ [Pali91] [Pali93] and the conjugate structure-VQ [Kata93] [Kata96]Quantization can be parametric or nonparametric In nonparametric quantizationthe actual signal is quantized Parametric representations are generally based onsignal transformations (often unitary) or on source-system signal models


51

52 QUANTIZATION AND ENTROPY CODING

A bit allocation algorithm is typically employed to compute the number ofquantization bits required to encode an audio segment Several bit allocationschemes have been proposed over the years these include bit allocation basedon the noise-to-mask-ratio (NMR) and masking thresholds [Bran87a] [John88a][ISOI92] perceptually motivated bit allocation [Vora97] [Naja00] and dynamicbit allocation based on signal statistics [Jaya84] [Rams86] [Shoh88] [West88][Beat89] [Madi97] Note that the NMR-based perceptual bit allocation sche-me [Bran87a] is one of the most popular techniques and is embedded in severalaudio coding standards (eg ISOIEC MPEG codec series etc)

In audio compression entropy coding techniques are employed in conjunctionwith the quantization and bit-allocation modules in order to obtain improved codingefficiencies Unlike the DPCM and the ADPCM techniques that remove the redun-dancy by exploiting the correlation of the signal while entropy coding schemesexploit the likelihood of the symbol occurrence [Cove91] Entropy is a measure ofuncertainty of a random variable For example consider two random variables x

and y and two random events A and B For the random variable x let the probabil-ity of occurrence of the event A be px(A) = 05 and the event B be px(B) = 05Similarly define py(A) = 099999 and py(B) = 000001 The random variable x

has a high uncertainty measure ie it is very hard to predict whether event A or B

is likely to occur On the contrary in the case of the random variable y the eventA is more likely to occur and therefore we have less uncertainty relative to therandom variable x In entropy coding the information symbols are mapped intocodes based on the frequency of each symbol Several entropy-coding schemeshave been proposed including Huffman coding [Huff52] Rice coding [Rice79]Golomb coding [Golo66] arithmetic coding [Riss79] [Howa94] and Lempel-Zivcoding [Ziv77] These entropy coding schemes are typically called noiseless Anoiseless coding system is able to reconstruct the signal perfectly from its codedrepresentation In contrast a coding scheme incapable of perfect reconstruction iscalled lossy

In the rest of the chapter we provide an overview of the quantization-bitallocation-entropy coding (QBE) framework We also provide background on theprobabilistic signal structures and we show how they are exploited in quantizationalgorithms Finally we introduce vector quantization basics

311 The QuantizationndashBit AllocationndashEntropy Coding Module

After perceptual irrelevancies in an audio frame are exploited a quantizationndashbitallocationndashentropy coding (QBE) module is employed to exploit statistical corre-lation In Figure 31 typical output parameters from stage I include the transformcoefficients the scale factors and the residual error These parameters are firstquantized using one of the aforementioned PCM DPCM or VQ schemes Thenumber of bits allocated per frame is typically specified by a bit-allocation modulethat uses perceptual masking thresholds

The quantized parameters are entropy coded using an explicit noiseless codingstage for final redundancy reduction Huffman or Rice codes are available in theform of look-up tables at the entropy coding stage Entropy coders are employed

DENSITY FUNCTIONS AND QUANTIZATION 53

Psychoacousticanalysis

ampMDCT

analysis(or)

Prediction

QuantizationEntropycoding

Huffman Rice orGolomb codes

(tables)

Bit-allocationmodule

Distortionmeasure

Quantizedand entropy

codedbitstream

Input audiosignal

A

A

B

B

C

C

Transform coefficients Scalefactors Residual error

Stage - IPerceptual masking

thresholds

Figure 31 A typical QBE module employed in audio coding

in scenarios where the objective is to achieve maximum coding efficiency Inentropy coders more probable symbols (ie frequently occurring amplitudes)are encoded with shorter codewords and vice versa This will essentially reducethe average data rate Next a distortion measure between the input and theencoded parameters is computed and compared against an established thresholdIf the distortion metric is greater than the specified threshold the bit-allocationmodule supplies additional bits in order to reduce the quantization error Theabove procedure is repeated until the distortion falls below the threshold

32 DENSITY FUNCTIONS AND QUANTIZATION

In this section we discuss the characterization of a random process in termsof its probability density function (PDF) This approach will help us derive thequantization noise equations for different quantization schemes A random pro-cess is characterized by its PDF which is a non-negative function p(x) whoseproperties are int infin

minusinfinp(x)dx = 1 (31)

and int x2

x1

p(x)dx = Pr(x1 lt X x2) (32)

From the above equations it is evident that the PDF area from x1 to x2 is theprobability that the random variable X is observed in this range Since X liessomewhere in [minusinfin infin] the total area under p(x) is one The mean and thevariance of the random variable X are defined as

microx = E[X] =int infin

minusinfinxp(x)dx (33)


0 x x0(a) (b)

pG(x) = 1

minusx2

e2ps2

x

2s2x

pL(x) = 1 e2sx

sx

minusradicminus2x

Figure 32 (a) The Gaussian PDF and (b) The Laplacian PDF

σ 2x =

int infin

minusinfin(x minus microx)

2p(x)dx = E[(X minus microx)2] (34)

Note that the expectation is computed either as a weighted average (33) orunder ergodicity assumptions as a time average (Chapter 2 Eq 254) PDFsare useful in the design of optimal signal quantizers as they can be used todetermine the assignment of optimal quantization levels PDFs often used todesign or analyze quantizers include the zero-mean uniform (pU(x)) the Gaus-sian (pG(x)) (Figure 32a) and the Laplacian (pL(x)) These are given in thatorder below

pU(x) = 1

2S minusS x S (35)

pG(x) = 1radic

2πσ 2x

eminus x2

2σ 2x (36)

pL(x) = 1radic2σx

eminus

radic2|x|σx (37)

where S is some arbitrary non-zero real number and σ 2x is the variance of the

random variable X Readers are referred to Papoulisrsquo classical book on probabilityand random variables [Papo91] for an in-depth treatment of random processes

33 SCALAR QUANTIZATION

In this section we describe the various scalar quantization schemes In particularwe review uniform and nonuniform quantization and then we present differentialPCM coding methods and their adaptive versions

331 Uniform Quantization

Uniform PCM is a memoryless process that quantizes amplitudes by roundingoff each sample to one of a set of discrete values (Figure 33) The differencebetween adjacent quantization levels ie the step size is constant in non-adaptive uniform PCM The number of quantization levels Q in uniform PCM

SCALAR QUANTIZATION 55

Quantizationnoise eq(t )

Step size ∆

Quantized waveformsq(t)

Analog waveformsa(t)

Qua

ntiz

atio

n le

vels

(a)

(b)

2 4 6 8 10 12 14 160

1

2

3

4

5

6

7

t

Am

plitu

de

sa(t )sq(t )

Figure 33 (a) Uniform PCM and (b) uniform quantization of a triangular waveformFrom the figure Rb = 3 bits Q = 8 uniform quantizer levels

binary representations is Q = 2Rb where Rb denotes the number of bits Theperformance of uniform PCM can be described in terms of the signal-to-noiseratio (SNR) Consider that the signal s is to be quantized and its values lie inthe interval

s isin (minussmax smax) (38)

A uniform step size can then be determined by

= 2smax

2Rb (39)

Let us assume that the quantization noise eq has a uniform PDF ie

minus

2 eq

2(310)


peq(eq) = 1

for |eq |

2 (311)

From (39) (310) and (311) the variance of the quantization noise can beshown [Jaya84] to be

σ 2eq

= 2

12= s2

max2minus2Rb

3 (312)

Therefore if the input signal is bounded an increase by 1 bit reduces the noisevariance by a factor of four In other words the SNR for uniform PCM will

(b)

(a)


Qua

ntiz

atio

n le

vels


2 4 6 8 10 12 14 160

035

095

258

7

t

Am

plitu

de

sa(t )sq(t )

Figure 34 (a) Nonuniform PCM and (b) nonuniform quantization of a decaying-expon-ential waveform From the figure Rb = 3 bits Q = 8 nonuniform quantizer levels


improve approximately by 6 dB per bit ie

SNRPCM = 602Rb + K1(dB) (313)

The factor K1 is a constant that accounts for the step size and loading factorsFor telephone speech K1 = minus10 [Jaya84]

332 Nonuniform Quantization

Uniform nonadaptive PCM has no mechanism for exploiting signal redundancyMoreover uniform quantizers are optimal in the mean square error (MSE) sensefor signals with uniform PDF Nonuniform PCM quantizers use a nonuniformstep size (Figure 34) that can be determined from the statistical structure of thesignal

PDF-optimized PCM uses fine step sizes for frequently occurring amplitudesand coarse step sizes for less frequently occurring amplitudes The step sizescan be optimally designed by exploiting the shape of the signalrsquos PDF A signalwith a Gaussian PDF (Figure 35) for instance can be quantized more effi-ciently in terms of the overall MSE by computing the quantization step sizesand the corresponding centroids such that the mean square quantization noise isminimized [Scha79]

Another class of nonuniform PCM relies on log-quantizers that are quite com-mon in telephony applications [Scha79] [Jaya84] In Figure 36 a nonuniformquantizer is realized by using a nonlinear mapping function g() that mapsnonuniform step sizes to uniform step sizes such that a simple linear quantizer isused An example of the mapping function is given in Figure 37 The decoderuses an expansion function gminus1() to recover the signal

Two telephony standards have been developed based on logarithmic compand-ing ie the micro-law and the A-law The micro-law companding function is used in

s

PDF

Assigned values (centroids)

∆1

∆2∆3

∆2

∆1

Figure 35 PDF-optimized PCM for signals with Gaussian distribution Quantizationlevels are on the horizontal axis


Compressorg()

Expandergminus1()

Uniformquantizer

s s

Figure 36 Nonuniform PCM via compressor and expansion functions

∆1

∆

∆

∆

∆2 ∆3s

g(s)

Figure 37 Companding function for nonuniform PCM

the North American PCM standard (micro = 255) The micro-law is given by

|g(s)| = log(1 + micro|ssmax|)log(1 + micro)

(314)

For micro = 255 (314) gives approximately linear mapping for small amplitudesand logarithmic mapping for larger amplitudes The European A-law compandingstandard is slightly different and is based on the mapping

|g(s)| =

A|ssmax|1 + log(A)

for 0 lt |ssmax| lt 1A

1 + log(A|ssmax|)1 + log(A)

for 1A lt |ssmax| lt 1

(315)

The idea with A-law companding is similar with micro-law in that again for signalswith small amplitudes the mapping is almost linear and for large amplitudes thetransformation is logarithmic Both of these techniques can yield superior SNRsparticularly for small amplitudes In telephony the companding schemes havebeen found to reduce bit rates without degradation by as much as 4 bitssamplerelative to uniform PCM

Dynamic range variations in PCM can be handled by using an adaptive stepsize A PCM system with an adaptive step-size is called adaptive PCM (APCM)The step size in a feed forward system is transmitted as side information while in afeedback system the step size is estimated from past coded samples Figure 38 In


Buffer Q()

∆ Estimator

sQndash1()

∆ Estimator

(a)

Qndash1()sand

sand

(b)

Buffer Q()

∆ Estimator

s

∆ Estimator

Figure 38 Adaptive PCM with (a) forward estimation of step size and (b) backwardestimation of step size

this figure Q represents either uniform or nonuniform quantization (compression)scheme and corresponds to the stepsize

333 Differential PCM

A more efficient scalar quantizer is the differential PCM (DPCM) that removesthe redundancy in the audio waveform by exploiting the correlation betweenadjacent samples In its simplest form a DPCM transmitter encodes only thedifference between successive samples and the receiver recovers the signal byintegration Practical DPCM schemes incorporate a time-invariant short-term pre-diction process A(z) This is given by

A(z) =psum

i=1

aizminusi (316)

where ai are the prediction coefficients and z is the complex variable of thez-transform This DPCM scheme is also called predictive differential coding(Figure 39) and reduces the quantization error variance by reducing the vari-ance of the quantizer input An example of a representative DPCM waveformeq(n) along with the associated analog and PCM quantized waveforms s(t) ands(n) respectively is given in Figure 310

The DPCM system (Figure 39) works as follows The sample s prime(n) is theestimate of the current sample s(n) and is obtained from past sample valuesThe prediction error e(n) is then quantized (ie eq(n)) and transmitted to the


(a) (b)

eq(n) s(n)

PredictionfilterA(z)

sum

andand

Quantizer


minus

s(n) e(n) eq(n)

s prime(n)

s prime(n)

sum

sum~

Figure 39 DPCM system (a) transmitter and (b) receiver

receiver The quantized prediction error is also added to s prime(n) in order to recon-struct the sample s prime(n) In the absence of channel errors s prime(n) = s(n) In thesimplest case A(z) is a first-order polynomial In Figure 39 A(z) is given by

A(z) =psum

i=1

aizminusi (317)

and the predicted signal is given by

s prime(n) =psum

i=1

aisprime(n minus i) (318)

The prediction coefficients are usually determined by solving the autocorrelationequations

rss(m) minuspsum

i=1

airss(m minus i) = 0 for m = 1 2 p (319)

where rss(m) are the autocorrelation samples of s(n) The details of the equationabove will be discussed in Chapter 4 Two other types of scalar coders are thedelta modulation (DM) and the adaptive DPCM (ADPCM) coders [Cumm73][Gibs74] [Gibs78] [Yatr88] DM can be viewed as a special case of DPCM wherethe difference (prediction error) is encoded with one bit DM typically operatesat sampling rates much higher than the rates commonly used with DPCM Thestep size in DM may also be adaptive

In an adaptive differential PCM (ADPCM) system both the step size andthe predictor are allowed to adapt and track the time-varying statistics of theinput signal [Span94] The predictor can be either forward adaptive or back-ward adaptive In forward adaptation the prediction parameters are estimatedfrom the current data which are not available at the receiver Therefore the


111

110

101

100

011010

001000

DigitizedPCM

waveform

7

6

5

4

3

2

1

0

Analogwaveform

(a)

xa(t )

yd(n)

zd(n)

t

n

1110

0100

DigitizedDPCM

waveform

n

(b)

(c)

Figure 310 Uniform quantization (a) Analog input signal (b) PCM waveform (3-bitdigitization of the analog signal) Output after quantization [101 100 011 011 101 001100 001 101 010 100] Total number of bits in PCM digitization = 33 (c) DifferentialPCM (2-bit digitization of the analog signal) Output after quantization [10 10 01 0110 00 10 00 10 01 01] Total number of bits in DPCM digitization = 22 As an asideit can be noted that relative to the PCM the DPCM reduces the number of bits forencoding by reducing the variance of the input signal The dynamic range of the inputsignal can be reduced by exploiting the redundancy present within the adjacent samplesof the signal

prediction parameters must be encoded and transmitted separately in order toreconstruct the signal at the receiver In backward adaptation the parameters areestimated from past data which are already available at the receiver Thereforethe prediction parameters can be estimated locally at the receiver Backward pre-dictor adaptation is amenable to low-delay coding [Gibs90] [Chen92] ADPCM


encoders with pole-zero decoder filters have proved to be particularly versatilein speech applications In fact the ADPCM 32 kbs algorithm adopted for theITU-T G726 [G726] standard (formerly G721 [G721]) uses a pole-zero adaptivepredictor

34 VECTOR QUANTIZATION

Data compression via vector quantization (VQ) is achieved by encoding a data-setjointly in block or vector form Figure 311(a) shows an N -dimensional quantizerand a codebook The incoming vectors can be formed from consecutive datasamples or from model parameters The quantizer maps the i-th incoming [N times 1]vector given by

si = [si(0) si(1) si(N minus 1)]T (320)

to a n-th channel symbol un n = 1 2 L as shown in Figure 311(a) Thecodebook consists of L code vectors

sn = [sn(0) sn(1) sn(N minus 1)]T n = 1 2 L (321)

which reside in the memory of the transmitter and the receiver A vector quantizerworks as follows The input vectors si are compared to each codeword snand the address of the closest codeword with respect to a distortion measureε(si sn) determines the channel symbol to be transmitted The simplest andmost commonly used distortion measure is the sum of squared errors which isgiven by

ε(si sn) =Nminus1sum

k=0

(si(k) minus sn(k))2 (322)

The L [N times 1] real-valued vectors are entries of the codebook and are designedby dividing the vector space into L nonoverlapping cells cn as shown in Figu-re 311(b) Each cell cn is associated with a template vector sn The quantizerassigns the channel symbol un to the vector si if si belongs to cn The channelsymbol un is usually a binary representation of the codebook index of sn

A vector quantizer can be considered as a generalization of the scalar PCMand in fact Gersho [Gers83] calls it vector PCM (VPCM) In VPCM the code-book is fully searched and the number of bits per sample is given by

B = 1

Nlog2 L (323)

The signal-to-noise ratio for VPCM is given by

SNRN = 6B + KN(dB) (324)

VECTOR QUANTIZATION 63

CodebookROM

Encoder

Channel

CodebookROM

Decodersi un un sn

sn

(a)

centroid

cellCn

si (0)

si (1)

(b)

Figure 311 Vector quantization scheme (a) block diagram (b) cells for two-dimensionalVQ ie N = 2

Note that for N = 1 VPCM defaults to scalar PC and therefore (313) is aspecial case of (324) Although the two equations are quite similar VPCM yieldsimproved SNR (reflected in KN ) since it exploits the correlation within thevectors VQ offers significant coding gain by increasing N and L However thememory and the computational complexity required grows exponentially withN for a given rate In general the benefits of VQ are realized at rates of 1 bitper sample or less The codebook design process also known as the training orpopulating process can be fixed or adaptive

Fixed codebooks are designed a priori and the basic design procedure involvesan initial guess for the codebook and then iterative improvement by using a


large number of training vectors An iterative codebook design algorithm thatworks for a large class of distortion measures was given by Linde Buzo andGray [Lind80] This is essentially an extension of Lloydrsquos [Lloy82] scalar quan-tizer design and is often referred to as the ldquoLBG algorithmrdquo Typically thenumber of training vectors per code vector must be at least ten and prefer-ably fifty [Makh85] Since speech and audio are nonstationary signals one mayalso wish to adapt the codebooks (ldquocodebook design on the flyrdquo) to the signalstatistics A quantizer with an adaptive codebook is called adaptive VQ (A-VQ) and applications to speech coding have been reported in [Paul82] [Cupe85]and [Cupe89] There are two types of A-VQ namely forward adaptive and back-ward adaptive In backward A-VQ codebook updating is based on past data thatis also available at the decoder Forward A-VQ updates the codebooks based oncurrent (or sometimes future) data and as such additional information must beencoded

341 Structured VQ

The complexity in high-dimensionality VQ can be reduced significantlywith the use of structured codebooks that allow for efficient search Tree-structured [Buzo80] and multi-step [Juan82] vector quantizers are associated withlower encoding complexity at the expense of a modest loss of performance Multi-step vector quantizers consist of a cascade of two or more quantizers each oneencoding the error or residual of the previous quantizer In Figure 312(a) thefirst VQ codebook L1 encodes the signal s(k) and the subsequent VQ stages L2

through LM encode the errors e1(k) through eMminus1(k) from the previous stagesrespectively In particular the codebook L1 is first searched and the vectorsl1(k) that minimizes the MSE (325) is selected where l1 is the codebook indexassociated with the first-stage codebook

εl =Nminus1sum

k=0

(s(k) minus sl(k))2 for l = 1 2 3 L1 (325)

Next the difference between the original input s(k) and the first-stage codewordsl1(k) is computed as shown in Figure 312 (a) This is given by

e1(k) = s(k) minus sl1(k) for k = 0 1 2 3 N minus 1 (326)

The residual e1(k) is used in the second stage as the reference signal to beapproximated Codebooks L2 L3 LM are searched sequentially and the codevectors el21(k) el32(k) elMMminus1(k) that result in the minimum MSE (327)are chosen as the codewords Note that l2 l3 lM are the codebook indices

(b)

(a)

Vec

tor

quan

tizer

-1C

odeb

ook

L1

Vec

tor

quan

tizer

-2C

odeb

ook

L2

Vec

tor

quan

tizer

- MC

odeb

ook

LM

minusminus

hellipminus

Sta

ge-1

Sta

ge-2

Sta

ge- M

s (k

)

k=

01

N

minus1e 1

(k)

e 2(k

)

e M(k

)

e Mminus1

(k)

s l1(

k)

ˆe l

21(

k)

ˆ

e lM

Mminus1

(k)

ˆ

sumsum

sum

Sta

ge-1

Sta

ge-2

Sta

ge-3

32

10

L 11 2 3

32

10

L 21 2 3

32

10

L 31 2

Nminus1

Nminus1

Nminus1

Fig

ure

312

(a

)M

ulti-

step

M-s

tage

VQ

bloc

kdi

agra

m

At

the

deco

der

the

sign

al

s(k

)ca

nbe

reco

nstr

ucte

das

s(k

)=

s l1(k

)+

e l21(k

)

+

+

e lM

Mminus1

(k)

fork

=0

12

N

minus1

(b)

An

exam

ple

mul

ti-st

epco

debo

okst

ruct

ure

(M=

3)T

hefir

stst

age

N-d

imen

sion

alco

debo

okco

nsis

tsof

L1

code

vect

ors

Sim

ilarl

yth

enu

mbe

rof

entr

ies

inth

ese

cond

-an

dth

ird-

stag

eco

debo

oks

incl

ude

L2

and

L3r

espe

ctiv

ely

Usu

ally

in

orde

rto

redu

ceth

eco

mpu

tatio

nal

com

plex

ity

the

num

ber

ofco

deve

ctor

sis

asfo

llow

sL

1gt

L2

gtL

3

65


associated with the 2nd 3rd M-th-stage codebooks respectively

For codebook L2

εl2 =Nminus1sum

k=0

(e1(k) minus el1(k))2 for l = 1 2 3 L2

For codebook LM

εlM =Nminus1sum

k=0

(eMminus1(k) minus elMminus1(k))2 for l = 1 2 3 LM

(327)

At the decoder the transmitted codeword s(k) can be reconstructed as follows

s(k) = sl1(k) + el21(k) + + elMMminus1(k) for k = 0 1 2 N minus 1 (328)

The complexity of VQ can also be reduced by normalizing the vectors of thecodebook and encoding the gain separately The technique is called gainshapeVQ (GS-VQ) and has been introduced by Buzo et al [Buzo80] and later studiedby Sabin and Gray [Sabi82] The waveform shape is represented by a code vectorfrom the shape codebook while the gain can be encoded from the gain codebookFigure 313 The idea of encoding the gain separately allows for the encoding of

ShapeCodebook

GS-VQEncoder

si un

Channel

un si

GainCodebook

ShapeCodebook

GS-VQDecoder

GainCodebook

ˆˆ

Figure 313 Gainshape (GS)-VQ encoder and decoder In the GS-VQ the idea is toencode the waveform shape and the gain separately using the shape and gain codebooksrespectively


vectors of high dimensionality with manageable complexity and is being widelyused in encoding the excitation signal in code-excited linear predictive (CELP)coders [Atal90] An alternative method for building highly structured codebooksconsists of forming the code vectors by linearly combining a small set of basisvectors [Gers90b]

342 Split-VQ

In split-VQ it is typical to employ two stages of VQs multiple codebooks ofsmaller dimensions are used in the second stage A two-stage split-VQ blockdiagram is given in Figure 314(a) and the corresponding split-VQ codebookstructure is shown in Figure 314(b) From Figure 314(b) note that the first-stage codebook L1 employs an N -dimensional VQ and consists of L1 codevectors The second-stage codebook is implemented as a split-VQ and consistsof a combination of two N 2-dimensional codebooks L2 and L3 The number ofentries in these two codebooks are L2 and L3 respectively First the codebookL1 is searched and the vector sl1(k) that minimizes the MSE (329) is selectedwhere l1 is the index of the first-stage codebook

εl =Nminus1sum

k=0


Next the difference between the original input s(k) and the first-stage codewordsl1(k) is computed as shown in Figure 314(a)

e(k) = s(k) minus sl1(k) for k = 0 1 2 3 N minus 1 (330)

The residual e(k) is used in the second stage as the reference signal to be approx-imated Codebooks L2 and L3 are searched separately and the code vectors el2low

and el3upp that result in the minimum MSE (331) and (332) respectively arechosen as the codewords Note that l2 and l3 are the codebook indices associatedwith the second- and third-stage codebooks

For codebook L2

εllow =N2minus1sum

k=0

(e(k) minus ellow(k))2 for l = 1 2 3 L2 (331)

For codebook L3

εlupp =Nminus1sum

k=N2

(e(k) minus elupp(k minus N2))2 for l = 1 2 3 L3 (332)


(a)

Vectorquantizer-1

Codebook L1s(k) e(k)minus

Vectorquantizer-2ACodebook L2

Vectorquantizer-2BCodebook L3

minus

minusk = 01Nminus1

k prime = 01N 2 minus 1

elow = e(k prime)

eupp = e(k primeprime)

k = 01Nminus1

sl1(k)ˆ

el2 lowˆ

el3uppˆ

2 2

Nk primeprime =

N+ 1Nminus1

sum

sum

sum

(b)

Nminus13210

L1

3

2

1

Stage - 2

N2 - 13210

L2

123 hellip

L3

123

Stage - 1

N2-dimensional codebook - 2A N2-dimensional codebook - 2B

N-dimensional codebook - 1

s1(0)ˆ s1(1)ˆ s1(2)ˆ s1(3)ˆ s1(N minus 1)ˆ

Figure 314 Split VQ (a) A two-stage split-VQ block diagram (b) the split-VQcodebook structure In the split-VQ the codevector search is performed by ldquodividingrdquothe codebook into smaller dimension codebooks In Figure 314(b) the second stageN-dimensional VQ has been divided into two N 2-dimensional split-VQ Note that thefirst-stage codebook L1 employs an N-dimensional VQ and consists of L1 entries Thesecond-stage codebook is implemented as a split-VQ and consists of a combination of twoN 2-dimensional codebooks L2 and L3 The number of entries in these two codebooksare L2 and L3 respectively The codebook indices ie l1 l2 and l3 will be encoded andtransmitted At the decoder these codebook indices are used to reconstruct the transmittedcodeword s(k) Eq (333)

At the decoder the transmitted codeword s(k) can be reconstructed as fol-lows

s(k) =

sl1(k) + el2low(k) for k = 0 1 2 N2 minus 1sl1(k) + el3upp(k minus N2) for k = N2 N2 + 1 N minus 1

(333)

Split-VQ offers high coding accuracy however with increased computationalcomplexity and with a slight drop in the coding gain Paliwal and Atal discuss


these issues in [Pali91] [Pali93] while presenting an algorithm for vector quan-tization of speech LPC parameters at 24 bitsframe Despite the aforementionedshortcomings split-VQ techniques are very efficient when it comes to encodingline spectrum prediction parameters in several speech and audio coding standardssuch as the ITU-T G729 CS-ACELP standard [G729] the IS-893 SelectableMode Vocoder (SMV) [IS-893] and the MPEG-4 General Audio (GA) Coding-Twin VQ tool [ISOI99]

343 Conjugate-Structure VQ

Conjugate structure-VQ (CS-VQ) [Kata93] [Kata96] enables joint quantizationof two or more parameters The CS-VQ works as follows Let s(n) be the targetvector that has to be approximated and the MSE ε to be minimized is given by

ε = 1

N

Nminus1sum

n=0

|e(n)|2 = 1

N

Nminus1sum

n=0

|(s(n) minus g1u(n) minus g2v(n))|2 (334)

where u(n) and v(n) are some arbitrary vectors and g1 and g2 are the gains to bevector quantized using the CS codebook given in Figure 315 From this figurecodebooks A and B contain P and Q entries respectively In both codebooks thefirst-column element corresponds to parameter 1 ie g1 and the second-columnelement represents parameter 2 ie g2 The optimum combination of g1 and g2

that results in the minimum MSE (334) is computed from PQ permutations asfollows

g1(i j) = gA1i + gB1j i isin [1 2 P ] j isin [1 2 Q] (335)


P

3

2

1

Index

gA2PgA1P

gA23gA13

gA22gA12

gA21gA11

Codebook ndash A

Q

3

2

1

Index

gB2QgB1Q

gB23gB13

gB22gB12

gB21gB11

Codebook ndash B

Figure 315 An example CS-VQ In this figure the codebooks lsquoArsquo and lsquoBrsquo are conjugate


CS-VQ codebooks are particularly handy in scenarios that involve joint quantiza-tion of excitation gains Second-generation near-toll-quality CELP codecs (egITU-T G729) and third-generation (3G) CELP standards for cellular applications(eg TIAIS-893 Selectable Mode Vocoder) employ the CS-VQ codebooks toencode the adaptive and stochastic excitation gains [Atal90] [Sala98] [G729] [IS-893] A CS-VQ is used to vector quantize the transformed spectral coefficientsin the MPEG-4 Twin-VQ encoder

35 BIT-ALLOCATION ALGORITHMS

Until now we discussed various scalar and vector quantization algorithms withoutemphasizing how the number of quantization levels are determined In thissection we review some of the fundamental bit allocation techniques A bit-allocation algorithm determines the number of bits required to quantize an audioframe with reduced audible distortions Bit-allocation can be based on certainperceptual rules or spectral characteristics From Figure 31 parameters typicallyquantized include the transform coefficients x scale factors S and the residualerror e For now let us consider that the transform coefficients x are to bequantized ie

x = [x1 x2 x3 xNf]T (337)

where Nf represents the total number of transform coefficients Let the totalnumber of bits available to quantize the transform coefficients be N bits Ourobjective is to find an optimum way of distributing the available N bits across theindividual transform coefficients such that a distortion measure D is minimizedThe distortion D is given by

D = 1

Nf

Nfsum

i=1

E[(xi minus xi )2] = 1

Nf

Nfsum

i=1

di (338)

where xi and xi denote the i-th unquantized and quantized transform coefficientsrespectively and E[] is the expectation operator Let ni be the number of bitsassigned to the coefficient xi for quantization such that

Nfsum

i=1

ni N (339)

Note that if xi are uniformly distributed foralli then we can employ a simpleuniform bit-allocation across all the transform coefficients ie

ni =[

N

Nf

]

foralli isin [1 Nf ] (340)

However in practice the transform coefficients x may not have uniformprobability distribution Therefore employing an equal number of bits for both

BIT-ALLOCATION ALGORITHMS 71

large and small amplitudes may result in spending extra bits for smaller ampli-tudes Moreover in such scenarios for a given N the distortion D can bevery high

Example 31

An example of the aforementioned discussion is presented in Table 31 Uni-form bit-allocation is employed to quantize both the uniformly distributed andGaussian-distributed transform coefficients In this example we assume thata total number of N = 64 bits are available for quantization and Nf = 16samples Therefore ni = 4 foralli isin [1 16] Note that the input vectors xu andxg have been randomly generated in MATLAB using rand(1 Nf ) andrandn(1 Nf ) functions respectively The distortions Du and Dg are com-puted using (338) and are given by 000023927 and 000042573 respectivelyFrom Example 31 we note that the uniform bit-allocation is not optimal forall the cases especially when the distribution of the unquantized vector xis not uniform Therefore we must have some cost function available thatminimizes the distortion di subject to the constraint given in (339) is metThis is given by

minni

D = minni

1

Nf

Nfsum

i=1

E[(xi minus xi)2]

= min

ni

1

Nf

Nfsum

i=1

σ 2i

(341)

where σ 2i is the variance Note that the above minimization problem can be

simplified if the quantization noise has a uniform PDF [Jaya84] From (312)

σ 2i = x2

i

3(22ni ) (342)

Table 31 Uniform bit-allocation scheme where ni = [N Nf ] foralli isin [1 Nf ]

Uniformly distributed coefficients Gaussian-distributed coefficients

Input vector xu

Quantizedvector xu Input vector xg

Quantizedvector xg

[06029 03806 [0625 0375 [05199 24205 [05 24375056222 012649 05625 0125 minus094578 minus00081113 minus09375 0026904 047535 025 05 minus042986 minus087688 minus04375 minus087504553 038398 04375 0375 11553 minus082724 1125 minus08125041811 035213 04375 0375 minus1345 minus015859 minus1375 minus01875023434 032256 025 03125 minus023544 085353 minus025 0875031352 03026 03125 03125 0016574 minus20292 0 minus2032179 016496] 03125 01875] 12702 028333] 125 03125]

D = 1

Nf

sumNf

i=1 E[(xi minus xi )2] ni = 4foralli isin [1 16]Du = 000023927 and Dg = 000042573


Substituting (342) in (341) and minimizing wrt ni

partD

partni

= x2i

3(minus2)2(minus2ni ) ln 2 + K1 = 0 (343)

ni = 1

2log2 x2

i + K (344)

From (344) and (339)

Nfsum

i=1

ni =Nfsum

i=1

(1

2log2 x2

i + K

)

= N (345)

K = N

Nf

minus 1

2Nf

Nfsum

i=1

log2 x2i = N

Nf

minus 1

2Nf

log2(

Nfprod

i=1

x2i ) (346)

Substituting (346) in (344) we can obtain the optimum bit-allocationn

optimum

i as

noptimum

i = N

Nf

+ 1

2log2

x2i

(prodNf

i=1 x2i )

1Nf

(347)

Table 32 presents the optimum bit assignment for both uniformly distributed andGaussian-distributed transform coefficients considered in the previous example(Table 31) From Table 32 note that the resulting optimal bit-allocation forGaussian-distributed transform coefficients resulted in two negative integersSeveral techniques have been proposed to avoid this scenario namely thesequential bit-allocation method [Rams82] and the Segallrsquos method [Sega76]For more detailed descriptions readers are referred to [Jaya84] [Madi97] Notethat the bit-allocation scheme given by (347) may not be optimal either inthe perceptual sense or in the SNR sense This is because the minimizationof (341) is performed without considering either the perceptual noise maskingthresholds or the dependence of the signal-to-noise power on the optimal num-ber of bits n

optimum

i Also note that the distortion Du in the case of optimalbit-assignment (Table 32) is slightly greater than in the case of uniform bit-allocation (Table 31) This can be attributed to the fact that fewer quantizationlevels must have been assigned to the low-powered transform coefficients rela-tive to the number of levels implied by (347) Moreover a maximum coding gaincan be achieved when the audio signal spectrum is non-flat One of the importantremarks presented in [Jaya84] relevant to the on-going discussion is that whenthe geometric mean of x2

i is less than the arithmetic mean of x2i then the optimal

bit-allocation scheme performs better than the uniform bit-allocation The ratio

Tabl

e3

2O

ptim

albi

t-al

loca

tion

sche

me

whe

ren

optim

umi

=N N

f+

1 2lo

g 2x

2 iminus

1 2lo

g 2(prod

Nf

i=1

x2 i

)1 N

f

Uni

form

lydi

stri

bute

dco

effic

ient

sG

auss

ian-

dist

ribu

ted

coef

ficie

nts

Inpu

tvec

tor

x uQ

uant

ized

vect

orx

uIn

put

vect

orx

gQ

uant

ized

vect

orx

g

[06

029

038

06

[05

9375

03

75

[05

199

242

05

[05

24

219

056

222

012

649

056

250

125

minus0

945

78minus

000

8111

3minus0

937

50

026

904

047

535

025

04

6875

minus0

429

86minus

087

688

minus04

375

minus08

75

045

530

383

98

043

750

375

1

1553

minus0

8272

41

1563

minus0

8125

0

4181

10

3521

30

4375

03

75

minus13

45minus

015

859

minus13

438

minus01

25

023

434

032

256

025

03

125

minus02

3544

08

5353

minus0

25

084

375

031

352

030

26

031

250

312

50

3125

01

25]

001

6574

minus2

0292

0

minus20

313

032

179

016

496]

127

020

283

33]

126

560

25]

Bits

allo

cate

dn

=[5

45

34

54

44

44

44

44

3]D

isto

rtio

nD

u=

000

0246

8B

itsal

loca

ted

n=

[46

5minus

24

55

56

33

5minus

16

63]

Dis

tort

ion

Dg

=0

0002

2878

73


of the two means is captured in the spectral flatness measure (SFM) ie

Geometric MeanGM =

Nfprod

i=1

x2i

1Nf

Arithemetic Mean AM = 1

Nf

Nfsum

i=1

x2i

SFM = GM

AM

and SFM isin [0 1] (348)

Other important considerations in addition to SNR and spectral flatness mea-sures are the perceptual noise masking the noise-to-mask ratio (NMR) andthe signal-to-mask ratio (SMR) All these measures are used in perceptualbit allocation methods Since at this point readers are not introduced to theprinciples of psychoacoustics and the concepts of SMR and NMR we defera discussion of perceptually based bit allocation to Chapter 5 Section 58

36 ENTROPY CODING

It is worthwhile to consider the theoretical limits for the minimum number ofbits required to represent an audio sample Shannon in his mathematical theoryof communication [Shan48] proved that the minimum number of bits requiredto encode a message X is given by the entropy He(X) The entropy of aninput signal can be defined as follows Let X = [x1 x2 xN ] be the inputdata vector of length N and pi be the probability that i-th symbol (over thesymbol setV = [v1 v2 vK ]) is transmitted The entropy He(X) is given by

He(X) = minusKsum

i=1

pi log2(pi) (349)

In simple terms entropy is a measure of uncertainty of a random variable Forexample let the input bitstream to be encoded be X = [4 5 6 6 2 5 4 4 5 4 4]ie N = 11 symbol set V = [2 4 5 6] and the corresponding probabilities are[

1

11

5

11

3

11

2

11

]

respectively with K = 4 The entropy He(X) can be com-

puted as follows

He(X) = minusKsum

i=1

pi log2(pi)

= minus

1

11log2

(1

11

)

+ 5

11log2

(5

11

)

+ 3

11log2

(3

11

)

+ 2

11log2

(2

11

)

= 17899 (350)

ENTROPY CODING 75

Table 33 An example entropy code for Example 32

Swimmer Probability of winning Binary string or the identifier

S1 12 0S2 14 10S3 18 110S4 116 1110S5 164 111100S6 164 111101S7 164 111110S8 164 111111

Example 32

Consider eight swimmers S1 S2 S3 S4 S5 S6 S7 and S8 in a race withwin probabilities 12 14 18 116 164 164 164 and 164 respectivelyThe entropy of the message announcing the winner can be computed as

He(X) = minus8sum

i=1

pi log2(pi)

= minus

1

2log2

(1

2

)

+ 1

4log2

(1

4

)

+ 1

8log2

(1

8

)

+ 1

16log2

(1

16

)

+ 4

64log2

(1

64

)

= 2

An example of the entropy code for the above message can be obtainedby associating binary strings wrt the swimmersrsquo probability of winning asshown in Table 33 The average length of the example entropy code given inTable 33 is 2 bits in contrast with 3 bits for a uniform codeThe statistical entropy alone does not provide a good measure of compress-ibility in the case of audio coding since several other factors ie quantiza-tion noise masking thresholds and tone- and noise-masking effects must beaccounted for in order to achieve efficiency Johnston in 1988 proposed a the-oretical limit on compressibility for audio signals (sim 21 bitssample) based onthe measure of perceptually relevant information content The limit is obtainedbased on both the psychoacoustic signal analysis and the statistical entropyand is called the perceptual entropy [John88a] [John88b] The various stepsinvolved in the perceptual entropy estimation are described later in Chapter 5In all the entropy coding schemes the objective is to construct an ensemblecode for each message such that the code is uniquely decodable prefix-freeand optimum in the sense that it provides minimum-redundancy encodingIn particular some basic restrictions and design considerations imposed on asource-coding process include


ž Condition 1 Each message should be assigned a unique code (seeExample 33)

ž Condition 2 The codes must be prefix-free For example considerthe following two code sequences (CS) CS-1 = 00 11 10 011 001and CS-2 = 00 11 10 011 010 Consider the output sequence to bedecoded to be 001100011 At the decoder if CS-1 is employedthe decoded sequence can be either 00 11 00 011 or 001 10001 or 001 10 00 11 etc This is due to the confusion at thedecoder whether to select lsquo00rsquo or lsquo001rsquo from the output sequence Thisconfusion is avoided using CS-2 where the decoded sequence is uniqueand is given by 00 11 00 011 Therefore in a valid no code in itsentirety can be found as a prefix of another code

ž Condition 3 Additional information regarding the beginning- and theend-point of a message source will not usually be available at the decoder(once synchronization occurs)

ž Condition 4 A necessary condition for a code to be prefix-free is givenby the Kraft inequality [Cove91]

KI =Nsum

i=1

2minusLi 1 (351)

where Li is the codeword length of the i-th symbol In the example discussedin condition 2 the Kraft inequality KI for both CS-1 and CS-2 is lsquo1rsquo Althoughthe Kraft inequality for CS-1 is satisfied the encoding sequence is not prefix-free Note that (351) is not a sufficient condition for a code to be prefix-free

ž Condition 5 To obtain a minimum-redundancy code the compressionrate R must be minimized and (351) must be satisfied

R =Nsum

i=1

piLi (352)

Example 33

Let the input bitstream X = [4 5 6 6 2 5 4 4 1 4 4] be chosen over a dataset V = [0 12 3 4 5 6 7] Here N = 11 K = 8 and the probabilities pi =[

01

11

1

11 0

5

11

2

11

2

11 0

]

i isin V

In Figure 316(a) a simple binary representation with equal-length code isused The code length of each symbol is given by l = int(log2 K) iel = int(log2 8) = 3 Therefore a possible binary mapping would be 1 rarr001 2 rarr 010 4 rarr 100 5 rarr 101 6 rarr 110 and the total length Lb = 33bitsFigure 316(b) depicts the Shannon-Fano coding procedure Each symbol isencoded using a unary representation based on the symbol probability

ENTROPY CODING 77

Input bitstream [4 5 6 6 2 5 4 4 1 4 4 ]

(a) Encoding based on the binary equal - length code

[100 101 110 110 010 101 100 100 001 100 100]Total length Lb = 33 bits

(b) Encoding based on the Shannon - FanoCoding

4s-5 times 5s-2 times 6s-2 twice 2s-once and 1s-onceProbabilities p4 = 511 p5 = 211 p6 = 211 p2 = 111 and p1 = 111

Representations 4 rarr 0 5rarr 10 6 rarr 110 1 rarr 1110 2 rarr 11110

[0 10 110 110 11110 10 0 0 1110 0 0]Total length LSF = 24 bits

Figure 316 Entropy coding schemes (a) binary equal-length code (b) Shannon-Fanocoding

361 Huffman Coding

Huffman proposed a technique to construct minimum redundancy codes [Huff52]Huffman codes found applications in audio and video encoding due to their sim-plicity and efficiency Moreover Huffman coding is regarded as the most effectivecompression method provided that the codes designed using a specific set ofsymbol frequencies match the input symbol frequencies PDFs of audio signalsof shorter frame-lengths are better described by the Gaussian distribution whilethe long-time PDFs of audio can be characterized by the Laplacian or gammadensities [Rabi78] [Jaya84] Hence for example Huffman codes designed basedon the Gaussian or Laplacian PDFs can provide minimum redundancy entropycodes for audio encoding Moreover depending upon the symbol frequencies aseries of Huffman codetables can also be employed for entropy coding eg theMPEG-1 Layer-III employs 32 Huffman codetables [ISOI94]

Example 34

Figure 317 depicts the Huffman coding procedure for the numerical Exam-ple 33 The input symbols ie 1 2 4 5 and 6 are first arranged in ascendingorder wrt their probabilities Next the two symbols with the smallest proba-bilities are combined to form a binary tree The left tree is assigned a ldquo0rdquo andthe right tree is represented by a ldquo1rdquo The probability of the resulting nodeis obtained by adding the two probabilities of the previous nodes as shownin Figure 317 The above procedure is continued until all the input symbolnodes are used Finally Huffman codes for each input symbol is formed byreading the bits along the tree For example the Huffman codeword for theinput symbol ldquo1rdquo is given by ldquo0000rdquo The resulting Huffman bit-mapping isgiven in Table 34 and the total length of the encoded bitstream is LHF = 23bits Note that depending on the node selection for the code tree formation


111 211 211 511

411

1111

2 6 5

0 1

0001 001 01 1

Input symbols

Huffman codewords

0000

111

211

1

0

1

6110

1

0

0000

41

Figure 317 A possible Huffman coding tree for Example 34

Table 34 Huffman codetable for the inputbitstream X = [4 5 6 6 2 5 4 4 1 4 4]

Input symbol Probability Huffman codeword

4 511 15 211 016 211 0012 111 00011 111 0000

several Huffman bit-mappings can be possible for example 4 rarr 1 5 rarr 0116 rarr 010 2 rarr 001 and 1 rarr 000 as shown in Figure 318 However the totalnumber of bits remain the same ie LHF = 23 bits

Example 35

The entropy of the input bitstream X = [4 5 6 6 2 5 4 4 1 4 4] is given by

He(X) = minusKsum

i=1

pi log2(pi)

= minus

2

11log2

(1

11

)

+ 4

11log2

(2

11

)

+ 5

11log2

(5

11

)

(353)

= 204

ENTROPY CODING 79

111 211 211 511

411

1111

2 6 5

0 1

001 010 011 1

Input symbols

Huffman codewords

000

111

1

211

1

0

6110

1

0

000

4

1

Figure 318 Huffman coding tree for Example 34

From Figures 316 and 317 the compression rate R is obtained using

(a) the uniform binary representation = 3311 = 3 bitssymbol(b) Shannon-Fano coding = 2411 = 218 bitssymbol and(c) Huffman coding = 2311 = 209 bitssymbol

In the case of Huffman coding entropy He(X) and the compression rate Rcan be related using the entropy bounds [Cove91] This is given by

He(X) R He(X) + 1 (354)

It is interesting to note that the compression rate for the Huffman code willbe equal to the lower entropy bound ie R = He(X) if the input symbolfrequencies are radix 2 (see Example 32)

Example 36

The Huffman code table for a different input symbol frequency thanthe one given in Example 34 Consider the input bitstream Y =[2 5 6 6 2 5 5 4 1 4 4] chosen over a data set V = [0 1 2 3 4 5 6 7] and

the probabilities pi =[

01

11

2

11 0

3

11

3

11

2

11 0

]

i isin V Using the design

procedure described above a Huffman code tree can be formed as shown inFigure 319 Table 35 presents the resulting Huffman code table Total lengthof the Huffman encoded bitstream is given by LHF = 25 bits

Depending on the Huffman code table design procedure employed three dif-ferent encoding approaches can be possible First entropy coding based on


211 211 311 311

511

2 6 5

0 1

001 01 10 11

Input symbols

Huffman codewords

000

111

1

311

1

0

1

11110

1

0

000

611

4


Table 35 Huffman codetable for the inputbitstream Y = [2 5 6 6 2 5 5 4 1 4 4]


4 311 115 311 106 211 012 211 0011 111 000

the Huffman codes designed beforehand ie nonadaptive Huffman codingIn particular a training process involving a large database of input symbolsis employed to design Huffman codes These Huffman code tables will beavailable both at the encoder and at the decoder It is important to note thatthis approach may not (always) result in minimum redundancy encoding Forexample if the Huffman bitmapping given in Table 35 is used to encodethe input bitstream X = [4 5 6 6 2 5 4 4 1 4 4] given in Example 33the resulting total number of bits is LHF = 24 bits ie one bit more com-pared to Example 34 Therefore in order to obtain better compression areliable symbol-frequency model is necessary A series of Huffman codetables (in the range of 10ndash32) based on the symbol probabilities is usuallyemployed in order to overcome the aforementioned shortcomings The non-adaptive Huffman coding method is typically employed in a variety of audiocoding standards [ISOI92] [ISOI94] [ISOI96] [John96] Second Huffman cod-ing based on an iterative designencode procedure ie semi-adaptive Huffmancoding In the entropy coding literature this approach is typically called theldquotwo-passrdquo encoding scheme In the first pass a Huffman codetable is designed

ENTROPY CODING 81

based on the input symbol statistics In the second pass entropy coding isperformed using the designed Huffman codetable (similar to Examples 34and 36) In this approach note that the designed Huffman codetables mustalso be transmitted along with the entropy coded bitstream This results inreduced coding efficiency however with an improved symbol-frequency mod-eling Third adaptive Huffman coding based on the symbol frequencies com-puted dynamically from the previous samples Adaptive Huffman coding sch-emes based on the input quantization step size have also been proposedin order to accommodate for wide range of input word lengths [Crav96][Crav97]

362 Rice Coding

Rice in 1979 proposed a method for constructing practical noiseless cod-es [Rice79] Rice codes are usually employed when the input signal x exhibitsthe Laplacian distribution ie

pL(x) = 1radic2σx

eminus

radic2|x|σx (355)

A Rice code can be considered as a Huffman code for the Laplacian PDF Severalefficient algorithms are available to form Rice codes [Rice79] [Cove91] A simplemethod to represent the integer I as a Rice code is to divide the integer into fourparts ie a sign bit m low-order bits (LSBs) and the number corresponding tothe remaining MSBs of I as zeros followed by a stop bit lsquo1rsquo The parameter lsquomrsquocharacterizes the Rice code and is given by [Robi94]

m = log2(loge(2)E(|x|)) (356)

For example the Rice code for I = 69 and m = 4 is given by [0 0101 0000 1]

Example 37

Rice coding for Example 33 Input bitstream = [4 5 6 6 2 5 4 4 1 4 4]

Input symbol Binary representation Rice code (m = 2)

4 100 0 00 0 15 101 0 01 0 16 110 0 10 0 12 010 0 10 11 001 0 011


363 Golomb Coding

Golomb codes [Golo66] are optimal for exponentially decaying probability dis-tributions of positive integers Golomb codes are prefix codes that can be char-acterized by a unique parameter ldquomrdquo An integer ldquoIrdquo can be encoded using aGolomb code as follows The code consists of two parts a binary representation

of (m mod I ) and a unary representation of

[I

m

]

For example consider I = 69

and m = 16 The Golomb code will be [010111100] as explained in Figure 320In Method 1 the positive integer ldquoIrdquo is divided in two parts ie binary andunary bits along with a stop bit On the other hand in Method 2 if m = 2kthe codeword for ldquoIrdquo consists of ldquokrdquo LSBs of ldquoI rdquo followed by the numberformed by the remaining MSBs of ldquoIrdquo in unary representation and with a stop

bit Therefore the length of the code is k +[

I

2k

]

+ 1

Example 38

Consider the input bitstream X = [4 4 4 2 2 4 4 4 4 4 4 2 4 4 4 4] chosenover the data set V = [2 4] The run-length encoding scheme [Golo66] canbe employed to efficiently encode X Note that ldquo4rdquo is the most frequentlyoccurring symbol in X The number of consecutive occurrences of ldquo4rdquo iscalled the run length n The run lengths are monitored and encoded ie [30 6 4] Here ldquo0rdquo represents the consecutive occurrence of ldquo2rdquo The prob-ability of occurrence of ldquo4rdquo and ldquo2rdquo are given by p(4) = p = 1316 andp(2) = (1 minus p) = q = 316 respectively Note that p gtgt q For this case

Method ndash 1n = 69 m =16

First part Binary ( m mod n ) = Binary (16 mod 69) = Binary (5) = 0101

Stop bit = 0

Method ndash 2

n = 69 m = 16 ie k = 4 ( where m = 2k)

First part k LSBs of n = 4 LSBs of [1000101] = 0101

Second part Unary (rest of MSBs) = unary (4) = 1110

Stop bit = 0

Second part Unary ( ) = Unary(4) = 1110nm

Golomb Code

[0101 1110 0]First part + Second part + Stop bit

Golomb Code


Figure 320 Golomb coding

ENTROPY CODING 83

Huffman coding of X results in 16 bits Moreover the PDFs of the run lengthsare better described using an exponential distribution ie the probability of arun length of n is given by qpn which is an exponential distribution Rice cod-ing [Rice79] or the Golomb coding [Golo66] can be employed to efficientlyencode the exponentially distributed run lengths Furthermore both Golomband Rice codes are fast prefix codes that allow for practical implementation

364 Arithmetic Coding

Arithmetic coding [Riss79] [Witt87] [Howa94] deals with encoding a sequenceof input symbols as a large binary fraction known as a ldquocodewordrdquo For examplelet V = [v1 v2 vK ] be the data set let pi be the probability that the i-thsymbol is transmitted and let X = [x1 x2 xN ] be the input data vector oflength N The main idea behind an arithmetic coder is to encode the input datastream X as one codeword that corresponds to a rational number in the half-open unit interval [0 1) Arithmetic coding is particularly useful when dealingwith adaptive encoding and with highly skewed symbols [Witt87]

Example 39

Arithmetic coding of the input stream X = 1 0 minus 1 0 1 chosen over a dataset V = [minus1 0 1] Here N = 5 K = 3 We will use the following symbol

probabilities pi =[

1

3

1

2

1

6

]

i isin V

Step 1

The probabilities associated with the data set V = [minus1 0 1] are arrangedas intervals on a scale of [0 1) as shown in Figure 321Step 2

The first input symbol in the data stream X is lsquo1rsquo Therefore the interval[5

6 1

)

is chosen as the target range

Step 3

The second input symbol in the data stream X is lsquo0rsquo Now the interval[5

6 1

)

is partitioned according to the symbol probabilities 13 12 and

16 The resulting interval ranges are given in Figure 321 For example

the interval range for symbol lsquominus1rsquo can be computed as

[5

6

5

6+ 1

6

1

3

)

=[

5

6

16

18

)

and for symbol lsquo0rsquo the interval ranges is given by

[16

18

5

6+

1

6

1

3+ 1

6

1

2

)

=[

16

18

35

36

)


0

0 1minus1

1

minus1

1

10

1

Input data stream X = [1 0 minus1 0 1]

minus1 0

1618

3536

56

56

13

1618

3536

3336

6972

1

1618

3336

0minus1

197216

97108

0minus1 1

197216

97108

393432

393432

394432

390432

Figure 321 Arithmetic coding First the probabilities associated with the data setV = [minus1 0 1] are arranged as intervals on a scale of [0 1) Next in step 2 an intervalis chosen that corresponds to the probability of the input symbol lsquo1rsquo in the datasequence X ie [56 1) In step 3 the interval [56 1) is partitioned according tothe probabilities 13 12 and 16 and the range corresponding to the input symbollsquo0rsquo is chosen ie [16183536) This procedure is repeated for the rest of the inputsymbols and the final interval range (typically a rational number) is encoded in thebinary form

PROBLEMS 85

Step 4

The third input symbol in the data stream X is lsquominus1rsquo The interval

[16

18

35

36

)

is partitioned according to the symbol probabilities 13 12 and 16 Theresulting interval ranges are given in Figure 321

The above procedure is repeated for the rest of the input symbols and an

interval range is obtained ie

[393

432

394

432

)

[09097222 0912037) In the

binary form the interval is given by [01110100011 01110100101 )Since all binary numbers that begin with 01110100100 are within the interval[

393

432

394

432

)

the binary codeword 1110100100 uniquely represents the input

data stream X = [1 0 minus 1 0 1]

37 SUMMARY

This chapter covered quantization essentials and provided background on PCMDPCM vector quantization bit allocation and entropy coding algorithms Aquantizationndashbit allocationndashentropy coding (QBE) framework that is part of mostof the audio coding standards was described Some of the important conceptsaddressed in this chapter include

ž Uniform and nonuniform quantizationž PCM DPCM and ADPCM techniquesž Vector quantization structured VQ split VQ and conjugate structure VQž Bit-allocation strategiesž Source coding principlesž Lossless (entropy) coders ndash Huffman coding Rice coding Golomb coding

and arithmetic coding

PROBLEMS

31 Derive the PCM 6 dB per bit rule when the quantization error has a uniformprobability density function

32 For a signal with Gaussian distribution (zero mean and unit variance)a Design a uniform PCM quantizer with four levelsb Design a nonuniform four-level quantizer that is optimized for the signal

PDF Compare with the uniform PCM in terms of SNR

33 For the PDF p(x) = 1

2eminus|x| determine the mean the variance and the

probability that a random variable will fall within plusmnσx of the mean value


1

minus1 10 x

p(x)

Figure 322 An example PDF

34 For the PDF p(x) given in Figure 322 design a four-level PDF-optimizedPCM and compare to uniform PCM in terms of SNR

35 Give and justify a formula for the number of bits in simple vector quanti-zation with N times 1 vectors and L template vectors

36 Give in terms of L and N the order of complexity in a VQ codebooksearch where L is the number of codebook entries and N is the codebookdimension Consider the following cases (i) a simple VQ (ii) a multi-stepVQ and (iii) a split VQ For (ii) and (iii) use configurations given inFigure 312 and Figure 314 respectively

COMPUTER EXERCISES

37 Design a DPCM coder for a stationary random signal with power spectral

density S(ej) = 1

|1 + 08ej|2 Use a first-order predictor Give a block

diagram and all pertinent equations Write a program that implements theDPCM coder and evaluate the MSE at the receiver Compare the SNR (forthe same data) for a PCM system operating at the same bit rate

38 In this problem you will write a computer program to design a vectorquantizer and generate a codebook of size [L times N ] Here L is the numberof codebook entries and N is the codebook dimension

Step 1Generate a training set Tin of size [Ln times N ] where n is the number oftraining vectors per codevector Assume L = 16 N = 4 and n = 10 Denotethe training set elements as tin(i j) for i = 0 1 159 and j = 0 1 2 3Use Gaussian vectors of zero mean and unit variance for trainingStep 2Using the LBG algorithm [Lind80] design a vector quantizer and generate acodebook C of size [L times N ] ie [16 times 4] In the LBG VQ design choosethe distortion threshold as 0001 Label the codevectors as c(i j) for i =0 1 15 and j = 0 1 2 3

Tabl

e3

6Se

gmen

tal

and

over

all

SNR

valu

esfo

ra

test

sign

alw

ithi

nan

dou

tsid

eth

etr

aini

ngse

quen

cefo

rdi

ffer

ent

num

ber

ofco

debo

oken

trie

sL

di

ffer

ent

code

book

dim

ensi

ons

N

and

ε=

000

1

Tra

inin

gve

ctor

spe

ren

try

nC

odeb

ook

dim

ensi

onN

SNR

for

ate

stsi

gnal

wit

hin

the

trai

ning

sequ

ence

(dB

)SN

Rfo

ra

test

sign

alou

tsid

eth

etr

aini

ngse

quen

ce(d

B)

Segm

enta

lO

vera

llSe

gmen

tal

Ove

rall

L=

16L

=64

L=

16L

=64

L=

16L

=64

L=

16L

=64

102 4

100

2 4

87


VectorQuantizer

Codebook L1

4-bit VQL = 16 N = 4 and n = 100

e = 000001

s s

Figure 323 Four-bit VQ design specifications for Problem 39

Step 3Similar to Step 1 generate another training set Tout of size [160 times 4] that wewill use for testing the VQ performance Label these training set values astout (i j) for i = 0 1 159 and j = 0 1 2 3Step 4Using the codebook C designed in Step 2 perform vector quantization oftin(i j) and tout (i j) Let us denote the VQ results as tin(i j) and tout (i j)respectively

a When the test vectors are within the training sequence compute theover-all SNR and segmental SNR values as follows

SNRoverall =sumLnminus1

i=0

sumNminus1j=0 t2

in(i j)sumLnminus1

i=0

sumNminus1j=0 (tin(i j) minus tin(i j))2

(357)

SNRsegmental = 1

Ln

Lnminus1sum

i=0

sumNminus1j=0 t2

in(i j)sumNminus1

j=0 (tin(i j) minus tin(i j))2(358)

b Compute the over-all SNR and segmental SNR values when the testvectors are different from the training ones ie replace tin(i j) withtout (i j) and tin(i j) with tout (i j) in (357) and (358)

c List in Table 36 the overall and segmental SNR values for differentnumber of codebook entries and different codebook dimensions Explainthe effects of choosing different values of L n and N on the SNRvalues

d Compute the MSE ε(tin tin) = 1

Ln

1

N

sumLnminus1i=0

sumNminus1j=0 (tin(i j) minus

tin(i j))2 for different cases eg L = 16 64 n = 10 100 1000N = 2 8 Explain how the MSE varies for different values of L nand N

39 Write a program to design a 4-bit VQ codebook L1 (ie use L = 16codebook entries) with codebook dimension N = 4 (Figure 323) Usen = 100 training vectors per codebook entry and a distortion threshold ε =000001 For VQ-training choose zero mean and unit variance Gaussianvectors


VectorQuantizer-1

Codebook L1

s ssum sum

minus

VectorQuantizer-2

Codebook L2e1

e1ˆ e2ˆ

minus

VectorQuantizer-3

Codebook L3e2

4-bit VQL = 16 N = 4 and n = 100

e = 000001

4-bit VQL = 16 N = 4 and n = 100

e = 00001

4-bit VQL = 16 N = 4 and n = 100

e = 0001

Figure 324 A three-stage vector quantizer

310 Extend the VQ design in problem 39 to a multi-step VQ (see Figure 324for an example multi-step VQ configuration) Use a total of three stagesin your VQ design Choose the MSE distortion thresholds in each of thestages as ε1 = 0001 ε2 = 00001 and ε3 = 000001 Comment on theMSE convergence in each of the stages How would you compare themulti-step VQ with a simple VQ in terms of the segmental and overallSNR values (Note In Figure 324 the first VQ codebook (L1) encodes thesignal s and the subsequent VQ stages L2 and L3 encode the error fromthe previous stage At the decoder the signal s prime can be reconstructed ass prime = s + e1 + e2)

311 Design a two-stage split-VQ Choose L = 16 and n = 100 Implementthe first-stage as a 4-dimensional VQ and the second-stage as two 2-dimensional VQs Select the distortion thresholds as follows for the firststage ε1 = 0001 and for the second stage ε2 = 000001 Compare thecoding accuracy in terms of a distance measure and the coding gain interms of the number of bitssample of the split-VQ with respect to thesimple VQ in problem 39 and the multi-step VQ in problem 310 (SeeFigure 314)

312 Given the input data stream X = [1 0 2 1 0 1 2 1 0 2 0 1 1] chosen overa data set V = [0 1 2]a Write a program to compute the entropy He(X)b Compute the symbol probabilities pi i isin V for the input data stream

Xc Write a program to encode X using Huffman codes Employ an appro-

priate Huffman bit-mapping Give the length of the output bitstream(Hint See Example 34 and Example 36)

d Use arithmetic coding to encode the input data stream X Give the finalcodeword interval range in the binary form Give also the length of theoutput bitstream (See Example 39)

CHAPTER 4

LINEAR PREDICTION IN NARROWBANDAND WIDEBAND CODING

41 INTRODUCTION

Linear predictive coders are embedded in several telephony and multimediastandards [G729] [G7231] [IS-893] [ISOI99] Linear predictive coding(LPC) [Kroo95] is mostly used for source coding of speech signals and thedominant application of LPC is cellular telephony Recently linear prediction(LP) analysissynthesis has also been integrated in some of the wideband speechcoding standards [G722] [G7222] [Bess02] and in audio modeling [Iwak96][Mori96] [Harm97a] [Harm97b] [Bola98] [ISOI00]

LP analysissynthesis exploits the short- and long-term correlation to param-eterize the signal in terms of a source-system representation LP analysis canbe open loop or closed loop In closed-loop analysis also called analysis-by-synthesis the LP parameters are estimated by minimizing the ldquoperceptuallyweightedrdquo difference between the original and reconstructed signal Speech cod-ing standards use a perceptual weighting filter (PWF) to shape the quantizationnoise according to the masking properties of the human ear [Schr79] [Kroo95][Sala98] Although the PWF has been successful in speech coding audio cod-ing requires a more sophisticated strategy to exploit perceptual redundancies Tothis end several extensions [Bess02] [G7222] to the conventional LPC havebeen proposed Hybrid transformpredictive coding techniques have also beenemployed for high-quality low-bit-rate coding [Ramp98] [Ramp99] [Rong99][ISOI99] [ISOI00] Other LP methods that make use of perceptual constrains and


91

92 LINEAR PREDICTION IN NARROWBAND AND WIDEBAND CODING

auditory psychophysics include the perceptual LP (PLP) [Herm90] the warpedLP (WLP) [Stru80] [Harm96] [Harm01] and the perceptually-motivated all-pole(PMAP) modeling [Atti05] In the PLP a perceptually based auditory spectrumis obtained by filtering the signal using a filter bank that mimics the auditoryfilter bank An all-pole filter that approximates the auditory spectrum is thencomputed using the autocorrelation method [Makh75] On the other hand in theWLP the main idea is to warp the frequency axis according to a Bark scale priorto performing LP analysis The PMAP modeling employs an auditory excitationpattern matching-method to directly estimate the perceptually-relevant pole loca-tions The estimated ldquoperceptual polesrdquo are then used to construct an all-polefilter for speech analysissynthesis

Whether or not LPC is amenable for audio modeling depends on the signalproperties For example a code-excited linear predictive (CELP) coder seemsto be more adequate than a sinusoidal coder for telephone speech while thesinusoidal coder seems to be more promising for music

42 LP-BASED SOURCE-SYSTEM MODELING FOR SPEECH

Speech is produced by the interaction of the vocal tract with the vocal chords TheLP analysissynthesis framework (Figures 41 and 42) has been successful forspeech coding because it fits well the source-system paradigm forspeech [Makh75] [Mark76] In particular the slowly time-varying spectral char-acteristics of the upper vocal tract (system) are modeled by an all-pole filterwhile the prediction residual captures the voiced unvoiced or mixed excitationsignal The LP analysis filter A(z) in Figure 41 is given by

A(z) = 1 minusLsum

i=1

aizminusi (41)

Framing or buffering

Linearprediction (LP) analysis A(z)

Vocal tractLP spectral envelope

Input speech

Speech frame

Linear predictionresidual

A(z) = 1minus Σ ai z minusiL

i=1

Figure 41 Parameter estimation using linear prediction

LP-BASED SOURCE-SYSTEM MODELING FOR SPEECH 93

Synthesized signal

Vocal tract

filter 1A(z)

(LP synthesis)

Gain

Figure 42 Engineering model for speech synthesis

where L is the order of the linear predictor Figure 42 depicts a simple speechsynthesis model where a time-varying digital filter is excited by quasi-periodicwaveforms when speech is voiced (eg as in steady vowels) and random wave-forms for unvoiced speech (eg as in consonants) The inverse filter 1A(z)shown in Figure 42 is an all-pole LP synthesis filter

H(z) = G

A(z)= G

1 minus sumLi=1 aizminusi

(42)

where G represents the gain Note that the term all-pole is used loosely since (42)has zeros at z = 0 The frequency response associated with the LP synthesis filterie the LPC spectrum represents the formant structure of the speech signal(Figure 43) In this figure F1 F2 F3 and F4 represent the four formants

0 02 04 06 08 1minus50

minus40

minus30

minus20

minus10

0

10

20

Normalized Frequency (x 4kHz)

Mag

nitu

de (

dB)

LPC spectrumFFT spectrum

F1 F2

F3F4

Figure 43 The LPC and FFT spectra (dotted line) The formants represent the resonantmodes of the vocal tract


43 SHORT-TERM LINEAR PREDICTION

Figure 44 presents a typical L-th order FIR linear predictor During forwardlinear prediction of s(n) an estimated value s(n) is computed as a linear com-bination of the previous samples ie

s(n) =Lsum

i=1

ais(n minus i) (43)

where the weights ai are the LP coefficients The output of the LP analysis filterA(z) is called the prediction residual e(n) = s(n) minus s(n) This is given by

e(n) = s(n) minusLsum

i=1

ais(n minus i) (44)

Because only short-term delays are considered in (44) the linear predictorin Figure 44 is also referred to as the short-term linear predictor The linearpredictor coefficients ai are estimated using least-square minimization of theprediction error ie

ε = E[e2(n)] = E

(

s(n) minusLsum

i=1

ais(n minus i)

)2

(45)

The minimization of ε in (45) with respect to ai ie partεpartai = 0 for i =1 2 L yields a set of equations involving autocorrelations

rss(m) minusLsum

i=1

airss(m minus i) = 0 for m = 1 2 L (46)

where rss(m) is the autocorrelation sequence of the signal s(n) Equation (46)can be written in matrix form ie

hellip

a1

aL

hellip

s(n) e(n)

aLminus1 hellip

_

_

+_

LP analysis filter A(z) = 1 minus sum ai zminus i

i = 1

L

z minus1 z minus1 sum

Figure 44 Linear prediction (LP) analysis

SHORT-TERM LINEAR PREDICTION 95

a1

a2

a3

aL

=

rss(0) rss(minus1) rss(minus2) rss(1 minus L)

rss(1) rss(0) rss(minus1) rss(2 minus L)

rss(2) rss(1) rss(0) rss(3 minus L)

rss(L minus 1) rss(L minus 2) rss(L minus 3) rss(0)

minus1

rss(1)

rss(2)

rss(3)

rss(L)

(47)

or more compactly

a = Rminus1ss rss (48)

where a is the LP coefficient vector rss is the autocorrelation vector and Rss isthe autocorrelation matrix Note that Rss has a Toeplitz and symmetric structureEfficient algorithms [Makh75] [Mark76] [Marp87] are available for inverting theautocorrelation matrix Rss including algorithms tailored to work well with finiteprecision arithmetic [Gers90] Typically the Levinson-Durbin recursive algo-rithm [Makh75] is used to compute the LP coefficients Preconditioning of theinput sequence s(n) and autocorrelation data rss(m) using tapered windowsimproves the numerical behavior of these algorithms [Klei95] [Kroo95] In addi-tion bandwidth expansion or scaling of the LP coefficients is typical in LPC asit reduces distortion during synthesis

In low-bit-rate coding the prediction coefficients and the residual must beefficiently quantized Because the direct-form LP coefficients ai do not haveadequate quantization properties transformed coefficients are typically quantizedFirst-generation voice coders (vocoders) such as the LPC10e [FS1015] and theIS-54 VSELP [IS-54] quantize reflection coefficients that are a by-product of theLevinson-Durbin recursion Transformation of the reflection coefficients can leadto a set of parameters that are also less sensitive to quantization In particularthe log area ratios and the inverse sine transformation have been used in theearly GSM 610 algorithm [GSM89] and in the skyphone standard [Boyd88]Recent LP-based cellular standards quantize line spectrum pairs (LSPs) Themain advantage of the LSPs is that they relate directly to frequency-domain andhence they can be encoded using perceptual criteria

431 Long-Term Prediction

Long-term prediction (LTP) as opposed to short-term prediction is a processthat captures the long-term correlation in the signal The LTP provides a mech-anism for representing the periodicity in the signal and as such it representsthe fine harmonic structure in the short-term spectrum (see Eq (41)) LTP syn-thesis (49) requires estimation of two parameters ie the delay τ and thegain parameter aτ For strongly voiced segments of speech the delay is usuallyan integer that approximates the pitch period A transfer function of a simpleLTP synthesis filter Hτ (z) is given in (49) More complex LTP filters involve


multiple parameters and noninteger delays [Kroo90]

Hτ (z) = 1

AL(z)= 1

1 minus aτ zminusτ

(49)

The LTP can be implemented by open loop or closed loop analysis The open-loopLTP parameters are typically obtained by searching the autocorrelation sequenceThe gain is simply obtained by aτ = rss(τ )rss (0) In closed-loop LTP search thesignal is synthesized for a range of candidate LTP lags and the lag that producesthe best waveform matching is chosen Because of the intensive computations infull-search closed-loop LTP recent algorithms use open-loop LTP to establishan initial LTP lag that is then refined using closed-loop search around the neigh-borhood of the initial estimate In order to further reduce the complexity LTPsearches are often carried in every other subframe

432 ADPCM Using Linear Prediction

One of the simplest compression schemes that uses the short-term LP analysis-synthesis is the adaptive differential pulse code modulation (ADPCM)coder [Bene86] [G726] ADPCM algorithms encode the difference between thecurrent and the predicted speech samples The block diagram of the ITU-T G72632 kbs ADPCM encoder [Bene86] is shown in Figure 45 The algorithm con-sists of an adaptive quantizer and an adaptive pole-zero predictor The predictionparameters are obtained by backward estimation ie from quantized data usinga gradient algorithm at the decoder From Figure 45 it can be noted that thedecoder is embedded in the encoder The pole-zero predictor (2 poles and 6 zeros)estimates the input signal and hence it reduces the variance of e(n) The quantizerencodes the error e(n) into a sequence of 4-bit words The ITU-T G726 alsoaccommodates 16 24 and 40 kbs with individually optimized quantizers

44 OPEN-LOOP ANALYSIS-SYNTHESIS LINEAR PREDICTION

In almost all LP-based speech codecs speech is approximated on short analysisintervals typically in the neighborhood of 20 ms As shown in Figure 46 a setof LP synthesis parameters is estimated on each analysis frame to capture theshape of the vocal tract envelope and to model the excitation

Some of the typical synthesis parameters encoded and transmitted in the open-loop LP include the prediction coefficients the pitch information the frameenergy and the voicing At the receiver the transmitted ldquosourcerdquo parametersare used to form the excitation The excitation e(n) is then used to excite theLP synthesis filter 1A(z) to reconstruct the speech signal Some of the stan-dardized open-loop analysis-synthesis LP algorithms include the LPC10e FederalStandard FS-1015 [FS1015] [Trem82] [Camp86] and the Mixed Excitation LP(MELP) [McCr91] The LPC10e FS-1015 uses a tenth-order predictor to estimatethe vocal tract parameters and a two-state voiced or unvoiced excitation modelfor residual modeling Mixed excitation schemes in conjunction with LPC were

ANALYSIS-BY-SYNTHESIS LINEAR PREDICTION 97

+

_Step update

Coder output

Decoder output

+

+

++

Qminus1Qs(n) e(n)

s (n)

e (n)

B(z)

A(z) sumsum

sum

Figure 45 The ADPCM ITU-T G726 encoder

A(z) = 1 minus Σ ai z minusiL

i =1

Vocal tract filter 1A(z)

Synthesis (Decoder)

Synthetic speech

Input speech

Analysis (Encoder) Pitch period

Residual or Excitation ParametersLPC excitation pitchenergy voicing etc

Parameters

LPC excitation pitch

energy voicing etc

Reconstructedspeech

gain

Figure 46 Open-loop analysis-synthesis LP

proposed by Makhoul et al [Makh78] and were later revisited by McCree andBarnwell [McCr91] [McCr93]

45 ANALYSIS-BY-SYNTHESIS LINEAR PREDICTION

In closed-loop source-system coders (Figure 47) the excitation source is deter-mined by closed-loop or analysis-by-synthesis (A-by-S) optimization Theoptimization process determines an excitation sequence that minimizes the per-ceptually weighted mean-square-error (MSE) between the input speech and recon-structed speech [Atal82b] [Sing84] [Schr85] The closed-loop LP combines the


spectral modeling properties of vocoders with the waveform matching attributesof waveform coders and hence the A-by-S LP coders are also called hybridLP coders The system consists of a short-term LP synthesis filter 1A(z) anda LTP synthesis filter 1AL(z) shown in Figure 47 The perceptual weightingfilter (PWF) W(z) shapes the error such that quantization noise is masked bythe high-energy formants The PWF is given by

W(z) = A(zγ1)

A(zγ2)= 1 minus sumL

i=1 γ i1aiz

minusi

1 minus sumLi=1 γ i

2aizminusi

0 lt γ2 lt γ1 lt 1 (410)

where γ1 and γ2 are the adaptive weights and L is the order of the linear predic-tor Typically γ1 ranges from 094 to 098 and γ2 varies between 04 and 07depending upon the tilt or the flatness characteristics associated with the LPCspectral envelope [Sala98] [Bess02] The role of W(z) is to de-emphasize theerror energy in the formant regions [Schr79] This de-emphasis strategy is basedon the fact that quantization noise in the formant regions is partially masked byspeech From Figure 47 note that a gain factor g scales the excitation vector xand the excitation samples are filtered by the long-term and short-term synthesisfilters

The three most common excitation models typically embedded in theexcitation generator module (Figure 47) in the A-by-S LP schemes includethe multi-pulse excitation (MPE) [Atal82b] [Sing84] the regular pulseexcitation (RPE) [Kroo86] and the vector or code excited linear prediction(CELP) [Schr85] A 96 kbs multi-pulse excited linear prediction (MPE-LP) algorithm is used in Skyphone airline applications [Boyd88] A 13 kbscoding scheme that uses regular pulse excitation (RPE) [Kroo86] was adoptedfor the first generation full-rate ETSI GSM Pan-European digital cellularstandard [GSM89]

The aforementioned MPE-LP and RPE schemes achieve high-quality speech atmedium rates (13 kbs) For low-rate high-quality speech coding a more efficientrepresentation of the excitation sequence is required Atal [Atal82a] suggestedthat high-quality speech at low rates may be produced by using noninstantaneous

1AL(z) 1A(z)g

Errorminimization W(z)

s

Error

xExcitationgenerator

(MPE or RPE)

s

Input speech

Syntheticspeech

+

_

e

Σ

Figure 47 A typical source-system model employed in the analysis-by-synthesis LP


(delayed decision) coding of Gaussian excitation sequences in conjunction withA-by-S linear prediction and perceptual weighting In the mid-1980s Atal andSchroeder [Atal84] [Schr85] proposed a vector or code excited linear prediction(CELP) algorithm for A-by-S linear predictive coding

We provide further details on CELP in this section because of its recent use inwideband coding standards The excitation codebook search process in CELP canbe explained by considering the A-by-S scheme shown in Figure 48 The N times 1error vector e associated with the k-th excitation vector can be written as

e[k] = sw minus s0w minus gk sw[k] (411)

where sw is the N times 1 vector that contains the perceptually-weighted speechsamples s0

w is the vector that contains the output due to the initial filter statesw[k] is the filtered synthetic speech vector associated with the k-th excitationvector and gk is the gain factor Minimizing εk = eT [k]e[k] wrt gk we obtain

gk = sTw sw[k]

sTw[k]sw[k]

(412)

where sw = sw minus s0w and T represents the transpose operator From (412) εk

can be written as

εk = sTwsw minus (sT

w sw[k])2

sTw[k]sw[k]

(413)

Excitation vectorsCodebook

helliphellip

gk

MSEminimization

Errore[k]

Input speechs

sw

LTP synthesisfilter

1 AL(z)

LP synthesisfilter

1 A(z)

+

_PWFW(z)

PWFW(z)

x[k]

Syntheticspeech

s[k]ˆ sw[k]ˆΣ

Figure 48 A generic block diagram for the A-by-S code-excited linear predictive (CELP)coding Note that the perceptual weighting W(z) is applied directly on the input speechs and synthetic speech s in order to facilitate for the CELP analysis that follows Thek-th excitation vector x[k] that minimizes εk in (413) is selected and the correspondinggain factor gk is obtained from (412) The codebook index k and the gain gk asso-ciated with the candidate excitation vector x[k] are encoded and transmitted along withthe short-term and long-term prediction filter parameters


The k-th excitation vector x[k] that minimizes (413) is selected and thecorresponding gain factor gk is obtained from (412)

One of the disadvantages of the original CELP algorithm is thelarge computational complexity required for the codebook search [Schr85]This problem motivated a great deal of work focused upon developingstructured codebooks [Davi86] [Klei90a] and fast search procedures [Tran90] Inparticular Davidson and Gersho [Davi86] proposed sparse codebooks and Kleijnet al [Klei90a] proposed a fast algorithm for searching stochastic codebooks withoverlapping vectors In addition Gerson and Jasiuk [Gers90] [Gers91] proposeda vector sum excited linear predictive (VSELP) coder which is associated withfast codebook search and robustness to channel errors Other implementationissues associated with CELP include the quantization of the CELP parametersthe effects of channel errors on CELP coders and the operation of the algorithmon finite-precision and fixed-point machines A study on the effects of parameterquantization on the performance of CELP was presented in [Kroo90] and theissues associated with the channel coding of the CELP parameters were discussedby Kleijn [Klei90b] Some of the problems associated with the fixed-pointimplementation of CELP algorithms were presented in [Span92]

451 Code-Excited Linear Prediction Algorithms

In this section we taxonomize CELP algorithms into three categories that areconsistent with the chronology of their development ie first-generation CELP(1986ndash1992) second-generation CELP (1993ndash1998) and third-generation CELP(1999ndashpresent)

4511 First-Generation CELP Coders The first-generation CELP algo-rithms operate at bit rates between 58 kbs and 16 kbs These are generallyhigh complexity and non-toll-quality algorithms Some of the first-generationCELP algorithms include the FS-1016 CELP the IS-54 VSELP the ITU-TG728 low delay-CELP and the IS-96 Qualcomm CELP The FS-1016 48 kbsCELP standard [Camp90] [FS1016] was jointly developed by the Department ofDefense (DoD) and the Bell Labs for possible use in the third-generation securetelephone unit (STU-III) The IS-54 VSELP algorithm [IS-54] [Gers90] and itsvariants are embedded in three digital cellular standards ie the 8 kbs TIAIS-54 [IS-54] the 63 kbs Japanese standard [GSM96a] and the 56 kbs half-rate GSM [GSM96b] The VSELP algorithm uses highly structured codebooksthat are tailored for reduced computational complexity and increased robust-ness to channel errors The ITU-T G728 low-delay (LD) CELP coder [G728][Chen92] achieves low one-way delay by using very short frames a backward-adaptive predictor and short excitation vectors (five samples) The IS-96 Qual-comm CELP [IS-96] is a variable bit rate algorithm and is part of the originalcode division multiple access (CDMA) standard for cellular communications

4512 Second-Generation Near-Toll-Quality CELP Coders Thesecond-generation CELP algorithms are targeted for TDMA and CDMAcellphones Internet audio streaming voice-over-Internet-protocol (VoIP) and


secure communications Second-generation CELP algorithms include the ITU-T G7231 dual-rate speech codec [G7231] the GSM enhanced full rate(EFR) [GSM96a] [IS-641] the IS-127 Relaxed CELP (RCELP) [IS-127][Klei92] and the ITU-T G729 CS-ACELP [G729] [Sala98]

The coding gain improvements in second-generation CELP coders can beattributed partly to the use of algebraic codebooks in excitation coding [Adou87][Lee90] [Sala98] [G729] The term algebraic CELP refers to the structure ofthe excitation codebooks Various algebraic codebook structures have been pro-posed [Adou87] [Lafl90] but the most popular is the interleaved pulse permu-tation code In this codebook the code vector consists of a set of interleavedpermutation codes containing only few non-zero elements This is given by

pi = i + jd j = 0 1 2M minus 1 (414)

where pi is the pulse position i is the pulse number and d is the interleavingdepth The integer M represents the number of bits describing the pulse positionsTable 41 shows an example ACELP codebook structure where the interleavingdepth d = 5 the number of pulses or tracks equal to 5 and the number ofbits to represent the pulse positions M = 3 From (414) pi = i + j5 wherei = 0 1 2 3 4 j = 0 1 2 7

For a given value of i the set defined by (414) is known as lsquotrackrsquo andthe value of j defines the pulse position From the codebook structure shown inTable 41 the codevector x(n) is given by

x(n) =4sum

i=0

αiδ(n minus pi) n = 0 1 39 (415)

where δ(n) is the unit impulse αi are the pulse amplitudes (plusmn1) and pi are thepulse positions In particular the codebook vector x(n) is computed by placingthe 5-unit pulses at the determined locations pi multiplied with their signs (plusmn1)The pulse position indices and the signs are coded and transmitted Note that thealgebraic codebooks do not require any storage

4513 Third-Generation CELP for 3G Cellular Standards The third-generation (3G) CELP algorithms are multimodal and accommodate several

Table 41 An example algebraic codebookstructure tracks and pulse positions

Track (i) Pulse positions (pi)

0 P0 0 5 10 15 20 25 30 351 P1 1 6 11 16 21 26 31 362 P2 2 7 12 17 22 27 32 373 P3 3 8 1318 23 28 33 384 P4 4 9 14 19 24 29 34 39


different bit rates This is consistent with the vision on wideband wirelessstandards [Knis98] that operate in different modes including low-mobility high-mobility indoor etc There are at least two algorithms that have been developedand standardized for these applications In Europe GSM standardized theadaptive multi-rate coder [ETSI98] [Ekud99] and in the United States the TIAhas tested the selectable mode vocoder (SMV) [Gao01a] [Gao01b] [IS-893] Inparticular the adaptive multirate coder [ETSI98] [Ekud99] has been adoptedby ETSI for use in the GSM network This is an algebraic CELP algorithmthat operates at multiple rates 122 102 795 67 59 515 and 575 kbsThe bit rate is adjusted according to the traffic conditions The SMV algorithm(IS-893) was developed to provide higher quality flexibility and capacity overthe existing IS-96 QCELP and IS-127 enhanced variable rate coding (EVRC)CDMA algorithms The SMV is based on 4 codecs full-rate at 85 kbs half-rate at 4 kbs quarter-rate at 2 kbs and eighth-rate at 08 kbs The rate andmode selections in SMV are based on the frame voicing characteristics and thenetwork conditions Efforts to establish wideband cellular standards continueto drive further the research and development towards algorithms that work atmultiple rates and deliver enhanced speech quality

46 LINEAR PREDICTION IN WIDEBAND CODING

Until now we discussed the use of LP in narrowband coding with signal band-width limited to 150ndash3400 Hz Signal bandwidth in wideband speech codingspans 50 Hz to 7 kHz which substantially improves the quality of signal recon-struction intelligibility and naturalness In particular the introduction of thelow-frequency components improves the naturalness while the higher frequencyextension provides more adequate speech intelligibility In case of high-fidelityaudio it is typical to consider sampling rates of 441 kHz and signal bandwidthcan range from 20 Hz to 20 kHz Some of the recent super high-fidelity audiostorage formats (Chapter 11) such as the DVD-audio and the super audio CD(SACD) consider signal bandwidths up to 100 kHz

461 Wideband Speech Coding

Over the last few years several wideband speech coding algorithms have beenproposed [Orde91] [Jaya92] [Lafl93] [Adou95] Some of the coding principlesassociated with these algorithms have been successfully integrated into sev-eral speech coding standards for example the ITU-T G722 subband ADPCMstandard and the ITU-T G7222 AMR-WB codec

4611 The ITU-T G722 Codec The ITU-T G722 standard (Figure 49) usesa combination of both subband and ADPCM (SB-ADPCM)techniques [G722] [Merm88] [Span94] [Pain00] The input signal is sampledat 16 kHz and decomposed into two subbands of equal bandwidth using quadra-ture mirror filter (QMF) banks The subband filters hlow(n) and hhigh(n) shouldsatisfy

hhigh(n) = (minus1)nhlow(n)and|Hlow(z)|2 + |Hhigh(z)|2 = 1 (416)

LINEAR PREDICTION IN WIDEBAND CODING 103

The low-frequency subband is typically quantized at 48 kbs while the high-frequency subband is coded at 16 kbs The G722 coder includes an adaptive bitallocation scheme and an auxiliary data channel Moreover provisions for quan-tizing the low-frequency subband at 40 or at 32 kbs are available In particularthe G722 algorithm is multimodal and can operate in three different modes ie48 56 and 64 kbs by varying the bits used to represent the lower band signalThe MOS at 64 kbs is greater than four for speech and slightly less than fourfor music signals [Jaya90] and the analysis-synthesis QMF banks introduce adelay of less than 3 ms Details on the real-time implementation of this coderare given in [Taka88]

4612 The ITU-T G7222 AMR-WB Codec The ITU-T G7222 [G7722][Bess02] is an adaptive multi-rate wideband (AMR-WB) codec that operates atbit rates ranging from 66 to 2385 kbs The G722 AMR-WB standard is pri-marily targeted for the voice-over IP (VoIP) 3G wireless communications ISDNwideband telephony and audiovideo teleconferencing It is important to note thatthe AMR-WB codec has also been adopted by the third-generation partnershipproject (3GPP) for GSM and the 3G WCDMA systems for wideband mobile com-munications [Bess02] This in fact brought to the fore all the interoperability-related advantages for wideband voice applications across wireline and wirelesscommunications The ITU-T G7222 AMR-WB codec is based on the ACELPcoder and operates on audio frames of 20 ms sampled at 16 kHz The codec sup-ports the following nine bit rates 2385 2305 1985 1825 1585 1425 1265885 and 66 kbs Excepting the two lowest modes ie the 885 kbs and the

MUX Data insertion

Auxiliary data16 kbs

64 kbsoutput

QMFanalysis

bank

ADPCMencoder 1

ADPCMencoder 2

s(n)

(a)


64 kbsinput De-

MUXData

extraction

QMFsynthesis

bank

ADPCMdecoder 1

ADPCMdecoder 2

s(n)

(b)

Figure 49 The ITU-T G722 standard for ISDN teleconferencing Wideband coding at64 kbs based on a two-band QMF analysissynthesis bank and ADPCM (a) encoder and(b) decoder Note that the low-frequency band is encoded at 32 kbs in order to allow foran auxiliary data channel at 16 kbs


66 kbs that are intended for transmission over noisy time-varying channels otherencoding modes ie 2385 through 1265 kbs offer high-quality signal recon-struction The G722 AMR-WB embeds several innovative techniques [Bess02]such as i) a modified perceptual weighting filter that decouples the formantweighting from the spectrum tilt ii ) an enhanced closed-loop pitch search to bet-ter accommodate the variations in the voicing level and iii ) efficient codebookstructures for fast searches The codec also includes a voice activity detection(VAD) scheme that activates a comfort noise generator module (1ndash2 kbs) incase of discontinuous transmission

462 Wideband Audio Coding

Motivated by the need to reduce the computational complexity associated with theCELP-based excitation source coding researchers have proposed several hybrid(LP + subbandtransform) coders [Lefe94] [Ramp98] [Rong99] In this sectionwe consider LP-based wideband coding methods that encode the prediction resid-ual based upon the transform or subband or sinusoidal coding techniques

4621 Multipulse Excitation Model Singhal at Bell Labs [Sing90] reportedthat analysis-by-synthesis multipulse excitation with sufficient pulse density canbe applied to correct for LP envelope errors introduced by bandwidth expansionand quantization (Figure 410) This algorithm uses a 24th-order LPC synthesisfilter while optimizing pulse positions and amplitudes to minimize perceptuallyweighted reconstruction errors Singhal determined that densities of approxi-mately 1 pulse per 4 output samples of each excitation subframe are requiredto achieve near transparent quality

Spectral coefficients are transformed to inverse sine reflection coefficientsthen differentially encoded and quantized using PDF-optimized Max quantiz-ers Entropy (Huffman) codes are also used Pulse locations are differentiallyencoded relative to the location of the first pulse Pulse amplitudes are fractionallyencoded relative to the largest pulse and then quantized using a Max quantizerThe proposed MPLPC audio coder achieved output SNRs of 35ndash40 dB at a bitrate of 128 kbs Other MPLPC audio coders have also been proposed [Lin91]

LPSynthesisFilter

ErrorWeighting

ExcitationGenerator

minus+

u(n)

s(n)

s(n)Σ

Figure 410 Multipulse excitation model used in [Sing90]


including a scheme based on MPLPC in conjunction with the discrete wavelettransform [Bola95]

4622 Discrete Wavelet Excitation Coding While most of the successfulspeech codecs nowadays use some form of closed-loop time-domain analysis-by-synthesis such as MPLPC high-performance LP-based perceptual audio cod-ing has been realized with alternative frequency-domain excitation models Forinstance Boland and Deriche reported output quality comparable to MPEG-1Layer II at 128 kbs for an LPC audio coder operating at 96 kbs [Bola98] inwhich the prediction residual was transform coded using a three-level discrete-wavelet-transform (DWT) (see also Section 82) based on a four-band uniformfilter bank At each level of the DWT the lowest subband of the previous levelwas decomposed into four uniform bands This 10-band nonuniform structurewas intended to mimic critical bandwidths to a certain extent A perceptual bitallocation according to MPEG-1 psychoacoustic model-2 was applied to thetransform coefficients

4623 Sinusoidal Excitation Coding Excitation sequences modeled as asum of sinusoids were investigated in [Chan96] This form of excitation is basedon the tendency of the prediction residuals to be spectrally impulsive rather thanflat for high-fidelity audio In coding experiments using 32-kHz-sampled inputaudio subjective and objective quality improvements relative to the MPLPCcoders were reported for the sinusoidal excitation schemes with high-qualityoutput audio reported at 72 kbs In the experiments reported in [Chan97] a setof ten LP coefficients is estimated on 94 ms analysis frames and split-vectorquantized using 24 bits Then the prediction residual is analyzed and sinusoidalparameters are estimated for the seven best out of a candidate set of thirteensinusoids for each of six subframes The masked threshold is estimated and usedto form a time-varying bit allocation for the amplitudes frequencies and phaseson each subframe Given a frame allocation of 675 a total of 573 78 and 24 bitsrespectively are allocated to the sinusoidal bit allocation side information andLP coefficients Sinusoidal excitation coding when used in conjunction with amasking-threshold adapted weighting filter resulted in improved quality relativeto MPEG-1 layer I at a bit rate of 96 kbs [Chan96] for selected test material

4624 Frequency-Warped LP Beyond the performance improvementsrealized through the use of different excitation models there has been interestin warping the frequency axis before LP analysis to effectively provide bet-ter resolution at certain frequencies In the context of perceptual coding it isnaturally of interest to achieve a Bark-scale warping Frequency axis warping toachieve nonuniform FFT resolution was first introduced by Oppenheim Johnsonand Steiglitz [Oppe71] [Oppe72] using a network of cascaded first-order all-passsections for frequency warping of the signal followed by a standard FFT Theidea was later extended to warped linear prediction (WLP) by Strube [Stru80]and was ultimately applied to an ADPCM codec [Krug88] Cascaded first-orderall-pass sections were used to warp the signal and then the LP autocorrelation


analysis was performed on the warped autocorrelation sequence In this scenarioa single-parameter warping of the frequency axis can be introduced into the LPanalysis by replacing the delay elements in the FIR analysis filter (Figure 44)with all-pass sections This is done by replacing the complex variable zminus1 ofthe FIR system function with another filter H(z) of the form

H(z) = zminus1 minus λ

1 minus λzminus1 (417)

Thus the predicted sample value is not produced from a combination of pastsamples as in Eq (44) but rather from the samples of a warped signal In factit has been shown [Smit95] [Smit99] that selecting the value of 0723 for theparameter λ leads to a frequency warp that approximates well the Bark frequencyscale A WLP-based audio codec [Harm96] was recently proposed The inherentBark frequency resolution of the WLP residual yields a perceptually shapedquantization noise without the use of an explicit perceptual model or time-varyingbit allocation In this system a 40-th order WLP synthesis filter is combined withdifferential encoding of the prediction residual A fixed rate of 2 bits per sample(882 kbs) is allocated to the residual sequence and 5 bits per coefficient areallocated to the prediction coefficients on an analysis frame of 800 samples or18 ms This translates to a bit rate of 992 kbs per channel In objective termsan auditory error measure showed considerable improvement for the WLP codingerror in comparison to a conventional LP coding error when the same numberof bits was allocated to the prediction residuals Subjectively the algorithm wasreported to achieve transparent quality for some material but it also had difficultywith transients at the frame boundaries

The algorithm was later extended to handle stereophonic signals [Harm97a]by forming a complex-valued representation of the two channels and then usinga version of WLP modified for complex signals (CWLP) It was suggested thatsignificant quality improvement could be realized for the WLPC audio coderby using a closed-loop analysis-by-synthesis procedure [Harm97b] One of theshortcomings of the original WLP coder was inadequate attention to temporaleffects As a result further experiments were reported [Harm98] in which WLPwas combined with temporal noise shaping (TNS) to realize additional qualityimprovement

47 SUMMARY

In this Chapter we presented the LP-based source-system model and described itsapplications in narrowband and wideband coding Some of the topics presentedin this chapter include

ž Short-term linear predictionž Conventional LP analysis-synthesis

PROBLEMS 107

ž Closed-loop analysis-by-synthesis hybrid codersž Code-excited linear prediction (CELP) speech standardsž Linear prediction in wideband codingž Frequency-warped LP

PROBLEMS

41 The following autocorrelation sequence is given rss(m) = 08|m|

036

Describe a source-system mechanism that will generate a signal with thisautocorrelation

42 Sketch the magnitude frequency response of the following filter function

H(z) = 1

1 minus 09zminus10

43 The autocorrelation sequence in Figure 411 corresponds to a strongly voicedspeech Show with an arrow which autocorrelation sample relates to the pitchperiod of the voiced signal Estimate the pitch period from the graph

44 Consider Figure 412 with a white Gaussian input signala Determine analytically the LP coefficients for a first-order predictor and

for H(z) = 1(1 minus 08zminus1)b Determine analytically the LP coefficients for a second-order predictor

and for H(z) = 1 + zminus1 + zminus2

0 20 40 60 80log index m

120100minus5

0

5

r ss

10

Figure 411 Autocorrelation of a voiced speech segment

H(z) Linearprediction

White Gaussianinput

s2 = 1m = 0

Residuale(n)

Figure 412 LP coefficients estimation


COMPUTER EXERCISES

The necessary MATLAB software for computer simulations and the speechaudiofiles (Ch4Sp8wav Ch4Au8wav and Ch4Au16wav ) can be obtained from theBook website

45 Linear predictive coding (LPC)

a Write a MATLAB program to load display and play back speech filesUse Ch4Sp8wav for this computer exercise

b Include a framing module in your program and set the frame size to256 samples Every frame should be read in a 256 times 1 real vector calledStime Compute the fast Fourier transform (FFT) of this vector ieSf req = ff t(Stime) Next compute the magnitude of the complex vec-tor Sfreq and plot its magnitude in dB up to the fold-over frequency Thiscomputation should be part of your frame-by-frame speech processingprogram

Deliverable 1Present at least one plot of time and one corresponding plot of frequency-domain data for a voiced unvoiced and a mixed speech segment (A total ofsix plots ndash use the subplot command)

c Pitch period and voicing estimation The period of a strongly voicedspeech signal is associated in a reciprocal manner to the fundamentalfrequency of the corresponding harmonic spectrum That is if the pitchperiod is T the fundamental frequency is 1T Note that T can be mea-sured in terms of the number of samples within a pitch period for voicedspeech If T is measured in ms then multiply the number of samples by1Fs where Fs is the sampling frequency of the input speech

Deliverable 2Create and fill Table 42 for the first 30 speech frames by visual inspec-tion as follows when the segment is voiced enter 1 in the 2nd columnIf speech pause (ie no speech present) enter 0 if unvoiced enter 025and if mixed enter 05 Measure the pitch period visually from the time-domain plot in terms of the number of samples in a pitch period If thesegment is unvoiced or pause enter infinity for the pitch period and hencezero for the fundamental frequency If the segment is mixed do your bestto obtain an estimate of the pitch if it is not possible set pitch toinfinityDeliverable 3From Table 42 plot the fundamental frequency as a function of the framenumber for all thirty frames This is called the pitch frequency contour


Table 42 Pitch period voicing and frame energy measurements

Speech framenumber

Voicedunvoicedmixedpause

Pitch(number of

samples)Frame

energy

Fundamentalfrequency

(Hz)

1230

Deliverable 4From Table 42 plot also i) the voicing and ii ) frame energy (in dB) as afunction of the frame number for all thirty frames

d The FFT and LP spectra Write a MATLAB program to implement theLevinson-Durbin recursion Assume a tenth-order LP analysis and esti-mate the LP coefficients (lp coeff ) for each speech frame

Deliverable 5Compute the LP spectra as follows H allpole = f reqz(1 lp coeff) Super-impose the LP spectra (H allpole) with the FFT speech spectra (Sfreq) for avoiced segment and an unvoiced segment Plot the spectral magnitudes in dBup to the foldover frequency Note that the LPC spectra look like a smoothedversion of the FFT spectra Divide Sfreq by H allpole Plot the magnitudeof the result in dB up to the fold-over frequency What does the resultingspectrum representDeliverable 6From the LP spectra measure (visually) the frequencies of the first threeformants F1 F2 and F3 Give these frequencies in Hz (Table 43) Plot thethree formants across the frame number These will be the formant contoursUse different line types or colors to discriminate the three contours

e LP analysis-synthesis Using the prediction coefficients (lp coeff ) frompart (d) perform LP analysis Use the mathematical formulation given inSections 42 and 43 Quantize both the LP coefficients and the prediction

Table 43 Formants F1 F2 and F3

Speech frame number F1 (Hz) F2 (Hz) F3 (Hz)

1230


residual using a 3-bit (ie 8 levels) scalar quantizer Next perform LPsynthesis and reconstruct the speech signal

Deliverable 7Plot the quantized residual and its corresponding dB spectrum for a voicedand an unvoiced frame Provide plots of the original and reconstructed speechCompute the SNR in dB for the entire reconstructed signal relative to theoriginal record Listen to the reconstructed signal and provide a subjectivescore on a scale of 1 to 5 Repeat this step when a 8-bit scalar quantizer isemployed In your simulation when the LP coefficients were quantized usinga 3-bit scalar quantizer the LP synthesis filter will become unstable for certainframes What are the consequences of this

46 The FS-1016 CELP standarda Obtain the MATLAB software for the FS-1016 CELP from the Book

website Use the following wave files Ch4Sp8wav and Ch4Au8wav

Deliverable 1Give the plots of the entire input and the FS-1016 synthesized output for thetwo wave files Comment on the quality of the two synthesized wave files Inparticular give more emphasis on the Ch4Au8wav and give specific reasonswhy the FS-1016 does not synthesize Ch4Au8wav with high quality Listen tothe output files and provide a subjective evaluation The CELP FS1016 coderscored 32 out of 5 in government tests on a MOS scale How would you rateits performance in terms of MOS for Ch4Sp8wav and Ch4Au8wav Also givesegmental SNR values for a voicedunvoicedmixed frame and overall SNR forthe entire record Present your results as Table 44 with appropriate caption

b Spectrum analysis File CELPANALM from lines 134 to 159Valuable comments describing the variable names globals and inputsoutputs are provided at the beginning of the MATLAB fileCELPANALM to further assist you with understanding the MATLAB

Table 44 FS-1016 CELP subjective and objective evaluation

Speech frame number

Segmental SNR forthe chosenframe (dB)

Voiced Ch4Sp8wav Unvoiced

Mixed Over-all SNR for the entire speech record = (dB)MOS in a scale of 1ndash5 =

Ch4Au8wavOver-all SNR for the entire music record = (dB)MOS in a scale of 1ndash5 =


program Specifically some of the useful variables in the MATLABcode include snew-input speech buffer fcn-LP filter coefficients of1A(z) rcn-reflection coefficients newfreq-LSP frequencies unqfreq-unquantized LSPs newfreq-quantized LSPs and lsp-interpolated LSPsfor each subframe

Deliverable 2Choose Ch4Sp8wav and use the voicedunvoicedmixed frames selected inthe previous step Indicate the frame numbers FS-1016 employs 30-ms speechframes so the frame size is fixed = 240 samples Give time-domain (variablein the code lsquosnewrsquo ) and frequency-domain plots (use FFT size 512 includecommands as necessary in the program to obtain the FFT) of the selectedvoicedunvoicedmixed frame Also plot the LPC spectrum using

figure freqz(1 fcn)

Study the interlacing property of the LSPs on the unit circle Note that in theFS-1016 standard the LSPs are encoded using scalar quantization Plot theLPC spectra obtained from the unquantized LSPs and quantized LSPs Youhave to convert LSPs to LPCs in both the cases (unquantized and quantized)and use the freqz command to plot the LPC spectra Give a z = domain plot(of a voiced and an unvoiced frame) containing the pole locations (show ascrosses lsquoxrsquo) of the LPC spectra and the roots of the symmetric (show as blackcircles lsquoorsquo) and asymmetric (show as red circles lsquoorsquo) LSP polynomials Notethe interlacing nature of black and red circles they always lie on the unit circleAlso note that if a pole is close to the unit circle the corresponding LSPs willbe close to each other In the file CELPANALM line 124 high-pass filteringis performed to eliminate the undesired low frequencies Experiment with andwithout a high-pass filter to note the presence of humming and low-frequencynoise in the synthesized speech

c Pitch analysis File CELPANALM from lines 162 to 191

Deliverable 3What are the key advantages of employing subframes in speech coding (eginterpolation pitch prediction) Explain in general the differences betweenlong-term prediction and short-term prediction Give the necessary transferfunctions In particular describe what aspects of speech each of the two pre-dictors capturesDeliverable 4Insert in file CELPANALM after line 182

tauptr = 75

Perform an evaluation of the perceptual quality of synthesis speech and giveyour remarks How does the speech quality change by forcing a pitch to apredetermined value (Choose different tauptr values 40 75 and 110)


47 The warped LP for audio analysis-synthesisa Write a MATLAB program to perform analysis-synthesis using i) the

conventional LP and ii ) the warped LP Use Ch4Au8wav as the inputwave file

b Perform a tenth-order LP analysis and use a 5-bit scalar quantizer toquantize the LP and the WLP coefficients and a 3-bit scalar quantizerfor the excitation vector Perform audio synthesis using the quantized LPand WLP analysis parameters Compute the warping coefficient [Smit99][Harm01] using

λ = 10674

(2

πarctan

(006583Fs

1000

))12

minus 01916

where Fs is the sampling frequency of the input audio Comment on the qualityof the synthesized audio from the LP and WLP analysis-synthesis Repeat thisstep for Ch4Au16wav Refer to [Harm00] for implementation of WLP synthe-sis filters

CHAPTER 5

PSYCHOACOUSTIC PRINCIPLES

51 INTRODUCTION

The field of psychoacoustics [Flet40] [Gree61] [Zwis65] [Scha70] [Hell72][Zwic90] [Zwic91] has made significant progress toward characterizing humanauditory perception and particularly the time-frequency analysis capabilities ofthe inner ear Although applying perceptual rules to signal coding is not a newidea [Schr79] most current audio coders achieve compression by exploiting thefact that ldquoirrelevantrdquo signal information is not detectable by even a well-trainedor sensitive listener Irrelevant information is identified during signal analysis byincorporating into the coder several psychoacoustic principles including absolutehearing thresholds critical band frequency analysis simultaneous masking thespread of masking along the basilar membrane and temporal masking Combiningthese psychoacoustic notions with basic properties of signal quantization has alsoled to the theory of perceptual entropy [John88b] a quantitative estimate of thefundamental limit of transparent audio signal compression

This chapter reviews psychoacoustic fundamentals and perceptual entropy andthen gives as an application example some details of the ISOMPEG psychoa-coustic model 1 Before proceeding however it is necessary to define the soundpressure level (SPL) a standard metric that quantifies the intensity of an acous-tical stimulus [Zwic90] Nearly all of the auditory psychophysical phenomenaaddressed in this book are treated in terms of SPL The SPL gives the level(intensity) of sound pressure in decibels (dB) relative to an internationally definedreference level ie LSPL = 20 log10(pp0) dB where LSPL is the SPL of a stim-ulus p is the sound pressure of the stimulus in Pascals (Pa equivalent to Newtons


113

114 PSYCHOACOUSTIC PRINCIPLES

per square meter (Nm2)) and p0 is the standard reference level of 20 microPa or2 times 10minus5 Nm2 [Moor77] About 150 dB SPL spans the dynamic range of inten-sity for the human auditory system from the limits of detection for low-intensity(quiet) stimuli up to the threshold of pain for high-intensity (loud) stimuli TheSPL reference level is calibrated such that the frequency-dependent absolutethreshold of hearing in quiet (Section 52) tends to measure in the vicinity of0 dB SPL On the other hand a stimulus level of 140 dB SPL is typically at orabove the threshold of pain

52 ABSOLUTE THRESHOLD OF HEARING

The absolute threshold of hearing characterizes the amount of energy needed ina pure tone such that it can be detected by a listener in a noiseless environ-ment The absolute threshold is typically expressed in terms of dB SPL Thefrequency dependence of this threshold was quantified as early as 1940 whenFletcher [Flet40] reported test results for a range of listeners that were generatedin a National Institutes of Health (NIH) study of typical American hearing acuityThe quiet threshold is well approximated [Terh79] by the non linear function

Tq(f ) = 364(f1000)minus08 minus 65eminus06(f1000minus33)2 + 10minus3(f1000)4(dB SPL)

(51)

which is representative of a young listener with acute hearing When applied tosignal compression Tq(f ) could be interpreted naively as a maximum allow-able energy level for coding distortions introduced in the frequency domain(Figure 51)

102 103 104

0

20

40

60

80

100

Frequency (Hz)

Sou

nd p

ress

ure

leve

l S

PL

(dB

)

Figure 51 The absolute threshold of hearing in quiet

CRITICAL BANDS 115

At least two caveats must govern this practice however First whereas thethresholds captured in Figure 51 are associated with pure tone stimuli the quan-tization noise in perceptual coders tends to be spectrally complex rather thantonal Secondly it is important to realize that algorithm designers have no a pri-ori knowledge regarding actual playback levels (SPL) and therefore the curveis often referenced to the coding system by equating the lowest point (ie near4 kHz) to the energy in +minus 1 bit of signal amplitude In other words it isassumed that the playback level (volume control) on a typical decoder will be setsuch that the smallest possible output signal will be presented close to 0 dB SPLThis assumption is conservative for quiet to moderate listening levels in uncon-trolled open-air listening environments and therefore this referencing practice iscommonly found in algorithms that utilize the absolute threshold of hearing Wenote that the absolute hearing threshold is related to a commonly encounteredacoustical metric other than SPL namely dB sensation level (dB SL) Sensa-tion level (SL) denotes the intensity level difference for a stimulus relative toa listenerrsquos individual unmasked detection threshold for the stimulus [Moor77]Hence ldquoequal SLrdquo signal components may have markedly different absoluteSPLs but all equal SL components will have equal supra-threshold margins Themotivation for the use of SL measurements is that SL quantifies listener-specificaudibility rather than an absolute level Whether the target metric is SPL or SLperceptual coders must eventually reference the internal PCM data to a physicalscale A detailed example of this referencing for SPL is given in Section 57 ofthis chapter

53 CRITICAL BANDS

Using the absolute threshold of hearing to shape the coding distortion spectrumrepresents the first step towards perceptual coding Considered on its own how-ever the absolute threshold is of limited value in coding The detection thresholdfor spectrally complex quantization noise is a modified version of the abso-lute threshold with its shape determined by the stimuli present at any giventime Since stimuli are in general time-varying the detection threshold is alsoa time-varying function of the input signal In order to estimate this thresholdwe must first understand how the ear performs spectral analysis A frequency-to-place transformation takes place in the cochlea (inner ear) along the basilarmembrane [Zwic90]

The transformation works as follows A sound wave generated by an acousticstimulus moves the eardrum and the attached ossicular bones which in turn trans-fer the mechanical vibrations to the cochlea a spiral-shaped fluid-filled structurethat contains the coiled basilar membrane Once excited by mechanical vibrationsat its oval window (the input) the cochlear structure induces traveling wavesalong the length of the basilar membrane Neural receptors are connected alongthe length of the basilar membrane The traveling waves generate peak responsesat frequency-specific membrane positions and therefore different neural receptors


are effectively ldquotunedrdquo to different frequency bands according to their locationsFor sinusoidal stimuli the traveling wave on the basilar membrane propagatesfrom the oval window until it nears the region with a resonant frequency nearthat of the stimulus frequency The wave then slows and the magnitude increasesto a peak The wave decays rapidly beyond the peak The location of the peakis referred to as the ldquobest placerdquo or ldquocharacteristic placerdquo for the stimulus fre-quency and the frequency that best excites a particular place [Beke60] [Gree90]is called the ldquobest frequencyrdquo or ldquocharacteristic frequencyrdquo Thus a frequency-to-place transformation occurs An example is given in Figure 52 for a three-tonestimulus The interested reader can also find online a number of high-qualityanimations demonstrating this aspect of cochlear mechanics [Twve99]

As a result of the frequency-to-place transformation the cochlea can be viewedfrom a signal processing perspective as a bank of highly overlapping bandpassfilters The magnitude responses are asymmetric and nonlinear (level-dependent)Moreover the cochlear filter passbands are of nonuniform bandwidth and thebandwidths increase with increasing frequency The ldquocritical bandwidthrdquo is afunction of frequency that quantifies the cochlear filter passbands Empirical workby several observers led to the modern notion of critical bands [Flet40] [Gree61][Zwis65] [Scha70] We will consider two typical examples

In one scenario the loudness (perceived intensity) remains constant for anarrowband noise source presented at a constant SPL even as the noise bandwidthis increased up to the critical bandwidth For any increase beyond the criticalbandwidth the loudness then begins to increase In this case one can imaginethat loudness remains constant as long as the noise energy is present withinonly one cochlear ldquochannelrdquo (critical bandwidth) and then that the loudnessincreases as soon as the noise energy is forced into adjacent cochlear ldquochannelsrdquoCritical bandwidth can also be viewed as the result of auditory detection efficacyin terms of a signal-to-noise ratio (SNR) criterion The power spectrum model

Dis

plac

emen

t

Distance from oval window (mm)

32 24 16 8 0

400 Hz 1600 Hz 6400 Hz

Figure 52 The frequency-to-place transformation along the basilar membrane The pic-ture gives a schematic representation of the traveling wave envelopes (measured in termsof vertical membrane displacement) that occur in response to an acoustic tone complexcontaining sinusoids of 400 1600 and 6400 Hz Peak responses for each sinusoid arelocalized along the membrane surface with each peak occurring at a particular distancefrom the oval window (cochlear ldquoinputrdquo) Thus each component of the complex stimulusevokes strong responses only from the neural receptors associated with frequency-specificloci (after [Zwic90])

CRITICAL BANDS 117

of hearing assumes that masked threshold for a given listener will occur at aconstant listener-specific SNR [Moor96] In the critical bandwidth measurementexperiments the detection threshold for a narrowband noise source presentedbetween two masking tones remains constant as long as the frequency separationbetween the tones remains within a critical bandwidth (Figure 53a) Beyondthis bandwidth the threshold rapidly decreases (Figure 53c) From the SNRviewpoint one can imagine that as long as the masking tones are presentedwithin the passband of the auditory filter (critical bandwidth) that is tuned tothe probe noise the SNR presented to the auditory system remains constantand hence the detection threshold does not change As the tones spread furtherapart and transition into the filter stopband however the SNR presented to theauditory system improves and hence the detection task becomes easier In orderto maintain a constant SNR at threshold for a particular listener the powerspectrum model calls for a reduction in the probe noise commensurate with thereduction in the energy of the masking tones as they transition out of the auditoryfilter passband Thus beyond critical bandwidth the detection threshold for theprobe tones decreases and the threshold SNR remains constant A notched-noiseexperiment with a similar interpretation can be constructed with masker andmaskee roles reversed (Figure 53 b and d) Critical bandwidth tends to remainconstant (about 100 Hz) up to 500 Hz and increases to approximately 20 ofthe center frequency above 500 Hz For an average listener critical bandwidth(Figure 53b) is conveniently approximated [Zwic90] by

BW c(f ) = 25 + 75[1 + 14(f1000)2]069(Hz) (52)

Although the function BW c is continuous it is useful when building practicalsystems to treat the ear as a discrete set of bandpass filters that conforms to (52)The function [Zwic90]

Zb(f ) = 13 arctan (000076f ) + 35 arctan

[(f

7500

)2]

(Bark) (53)

is often used to convert from frequency in Hertz to the Bark scale Figure 54(a) Corresponding to the center frequencies of the Table 51 filter bank thenumbered points in Figure 54 (a) illustrate that the nonuniform Hertz spacing ofthe filter bank (Figure 55) is actually uniform on a Bark scale Thus one criticalbandwidth (CB) comprises one Bark Table 51 gives an idealized filter bank thatcorresponds to the discrete points labeled on the curves in Figure 54(a b) Adistance of 1 critical band is commonly referred to as one Bark in the literature

Although the critical bandwidth captured in Eq (52) is widely used in perceptualmodels for audio coding we note that there are alternative expressions In particularthe equivalent rectangular bandwidth (ERB) scale emerged from research directedtowards measurement of auditory filter shapes Experimental data is obtained typi-cally from notched noise masking procedures Then the masking data is fitted withparametric weighting functions that represent the spectral shaping properties of the


∆f

Aud

ibili

ty T

h

fcb

(c)

∆f

Aud

ibili

ty T

h

fcb

(d)

(a)Freq

Sou

nd P

ress

ure

Leve

l (dB

)∆f

(b)Freq

Sou

nd P

ress

ure

Leve

l (dB

)

∆f

Figure 53 Critical band measurement methods (ac) Detection threshold decreases asmasking tones transition from auditory filter passband into stopband thus improvingdetection SNR (bd) Same interpretation with roles reversed (after [Zwic90])

auditory filters [Moor96] Rounded exponential models with one or two free param-eters are popular For example the single-parameter roex(p) model is given by

W(g) = (1 + pg)eminuspg (54)

where g = |f minus f0|f0 is the normalized frequency f0 is the center frequency ofthe filter and f represents frequency in Hz Although the roex(p) model does notcapture filter asymmetry asymmetric filter shapes are possible if two roex(p) modelsare used independently for the high- and low-frequency filter skirts Two parame-ter models such as the roex(p r) are also used to gain additional degrees of free-dom [Moor96] in order to improve the accuracy of the filter shape estimates Aftercurve-fitting an ERB estimate is obtained directly from the parametric filter shapeFor the roex(p) model it can be shown easily that the equivalent rectangular band-width is given by

ERBroex(p) = 4f0

p(55)

We note that some texts denote ERB by equivalent noise bandwidth An example isgiven in Figure 56 The solid line in Figure 56 (a) shows an example roex(p) filterestimated for a center frequency of 1 kHz while the dashed line shows the ERBassociated with the given roex(p) filter shape

In [Moor83] and [Glas90] Moore and Glasberg summarized experimentalERB measurements for roex(pr) models obtained over a period of several years

CRITICAL BANDS 119

(a)

0 05 1 15 2

x 104

0

5

10

15

20

25

Frequency f (Hz)

Crit

ical

ban

d ra

te Z

b (

Bar

k)

1

25 24

2221

19

15

10

5

x ndash CB center frequencies

(b)

102 103 1040

1000

2000

3000

4000

5000

6000

Frequency f (Hz)

Crit

ical

ban

dwid

th (

Hz)

25

24

22

2119

151051


Figure 54 Two views of critical bandwidth (a) Critical band rate Zb(f ) maps fromHertz to Barks and (b) critical bandwidth BWc(f ) expresses critical bandwidth as afunction of center frequency in Hertz The ldquoXsrdquo denote the center frequencies of theidealized critical band filter bank given in Table 51

by a number of different investigators Given a collection of ERB measurementson center frequencies across the audio spectrum a curve fitting on the data setyielded the following expression for ERB as a function of center frequency

ERB(f ) = 247(437(f1000) + 1) (56)


0

02

04

06

08

Am

plitu

de

1

12

02 04 06 08 1 12 14 16 18 2X 104Frequency (Hz)

Figure 55 Idealized critical band filter bank

As shown in Figure 56 (b) the function specified by Eq (56) differs from thecritical bandwidth of Eq (52) Of particular interest for perceptual codec design-ers the ERB scale implies that auditory filter bandwidths decrease below 500 Hzwhereas the critical bandwidth remains essentially flat The apparent increasedfrequency selectivity of the auditory system below 500 Hz has implications foroptimal filter-bank design as well as for perceptual bit allocation strategies Theseimplications are addressed later in the book

Regardless of whether it is best characterized in terms of critical bandwidth orERB the frequency resolution of the auditory filter bank largely determines whichportions of a signal are perceptually irrelevant The auditory time-frequencyanalysis that occurs in the critical band filter bank induces simultaneous andnonsimultaneous masking phenomena that are routinely used by modern audiocoders to shape the coding distortion spectrum In particular the perceptual mod-els allocate bits for signal components such that the quantization noise is shapedto exploit the detection thresholds for a complex sound (eg quantization noise)These thresholds are determined by the energy within a critical band [Gass54]Masking properties and masking thresholds are described next

54 SIMULTANEOUS MASKING MASKING ASYMMETRY AND THESPREAD OF MASKING

Masking refers to a process where one sound is rendered inaudible because ofthe presence of another sound Simultaneous masking may occur whenever two

SIMULTANEOUS MASKING MASKING ASYMMETRY AND THE SPREAD OF MASKING 121

Table 51 Idealized critical band filter bank (after [Scha70]) Band edges andcenter frequencies for a collection of 25 critical bandwidth auditory filters thatspan the audio spectrum

Band number Center frequency (Hz) Bandwidth (Hz)

1 50 ndash1002 150 100ndash2003 250 200ndash3004 350 300ndash4005 450 400ndash5106 570 510ndash6307 700 630ndash7708 840 770ndash9209 1000 920ndash1080

10 1175 1080ndash127011 1370 1270ndash148012 1600 1480ndash172013 1850 1720ndash200014 2150 2000ndash232015 2500 2320ndash270016 2900 2700ndash315017 3400 3150ndash370018 4000 3700ndash440019 4800 4400ndash530020 5800 5300ndash640021 7000 6400ndash770022 8500 7700ndash950023 10500 9500ndash1200024 13500 12000ndash1550025 19500 15500ndash

or more stimuli are simultaneously presented to the auditory system From afrequency-domain point of view the relative shapes of the masker and maskeemagnitude spectra determine to what extent the presence of certain spectral energywill mask the presence of other spectral energy From a time-domain perspectivephase relationships between stimuli can also affect masking outcomes A simpli-fied explanation of the mechanism underlying simultaneous masking phenomena isthat the presence of a strong noise or tone masker creates an excitation of sufficientstrength on the basilar membrane at the critical band location to block effectivelydetection of a weaker signal Although arbitrary audio spectra may contain complexsimultaneous masking scenarios for the purposes of shaping coding distortions itis convenient to distinguish between three types of simultaneous masking namelynoise-masking-tone (NMT) [Scha70] tone-masking-noise (TMN) [Hell72] andnoise-masking-noise (NMN) [Hall98]


(a)

700 800 900 1000 1100 1200 1300

minus25

minus20

minus15

minus10

minus5

0

5Equivalent Rectangular Bandwidth roex(p)

Frequency (Hz)

Atte

nuat

ion

(dB

) ERB

(b)

102

103

Center Frequency (Hz)

Ban

dwid

th (

Hz)

102 103 104

Critical bandwidth

ERB

Figure 56 Equivalent rectangular bandwidth (ERB) (a) Example ERB for a roex(p) sin-gle-parameter estimate of the shape of the auditory filter centered at 1 kHz The solid linerepresents an estimated spectral weighting function for a single-parameter fit to data froma notched noise masking experiment the dashed line represents the equivalent rectangularbandwidth (b) ERB vs critical bandwidth ndash the ERB scale of Eq (56) (solid) vs criticalbandwidth of Eq (52) (dashed) as a function of center frequency


(b)

Freq (Hz)

SP

L (d

B)

Crit BW

TonalMasker

Threshold

SM

Rsim

24 d

B

MaskedNoise

80

56

1000Freq (Hz)

SP

L (d

B)

Crit BW

NoiseMasker

ThresholdS

MR

sim 4

dB

MaskedTone

80

76

410

(a)

Figure 57 Example to illustrate the asymmetry of simultaneous masking (a) Noise-masking-tone ndash At the threshold of detection a 410-Hz pure tone presented at 76-dB SPLis just masked by a critical bandwidth narrowband noise centered at 410 Hz (90 Hz BW)of overall intensity 80 dB SPL This corresponds to a threshold minimum signal-to-mask(SMR) ratio of 4 dB The threshold SMR increases as the probe tone is shifted eitherabove or below 410 Hz (b) Tone-masking-noise ndash At the threshold of detection a 1-kHzpure tone presented at 80-dB SPL just masks a critical-band narrowband noise centeredat 1 kHz of overall intensity 56-dB SPL This corresponds to a threshold minimum SMRof 24 dB As for the NMT experiment threshold SMR for the TMN increases as themasking tone is shifted either above or below the noise center frequency 1 kHz Whencomparing (a) to (b) it is important to notice the apparent ldquomasking asymmetryrdquo namelythat NMT produces a significantly smaller threshold minimum SMR (4 dB) than doesTMN (24 dB)

541 Noise-Masking-Tone

In the NMT scenario (Figure 57a) a narrowband noise (eg having 1 Barkbandwidth) masks a tone within the same critical band provided that the inten-sity of the masked tone is below a predictable threshold directly related to theintensity and to a lesser extent center frequency of the masking noise Numer-ous studies characterizing NMT for random noise and pure-tone stimuli haveappeared since the 1930s (eg [Flet37] [Egan50]) At the threshold of detectionfor the masked tone the minimum signal-to-mask ratio (SMR) ie the smallestdifference between the intensity (SPL) of the masking noise (ldquosignalrdquo) and theintensity of the masked tone (ldquomaskrdquo) occurs when the frequency of the maskedtone is close to the maskerrsquos center frequency In most studies the minimumSMR tends to lie between minus5 and +5 dB For example a sample thresholdSMR result from the NMT investigation [Egan50] is schematically representedin Figure 57a In the figure a critical band noise masker centered at 410 Hz


with an intensity of 80 db SPL masks a 410 Hz tone and the resulting SMR atthe threshold of detection is 4 dB

Masking power decreases (ie SMR increases) for probe tones above andbelow the frequency of the minimum SMR tone in accordance with a level-and frequency-dependent spreading function that is described later We note thattemporal factors also affect simultaneous masking For example in the NMTscenario an overshoot effect is possible when the probe tone onset occurs withina short interval immediately following masker onset Overshoot can boost simul-taneous masking (ie decrease the threshold minimum SMR) by as much as10 dB over a brief time span [Zwic90]

542 Tone-Masking-Noise

In the case of TMN (Figure 57b) a pure tone occurring at the center of acritical band masks noise of any subcritical bandwidth or shape provided thenoise spectrum is below a predictable threshold directly related to the strengthand to a lesser extent the center frequency of the masking tone In contrastto NMT relatively few studies have attempted to characterize TMN At thethreshold of detection for a noise band masked by a pure tone however it wasfound in both [Hell72] and [Schr79] that the minimum SMR ie the smallestdifference between the intensity of the masking tone (ldquosignalrdquo) and the intensityof the masked noise (ldquomaskrdquo) occurs when the masker frequency is close to thecenter frequency of the probe noise and that the minimum SMR for TMN tendsto lie between 21 and 28 dB A sample result from the TMN study [Schr79] isgiven in Figure 57b In the figure a narrowband noise of one Bark bandwidthcentered at 1 kHz is masked by a 1 kHz tone of intensity 80 dB SPL Theresulting SMR at the threshold of detection is 24 dB As with NMT the TMNmasking power decreases for critical bandwidth probe noises centered above andbelow the minimum SMR probe noise

543 Noise-Masking-Noise

The NMN scenario in which a narrowband noise masks another narrowbandnoise is more difficult to characterize than either NMT or TMN because ofthe confounding influence of phase relationships between the masker and mas-kee [Hall98] Essentially different relative phases between the components ofeach can lead to different threshold SMRs The results from one study of intensitydifference detection thresholds for wideband noise [Mill47] produced thresholdSMRs of nearly 26 dB for NMN [Hall98]

544 Asymmetry of Masking

The NMT and TMN examples in Figure 57 clearly show an asymmetry in mask-ing power between the noise masker and the tone masker In spite of the factthat both maskers are presented at a level of 80 dB SPL the associated thresh-old SMRs differ by 20 dB This asymmetry motivates our interest in both the


TMN and NMT masking paradigms as well as NMN In fact knowledge ofall three is critical to success in the task of shaping coding distortion such thatit is undetectable by the human auditory system For each temporal analysisinterval a codecrsquos perceptual model should identify across the frequency spec-trum noise-like and tone-like components within both the audio signal and thecoding distortion Next the model should apply the appropriate masking relation-ships in a frequency-specific manner In conjunction with the spread of masking(below) NMT NMN and TMN properties can then be used to construct a globalmasking threshold Although several methods for masking threshold estimationhave proven effective we note that a deeper understanding of masking asym-metry may provide opportunities for improved perceptual models In particularHall [Hall97] has shown that masking asymmetry can be explained in terms ofrelative maskermaskee bandwidths and not necessarily exclusively in terms ofabsolute masker properties Ultimately this implies that the de facto standardenergy-based schemes for masking power estimation among perceptual codecsmay be valid only so long as the masker bandwidth equals or exceeds mas-kee (probe) bandwidth In cases where the probe bandwidth exceeds the maskerbandwidth an envelope-based measure be embedded in the masking calcula-tion [Hall97] [Hall98]

545 The Spread of Masking

The simultaneous masking effects characterized before by the paradigms NMTTMN and NMN are not bandlimited to within the boundaries of a single criticalband Interband masking also occurs ie a masker centered within one criticalband has some predictable effect on detection thresholds in other critical bandsThis effect also known as the spread of masking is often modeled in codingapplications by an approximately triangular spreading function that has slopes of+25 and minus10 dB per Bark A convenient analytical expression [Schr79] is given by

SF dB (x) = 1581 + 75(x + 0474) minus 175radic

1 + (x + 0474)2dB (57)

where x has units of Barks and SF db(x) is expressed in dB After critical band anal-ysis is done and the spread of masking has been accounted for masking thresholdsin perceptual coders are often established by the [Jaya93] decibel (dB) relations

TH N = ET minus 145 minus B (58)

TH T = EN minus K (59)

where TH N and TH T respectively are the noise-and tone-masking thresholdsdue to tone-masking-noise and noise-masking-tone EN and ET are the criticalband noise and tone masker energy levels and B is the critical band numberDepending upon the algorithm the parameter K is typically set between 3 and5 dB Of course the thresholds of Eqs (58) and (59) capture only the contributionsof individual tone-like or noise-like maskers In the actual coding scenario each


frame typically contains a collection of both masker types One can see easily thatEqs (58) and (59) capture the masking asymmetry described previously Afterthey have been identified these individual masking thresholds are combined toform a global masking threshold

The global masking threshold comprises an estimate of the level at whichquantization noise becomes just noticeable Consequently the global maskingthreshold is sometimes referred to as the level of ldquojust-noticeable distortionrdquoor JND The standard practice in perceptual coding involves first classifyingmasking signals as either noise or tone next computing appropriate thresholdsthen using this information to shape the noise spectrum beneath the JND levelTwo illustrated examples are given later in Sections 56 and 57 which addressthe perceptual entropy and the ISOIEC MPEG Model 1 respectively Note thatthe absolute threshold (Tq) of hearing is also considered when shaping the noisespectra and that MAX (JND Tq) is most often used as the permissible distortionthreshold Notions of critical bandwidth and simultaneous masking in the audiocoding context give rise to some convenient terminology illustrated in Figure 58where we consider the case of a single masking tone occurring at the center ofa critical band All levels in the figure are given in terms of dB SPL

A hypothetical masking tone occurs at some masking level This generates anexcitation along the basilar membrane that is modeled by a spreading functionand a corresponding masking threshold For the band under consideration theminimum masking threshold denotes the spreading function in-band minimumAssuming the masker is quantized using an m-bit uniform scalar quantizer noisemight be introduced at the level m Signal-to-mask ratio (SMR) and noise-to-mask ratio (NMR) denote the log distances from the minimum masking thresholdto the masker and noise levels respectively

Freq

Sou

nd P

ress

ure

Leve

l (dB

)

mminus1mm+1

Critical

Band

Masking Tone

Masking Threshold

Minimum

Masking Threshold

Neighboring

Band

NM

RS

MR

SN

R

Figure 58 Schematic representation of simultaneous masking (after [Noll93])

NONSIMULTANEOUS MASKING 127

55 NONSIMULTANEOUS MASKING

As shown in Figure 59 masking phenomena extend in time beyond the win-dow of simultaneous stimulus presentation In other words for a masker of finiteduration non simultaneous (also sometimes denoted ldquotemporalrdquo) masking occursboth prior to masker onset as well as after masker removal The skirts on bothregions are schematically represented in Figure 59 Essentially absolute audi-bility thresholds for masked sounds are artificially increased prior to during andfollowing the occurrence of a masking signal Whereas significant premaskingtends to last only about 1ndash2 ms postmasking will extend anywhere from 50to 300 ms depending upon the strength and duration of the masker [Zwic90]Below we consider key nonsimultaneous masking properties that should beembedded in audio codec perceptual models

Of the two nonsimultaneous masking modes forward masking is better under-stood For masker and probe of the same frequency experimental studies haveshown that the amount of forward (post-) masking depends in a predictableway on stimulus frequency [Jest82] masker intensity [Jest82] probe delay aftermasker cessation [Jest82] and masker duration [Moor96] Forward masking alsoexhibits frequency-dependent behavior similar to simultaneous masking that canbe observed when the masker and probe frequency relationship is varied[Moor78] Although backward (pre) masking has also been the subject of many

studies it is not well understood [Moor96] As shown in Figure 59 backwardmasking decays much more rapidly than forward masking For example onestudy at Thomson Consumer Electronics showed that only 2 ms prior to maskeronset the masked threshold was already 25 dB below the threshold of simulta-neous masking [Bran98]

We note however that the literature lacks consensus over the maximumtime persistence of significant backward masking Despite the inconsistent resultsacross studies it is generally accepted that the amount of measured backward

minus50 0 50 100 150 0 50 100 150

20

40

60 SimultaneousPre- Postmasking

Time after masker appearance (ms) Time after masker removal (ms)

Masker

Mas

kee

audi

bilit

y th

resh

old

incr

ease

(dB

)

200

Figure 59 Nonsimultaneous masking properties of the human ear Backward (pre-)masking occurs prior to masker onset and lasts only a few milliseconds Forward (post-)masking may persist for more than 100 ms after masker removal (after [Zwic90])


masking depends significantly on the training of the experimental subjects Forthe purposes of perceptual coding abrupt audio signal transients (eg the onsetof a percussive musical instrument) create pre- and postmasking regions dur-ing which a listener will not perceive signals beneath the elevated audibilitythresholds produced by a masker In fact temporal masking has been used inseveral audio coding algorithms (eg [Bran94a] [Papa95] [ISOI96a] [Fiel96][Sinh98a]) Premasking in particular has been exploited in conjunction withadaptive block size transform coding to compensate for pre-echo distortions(Chapter 6 Sections 69 and 610)

56 PERCEPTUAL ENTROPY

Johnston [John88a] combined notions of psychoacoustic masking with signalquantization principles to define perceptual entropy (PE) a measure of percep-tually relevant information contained in any audio record Expressed in bits persample PE represents a theoretical limit on the compressibility of a particular sig-nal PE measurements reported in [John88a] and [John88b] suggest that a widevariety of CD-quality audio source material can be transparently compressed atapproximately 21 bits per sample The PE estimation process is accomplished asfollows The signal is first windowed and transformed to the frequency domainA masking threshold is then obtained using perceptual rules Finally a determi-nation is made of the number of bits required to quantize the spectrum withoutinjecting perceptible noise The PE measurement is obtained by constructing aPE histogram over many frames and then choosing a worst-case value as theactual measurement

The frequency-domain transformation is done with a Hann window followedby a 2048-point fast Fourier transform (FFT) Masking thresholds are obtainedby performing critical band analysis (with spreading) making a determination ofthe noise-like or tone-like nature of the signal applying thresholding rules for thesignal quality then accounting for the absolute hearing threshold First real andimaginary transform components are converted to power spectral components

P(ω) = Re2(ω) + Im2(ω) (510)

then a discrete Bark spectrum is formed by summing the energy in each criticalband (Table 51)

Bi =bhisum

ω=bli

P (ω) (511)

where the summation limits are the critical band boundaries The range of theindex i is sample rate dependent and in particular i isin 1 25 for CD-qualitysignals A spreading function Eq (57) is then convolved with the discrete Barkspectrum

Ci = Bi lowast SFi (512)

PERCEPTUAL ENTROPY 129

to account the spread of masking An estimation of the tone-like or noise-likequality for Ci is then obtained using the spectral flatness measure (SFM)

SFM = microg

microa

(513)

where microg and microa respectively correspond to the geometric and arithmetic meansof the PSD components for each band The SFM has the property that it isbounded by 0 and 1 Values close to 1 will occur if the spectrum is flat in aparticular band indicating a decorrelated (noisy) band Values close to zero willoccur if the spectrum in a particular band is narrowband A coefficient of tonalityα is next derived from the SFM on a dB scale

α = min

(SFM db

minus60 1

)

(514)

and this is used to weight the thresholding rules given by Eqs (58) and (59)(with K = 55) as follows for each band to form an offset

Oi = α(145 + i) + (1 minus α)55( in dB) (515)

A set of JND estimates in the frequency power domain are then formed bysubtracting the offsets from the Bark spectral components

Ti = 10log10(Ci )minus

Oi

10 (516)

These estimates are scaled by a correction factor to simulate deconvolution ofthe spreading function and then each Ti is checked against the absolute thresholdof hearing and replaced by max(Ti Tq(i)) In a manner essentially identical tothe SPL calibration procedure that was described in Section 52 the PE estima-tion is calibrated by equating the minimum absolute threshold to the energy ina 4 kHz signal of +minus 1 bit amplitude In other words the system assumes thatthe playback level (volume control) is configured such that the smallest possiblesignal amplitude will be associated with an SPL equal to the minimum absolutethreshold By applying uniform quantization principles to the signal and associ-ated set of JND estimates it is possible to estimate a lower bound on the numberof bits required to achieve transparent coding In fact it can be shown that theperceptual entropy in bits per sample is given by

PE =25sum

i=1

bhisum

ω=bli

log2

(

2

∣∣∣∣nint

(Re(ω)radic6Tiki

)∣∣∣∣ + 1

)

+ log2

(

2

∣∣∣∣nint

(Im(ω)radic6Tiki

)∣∣∣∣ + 1

)

(bitssample) (517)


where i is the index of critical band bli and bhi are the lower and upper boundsof band i ki is the number of transform components in band i Ti is the mask-ing threshold in band i (Eq (516)) and nint denotes rounding to the nearestinteger Note that if 0 occurs in the log we assign 0 for the result The maskingthresholds used in the above PE computation also form the basis for a transformcoding algorithm described in Chapter 7 In addition the ISOIEC MPEG-1 psy-choacoustic model 2 which is often used in MP3 encoders is closely related tothe PE procedure

We note however that there have been evolutionary improvements since thePE estimation scheme first appeared in 1988 For example the PE calculationin many systems (eg [ISOI94]) relies on improved tonality estimates relativeto the SFM-based measure of Eq (514) The SFM-based measure is both time-and frequency-constrained Only one spectral estimate (analysis frame) is exam-ined in time and in frequency the measure by definition lumps together multiplespectral lines In contrast other tonality estimation schemes eg the ldquochaosmeasurerdquo [ISOI94] [Bran98] consider the predictability of individual frequencycomponents across time in terms of magnitude and phase-tracking properties Apredicted value for each component is compared against its actual value and theEuclidean distance is mapped to a measure of predictability Highly predictablespectral components are considered to be tonal while unpredictable componentsare treated as noise-like A tonality coefficient that allows weighting towards oneextreme or the other is computed from the chaos measure just as in Eq (514)Improved performance has been demonstrated in several instances (eg [Bran90][ISOI94] [Bran98]) Nevertheless the PE measurement as proposed in its orig-inal form conveys valuable insight on the application of simultaneous maskingasymmetry to a perceptual model in a practical system

57 EXAMPLE CODEC PERCEPTUAL MODEL ISOIEC 11172-3(MPEG - 1) PSYCHOACOUSTIC MODEL 1

It is useful to consider an example of how the psychoacoustic principles describedthus far are applied in actual coding algorithms The ISOIEC 11172-3 (MPEG-1layer 1) psychoacoustic model 1 [ISOI92] determines the maximum allowablequantization noise energy in each critical band such that quantization noiseremains inaudible In one of its modes the model uses a 512-point FFT forhigh-resolution spectral analysis (8613 Hz) then estimates for each input frameindividual simultaneous masking thresholds due to the presence of tone-like andnoise-like maskers in the signal spectrum A global masking threshold is thenestimated for a subset of the original 256 frequency bins by (power) additive com-bination of the tonal and nontonal individual masking thresholds The remainderof this section describes the step-by-step model operations Sample results aregiven for one frame of CD-quality pop music sampled at 441 kHz16-bits persample We note that although this model is suitable for any of the MPEG-1coding layers IndashIII the standard [ISOI92] recommends that model 1 be usedwith layers I and II while model 2 is recommended for layer III (MP3) The five

EXAMPLE CODEC PERCEPTUAL MODEL 131

steps leading to computation of global masking thresholds are described in thefollowing Sections

571 Step 1 Spectral Analysis and SPL Normalization

Spectral analysis and normalization are performed first The goal of this step is toobtain a high-resolution spectral estimate of the input with spectral componentsexpressed in terms of sound pressure level (SPL) Much like the PE calculationdescribed previously this SPL normalization guarantees that a 4 kHz signal of+minus1 bit amplitude will be associated with an SPL near 0 dB (close to anacceptable Tq value for normal listeners at 4 kHz) whereas a full-scale sinusoidwill be associated with an SPL near 90 dB The spectral analysis procedure worksas follows First incoming audio samples s(n) are normalized according to theFFT length N and the number of bits per sample b using the relation

x(n) = s(n)

N(2bminus1) (518)

Normalization references the power spectrum to a 0-dB maximum The normal-ized input x(n) is then segmented into 12-ms frames (512 samples) using a116th-overlapped Hann window such that each frame contains 109 ms of newdata A power spectral density (PSD) estimate P (k) is then obtained using a512-point FFT ie

P(k) = PN + 10 log10

∣∣∣∣∣∣

Nminus1sum

n=0

w(n)x(n)eminusj

2πkn

N

∣∣∣∣∣∣

2

0 k N

2 (519)

where the power normalization term PN is fixed at 90302 dB and the Hannwindow w(n) is defined as

w(n) = 1

2

[

1 minus cos

(2πn

N

)]

(520)

Because playback levels are unknown during psychoacoustic signal analysis thenormalization procedure (Eq (518)) and the parameter PN in Eq (519) are usedto estimate SPL conservatively from the input signal For example a full-scalesinusoid that is precisely resolved by the 512-point FFT in bin ko will yielda spectral line P(k0) having 84 dB SPL With 16-bit sample resolution SPLestimates for very-low-amplitude input signals will be at or below the absolutethreshold An example PSD estimate obtained in this manner for a CD-qualitypop music selection is given in Figure 510(a) The spectrum is shown both on alinear frequency scale (upper plot) and on the Bark scale (lower plot) The dashedline in both plots corresponds to the absolute threshold of hearing approximationused by the model

572 Step 2 Identification of Tonal and Noise Maskers

After PSD estimation and SPL normalization tonal and nontonal masking com-ponents are identified Local maxima in the sample PSD that exceed neighboring


1minus10

minus10

60

50

40

30

20

10

0

10

20

30

40

50

60

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

0

SP

L (d

B)

SP

L (d

B)

3 5 7 9 11 13 15 17 19 21 23 25

Bark

(a)

Frequency (Hz)

Figure 510a ISOIEC MPEG-1 psychoacoustic analysis model 1 for an example popmusic selection steps 1ndash5 as described in the text (a) Step 1 Obtain PSD express in dBSPL Top panel gives linear frequency scale bottom panel gives Bark frequency scaleAbsolute threshold superimposed Step 2 Tonal maskers identified and denoted by lsquoxrsquosymbol noise maskers identified and denoted by lsquoorsquo symbol (b) Collection of proto-type spreading functions (Eq (531)) shown with level as the parameter These illustratethe incorporation of excitation pattern level-dependence into the model Note that theprototype functions are defined to be piecewise linear on the Bark scale These will beassociated with maskers in steps 3 and 4 (c) Steps 3 and 4 Spreading functions are asso-ciated with each of the individual tonal maskers satisfying the rules outlined in the textNote that the Signal-to-Mask Ratio (SMR) at the peak is close to the widely acceptedtonal value of 145 dB (d) Spreading functions are associated with each of the individ-ual noise maskers that were extracted after the tonal maskers had been eliminated fromconsideration as described in the text Note that the peak SMR is close to the widelyaccepted noise-masker value of 5 dB (e) Step 5 A global masking threshold is obtainedby combining the individual thresholds as described in the text The maximum of theglobal threshold and the absolute threshold is taken at each point in frequency to be thefinal global threshold The figure clearly shows that some portions of the input spectrumrequire SNRs better than 20 dB to prevent audible distortion while other spectral regionsrequire less than 3 dB SNR


7

minus60

minus40

minus20

0

20

SP

L (d

B)

40

60

8090

75

60

45

30

15

0

8 9 10 11 12Bark

(b)

13 14 15 16 17

Barks

(c)

0minus20

minus10

0

10

20

SP

L (d

B)

30

40

50

60

5 10 15 20 25

Figure 510b c


0minus20

minus10

0

10

20

SP

L (d

B)

30

40

50

60

5 10Barks

(d)

15 20 25

0minus20

minus10

0

10

20

SP

L (d

B)

30

40

50

60

5 10 15Barks

(e)

20 25

Figure 510d e


components within a certain Bark distance by at least 7 dB are classified as tonalSpecifically the ldquotonalrdquo set ST is defined as

ST =

P(k)

∣∣∣∣

P(k) gt P (k plusmn 1)

P (k) gt P (k plusmn k) + 7dB

(521)

where

k isin

2 2 lt k lt 63 (017 minus 55 kHz)[2 3] 63 k lt 127 (55 minus 11 kHz)[2 6] 127 k 256 (11 minus 20 kHz)

(522)

Tonal maskers PTM (k) are computed from the spectral peaks listed in ST asfollows

PTM (k) = 10 log10

1sum

j=minus1

1001P(k+j)(dB) (523)

In other words for each neighborhood maximum energy from three adjacentspectral components centered at the peak are combined to form a single tonalmasker Tonal maskers extracted from the example pop music selection are iden-tified using lsquoxrsquo symbols in Figure 510(a) A single noise masker for each criticalband PNM (k) is then computed from (remaining) spectral lines not within theplusmnk neighborhood of a tonal masker using the sum

PNM (k) = 10 log10

sum

j

1001P(j)(dB) forallP(j) isin PTM (k k plusmn 1 k plusmn k) (524)

where k is defined to be the geometric mean spectral line of the critical bandie

k =

uprod

j=l

j

1(lminusu+1)

(525)

where l and u are the lower and upper spectral line boundaries of the criticalband respectively The idea behind Eq (524) is that residual spectral energywithin a critical bandwidth not associated with a tonal masker must by defaultbe associated with a noise masker Therefore in each critical band Eq (524)combines into a single noise masker all of the energy from spectral componentsthat have not contributed to a tonal masker within the same band Noise maskersare denoted in Figure 510 by lsquoorsquo symbols Dashed vertical lines are included inthe Bark scale plot to show the associated critical band for each masker

573 Step 3 Decimation and Reorganization of Maskers

In this step the number of maskers is reduced using two criteria First any tonalor noise maskers below the absolute threshold are discarded ie only maskersthat satisfy

PTM NM (k) Tq(k) (526)


are retained where Tq(k) is the SPL of the threshold in quiet at spectral line k Inthe pop music example two high-frequency noise maskers identified during step2 (Figure 510(a)) are dropped after application of Eq (526) (Figure 510(cndashe))Next a sliding 05 Bark-wide window is used to replace any pair of maskersoccurring within a distance of 05 Bark by the stronger of the two In thepop music example two tonal maskers appear between 195 and 205 Barks(Figure 510(a)) It can be seen that the pair is replaced by the stronger of thetwo during threshold calculations (Figure 510(cndashe)) After the sliding windowprocedure masker frequency bins are reorganized according to the subsamplingscheme

PTM NM (i) = PTM NM (k) (527)

PTM NM (k) = 0 (528)

where

i =

k 1 k 48k + (k mod 2) 49 k 96

k + 3 minus ((k minus 1)mod 4) 97 k 232

(529)

The net effect of Eq (529) is 21 decimation of masker bins in critical bands18ndash22 and 41 decimation of masker bins in critical bands 22ndash25 with no lossof masking components This procedure reduces the total number of tone andnoise masker frequency bins under consideration from 256 to 106 Tonal andnoise maskers shown in Figure 510(cndashe) have been relocated according to thisdecimation scheme

574 Step 4 Calculation of Individual Masking Thresholds

Using the decimated set of tonal and noise maskers individual tone and noisemasking thresholds are computed next Each individual threshold represents amasking contribution at frequency bin i due to the tone or noise masker locatedat bin j (reorganized during step 3) Tonal masker thresholds TTM (i j) are givenby

TTM (i j) = PTM (j) minus 0275Zb(j) + SF(i j) minus 6025(dB SPL) (530)

where PTM (j) denotes the SPL of the tonal masker in frequency bin j Zb(j)

denotes the Bark frequency of bin j (Eq (53)) and the spread of masking frommasker bin j to maskee bin i SF(i j) is modeled by the expression

SF(i j) =

17Zb minus 04PTM (j) + 11 minus3 Zb lt minus1(04PTM (j) + 6)Zb minus1 Zb lt 0minus17Zb 0 Zb lt 1(015PTM (j) minus 17)Zb minus 015PTM (j) 1 Zb lt 8

(dB SPL)

(531)


ie as a piecewise linear function of masker level P (j ) and Bark maskee-maskerseparation Zb = Zb(i) minus Zb(j) SF(i j) approximates the basilar spreading(excitation pattern) described in Section 54 Prototype individual masking thresh-olds TTM (i j) are shown as a function of masker level in Figure 510(b) foran example tonal masker occurring at Zb = 10 Barks As shown in the figurethe slope of TTM (i j) decreases with increasing masker level This is a reflec-tion of psychophysical test results which have demonstrated [Zwic90] that theearrsquos frequency selectivity decreases as stimulus levels increase It is also notedhere that the spread of masking in this particular model is constrained to a 10-Bark neighborhood for computational efficiency This simplifying assumption isreasonable given the very low masking levels that occur in the tails of the excita-tion patterns modeled by SF(i j) Figure 510(c) shows the individual maskingthresholds (Eq (530)) associated with the tonal maskers in Figure 510(a) (lsquoxrsquo)It can be seen here that the pair of maskers identified near 19 Barks has beenreplaced by the stronger of the two during the decimation phase The plot includesthe absolute hearing threshold for reference Individual noise masker thresholdsTNM (i j) are given by

TNM (i j) = PNM (j) minus 0175Zb(j) + SF(i j) minus 2025(dB SPL) (532)

where PNM (j) denotes the SPL of the noise masker in frequency bin j Zb(j)

denotes the Bark frequency of bin j (Eq (53)) and SF(i j) is obtained byreplacing PTM (j) with PNM (j) in Eq (531) Figure 510(d) shows individualmasking thresholds associated with the noise maskers identified in step 2(Figure 510(a) lsquoorsquo) It can be seen in Figure 510(d) that the two high frequencynoise maskers that occur below the absolute threshold have been eliminated

Before we proceed to step 5 and compute a global masking threshold it isworthwhile to consider the relationship between Eq (58) and Eq (530) as wellas the connection between Eq (59) and Eq (532) Equations (58) and (530)are related in that both model the TMN paradigm (Section 54) in order togenerate a masking threshold for quantization noise masked by a tonal signalcomponent In the case of Eq (58) a Bark-dependent offset that is consis-tent with experimental TMN data for the threshold minimum SMR is subtractedfrom the masker intensity namely the quantity 145 +B In a similar mannerEq (530) estimates for a quantization noise maskee located in bin i the inten-sity of the masking contribution due the tonal masker located in bin j LikeEq (58) the psychophysical motivation for Eq (530) is the desire to modelthe relatively weak masking contributions of a TMN Unlike Eq (58) howeverEq (530) uses an offset of only 6025 + 0275B ie Eq (530) assumes asmaller minimum SMR at threshold than does Eq (58) The connection betweenEqs (59) and (532) is analogous In the case of this equation pair howeverthe psychophysical motivation is to model the masking contributions of NMTEquation (59) assumes a Bark-independent minimum SMR of 3ndash5 dB depend-ing on the value of the parameter K Equation (532) on the other hand assumesa Bark-dependent threshold minimum SMR of 2025 + 0175B dB Also whereas


the spreading function (SF ) terms embedded in Eqs (530) and (532) explicitlyaccount for the spread of masking equations (58) and (59) assume that thespread of masking was captured during the computation of the terms ET andEN respectively

575 Step 5 Calculation of Global Masking Thresholds

In this step individual masking thresholds are combined to estimate a globalmasking threshold for each frequency bin in the subset given by Eq (529) Themodel assumes that masking effects are additive The global masking thresholdTg(i) is therefore obtained by computing the sum

Tg(i) = 10 log10(1001Tq(i) +Lsum

l=1

1001TTM (il) +Msum

m=1

1001TNM (im))(dB SPL)

(533)

where Tq(i) is the absolute hearing threshold for frequency bin i TTM (i l) andTNM (i m) are the individual masking thresholds from step 4 and L and M are thenumbers of tonal and noise maskers respectively identified during step 3 In otherwords the global threshold for each frequency bin represents a signal-dependentpower-additive modification of the absolute threshold due to the basilar spread ofall tonal and noise maskers in the signal power spectrum Figure 510(e) showsglobal masking threshold obtained by adding the power of the individual tonal(Figure 510(c)) and noise (Figure 510(d)) maskers to the absolute threshold inquiet

58 PERCEPTUAL BIT ALLOCATION

In this section we will extend the uniform- and optimal-bit allocation algorithmspresented in Chapter 3 Section 35 with perceptual bit-assignment strategies Inperceptual bit allocation method the number of bits allocated to different bands isdetermined based on the global masking thresholds obtained from the psychoacous-tic model The steps involved in the computation of the global masking thresholdshave been presented in detail in the previous section The signal-to-mask ratio(SMR) determines the number of bits to be assigned in each band for perceptu-ally transparent coding of the input audio The noise-to-mask ratios (NMRs) arecomputed by subtracting the SMR from the SNR in each subband ie

NMR = SNR minus SMR(dB) (534)

The main objective in a perceptual bit allocation scheme is to keep the quan-tization noise below a masking threshold For example note that the NMR inFigure 511(a) is relatively more compared to NMR in Figure 511(b) Hencethe (quantization) noise in case of Figure 511(a) can be masked relatively easilythan in case of Figure 511(b) Therefore it is logical to assign sufficient number

PERCEPTUAL BIT ALLOCATION 139

of bits to the subband with the lowest NMR This criterion will be applied toall the subbands and until all the bits are exhausted Typically in audio codingstandards an iterative procedure is employed that satisfies both the bit rate andglobal masking threshold requirements

(a)

Freq

Sou

nd p

ress

ure

leve

l (dB

)

Criticalband

Masking tone

Masking thresh

Minimum masking threshold

Neighboringband

NM

RS

MR

SN

R

Freq

Sou

nd p

ress

ure

leve

l (dB

)

Criticalband

Masking tone

Masking thresh

Minimum maskingthreshold

Neighboringband

NM

RS

MR

SN

R

Noise threshold

(b)

Figure 511 Simultaneous masking depicting relatively large NMR in (a) compared to(b)


59 SUMMARY

This chapter dealt with some of the basics of psychoacoustics We coveredthe absolute threshold of hearing the Bark scale the simultaneous and tem-poral masking effects and the perceptual entropy A step-by-step procedure thatdescribes the ISOIEC psychoacoustic model 1 was provided

PROBLEMS

51 Describe the difference between a Mel scale and a Bark scale Give tablesand itemize side-by-side the center frequencies and bandwidth for 0ndash5 kHzDescribe how the two different scales are constructed

52 In Figure 512 the solid line indicates the just noticeable distortion (JND)curve and the dotted line indicates the absolute threshold in quiet Statewhich of the tones A B C or D would be audible and which ones arelikely to be masked Explain

53 In Figure 513 state whether tone B would be masked by tone A ExplainAlso indicate whether tone C would mask the narrow-band noise Givereasons

54 In Figure 514 the solid line indicates the JND curve obtained from thepsychoacoustic model 1 A broadband noise component is shown that spans

100

80

60

40

20

0

minus20

Sou

nd p

ress

ure

leve

l (dB

)

5 10 15 20 25

Barks

A

B

D

C

JND curveAbsolute threshold in quiet

Figure 512 JND curve for Problem 52


minus20

100

80

60

40

20

0


Maskingthreshold

A

B

C

Sou

nd p

ress

ure

leve

l (dB

)

5 10 15 20

Critical BWBarks

Absolute threshold

Narrow-band noise

Figure 513 Masking experiment Problem 53

from 3 to 11 Barks and a tone is present at 10 Barks Sketch the portionsof the noise and the tone that could be considered perceptually relevant

COMPUTER EXERCISES

55 Design a 3-band equalizer using the peaking filter equations of Chapter 2The center frequencies should correspond to the auditory filters (seeTable 51) at center frequencies 450 Hz 1000 Hz and 2500 Hz Computethe Q-factors associated with each of these filters using Q = f0BW where f0 is the center frequency and BW is the filter bandwidth (obtainfrom Table 51) Choose g = 5dB for all the filters Give the frequencyresponse of the 3-band equalizer in terms of Bark scale

56 Write a program to plot the absolute threshold of hearing in quiet Eq (51)Give a plot in terms of a linear Hz scale

57 Use the program of Problem 56 and plot the absolute threshold of hearingin a Bark scale

58 Generate four sinusoids with frequencies 400 Hz 1000 Hz 2500 Hz and6000 Hz fs = 441 kHz Obtain s(n) by adding these individual sinusoidsas follows

s(n) =4sum

i=1

sin

(2πfin

fs

)

n = 1 2 1024


10

80

70

60

50

40

30

20

10

0

minus10

minus20

Sou

nd p

ress

ure

leve

l (dB

)

5 15 20

Barks

3 11

Figure 514 Perceptual bit-allocation Problem 54

Give power spectrum plots of s(n) (in dB) in terms of a Bark scale and in termsof a linear Hz scale List the Bark-band numbers where the four peaks are located(Hint see Table 51 for the bark band numbers)

59 Extend the above problem and give the power spectrum plot in dB SPLSee Section 571 for details Also include the absolute threshold of hearingin quiet in your plot

510 Write a program to compute the perceptual entropy (in bitssample) of thefollowing signalsa ch5 malespeechwav (8 kHz 16 bit)b ch5 musicwav (441 kHz 16 bit)

(Hint Use equations (510)ndash(517) in Chapter 5 Section 56) Choose the framesize as 512 samples Also the perceptual entropy (PE) measurement is obtainedby constructing a PE histogram over many frames and then choosing a worst-casevalue as the actual measurement

511 FFT-based perceptual audio synthesis using the MPEG 1 psychoacous-tic model 1

In this computer exercise we will consider an example to show how thepsychoacoustic principles are applied in actual audio coding algorithms Recallthat in Chapter 2 Computer Exercise 225 we employed the peak-picking methodto select a subset of FFT components for audio synthesis In this exercise wewill use the just-noticeable-distortion (JND) curve as the ldquoreferencerdquo to selectthe perceptually important FFT components All the FFT components below the


JND curve are assigned a minimal value (for example minus50 dB SPL) such thatthese perceptually irrelevant FFT components receive minimum number of bitsfor encoding The ISOIEC MPEG 1 psychoacoustic model 1 (See Section 57Steps 1 through 5) is used to compute the JND curve

Use the MATLAB software from the Book website to simulate the ISOIECMPEG-1 psychoacoustic model 1 The software package consists of three MAT-LAB files psymainm psychoacousticsm audio synthesism a wave filech5 musicwav and a hintsdoc file The psymainm file is the main file thatcontains the complete flow of your computer exercise The psychoacousticsmfile contains the steps performed in the psychoacoustic analysis

Deliverable 1a Using the comments included in the hintsdoc file fill in the MATLAB

commands in the audio synthesism file to complete the programb Give plots of the input and the synthesized audio Is the psychoacoustic

criterion for picking FFT components equivalent to using a Parsevalrsquosrelated criterion In the audio synthesis what happens if the FFT con-jugate symmetry was not maintained

c How is this FFT-based perceptual audio synthesis method different froma typical peak-picking method for audio synthesis

Deliverable 2d Provide a subjective evaluation of the synthesized audio in terms of a

MOS scale Did you hear any clicks between the frames in the syn-thesized audio If yes what would you do to eliminate such artifactsCompute the over-all and segmental SNR values between the input andthe synthesized audio

e On the average how many FFT components per frame were selected Ifonly 30 FFT components out of 512 were to be picked (because of the bitrate considerations) would the application of the psychoacoustic modelto select the FFT components yield the best possible SNR

512 In this computer exercise we will analyze the asymmetry of simultaneousmasking Use the MATLAB software from the Book website The softwarepackage consists of two MATLAB files asymm maskm and psychoacous-ticsm The asymm maskm includes steps to generate a pure tone s1(n)with f = 4 kHz and fs = 441 kHz

s1(n) = sin

(2πf n

fs

)

n = 0 1 2 44099

a Simulate a broadband noise s2(n) by band-pass filtering uniform whitenoise (micro = 0 and σ 2 = 1) using a Butterworth filter (of appropriateorder eg 8) with center frequency 4 kHz Assume that 3-dB cut-offfrequencies of the bandpass filter as 3500 Hz and 4500 Hz Generatea test signal s(n) = αs1(n) + βs2(n) Choose α = 0025 and β = 1Observe if the broad-band noise completely masks the tone Experiment


for different values of α β and find out when 1) the broadband noisemasks the tone and 2) the tone masks the broadband noise

b Simulate the two cases of masking (ie the NMT and the TMN) givenin Figure 57 Section 54

CHAPTER 6

TIME-FREQUENCY ANALYSIS FILTERBANKS AND TRANSFORMS

61 INTRODUCTION

Audio codecs typically use a time-frequency analysis block to extract a set ofparameters that is amenable to quantization The tool most commonly employedfor this mapping is a filter bank of bandpass filters The filter bank divides thesignal spectrum into frequency subbands and generates a time-indexed series ofcoefficients representing the frequency-localized signal power within each bandBy providing explicit information about the distribution of signal and hence mask-ing power over the time-frequency plane the filter bank plays an essential rolein the identification of perceptual irrelevancies Additionally the time-frequencyparameters generated by the filter bank provide a signal mapping that is con-veniently manipulated to shape the coding distortion On the other hand bydecomposing the signal into its constituent frequency components the filter bankalso assists in the reduction of statistical redundancies

This chapter provides a perspective on filter-bank design and other tech-niques of particular importance in audio coding The chapter is organized asfollows Sections 62 and 63 introduce filter-bank design issues for audio cod-ing Sections 64 through 67 review specific filter-bank methodologies found inaudio codecs namely the two-band quadrature mirror filter (QMF) the M-bandtree-structured QMF the M-band pseudo-QMF bank and the M-band ModifiedDiscrete Cosine Transform (MDCT) The lsquoMP3rsquo or MPEG-1 Layer III audio


145

146 TIME-FREQUENCY ANALYSIS FILTER BANKS AND TRANSFORMS

codec pseudo-QMF and MDCT are discussed in Sections 66 and 67 respec-tively Section 68 provides filter-bank interpretations of the discrete Fourierand discrete cosine transforms Finally Sections 69 and 610 examine the time-domain ldquopre-echordquo artifact in conjunction with pre-echo control techniques

Beyond the references cited in this chapter the reader is referred to in-depthtutorials on filter banks that have appeared in the literature [Croc83] [Vaid87][Vaid90] [Malv91] [Vaid93] [Akan96] The reader may also wish to explorethe connection between filter banks and wavelets that has been well docu-mented in [Riou91] [Vett92] and in several texts [Akan92] [Wick94] [Akan96][Stra96]

62 ANALYSIS-SYNTHESIS FRAMEWORK FOR M-BAND FILTERBANKS

Filter banks are perhaps most conveniently described in terms of an analysis-synthesis framework (Figure 61) in which the input signal s(n) is processed atthe encoder by a parallel bank of (L minus 1)-th order FIR bandpass filters Hk(z)The bandpass analysis outputs

vk(n) = hk(n) lowast s(n) =Lminus1sum

m=0

s(n minus m)hk(m) k = 0 1 M minus 1 (61)

are decimated by a factor of M yielding the subband sequences

yk(n) = vk(Mn) =Lminus1sum

m=0

s(nM minus m)hk(m) k = 0 1 M minus 1 (62)

which comprise a critically sampled or maximally decimated signal representa-tion ie the number of subband samples is equal to the number of input samplesBecause it is impossible to achieve perfect ldquobrickwallrdquo magnitude responseswith finite-order bandpass filters there is unavoidable aliasing between the deci-mated subband sequences Quantization and coding are performed on the subbandsequences yk(n) In the perceptual audio codec the quantization noise is usuallyshaped according to a perceptual model The quantized subband samples yk(n)are eventually received by the decoder where they are upsampled by M to formthe intermediate sequences

wk(n) =

yk(nM) n = 0 M 2M 3M

0 otherwise(63)

In order to eliminate the imaging distortions introduced by the upsampling oper-ations the sequences wk(n) are processed by a parallel bank of synthesis filtersGk(z) and then the filter outputs are combined to form the overall output s(n)

ANALYSIS-SYNTHESIS FRAMEWORK FOR M-BAND FILTER BANKS 147

The analysis and synthesis filters are carefully designed to cancel aliasing andimaging distortions It can be shown [Akan96] that

s(n) = 1

M

infinsum

m=minusinfin

infinsum

l=minusinfin

Mminus1sum

k=0

s(m)hk(lM minus m)gk(l minus Mn) (64)

or in the frequency domain

S() = 1

M

Mminus1sum

k=0

Mminus1sum

l=0

S

(

+ 2πl

M

)

Hk

(

+ 2πl

M

)

Gk() (65)

For perfect reconstruction (PR) filter banks the output s(n) will be identicalto the input s(n) within a delay ie s(n) = s(n minus n0) as long as there is noquantization noise introduced ie y(n) = yk(n) This is naturally not the casefor a codec and therefore quantization sensitivity is an important filter bankproperty since PR guarantees are lost in the presence of quantization

Figures 62 and 63 respectively give example magnitude responses for banksof uniform and nonuniform bandwidth filters that can be realized within theframework of Figure 61 A uniform bandwidth M-channel filter bank is shownin Figure 62 The M analysis filters have normalized center frequencies (2k +1)π2M and are characterized by individual impulse responses hk(n) as wellas frequency responses Hk() 0 k lt M minus 1

Some of the popular audio codecs contain parallel bandpass filters of uniformbandwidth similar to Figure 62 Other coders strive for a ldquocritical bandrdquo analysisby relying upon filters of nonuniform bandwidth The octave-band filter bank

MH0(z)

H1(z)

H2(z)

HMminus1(z)

M

M

M

s(n)

s(n)

v0(n)

v1(n)

v2(n)

vMminus1(n)

M G0(z)

G1(z)

G2(z)

GMminus1(z)

M

M

M

y0(n)

y1(n)

y2(n)

yMminus1(n)

w0(n)

w1(n)

w2(n)

wMminus1(n)

Σˆ

Figure 61 Uniform M-band maximally decimated analysis-synthesis filter bank


Frequency (Ω)

minusp p

H0 H1 H2H0H1H2HM minus1 HM minus1

2M(2M minus1)p

minus 2M

(2M minus1)p2M5p

minus2M3p

minus2Mp

minusp

2M3p2M

5p2M

Figure 62 Magnitude frequency response for a uniform M-band filter bank (oddlystacked)

Frequency (Ω)

H20 H21 H11 H01

0 8p

4p

2p p

Figure 63 Magnitude frequency response for an octave-band filter bank

for which the four-band case is illustrated in Figure 63 is sometimes used as anapproximation to the auditory filter bank albeit a poor one

As shown in Figure 63 octave-band analysis filters have normalized centerfrequencies and filter bandwidths that are dyadically related Naturally muchbetter approximations are possible

63 FILTER BANKS FOR AUDIO CODING DESIGN CONSIDERATIONS

This section addresses the issues that govern the selection of a filter bank for audiocoding Efficient coding performance depends heavily on adequately matching theproperties of the analysis filter bank to the characteristics of the input signal Algo-rithm designers face an important and difficult tradeoff between time and frequencyresolution when selecting a filter-bank structure [Bran92a] Failure to choose a suit-able filter bank can result in perceptible artifacts in the output (eg pre-echoes) orlow coding gain and therefore high bit rates No single tradeoff between time andfrequency resolution is optimal for all signals We will present three examples toillustrate the challenge facing codec designers In the first example we consider theimportance of matching time-frequency analysis resolution to the signal-dependentdistribution of masking power in the time-frequency plane The second exampleillustrates the effect of inadequate frequency resolution on perceptual bit alloca-tion Finally the third example illustrates the effect of inadequate time resolution onperceptual bit allocation These examples clarify the fundamental tradeoff requiredduring filter-bank selection for perceptual coding

FILTER BANKS FOR AUDIO CODING DESIGN CONSIDERATIONS 149

631 The Role of Time-Frequency Resolution in Masking PowerEstimation

Through schematic representations of masking thresholds for castanets and pic-colo Figure 64(ab) illustrates the difficulty of selecting a single filter bank tosatisfy the diverse time and frequency resolution requirements associated withdifferent classes of audio In the figures darker regions correspond to highermasking thresholds To realize maximum coding gain the strongly harmonicpiccolo signal clearly calls for fine frequency resolution and coarse time resolu-tion because the masking thresholds are quite localized in frequency Quite theopposite is true of the castanets The fast attacks associated with this percussivesound create highly time-localized masking thresholds that are also widely dis-bursed in frequency Therefore adequate time resolution is essential for accurateestimation of the highly time-varying masked threshold Naturally similar reso-lution properties should also be associated with the filter bank used to decomposethe signal into a parametric set for quantization and encoding Using real signalsand filter banks the next two examples illustrate the bit rate impact of adequateand inadequate resolutions in each domain

632 The Role of Frequency Resolution in Perceptual Bit Allocation

To demonstrate the importance of matching a filter bankrsquos resolution propertieswith the noise-shaping requirements imposed by a perceptual model the next twoexamples combine high- and low-resolution filter banks with two input extremesnamely those of a sinusoid and an impulse First we consider the importance ofadequate frequency resolution The need for high-resolution frequency analysis ismost pronounced when the input contains strong sinusoidal components Given atonal input inadequate frequency resolution can produce unreasonably high signal-to-noise ratio (SNR) requirements within individual subbands resulting in high bitrates To see this we compare the results of processing a 27-kHz pure tone firstwith a 32-channel and then with a 1024-channel MDCT filter bank as shownin Figure 65(a) and Figure 65(b) respectively The vertical line in each figurerepresents the frequency and level (80 dB SPL) of the input tone In the figuresthe masked threshold associated with the sinusoid is represented by a solid nearlytriangular line For each filter bank in Figure 65(a) and Figure 65(b) the bandcontaining most of the signal energy is quantized with sufficient resolution tocreate an in-band SNR of 154 dB Then the quantization noise is superimposedon the masked threshold In the 32-band case (Figure 65a) it can be seen thatthe quantization noise at an SNR of 154 dB spreads considerably beyond themasked threshold implying that significant artifacts will be audible in the recon-structed signal On the other hand the improved selectivity of the 1024-channelfilter bank restricts the spread of quantization at 154 dB SNR to well within thelimits of the masked threshold (Figure 65b) The figure clearly shows that for atonal signal good frequency selectivity is essential for low bit rates In fact the 32-channel filter bank for this signal requires greater than a 60 dB SNR to satisfy themasked threshold (Figure 65c) This high cost (in terms of bits required) results


Time (ms)

Fre

quen

cy (

Bar

ks)

15 300

5 10

1520

25

Time (ms)

Fre

quen

cy (

Bar

ks)

15 300

5 10

1520

25

(b)

(a)

Figure 64 Masking thresholds in the time-frequency plane (a) castanets (b) piccolo(after [Prin95])

from the mismatch between the filter bankrsquos poor selectivity and the very limiteddownward spread of masking in the human ear As this experiment would implyhigh-resolution frequency analysis is usually appropriate for tonal audio signals

633 The Role of Time Resolution in Perceptual Bit Allocation

A time-domain dual of the effect observed in the previous example can be used toillustrate the importance of adequate time resolution Whereas the previous exper-iment showed that simultaneous masking criteria determine how much frequency


(a)

1000 2000 3000 4000 5000 6000 7000minus40

minus20

0

20

40

60

80

100

Band SNR = 154 dBS

PL

(dB

)

Frequency (Hz)

Conservativemasked threshold

(b)

1000 2000 3000 4000 5000 6000 7000minus40

minus20

20

0

40

60

80

100

Band SNR = 154 dB

SP

L (d

B)

Frequency (Hz)


Figure 65 The effect of frequency resolution on perceptual SNR requirements for a27 kHz pure tone input Input tone is represented by the spectral line at 27 kHz with80 dB SPL Conservative masked threshold due to presence of tone is shown Quantizationnoise for a given number of bits per sample is superimposed for several precisions(a) 32-channel MDCT with 154 dB in-band SNR quantization noise spreads beyondmasked threshold (b) 1024-channel MDCT with 154 dB in-band SNR quantization noiseremains below masked threshold (c) 32-channel MDCT requires 637 dB in-band SNRto satisfy masked threshold ie requires larger bit allocation to mask quantization noise


resolution is necessary for good performance one can also imagine that temporalmasking effects dictate time resolution requirements In fact the need for goodtime resolution is pronounced when the input contains sharp transients This is bestillustrated with an impulse Unlike the sinusoid with its highly frequency-localizedmasking power the broadband impulse contains broadband masking power that ishighly time-localized Given a transient input therefore lacking time resolutioncan result in a temporal smearing of quantization noise beyond the time window ofeffective masking To illustrate this point consider the results obtained by process-ing an impulse with the same filter banks as in the previous example The resultsare shown in Figure 66(a) and Figure 66(b) respectively The vertical line in eachfigure corresponds to the input impulse The figures also show the temporal enve-lope of masking power associated with the impulse as a solid nearly triangularwindow For each filter bank all subbands were quantized with identical bit alloca-tions Then the error signal at the output (quantization noise) was superimposed onthe masking envelope In the 32-band case (Figure 66a) one can observe that thetime resolution (impulse response length) restricts the spread of quantization noiseto well within the limits of the masked threshold For the 1024-band filter bank(Figure 66b) on the other hand quantization noise spreads considerably beyondthe time envelope masked threshold implying that significant artifacts will be audi-ble in the reconstructed signal The only remedy in this case would be to ldquoovercoderdquothe transient so that the signal surrounding the transient receives precision adequateto satisfy the limited masking power in the regions before and after the impulseThe figures clearly show that for a transient signal good time resolution is essentialfor low bit rates The combination of long impulse responses and limited maskingwindows in the presence of transient signals can lead to an artifact known as ldquopre-echordquo distortion Pre-echo distortion and pre-echo compensation are covered at theend of this chapter in Sections 69 and 610

Unfortunately most audio source material is highly nonstationary and containssignificant tonal and atonal energy as well as both steady-state and transient inter-vals As a rule signal models [John96a] tend to remain constant for long periods andthen change abruptly Therefore the ideal coder should make adaptive decisionsregarding optimal time-frequency signal decomposition and the ideal analysis filterbank would have time-varying resolutions in both the time and frequency domainsThis fact has motivated many algorithm designers to experiment with switchedand hybrid filter-bank structures with switching decisions occurring on the basisof the changing signal properties Filter banks emulating the analysis properties ofthe human auditory system ie those containing nonuniform ldquocritical bandwidthrdquosubbands have proven highly effective in the coding of highly transient signalssuch as the castanets glockenspiel or triangle For dense harmonically structuredsignals such as the harpsichord or pitch pipe on the other hand the ldquocritical bandrdquofilter banks have been less successful because of their reduced coding gain relativeto filter banks with a large number of subbands In short several bank characteristicsare highly desirable for audio coding

ž Signal-adaptive time-frequency tilingž Low-resolution ldquocritical-bandrdquo mode (eg 32 subbands)


(a)

minus25 minus20 minus15 minus10 minus5 0 5 10 15 20 25minus50

minus45

minus40

minus35

minus30

minus25

minus20

minus15

minus10

minus5

0

5

Time (ms)

dB32 Channel MDCT

Conservativetemporal masked

threshold

1024 Channel MDCT

minus50

minus45

minus40

minus35

minus30

minus25

minus20

minus15

minus10

minus5

0

5

(b)

minus25 minus20 minus15 minus10 minus5 0 5 10 15 20 25

Time (ms)

dB


threshold

Figure 66 The effect of time resolution on perceptual bit allocation for an impulseImpulse input occurs at time 0 Conservative temporal masked threshold due to thepresence of impulse is shown Quantization noise for a fixed number of bits per sam-ple is superimposed for low- and high-resolution filter banks (a) 32-channel MDCT(b) 1024-channel MDCT

ž High-resolution mode eg 4096 subbandsž Efficient resolution switchingž Minimum blocking artifactsž Good channel separationž Strong stop-band attenuation


ž Perfect reconstructionž Critical samplingž Fast implementation

Good channel separation and stop-band attenuation are particularly desirable forsignals containing very little irrelevancy such as the harpsichord Maximum redun-dancy removal is essential for maintaining high quality at low bit rates for thesesignals Blocking artifacts in time-varying filter banks can lead to audible distortionin the reconstruction Beyond filter-bank-specific architectural and performance cri-teria system-level considerations may also influence the best choice of filter bankfor a codec design For example the codec architecture could contain two separateparallel time-frequency analysis blocks one for the perceptual model and one forgenerating the parametric set that is ultimately quantized and encoded The parallelscenario offers the advantage that each filter bank can be optimized independentlyThis is possible since the perceptual analysis section does not typically require sig-nal reconstruction whereas the coefficients for coding must eventually be mappedback to the time-domain In the interest of computational efficiency however manyaudio codecs have only one time-frequency analysis block that does ldquodouble dutyrdquoin the sense that the perceptual model obtains information from the same set ofcoefficients that are ultimately quantized and encoded

Algorithms for filter-bank design as well as fast algorithms for efficient filter-bank realizations offer many choices to designers of perceptual audio codecsAmong the many types available are those characterized by the following

ž Uniform or nonuniform frequency partitioningž An arbitrary number of subbandsž Perfect or almost perfect reconstructionž Critically sampled or oversampled representationsž FIR or IIR constituent filters

In the next few Sections we will focus on the design and performance of well-known filter banks that are popular in audio coding Rather than dealing withefficient implementation structures that are available we have elected for eachfilter-bank architecture to describe the individual bandpass filters in terms ofimpulse and frequency response functions that are easily related to the analysis-synthesis framework of Figure 61 These descriptions are intended to provideinsight regarding the filter-bank response characteristics and to allow for com-parisons across different methods The reader should be aware however thatstructures for efficient realizations are almost always used in practice and becausecomputational efficiency is of paramount importance most audio coding filter-bank realizations although functionally equivalent may or may not resemble themaximally decimated analysis-synthesis structure given in Figure 61 In otherwords most of the filter banks used in audio coders have equivalent parallelforms and can be conveniently analyzed in terms of this analysis-synthesis frame-work The framework provides a useful interpretation for the sets of coefficients

QUADRATURE MIRROR AND CONJUGATE QUADRATURE FILTERS 155

generated by the unitary transforms often embedded in audio coders such asthe discrete cosine transform (DCT) the discrete Fourier transform (DFT) thediscrete wavelet transform (DWT) and the discrete wavelet packet transform(DWPT)

64 QUADRATURE MIRROR AND CONJUGATE QUADRATUREFILTERS

The two-band quadrature mirror and conjugate quadrature filter (QMF and CQF)banks are logical starting points for the discussion on filter banks for audio codingTwo-band QMF banks were used in early subband algorithms for speech cod-ing [Croc76] and later for the first standardized 7-kHz wideband audio algorithmthe ITU G722 [G722] Also the strong connection between two-band perfectreconstruction (PR) CQF filter banks and the discrete wavelet transform [Akan96]has played a significant role in the development of high-performance audio cod-ing filter banks Ultimately tree-structured cascades of the CQF filters have beenused to construct several ldquocritical-bandrdquo filter banks in a number of high qualityalgorithms The two-channel bank which can provide a building block for struc-tured M-channel banks is developed as follows If the analysis-synthesis filterbank (Figure 61) is constrained to two channels ie if M = 2 then Eq (65)becomes

S() = 1

2S()[H 2

0 () minus H 21 ()] (66)

Esteband and Galand showed [Este77] that aliasing is cancelled between theupper and lower bands if the QMF conditions are satisfied namely

H1() = H0( + π) rArr h1(n) = (minus1)nh0(n)

G0() = H0() rArr g0(n) = h0(n)

G1() = minusH0( + π) rArr g1(n) = minus(minus1)nh0(n) (67)

Thus the two-band filter-bank design task is reduced to the design of a sin-gle lowpass filter h0(n) under the constraint that the overall transfer func-tion Eq (66) be an allpass function with constant group delay (linear phase)Although filter families satisfying the QMF criteria with good stop-band andtransition-band characteristics have been designed (eg [John80]) to minimizeoverall distortion the QMF conditions actually make perfect reconstructionimpossible Smith and Barnwell showed in [Smit86] however that PR two-bandfilter banks based on a lowpass prototype are possible if the CQF conditions aresatisfied namely

h1(n) = (minus1)nh0(L minus 1 minus n)

g0(n) = h0(L minus 1 minus n) (68)

g1(n) = minus(minus1)nh0(n)


minus10

minus20

minus30

minus40

minus50

minus60

0

025 0750

Normalized Frequency (x p rad)

Two Channel CQF Filter Bank

Mag

nitu

de (

dB)

Figure 67 Two-band Smith-Barnwell CQF filter bank magnitude frequency responsewith L = 8 [Smit86]

The magnitude response of an example Smith-Barnwell CQF filter bank from[Smit86] with L = 8 is shown in Figure 67 The lowpass response is shown

as a solid line while the highpass is dashed One can observe the significantoverlap between channels as well as monotonic passband and equiripple stop-band characteristics with minimum stop-band rejection of 40 dB As mentionedpreviously efficiency concerns dictate that filter banks are rarely implementedin the direct form of Eq (61) The QMF banks are most often realized using apolyphase factorization [Bell76] (ie

H(z) =Mminus1sum

l=0

zminuslEl(zM) (69)

where

El(z) =infinsum

n=minusinfinh(Mn + l)zminusn (610)

which yields better than a 21 computational load reduction On the other handthe CQF filters are incompatible with the polyphase factorization but can beefficiently realized using alternative structures such as the lattice [Vaid88]

65 TREE-STRUCTURED QMF AND CQF M-BAND BANKS

Clearly audio coders require better frequency resolution than either the QMF orCQF two-band decompositions can provide in order to realize sufficient coding

TREE-STRUCTURED QMF AND CQF M-BAND BANKS 157

y0(n)

s(n)

2H00(z)

2H10(z)

2H11(z)

2H20(z)

2H21(z)

2H20(z)

2H21(z)

2H01(z)

2H10(z)

2H11(z)

2H20(z)

2H21(z)

2H20(z)

2H21(z)

y1(n)

y2(n)

y3(n)

y4(n)

y5(n)

y6(n)

y7(n)

Figure 68 Tree-structured realization of a uniform eight-channel analysis filter bank

gain for spectrally complex signals Tree-structured cascades are one straight-forward method for creating M-band filter banks from the two-band QMF andCQF prototypes These are constructed as follows The two-band filters are con-nected in a cascade that can be represented well using either a binary tree or apruned binary tree The root node of the tree is formed by a single two-bandQMF or CQF section Then each of the root node outputs is connected to acascaded QMF or CQF bank The cascade structure may be continued to thedepth necessary to achieve the desired magnitude response characteristics Ateach node in the tree a two-channel QMF or CQF bank operates on the out-put from a higher-level two-channel bank Thus frequency subdivision occursthrough a series of two-band splits Tree-structured filter banks have severaladvantages First of all the designer can approximate an arbitrary partitioning ofthe frequency axis by creating an appropriate cascade Consider for example theuniform subband tree (Figure 68) or the octave-band tree (Figure 69) The abil-ity to partition the frequency axis in a nonuniform manner also has implicationsfor multi-resolution temporal analysis or nonuniform tiling of the time-frequencyplane This property can be advantageous if the ultimate objective is to approx-imate the analysis properties of the human ear and in fact many algorithmsmake use of tree-structured filter banks for this very reason [Bran90] [Sinh93b]


[Tsut98] In addition the designer has the flexibility to optimize the length andother properties of constituent filters at each node in the tree This flexibilityhas been exploited to enhance the performance of several experimental audiocodecs [Sinh93b] [Phil95a] [Phil95b] Tree-structured filter banks are also attrac-tive for their computational efficiency relative to other M-band techniques Onedisadvantage of the tree-structured filter bank is that delays accumulate throughthe cascaded nodes and hence the overall delay can become quite large

The example tree in Figure 68 shows an eight-band uniform analysis filterbank in which the analysis filters are indexed first by level and then by typeLowpass filters are indexed with a 0 and highpass with a 1 For instance high-pass filters at level 2 in the tree are denoted by H21(z) and lowpass by H20(z)It is often convenient to analyze the M-band tree-structured CQF or QMF bankusing an equivalent parallel form To see the connection between the cascaded

s(n)

2H01(z)

y0(n)

2H00(z)

2H10(z)

2H11(z)

2H20(z)

2H21(z) y1(n)

y2(n)

y3(n)

Figure 69 Tree-structured realization of an octave-band four-channel analysis filter bank

MMy(n)

y(n)

x(n)

MH(z) M

(b)

(a)

x(n)

H(z) H(zM)

H(zM)

x(n)

x(n) y(n)

y(n)

Figure 610 The Noble identities In each picture the structure on the left is equivalentto the structure on the right (a) Interchange of a filter and a downsampler The positionsare swapped after the complex variable z is replaced by zM in the system function H(z)(b) Interchange of a filter and an upsampler The positions are swapped after the complexvariable z is replaced by zM in the system function H(z)


(a)

Eight Channel Tree-Structured CCQF Filter Bank N = 53

0

minus10

minus20

minus30

minus40

minus50

minus6000062501875 03125 04375 05625 06875 08125 09375


Mag

nitu

de (

dB)

(b)

0

350 5 10 15 20 25 30 40 45 50

03125

0

minus20

minus40

minus60

minus02

minus01

0

01

02


Mag

nitu

de (

dB)

Am

plitu

de

Sample Number

Figure 611 Eight-channel cascaded CQF (CCQF) filter bank (a) Magnitude frequencyresponses for all eight channels Odd-numbered channel responses are drawn with dashedlines even-numbered channel responses are drawn with solid lines (b) Isolated view ofthe magnitude frequency response and time-domain impulse response for channel 3 Thisview highlights the presence of a significant sidelobe in the stop-band In this figure N

is the length of the impulse response

tree structures (Figures 68 and 69) and the parallel analysis-synthesis structure(Figure 61) one can apply the ldquonoble identitiesrdquo (Figure 610) which allow forthe interchange of the down-sampling and filtering operations In a straightfor-ward manner this practice collapses the cascaded filter transfer functions into


single parallel-form analysis filters for each channel In the case of the eight-channel bank (Figure 68) for example we have

H0(z) = H00(z)H10(z2)H20(z

4)

H1(z) = H00(z)H10(z2)H21(z

4) (611)

H7(z) = H01(z)H11(z2)H21(z

4)

Figure 611(a) shows the magnitude spectrum of an eight-channel tree-structuredfilter bank based on a three-level cascade (Figure 68) of the same Smith-BarnwellCQF filter examined previously (Figure 67) Even-numbered channel responsesare drawn with solid lines and odd-numbered channel responses are drawn withdashed lines One can observe the effect that the cascaded structure has on theshape of the channel responses Bands 0 through 3 are each uniquely shapedand are mirror images of bands 4 through 7 Moreover the M-band stop-bandcharacteristics are significantly different than the prototype filter ie the equirip-ple property does not extend to the M-channels Figure 611(b) shows |H2()|2making it is possible to observe clearly a sidelobe of significant magnitude inthe stop-band The figure also illustrates the impulse response h2(n) associatedwith the filter H2(z) One can see that the effective length of the parallel-formimpulse response represents the cumulative contributions from each of the cas-caded filters

66 COSINE MODULATED lsquolsquoPSEUDO QMFrsquorsquo M-BAND BANKS

A tree-structured cascade of two-channel prototypes is only one of several well-known methods available for realization of an M-band filter bank Although thetree structures offer opportunities for optimization at each node and are con-ceptually simple the potential for long delay and irregular channel responsesis sometimes unappealing As an alternative to the tree-structured architecturecosine modulation of a lowpass prototype filter has been used since the early1980s [Nuss81] [Roth83] [Chu85] [Mass85] [Cox86] to realize parallel M-channelfilter banks with nearly perfect reconstruction Because they do not achieve perfectreconstruction these filter banks are known collectively as ldquopseudo QMFrdquo andthey are characterized by several attractive properties

ž Constrained design single FIR prototype filterž Uniform linear phase channel responsesž Overall linear phase hence constant group delayž Low complexity ie one filter plus modulationž Amenable to fast block algorithmsž Critical sampling

COSINE MODULATED lsquolsquoPSEUDO QMFrsquorsquo M-BAND BANKS 161

In the pseudo QMF (PQMF) bank derivation phase distortion is completely elim-inated from the overall transfer function Eq (65) by forcing the analysis andsynthesis filters to satisfy the mirror image condition

gk(n) = hk(L minus 1 minus n) (612)

Moreover adjacent channel aliasing is cancelled by establishing precise relation-ships between the analysis and synthesis filters Hk(z) and Gk(z) respectively Inthe critically sampled analysis-synthesis notation of Figure 61 these conditionsultimately yield analysis filters given by

hk(n) = 2w(n) cos

[π

M(k + 05)

(

n minus (L minus 1)

2

)

+ k

]

(613)

and synthesis filters given by

gk(n) = 2w(n) cos

[π

M(k + 05)

(

n minus (L minus 1)

2

)

minus k

]

(614)

wherek = (minus1)kπ

4(615)

and the sequence w(n) corresponds to the L-sample ldquowindowrdquo a real-coefficientlinear phase FIR prototype lowpass filter with normalized cutoff frequencyπ2M Given that aliasing and phase distortions have been eliminated in thisformulation the filter-bank design procedure is reduced to the design of the win-dow w(n) such that overall amplitude distortion (Eq (65)) is minimized Oneapproach [Vaid93] is to minimize a composite objective function ie

C = αc1 + (1 minus α)c2 (616)

where constraint c1 of the form

c1 =int πM

0

[

|W()|2 +∣∣∣W

( minus π

M

)∣∣∣2 minus 1

]2

d (617)

minimizes spectral nonflatness in the reconstruction and constraint c2 of the form

c2 =int π

π

2M+ε

|W()|2d (618)

maximizes stop-band attenuation The parameter ε is related to transition band-width and the parameter α determines which design constraint is more dominant

The magnitude frequency response of an example eight-channel pseudo QMFbank designed using Eqs (66) and (67) is shown in Figure 612 In contrast tothe previous CCQF example one can observe that all of the channel magni-tude responses are identical modulated versions of the lowpass prototype andtherefore the passband and stop-band characteristics are uniform The impulse


0 00625 01875 03125 04375 05625 06875 08125 09375minus60

minus50

minus40

minus30

minus20

minus10

0


Mag

nitu

de (

dB)

(a)

Eight Channel PQMF Bank N = 39

(b)

350 5 10 15 20 25 30minus02

minus01

0

01

02

Am

plitu

de

Sample Number

Normalized Frequency (x p rad)031250

0

minus20

minus40

minus60

Mag

nitu

de (

dB)

Figure 612 Eight-channel PQMF bank (a) Magnitude frequency responses for all eightchannels Odd-numbered channel responses are drawn with dashed lines even-numberedchannel responses are drawn with solid lines (b) Isolated view of the magnitude fre-quency response and time-domain impulse response for channel 3 Here N is the impulseresponse length

response symmetry associated with a linear phase filter is also evident in anexamination of Figure 612(b)

The PQMF bank plays a significant role in several popular audio codingalgorithms In particular the IS11172-3 and IS13818-3 algorithms (ldquoMPEG-1rdquo

COSINE MODULATED (PR) M-BAND BANKS AND THE (MDCT) 163

[ISOI92] and ldquoMPEG-2 BCLSFrdquo [ISOI94a]) employ a 32-channel PQMF bankfor spectral decomposition in both layers I and II The prototype filter w(n)contains 512 samples yielding better than 96-dB sidelobe suppression in thestop-band of each analysis channel Output ripple (non-PR) is less than 007 dBIn addition the same PQMF is used in conjunction with a PR cosine modulatedfilter bank (discussed in the next section) in layer III to form a hybrid filter-bankarchitecture with time-varying properties The MPEG-1 algorithm has reached aposition of prominence with the widespread use of ldquoMP3rdquo files (MPEG-1 layer3) on the World Wide Web (WWW) for the exchange of audio recordings aswell as with the deployment of MPEG-1 layer II in direct broadcast satellite(DBSDSS) and European digital audio broadcast (DBA) initiatives Because ofthe availability of common algorithms for pseudo QMF and PR QMF banks wedefer the discussion on generic complexity and efficient implementation strategiesuntil later In the particular case of MPEG-1 however note that the 32-bandpseudo QMF analysis bank as defined in the standard requires approximately80 real multiplies and 80 real additions per output sample [ISOI92] although amore efficient implementation based on a fast algorithm for the DCT was alsoproposed [Pan93] [Kons94]

67 COSINE MODULATED PERFECT RECONSTRUCTION (PR)M-BAND BANKS AND THE MODIFIED DISCRETE COSINETRANSFORM (MDCT)

Although PQMF banks have been used quite successfully in perceptual audiocoders the overall system design still must compensate for the inherent dis-tortion induced by the lack of perfect reconstruction to avoid audible artifactsin the codec output The compensation strategy may be a simple one (egincreased prototype filter length) but perfect reconstruction is actually prefer-able because it constrains the sources of output distortion to the quantizationstage Beginning in the early 1990s independent work by Malvar [Malv90b]Ramstad [Rams91] and Koilpillai and Vaidyanathan [Koil91] [Koil92] showedthat generalized perfect reconstruction (PR) cosine modulated filter banks arepossible by appropriately constraining the prototype lowpass filter w(n) andsynthesis filters gk(n) for 0 k M minus 1 In particular perfect reconstructionis guaranteed for a cosine-modulated filter bank with analysis filters hk(n) givenby Eqs (613) and (615) if four conditions are satisfied First the length L ofthe window w(n) must be integer multiple of the number of subbands ie

1L = 2mM (619)

where the parameter m is an integer greater than zero Next the synthesisfilters gk(n) must be related to the analysis filters by a time-reversalsuch that

2gk(n) = hk(L minus 1 minus n) (620)


In addition the FIR lowpass prototype must have linear phase whichmeans that

3w(n) = w(L minus 1 minus n) (621)

and finally the polyphase components of w(n) must satisfy the pairwisepower complementary requirement ie

4Ek(z)Ek(z) + EM+k(z)EM+k(z) = α (622)

where the constant α is greater than 0 the functions Ek(z) are the k =0 1 2 M minus 1 polyphase components (Eq (69)) of W (z) and the tildenotation denotes the paraconjugate ie

E(z) = E lowast (zminus1) (623)

or in other words the coefficients of E(z) are conjugated and then thecomplex variable zminus1 is substituted for the complex variable z

The generalized PR cosine-modulated filter banks developed in Eqs(619)through (623) are of considerable interest in many applications This Sectionhowever concentrates on the special case that has become of central importancein the advancement of perceptual audio coding algorithms namely the filter bankfor which L = 2M ie m = 1 The PR properties of this special case were firstdemonstrated by Princen and Bradley [Prin86] using time-domain arguments forthe development of the time domain aliasing cancellation (TDAC) filter bankLater Malvar [Malv90a] developed the modulated lapped transform (MLT) byrestricting attention to a particular prototype filter and formulating the filter bankas a lapped orthogonal block transform More recently the consensus namein the audio coding literature for lapped block transform interpretation of thisspecial case filter bank has evolved into the modified discrete cosine transform(MDCT) To avoid confusion we will denote throughout this book by MDCTthe PR cosine-modulated filter bank with L = 2M and we will restrict thewindow w(n) in accordance with Eqs (619) and (621) In short the readershould be aware that the different acronyms TDAC MLT and MDCT all referessentially to the same PR cosine modulated filter bank Only Malvarrsquos MLT labelimplies a particular choice for w(n) as described below From the perspective ofan analysis-synthesis filter bank (Figure 61) the MDCT analysis filter impulseresponses are given by

hk(n) = w(n)

radic2

Mcos

[(2n + M + 1)(2k + 1)π

4m

]

(624)

and the synthesis filters to satisfy the overall linear phase constraint are obtainedby a time reversal ie

gk(n) = hk(2M minus 1 minus n) (625)


This perspective is useful for visualizing individual channel characteristics interms of their impulse and frequency responses In practice however the MDCTis typically realized as a block transform usually via a fast algorithm The remain-der of this section treats several MDCT facets that are of importance in audiocoding applications including its forward and inverse transform interpretationsprototype filter (window) design criteria window design examples time-varyingforms and fast algorithms

671 Forward and Inverse MDCT

The analysis filter bank (Figure 613(a)) is realized as a block transform of length2M samples while using a block advance of only M samples ie with 50overlap between blocks Thus the MDCT basis functions extend across twoblocks in time leading to virtual elimination of the blocking artifacts that plaguethe reconstruction of nonoverlapped transform coders Despite the 50 overlaphowever the MDCT is still critically sampled and only m coefficients are gen-erated by the forward transform for each 2M-sample input block Given an inputblock x(n) the transform coefficients X(k) for 0 k M minus 1 are obtainedby means of the forward MDCT defined as

X(k) =2Mminus1sum

n=0

x(n)hk(n) (626)

Clearly the forward MDCT performs a series of inner products between the M

analysis filter impulse responses hk(n) and the input x(n) On the other handthe inverse MDCT (Figure 613(b)) obtains a reconstruction by computing a sumof the basis vectors weighted by the transform coefficients from two blocks Thefirst M samples of the k-th basis vector for hk(n) 0 n M minus 1 are weightedby k-th coefficient of the current block X(k) Simultaneously the second M

samples of the k-th basis vector hk(n) for M n 2M minus 1 are weighted bythe k-th coefficient of the previous block XP (K) Then the weighted basisvectors are overlapped and added at each time index n Note that the extendedbasis functions require that the inverse transform maintains an M sample memoryto retain the previous set of coefficients Thus the reconstructed samples x(n)for 0 n M minus 1 are obtained via the inverse MDCT defined as

x(n) =Mminus1sum

k=0

[X(k)hk(n) + XP (k)hk(n + M)] (627)

where xP (k) denotes the previous block of transform coefficients The overlappedanalysis and overlap-add synthesis processes are illustrated in Figure 613(a) andFigure 613(b) respectively

672 MDCT Window Design

Given the forward (Eq (626)) and inverse (Eq (627)) transform definitions onestill must design a suitable FIR prototype filter (window) w(n) Several general


Frame k Frame k + 1 Frame k + 2 Frame k + 3M M M M

2M MDCT M

2M MDCT M

2M MDCT M

(a)

+

IMDCTM 2M

IMDCTM 2M

IMDCTM 2M

+

Frame k + 1 Frame k + 2M M

(b)

Figure 613 Modified discrete cosine transform (MDCT) (a) Lapped forward transform(analysis) ndash 2M samples are mapped to M spectral components (Eq (626)) Analy-sis block length is 2M samples but analysis stride (hop size) and time resolution areM-samples (b) Inverse transform (synthesis) ndash M spectral components are mapped to avector of 2M samples (Eq (627)) that is overlapped by M samples and added to thevector of 2M samples associated with the previous frame

purpose orthogonal [Prin86] [Prin87] [Malv90a] and biorthogonal [Jawe95][Smar95] [Matv96] windows that have been proposed while still otherorthogonal [USAT95a] [Ferr96a] [ISOI96a] [Fiel96] and biorthogonal [Cheu95][Malv98] windows are optimized explicitly for audio coding In the orthogonalcase the generalized PR conditions [Vaid93] given in Eqs (619)ndash(623) can bereduced to linear phase and Nyquist constraints on the window namely

w(2M minus 1 minus n) = w(n) (628a)

w2(n) + w2(n + M) = 1 (628b)

for the sample indices 0 n M minus 1 These constraints give rise to two consid-erations First unlike the pseudo-QMF bank linear phase in the MDCT lowpassprototype does not translate into linear phase for the modulated analysis filters oneach subband channel The overall MDCT analysis-synthesis filter bank howeveris characterized by perfect reconstruction and hence linear phase with a constantgroup delay of L minus 1 samples Secondly although Eqs (628a) and (628b) guar-antee an orthogonal basis for the MDCT an orthogonal basis is not required tosatisfy the PR constraints in Eqs (619)ndash(623) In fact it can be shown [Cheu95]


that Eq (628b) can be revised and that perfect reconstruction for the MDCT isstill guaranteed as long as it is true that

ws(n) = wa(n)

w2a(n) + w2

a(n + M)(629)

for the sample indices 0 n M minus 1 where ws(n) denotes the synthesis win-dow and wa(n) denotes the analysis window From the transform perspec-tive Eqs (628a) and (629) guarantee a biorthogonal MDCT basis Clearly thisrelaxation of prototype FIR lowpass filter design requirements increases thedegrees of freedom available to the filter bank designer from M2 to M Ineffect it is no longer necessary to use the same analysis and synthesis windowsIn any case whether an orthogonal or biorthogonal basis is used the MDCTwindow design problem can be formulated in the same manner as it was forthe PQMF bank (Eq (616)) except that the PR property of the MDCT elim-inates the spectral flatness constraint (Eq (617)) such that the designer canconcentrate solely on minimizing either the stop-band energy or the maximumstop-band magnitude of W() Well-known tools are available (eg [Pres89])for minimizing Eq (616) but in many cases one can safely forego the designprocess and rely instead upon the general purpose orthogonal [Prin86] [Prin87][Malv90a] or biorthogonal [Jawe95] [Smar95] [Matv96] MDCT windows thathave been proposed in the literature In fact several existing orthogonal [Ferr96a][ISOI96a] [USAT95a] and biorthogonal [Cheu95] [Malv98] transform windowswere explicitly designed to be in some sense optimal for audio coding

673 Example MDCT Windows (Prototype FIR Filters)

It is instructive to consider some example MDCT windows in order to appreciatemore fully the characteristics well suited to audio coding as well as the tradeoffsthat are involved in the window selection process

6731 Sine Window Malvar [Malv90a] denotes by MLT the MDCT filterbank that makes use of the sine window defined as

w(n) = sin

[(

n + 1

2

)π

2M

]

(630)

for 0 n M minus 1 This particular window is perhaps the most popularin audio coding It appears for example in the MPEG-1 layer III (MP3)hybrid filter bank [ISOI92] the MPEG-2 AACMPEG-4 time-frequency filterbank [ISOI96a] and numerous experimental MDCT-based coders that haveappeared in the literature In fact this window has become the de factostandard in MDCT audio applications and its properties are typically referencedas performance benchmarks when windows are proposed The sine window(Figure 614) has several unique properties that make it advantageous First DCenergy is concentrated in a single transform coefficient because all basis functionsexcept for the first one have infinite attenuation at DC Secondly the filter bank


channels achieve 24 dB sidelobe attenuation when the sine window (Figure 614dashed line) is used Finally the sine window has been shown [Malv90a] to makethe MDCT asymptotically optimal in terms of coding gain for a lapped transformCoding gain is desirable because it quantifies the factor by which the mean squareerror (MSE) is reduced when using the filter bank relative to using direct pulsecode modulated (PCM) quantization of the time-domain signal at the same rate

6732 Parametric Phase-Modulated Sine Window Optimization crite-ria other than coding gain or DC localization are possible and have also beeninvestigated for the MDCT Ferreira [Ferr96a] proposed a parametric window forthe orthogonal MDCT that offers a controlled tradeoff between reduction of thetime-domain ringing artifacts produced by coarse quantization and reduction ofstop-band leakage relative to the sine window The window (Figure 614 solid)which is defined in terms of three parameters for any value of M ie

w(n) = sin

[(

n + 1

2

)π

2M+ φopt (n)

]

(631)

where φopt (n) = 4πβ

(1 minus δ2)

[(4n

2M minus 2

)α

minus δ

] [(4n

2M minus 2

)α

minus 1

]

(632)

was motivated by the observation that explicit simultaneous minimization of time-domain aliasing and stop-band energy resulted in a window well approximatedby a nonlinear phase difference with respect to the sine window Moreover theparametric solution provided nearly optimal results and was tractable while theexplicit minimization was numerically unstable for long windows Parametersare given in [Ferr96a] for three windows that offer respectively time-domainaliasingstop-band leakage percentage improvements relative to the sine windowof 63101 8307 and 133minus31 Figure 614 compares the latter para-metric window (β = 003125 α = 092 δ = 00) in both time and frequencyagainst the sine window It can be seen that the negative gain in stop-band atten-uation is caused by a slight increase in the first sidelobe energy It is also clearhowever that the stop-band attenuation characteristics improve with increasingfrequency In fact the Ferreira window has a broader range of better than 110 dBattenuation than does the sine window This characteristic of improved ultimatestop-band rejection can be beneficial for perceptual gain particularly for stronglyharmonic signals

6733 Separate Analysis and Synthesis Windows ndash BiorthogonalMDCT Basis Even more dramatic improvements in ultimate stop-band rejec-tion are possible when the orthogonality constraint is removed Cheung and Lim[Cheu95] derived for the MDCT the biorthogonality window constraint given by

Eq (629) and then demonstrated with a Kaiser analysis window wa(n) the poten-tial for improved stop-band attenuation In a similar fashion Figure 615(a) showsthe analysis (solid) and synthesis (dashed) windows that result for the biorthogonalMDCT when wa(n) is a Kaiser window [Oppe99] with β = 11 The most signif-icant benefit of this arrangement is apparent from the frequency response plot for


100 200 300 400 500 600 700 800 900 10000

01

02

03

04

05

06

07

08

09

1

Sample

Am

plitu

de

(a)

0 0005 001 0015 002 0025 003minus80

minus70

minus60

minus50

minus40

minus30

minus20

minus10

0


Gai

n (d

B)

(b)

Figure 614 Orthogonal MDCT analysissynthesis windows of Malvar [Malv90a] (das-hed) and Ferreira [Ferr96a] (solid) (a) time-domain (b) frequency-domain magnituderesponse The parametric Ferreira window provides better stop-band attenuation over abroader range of frequencies at the expense of transition bandwidth and slightly reducedattenuation of the first sidelobe

two 256-channel filter banks depicted in Figure 615(b) In this figure the dashedline represents the frequency response associated with channel four of the sinewindow MDCT and the lighter solid line corresponds to the frequency responseassociated with the same channel in the Kaiser window MDCT filter bank Also


50 100 150 200 250 300 350 400 450 500

Sample

0

05

1

15

2A

mpl

itude

(a)

0 200 400 600 800 1000 1200 1400 1600 1800 2000minus120

minus100

minus80

minus60

minus40

minus20

0

Frequency (Hz)

Gai

n (d

B)

Simultaneousmasking threshold

(b)

Figure 615 Biorthogonal MDCT basis (a) Time-domain view of the analysis (solid)and synthesis (dashed) windows (Cheung and Lim [Cheu95]) that are associated witha biorthogonal MDCT basis (b) Frequency-domain magnitude responses associated withMDCT channel four of 256 for the sine (orthogonal basis dashed) and Kaiser (biorthogonalbasis solid) windows Simultaneous masking threshold is superimposed for a pure toneoccurring at the channel center frequency 388 Hz Picture demonstrates the potentialfor super-threshold leakage associated with the sine window and the improved stop-bandattenuation realized with the Kaiser window


superimposed on the plot is the simultaneous masking threshold generated by a388-Hz pure tone occurring at the channel center It can be seen that althoughthe main lobe for the Kaiser MDCT is somewhat broader than the sine MDCTthe stop-band attenuation is significantly below the masking threshold whereas thesine window MDCT stop-band leakage has substantial super-threshold energy Thesine window therefore has the potential to cause artificially high bit rates becauseof its greater leakage This type of artifact motivated the designers of the DolbyAC-2AC-3 [USAT95a] and MPEG-2 AACMPEG-4 T-F [ISOI96a] algorithms touse customized windows rather than the standard sine window in their respectiveorthogonal MDCT filter banks

6734 The Dolby AC-2Dolby AC-3MPEG-2 AAC KBD Window TheKaiser-Bessel Derived (KBD) window was obtained in a procedure devised atDolby Laboratories The AC-2 and AC-3 designers showed [Fiel96] that theprototype filter for an M ndashchannel orthogonal MDCT filter bank satisfying thePR conditions (Eqs (628a) and (628b)) can be derived from any symmetrickernel window of length M + 1 by applying a transformation of the form

wa(n) = ws(n) =radicradicradicradic

sumnj=0 v(j)

sumMj=0 v(j)

0 n lt M (633)

where the sequence v(n) represents the symmetric kernel The resulting identicalanalysis and synthesis windows wa(n) and ws(n) respectively are of lengthM + 1 and symmetric ie w(2M minus n minus 1) = w(n) Note that although a moregeneral form of Eq (633) appeared [Fiel96] we have simplified it here for theparticular case of the 50-overlapped MDCT During the development of theAC-2 and AC-3 algorithms novel MDCT prototype filters optimized to sat-isfy a minimum masking template (eg Figure 616(a) for AC-3) were designedusing Eq (633) with a parametric Kaiser-Bessel kernel v(n) At the expenseof some passband selectivity the KBD windows achieve considerably betterstop-band attenuation (greater than 40 dB improvement) than the sine window(Figure 616b) Thus for a pure tone occurring at the center of a particularMDCT channel the KBD filter bank concentrates more energy into a single trans-form coefficient The remaining dispersed energy tends to generate coefficientmagnitudes that lie below the worst-case pure tone excitation pattern (ldquomask-ing templaterdquo (Figure 616b)) Particularly for signals with adequately spacedtonal components the presence of fewer supra-threshold MDCT componentsreduces the perceptual bit allocation and therefore tends to improve coding gainIn spite of the reduced bit allocation the filter bank still renders the quantizationnoise inaudible since the uncoded coefficients have smaller magnitudes than themasked threshold A KBD filter bank simulation exemplifying this behavior forthe MPEG-2 AAC algorithm is given later

6735 Parametric Windows for a Biorthogonal MDCT Basis In anotherexample of biorthogonal window design Malvar [Malv98] proposed the lsquomod-ulated biorthogonal lapped transform (MBLT)rsquo a biorthogonal version of the


(a)

50 100 150 200 250 300 350 400 450 5000

01

02

03

04

05

06

07

08

09

1

Sample

Am

plitu

de

Sine

AC-3

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000minus120

minus100

minus80

minus60

minus40

minus20

0

Masking template

Sine window

AC-3 window

Frequency (Hz)

Mag

nitu

de (

dB)

(b)

Figure 616 Dolby AC-3 (solid) vs sine (dashed) MDCT windows (a) time-domainviews (b) frequency-domain magnitude responses in relation to worst-case masking tem-plate Improved stop-band attenuation of the AC-3 (KBD) window is shown to approxi-mate well the minimum masking template

MDCT based on a parametric window defined as

ws(n) =1 minus cos

[(n + 1

2M

)α

π

]

+ β

2 + β(634)


for 0 n M minus 1 Like [Cheu95] Eq (634) was also motivated by a desireto realize improved stop-band attenuation Additionally it was used to achievegood characteristics in a novel nonuniform filter bank based on a straightforwardmanipulation of the MBLT In this design the parameter α controls windowwidth while the parameter β controls its end values

6736 Summary and Example Eight-Channel Filter Bank (MDCT) Usinga Sine Window The foregoing examples demonstrate that MDCT windowdesigns are predominantly concerned with optimizing in some sense the trade-off between mainlobe width and stopband attenuation as is true of any FIRfilter design We also note that biorthogonal MDCT extensions are a recentdevelopment and consequently most current audio coders incorporate primarilydesign innovations that have occurred within the orthogonal MDCT frameworkTo facilitate comparisons with the previously described filter bank methodolo-gies (QMF CQF tree-structured QMF pseudo-QMF etc) the analysis filtermagnitude responses for an example eight-channel MDCT filter bank using thesine window are shown in Figure 617(a) Examination of the channel-3 impulseresponse in Figure 617(b) reveals the asymmetry that precludes linear phase forthe analysis filters

6737 Time-Varying Forms of the MDCT One final point regarding MDCTwindow design is of particular relevance for perceptual audio coders The earlierexamples for tone-like and noise-like signals (Chapter 6 Section 62) demon-strated clearly that characteristics of the ldquobestrdquo filter bank for audio are signal-specific and therefore time-varying In practice it is very common for codecsusing the MDCT (eg MPEG-1 [ISOI92a] MPEG-2 AAC [ISOI96a] Dolby AC-3 [USAT95a] Sony ATRAC [Tsut96] etc) to change the number of channels andhence the window length to match the signal properties of the input Typically abinary classification scheme identifies the input as either stationary or nonstation-arytransient Then a long window is used to maximize coding gain and achievegood channel separation during segments identified as stationary or a short windowis used to localize time-domain artifacts when pre-echoes are likely Although thestrategy has proven to be highly effective it does complicate the codec structureIn particular because of the time overlap between basis vectors either boundaryfilters [Herl95] or special transitional windows [Herl93] are required to preserveperfect reconstruction when window switching occurs Other schemes are alsoavailable to achieve perfect reconstruction with time-varying filter bank prop-erties [Quei93] [Soda94] but for practical reasons these are not typically usedConsequently window switching has been the method of choice In this scenariothe transitional window function does not need to be symmetrical It can be shownthat the PR property is preserved as long as the transitional window satisfies thefollowing constraints

w2(n) + w2(M minus n) = 1 n lt M (635a)

w2(M + n) + w2(2M minus n) = 1 n M (635b)


000625 01875 03125 04375 05625 06875 08125 09375minus60

minus50

minus40

minus30

minus20

minus10

0

Eight Channel MDCT Filter Bank N = 15


Mag

nitu

de (

dB)

(a)

(b)

0 03125minus60

minus40

minus20

0


Mag

nitu

de (

dB)

0 5 10 15minus05

0

05

Sample Number

Am

plitu

de

Figure 617 Eight-channel MDCT filter bank constructed with the sine window (a) Mag-nitude frequency responses for all eight channels of the analysis filter bank Odd-numberedchannel responses are drawn with dashed lines even-numbered channel responses aredrawn with solid lines (b) Isolated view of the magnitude frequency response andtime-domain impulse response for channel 3 Asymmetry is clearly visible in the chan-nel impulse response precluding the possibility of linear phase although the overallanalysis-synthesis filter bank has linear phase on all channels


and provided that the relationship between the transitional window and the adjoin-ing new length window obeys

w1(M + n) = w2(M minus n) (636)

where w1(n) and w2(n) are the left and right window functions respectively Inspite of the preserved PR property it should be noted that MDCT transitionalwindows are highly non ideal in the sense that they seriously impair the channelselectivity and stop-band attenuation of the filter bank

The Dolby AC-3 algorithm as well as the MPEG MDCT-based coders employMDCT window switching to maximize filter bank-to-signal matching The MPEG-1 layer III and MPEG-2 AAC window switching schemes use transitional win-dows that are described in some detail later (Section 610) Unlike the MPEGapproach the AC-3 algorithm maintains perfect reconstruction while avoidingtransitional windows The AC-3 applies high-resolution frequency analysis to sta-tionary signals using an MDCT as defined in Eqs (626) and (627) with M =256 During transient segments a pair of two half-length transforms (M = 128)given by

X1(k) =2Mminus1sum

n=0

x(n)hk1(n) (637a)

X2(k) =2Mminus1sum

n=0

x(n + 2M)hk2(n + 2M) (637b)

replaces the single long-block transform and the short block filter impulse respon-ses hk1 and hk2 are defined as

hk1(n) = w(n)

radic2

Mcos

[(2n + 1)(2k + 1)π

4M

]

(638a)

hk2(n) = w(n)

radic2

Mcos

[(2n + 2M + 1)(2k + 1)π

4M

]

(638b)

The window function w(n) remains identical for both the long and short trans-forms Here the key to maintaining the PR property is that the different phaseshifts in Eqs (638a) and (638b) relative to Eq (624) guarantee an orthogonalbasis Also note that the AC-3 window is customized and incorporates into itsdesign some perceptual properties [USAT95a] The spectral and temporal analysistradeoffs involved in transitional window designs are well illustrated in [Shli97]for both the MPEG-1 layer III [ISOI92a] and the Dolby AC-3 [USAT95a] fil-ter banks

6738 Fast Algorithms Complexity and Implementation Issues Oneof the attractive properties that has contributed to the widespread use of the MDCT


particularly in the standards is the availability of FFT-based fast algorithms[Duha91] [Sevi94] that make the filter bank viable for real-time applications A

unified fast algorithm [Liu98] is available for the MPEG-1 -2 -4 and AC-3 longblock MDCT (Eq (626) and Eq (627)) the AC-3 short block MDCT (Eq (638a)and Eq (638b)) and the MPEG-1 pseudo-QMF bank (Eq (613) and Eq (614))The computational load of [Liu98] for an M = 1024 (2048-point) MDCT (egMPEG-2 AAC ATampT PAC) is 8192 multiplies and 13920 adds This translatesinto complexity of O(M log2 M) for multiplies and O(2M log2 M) for adds Thecomplexity scales accordingly for other values of M Both [Duha91] and [Liu98]exploit the fact that the forward MDCT can be decomposed into two cascadedstages (Figure 618) namely a set of M2 butterflies followed by an M-point dis-crete cosine transform (DCT) The inverse transform is decomposed in the inversemanner ie a DCT followed by a butterfly network In both cases the butterfliescapture the windowing behavior while the DCT performs the modulation and fil-tering The decompositions are efficient as well-known fast algorithms are availablefor the various DCT types [Rao90] The butterflies are of low complexity typicallyO(2M) for both multiplies and adds In addition to the computationally efficientalgorithms of [Duha91] and [Liu98] a regressive structure suitable for parallelVLSI implementation of the Eq (626) forward MDCT was proposed in [Chia96]with complexity of 3M adds and 2M multiplies per output for the forward transformand 3M adds and M multiplies for the inverse transform

As far as other implementation issues are concerned several researchershave addressed the quantization sensitivity of the MDCT There are available

zminus1

zminus1

minusw(0)

w(0)

minusw(M minus 1)

minusw(M minus 1)

x(M minus 1)

minusw(M 2)

minusw(M 2)

minusw(M2 minus 1)

w(M2 minus 1)

x(M2 minus 1)

X(M minus 1)

x(M2 minus 2)M2 minus 2

M2 minus 1

M minus 1

0

DCT-IV

x(0) X(0)

bullbullbull

bullbullbull

bullbullbull

bullbullbull

bullbullbull

Figure 618 A fast algorithm for the 2M-point (M-channel) forward MDCT (Eq (626))consists of a butterfly network and memory followed by a Type IV DCT The inversestructure can be formed to compute the inverse MDCT (Eq (627)) Efficient FFT-basedalgorithms are available for the Type IV DCT


expressions [Jako96] for the reconstruction error of the quantized system interms of signal-correlated and uncorrelated components that can be used toassist algorithm designers in the identification and optimization of perceptuallydisturbing reconstruction artifacts induced by quantization noise A more generaltreatment of quantization issues for PR cosine modulated filter banks has alsoappeared [Akan92]

6739 Remarks on the MDCT The MDCT has become of central impor-tance in audio coding and the majority of standardized algorithms make someuse of this filter bank This section has traced the origins of the MDCT reviewedcommon terminology and definitions addressed the major window design issuesexamined the strategies for time-varying implementations and noted the avail-ability fast algorithms for efficient realization It has also provided numerousexamples The important properties of the MDCT filter bank can be summarizedas follows

ž Perfect reconstructionž Overlapping basis vectorsž Linear overall filter bank phase responsež Extended ringing artifacts due to quantizationž Critical samplingž Virtual elimination of blocking artifactsž Constant group delay = L minus 1ž Nonlinear analysis filter phase responsesž Low complexity one filter and modulationž Orthogonal version M2 degrees of freedom for w(n)

ž Amenable to time-varying implementations with some performance sac-rifices

ž Amenable to fast algorithmsž Constrained design a single FIR lowpass prototype filterž Biorthogonal version M degrees of freedom for wa(n) or ws(n)

One can see from this synopsis that the MDCT possesses many of the qualitiessuitable for audio coding (Section 63) As a PR cosine-modulated filter bankit inherits all of the advantages realized for the pseudo-QMF except for phaselinearity on individual analysis channels and it does so at the expense of lessthan a 5 dB reduction (typically) in stop-band attenuation Moreover the MDCToffers the added advantage that the number of parameters to be optimized indesign of the lowpass prototype is essentially reduced to M2 in the orthogonalcase If more freedom is desired however one can opt for the biorthogonalconstruction Finally we have presented the MDCT as both a filter bank and ablock transform To maintain consistency we recognize the filter-banktransformduality of some of the other tools presented in this chapter Recall that the MDCT


is a special case of the PR cosine modulated filter bank for which L = 2mM and m = 1 Note then that the PQMF bank (Chapter 6 Section 66) can alsobe interpreted as a lapped transform for which it is possible to have L = 2mM In the case of the MPEG-1 filter bank for layers I and II for example L = 512and M = 32 or in other words m = 8 As the coder architectures described inChapters 7 through 10 will demonstrate many filter banks for audio coding aremost efficiently realized as block transforms particularly when fast algorithmsare available

68 DISCRETE FOURIER AND DISCRETE COSINE TRANSFORM

This section offers abbreviated filter bank interpretations of the discrete Fouriertransform (DFT) and the discrete cosine transform (DCT) These classical blocktransforms were often used to achieve high-resolution frequency analysis in theearly experimental transform-based audio coders (Chapter 7) that preceded theadaptive spectral entropy coding (ASPEC) and ultimately the MPEG-1 algo-rithms layers IndashIII (Chapter 10) For example the FFT realization of the DFTplays an important role in layer III of MPEG-1 (MP3) The FFT is embeddedin efficient realizations of both MP3 hybrid filter bank stages (pseudo-QMF andMDCT) as well as in the spectral estimation blocks of the psychoacoustic mod-els 1 and 2 recommended in the MPEG-1 standard [ISOI92] It can be seen thatblock transforms are a special case of the more general uniform-band analysis-synthesis filter bank of Figure 61 For example consider the unitary DFT andits inverse [Akan92] which can be written as respectively

X(k) = 1radic2M

2Mminus1sum

n=0

x(n)Wminusnk 0 k 2M minus 1 (639a)

x(n) = 1radic2M

2Mminus1sum

k=0

X(k)Wnk 0 n 2M minus 1 (6 39b)

where W = ejπM If the analysis filters in Eq (61) all have the same length andL = 2M then the filter bank could be interpreted as taking contiguous L sam-ple blocks of the input and applying to each block the transform in Eq (639a)Although the DFT is usually defined with a block size of N instead of 2M Eqs (639a) and (639b) are given using notation slightly different from the usualto remain consistent with the convention of this chapter throughout which thenumber of filter bank channels is denoted by the parameter M The DFT hasconjugate symmetry for real signals and thus from the audio filter bank perspec-tive effectively half as many channels as its block length Also from the filterbank viewpoint the impulse response of the k-th-channel analysis filter is givenby the k-th DFT basis vector ie

hk(n) = 1radic2M

Wkn 0 n 2M minus 1 0 k M minus 1 (640)

DISCRETE FOURIER AND DISCRETE COSINE TRANSFORM 179

0 0125 025 0375 05 0625 075 0875 1minus60

minus50

minus40

minus30

minus20

minus10

0

Eight Channel DFT Filter Bank


Mag

nitu

de (

dB)

(a)

(b)

0 025minus60

minus40

minus20

0


Mag

nitu

de (

dB)

0 5 10 15minus60

minus40

minus20

0

Sample Number

Am

plitu

de

Figure 619 Eight-band STFT filter bank (a) Magnitude frequency responses for alleight channels of the analysis filter bank Odd-numbered channel responses are drawn withdashed lines even-numbered channel responses are drawn with solid lines (b) Isolatedview of the magnitude frequency response and time-domain impulse response for channel3 Note that the impulse response is complex-valued and that only its magnitude is shownin the figure


An example eight-channel DFT analysis filter bank magnitude frequency res-ponse appears in Figure 619 with the magnitude response of the third chan-nel magnified in Figure 619(b) The magnitude of the complex-valued impulseresponse for the same channel also appears in Figure 619(b) Note that theDFT filter bank is evenly stacked whereas the cosine-modulated filter banksin Sections 66 and 67 of this chapter were oddly stacked In other words thecenter frequencies for the DFT analysis and synthesis filters occur at kπM for0 k M minus 1 while the center frequencies for the oddly stacked filters occurat (2k + 1)π2M for 0 k M minus 1 As is evident from Figure 619(a) evenstacking means that the low-band filter is only half the bandwidth of the otherchannels and that it ldquowraps-aroundrdquo the fold-over frequency

A filter bank perspective can also be provided for to the DCT As a blocktransform the forward DCT (Type II) and its inverse are given by the analysisand synthesis expressions respectively

X(k) = c(k)

radic2

M

Mminus1sum

n=0

x(n) cos

[π

M

(

n + 1

2

)

k

]

0 k M minus 1 (641a)

x(n) =radic

2

M

Mminus1sum

k=0

c(k)X(k) cos

[π

M

(

n + 1

2

)

k

]

0 n M minus 1(641b)

where c(0) = 1radic

2 and c(k) = 1 for 1 k M minus 1 Using the same dualityarguments as for the DFT one can view the DCT from the perspective of theanalysis-synthesis filter bank (Figure 61) in which case the impulse response ofthe k-th-channel analysis filter is the k-th DCT-II basis vector given by

hk(n) = c(k)

radic2

Mcos

[π

M

(

n + 1

2

)

k

]

0 n k M minus 1 (642)

As an example the magnitude frequency responses of an eight-channel DCT anal-ysis filter bank are given in Figure 620(a) and the isolated magnitude responseof the third channel is given in Figure 620(b) The impulse response for thesame channel is also given in the figure

69 PRE-ECHO DISTORTION

An artifact known as pre-echo distortion can arise in transform coders using per-ceptual coding rules Pre-echoes occur when a signal with a sharp attack beginsnear the end of a transform block immediately following a region of low energyThis situation can arise when coding recordings of percussive instruments suchas the triangle the glockenspiel or the castanets for example (Figure 621a)For a block-based algorithm when quantization and encoding are performed inorder to satisfy the masking thresholds associated with the block average spec-tral estimate time-frequency uncertainty dictates that the inverse transform willspread quantization distortion evenly in time throughout the reconstructed block

PRE-ECHO DISTORTION 181

(a)

000625 01875 03125 04375 05625 06875 08125 09375minus60

minus50

minus40

minus30

minus20

minus10

0

Eight Channel DCT Filter Bank N = 7


Mag

nitu

de (

dB)

0

0

0

0 1 2 3 4 5 6 7

minus20

minus40

minus60

Mag

nitu

de (

dB)

03125Normalized Frequency (x p rad)

05

minus05

Sample Number

(b)

Am

plitu

de

Figure 620 Eight-band DCT filter bank (a) Magnitude frequency responses for all eightchannels of the analysis filter bank Odd-numbered channel responses are drawn withdashed lines even-numbered channel responses are drawn with solid lines (b) Isolatedview of the magnitude frequency response and time-domain impulse response for ch-annel 3


(Figure 621b) This results in unmasked distortion throughout the low-energyregion preceding in time the signal attack at the decoder Although it has thepotential to compensate for pre-echo temporal premasking is possible only ifthe transform block size is sufficiently small (minimal coder delay) Percussivesounds are not the only signals likely to produce pre-echoes Such artifacts alsooften plague coders when processing ldquopitchedrdquo signals containing nearly impul-sive bursts at the beginning of each pitch period eg the ldquoGerman male speechrdquorecording [Herr96] For a male speaker with a fundamental frequency of 125 Hzthe interval between impulsive events is only 8 ms which is much less than thetypical analysis block length Several methods proposed to eliminate pre-echoesare reviewed next

610 PRE-ECHO CONTROL STRATEGIES

Several methodologies have been proposed and successfully applied in the effortto mitigate the pre-echoes that tend to plague block-based coding schemes Thissection describes several of the most widespread techniques including the bitreservoir window switching gain modification switched filter banks and tem-poral noise shaping Advantages and drawbacks associated with each method arealso discussed

6101 Bit Reservoir

Some coders [ISOI92] [John96c] utilize this technique to satisfy the greater bitdemand associated with transients Although most algorithms are fixed rate theinstantaneous bit rates required to satisfy masked thresholds on each frame arein fact time-varying Thus the idea behind a bit reservoir is to store surplus bitsduring periods of low demand and then to allocate bits from the reservoir duringlocalized periods of peak demand resulting in a time-varying instantaneous bitrate but at the same time a fixed average bit rate One problem however isthat very large reservoirs are needed to deal with certain transient signals egldquopitched signalsrdquo Particular bit reservoir implementations are addressed later inconjunction with the MPEG [ISOI92a] and PAC [John96c] standards

6102 Window Switching

First introduced by Edler [Edle89] this is also a popular method for pre-echosuppression particularly in the case of MDCT-based algorithms Window switch-ing works by changing the analysis block length from long duration (eg 25 ms)during stationary segments to ldquoshortrdquo duration (eg 4 ms) when transients aredetected (Figure 622) At least two considerations motivate this method First ashort window applied to the frame containing the transient will tend to minimizethe temporal spread of quantization noise such that temporal premasking effectsmight preclude audibility Secondly it is desirable to constrain the high bit ratesassociated with transients to the shortest possible temporal regions Althoughwindow switching has been successful [ISOI92] [John96c] [Tsut98] it also has

PRE-ECHO CONTROL STRATEGIES 183

1

08

06

04

02

Am

plitu

de

0

minus02

minus04

minus06

minus08200 400 600 800 1000

Sample (n)

(a)

1200 1400 1600 1800 2000

1

08

06

04

02

Am

plitu

de

0

minus02

minus04

minus06

minus08200 400 600 800 1000

Sample (n)

(b)

1200 1400 1600 1800 2000

Pre-echo distortion

Figure 621 Pre-echo example (time-domain waveforms) (a) Uncoded castanets (b) tra-nsform coded castanets 2048-point block size Pre-echo distortion is clearly visible in thefirst 1300 samples of the reconstructed signal


significant drawbacks For one the perceptual model and lossless coding por-tions of the coder must support multiple time resolutions This usually translatesinto increased complexity Furthermore most coders nowadays use lapped trans-forms such as the MDCT To satisfy PR constraints window switching typicallyrequires transition windows between the long and short blocks Even when suit-able transition windows (Figure 622) satisfy the PR constraints they do so atthe expense of poor time and frequency localization properties [Shli97] resultingin reduced coding gain Other difficulties inherent to window switching schemesare increased coder delay undesirable latency for closely spaced transients (eglong-start-short-stop-start-short) and impractical overuse of short windows forldquopitchedrdquo signals

6103 Hybrid Switched Filter Banks

Window switching essentially relies upon a fixed filter bank with adaptive win-dow lengths In contrast the hybrid and switched filter-bank architectures relyupon distinct filter bank modes In hybrid schemes (eg [Prin95]) compatiblefilter-bank elements are cascaded in order to achieve the time-frequency tilingbest suited to the current input signal Switched filter banks (eg [Sinh96]) onthe other hand make hard switching decisions on each analysis interval in orderto select a single monolithic filter bank tailored to the current input Examplesof these methods are given in later chapters along with some discussion of theirassociated tradeoffs

10 20 30 40 50 60 70 80 90 100 110 1200

02

04

06

08

1

Time (ms)

Am

plitu

de

LongLong Start Stop Short

Figure 622 Example window switching scheme (MPEG-1 layer III or ldquoMP3rdquo) Tran-sitional start and stop windows are required in between the long and short blocks topreserve the PR properties of the filter bank


6104 Gain Modification

The gain modification approach (Figure 623) has also shown promise in thetask of pre-echo control [Vaup91] [Link93] The gain modification proceduresmoothes transient peaks in the time-domain prior to spectral analysis Then per-ceptual coding may proceed as it does for normal stationary blocks Quantizationnoise is shaped to satisfy masking thresholds computed for the equalized longblock without compensating for an undesirable temporal spread of quantizationnoise A time-varying gain and the modification time interval are transmitted asside information Inverse operations are performed at the decoder to recover theoriginal signal Like the other techniques caveats also apply to this method Forexample gain modification effectively distorts the spectral analysis time windowDepending upon the chosen filter bank this distortion could have the unintendedconsequence of broadening the filter-bank responses at low frequencies beyondcritical bandwidth One solution for this problem is to apply independent gainmodifications selectively within only frequency bands affected by the transientevent This selective approach however requires embedding of the gain blockswithin a hybrid filter-bank structure which increases coder complexity [Akag94]

6105 Temporal Noise Shaping

The final pre-echo control technique considered in this section is temporal noiseshaping (TNS) As shown in Figure 624 TNS [Herr96] is a frequency-domaintechnique that operates on the spectral coefficients X(k) generated by the analy-sis filter bank TNS is applied only during input attacks susceptible to pre-echoes

s(n)G(n)

TRANSSpectralAnalysis

Side Info

Figure 623 Gain modification scheme for pre-echo control

X(k)A(z)

Q

e(k)

TNSe(k) X(k)ˆˆ

Figure 624 Temporal noise shaping scheme (TNS) for pre-echo control


The idea is to apply linear prediction (LP) across frequency (rather than time)since for an impulsive time signal frequency-domain coding gain is maximizedusing prediction techniques The method works as follows Parameters of a spec-tral LP synthesis filter A(z) are estimated via application of standard minimumMSE estimation methods (eg Levinson-Durbin) to the spectral coefficientsX(k) The resulting prediction residual e(k) is quantized and encoded usingstandard perceptual coding according to the original masking threshold Pre-diction coefficients are transmitted to the receiver as side information to allowrecovery of the original signal The convolution operation associated with spectraldomain prediction is associated with multiplication in time In a manner anal-ogous to the source-system separation realized by time-domain LP analysis fortraditional speech codecs TNS effectively separates the time-domain waveforminto an envelope and temporally flat ldquoexcitationrdquo Then because quantizationnoise is added to the flattened residual the time-domain multiplicative enve-lope corresponding to A(z) shapes the quantization noise such that it follows theoriginal signal envelope

Quantization noise for the castanets applied to a DCT-based coder is shownin Figure 625(a) and Figure 625(b) both without and with TNS active respec-tively TNS clearly shapes the quantization noise to follow the input signalrsquosenergy envelope TNS mitigates pre-echoes since the error energy is now concen-trated in the time interval associated with the largest masking threshold Althoughthey are related as time-frequency dual operations TNS is advantageous relativeto gain shaping because it is easily applied selectively in specific frequency sub-bands Moreover TNS has the advantages of compatibility with most filter-bankstructures and manageable complexity Unlike window switching schemes forexample TNS does not require modification of the perceptual model or losslesscoding stages to a new time-frequency mapping TNS was reported in [Herr96]to dramatically improve performance on a five-point mean opinion score (MOS)test from 264 to 354 for a particularly troublesome pitched signal ldquoGermanMale Speechrdquo for the MPEG-2 nonbackward compatible (NBC) coder [Herr96]A MOS improvement of 03 was also realized for the well-known ldquoGlocken-spielrdquo test signal This ultimately led to the adoption of TNS in the MPEG NBCscheme [Bosi96a] [ISOI96a]

611 SUMMARY

This chapter covered the basics of time-frequency analysis techniques for audiosignal processing and coding We also highlighted the time-frequency tradeoffchallenges faced by the audio codec designers when designing a filter bank Wediscussed both the QMF and CQF filter-bank designs and their extended tree-structured forms We also dealt in detail with the cosine modulated pseudo-QMFand perfect reconstruction M-band filter-bank designs The modified discretecosine transform (MDCT) and the various window designs were also covered indetail

SUMMARY 187

1

06

08

04

02

0

Am

plitu

de

minus02

minus04

minus06

minus08

minus1200 400 600 800 1000

Sample (n)

(a)

1200 1400 1600 1800 2000

1

08

06

04

02

Am

plitu

de

0

minus02

minus04

minus06

minus1

minus08

200 400 600 800 1000

Sample (n)

(b)

1200 1400 1600 1800 2000

Figure 625 Temporal noise shaping example showing quantization noise and the inputsignal energy envelope for castanets (a) without TNS and (b) with TNS


PROBLEMS

61 Prove the identities shown in Figure 610

62 Consider the analysis-synthesis filter bank shown in Figure 61 with input

signal s(n) Show that s(n) = 1

M

suminfinm=minusinfin

suminfinl=minusinfin

sumMminus1k=0 s(m)

hk(lM minus m)gk(l minus Mn) where M is the number of subbands

63 For the down-sampling and up-sampling processes given in Figure 626show that

Sd() = 1

M

Mminus1sum

l=0

S

( + 2πl

M

)

H

( + 2πl

M

)

and Su() = S(M)G()

64 Using results from Problem 63 Prove Eq (65) for the analysis-synthesisframework shown in Figure 61

65 Consider Figure 627Given s(n) = 075 sin(πn3) + 05 cos(πn6) n = 0 1 6 H0(z) = 1 minuszminus1 and H1(z) = 1 + zminus1

a Design the synthesis filters G0(z) and G1(z) in Figure 627 such thataliasing distortions are minimized

b Write the closed-form expression for v0(n) v1(n) y0(n) y1(n)

w0(n) w1(n) and the synthesized waveform s(n) In Figure 627assume yi(n) = yi(n) for i = 0 1

c Assuming an alias-free scenario show that s(n) = αs(n minus n0) whereα is the QMF bank gain n0 is a delay that depends on Hi(z) and Gi(z)Estimate the value of n0

d Repeat steps (a) and (c) for H0(z) = 1 minus 075zminus1 and H1(z) = 1 +075zminus1

H(z)

s(n) sd(n)

M

Down-sampling

G(z)s(n) su(n)

M

Up-sampling

Figure 626 Down-sampling and up-sampling processes

s(n)

y0(n)2

v0(n)

y1(n)2

v1(n)

2w0(n)

2w1(n)

H0(z)

H1(z)

G0(z)

G1(z)

y0(n)ˆ

s(n)ˆ

y1(n)ˆ

sum

Figure 627 A two-band maximally decimated analysis-synthesis filter bank

PROBLEMS 189

66 In this problem we will compare the two-band QMF and CQF designsGiven H0(z) = 1 minus 09zminus1a Design a two-band (ie M = 2) QMF [Use Eq (67)]b Design a two-band CQF for L = 8 [Use Eq (68)]c Consider the two-band QMF and CQF banks in Figure 628 with input

signal s(n) = 075 sin(πn3) + 05 cos(πn6) n = 0 1 6Compare the designs in (a) and (b) and check for the alias-free recon-struction in case of CQF Give the delay values d1 and d2

d Extend the two-band QMF design in part (a) to polyphase factorization[use Equations (69) and (610)] What are the advantages of employingpolyphase factorization

67 In this problem we will design and analyze a four-channel uniform tree-structured QMF banka Given H00(z) = 1 + 01zminus1 and H10(z) = 1 + 09zminus1 Complete

the tree-structured QMF bank (use Figure 68) for four channelsb Using the identities given in Figure 610 (or Eq (611)) construct a

parallel analysis-synthesis filter bank The parallel analysis-synthesisfilter bank structure must be similar to the one shown in Figure 61with M = 4

c Plot the frequency response of the resulting parallel filter bank analysisfilters ie H0(z) H1(z) H2(z) and H3(z) Comment on the pass-bandand stopband structures of the magnitude responses associated withthese filters

d Plot the impulse response h1(n) Is h1(n) symmetric

68 Repeat Problem 67 for a four-channel uniform tree-structured CQF bankwith L = 4

69 A time-domain plot of an audio signal is shown in Figure 629 Given theflexibility to encode the regions A through E with varying frame sizesWhich of the following choices is preferred

Choice ILong frames in regions B and DChoice IILong frames in regions A C and E

2-bandQMF

s(n) s1(n minus d1)

2-bandCQF

s(n) s2(n minus d2)

Figure 628 A two-band QMF and CQF design comparison


B CA D E

Figure 629 An example audio segment with harmonics (region A) a transient (regionB) background noise (region C) exponentially weighted harmonics (region D) and back-ground noise (region E) segments

Choice IIIShort frames in region B onlyChoice IVShort frames in regions B and DChoice VShort frames in region A only

Explain how would you assign frequency-resolution (high or low) among theregions A B C D and E

610 A pure tone at f0 with P0 dB SPL is encoded such that the quantiza-tion noise is masked Let us assume that a 256-point MDCT producedan in-band signal-to-noise ratio of SNRA and encodes the tone with bA

bitssample And a 1024-point MDCT yielded SNRB and encodes the tonewith bB bitssample

x1(n)

x2(n)

x3(n)

x4(n)

Figure 630 Audio frames x1(n) x2(n) x3(n) and x4(n) for Problem 611


In which of the two cases we will require the most bitssample (state ifbA gt bB or bA lt bB) to mask the quantization noise

611 Given the signals x1(n) x2(n) x3(n) and x4(n) as shown in Figure 630Let all the signals be of length 1024 samples When transform coded usinga 512-point MDCT which of the signals will result in pre-echo distortion

COMPUTER EXERCISES

612 In this problem we will study the filter banks that are based on theDFT Use Eq (639a) to implement a 2M-point DFT of x(n) given inFigure 631 Assume M = 8a Give the plots of |X(k)|b Plot the frequency response of the second- and third-channel analysis

filters that are associated with the basis vectors h1(n) and h2(n)c State whether the DFT filter bank is evenly stacked or oddly stacked

613 In this problem we will study the filter banks that are based on the DCTUse Eq (641a) to implement a M-point DCT of x(n) given in Figure 631Assume M = 8a Give the plots of |X(k)|b Also plot the frequency response of the second and third channel anal-

ysis filters that are associated with the basis vectors h1(n) and h2(n)c Plot the impulse response of h1(n) and see if it is symmetricd Is the DCT filter bank evenly stacked or oddly stacked

614 In this problem we will study the filter banks based on the MDCTa First design a sine window w(n) = sin[(2n + 1)π4M] with M = 8b Check if the sine window satisfies the generalized perfect reconstruction

conditions ie Eqs (628a) (628b)c Next design a MDCT analysis filter bank hk(n) for 0 lt k lt 7

x(n)

n0 7

8 15

16 23

24 31

1

minus1

Figure 631 Input signal x(n) for Problems 612 613 and 614


d Plot both the impulse response and the frequency response of the anal-ysis filters h1(n) and h2(n)

e Compute the MDCT coefficients X(k) of the input signal x(n) shownin Figure 631

f Is the impulse response of the analysis filter h1(n) symmetric

615 Show analytically that a DCT can be implemented using FFTs Also usex(n) given in Figure 632 as your test signal and verify your softwareimplementation

616 Give expressions for DCT-I DCT-II DCT-III and DCT-IV orthonormaltransforms (eg see [Rao90]) Use the signals x1(n) and x2(n) shownin Figure 633 to study the differences in the 4-point DCT coefficientsobtained from different types of DCT Describe in general whether choos-ing a particular type of DCT affects the energy compaction of a signal

617 In this problem we will design a two-band (M = 2) cosine-modulatedPQMF bank with L = 8a First design a linear phase FIR prototype lowpass filter (ie w(n)) with

normalized cutoff frequency π4 Plot the frequency response of thiswindow Use fir2 command in MATLAB to design the lowpass filter

b Use Eq (613) and (614) to design the PQMF analysis and synthesisfilters respectively

x(n)

0 7

8 15

1

minus1

n

Figure 632 Input signal x(n) for Problem 615

0

1

minus1

n

3

x2(n)x1(n)

0

1

minus1

n

3

Figure 633 Test signals to study the differences among various types of orthonormalDCT transforms


2-bandPQMF

s(n) s3(n ndash d3)

Figure 634 A two-band PQMF design

c In Figure 634 use s(n) = 075 sin(πn3) + 05 cos(πn6) n = 0

1 6 and generate s3(n) Compare s(n) and s3(n) and commenton any type of distortion that you may observe

d What are the advantages of employing a cosine modulated filter bankover the two-band QMF and two-band CQF

e List some of the key differences between the CQF and the PQMFin terms of 1) the analysis filter bank frequency responses 2) phasedistortion 3) impulse response symmetries

f Use sine window w(n) = sin[(2n + 1)π4M] with M = 8 and repeatsteps (b) and (c)

CHAPTER 7

TRANSFORM CODERS

71 INTRODUCTION

Transform coders make use of unitary transforms (eg DFT DCT etc) for thetimefrequency analysis section of the audio coder shown in Figure 11 Manytransform coding schemes for wideband and high-fidelity audio have been pro-posed starting with some of the earliest perceptual audio codecs For examplein the mid-1980s Krahe applied psychoacoustic bit allocation principles to atransform coding scheme [Krah85] [Krah88] Schroeder [Schr86] later extendedthese ideas into multiple adaptive spectral audio coding (MSC) The MSC uti-lizes a 1024-point DFT then groups coefficients into 26 subbands inspired bythe critical bands of the ear This chapter gives overview of algorithms that wereproposed for transform coding of high-fidelity audio following the early work ofSchroeder [Schr86]

The Chapter is organized as follows Sections 72 through 75 describe in somedetail the transform coding algorithms proposed by Brandenburg Johnston andMahieux [Bran87b] [John88a] [Mahi89] [Bran90] Most of this research becameconnected with the MPEG standardization and the ISOIEC eventually clusteredthese algorithms into a single candidate algorithm called adaptive spectral entropycoding (ASPEC) [Bran91] of high quality music signals The ASPEC algorithm(Section 76) has become part of the ISOIEC MPEG-1 [ISOI92] and the MPEG-2BC-LSF [ISOI94a] audio coding standards Sections 77 and 78 are concernedwith two transform coefficient substitution schemes namely the differential per-ceptual audio coder (DPAC) and the DFT noise substitution algorithm Finally


195

196 TRANSFORM CODERS

Sections 79 and 710 address several early applications of vector quantization(VQ) to transform coding of high-fidelity audio

The algorithms described in the Chapter that make use of modulated fil-ter banks (eg ASPEC DPAC TwinVQ) can also be characterized as high-resolution subband coders Typically transform coders perform high-resolutionfrequency analysis and subband coders rely on a coarse division of the frequencyspectrum In many ways the transform and subband coder categories overlapand in some cases it is hard to categorize a coder in a definite manner Thesource of this overlapping of transformsubband categories come from the factthat block transform realizations are used for cosine modulated filter banks

72 OPTIMUM CODING IN THE FREQUENCY DOMAIN

Brandenburg in 1987 proposed a 132 kbs algorithm known as optimum coding inthe frequency domain (OCF) [Bran87b] which is in some respects an extensionof the well-known adaptive transform coder (ATC) for speech The OCF wasrefined several times over the years with two enhanced versions appearing afterthe original algorithm The OCF is of interest because of its influence on currentstandards

The original OCF (Figure 71) works as follows The input signal is firstbuffered in 512 sample blocks and transformed to the frequency domain usingthe DCT Next transform components are quantized and entropy coded A sin-gle quantizer is used for all transform components Adaptive quantization andentropy coding work together in an iterative procedure to achieve a fixed bitrate The initial quantizer step size is derived from the spectral flatness measure(Eq (513))

In the inner loop of Figure 71 the quantizer step size is iteratively increasedand a new entropy-coded bit stream is formed at each update until the desired bit

Input BufferWindowing DCT

Psychoacoustic Analysis

EntropyCoder WeightingQuantizer

Outer Loop

Inner Loop

output

s(n)

loop count

loop count

Figure 71 OCF encoder (after [Bran88b])

PERCEPTUAL TRANSFORM CODER 197

rate is achieved Increasing the step size at each update produces fewer levelswhich in turn reduces the bit rate Using a second iterative procedure a perceptualanalysis is introduced after the inner loop is done First critical band analysis isapplied Then a masking function is applied which combines a flat minus6 dB mask-ing threshold with an interband masking threshold leading to an estimate of JNDfor each critical band If after inner-loop quantization and entropy encoding themeasured distortion exceeds JND in at least one critical band then quantizationstep sizes are adjusted only in the out-of-tolerance critical bands The outer looprepeats until JND criteria are satisfied or a maximum loop count is reachedEntropy coded transform components are then transmitted to the receiver alongwith side information which includes the log encoded SFM the number ofquantizer updates during the inner loop and the number of step size reductionsthat occurred for each critical band in the outer loop This side informationis sufficient to decode the transform components and perform reconstruction atthe receiver

Brandenburg in 1988 reported an enhanced OCF (OCF-2) which achievedsubjective quality improvements at a reduced bit rate of only 110 kbs [Bran88a]The improvements were realized by replacing the DCT with the MDCT andadding a pre-echo detectioncompensation scheme Reconstruction quality is imp-roved due to the effective time resolution increase (ie 50 time overlap)associated with the MDCT OCF-2 quality is also improved for difficult signalssuch as triangle and castanets due to a simple pre-echo detectioncompensationscheme The encoder detects pre-echoes using analysis-by-synthesis Pre-echoesare detected when noise energy in a reconstructed segment (16 samples = 036 ms 441 kHz) exceeds signal energy The encoder then determines the frequencybelow which 90 of signal energy is contained and transmits this cutoff to thedecoder Given pre-echo detection at the encoder (1 bit) and a cutoff frequencythe decoder discards frequency components above the cutoff in effect low-passfiltering pre-echoes Due to these enhancements the OCF-2 was reported toachieve transparency over a wide variety of source material

Later in 1988 Brandenburg reported further OCF enhancements (OCF-3) inwhich better quality was realized at a lower bit rate (64 kbs) with reducedcomplexity [Bran88b] This was achieved through differential coding of spectralcomponents to exploit correlation between adjacent samples an enhanced psy-choacoustic model modified to account for temporal masking and an improvedrate-distortion loop

73 PERCEPTUAL TRANSFORM CODER

While Brandenburg developed the OCF algorithm similar work was simulta-neously underway at ATampT Bell Labs Johnston developed several DFT-basedtransform coders [John88a] [John89] for audio during the late 1980s that becamean integral part of the ASPEC proposal Johnstonrsquos work in perceptual entropy[John88b] forms the basis for a transform coder reported in 1988 [John88a] that


Quantizers

ThresholdAdjustment

PsychoacousticAnalysis

Bit Packing

Ti

s(n) FFT

2048 point

To Channel

Side Info

Ti

Bit AllocationLoop

Ti Pj

Figure 72 PXFM encoder (after [John88a])

achieves transparent coding of FM-quality monaural audio signals (Figure 72)A stereophonic coder based on similar principles was developed later

731 PXFM

A monaural algorithm the perceptual transform coder (PXFM) was developedfirst The idea behind the PXFM is to estimate the amount of quantization noisethat can be inaudibly injected into each transform domain subband using PEestimates The coder works as follows The signal is first windowed into over-lapping (116) segments and transformed using a 2048-point FFT Next the PEprocedure described in Section 56 is used to estimate JND thresholds for eachcritical band Then an iterative quantization loop adapts a set of 128 subbandquantizers to satisfy the JND thresholds until the fixed bit rate is achieved Finallyquantization and bit packing are performed Quantized transform components aretransmitted to the receiver along with appropriate side information Quantizationsubbands consist of 8-sample blocks of complex-valued transform componentsThe quantizer adaptation loop first initializes the j isin [1 128] subband quantizers(1024 unique FFT components8 components per subband) with kj levels andstep sizes of Ti as follows

kj = 2 nint

(Pj

Ti

)

+ 1 (71)

where Ti are the quantized critical band JND thresholds Pj is the quantizedmagnitude of the largest real or imaginary transform component in the j -thsubband and nint() is the nearest integer rounding function The adaptation pro-cess involves repeated application of two steps First bit packing is attemptedusing the current quantizer set Although many bit packing techniques are pos-sible one simple scenario involves sorting quantizers in kj order then filling64-bit words with encoded transform components according to the sorted resultsAfter bit packing Ti are adjusted by a carefully controlled scale factor andthe adaptation cycle repeats Quantizer adaptation halts as soon as the packeddata length satisfies the desired bit rate Both Pj and the modified Ti are quan-tized on a dB scale using 8-bit uniform quantizers with a 170 dB dynamic


range These parameters are transmitted as side information and used at thereceiver to recover quantization levels (and thus implicit bit allocations) for eachsubband which are in turn used to decode quantized transform componentsThe DC FFT component is quantized with 16 bits and is also transmitted asside information

732 SEPXFM

In 1989 Johnston extended the PXFM coder to handle stereophonic signals(SEPXFM) and attained transparent coding of a CD-quality stereophonic channelat 192 kbs or 22 bitssample SEPXFM [John89] realizes performance improve-ments over PXFM by exploiting inherent stereo cross-channel redundancy andby assuming that both channels are presented to a single listener rather thanbeing used as separate signal sources The SEPXFM structure is similar to thatof PXFM with variable radix bit packing replaced by adaptive entropy codingSide information is therefore reduced to include only adjusted JND thresholds(step sizes) and pointers to the entropy codebooks used in each transform domainsubband The coder works in the following manner First sum (L + R) and dif-ference (L minus R) signals are extracted from the left (L) and right (R) channels toexploit leftright redundancy Next the sum and difference signals are windowedand transformed using the FFT Then a single JND threshold for each criticalband is established via the PE method using the summed power spectra fromthe L + R and L minus R signals A single combined JND threshold is applied toquantization noise shaping for both signals (L + R and L minus R) based upon theassumption that a listener is more than one ldquocritical distancerdquo [Jetz79] away fromthe stereo speakers

Like PXFM a fixed bit rate is achieved by applying an iterative thresholdadjustment procedure after the initial determination of JND levels The adapta-tion process analogous to PXFM bit rate adjustment and bit packing consistsof several steps First transform components from both (L + R) and (L minus R)are split into subband blocks each averaging 8 realimaginary samples Thenone of six entropy codebooks is selected for each subband based on the averagecomponent magnitude within that subband Next transform components are quan-tized given the JND levels and encoded using the selected codebook Subbandcodebook selections are themselves entropy encoded and transmitted as sideinformation After encoding JND thresholds are scaled by an estimator andthe quantizer adaptation process repeats Threshold adaptation stops when thecombined bitstream of quantized JND levels Huffman-encoded (L + R) com-ponents Huffman-encoded (L minus R) components and Huffman-encoded averagemagnitudes achieves the desired bit rate The Huffman codebooks are developedusing a large music and speech database They are optimized for difficult sig-nals at the expense of mean compression rate It is also interesting to note thatheadphone listeners reported no noticeable acoustic mixing despite the criticaldistance assumption and single combined JND level estimate for both channels(L + R) and (L minus R)


74 BRANDENBURG-JOHNSTON HYBRID CODER

Johnston and Brandenburg [Bran90] collaborated in 1990 to produce a hybridcoder that strictly speaking is both a subband and transform coding algorithmThe idea behind the hybrid coder is to improve time and frequency resolutionrelative to OCF and PXFM by constructing a filter bank that more closely resem-bles the auditory filter bank This is accomplished at the encoder by first splittingthe input signal into four octave-width subbands using a QMF filter bank

The decimated output sequence from each subband is then followed by one ormore transforms to achieve the desired timefrequency resolution Figure 73(a)Both the DFT and the MDCT were investigated Given the tiling of the time-frequency plane shown in Figure 73(b) frequency resolution at low frequencies(234 Hz) is well matched to the ear while the time resolution at high frequencies(27 ms) is sufficient for pre-echo control

The quantization and coding schemes of the hybrid coder combine elementsfrom both PXFM and OCF Masking thresholds are estimated using the PXFM

(a)

(b)

1024 samples (Time)

128 frequency lines (23 Hz21 ms)

64 freq lines (47 Hz11 ms)

64 freq lines(94 Hz5 ms)

64 freqlines

188 Hz27 ms

6 kHz

3 kHz

24 kHz

12 kHz





64 freqlines

188 Hz27 ms

64 freqlines

188 Hz27 ms

64 freqlines

188 Hz27 ms

64 freqlines

188 Hz27 ms

64 freqlines

188 Hz27 ms

64 freqlines

188 Hz27 ms

64 freqlines

188 Hz27 ms

Freq

s(n) 80 tap 64 pt DFT (8)

0-33-6 kHz

80 tap

80 tap

64 pt DFT (4)

64 pt DFT (2)

128 pt DFT (1)

(512)

(256)

0-66-12 kHz

0-1212-24 kHz

(1024)

320 lines frame

QMF

QMF

QMF

Figure 73 Brandenburg-Johnston coder (a) filter bank structure (b) timefreq tiling(after [Bran90])

CNET CODERS 201

approach for eight time slices in each frequency subband A more sophisticatedtonality estimate was defined to replace the SFM (Eq (513)) used in PXFMhowever such that tonality is estimated in the hybrid coder as a local char-acteristic of each individual spectral line Predictability of magnitude and phasespectral components across time is used to evaluate tonality instead of just globalspectral shape within a single frame High temporal predictability of magnitudesand phases is associated with the presence of a tonal signal In contrast low pre-dictability implies the presence of a noise-like signal The hybrid coder employsa quantization and coding scheme borrowed from OCF As far as quality thehybrid coder without any explicit pre-echo control mechanism was reported toachieve quality better than or equal to OCF-3 at 64 kbs [Bran90] The onlydisadvantage noted by the authors was increased complexity A similar hybridstructure was eventually adopted in MPEG-1 and -2 layer III

75 CNET CODERS

Research at the Centre National drsquoEtudes des Telecommunications (CNET) res-ulted in several transform coders based on the DFT and the MDCT

751 CNET DFT Coder

In 1989 Mahieux Petit et al proposed a DFT-based audio coding system thatintroduced a novel scheme to exploit DFT interblock redundancy Nearly trans-parent quality was reported for 15-kHz (FM-grade) audio at 96 kbs [Mahi89]except for some highly harmonic signals The encoder applies first-order back-ward-adaptive predictors (across time) to DFT magnitude and differential phasecomponents then quantizes separately the prediction residuals Magnitude anddifferential phase residuals are quantized using an adaptive nonuniform pdf-optimized quantizer designed for a Laplacian distribution and an adaptive uniformquantizer respectively The backward-adaptive quantizers are reinitialized dur-ing transients Bits are allocated during step-size adaptation to shape quantizationnoise such that a psychoacoustic noise threshold is satisfied for each block Theperceptual model used is similar to Johnstonrsquos model that was described earlierin Section 56 The use of linear prediction is justified because it exploits mag-nitude and differential phase time redundancy which tends to be large duringperiods when the audio signal is quasi-stationary especially for signal harmonicsQuasi-stationarity might occur for example during a sustained note A similartechnique was eventually embedded in the MPEG-2 AAC algorithm

752 CNET MDCT Coder 1

In 1990 Mahieux and Petit reported on the development of a similar MDCT-based transform coder for which they reported transparent CD-quality at64 kbs [Mahi90] This algorithm introduced a novel spectrum descriptor schemefor representing the power spectral envelope The algorithm first segments inputaudio into frames of 1024 samples corresponding to 12 ms of new data per frame


given 50 MDCT time overlap Then a bit allocation is computed at the encoderusing a set of ldquospectrum descriptorsrdquo Spectrum descriptors consist of quantizedsample variances for MDCT coefficients grouped into 35 nonuniform frequencysubbands Like their DFT coder this algorithm exploits either interblock orintrablock redundancy by differentially encoding the spectrum descriptors withrespect to time or frequency and transmitting them to the receiver as sideinformation A decision whether to code with respect to time or frequency ismade on the basis of which method requires fewer bits the binary decisionrequires only 1 bit Either way spectral descriptor encoding is done using logDPCM with a first-order predictor and a 16-level uniform quantizer with a stepsize of 5 dB Huffman coding of the spectral descriptor codewords results in lessthan 2 bitsdescriptor A global masking threshold is estimated by convolving thespectral descriptors with a basilar spreading function on a bark scale somewhatlike the approach taken by Johnstonrsquos PXFM Bit allocations for quantizationof normalized transform coefficients are obtained from the masking thresholdestimate As usual bits are allocated such that quantization noise is below themasking threshold at every spectral line Transform coefficients are normalized bythe appropriate spectral descriptor then quantized and coded with one exceptionMasked transform coefficients which have lower energy than the global maskingthreshold are treated differently The authors found that masked coefficient binstend to be clustered therefore they can be compactly represented using run lengthencoding (RLE) RLE codewords are Huffman coded for maximum coding gainThe first CNET MDCT coder was reported to perform well for broadband signalswith many harmonics but had some problems in the case of spectrally flat signals


Mahieux and Petit enhanced their 64 kbs algorithm by incorporating a sophisti-cated pre-echo detection and postfiltering scheme as well as by incorporating anovel quantization scheme for two-coefficient (low-frequency) spectral descriptorbands [Mahi94] For improved quantization performance two-component spec-tral descriptors are efficiently vector encoded in terms of polar coordinatesPre-echoes are detected at the encoder and flagged using 1 bit The idea behindthe pre-echo compensation is to temporarily activate a postfilter at the decoderin the corrupted quiet region prior to the signal attack and therefore a stop-ping index must also be transmitted The second-order IIR postfilter differenceequation is given by

spf (n) = b0s(n) + a1spf (n minus 1) + a2spf (n minus 2) (72)

where s(n) is the nonpostfiltered output signal that is corrupted by pre-echo dis-tortion spf (n) is the postfiltered output signal and ai are related to the parametersαi by

a1 = α1

[

1 minus(

p(0 0)

p(0 0) + σ 2b

)]

(73a)

a2 = α2

[

1 minus(

p(1 0)

p(0 0) + σ 2b

)]

(73b)

ADAPTIVE SPECTRAL ENTROPY CODING 203

where αi are the parameters of a second-order autoregressive (AR-2) spectralestimate of the output audio s(n) during the previous nonpostfiltered frameThe AR-2 estimate s(n) can be expressed in the time domain as

s(n) = w(n) + α1s(n minus 1) + α2s(n minus 2) (74)

where w(n) represents Gaussian white noise The prediction error is thendefined as

e(n) = s(n) minus s(n) (75)

The parameters p(ij ) in Eqs (73a) and (73b) are elements of the predictionerror covariance matrix P and the parameter σ 2

b is the pre-echo distortion vari-ance which is derived from side information Pre-echo postfiltering and improvedquantization schemes resulted in a subjective score of 365 for two-channelstereo coding at 64 kbs per channel on the 5-point CCIR 5-grade impairmentscale (described in Section 123) over a wide range of listening material TheCCIR J41 reference audio codec (MPEG-1 layer II) achieved a score of 384 at384 kbschannel over the same set of tests

76 ADAPTIVE SPECTRAL ENTROPY CODING

The MSC OCF PXFM Brandenburg-Johnston hybrid and CNET transformcoders were eventually clustered into a single proposal by the ISOIEC JTC1SC2WG11 committee As a result Schroeder Brandenburg Johnston Herre andMahieux collaborated in 1991 to propose for acceptance as the new MPEG audiocompression standard a flexible coding algorithm ASPEC which incorporatedthe best features of each coder in the group ASPEC [Bran91] was claimed toproduce better quality than any of the individual coders at 64 kbs

The structure of ASPEC combines elements from all of its predecessors LikeOCF and the later CNET coders ASPEC uses the MDCT for time-frequencymapping The masking model is similar to that used in PXFM and the Brandenburg-Johnston hybrid coders including the sophisticated tonality estimation scheme atlower bit rates The quantization and coding procedures use the pair of nested loopsproposed for OCF as well as the block differential coding scheme developed atCNET Moreover long runs of masked coefficients are run-length and Huffmanencoded Quantized scale factors and transform coefficients are Huffman codedalso Pre-echoes are controlled using a dynamic window switching mechanismlike the Thomson coder [Edle89] ASPEC offers several modes for different qualitylevels ranging from 64 to 192 kbs per channel A real-time ASPEC implementa-tion for coding one channel at 64 kbs was realized on a pair of 33-MHz MotorolaDSP56001 devices ASPEC ultimately formed the basis for layer III of the MPEG-1and MPEG-2BC-LSF standards We note that similar contributions were made inthe area of transform coding for audio outside of the ASPEC cluster For exampleIwadare et al reported on DCT-based [Sugi90] and MDCT-based [Iwad92] per-ceptual adaptive transform coders that control pre-echo distortion using an adaptivewindow size


77 DIFFERENTIAL PERCEPTUAL AUDIO CODER

Other investigators have also developed promising schemes for transform codingof audio Paraskevas and Mourjopoulos [Para95] reported on a differential per-ceptual audio coder (DPAC) which makes use of a novel scheme for exploitinglong-term correlations DPAC works as follows Input audio is transformed usingthe MDCT A two-state classifier then labels each new frame of transform coef-ficients as either a ldquoreferencerdquo frame or a ldquosimplerdquo frame The classifier labelsas ldquoreferencerdquo frames that contain significant audible differences from the pre-vious frame The classifier labels nonreference frames as ldquosimplerdquo Referenceframes are quantized and encoded using scalar quantization and psychoacousticbit allocation strategies similar to Johnstonrsquos PXFM Simple frames howeverare subjected to coefficient substitution Coefficients whose magnitude differ-ences with respect to the previous reference frame are below an experimentallyoptimized threshold are replaced at the decoder by the corresponding refer-ence frame coefficients The encoder then replaces subthreshold coefficientswith zeros thus saving transmission bits Unlike the interframe predictive cod-ing schemes of Mahieux and Petit the DPAC coefficient substitution system isadvantageous in that it guarantees that the ldquosimplerdquo frame bit allocation willalways be less than or equal to the bit allocation that would be required ifthe frame was coded as a ldquoreferencerdquo frame Suprathreshold ldquosimplerdquo framecoefficients are coded in the same way as reference frame coefficients DPACperformance was evaluated for frame classifiers that utilized three different selec-tion criteria

1 Euclidean distance Under the Euclidean criterion test frames satisfyingthe inequality

[sd

T sd

srT sr

]1

2 λ (76)

are classified as simple where the vectors sr and st respectively containreference and test frame time-domain samples and the difference vectorsd is defined as

sd = sr minus st (77)

2 Perceptual entropy Under the PE criterion (Eq 517) a test frame islabeled as ldquosimplerdquo if it satisfies the inequality

PE S

PER

λ (78)

where PE S corresponds to the PE of the ldquosimplerdquo (coefficient-substituted)version of the test frame and PER corresponds to the PE of the unmodifiedtest frame

DFT NOISE SUBSTITUTION 205

3 Spectral flatness measure Finally under the SFM criterion (Eq 513) atest frame is labeled as ldquosimplerdquo if it satisfies the inequality

abs

(

10 log10SFM T

SFM R

)

λ (79)

where SFMT corresponds to the test frame SFM and SFMR correspondsto the SFM of the previous reference frame The decision threshold λwas experimentally optimized for all three criteria Best performance wasobtained while encoding source material using a PE criterion As far asoverall performance is concerned noise-to-mask ratio (NMR) measure-ments were compared between DPAC and Johnstonrsquos PXFM algorithm at64 88 and 128 kbs Despite an average drop of 30ndash35 in PE measuredat the DPAC coefficient substitution stage output relative to the coefficientsubstitution input comparative NMR studies indicated that DPAC outper-forms PXFM only below 88 kbs and then only for certain types of sourcematerial such as pop or jazz music The desirable PE reduction led toan undesirable drop in reconstruction quality The authors concluded thatDPAC may be preferable to algorithms such as PXFM for low-bit-rate nontransparent applications

78 DFT NOISE SUBSTITUTION

Whereas DPAC exploits temporal correlation a substitution technique that exp-loits decorrelation was devised for coding efficiently noise-like portions of thespectrum In a noise substitution procedure [Schu96] Schulz parameterizes trans-form coefficients corresponding to noise-like portions of the spectrum in terms ofaverage power frequency range and temporal evolution resulting in an increasedcoding efficiency of 15 on average A temporal envelope for each parametricnoise band is required because transform block sizes for most codecs are muchlonger (eg 30 ms) than the human auditory systemrsquos temporal resolution (eg2 ms) In this method noise-like spectral regions are identified in the followingway First least-mean-square (LMS) adaptive linear predictors (LP) are appliedto the output channels of a multi-band QMF analysis filter bank that has as inputthe original audio s(n) A predicted signal s(n) is obtained by passing the LPoutput sequences through the QMF synthesis filter bank Prediction is done insubbands rather than over the entire spectrum to prevent classification errors thatcould result if high-energy noise subbands are allowed to dominate predictoradaptation resulting in misinterpretation of low-energy tonal subbands as noisyNext the DFT is used to obtain magnitude (S(k)S(k)) and phase components(θ(k)θ (k)) of the input s(n) and prediction s(n) respectively Then tonalityT (k) is estimated as a function of the magnitude and phase predictability ie

T (k) = α

∣∣∣∣∣

S(k) minus S(k)

S(k)

∣∣∣∣∣+ β

∣∣∣∣∣

θ(k) minus θ (k)

θ(k)

∣∣∣∣∣ (710)


where α and β are experimentally determined constants Noise substitution isapplied to contiguous blocks of transform coefficient bins for which T (k) is verysmall The 15 average bit savings realized using this method in conjunctionwith transform coding is offset to a large extent by a significant complexityincrease due to the additions of the adaptive linear predictors and a multi-bandanalysis-synthesis QMF filter bank As a result the author focused his attentionon the application of noise substitution to QMF-based subband coding algorithmsA modified version of this scheme was adopted as part of the MPEG-2 AACtime-frequency coder within the MPEG-4 reference model [Herr98]

79 DCT WITH VECTOR QUANTIZATION

For the most part the algorithms described thus far rely upon scalar quantiza-tion of transform coefficients This is not unreasonable since scalar quantizationin combination with entropy coding can achieve very good performance Asone might expect however vector quantization (VQ) has also been appliedto transform coding of audio although on a much more limited scale Ger-sho and Chan investigated VQ schemes for coding DCT coefficients subjectto a constraint of minimum perceptual distortion They reported on a variablerate coder [Chan90] that achieves high quality in the range of 55ndash106 kbs foraudio sequences bandlimited to 15 kHz (32 kHz sample rate) After computingthe DCT on 512 sample blocks the algorithm utilizes a novel multi-stage tree-structured VQ (MSTVQ) scheme for quantization of normalized vectors witheach vector containing four DCT components Bit allocation and vector nor-malization are derived at both the encoder and decoder from a sampled powerspectral envelope which consists of 29 groups of transform coefficients A simpli-fied masking model assumes that each sample of the power envelope representsa single masker Masking is assumed to be additive as in the ASPEC algo-rithms Thresholds are computed as a fixed offset from the masking level Theauthors observed a strong correlation between the SFM and the amount of offsetrequired to achieve high quality Two-segment scalar quantizers that are piecewiselinear on a dB scale are used to encode the power spectral envelope Quadraticinterpolation is used to restore full resolution to the subsampled envelope

Gersho and Chan later enhanced [Chan91b] their algorithm by improving thepower envelope and transform coefficient quantization schemes In the new app-roach to quantization of transform coefficients constrained-storage VQ [Chan91a]techniques are combined with the MSTVQ from the original coder allowing thenew coder to handle peak noise-to-mask ratio (NMR) requirements without imprac-tical codebook storage requirements In fact CS-MSTVQ enabled quantization of127 four-coefficient vectors using only four unique quantizers Power spectralenvelope quantization is enhanced by extending its resolution to 127 samples Thepower envelope samples are encoded using a two-stage process The first stageapplies nonlinear interpolative VQ (NLIVQ) a dimensionality reduction processwhich represents the 127-element power spectral envelope vector using only a 12-dimensional ldquofeature power enveloperdquo Unstructured VQ is applied to the feature

MDCT WITH VECTOR QUANTIZATION 207

power envelope Then a full-resolution quantized envelope is obtained from theunstructured VQ index into a corresponding interpolation codebook In the secondstage segments of a power envelope residual are encoded using 8- 9- and 10-element TSVQ Relative to their first VQDCT coder the authors reported savingsof 10ndash20 kbs with no reduction in quality due to the CS-VQ and NLIVQ schemesAlthough VQ schemes with this level of sophistication typically have not been seenin the audio coding literature since [Chan90] and [Chan91b] first appeared therehave been successful applications of less-sophisticated VQ in some of the standards(eg [Sree98a] [Sree98b])

710 MDCT WITH VECTOR QUANTIZATION

Iwakami et al developed transform-domain weighted interleave vector quantiza-tion (TWIN-VQ) an MDCT-based coder that also involves transform coefficientVQ [Iwak95] This algorithm exploits LPC analysis spectral interframe redun-dancy and interleaved VQ

At the encoder (Figure 74) each frame of MDCT coefficients is first dividedby the corresponding elements of the LPC spectral envelope resulting in a spec-trally flattened quotient (residual) sequence This procedure flattens the MDCTenvelope but does not affect the fine structure The next step therefore dividesthe first step residual by a predicted fine structure envelope This predicted finestructure envelope is computed as a weighted sum of three previous quantized finestructure envelopes ie using backward prediction Interleaved VQ is applied tothe normalized second step residual The interleaved VQ vectors are structuredin the following way Each N -sample normalized second step residual vector issplit into K subvectors each containing N K coefficients Second step residualsfrom the N -sample vector are interleaved in the K subvectors such that the i-thsubvector contains elements i + nK where n = 0 1 (NK) minus 1 Percep-tual weighting is also incorporated by weighting each subvector by a nonlinearlytransformed version of its corresponding LPC envelope component prior to thecodebook search VQ indices are transmitted to the receiver Side information

s(n) MDCT

LPCAnalysis

DenormalizeWeightedInterleaveVQ

Normalize

Inter-framePrediction

LPCEnvelope

Indices

Side info

Side info

Figure 74 TWIN-VQ encoder (after [Iwak95])


consists of VQ normalization coefficients and the LPC envelope encoded in termsof LSPs The authors claimed higher subjective quality than MPEG-1 layer IIat 64 kbs for 48 kHz CD-quality audio as well as higher quality than MPEG-1layer II for 32 kHz audio at 32 kbs

TWIN-VQperformance at lower bit rateshasalsobeen investigated At least threetrends were identified during ISO-sponsored comparative tests [ISOI98] of TWIN-VQ and MPEG-2 AAC First AAC outperformed TWIN-VQ for bit rates above16 kbs Secondly TWIN-VQ and AAC achieved similar performance at 16 kbswith AAC having a slight edge Finally the performance of TWIN-VQ exceededthat of AAC at a rate of 8 kbs These results ultimately motivated a combinedAACTWIN-VQ architecture for inclusion in MPEG-4 [Herre98] Enhancementsto the weighted interleaving scheme and LPC envelope representation [Mori96]enabled real-time implementation of stereo decoders on Pentium-I and PowerPCplatforms Channel error robustness issues are addressed in [Iked95] A later versionof the TWIN-VQ scheme is embedded in the set of tools for MPEG-4 audio

711 SUMMARY

Transform coders for high-fidelity audio were described in this Chapter Thetransform coding algorithms presented include

ž the OCF algorithmž the monaural and stereophonic perceptual transform coders (PXFM and

SEPXFM)ž the CNET DFT and MDCT codersž the ASPECž the differential PACž the TWIN-VQ algorithm

PROBLEMS

71 Given the expressions for the DFT the DCT and the MDCT

XDFT (k) = 1radic2M

2Mminus1sum

n=0

x(n)eminusjπnkM 0 k 2M minus 1

XDCT (k) = c(k)

radic2

M

Mminus1sum

n=0

x(n) cos

[π

M

(

n + 1

2

)

k

]

0 k M minus 1

where c(0) = 1radic

2 and c(k) = 1 for 1 k M minus 1

XMDCT (k) =radic

2

M

2Mminus1sum

n=0

x(n) sin

[(

n + 1

2

)π

2M

]

︸︷︷︸w(n)

cos

[(2n + M + 1)(2k + 1)π

4M

]

for 0 k M minus 1

H0(z

)

H1(z

)

2 2

2 2

F0(z

)

F1(z

)

FF

Tsy

nthe

sis

mod

ule

FF

Tsy

nthe

sis

mod

ule

x 0(n

)

x 1(n

)

x d0(n

)

x d1(

n)

x e0(n

)

x e1(n

)

Sel

ect L

com

pone

nts

out o

f N

FF

T(S

ize

N =

128)

x(n

)X

rsquo(k

)In

vers

eF

FT

(Siz

e N

=12

8)

xrsquo(n

)X

(k)

n=

[1 x

128

]k

= [1

xN]

k =

[1xL

]n

= [1

x 1

28]

FF

T s

ynth

esis

mod

ulexrsquo

d0(n

)

xrsquod1

(n)

x(n

)xrsquo

(n)

Fig

ure

75

FFT

anal

ysis

syn

thes

isw

ithin

the

two

band

sof

QM

Fba

nk

209


Write the three transforms in matrix form as follows

XT = Hx

where H is the transform matrix and x and XT denote the input and transformedvector respectively Note the structure in the transform matrices

72 Give the signal flowgraph of the FFT butterfly structure for an 8-pointDFT an 8-point DCT and an 8-point MDCT Specify clearly the valueson the nodes and the branches [Hint See Problem 616 and Figure 618 inChapter 6]

COMPUTER EXERCISES

73 In this problem we will study the energy compaction of the DFT andthe DCT Use x(n) = eminus05n sin(04πn) n = 0 1 15 Plot the 16-pointDFT and 16-point DCT of the input signal x(n) See how the energy ofthe sequence is concentrated Now pick two peaks of the DFT vector andthe DCT vector and synthesize the input signal x(n) Let the synthesizedsignals be xDFT (n) and xDCT (n) Compute the MSE values between theinput signal and the two reconstructed signals Repeat this for four peakssix peaks and eight peaks Plot the estimated MSE values across the numberof peaks selected and comment on your result

74 This computer exercise is a combination of Problems 224 and 225 inChapter 2 In particular the FFT analysissynthesis module in Problem 225will be used within the two bands of the QMF bank The configuration isshown in Figure 75a Given H0(z) = 1 minus zminus1H1(z) = 1 + zminus1 Choose F0(z) and F1(z) such

that the aliasing term can be cancelled Use L = 32 and the peak-pickingmethod for component selection Perform speech synthesis and give time-domain plots of both input and output speech records

b Use the same voiced frame selected in Problem 224 Give time-domainand frequency-domain plots of x prime

d0(n) and x primed1(n) in Figure75

c Compute the overall SNR (between x(n) and x prime(n)) and estimate a MOSscore for the output speech

d Describe whether the perceptual quality of the output speech improvesif the FFT analysissynthesis module is employed within the subbandsinstead of using it for the entire band

CHAPTER 8

SUBBAND CODERS

81 INTRODUCTION

Similar to the transform coders described in the previous chapter subband codersalso exploit signal redundancy and psychoacoustic irrelevancy in the frequencydomain The audible frequency spectrum (20 Hzndash20 kHz) is divided into fre-quency subbands using a bank of bandpass filters The output of each filter is thensampled and encoded At the receiver the signals are demultiplexed decodeddemodulated and then summed to reconstruct the signal Audio subband codersrealize coding gains by efficiently quantizing decimated output sequences fromperfect reconstruction filter banks Efficient quantization methods usually relyupon psychoacoustically controlled dynamic bit allocation rules that allocate bitsto subbands in such a way that the reconstructed output signal is free of audi-ble quantization noise or other artifacts In a generic subband audio coder theinput signal is first split into several uniform or nonuniform subbands using somecritically sampled perfect reconstruction (or nearly perfect reconstruction) filterbank Nonideal reconstruction properties in the presence of quantization noiseare compensated for by utilizing subband filters that have good sidelobe attenu-ation Then decimated output sequences from the filter bank are normalized andquantized over short 2ndash10 ms blocks Psychoacoustic signal analysis is usedto allocate an appropriate number of bits for the quantization of each subbandThe usual approach is to allocate an adequate number of bits to mask quantiza-tion noise in each block while simultaneously satisfying some bit rate constraintSince masking thresholds and hence bit allocation requirements are time-varyingbuffering is often introduced to match the coder output to a fixed rate The encoder


211

212 SUBBAND CODERS

sends to the decoder quantized subband output samples normalization scale fac-tors for each block of samples and bit allocation side information Bit allocationmay be transmitted as explicit side information or it may be implicitly repre-sented by some parameter such as the scale factor magnitudes The decoder usesside information and scale factors in conjunction with an inverse filter bank toreconstruct a coded version of the original input

The purpose of this chapter is to expose the reader to subband coding algo-rithms for high-fidelity audio This chapter is organized much like Chapter 7The first portion of this chapter is concerned with early subband algorithms thatnot only contributed to the MPEG-1 standardization but also had an impact onlater developments in the field The remainder of the chapter examines a vari-ety of recent experimental subband algorithms that make use of discrete wavelettransforms (DWT) discrete wavelet packet transforms (DWPT) and hybrid filterbanks The chapter is organized as follows Section 811 concentrates upon theearly subband coding algorithms for high-fidelity audio including the MaskingPattern Adapted Universal Subband Integrated Coding and Multiplexing (MUSI-CAM) Section 82 presents the filter-bank interpretations of the DWT and theDWPT Section 83 addresses subband audio coding algorithms in which time-invariant and time-varying signal adaptive filter banks are constructed fromthe DWT and the DWPT Section 84 examines the use of nonuniform fil-ter banks related the DWPT Sections 85 and 86 are concerned with hybridsubband architectures involving sinusoidal modeling and code-excited linear pre-diction (CELP) Finally Section 87 addresses subband audio coding with IIRfilter banks

811 Subband Algorithms

This section is concerned with early subband algorithms proposed by researchersfrom the Institut fur Rundfunktechnik (IRT) [Thei87] [Stoll88] Philips ResearchLaboratories [Veld89] and CCETT Much of this work was motivated by stan-dardization activities for the European Eureka-147 digital broadcast audio (DBA)system The ISOIEC eventually clustered the IRT Philips and CCETT propos-als into the MUSICAM algorithm [Wies90] [Dehe91] which was adopted as partof the ISOIEC MPEG-1 and MPEG-2 BC-LSF audio coding standards

8111 Masking Pattern Adapted Subband Coding (MASCAM) TheMUSICAM algorithm is derived from coders developed at IRT Philips andCNET At IRT Theile Stoll and Link developed Masking Pattern AdaptedSubband Coding (MASCAM) a subband audio coder [Thei87] based upon atree-structured quadrature mirror filter (QMF) filter bank that was designed tomimic the critical band structure of the auditory filter bank The coder has 24nonuniform subbands with bandwidths of 125 Hz below 1 kHz 250 Hz in therange 1ndash2 kHz 500 Hz in the range 2ndash4 kHz 1 kHz in the range 4ndash8 kHz and2 kHz from 8 kHz to 16 kHz The prototype QMF has 64 taps Subband outputsequences are processed in 2-ms blocks A normalization scale factor is quantized

INTRODUCTION 213

and transmitted for each block from each subband Subband bit allocations arederived from a simplified psychoacoustic analysis The original coder reportedin [Thei87] considered only in-band simultaneous masking Later as describedin [Stol88] interband simultaneous masking and temporal masking were addedto the bit rate calculation Temporal postmasking is exploited by updating scalefactors less frequently during periods of signal decay The MASCAM coder wasreported to achieve high-quality results for 15 kHz bandwidth input signals atbit rates between 80 and 100 kbs per channel A similar subband coder wasdeveloped at Philips during this same period As described by Velhuis et alin [Veld89] the Philips group investigated subband schemes based on 20- and26-band nonuniform filter banks Like the original MASCAM system the Philipscoder relies upon a highly simplified masking model that considers only theupward spread of simultaneous masking Thresholds are derived from a prototypebasilar excitation function under worst-case assumptions regarding the frequencyseparation of masker and maskee Within each subband signal energy levels aretreated as single maskers Given SNR targets due to the masking model uniformADPCM is applied to the normalized output of each subband The Philips coderwas claimed to deliver high-quality coding of CD-quality signals at 110 kbs forthe 26-band version and 180 kbs for the 20-band version

8112 Masking Pattern Adapted Universal Subband Integrated Cod-ing and Multiplexing (MUSICAM) Based primarily upon coders developedat IRT and Philips the MUSICAM algorithm [Wies90] [Dehe91] was successfulin the 1990 ISOIEC competition [SBC90] for a new audio coding standard Iteventually formed the basis for MPEG-1 and MPEG-2 audio layers I and II Rela-tive to its predecessors MUSICAM (Figure 81) makes several practical tradeoffsbetween complexity delay and quality By utilizing a uniform bandwidth 32-band pseudo-QMF bank (aka ldquopolyphaserdquo filter bank) instead of a tree-structuredQMF bank both complexity and delay are greatly reduced relative to the IRTand Phillips coders Delay and complexity are 1066 ms and 5 MFLOPS respec-tively These improvements are realized at the expense of using a sub-optimal

s(n)

PolyphaseAnalysisFilterbank 32 ch

(750 Hz 48 kHz)

Side Info

1024-ptFFT


Quantization

Bit Allocation

Scl Fact

Samples

81624 ms

Figure 81 MUSICAM encoder (after [Wies90])

214 SUBBAND CODERS

filter bank however in the sense that filter bandwidths (constant 750 Hz for48 kHz sample rate) no longer correspond to the critical band rate Despite theseexcessive filter bandwidths at low frequencies high-quality coding is still possi-ble with MUSICAM due to its enhanced psychoacoustic analysis High-resolutionspectral estimates (46 Hzline at 48 kHz sample rate) are obtained through theuse of a 1024-point FFT in parallel with the PQMF bank This parallel structureallows for improved estimation of masking thresholds and hence determinationof more accurate minimum signal-to-mask ratios (SMRs) required within eachsubband

The MUSICAM psychoacoustic analysis procedure is essentially the same asthe MPEG-1 psychoacoustic model 1 The remainder of MUSICAM works asfollows Subband output sequences are processed in 8-ms blocks (12 samplesat 48 kHz) which is close to the temporal resolution of the auditory system(4ndash6 ms) Scale factors are extracted from each block and encoded using 6 bitsover a 120-dB dynamic range Occasionally temporal redundancy is exploitedby repetition over 2 or 3 blocks (16 or 24 ms) of slowly changing scale factorswithin a single subband Repetition is avoided during transient periods such assharp attacks Subband samples are quantized and coded in accordance with SMRrequirements for each subband as determined by the psychoacoustic analysis Bitallocations for each subband are transmitted as side information On the CCIRfive-grade impairment scale MUSICAM scored 46 (std dev 07) at 128 kbsand 43 (std dev 11) at 96 kbs per monaural channel compared to 47 (stddev 06) on the same scale for the uncoded original Quality was reported tosuffer somewhat at 96 kbs for critical signals which contained sharp attacks (egtriangle castanets) and this was reflected in a relatively high standard deviation of11 MUSICAM was selected by ISOIEC for MPEG-1 audio due to its desirablecombination of high quality reasonable complexity and manageable delay Alsobit error robustness was found to be very good (errors nearly imperceptible) upto a bit error rate of 10minus3

82 DWT AND DISCRETE WAVELET PACKET TRANSFORM (DWPT)

The previous section described subband coding algorithms that utilize banks offixed resolution bandpass QMF or pseudo-QMF finite impulse response (FIR)filters This section describes a different class of subband coders that rely insteadupon a filter-bank interpretation of the discrete wavelet transform (DWT) DWT-based subband coders offer increased flexibility over the subband coders describedpreviously since identical filter-bank magnitude frequency responses can be obtai-ned for many different choices of a wavelet basis or equivalently choices of filtercoefficients This flexibility presents an opportunity for basis optimization Theadvantage of this optimization in the audio coding application is illustrated bythe following example First a desired filter-bank magnitude response can beestablished This response might be matched to the auditory filter bank Thenfor each segment of audio one can adaptively choose a wavelet basis that mini-mizes the rate for some target distortion level Given a psychoacoustically deriveddistortion target the encoding remains perceptually transparent

DWT AND DISCRETE WAVELET PACKET TRANSFORM (DWPT) 215

darr 2

darr 2=y = Qx =

ylp

x

yhp

Q

Qylp

yhpx

Hlp(z)

Hhp(z)

Figure 82 Filter-bank interpretation of the DWT

A detailed discussion of specific technical conditions associated with thevarious wavelet families is beyond the scope of this book and this chaptertherefore concentrates upon high-level coder architectures In-depth treatmentof wavelets is available from many sources eg [Daub92] Before describingthe wavelet-based coders however it is useful to summarize some basic waveletcharacteristics Wavelets are a family of basis functions for the space of squareintegrable signals A finite energy signal can be represented as a weighted sumof the translates and dilates of a single wavelet Continuous-time wavelet sig-nal analysis can be extended to discrete-time and square summable sequencesUnder certain assumptions the DWT acts as an orthonormal linear transformT RN rarr RN For a compact (finite) support wavelet of length K the asso-ciated transformation matrix Q is fully determined by a set of coefficientsck for 0 k K minus 1 As shown in Figure 82 this transformation matrixhas an associated filter-bank interpretation One application of the transformmatrix Q to an N times 1 signal vector x generates an N times 1 vector of wavelet-domain transform coefficients y The N times 1 vector y can be separated into twoN

2times 1 vectors of approximation and detail coefficients ylp and yhp respec-

tively The spectral content of the signal x captured in ylp and yhp correspondsto the frequency subbands realized in the 21 decimated output sequences froma QMF bank (Section 64) which obeys the ldquopower complimentary conditionrdquoie

|Hlp()|2 + |Hlp( + π)|2 = 1 (81)

where Hlp() is the frequency response of the lowpass filter Therefore recursiveDWT applications effectively pass input data through a tree-structured cascadeof lowpass (LP) and highpass (HP) filters followed by 21 decimation at everynode The forwardinverse transform matrices of a particular wavelet are associ-ated with a corresponding QMF analysissynthesis filter bank The usual waveletdecomposition implements an octave-band filter bank structure as shown inFigure 83 In the figure frequency subbands associated with the coefficientsfrom each stage are schematically represented for an audio signal sampled at441 kHz

Wavelet packet (WP) or discrete wavelet packet transform (DWPT) representa-tions on the other hand decompose both the detail and approximation coefficientsat each stage of the tree as shown in Figure 84 In the figure frequency subbands

216 SUBBAND CODERS

22 kHz11552814

Q QQxy1 y2 y3

y4

y5

y1y2y3y4y5

Frequency (Hz)

Q

Figure 83 Octave-band subband decomposition associated with a discrete wavelet trans-form (ldquoDWTrdquo)

associated with the coefficients from each stage are schematically represented fora 441-kHz sample rate

A filter-bank interpretation of wavelet transforms is attractive in the context ofaudio coding algorithms Wavelet or wavelet packet decompositions can be treestructured as necessary (unbalanced trees are possible) to decompose input audiointo a set of frequency subbands tailored to some application It is possible forexample to approximate the critical band auditory filter bank utilizing a waveletpacket approach Moreover many K-coefficient finite support wavelets are asso-ciated with a single magnitude frequency response QMF pair and therefore aspecific subband decomposition can be realized while retaining the freedom tochoose a wavelet basis which is in some sense ldquooptimalrdquo These considerationshave motivated the development of several experimental wavelet-based subbandcoders in recent years The basic idea behind DWT and DWPT-based subbandcoders is to quantize and encode efficiently the coefficient sequences associatedwith each stage of the wavelet decomposition tree using the same noise shapingtechniques as the previously described perceptual subband coders

The next few sections of this chapter Sections 83 through 85 expose thereader to several WP-based subband coders developed in the early 1990s bySinha Tewfik et al [Sinh93a] [Sinh93b] [Tewf93] as well as more recentlyproposed hybrid sinusoidalWPT algorithms developed by Hamdy and Tewfik[Hamd96] Boland and Deriche [Bola97] and Pena et al [Pena96] [Prel96a][Prel96b] [Pena97a] The core of least one experimental WP audio coder [Sinh96]has been embedded in a commercial standard namely the ATampT PerceptualAudio Coder (PAC) [Sinh98] Although not addressed in this chapter we notethat other studies of DWT and DWPT-based audio coding schemes have appearedFor example experimental coder architectures for low-complexity low-delaycombined waveletmultipulse LPC coding and combined scalarvector quantiza-tion of transform coefficients were reported respectively by Black and Zeytinoglu[Blac95] Kudumakis and Sandler [Kudu95a] [Kudu95b] [Kudu96] and Bolandand Deriche [Bola95][Bola96] Several bit rate scalable DWPT-based schemeshave also been investigated recently For example a fixed-tree DWPT codingscheme capable of nearly transparent quality with scalable bitrates below 100 kbswas proposed by Dobson et al and implemented in real-time on a 75 MHzPentium-class platform [Dobs97] Additionally Lu and Pearlman investigateda rate-scalable DWPT-based coder that applies set partitioning in hierarchicaltrees (SPIHT) to generate an embedded bitstream Nearly transparent qualitywas reported at bit rates between 55 and 66 kbs [Lu98]

22kH

z11

55

28

83

138

165

193

Q

Q

QQ

QQQ

y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8

y 1y 2

y 3y 4

y 5y 6

y 7y 8

x

Fre

quen

cy (

Hz)

Fig

ure

84

Subb

and

deco

mpo

sitio

nas

soci

ated

with

apa

rtic

ular

wav

elet

pack

ettr

ansf

orm

(ldquoW

PTrdquo

orldquoW

Prdquo)

Alth

ough

the

pict

ure

illus

trat

esa

bala

nced

bina

rytr

eean

dth

eas

soci

ated

unif

orm

band

wid

thsu

bban

ds

node

sco

uld

bepr

uned

inor

der

toac

hiev

eno

nuni

form

freq

uenc

ysu

bdiv

isio

n

217

218 SUBBAND CODERS

83 ADAPTED WP ALGORITHMS

The ldquobest basisrdquo methodologies [Coif92] [Wick94] for adapting the WP treestructure to signal properties are typically formulated in terms of Shannonentropy [Shan48] and other perceptually blind statistical measures For a givenWP tree related research directed towards optimal filter selection [Hedg97][Hedg98a] [Hedg98b] has also emphasized optimization of statistical rather thanperceptual properties The questions of perceptually motivated filter selection andtree construction are central to successful application of WP analysis in audiocoding algorithms The WP tree structure determines the time and frequencyresolution of the transform and therefore also creates a particular tiling of thetime-frequency plane Several WP audio algorithms [Sinh93b] [Dobs97] havesuccessfully employed time-invariant WP tree structures that mimic the earrsquoscritical band frequency resolution properties In some cases however a moreefficient perceptual bit allocation is possible with a signal-specific time-frequencytiling that tracks the shape of the time-varying masking threshold Some examplesare described next

831 DWPT Coder with Globally Adapted Daubechies AnalysisWavelet

Sinha and Tewfik developed a variable-rate wavelet-based coding schemefor which they reported nearly transparent coding of CD-quality audio at48ndash64 kbs [Sinh93a] [Sinh93b] The encoder (Figure 85) exploits redundancyusing a VQ scheme and irrelevancy using a wavelet packet (WP) signaldecomposition combined with perceptual masking thresholds The algorithmworks as follows Input audio is segmented into N times 1 vectors which are then

DynamicDictionarySearch


T

sd

s

r

sminus

+

s

TransmitIndex of sd

T

Y

NWaveletPacketSearchAnalysis

Transmitr or s

d(s sd) le T

sum

Figure 85 Dynamic dictionaryoptimal wavelet packet encoder (after [Sinh93a])

ADAPTED WP ALGORITHMS 219

windowed using a 116-th overlap square root Hann window The dynamicdictionary (DD) which is essentially an adaptive VQ subsystem then eliminatessignal redundancy A dictionary of N times 1 codewords is searched for the vectorperceptually closest to the input vector The effective size of the dictionary ismade larger than its actual size by a novel correlation lag searchtime-warpingprocedure that identifies two N 2-sample codewords for each N -sample inputvector At both the transmitter and receiver the dictionary is systematicallyupdated with N -sample reconstructed output audio vectors according to aperceptual distance criterion and last-used-first-out rule After the DD procedurehas been completed an optimized WP decomposition is applied to the originalsignal as well as the DD residual The decomposition tree is structured such thatits 29 frequency subbands roughly correspond to the critical bands of the auditoryfilter bank A masking threshold obtained as in [Veld89] is assumed constantwithin each subband and then used to compute a perceptual bit allocation

The encoder transmits the particular combination of DD and WP informationthat minimizes the bit rate while maintaining perceptual quality Three combi-nations are possible In one scenario the DD index and time-warping factor aretransmitted alone if the DD residual energy is below the masking threshold atall frequencies Alternatively if the DD residual has audible noise energy thenWP coefficients of the DD residual are also quantized encoded and transmittedIn some cases however WP coefficients corresponding to the original signal aremore compactly represented than the combination of the DD plus WP residualinformation In this case the DD information is discarded and only quantizedand encoded WP coefficients are transmitted In the latter two cases the encoderalso transmits subband scale factors bit allocations and energy normalizationside information

This algorithm is unique in that it contains the first reported application ofadapted WP analysis to perceptual subband coding of high-fidelity CD-qualityaudio During each frame the WP basis selection procedure applies an optimalitycriterion of minimum bit rate for a given distortion level The adaptation isldquoglobalrdquo in the sense that the same analysis wavelet is applied to the entiredecomposition The authors reached several conclusions regarding the optimalcompact support (K-coefficient) wavelet basis when selecting from among theDaubechies orthogonal wavelet bases ([Daub88])

First optimization produced average bit rate savings dependent on filter lengthof up to 15 Average bit rate savings were 3 65 875 and 15 for waveletsselected from the sets associated with coefficient sequences of lengths 10 2040 and 60 respectively In an extreme case a savings of 17 bitssample is real-ized for transparent coding of a difficult castanets sequence when using best-caserather than worst-case wavelets (08 vs 25 bitssample for K = 40) The sec-ond conclusion reached by the researchers was that it is not necessary to searchexhaustively the space of all wavelets for a particular value of K The search canbe constrained to wavelets with K 2 vanishing moments (the maximum possiblenumber) with minimal impact on bit rate The frequency responses of the filtersassociated with a p-th-order vanishing moment wavelet have p-th-order zeros at

220 SUBBAND CODERS

the foldover frequency ie = π Only a 31 bitrate reduction was realizedfor an exhaustive search versus a maximal vanishing moment constrained searchThird the authors found that larger K ie more taps and deeper decomposi-tion trees tended to yield better results Given identical distortion criteria for acastanets sequence bit rates of 21 bitssample for K = 4 wavelets were realizedversus 08 bitssample for K = 40 wavelets

As far as quality is concerned subjective tests showed that the algorithmproduced transparent quality for certain test material including drums pop vio-lin with orchestra and clarinet Subjects detected differences however for thecastanets and piano sequences These difficulties arise respectively because ofinadequate pre-echo control and inefficient modeling of steady sinusoids Thecoder utilizes only an adaptive window scheme which switches between 1024and 2048-sample windows Shorter windows (N = 1024 or 23 ms) are used forsignals that are likely to produce pre-echoes The piano sequence contained longsegments of nearly steady or slowly decaying sinusoids The wavelet coder doesnot handle steady sinusoids as well as other signals With the exception of thesetroublesome signals in a comparative test one additional expert listener alsofound that the WP coder outperformed MPEG-1 layer II at 64 kbs

Tewfik and Ali later enhanced the WP coder to improve pre-echo control andincrease coding efficiency After elimination of the dynamic dictionary theyreported improved quality in the range of 55 to 63 kbs as well as a real-time implementation of a simplified 64 to 78 kbs coder on two TMS320C31devices [Tewf93] Other improvements included exploitation of auditory tem-poral masking for pre-echo control more efficient quantization and encodingof scale-factors and run-length coding of long zero sequences The improvedWP coder also upgraded its psychoacoustic analysis section with a more sophis-ticated model similar to Johnstonrsquos PXFM coder [John88a] The most notableimprovement occurred in the area of pre-echo control This was accomplishedin the following manner First input frames likely to produce pre-echoes areidentified using a normalized energy measure criterion These frames are parsedinto 5-ms time slots (256 samples) Then WP coefficients from all scales withineach time slot are combined to estimate subframe energies Masking thresholdscomputed over the global 1024-sample frame are assumed only to apply dur-ing high-energy time slots Masking thresholds are reduced across all subbandsfor low-energy time slots utilizing weighting factors proportional to the energyratio between high- and low-energy time-slots The remaining enhancements ofimproved scale factor coding efficiency and run-length coding of zero sequencesmore than compensated for removal of the dynamic dictionary

832 Scalable DWPT Coder with Adaptive Tree Structure

Srinivasan and Jamieson proposed a WP-based audio coding scheme [Srin97][Srin98] in which a signal-specific perceptual best basis is constructed by adaptingthe WP tree structure on each frame such that perceptual entropy and ultimatelythe bit rate are minimized While the tree structure is signal-adaptive the analysis


AdaptiveWPT

ZerotreeQuantizer

Perceptual Model

LosslessCoding

l

s(n)

Figure 86 Masking-threshold adapted WP audio coder [Srin98] On each frame the WPtree structure is adapted in order to minimize a perceptually motivated rate constraint

filters are time-invariant and obtained from the family of spline-based biorthog-onal wavelets [Daub92] The algorithm (Figure 86) is also unique in the sensethat it incorporates mechanisms for both bit rate and complexity scaling Beforethe tree adaptation process can commence for a given frame a set of 63 maskingthresholds corresponding to a set of threshold frequency partitions roughly 13Bark wide is obtained from the ISOIEC MPEG-1 psychoacoustic model rec-ommendation 2 [ISOI92] Of course depending upon the WP tree the subbandsmay or may not align with the threshold partitions For any particular WP treethe associated bit rate (cost) is computed by extracting the minimum maskingthresholds from each subband and then allocating sufficient bits to guarantee thatthe quantization noise in each band does not exceed the minimum threshold

The objective of the tree adaptation process therefore is to construct a mini-mum cost subband decomposition by maximizing the minimum masking thresh-old in every subband Figure 87a shows a possible subband structure in whichsubband 0 contains five threshold partitions This choice of bandsplitting is clearlyundesirable since the minimum masking threshold for partition 1 is far below par-tition 4 Bit allocation for subband 0 will be forced to satisfy partition 1 with aresulting overallocation for partitions 2 through 5

It can be seen that subdividing the band (Figure 87b) relaxes the minimummasking threshold in band 1 to the level of partition 5 Naturally the idealbandsplitting would in this case ultimately match the subband boundaries tothe threshold partition boundaries On each frame therefore the tree adaptationprocess performs the following top-down iterative ldquogrowingrdquo procedure Dur-ing any iteration the existing subbands are assigned individual costs based onthe bit allocation required for transparent coding Then a decision on whetheror not to subdivide further at a node is made on the basis of cost reductionSubbands are examined for potential splitting in order of decreasing cost andthe search is ldquobreadth-firstrdquo meaning that each level is completely decomposedbefore proceeding to the next level Subdivision occurs only if the associated bitrate improvement exceeds a threshold The tree adaptation is also constrained bya complexity scaling mechanism Top-down tree growth is halted by the com-plexity scaling constraint λ when the estimated total cost of computing the

222 SUBBAND CODERS

Subband 0

t4

t5

t1

t3

t2Min = t1

Frequency

Mas

king

thre

shol

d

(a)

Subband 0

Frequency

Min = t5

Mas

king

thre

shol

d

(b)

Subband 1

t4

t5

t1

t3

t2

Figure 87 Example masking-threshold adapted WP filter bank (a) initial condition (b)after one iteration Threshold partitions are denoted by dashed lines and labeled by tk Idealized subband boundaries are denoted by heavy black lines Under the initial conditionwith only one subband the minimum masking threshold is given by t1 and therefore thebit allocation will be relatively large in order to satisfy a small threshold After oneband splitting however the minimum threshold in subband 1 increases from t1 to t5thereby reducing the perceptual bit allocation Hence the cost function is reduced in part(b) relative to part (a)

DWPT reaches a predetermined limit With this feature it is envisioned that ina real-time environment the WP adaptation process could respond to changingCPU resources by controlling the cost of the analysis and synthesis filter banks

In [Srin98] a complexity-constrained tree adaptation procedure is shown toyield a basis requiring the fewest bits for perceptually transparent coding for agiven complexity and temporal resolution After the WP tree adaptation procedurehas been completed Shapirorsquos zerotree algorithm [Shap93] is applied iterativelyto quantize the coefficients and exploit remaining temporal correlation until theperceptual rate-distortion criteria are satisfied ie until sufficient bits have beenallocated to satisfy the perceptually transparent bit rate associated with the given


WP tree The zerotree technique has the added benefit of generating an embeddedbitstream making this coder amenable to progressive transmission In scalableapplications the embedded bitstream has the property that it can be partiallydecoded and still guarantee the best possible quality of reconstruction giventhe number of bits decoded The complete bitstream consists of the encoded treestructure the number of zerotree iterations and a block of zerotree encoded dataThese elements are coded in a lossless fashion (eg Huffman arithmetic etc) toremove any remaining redundancies and transmitted to the decoder For informallistening tests over coded program material that included violin violinviola flutesitar vocalsorchestra and sax the coded outputs at rates in the vicinity of 45 kbswere reported to be indistinguishable from the originals with the exceptions ofthe flute and sax

833 DWPT Coder with Globally Adapted General Analysis Wavelet

Srinivasan and Jamieson [Srin98] demonstrated the advantages of a maskingthreshold adapted WP tree with a time-invariant analysis wavelet On the otherhand Sinha and Tewfik [Sinh93b] used a time-invariant WP tree but a glob-ally adapted analysis wavelet to demonstrate that there exists a signal-specificldquobestrdquo wavelet basis in terms of perceptual coding gain for a particular numberof filter taps The basis optimization in [Sinh93b] however was restricted toDaubechiesrsquo wavelets Recent research has attempted to identify which waveletproperties portend an optimal basis as well as to consider basis optimizationover a broader class of wavelets In an effort to identify those wavelet proper-ties that could be associated with the ldquobestrdquo filter Philippe et al measured theimpact on perceptual coding gain of wavelet regularity AR(1) coding gain andfilter bank frequency selectivity [Phil95a] [Phil95b] The study compared perfor-mance between orthogonal Rioul [Riou94] orthogonal Onno [Onno93] and thebiorthogonal wavelets of [More95] in a WP coding scheme that had essentiallythe same time-invariant critical band WP decomposition tree as [Sinh93b] Usingfilters of lengths varying between 4 and 120 taps minimum bit rates requiredfor transparent coding in accordance with the usual perceptual subband bit allo-cations were measured for each wavelet For a given filter length the resultssuggested that neither regularity nor frequency selectivity mattered significantlyOn the other hand the minimum bit rate required for transparent coding wasshown to decrease with increasing analysis filter AR(1) coding gain leading theauthors to conclude that AR(1) coding gain is a legitimate criterion for WP filterselection in perceptual coding schemes

834 DWPT Coder with Adaptive Tree Structure and Locally AdaptedAnalysis Wavelet

Phillipe et al [Phil96] measured the perceptual coding gain associated withoptimization of the WP analysis filters at every node in the tree aswell as optimization of the tree structure In the first experiment the WPtree structure was fixed and then optimal filters were selected for each

224 SUBBAND CODERS

tree node (local adaptation) such that the bit rate required for transparentcoding was minimized Simulated annealing [Kirk83] was used to solve thediscrete optimization problem posed by a search space containing 300 filtersof varying lengths from the Daubechies [Daub92] Onno [Onno93] Smith-Barnwell [Smit86] Rioul [Riou94] and Akansu-Caglar [Cagl91] families Thenthe filters selected by simulated annealing were used in a second set ofexperiments on tree structure optimization The best WP decomposition treewas constructed by means of a growing procedure starting from a single celland progressively subdividing Further splitting at each node occurred only if itsignificantly reduced the perceptually transparent bit rate As in [Phil95b] thesefilter and tree adaptation experiments estimated bit rates required for perceptuallytransparent coding of 48-kHz sampled source material using statistical signalproperties For a fixed tree the filter adaptation experiments yielded severalnoteworthy results First a nominal bit rate reduction of 3 was realized forOnnorsquos filters (665 kbs) relative to Daubechiesrsquo filters (68 kbs) when the samefilter family was applied in all tree nodes and filter length was the only freeparameter Secondly simulated annealing over the search space of 300 filtersyielded a nominal 1 bit rate reduction (66 kbs) relative to the Onno-onlycase Finally longer filter bank delay ie longer analysis filters and hencebetter frequency selectivity yielded lower bitrates For low-delay applicationshowever a sevenfold delay reduction from 700 down to only 100 samples isrealized at the cost of only a 10 increase in bit rate The tree adaptationexperiments showed that a 16-band decomposition yielded the best bit rate whentree description overhead was accounted for In light of these results and thewavelet adaptation results of [Sinh93b] one might conclude that WP filter andWP tree optimization are warranted if less than a 10 bit rate improvementjustifies the added complexity

835 DWPT Coder with Perceptually Optimized Synthesis Wavelets

The wavelet-based audio coding schemes as well as WP tree and filter adapta-tion experiments described in the foregoing sections (eg [Sinh93b] [Phil95a][Phil95b] [Phil96]) seek to maximize perceptual coding efficiency by match-ing subband bandwidths (ie the time-frequency tiling) andor individual filtermagnitude and phase characteristics to incoming signal properties All of thesetechniques make use of perfect reconstruction (ldquoPRrdquo) DWT or WP filter banksthat are designed to split a signal into frequency subbands in the analysis filterbank and then later recombine the subband signals in the synthesis filter bank toreproduce exactly the original input signal The PR property only holds howeverso long as distortion is not injected into the subband sequences ie in the absenceof quantization This is an important point to consider in the context of coding Thequantization noise introduced into the subbands during bit allocation leads to filterbank-induced reconstruction artifacts because the synthesis filter bank has carefullycontrolled spectral leakage properties specifically designed to cancel the aliasingand imaging distortions introduced by the critically sampled analysis-synthesisprocess Whether using classical or perceptual bit allocation rules most subband


coders do not account explicitly for the filter bank distortion artifacts introducedby quantization Using explicit knowledge of the analysis filters and the quanti-zation noise however recent research has shown that reconstruction distortioncan be minimized in the mean square sense (MMSE) by relaxing PR constraintsand tuning the synthesis filters [Chen95] [Hadd95] [Kova95] [Delo96] [Goss97b]Naturally mean square error minimization is of limited value for subband audiocoders As a result Gosse et al [Goss95] [Goss97] extended the MMSE synthe-sis filter tuning procedure [Goss96] to minimize a mean perceptual error (MMPE)rather than MMSE Experiments were conducted to determine whether or not tunedsynthesis filters outperform the unmodified PR synthesis filters and if so whetheror not MMPE filters outperform MMSE filters in subjective listening tests A WPaudio coding scheme configured for 128 kbs operation and having a time-invariantfilter-bank structure (Figure 88) formed the basis for the experiments

The tree and filter selections were derived from the minimum-rate filter andtree adaptation investigation reported in [Phil96] In the figure each of the 16 sub-bands is labeled with its upper cutoff frequency (kHz) The experiments involvedfirst a design phase and then an evaluation phase During the design phase opti-mized synthesis filter coefficients were obtained as follows For the MMPE filterscoding simulations were run using the unmodified PR synthesis filter bank withpsychoacoustically derived bit allocations for each subband on each frame Amean perceptual error (MPE) was evaluated at the PR filter bank output in termsof a unique JND measure [Duro96] Then the filter tuning algorithm [Goss96]was applied to minimize the reconstruction error Since the bit allocation was per-ceptually motivated the tuning and reconstruction error minimization procedureyielded MMPE filter coefficients For the MMSE filters coefficients were alsoobtained using [Goss96] without the benefit of a perceptual bit allocation step

Daub-14

Daub-24 Daub-24

Onno-32 Onno-32

Daub-18 Daub-18

Onno-4

Onno-6 Onno-6

Haar

Haar

01 02

04 06 08

11 15

3 45 756 9 105 12

18 24 (kHz)

Haar

Daub-18 Daub-18

Figure 88 Wavelet packet analysis filter-bank optimized for minimum bitrate used inMMPE experiments

226 SUBBAND CODERS

During the evaluation phase of the experiments three 128 kbs coding simulationswith psychoacoustic bit allocations were run with

ž PR synthesis filtersž MMSE-tuned synthesis filters andž MMPE-tuned synthesis filters

Performance was evaluated in terms of a perceptual objective measure(POM) [Colo95] an estimate of the probability that an expert listener candistinguish between the original and coded signal The POM results were 44distinguishability for the PR case versus only 16 for both the MMSE andMMPE cases The authors concluded that synthesis filter tuning is worthwhilesince some performance enhancement exists over the PR case They alsoconcluded that MMPE filters failed to outperform MMSE filters because theywere designed to minimize the perceptual error over a long period rather than atime-localized basis Since perceptual signal properties are strongly time-variantit is possible that time-variant MMPE tuning will realize some performance gainrelative to MMSE tuning The perceptual synthesis filter tuning ideas exploredin this work have shown promise but further investigation is required to bettercharacterize its costs and benefits

84 ADAPTED NONUNIFORM FILTER BANKS

The most popular method for realizing nonuniform frequency subbands is tocascade uniform filters in an unbalanced tree structure as with for examplethe DWPT For a given impulse response length however cascade structures ingeneral produce poor channel isolation Recent advances in modulated filter bankdesign methodologies (eg [Prin94]) have made tractable direct form near perfectreconstruction nonuniform designs that are critically sampled This section is con-cerned with subband coders that employ signal-adaptive nonuniform modulatedfilter banks to approximate the time-frequency analysis properties of the auditorysystem more effectively than the other subband coders Two examples are givenBeyond the pair of algorithms addressed below we note that other investigatorshave proposed nonuniform filter bank coding techniques that address redundancyreduction utilizing lattice [Mont94] and bidimensional VQ schemes [Main96]

841 Switched Nonuniform Filter Bank Cascade

Princen and Johnston developed a CD-quality coder based upon a signal-adaptivefilter bank [Prin95] for which they reported quality better than the sophisticatedMPEG-1 layer III algorithm at both 48 and 64 kbs The analysis filter bank forthis coder consists of a two-stage cascade The first stage is a 48-band nonuniformmodulated filter bank split into four uniform-bandwidth sections There are 8 uni-form subbands from 0 to 750 Hz 4 uniform subbands from 750 to 1500 Hz 12uniform subbands from 15 to 6 kHz and 24 uniform subbands from 6 to 24 kHz

HYBRID WP AND ADAPTED WPSINUSOIDAL ALGORITHMS 227

The second stage in the cascade optionally decomposes nonuniform bank outputswith onoff switchable banks of finer resolution uniform subbands During filterbank adaptation a suitable overall time-frequency resolution is attained by selec-tively enabling or disabling the second stage filters for each of the four uniformbandwidth sections The low-resolution mode for this architecture corresponds toslightly better than auditory filter-bank frequency resolution On the other handthe high-resolution mode corresponds roughly to 512 uniform subband decompo-sition Adaptation decisions are made independently for each of the four cascadedsections based on a criterion of minimum perceptual entropy (PE) The secondstage filters in each section are enabled only if a reduction in PE (hence bit rate)is realized Uniform PCM is applied to subband samples under the constraint ofperceptually masked quantization noise Masking thresholds are transmitted asside information Further redundancy reduction is achieved by Huffman codingof both quantized subband sequences and masking thresholds

842 Frequency-Varying Modulated Lapped Transforms

Purat and Noll [Pura96] also developed a CD-quality audio coding scheme basedon a signal-adaptive nonuniform tree-structured wavelet packet decompositionThis coder is unique in two ways First of all it makes use of a novel waveletpacket decomposition [Pura95] Secondly the algorithm adapts to the signal thewavelet packet tree decomposition depth and breadth (branching structure) basedon a minimum bit rate criterion subject to the constraint of inaudible distortionsIn informal subjective tests the algorithm achieved excellent quality at a bit rate of55 kbs

85 HYBRID WP AND ADAPTED WPSINUSOIDAL ALGORITHMS

This section examines audio coding algorithms that make use of a hybrid waveletpacketsinusoidal signal analysis Hybrid coder architectures often improve coderrobustness to diverse program material In this case the wavelet portion of acoder might be better suited to certain signal classes (eg transient) while theharmonic portion might be better suited to other classes of input signal (egtonal or steady-state) In an effort to improve coder overall performance (egbetter output quality for a given bit rate) several of the signal-adaptive waveletand wavelet packet subband coding schemes presented in the previous sectionhave been embedded in experimental hybrid coding schemes that seek to adaptthe analysis properties of the coding algorithm to the signal content Severalexamples are considered in this section

Although the WP coder improvements reported in [Tewf93] addressed pre-echo control problems evident in [Sinh93b] they did not rectify the coderrsquos inade-quate performance for harmonic signals such as the piano test sequence This is inpart because the low-order FIR analysis filters typically employed in a WP decom-position are characterized by poor frequency selectivity and therefore waveletbases tend not to provide compact representations for strongly sinusoidal signals

228 SUBBAND CODERS

MaskingModel

s(n)Σ

SinusoidalAnalysis

SinusoidalSynthesis

WPAnalysis Transient

Removal

NoiseEncoder

TransientTracker

TransientEncoder

Bit P

acking

EncodeSide Info

QuantizerEncoder

minus

Figure 89 Hybrid sinusoidalwavelet encoder (after [Hamd96])

On the other hand wavelet decompositions provide some control over time res-olution properties leading to efficient representations of transient signals Theseconsiderations have inspired several researchers to investigate hybrid coders

851 Hybrid SinusoidalClassical DWPT Coder

Hamdy et al developed a hybrid coder [Hamd96] designed to exploit the effi-ciencies of both harmonic and wavelet signal representations For each framethe encoder (Figure 89) chooses a compact signal representation from combinedsinusoidal and wavelet bases This algorithm is based on the notion that short-timeaudio signals can be decomposed into tonal transient and noise components Itassumes that tonal components are most compactly represented in terms of sinu-soidal basis functions while transient and noise components are most efficientlyrepresented in terms of wavelet bases The encoder works as follows FirstThomsonrsquos analysis model [Thom82] is applied to extract sinusoidal parameters(frequencies amplitudes and phases) for each input frame Harmonic synthesisusing the McAulay and Quatieri reconstruction algorithm [McAu86] for phaseand amplitude interpolation is next applied to obtain a residual sequence Thenthe residual is decomposed into WP subbands

The overall WP analysis tree approximates an auditory filter bank Edge-detection processing identifies and removes transients in low-frequency sub-bands Without transients the residual WP coefficients at each scale becomelargely decorrelated In fact the authors determined that the sequences are wellapproximated by white Gaussian noise (WGN) sources having exponential decayenvelopes As far as quantization and encoding are concerned sinusoidal frequen-cies are quantized with sufficient precision to satisfy just-noticeable-differencesin frequency (JNDF) which requires 8-bit absolute coding for a new frequencytrack and then 5-bit differential coding for the duration of the lifetime of thetrack The sinusoidal amplitudes are quantized and encoded in a similar abso-lutedifferential manner using simultaneous masking thresholds for shaping of


quantization noise This may require up to 8 bits per component Sinusoidalphases are uniformly quantized on the interval [minusπ π] and encoded using 6 bitsAs for quantization and encoding of WP parameters all coefficients below 11 kHzare encoded as in [Sinh93b] Above 11 kHz however parametric representationsare utilized Transients are represented in terms of a binary edge mask that can berun length encoded while the Gaussian noise components are represented in termsof means variances and exponential decay constants The hybrid harmonic-wavelet coder was reported to achieve nearly transparent coding over a widerange of CD-quality source material at bit rates in the vicinity of 44 kbs [Ali96]

852 Hybrid SinusoidalM-band DWPT Coder

During the late 1990s other researchers continued to explore the potential ofhybrid sinusoidal-wavelet signal analysis schemes for audio coding Bolandand Deriche [Bola97] reported on an experimental sinusoidal-wavelet hybridaudio codec with high-level architecture very similar to [Hamd96] but with low-level differences in the sinusoidal and wavelet analysis blocks In particularfor harmonic analysis the proposed algorithm replaces Thomsonrsquos method usedin [Hamd96] with a combination of total least squares linear prediction (TLS-LP)and Pronyrsquos method Then in the harmonic residual wavelet decomposition blockthe proposed method replaces the usual DWT cascade of two-band QMF sectionswith a cascade of four-band QMF sections The algorithm works as follows Firstharmonic analysis operates on nonoverlapping 12-ms blocks of rectangularlywindowed input audio (512 samples 441 kHz) For each block sinusoidalfrequencies fk are extracted using TLS-LP spectral estimation [Rahm87] aprocedure that is formulated to deal with closely spaced sinusoids in lowSNR environments Given the set of TLS-LP frequencies a classical Pronyalgorithm [Marp87] next determines the corresponding amplitudes Ak andphases φk Masking thresholds for the tonal sequence are calculated in a mannersimilar to the ISOIEC MPEG-1 psychoacoustic recommendation 2 [ISOI92]After masked tones are discarded the parameters of the remaining sinusoidsare uniformly quantized and encoded in a procedure similar to [Hamd96]Frequencies are encoded according to JNDFs (3 nonuniform bands 8 bits percomponent in each band) phases are allocated 6 bits across all frequencies andamplitudes are block companded with 5 bits for the gain and 6 bits per normalizedamplitude Unlike [Hamd96] however amplitude bit allocations are fixed ratherthan signal adaptive Quantized sinusoidal components are used to synthesize atonal sequence stonal(n) as follows

stonal(n) =psum

k=1

Akej(k+φk) (82)

where the parameters k = 2πfkfs are the normalized radian frequencies andonly p2 frequency components are independent since the complex exponentialsare organized into conjugate symmetric pairs As in [Hamd96] the synthetic

230 SUBBAND CODERS

y1

y2

y3

y4

y5

y6

y2y3y4y5y6 y1

y7

y8

y9y10

x

Frequency (kHz)

22 kHz1651155412713

y10

03

Q4 Q4 Q4

Figure 810 Subband decomposition associated with cascaded M-band DWT in [Bola97]

tonal sequence stonal(n) is subtracted from the input sequence s(n) to form aspectrally flattened residual r(n)

In the wavelet analysis section the harmonic residual r(n) is decomposedsuch that critical bandwidths are roughly approximated using a three-level cascade(Figure 810) of 4-band analysis filters (ie 10 subbands) designed accordingto the M-band technique in [Alki95] Compared to the usual DWT cascade of2-band QMF sections the M-band cascade offers the advantages of reducedcomplexity reduced delay and linear phase The DWT coefficients are uni-formly quantized and encoded in a block companding scheme with 5 bits persubband gain and a dynamic bit allocation according to a perceptual noise modelfor the normalized coefficients A Huffman coding section removes remainingstatistical redundancies from the quantized harmonic and DWT coefficient setsIn subjective listening comparisons between the proposed scheme at 60ndash70 kbsand MPEG-1 layer III at 64 kbs on 12 SQAM CD [SQAM88] source items theauthors reported indistinguishable quality for ldquoacoustic guitarrdquo ldquoEddie Rabbitrdquoldquocastanetsrdquo and ldquofemale speechrdquo Slight impairments relative to MPEG-1 layerIII were reported for the remaining eight items No comparisons were reportedin terms of delay or complexity

853 Hybrid SinusoidalDWPT Coder with WP Tree StructureAdaptation (ARCO)

Other researchers have also developed hybrid algorithms that represent audiousing a combination of sinusoidal and wavelet packet bases Pena et al [Pena96]have reported on the Adaptive Resolution COdec (ARCO) This algorithmemploys a two-stage hybrid tonal-WP analysis section architecturally similar toboth [Hamd96] and [Bola97] The experimental ARCO algorithm has introducedseveral novelties in the segmentation psychoacoustic analysis tonal analysis bitallocation and WP analysis blocks In addition recent work on this project hasproduced a unique MDCT-based filter bank The remainder of this subsectiongives some details on these developments

8531 ARCO Segmentation Perceptual Model and Sinusoidal An-alysis-by-Synthesis In an effort to match the time-frequency analysis res-olution to the signal properties ARCO includes a segmentation scheme that


makes use of both time and frequency block clustering to determine optimalanalysis frame lengths [Pena97b] Similar blocks are assumed to contain station-ary signals and are therefore combined into larger frames Dissimilar blocks onthe other hand are assumed to contain nonstationarities that are best analyzedusing individual short segments The ARCO psychoacoustic model resemblesISOIEC MPEG-1 model recommendation 1 [ISOI92] with some enhancementsUnlike [ISOI92] tonality labeling is based on [Terh82] and noise maskers aresegregated into narrowband and wideband subclasses Then frequency-dependentexcitation patterns are associated with the wideband noise maskers ARCO quan-tizes tonal signal components in a perceptually motivated analysis-by-synthesisUsing an iterative procedure bits are allocated on each analysis frame until thesynthetic tonal signalrsquos excitation pattern matches the original signalrsquos excitationpattern to within some tolerance

8532 ARCO WP Decomposition The ARCO WP decomposition proce-dure optimizes both the tree structure as in [Srin98] and filter selections asin [Sinh93b] and [Phil96] For the purposes of WP tree adaptation [Prel96a]ARCO defines for the k-th band a cost εk as

εk =int fk+Bk2fkminusBk2 (U(f ) minus Ak)df

int fk+Bk2fkminusBk2 U(f )df

(83)

where U(f ) is the masking threshold expressed as a continuous function theparameter f represents frequency fk is the center frequency for the k-th subbandBk is the k-th subband bandwidth and Ak is the minimum masking threshold inthe k-th band Then the total cost C to be minimized over all M subbands isgiven by

C =Msum

k=1

εk (84)

By minimizing Eq (84) on each frame ARCO essentially arranges the subbandssuch that the corresponding set of idealized brickwall rectangular filters havingamplitude equal to the height of the minimum masking threshold in the eachband matches as closely as possible the shape of the masking threshold Thenbits are allocated in each subband to satisfy the minimum masking thresholdAk Therefore uniform quantization in each subband with sufficient bits affectsa noise shaping that satisfies perceptual requirements without wasting bits Themethod was found to be effective without accounting explicitly for the spec-tral leakage associated with the filter bank sidelobes [Prel96b] As far as filterselection is concerned ARCO employs signal-adaptive filters during steady-statesegments and time-invariant filters during transients Some of the filter selectionstrategies were reported to have been inspired by Agerkvistrsquos auditory modelingwork [Ager94] [Ager96] In [Pena97a] it was found that the ldquosymmetrizationrdquotechnique [Bamb94] [Kiya94] was effective for minimizing the boundary distor-tions associated with the time-varying WP analysis

232 SUBBAND CODERS

8533 ARCO Bit Allocation Unlike most other algorithms ARCO encodesand transmits the masking threshold to the decoder This has the advantage ofefficiently representing both the adapted WP tree and the subband bit allocationswith a single piece of information The disadvantage however is that the decoderis no longer decoupled from the details of perceptual bit allocation as is typi-cally the case with other algorithms The ARCO bit allocation strategy [Sera97]achieves fast convergence to a desired bit rate by shifting the masking thresholdup or down using a novel noise scaling procedure The technique essentially usesa Newton algorithm to converge in only a few iterations to the noise scaling levelthat achieves the desired bit rate The technique takes into account bit allocationsfrom previous frames and allocates bits to all subbands simultaneously Conver-gence speed and accuracy are controlled by a single parameter and the procedureis amenable to subband weighting of the threshold to create unique noise pro-files In one set of experiments convergence to a target rate with perceptual noiseshaping was achieved in between two and seven iterations of the low complexitytechnique Another unique property of ARCO is its set of high-level ldquocognitiverulesrdquo that seek to minimize the objectionable distortion when insufficient bitsare available to guarantee transparent coding [Pena95] These rules monitor theevolution of coding distortion over many frames and make fine noise-shapingadjustments on individual frames in order to avoid perceptually annoying noisepatterns that could not otherwise be detected on a short-time basis

8534 ARCO Developments It is interesting to note that the researchersdeveloping ARCO recently replaced the hybrid sinusoidal-WP analysis filter bankwith a novel multiresolution MDCT-based filter bank In [Casa98] Casal et aldeveloped a ldquomulti-transformrdquo (MT) that retains the lapped properties of theMDCT but creates a nonuniform time-frequency tiling by transforming back intotime the high-frequency MDCT components in L-sample blocks The proposedMT is characterized by high resolution in frequency for the low subbands andhigh resolution in time for the high frequencies Like the MDCT upon whichit is based the MT maintains critical sampling and perfect reconstruction inthe absence of quantization Preliminary results for application of the MT inthe TARCO (Tonal Adaptive Resolution COdec) are given in [Casa98] As faras bit rates reconstruction quality and complexity are concerned details onARCOTARCO have not yet appeared in the literature

We conclude this section with the observation that hybrid DWT-sinusoidal andDWPT-sinusoidal architectures such as those advocated by Hamdy [Hamd96]Boland [Bola97] and Pena [Pena96] have been motivated by the notion that asource-robust audio coder must represent radically different signal types withuniform efficiency The idea behind the hybrid structure is that providing twoextreme basis possibilities might yield opportunities for maximally efficient signaladaptive basis selection By offering superior frequency resolution with inher-ently narrowband basis elements sinusoidal signal models are ideally suited forstrongly tonal signals while DWT and WPT filter banks on the other handsacrifice some frequency resolution but offer greater time resolution flexibilitymaking these bases inherently more efficient for representing transient signals As

SUBBAND CODING WITH HYBRID FILTER BANKCELP ALGORITHMS 233

this section has demonstrated the combination of the both signal models withina single codec can provide compact representations for a wide range of input sig-nals The next section of this chapter examines a different type of hybrid audiocoding architecture in which code excited linear prediction (CELP) is embeddedwithin subband coding schemes

86 SUBBAND CODING WITH HYBRID FILTER BANKCELPALGORITHMS

While hybrid sinusoidal-DWT and sinusoidal-DWPT signal models seek to max-imize robustness and basis flexibility other hybrid signal models have been moti-vated by low-delay and low-complexity concerns In this section we consider inparticular algorithms that combine a filter bank front end with subband-specificcode-excited linear prediction (CELP) blocks for quantization and coding of thedecimated subband sequences The goal of these experimental hybrid coders is toachieve very low delay andor low-complexity perceptual coding with reconstruc-tion quality comparable to any state-of-the-art audio codec Before consideringthese algorithms however we first define what is meant by ldquocode-excited linearpredictionrdquo

In the coding literature the acronym ldquoCELPrdquo denotes an entire class of effi-cient analysis-by-synthesis source coding techniques developed primarily forspeech applications in which the analyzed signal is treated as the output ofa source-system mechanism such as the human vocal apparatus In the CELPscheme excitation vectors corresponding to the lower vocal tract ldquosourcerdquo con-tribution drive a slowly time-varying LP synthesis filter that corresponds to theupper vocal tract ldquosystemrdquo Parameters of the LP synthesis filter are usuallyestimated on a block basis typically every 20 ms while the excitation vectorsare usually updated more frequently typically every 5 ms The LP parametersare most often estimated in an open-loop procedure by solving a set of normalequations that have been formulated to minimize the mean square predictionerror In contrast the excitation vectors are optimized in a closed-loop analysis-by-synthesis procedure such that the reconstruction error is minimized most oftenin the perceptually weighted mean square sense Given a vector of input speechthe analysis-by-synthesis process essentially reduces to a search during whichthe encoder must identify within a vector codebook that candidate excitation thatgenerates the best synthetic output speech when processed by the LP synthesisfilter The set of encoded parameters is therefore a set of filter parameters andone (or more) vector indices depending upon the codebook structure Since itsintroduction in the mid-1980s [Schr85] CELP and its derivatives have receivedconsiderable attention in the literature As a result numerous high-quality highlyefficient algorithms have been proposed and adopted as international standards inspeech coding Although a detailed discussion of CELP is beyond the scope ofthis book we refer the reader to the comprehensive tutorial in [Span94] for fur-ther details as well as a complete perspective on the CELP research and standardsThe remainder of this section assumes that the reader has a basic understanding

234 SUBBAND CODERS

of CELP coding principles Several examples of experimental subbandCELPalgorithms are examined next

861 Hybrid SubbandCELP Algorithm for Low-Delay Applications

One example of a hybrid filter bankCELP low-delay audio codec was developedjointly by Hay and Saoudi at ENST and Mainard at CCETT They devised asystem for generic audio signals sampled at 32 kHz based on the four-bandpolyphase quadrature filter bank (pseudo-QMF) borrowed from the ISOIECMPEG-2 AAC scalable sample rate profile [Akai95] and a bank of modifiedITU G728 [ITUR92] low-delay CELP speech codecs (Figure 811) The pri-mary objective of this system is to achieve transparent coding of the high-fidelityinput with very low delay The coder was first reported in [Hay96] and thenenhanced in [Hay97] The enhanced algorithm works as follows First the filterbank decomposes the input into four equal width subbands Then each of thedecimated subband sequences is quantized and encoded in five-sample blocks(0625 ms) using modified G728 codecs (low-delay CELP) for each subbandThe backward adaptive G728 algorithm [ITUR92] generates as output a singlevector index for each block of input samples and therefore a set of four code-book indices i1 i2 i3 i4 comprises the complete bitstream for the hybrid audiocodec Algorithmic delay consists of the 3-ms filter bank delay (96-tap filters)plus the additional 2-ms delay contributed by the G728 stages resulting in antotal delay of only 5 ms Bit allocation targets for each subband are computedby means of a modified ISOIEC MPEG-1 psychoacoustic model-1 that com-putes masking thresholds signal-to-mask ratios and ultimately the number ofbits required for transparent coding by analyzing the quantized outputs of thei-th band Si from a 4-ms-old block of data

H2(z)

H1(z)

s(n)

H3(z)

H4(z)

LD-CELP 32 kbps(Lattice-VQ)

LD-CELP var

LD-CELP var

LD-CELP var

Perceptual Model

4

4

4

4

bi

i1

i2

i3

i4

Siˆ

Figure 811 Low-delay hybrid filter-bankLD-CELP algorithm [Hay97]


The perceptual model utilizes an alias-cancelled DFT [Tang95] to compensatefor the analysis filter bankrsquos aliasing distortion Bit allocations are derived atboth the transmitter and receiver from the same set of quantized data makingit unnecessary to transmit explicitly any bit allocation information Average bitallocations on a subset of the standard ISO test material were 31 18 12 and3 kbs respectively for subbands 1 through 4 Given that the G728 codec isintended to operate at a fixed rate the primary challenge facing the algorithmdesigners was implementing dynamic subband bit allocations Computationallyefficient variable rate versions of G728 were constructed for bands 2 through 4by structuring standard LBG (K-means) [Lind80] codebooks to deal with mul-tiple rates (variable precision codebook indices) Unfortunately the first (lowfrequency) subband requires an average bit rate of 32 kbs for perceptual trans-parency which translates to an impractical codebook size of 220 vectors Tosolve this problem the authors implemented a highly efficient D5 lattice VQscheme [Conw88] which dramatically reduced the search complexity for eachinput vector by constraining the search space to a 50-vector neighborhood Lat-tice vector shapes were assigned 16 bits and gains 4 bits The lattice schemewas shown to perform nearly as well as an exhaustive search over a codebookcontaining more than 50000 vectors Neither objective nor subjective qualitymeasures were reported for this hybrid system

862 Hybrid SubbandCELP Algorithm for Low-ComplexityApplications

Intended for achieving CD quality in low-complexity decoder applications asecond example of a hybrid filter bankCELP algorithm appeared in [Vand98]Like [Hay97] the proposed algorithm follows a critically sampled filter bankwith a quantization and encoding stage of parallel variable-rate CELP codersone per subband (Figure 812) Unlike [Hay97] however this algorithm makesuse of a higher resolution longer delay filter bank Thus channel separation isgained at the expense of delay At the same time this algorithm utilizes relativelylow-order LP synthesis filters which significantly reduce decoder complexityIn contrast [Hay97] captures significant spectral detail in the high-order (50-thorder) predictors that are embedded in the G728 blocks The proposed algorithmclosely resembles ISOIEC MPEG-1 layer 1 in its filter bank and psychoacousticmodeling sections In particular the filter bank is identical to the 32-band 512-tap PQMF bank of [ISOI92] Also like [ISOI92] the subband sequences areprocessed in 12-sample blocks corresponding to 384 input samples

The proposed algorithm however replaces the block companding of [ISOI92]with the CELP quantization and encoding for all 32 subbands For every block of12 subband samples bits are allocated to the subbands on the basis of maskingthresholds delivered by the perceptual model This practice establishes minimumSNRs required in each subband to achieve perceptually transparent coding Thenparallel noise scaling is applied to the target SNRs to adjust the bit rate to ascalable target Finally CELP blocks quantize and encode each subband usingthe number of bits allocated by the perceptual model The particulars of the 32

236 SUBBAND CODERS

Nc (32)

Nc (2)

Nc (1)

s(n)

CELP 1

CELP 2

CELP 32

Perceptual Model

32

PQF

32

32

i1 g1

i2 g2

i32 g32

Figure 812 Low-complexity hybrid filter-bankCELP algorithm [Vand98]

identical CELP stages are as follows In order to maintain low complexity thebackward-adaptive LP synthesis filters are second order The codebook which isidentical for all stages contains 12-element stochastic excitation vectors that arestructured for gain-shape quantization with 6 bits allocated to the gains and 8 bitsallocated to the shapes for each of the 256 codewords Because bits are allocateddynamically for each subband in accordance with a masking threshold the CELPblocks are configured for variable rate operation Each CELP coder will combineexcitation contributions from up to 4 codebooks meaning that available rates foreach subband are 0 167 233 35 and 467 bits per sample The closed-loopanalysis-by-synthesis excitation search procedure relies upon a standard MSEminimization codebook search The total bit budget R is given by

R = 2Nb +Nbsum

i=1

8Nc(i) +Nbsum

i=1

6Nc(i) (85)

where Nb is the number of bands (32) Nc(i) is the number of codebooks requiredin the i-th band to achieve the SNR demanded by the perceptual model From leftto right the terms in Eq (85) represent the bits required to specify the number ofcodebooks being used in each subband the bits required for the shape codewordsand the bits required for the gain codewords In informal subjective tests overa set of unspecified test material the algorithm was reported to produce qualityldquonear transparencyrdquo at 62 kbs ldquogood qualityrdquo at 50 and 37 kbs and quality thatwas ldquoweakrdquo at 30 kbs

PROBLEMS 237

87 SUBBAND CODING WITH IIR FILTER BANKS

Although the majority of subband and wavelet audio coding algorithms foundin the literature employ banks of perfect reconstruction FIR filters this does notpreclude the possibility of using infinite impulse response (IIR) filter banks forthe same purpose Compared to FIR filters IIR filters are able to achieve simi-lar magnitude response characteristics with reduced filter orders and hence withreduced complexity In the multiband case IIR filter banks also offer complex-ity advantages over FIR filter banks Enhanced performance however comesat the expense of an increased sensitivity and implementation cost for IIR filterbanks Creusere and Mitra constructed a template subband audio coding sys-tem modeled after [Lokh92] to compare performance and to study the tradeoffsinvolved when choosing between FIR and IIR filter banks for the audio codingapplication [Creu96] In the study two IIR and two FIR coding schemes wereconstructed from the template using a structured all-pass filter bank a parallelall-pass filter bank a tree-structured QMF bank and a PQMF bank Beyond thisstudy IIR filter banks have not been widely used for audio coding The applica-tion of IIR filter banks to subband audio coding remains a subject that is largelyunexplored

PROBLEMS

81 In this problem we will show that STFT can be interpreted as a bank ofsubband filters Given the STFT X(n k) of the input signal x(n)

X(n k) =infinsum

m=minusinfinx(m)w(n minus m)eminusjkm = w(n) lowast x(n)eminusjkn

where w(n) is the sliding analysis window Give a filter-bank realizationof the STFT for a discrete frequency variable k = k() k = 0 1 7(ie 8 bands) Choose such that the speech band (20ndash4000 Hz) iscovered Assume that the frequencies k are uniformly spaced

82 The mother wavelet function ξ(t) is given in Figure 813 Determine andsketch carefully the wavelet basis functions ξυτ (t) for υ = 0 1 2 andτ = 0 1 2 associated with ξ(t)

ξυτ (t) 2minusυ2ξ(2minusυ t minus τ) (86)

where υ and τ denote the dilation (frequency scaling) and translation (timeshift) indices respectively

83 Let h0(n) = [1radic

2 1radic

2] and h1(n) = [1radic

2 minus1radic

2] Compute the scal-ing and wavelet functions φ(t) and ξ(t) Using ξ(t) as the mother waveletand generate the wavelet basis functions ξ00(t) ξ01(t) ξ10(t) and ξ11(t)

238 SUBBAND CODERS

0t

1

x(t )

1

Figure 813 An example wavelet function

Hint From the DWT theory the Fourier transforms of φ(t) and ξ(t) aregiven by

() = 1radic2H0(e

j2)

infinprod

p=2

H0(ej2p

) (87)

ξ () = 1radic2H1(e

j2)

infinprod

p=2

H0(ej2p

) (88)

where H0(ej) and H1(e

j) are the DTFTs of the causal FIR filters h0(n)

and h1(n) respectively For convenience we assumed Ts = 1 in = ωTs

and φ(t)CFTlarrminusminusrarr(ω) equiv () h0(n)

DTFTlarrminusminusrarrH0(ej)

84 Let H0(ej) and H1(e

j) be ideal lowpass and highpass filters with cut-off frequency π 2 as shown in Figure 814 Sketch () ξ () and thewavelet basis functions ξ00(t) ξ01(t) ξ10(t) and ξ11(t)

85 Show that if both H0(ej) and H1(e

j) are causal FIR filters of order N thenthe wavelet basis functions ξυτ (t) will have finite duration of (N + 1)2υ

86 Using equations (87) and (88) prove the following 1) (2) =infinprod

p=2

H0(ej2p

) and 2) |()|2 + |ξ ()|2 = |(2)|2

87 From problem 86 we have () = 1radic2H0(e

j2)(2) and ξ () =1radic2H1(e

j2)(2) Show that φ(t) and ξ(t) can be obtained

PROBLEMS 239

H0(e jΩ) H1(e jΩ)

minusp pminus

2

pminus

2

11

pp

2

p

200

ΩΩ

Figure 814 Ideal lowpass and highpass filters with cutoff frequency π 2

0t

1

2

f ( t )

0t

1

2

1

x ( t )

(a) (b)

0t

2

x(t )

21

1

43 65

4

3

(c)

Figure 815 (a) The scaling function φ(t) (b) the mother wavelet function ξ(t) and(c) input signal x(t)

240 SUBBAND CODERS

recursively as

φ(t) = radic2sum

n

h0(n)φ(2t minus n) (89)

ξ(t) = radic2sum

n


COMPUTER EXERCISE

88 Let the scaling function φ(t) and the mother wavelet function ξ(t) be asshown in Figure 815(a) and Figure 815(b) respectively Assume that theinput signal x(t) is as shown in Figure 815(c) Given the wavelet seriesexpansion

x(t) =infinsum

τ=minusinfinα(τ)φ(t minus τ) +

infinsum

υ=0

infinsum

τ=minusinfinβ(υ τ)ξυτ (t) (811)

where both υ and τ are integers and denote the dilation and translationindices respectively α(τ) and β(υ τ) are the wavelet expansion coeffi-cients Solve for α(τ) and β(υ τ)[Hint Compute the coefficients using inner products α(τ) 〈x(t)φ(t minusτ)〉 = int

x(t)φ(t minus τ)dt and β(υ τ) 〈x(t)ξυτ (t)〉 = intx(t)ξυτ (t)dt =int

x(t)2minusυ2ξ(2minusυt minus τ)dt ]

CHAPTER 9

SINUSOIDAL CODERS

91 INTRODUCTION

This chapter addresses perceptual coding algorithms based on sinusoidal modelsAlthough sinusoidal signal models have been applied successfully since the 1980sin speech coding [Hede81] [Alme83] [McAu86] [Geor87] [Geor92] and music syn-thesis [Serr90] perceptual properties were not introduced in sinusoidal modelinguntil later [Edle96c] [Pena96] [Levin98a] [Pain01] The advent of MPEG-4 stan-dardization established new research goals for high-quality coding of general audiosignals at bit rates in the range of 6ndash24 kbs In experiments reported as part of theMPEG-4 standardization effort it was determined that sinusoidal coding is capableof achieving good quality at low rates without being constrained by a restrictivesource model Furthermore unlike CELP and other classical low rate speech codingmodels the parametric sinusoidal coding is amenable to pitch and time-scale mod-ification at the decoder Additionally the emergence of Internet-based streamingaudio has motivated considerable research on the application of sinusoidal signalmodels to high-quality audio coding at low bit rates For example Levine and Smithdeveloped a hybrid sinusoidal-filter-bank coding scheme that achieves very highquality at rates around 32 kbs [Levin98a] [Levi99]

This chapter describes some of the sinusoidal algorithms for low rate audiocoding that exploit perceptual properties In Section 92 we review the clas-sical sinusoidal model Section 93 presents the analysissynthesis audio codec(ASAC) which was eventually considered for MPEG-4 standardizationSection 94 describes an enhanced version of ASAC the harmonic and indi-vidual lines plus noise (HILN) algorithm The HILN algorithm has been adopted


241

242 SINUSOIDAL CODERS

as part of the MPEG-4 standard Section 95 examines the use of FM synthesisoperators in sinusoidal audio coding In Section 96 we investigate the sines +transients + noise (STN) model Finally Section 97 is concerned with algo-rithms that combine sinusoidal modeling with other well-known techniques invarious hybrid architectures to achieve efficient low-rate audio coding

92 THE SINUSOIDAL MODEL

This section describes the sinusoidal model that forms the basis for the parametricaudio coding and the extended hybrid model given in the latter portions of thischapter In particular standard methodologies are presented for sinusoidal analysistracking interpolation and synthesis The classical sinusoidal model comprisesan analysis-synthesis framework ([McAu86] [Serr90] [Quat02]) that represents asignal s(n) as the sum of a collection of K sinusoids (ldquopartialsrdquo) with time-varyingfrequencies phases and amplitudes ie

s(n) asymp s(n) =Ksum

k=1

Ak cos(ωk(n)n + φk(n)) (91)

where Ak represents the amplitude ωk(n) represents the instantaneous frequencyand φk(n) represents the instantaneous phase of the k-th sinusoid It is assumedthat the amplitude frequency and phase functions evolve on a time scale substan-tially longer than a signal period Analysis for this model amounts to estimatingthe amplitudes phases and frequencies of the constituent partials Although thisestimation is typically accomplished by peak picking in the short-time Fourierdomain [McAu86] [Span91] [Serr90] analysis-by-synthesis estimation techniquesthat minimize explicitly a mean square error in terms of the sinusoidal parametershave also been proposed [Geor87] [Geor90] [Geor92] Sinusoidal analysis-by-synthesis has also been presented within the more generalized framework of match-ing pursuits using overcomplete signal dictionaries [Good97] [Verm99] Whetherclassical short-time Fourier transform (STFT) peak picking or analysis-by-synthesisis used for parameter estimation the analysis yields partial parameters on eachframe and the data rate of the parameterization is given by the analysis stride andthe order of the model In the synthesis stage the frame-rate model parameters areconnected from frame to frame by a line tracking process and then interpolatedusing low-order polynomial models to derive sample-rate control functions for abank of oscillators Interpolation is carried out based on synthesis frames which areimplicitly established by the analysis stride Although the bank of synthesis oscil-lators can be realized through additive combination of cosines computationallyefficient alternatives are available based on the FFT (eg [McAu88] [Rode92])

921 Sinusoidal Analysis and Parameter Tracking

The STFT-based analysis scheme [McAu86] [Serr89] that estimates the sinusoidalmodel parameters is presented here Figure 91 First the input is segmented into

THE SINUSOIDAL MODEL 243

Mag

nitu

de

Frame 1

Frame 2

Frame 3

Mag

nitu

deM

agni

tude

Frequency

X

X X

X

X

X X XX X X

Time

Am

plitu

de

STFT

STFT

STFT

X X

XX X X

X X

X X

X X X X X X

Frequency

Frequency

Figure 91 Sinusoidal model analysis The time-domain signal is segmented into over-lapping frames that are transformed to the frequency domain using STFT analysis Localmagnitude spectral maxima are identified It is assumed that each peak is associated witha pure tone (partial) component of the input For each of the peaks a parameter triad con-taining frequency amplitude and phase is extracted Finally a tracking algorithm formstime trajectories for the sinusoids by matching the amplitude andor frequency parametersacross time

overlapping frames In the hybrid signal model analysis frame lengths are signaladaptive Frames are typically overlapped by half of their length After segmenta-tion the frames are analyzed with the STFT which yields magnitude and phasespectra The sinusoidal analysis scheme assumes that magnitude spectral peaks areassociated with underlying pure tones in the input Therefore spectral peaks areidentified by a peak detector and then passed to a tracking algorithm that formstime trajectories by associating peaks from frame to frame For a time-domaininput s(n) let Sl(k) denote the complex-valued STFT of the signal s(n) on the l-thframe A spectral peak is defined as a local maximum in the magnitude spectrum|Sl(k)| ie an STFT magnitude peak on bin k0 that satisfies the inequality

|Sl(k0 minus 1)| |Sl(k0)| |Sl(k0 + 1)| (92)


Following peak identification on frames l and l + 1 a tracking procedure forms timetrajectories by matching across frames those spectral peaks which satisfy certainmatching criteria The resulting trajectories are intended to represent the smoothlytime-varying frequencies amplitudes and phases of the sinusoidal partials thatcomprise the signal under analysis Several trajectory tracking algorithms havebeen demonstrated to perform well [McAu86] [Serr89]

The tracking procedure (Figure 92) works in the following way First denoteby ωl

i the frequencies associated with the sinusoids identified on frame l with1 i p Similarly denote by ωl+1

j the frequencies associated with the sinusoidsidentified on frame l + 1 with 1 j r Given two sets of unmatched sinusoidsthe tracking objective is to identify for the i-th sinusoid on frame l the j -thsinusoid on frame l + 1 that is closest in frequency andor amplitude (here onlyfrequency matching is considered) Therefore in the first step of the procedurean initial match is formed between ωl

i and ωl+1j such that the difference ω =

|ωli minus ωl+1

j | is minimized and such that the distance ω is less than a specifiedmaximum ωmax

Following an initial match three outcomes are possible First the trajectorywill be continued (Figure 92a) if a match is found and there are no match con-flicts to be resolved In this case the frequency amplitude and phase parametersare interpolated from frame l to frame l + 1 On the other hand if no initialmatch is found during the first step it is assumed that the trajectory associatedwith frequency ωl

i must terminate In this case the trajectory is declared ldquodeadrdquo(Figure 92c) and is matched to itself with zero amplitude on frame l + 1 In thethird possible outcome the initial match creates a conflict In this case the i-thtrajectory attempts to match with a peak that has already been claimed by another

Fre

quen

cy

XX

Frame k Frame k + 1

∆fmax∆fmax∆fmax

Fre

quen

cy

X

X

Frame k Frame k + 1

O

Fre

quen

cy

O

X

Frame k Frame k + 1

X

Continuation Birth Death

Figure 92 Sinusoidal trajectory formation In the figure an lsquoxrsquo denotes the presence ofa sinusoid at the specified frequency while an lsquoorsquo denotes the absence of a sinusoid atthe specified frequency In part (a) a sinusoid on frame k is matched to a sinusoid onframe k + 1 because the two sinusoids are sufficiently close in frequency and becausethere are no conflicts During synthesis the frequency amplitude and phase parametersare interpolated from frame k to frame k + 1 In part (b) a sinusoid on frame k + 1 isdeclared ldquobornrdquo because a sufficiently close matching sinusoid does not exist on frame kIn this case frequency is held constant but amplitude is interpolated from zero on framek to the measured amplitude on frame k + 1 In part (c) a sinusoid on frame k is declaredldquodeadrdquo because a sufficiently close matching sinusoid does not exist on frame k + 1 Inthis case frequency is held constant but amplitude is interpolated from the measuredamplitude on frame k to zero on frame k + 1


trajectory The conflict is resolved in favor of the closest frequency match If thecurrent trajectory loses it picks the next best available match that satisfies thedifference criterion outlined above If the pre-existing match loses the conflictthe current trajectory claims the peak and the pre-existing match is returned tothe pool of available trajectories This process is repeated until all trajectoriesare either matched or declared ldquodeadrdquo At the conclusion of the matching pro-cedure any unclaimed sinusoids on frame l + 1 are declared ldquobornrdquo As shownin Figure 92(b) trajectories at ldquobirthrdquo are backwards matched to themselves onframe l with the amplitude interpolated from zero on frame l to the measuredamplitude on frame l + 1

922 Sinusoidal Synthesis and Parameter Interpolation

The sinusoidal trajectories of frequency amplitude and phase triads are updatedat a rate of once per frame The synthesis portion of the sinusoidal model usesthe frame-rate parameters that were extracted during the analysis procedure togenerate a sample-rate output sequence s(n) by appropriately controlling theoutput of a bank of oscillators One method for generating the model output is asfollows On the l-th frame let output sample on index m + lH represent the sumof the contributions of the K partials that were estimated on the l-th frame ie

s(m + lH) =Ksum

k=1

Alk cos(ωkm + φk) 0 m lt H (93)

where the parameter triad ωlk A

lk φ

lk represents the frequency amplitude and

phase respectively of the k-th sinusoid and the parameter H corresponds to thesynthesis hop size (equal to analysis hop size unless time-scale modification isrequired) The problem with this approach is that the sinusoidal parameters arenot interpolated between frames and therefore the sequence s(n) will in generalcontain jump discontinuities at the frame boundaries In order to avoid disconti-nuities and the associated artifacts a better approach is to use oscillator controlfunctions that interpolate the trajectory parameters from one frame to the nextIf the k-th trajectory parameters on frames l and l + 1 are given by ωl

k Alk φ

lk

and ωl+1k Al+1

k φl+1k respectively then the instantaneous amplitude Al

k(m)can be linearly interpolated between the measured amplitudes Al

k and Al+1k using

the relation

Alk(m) = Al

k + Al+1k minus Al

k

Hm 0 m lt H (94)

Measured values for frequency and phase are interpolated next For clarity thesubscript index k has been dropped throughout the remainder of this discussionand the frame index l is used in its place Frequency and phase interpolationare less straightforward than amplitude interpolation because of the fact thatfrequency is the phase derivative Before defining a phase interpolation functionit is important to note that the instantaneous phase θ (m) is defined as

θ (m) = mω + φ 0 m lt H (95)


where ω and φ are the measured frequency and measured phase respectively Forsmooth interpolation between frames therefore it is necessary that the instan-taneous phase be equal to the measured phases at the frame boundaries andsimultaneously it is also necessary that the instantaneous phase derivatives beequal to the measured frequencies at the frame boundaries To accomplish thisa cubic phase interpolation polynomial was proposed [McAu86] of the form

θl(m) = γ + κm + αm2 + βm3 0 m lt H (96)

After some manipulation it can be shown [McAu86] that the instantaneousphase is given by

θl (m) = φl + ωlm + α(Mlowast)m2 + β(Mlowast)m3 0 m lt H (97)

and that the parameters α and β are obtained as follows

[α(M)

β(M)

]

=

3

H 2minus 1

H

minus 2

H 3

1

H 2

[φl+1 minus φl minus ωlH + 2πM

ωl+1 minus ωl

]

(98)

Although many values for the parameter M will allow θ (m) to satisfy the frameboundary conditions it was shown in [McAu86] that the smoothest function (inthe sense that the integral of the square of the second derivative of the functionθ (m) is minimized) is obtained for the value M = Mlowast which is given by

Mlowast = round

(1

2π

[

(φl + ωlH minus φl+1) + (ωl+1 minus ωl)H

2

])

(99)

where the round() operation denotes taking the integer closest to the functionargument We now restore the notational conventions used earlier ie that thesubscript corresponds to the parameter index and that the superscript representsa frame index Given the interpolated amplitude and instantaneous phase func-tions Al

k(m) and θ lk(m) respectively the interpolated sinusoidal model synthesis

expression becomes

s(m + lH) =Ksum

k=1

Alk(m) cos(θ l

k(m)) 0 m lt H (910)

Unless otherwise specified this is the standard synthesis expression used in thehybrid sinusoidal model presented throughout the remainder of this chapterThis section has provided the essential details of the basic sinusoidal modelIn Section 96 the basic model is extended to enhance its robustness over awider class of signals by including explicit model extensions for transient andnoise-like signal energy In Sections 93 through 95 we apply the sinusoidal

ANALYSISSYNTHESIS AUDIO CODEC (ASAC) 247

s(n)

QuantEncode

Pre-analysisEnvelopeFraming

FFT

FFT

Σ

Σ

PsychoacousticModel

minus

divide MAX

Accumulator

ParameterEstimation

Aififi

Synthesis

si(n)ˆ

minus

ei(n)

fi

Figure 93 ASAC encoder (after [Edle96c])

model presented here in the context of parametric audio codecs ie the ASACthe HILN and the FM

93 ANALYSISSYNTHESIS AUDIO CODEC (ASAC)

The sinusoidal analysissynthesis audio codec (ASAC) for robust coding of gen-eral audio signals at rates between 6 and 24 kbs was developed by Edler et al atthe University of Hannover and proposed for MPEG-4 standardization [Edle95]in 1995 An enhanced ASAC proposal later appeared in [Edle96a] InitiallyASAC segments input audio into analysis frames over which the signal is assumedto be nearly stationary Sinusoidal synthesis parameters are then extracted accord-ing to perceptual criteria quantized encoded and transmitted to the decoderfor synthesis The algorithm distributes synthesis parameters across basic andenhanced bit streams to allow scalable output quality at bitrates of 6 and 24 kbsArchitecturally the ASAC scheme (Figure 93) consists of a pre-analysis blockfor window selection and envelope extraction a sinusoidal analysis-by-synthesisparameter estimation block a perceptual model and a quantization and cod-ing block Although it bears similarities to sinusoidal speech coding [McAu86][Geor92] [Geor97] and music synthesis [Serr90] algorithms that have been avail-able for some time the ASAC coder also incorporates some new techniques Inparticular whereas previous speech-specific sinusoidal coders emphasized wave-form matching by minimizing reconstruction error norms such as the mean squareerror (MSE) ASAC disregards classical error minimization criteria and insteadselects sinusoids in decreasing order of perceptual importance by means of aniterative analysis-by-synthesis loop The perceptual significance of each com-ponent sinusoid is judged with respect to the masking power of the synthesissignal which is determined by a simplified version of the psychoacoustic modelfrom [Baum95] but without the temporal masking considerations The remainderof the section presents details of the ASAC encoder [Edle96c]


931 ASAC Segmentation

Before analysis-by-synthesis begins a pre-analysis segmentation adapts the anal-ysis window length to match characteristics of the input signal s(n) A shortrectangular window with tapered edges is used during transient events or a longHanning window twice the synthesis window length is used during stationarysegments The pre-analysis process also extracts via weighted linear regression aparametric amplitude envelope for each frame to deal with rapid changes in theinput signal During transients envelopes are parameterized in terms of a peaktime as well as attack and decay slopes Non-attack envelopes on the other handare parameterized with only a single attack (+) or decay (minus) slope

932 ASAC Sinusoidal Analysis-by-Synthesis

The iterative analysis-by-synthesis block [Edle96c] estimates one at a time theparameters of the i-th individual constituent sinusoid or partial and every iterationidentifies the most perceptually significant sinusoid remaining in the synthesisresidual ei(n) = s(n) minus si (n) and adds it to the synthetic output si (n) Per-ceptual significance is assessed by comparing the synthesis residual against themasked threshold associated with the current synthetic output and choosing theresidual sinusoid with the largest suprathreshold margin This perceptual selectioncriterion assumes that the signal component with the most unmasked energy willbe the most perceptually important After the most important peak frequencyf i has been identified a high-resolution spectral analysis technique inspiredby [Kay89] refines the frequency estimate Given refined frequency estimatesfor the current partial at frame start f start

i and end f endi the amplitude Ai

and phase φi parameters are extracted from a complex correlation coefficientobtained from an inner product between the synthesis residual and a complexexponential of linearly time-varying frequency End-point frequencies for thecomplex exponential correspond to the frame start and frame end frequenciesfor the partial of interest For the synthesis phase both unscaled and envelope-scaled versions of the current partial are generated The analysis-by-synthesis loopselects the version that minimizes the residual variance subtracts the new partialfrom the current residual and then repeats the analysis-by-synthesis parameterextraction loop with the updated residual The loop repeats until the bit bud-get is exhausted We note that in contrast to the time-varying global gain usedin [Geor92] a unique feature of the ASAC coder is that the temporal envelopemay be selectively applied to individual synthesis partials

933 ASAC Bit Allocation Quantization Encoding and Scalability

The ASAC quantization and coding block provides bit rate and output qualityscaling by generating simultaneously basic and enhanced bitstreams for eachframe The basic bitstream contains the three temporal envelope parameters aswell as parameters representing frequencies amplitudes envelope enable bits

HARMONIC AND INDIVIDUAL LINES PLUS NOISE CODER (HILN) 249

and continuation bits for each partial A tracking algorithm classifies each par-tial as either new or continued by matching partials from frame to frame on thebasis of amplitude and frequency similarities [Edle96c] Track disputes (multiplematches for a given partial) are resolved in favor of the candidate that maximizesa weighted amplitude and frequency closeness measure ASAC partial trackingand dispute resolution differs from [McAu86] in that amplitude matching is con-sidered explicitly in addition to frequency proximity For noncontinued partialsboth frequencies and amplitudes are log quantized For continued partials on theother hand frequency differences are uniformly quantized while amplitude ratiosare quantized on a log scale The enhancement bitstream contains parameters toimprove reconstructed signal fidelity over the basic bitstream Enhanced bitstreamparameters include finer quantization bits for the envelope parameters phases foreach partial and finer frequency quantization bits for noncontinued partials abovea threshold frequency The phases are uniformly quantized Because of the fixed-rate ASAC architecture between 10 and 17 spectral lines may fit within the6 kbs basic bitstream for a given 32-ms frame Like the time-frequency MPEGfamily algorithms [Bran94a] [Bosi96] (Chapter 10) the ASAC quantization andcoding block maintains a bit reservoir to smooth local maxima in bit demandThese maxima tend to occur during transients when many noncontinued partialsarise and inflate the bit rate Bits are deposited into the reservoir whenever thenumber of partials reaches a threshold and before exhausting the bit budget Con-versely reservoir bits are withdrawn whenever the number of partials is belowa threshold while the bit budget has already been exhausted

9331 ASAC Performance When compared to standard speech codecs atsimilar bit rates the first version of ASAC [Edle95] reportedly offered improvedquality for nonharmonic tonal signals such as spectrally complex music similarquality for single instruments and impaired quality for clean speech [ISOI96b]The later ASAC [Edle96a] was improved for certain signals [ISOI96c]

94 HARMONIC AND INDIVIDUAL LINES PLUS NOISE CODER (HILN)

The ASAC algorithm outperformed speech-specific algorithms at the same bitrate in subjective tests for certain test signals particularly spectrally complexmusic characterized by large numbers of nonharmonically related sinusoids Theoriginal ASAC however failed to match speech codec performance for othertest signals such as clean speech As a result the ASAC core was embeddedin an enhanced algorithm [Purn98] intended to better match the coderrsquos signalmodel with diverse input signal characteristics In research proposed as part ofan MPEG-4 ldquocore experimentrdquo [Purn97] Purnhagen et al developed an ldquoobject-basedrdquo algorithm In this approach harmonic sinusoid individual sinusoid andcolored noise objects could be combined in a hybrid source model to create aparametric signal representation The enhanced algorithm known as the ldquohar-monic and individual lines plus noiserdquo (HILN) is architecturally very similar tothe original ASAC with some modifications


Like ASAC the HILN coder (Figure 94) segments the input signal into 32-ms overlapping analysis frames and extracts a parametric temporal envelopeduring preanalysis Unlike ASAC however the iterative analysis-synthesis blockis extended to include a cascade of analysis stages for each of the available objecttypes In the enhanced analysis-synthesis system of HILN harmonic analysis isapplied first followed by individual spectral line analysis followed by shapednoise modeling of the two-stage residual The remainder of the section presentssome unique details of the HILN encoder including descriptions of the HILNschemes for sinusoidal analysis-by-synthesis bit allocation and quantization andencoding

941 HILN Sinusoidal Analysis-by-Synthesis

The HILN sinusoidal analysis-by-synthesis occurs in a cascade of three stagesFirst the harmonic section estimates a fundamental frequency and also quantifiesthe harmonic partial amplitudes using essentially the same method as ASACCepstral analysis is used to estimate the fundamental Harmonics may occur oneither integer or noninteger multiples of the fundamental A stretching [Flet91]parameter is available to account for the noninteger harmonic phenomena inducedby eg stiff-stringed instruments In the second stage a partial discriminatorseparates those partials belonging to the harmonic object from those partialsbelonging to the individual sinusoid object In the last stage the noise extractionmodule uses a cosine series expansion to estimate the spectral envelope of thenoise residual left after all partials have been extracted including those partialsnot already extracted by the harmonic and individual stages

QuantEncodes(n)

FundamentalFrequencyEstimation

IndividualSpectral LineExtraction

ResidualSpectrumExtraction

IndividualFreqsAmplitudes

Noise EnvelopeAmplitude

Parameter Estimation

HarmonicFreqsAmplitudes

Figure 94 HILN encoder (after [Purn98])

FM SYNTHESIS 251

942 HILN Bit Allocation Quantization Encoding and Decoding

Quantization and encoding is applied to the various object parameters in thefollowing way For individual sinusoids quantization and coding is essentiallythe same as in ASAC For the harmonic sinusoids the fundamental frequencyand partial amplitudes are log quantized and encoded Since spectrally complexsignals may contain in excess of 100 partials the harmonic partial amplitudesbeyond the tenth are grouped and represented by an average amplitude instead ofby the individual amplitudes This approach saves bits by exploiting the reducedfrequency resolution of the auditory filter bank at higher frequencies For thecolored noise object a gain is log quantized and transmitted along with thequantized parameters of the noise envelope Bits are distributed between the threeobject types according to perceptual relevance Unlike ASAC none of the HILNsinusoidal coding objects include phase information Start-up phases for newsinusoids are randomized at the decoder and then continued smoothly for laterframes The decoder also employs smooth fade in and fade out of partial tracks toprevent jump discontinuities in the output The output noise object is synthesizedusing the parametric spectral envelope and randomized phase in conjunction withan inverse FFT As is true of the ASAC algorithm the parametric nature of HILNmeans that speed and pitch changes are realized in a straightforward manner

9421 HILN Performance Results from subjective listening tests at 6 kbsshowed significant improvements for HILN over ASAC particularly for the mostcritical test items that had previously generated the most objectionable ASACartifacts [Purn97] Compared to HILN the CELP speech codecs are still able torepresent more efficiently clean speech at low rates and time-frequency codecsare able to encode more efficiently general audio at rates above 32 kbs Neverthe-less the HILN improvements relative to ASAC inspired the MPEG-4 committeeto incorporate HILN into the MPEG-4 committee draft as the recommended lowrate parametric audio coder [ISOI97a] As far as applications are concerned theHILN algorithm was deployed in a scalable low-rate Internet streaming audioscheme [Feit98]

95 FM SYNTHESIS

The HILN algorithm seeks to optimize coding efficiency by making combineduse of three distinct source models Although the HILN harmonic sinusoid objecthas been shown to facilitate increased coding gain for certain signals it ispossible that other object types may offer opportunities for greater efficiencywhen representing spectrally complex harmonic signals This notion motivated arecent investigation into the use of frequency modulation (FM) synthesis tech-niques [Chow73] in low-rate sinusoidal audio coding for harmonically structuredsingle instrument sounds [Wind98] In this section we first review the basic prin-ciples of FM synthesis and then present an experimental audio codec that makesuse of FM synthesis operators for low-rate coding


951 Principles of FM Synthesis

FM synthesis offers advantages over other harmonic coding methods(eg [Ferr96b] [Edle96c]) because of its ability to model with relatively fewparameters harmonic signals that have many partials In the simplest FMsynthesis for example the frequency of a sine wave (carrier) is modulated byanother sine wave (modulator) to generate a complex waveform with spectralcharacteristics that depend on a modulation index and the parameters of the twosine waves In continuous time the FM signal is given by

s(t) = A sin[2πfct + I sin(2πfmt)] (911)

where A is the amplitude fc is the carrier frequency fm is the modulationfrequency I is the modulation index and t is the time index The associatedFourier series representation is

s(t) =infinsum

k=minusinfinJk(I ) sin(2πfct + 2πkfmt) (912)

where Jk(I ) is the Bessel function of the first kindIt can be seen from Eq (912) that a large number of harmonic partials can

be generated (Figure 95) by controlling only three parameters per FM operatorOne can observe that the fundamental and harmonic frequencies are determinedby fc and fm and that the harmonic partial amplitudes are controlled by themodulation index I The Bessel envelope moreover essentially determines theFM spectral bandwidth Example harmonic FM spectra for a unit amplitude200 Hz carrier are given in Figure 95 for modulation indices of 1 (Figure 95a)and 15 (Figure 95b) While both examples have identical harmonic structurethe amplitude envelopes and bandwidths differ markedly as a function of theindex I Clearly the central issue in making effective use of the FM techniquefor signal modeling is parameter estimation accuracy

952 Perceptual Audio Coding Using an FM Synthesis Model

Winduratna [Wind98] proposed an FM synthesis audio coding scheme in whichthe outputs of parallel FM ldquooperatorsrdquo are combined to model a single instrumentsound The algorithm (Figure 96) works as follows First the preanalysis blocksegments the input into frames and then extracts parameters for a set of individualspectral lines as in [Edle96c] Next the preanalysis identifies a harmonic structureby maximizing an objective function [Wind98] Given a fundamental frequencyestimate from the pre-analysis f0 the iterative parameter extraction loop estimatesthe parameters of individual FM operators and accumulates their contributions untilthe composite spectrum from the multiple parallel FM operators closely resemblesthe original Perceptual closeness is judged to be adequate when the magnitude ofthe original minus synthetic harmonic (difference) spectrum is below the masked

FM SYNTHESIS 253

500 1000 1500 2000 2500 3000 3500 40000

10

20

30

40

50

60

Frequency (Hz)

Leve

l (dB

)

(b)

500 1500 2000 2500 3000 3500 40000

10

20

30

40

50

60

Frequency (Hz)

Leve

l (dB

)

(a)

1000

Figure 95 Examples of harmonic FM spectra for fc = fm = 200 Hz with (a) I = 1and (b) I = 15

threshold [Baum95] During each loop iteration error minimizing values for thecurrent operator are determined by means of an exhaustive search Then the contri-bution of the current operator is added to the existing output Finally the harmonicresidual spectrum is checked against the masked threshold The loop repeats andadditional operators are synthesized until the error spectrum is below the masked


threshold Spectral line amplitude prediction is used for adjacent frames having thesame fundamental frequency

In the quantization and encoding block the parameters f0 fc fm I andA are encoded with 12 7 7 5 and 7 bits respectively for each operatorThe fundamental frequency and amplitude are log quantized The carrier andmodulation frequencies are encoded as integer multiples of the fundamentalwhile the amplitude prediction weight p is quantized and encoded with 8 bitsfor adjacent frames having identical fundamental frequencies At the expenseof high complexity the FM coding scheme was shown to efficiently representsingle instrument sounds at bit rates between 21 and 48 kbs Performance ofthe scheme was investigated not only for audio but also for speech Significantperformance gains were realized relative to previous sinusoidal schemes Using a24-ms analysis window for example one critical male speech item was encodedat 212 kbs using FM synthesis compared to 45 kbs for ASAC [Wind98] withsimilar output quality Despite estimation difficulties for signals with more thanone fundamental frequency eg polyphonic music the high efficiency of the FMsynthesis technique for parametric representations of complex harmonic spectramakes it a likely candidate for future inclusion in object-based algorithms suchas the HILN coder

96 THE SINES + TRANSIENTS + NOISE (STN) MODEL

Although the basic sinusoidal model (Eq (91)) can achieve very efficient rep-resentations of some signals (eg sustained vowels via harmonically relatedcomponents) extensions to the basic model have also been proposed to deal withother signals containing significant nontonal energy that can lead to modelinginefficiencies A viable model for many musical signals is one of a deterministicplus a stochastic component The deterministic part corresponds to the pitchedpart of the sound and the stochastic part accounts for intrinsically random musi-cal characteristics such as breath noise or bow noise The spectral modeling and

QuantEncode

Pre-analysis

minus

s(n)Σ

PsychoacousticModel

Comparator

FM SynthesisLoop

AccumulatorΣ

z minus1

f0

X

p

Figure 96 FM synthesis coding scheme (after [Wind98])

HYBRID SINUSOIDAL CODERS 255

synthesis system (SMS) [Serr89] [Serr90] treats audio as the sum of a collectionof K sinusoids (ldquopartialsrdquo) with time-varying parameters (Eq (91)) with theaddition of a stochastic component e(n) ie


k=1

Ak cos(ωk(n)n + φk(n)) + e(n) (913)

where Ak represents the amplitude ωk(n) represents the instantaneous frequencyand φk(n) represents the instantaneous phase of the k-th sinusoid Different para-metric models for the stochastic component have been proposed such as paramet-ric noise spectral envelopes [Serr89] or parametric Bark-band noise [Good96]Other investigators have constructed hybrid signal models in which the residualof the sinusoidal analysis-synthesis s(n) minus s(n) is decomposed using a discretewavelet packet transform (DWPT) [Hamd96] and then represented in terms ofcarefully selected principal components Although the sines + noise signal modelhas demonstrated improved performance relative to the basic model one furtherextension has gained popularity namely the addition of a separate model fortransient components or in other words the construction of a three-part modelconsisting of sines + transients + noise [Verm98a] [Verm98b]

97 HYBRID SINUSOIDAL CODERS

This section examines two experimental audio coders that attempt to enhancesource robustness and output quality beyond that of the purely sinusoidal algo-rithms at low bit rates by embedding additional signal models in the coderarchitecture The motivation for this work is as follows Whereas the waveform-preserving perceptual transform (Chapter 7) and subband (Chapter 8) coders tendto target transparent quality at bitrates between 32 and 128 kbs per channel thesinusoidal coders proposed thus far in the literature have concentrated on lowrate applications between 2 and 16 kbs Rather than transparent quality thesealgorithms have emphasized source robustness ie the ability to deal with gen-eral audio at low rates without constraining source model dependence Low ratesinusoidal algorithms (ASAC HILN etc) represent the perceptually significantportions of the magnitude spectrum from the original signal without explicitlytreating the phase spectrum As a result perceptually transparent coding is typ-ically not achieved with these algorithms It is generally agreed that differentclasses of state-of-the-art coding techniques perform most efficiently in terms ofoutput quality achieved for a given bit rate In particular CELP speech algo-rithms offer the best performance for clean speech below 16 kbs parametricsinusoidal techniques perform best for general audio between 16 and 32 kbsand so-called ldquotime-frequencyrdquo audio codecs tend to offer the best performanceat rates above 32 kbs Designers of comprehensive bit rate scalable coding sys-tems therefore must decide whether to cascade multiple stages of fundamentallydifferent coder architectures with each stage operating on a residual signal from


the previous stage or alternatively to simulcast independent bitstreams from dif-ferent coder architectures and then select an appropriate decoder at the receiverIn fact some experimental work performed in the context of MPEG-4 standard-ization has demonstrated that a cascaded hybrid sinusoidaltime-frequency codercan not only meet but in some cases even exceed the output quality achieved bythe time-frequency (transform) coder alone at the same bit rate for certain criticaltest signals [Edle96b] Several hybrid coder architectures that exploit these ideasby combining sinusoidal modeling with other well-known techniques have beenproposed Two examples are considered next

971 Hybrid Sinusoidal-MDCT Algorithm

In one experimental hybrid scheme (Figure 97) for example a locally decodedsinusoidal output signal (generated by eg ASAC) is transformed using an inde-pendent MDCT that is configured identically to the second stage coderrsquos internalMDCT Then a frequency selective switch (scalability tool) in the second stagecoder determines whether quantization and encoding will be more efficient forMDCT coefficients from the original signal or for MDCT coefficient differencesbetween the original and first stage decoded output Clearly the coefficient dif-ferencing scheme is only beneficial if the difference magnitudes are smaller thanthe original spectral magnitudes Several of the issues critical to cascading suc-cessfully a parametric sinusoidal coder with a transform-based time-frequencycoder are addressed in [Edle98] For the scheme depicted in Figure 97 phaseinformation is essential to minimization of the MDCT coefficient differences Inaddition frequency quantization also influences significantly the maximum resid-ual amplitude Although log frequency quantization is well suited for stand-aloneoperation uniform frequency quantization better controls residual amplitudes inthe hybrid scheme These considerations motivated the inclusion of phase param-eters and additional high-frequency spectral line quantization bits in the ASACenhancement bitstream In the case of harmonically structured signals however

SinusoidalEncoder

SinusoidalDecoder

Σ

MDCT

MDCT

minusQuantEncode

FreqSelectMux

s(n)

Figure 97 Hybrid sinusoidalT-F (MDCT) encoder (after [Edle98])


the presence of many partials could overwhelm the enhancement bit allocation ifhundreds of partial phases were required One solution is to limit the enhancementprocessing to the first ten or so partials Moreover for a first stage coder such asthe object-based HILN the noise object would increase rather than decrease theMDCT residuals and should therefore be disabled in a cascade configuration

972 Hybrid Sinusoidal-Vocoder Algorithm

It was earlier noted that CELP speech algorithms typically outperform the para-metric sinusoidal coders for clean speech inputs at rates below 16 kbs Thereis some uncertainty however as to which class of algorithm is best suitedwhen both speech and music are present A hybrid scheme intended to out-perform CELPparametric ldquosimulcastrdquo for speechmusic mixtures was proposedin [Edle98] The hybrid scheme (Figure 98) works as follows A first-stage para-metric coder extracts the dominant harmonic tones forms an internal residualand then extracts the remaining harmonic as well as individual tones Then anexternal residual is formed between the original signal and a parametrically syn-thesized signal containing the contributions of only the second set of harmonicand individual tones This system implicitly assumes that the speech signal isdominant The idea is that the second stage cascaded vocoder will receive primar-ily the speech portions of the signal since the first stage parametric coder removesthe secondary harmonics and individual tones associated with the musical signalcomponents

This hybrid structure was reported to outperform simulcast configurations onlywhen the voice signal was dominant [Edle98] Quality degradations were reportedfor mixtures containing dominant musical signals In the future hybrid structuresof this type will benefit from emerging techniques in speechmusic discrimination(eg [Saun96] [Sche98b]) As observed by Edler on the other hand future audiocoding research is also quite likely to focus on automatic decomposition of com-plex input signals into components for which individual coding is more efficientthan direct coding of the mixture [Edle97] using hybrid structures Advances in

Vocoder

QuantEncode

FirstHarmonic

minus

s(n)

Sinusoidal Coder

SecondHarmonic +Individual

minusΣ

Σ

Figure 98 Hybrid sinusoidalvocoder (after [Edle98])


sound separation and auditory scene analysis [Breg90] [Elli96] techniques willeventually make the automated decomposition process viable

98 SUMMARY

In this Chapter we reviewed sinusoidal signal models for low-rate audio cod-ing Some of the key topics we studied include the ASAC the HILN algorithmhybrid sinusoidal coders and the STN model The parametric representation ofsinusoidal coders has the potential for scalable audio compression and delivery ofhigh-quality audio over the Internet We will discuss the scalable and parametricaudio coding tools integrated in the ISOIEC MPEG 4 audio standard in the nextchapter

PROBLEMS

91 Modify the DFT such that the frequency components are computed from

n = minusN

2 0

N

2minus 1 Why is this approach useful in sinusoidal

analysis-synthesis

92 In sinusoidal analysis-synthesis we require the analysis window to be nor-malized ie

sumN2n=minusN2 w(n) = 1 Explain with mathematical arguments

the reason for the use of this normalization

93 Explain with diagrams text and equations how the sinusoidal birth-deathfrequency tracker works

94 Let s(n) be the input audio frame of size N samples Let S(k) be theN -point DFT

S(k) = 1

N

Nminus1sum

n=0

s(n)eminusj2πknN k = 0 1 2 N minus 1

Obtain a p-th order frequency-domain AR model using least squares algo-rithm that fits the DFT spectral components Assume that p lt N minus 1

95 Write an algorithm that selects the K peaks from an N -point DFT magni-tude spectrum (K lt N ) Assume that the K spectral peaks are associatedwith pure tones in the input signal s(n) Design a sinusoidal analysis-synthesis model that minimizes the MSE ε

ε =sum

n

e2(n) =sum

n

[

s(n) minusKsum

k=1

Ak cos(ωkn + φk)

]2

where s(n) is the input speech Ak represents the amplitude ωk representsthe frequency and φk denotes the phase of the k-th sinusoid Hint Referto [McAu86]


96 In this problem we will design a sinusoidal analysis-by-synthesis (A-by-S) algorithm Use the sinusoidal model s(n) asymp s(n) = sumK

k=1 Ak cos(ωkn

+ φk) where Ak is the amplitude ωk represents the frequency and φk

denotes the phase of the k-th sinusoid Attempting to minimize E =sum

n e2(n) = sumn

[s(n) minus sumK

k=1 Ak cos(ωkn + φk)]2

with respect to Ak ωk and φk simultaneously would lead to a nonlinear optimization problemHence design a suboptimal A-by-S algorithm to solve for the parametersseparately Hint Start your design by assuming that the (K minus 1) sinusoidalparameters are already determined and the frequencies ωk are known TheMSE will then become

ε =sum

n

s(n) minusKminus1sum

k=1

Ak cos(ωkn + φk)

︸︷︷︸eKminus1(n)

minusAK cos(ωKn + φK)

2

where eKminus1(n) is the error after K minus 1 sinusoidal components were deter-mined See [Geor87] and [Geor90] to obtain additional information

97 Assuming that a cubic phase interpolation polynomial is used in a sinusoidalanalysis-synthesis model derive Eq (97) using results from [McAu86]

COMPUTER EXERCISES

98 Write a MATLAB program to implement the sinusoidal analysis-by-synthesis algorithm that you derived in Problem 96 Use the audio filech9aud1wav Consider a frame size of N = 512 samples no overlapbetween frames and K = 20 sinusoids In your deliverables provide theMATLAB program give the plots of the input audio s(n) the synthesizedsignal s(n) and the error e(n) = s(n) minus s(n) for the entire audio record

99 Use the MATLAB program from the previous problem and estimate thesample mean square error Ei = E[e2

i (n)] for each frame i where E[] isthe expectation operator

a Calculate the average MSE EAvg = 1

Nf

sumNf

i=1 Ei where Nf is the num-

ber of framesb Repeat step (a) for a different number of sinusoids eg K = 5 15 30

50 and 60 Plot EAvg in dB across the number of sinusoids K c How does the MSE behave with respect to the number of sinusoids

selected

910 In this problem we will study the significance of phase information in thesinusoidal coding of audio Use the audio file ch9aud1wav Consider aframe size of N = 512 samples Let Si(k) be the 512-point FFT of the


i-th audio frame and Si(k) = |Si(k)|eji(k) where |Si(k)| is the magnitudespectrum and i(k) is the phase at the k-th FFT bin Write a MATLABprogram to pick 30 spectral peaks from the 512-point FFT magnitude spec-trum Set the rest of the spectral component magnitudes to a very smallvalue eg 0000001 Call the resulting spectral magnitude as |Si(k)|a Set the phase i(k) = 0 and reconstruct the audio signal si (n) =

IFFT512[Si(k)] for all the frames i = 1 2 Nf b Calculate the sample MSE Ei = E[e2

i (n)] for each frame i Compute

the average MSE EAvg = 1

Nf

sumNf

i=1 Ei where Nf is the number of

framesc Set the phase i = π(2rand(1 512) minus 1)) where i is the [1 times 512]

uniform random phase vector that varies between ndash π and π Recon-struct the audio signal si (n) = IFFT512[Si(k)eji ] for all the framesNote that you must avoid using the same set of random phase compo-nents in all the frames Repeat step (b)

d Set the phase i(k) = arg[Si(k)] ie we are using the input signalphase Reconstruct the audio si (n) = IFFT512[Si(k)eji(k)] for all theframes Repeat step (b)

e Perform an experiment where the low-frequency sinusoids (lt15 kHz)use the input signal phase and the high-frequency components use randomphase as in part (c) Compute Ei and EAvg using step (b)

f Give a dB plot of Ei obtained from steps (a) and (b) across the numberof frames i = 1 2 Nf Now superimpose the Eirsquos obtained insteps (c) (d) and (e) Which case performed better in terms of theMSE measure

g Evaluate to the synthesized audio obtained from steps (a) (c) (d) and(e) Which case results in better perceptual quality

911 In this problem we will become familiar with the sinusoidal trajectoriesUse the audio file ch9aud2wav Consider a frame size of N = 512 sam-ples no overlap between frames and K = 10 sinusoids Write a MATLABprogram to implement the sinusoidal A-by-S algorithm [Geor90] For eachframe i compute the amplitudes Ak

i the frequencies ωki and the phases

φki of the k-th sinusoid

a Give a contour plot of the frequencies ωki versus the frame number

i = 1 2 Nf associated with each of the 10 sinusoidsb Now assume a 50 overlap and repeat the above step Comment on

the smoothness and continuity of the frequency trajectories when framesare overlapped 50

912 Extend the Computer Exercise 98 when a 50 overlap is allowed betweenthe frames Use a Bartlett window or a Hanning window for overlapSee [Geor97] for hints on overlap-add A-by-S implementation Do you


observe any improvements in terms of audio quality when overlapping isemployed

913 In this problem we will design a hybrid subbandsinusoidal algorithmUse the audio file ch9aud3 24kwav Consider a frame size of N = 512samples no overlap between frames and K = 20 sinusoids Note that thesampling frequency of the given audio file is 24 kHza Write a MATLAB program to design a sinusoidal A-by-S algorithm to

pick 20 sinusoids in each frame Compute the MSE E = 1

Nf

sumNf

i=1 Ei

where Nf is the number of frames and Ei = E[e2i (n)] is the MSE at

the i-the frameb Modify the program in (a) to design a sinusoidal A-by-S algorithm to

pick 15 sinusoids between 20 Hz and 6 kHz and 5 sinusoids between6 kHz and 12 kHz in each of the frames Did you observe any per-formance improvements in terms of computational efficiency andorperceptual quality

CHAPTER 10

AUDIO CODING STANDARDSAND ALGORITHMS

101 INTRODUCTION

Despite the several advances research towards developing lower rate coders forstereophonic and multichannel surround sound systems is strong in many industryand university labs Multimedia applications such as online radio web jukeboxesand teleconferencing created a demand for audio coding algorithms that can deliverreal-time wireless audio content This will in turn require audio compression algo-rithms to deliver high-quality audio at low bit-rates with resiliencerobustness to biterrors Motivated by the need for audio compression algorithms for streaming audioresearchers pursue techniques such as combined speechaudio architectures as wellas joint source-channel coding algorithms that are optimized for the packet switchedInternet [Ben99] [Liu99] [Gril02] Bluetooth [Joha01] [Chen04] [BWEB] and insome cases wideband cellular network [Ji02] [Toh03] Also the need for transparentreproduction quality coding algorithms in storage media such as the super audio CD(SACD) and the DVD-audio provided designers with new challenges There is in factan ongoing debate over the quality limitations associated with lossy compressionSome experts believe that uncompressed digital CD-quality audio (441 kHz16 bit)is inferior to the analog original They contend that sample rates above 55 kHz andword lengthsgreater than20 bits are necessary toachieve transparency in the absenceof any compression

As a result several standards have been developed [ISOI92] [ISOI94a] [Davi94][Fiel96] [Wyl96b] [ISOI97b] particularly in the last five years [Gerz99] [ISOI99]


263

264 AUDIO CODING STANDARDS AND ALGORITHMS

[ISOI00] [ISOI01b] [ISOI02a] [Jans03] [Kuzu03] and several are now beingdeployed commercially This chapter and the next address some of the importantaudio coding algorithms and standards deployed during the last decade In partic-ular we describe the lossy audio compression (LAC) algorithms in this chapterand the lossless audio coding (L2AC) schemes in Chapter 11

Some of the LAC schemes (Figure 101) described in this chapter includethe ISOMPEG codec series the Sony ATRAC the Lucent TechnologiesPACEPACMPAC the Dolby AC-2AC-3 the APT-x 100 and the DTS-coherentacoustics

The rest of the chapter is organized as follows Section 102 reviews theMIDI standard Section 103 serves as an introduction to the multichannel sur-round sound format Section 104 is dedicated to MPEG audio standards Inparticular Sections 1041 through 1046 respectively describe the MPEG-1MPEG-2 BCLSF MPEG-2 AAC MPEG-4 MPEG-7 and MPEG-21 audiostandards Section 105 presents the adaptive transform acoustic coding (ATRAC)algorithm the MiniDisc and the Sony dynamic digital sound (SDDS) systemsSection 106 reviews the Lucent Technologies perceptual audio coder (PAC) theenhanced PAC (EPAC) and the multichannel PAC (MPAC) coders Section 107describes the Dolby AC-2 and the AC-3Dolby Digital algorithms Section 108is devoted to the Audio Processing Technology ndash APTx-100 system Finallyin Section 109 we examine the principles of coherent acoustics in codingthat are embedded in the Digital Theater SystemsndashCoherent Acoustics(DTS-CA)

102 MIDI VERSUS DIGITAL AUDIO

The musical instrument digital interface (MIDI) encoding is an efficient wayof extracting and representing semantic features from audio signals [Lehr93][Penn95] [Hube98] [Whit00] MIDI synthesizers originally established in 1983are widely used for musical transcriptions Currently the MIDI standards aregoverned by the MIDI Manufacturers Association (MMA) in collaboration withthe Japanese Association of Musical Electronics Industry (AMEI)

The digital audio representation contains the actual sampled audio data whilea MIDI synthesizer represents only the instructions that are required to playthe sounds Therefore the MIDI data files are extremely small when com-pared to the digital audio data files Despite being able to represent high-qualitystereo data at 10ndash30 kbs there are certain limitations with MIDI formats Inparticular the MIDI protocol uses a slow serial interface for data streamingat 3125 kbs [Foss95] Moreover MIDI is hardware dependent Despite suchlimitations musicians prefer the MIDI standard because of its simplicity andhigh-quality sound synthesis capability

1021 MIDI Synthesizer

A simple MIDI system (Figure 102) consists of a MIDI controller a sequencerand a MIDI sound module The keyboard is an example of a MIDI controller

MIDI Versus Digital Audio 265

Lossy(perceptual)

Audio Coding Schemes

ISOIEC MPEGAudio Standards

MPEG-1(ISOIEC 11172-3)[ISOI92] [Bran94a]

MPEG-2 BCLSF(ISOIEC 13818-3)

[ISOI94a]

MPEG-2 AAC(ISOIEC 13818-7)[ISOI97b] [Bosi97]

MPEG-4(ISOIEC 14496-3)[ISOI99] [ISOI00]

MPEG-7(ISOIEC 15938-4)

[ISOI01b]Sony

Adaptive TransformAcoustic Coding

(ATRAC) [Yosh94]

Lucent TechnologiesAudio Coding Standards

Perceptual Audio Coder(PAC)

[John96c]

Enhanced PAC(EPAC)[Sinh96]

Multi-channel PAC(MPAC)

[Sinh98a]

Dolby Audio CodingStandards

Dolby AC-2 AC-2A[Fiel91] [Davi92]

Dolby AC-3[Fiel96] [Davis93]

Audio ProcessingTechnology (APT-x 100)

[Wyli96b]

DTS ndash Coherent Acoustics

[Smyt96] [Smyt99]

Figure 101 A list of some of the lossy audio coding algorithms

that translates the music notes into a real-time MIDI data stream The MIDIdata stream includes a start bit 8 data bits and one stop bit A MIDI sequencercaptures the MIDI data sequence and allows for various manipulations (egediting morphing combining etc) On the other hand a MIDI sound moduleacts as a sound player


MIDI OUT

MIDI Sequencer

MIDI sound module

MIDI OUTMIDI IN

MIDI INMIDI OUT

MIDI Controller

MIDI OUT

MIDI SequencerMIDI Sequencer

MIDI sound moduleMIDI Sound Module

MIDI ControllerMIDI Controller

Figure 102 A simple MIDI system

1022 General MIDI (GM)

In order to facilitate a greater degree of file compatibility the MMA developedthe general MIDI (GM) standard The GM constitutes a MIDI synthesizer witha standard set of voices (16 categories of 8 different sounds = 16 times 8 = 128sounds) that are fixed Although the GM standard does not describe the soundquality of synthesizer outputs it provides details on the MIDI compatibilityie the MIDI sounds composed on one sequencer can be reproduced or playedback on any other system with reduced or no distortion Different GM versionsare available in the market today ie GM Level-1 GM Level-2 GM lite andscalable polyphonic MIDI (SPMIDI) Table 101 summarizes the various GMlevels and versions

1023 MIDI Applications

MIDI has been successful in a wide range of applications including music-retrieval and classification [Mana02] music databases search [Kost96] musicalinstrument control [MID03] MIDI karaoke players [MIDI] real-time object-based coding [Bros03] automatic recognition of musical phrases [Kost96]audio authoring [Mode98] waveform-editing [MIDI] singing voice synthe-sis [Maco97] loudspeaker design [Bald96] and feature extraction [Kost95] TheMPEG-4 structured audio tool incorporates many MIDI-like features Other appli-cations of MIDI are attributed to MIDI GM Level-2 [Mode00] XMIDI [LKur96]and PCM to MIDI transportation [Mart02] [MID03] [MIDI]

MULTICHANNEL SURROUND SOUND 267

Table 101 General MIDI (GM) formats and versions

GM specifications GM L-1 GM L-2 GM Lite SPMIDI

Number of MIDI channels 16 16 16 16Percussion (drumming) channel 10 10 11 10 ndashPolyphony (voices) 24 voices 32 voices Limited ndash

Other related information [MID03]GM L-2 This is the latest standard introduced with capabilities of registered parameter controllersMIDI tuning universal system exclusive messages GM L-2 is backwards compatible with GM L-1GM Lite As the name implies this is a light version of GM L-1 and is intended for devices withlimited polyphonySPMIDI Intended for mobile devices SPMIDI functions based on the fundamentals of GM Liteand scalable polyphony This GM standard has been adopted by the Third-Generation PartnershipProject (3GPP) for the multimedia messaging applications in cellular phones

103 MULTICHANNEL SURROUND SOUND

Surround sound tracks (or channels) were included in motion pictures in the early1950s in order to provide a more realistic cinema experience Later the pop-ularity of surround sound resulted in its migration from cinema halls to hometheaters equipped with matrixed multichannel sound (eg Dolby ProLogic) Thiscan be attributed to the multichannel surround sound format [Bosi93] [Holm99][DOLBY] and subsequent improvements in the audio compression technology

Until the early 1990s almost all surround sound formats were based on matrix-ing ie the information from all the channels (front and surround) was encodedas a two-channel stereo as shown in Figure 103 In the mid-1990s discreteencoding ie 51 separate channels of audio was introduced by Dolby Labora-tories and Digital Theater Systems (DTS)

1031 The Evolution of Surround Sound

Table 102 lists some of the milestones in the history of multichannel surroundsound systems In the early 1950s the first commercial multichannel soundformat was developed for cinema applications ldquoQuadrdquo (Quadraphonic) was thefirst home-multichannel format promoted in the early 1970s But due to someincompatibility issues in the encodingdecoding techniques the Quad was notsuccessful In the mid-1970s Dolby overcame the incompatibility issues asso-ciated with the optical sound tracks and introduced a new format called theDolby stereo a special encoding technique that later became very popular Withthe advent of compact discs (CDs) in the early 1980s high-performance stereosystems became quite common With the emergence of digital versatile discs(DVDs) in 1995ndash1996 content creators began to distribute multichannel music indigital format Dolby laboratories in 1992 introduced another coding algorithm(Dolby AC-3 Section 107) called the Dolby Digital that offers a high-qualitymultichannel (51-channel) surround sound experience The Dolby Digital waslater chosen as the primary audio coding technique for DVDs and for digital


audio broadcasting (DAB) The following year Digital Theater Systems Inc(DTS) announced a new format based on the Coherent Acoustics encoding prin-ciple (DTS-CA) The same year Sony proposed the Sony Dynamic Digital Sound(SDDS) system that employs the Adaptive Transform Acoustic Coding (ATRAC)algorithm Lucent Technologiesrsquo Multichannel Perceptual Audio Coder (MPAC)also has a five-channel surround sound configuration Moreover the developmentof two new audio recording technologies namely the Meridian Lossless Packing(MLP) and the Direct Stream Digital (DSD) for use in the DVD-Audio [DVD01]and SACD [SACD02] formats respectively offer audiophiles listening experi-ences that promise to be more realistic

1032 The Mono the Stereo and the Surround Sound Formats

Figure 104 shows the three most common sound formats ie mono stereo andsurround Mono is a simple method of recording sound onto a single channelthat is typically played back on one speaker In stereo encoding a two-channelrecording is employed Stereo provides a sound field in front while the multichan-nel surround sound provides multi-dimensional sound experience The surroundsound systems typically employ a 51-channel configuration ie sound tracks arerecorded using five main channels left (L) center (C) right (R) left surround(LS) and right surround (RS) In addition to these five channels a sixth channelcalled the low-frequency-effects (LFE) channel is used for the subwoofer Sincethe LFE channel covers only a fraction (less than 150 Hz) of the total frequencyrange it is referred as the 1-channel

1033 The ITU-R BS775 51-Channel Configuration

In an effort to evaluate and standardize the so-called 51- or 32-channel configuration several technical documents appeared [Bosi93] [ITUR94c][EBU99] [Holm99] [SMPTE99] [AES00] [Bosi00] [SMPTE02] Various inter-national standardization bodies became involved in multichannel algorithmadoptionevaluation process These include the Audio Engineering Society

+

+

+

+

90

L

C

R

S

_

L-total

R-total

minus3 dB

+

+

+

+

minus3 dB

90deg

L

Figure 103 Multichannel surround sound matrixing


Table 102 Milestones in multichannel surround sound

Year Description

1941 Fantasia (Walt-Disney Productions) was the first motion picture to bereleased in the multichannel format

1955 Introduction of the first 3570-mm magnetic stripe capable of providing 46channels

1972 Video cassette consumer format ndash mono (1 channel)1976 Dolbyrsquos stereo in optical format1978 Videocassette ndash stereo (2 channels)1979 Dolbyrsquos first stereo surround called the split-surround sound format offered

3 screen channels 2 surround channels and a subwoofer (3201)1982 Dolby surround format implemented on a compact disc (2 channel)1992 Dolby digital optical (51 channel)1993 Digital Theater Systems (DTS)1993ndash94 Sony Dynamic Digital Sound (SDDS) based on ATRAC1994 The ISOIEC 13818-3 MPEG-2 Backward compatible audio standard1995 Dolby digital chosen for DVD (51 channel)1997 DVD video released in market (51 channel)1998 Dolby digital selected for digital audio broadcasting (DAB) in US1999ndash Super Audio CD and DVD-Audio storage formats2000ndash Direct Stream Digital (DSD) and Meridian Lossless Packing (MLP)

Technologies

(AES) the European Broadcasting Union (EBU) the Society of Motion Pictureand Television Engineers group (SMPTE) the ISOIEC MPEG and the ITU-Radio communication sector (ITU-R)

Figure 105 shows a 51-channel configuration described in the ITU-R BS775-1 standard [ITUR94c] Ideally five full-bandwidth (150 Hzndash20 kHz) loudspeak-ers ie L R C LS and RS are placed on the circumference of a circle in thefollowing manner the left (L) and right (R) front loudspeakers are placed at theextremities of an arc subtending 2θ = 60 at the reference listening position(see Figure 105) and the center (C) loudspeaker must be placed at 0 from thelistenerrsquos axis This enables the compatibility with the listening arrangement for aconventional two-channel system The two surround speakers ie LS and RS areusually placed at φ = 110 to 120 from the listenerrsquos axis In order to achievesynchronization the front and surround speakers must be equidistant λ (usu-ally 2ndash4 m) from the reference listening point with their acoustic centers in thehorizontal plane as shown in the figure The sixth channel ie the LFE channeldelivers bass-only omnidirectional information (20ndash150 Hz) This is because lowfrequencies imply longer-wavelengths where the ears are not sensitive to local-ization The subwoofer placement receives less attention in the ITU-R standardhowever we note that the subwoofers are typically placed in a front corner (seeFigure 104) In [Ohma97] Ingvar discusses the various problems associated withthe subwoofer placement Moreover [SMPTE02] provides of information on the


Mono

Left Right

Left RightCenter

Leftsurround

Rightsurround

Sub-woofer

Figure 104 Mono stereo and surround sound systems

loudspeaker placement for audio monitoring The [SMPTE99] specifies the audiochannel assignment and their relative levels for audio program recordings (3ndash6audio channels) onto storage media for television sound

104 MPEG AUDIO STANDARDS

MPEG is the acronym for Moving Pictures Experts Group that forms a work-group (WG-11) of ISOIEC JTC-1 subcommittee (SC-29) The main functionsof MPEG are a) to publish technical results and reports related to audiovideocompression techniques b) to define means to multiplex (combine) video audioand information bitstreams into a single bitstream and c) to provide descriptionsand syntax for low bit rate audiovideo coding tools for Internet and bandwidth-restricted communications applications MPEG standards do not characterize orprovide any rigid encoder specifications but rather standardizes the type of infor-mation that an encoder has to produce as well as the way in which the decoder hasto decompress this information The MPEG workgroup has its own official web-page that can be accessed at [MPEG] The MPEG video aspect of the standard isbeyond the scope of this book however we include some tutorial references andrelevant standards [LeGal92] [Scha95] [ISO-V96] [Hask97] [Mitc97] [Siko97a][Siko97b]

MPEG AUDIO STANDARDS 271

Reference listening position

Worst-case listening positions

q = 30degj = 110deg to 120degl ~ g = 2 to 4 m lrsquo ~ 065 to 085 m

q

j

l

l

g

RS

L

CR

LS

4

21

3

lrsquo

Figure 105 A typical 32-channel configuration described in the ITU-R BS775-1 stan-dard [ITUR94c] L left C center R right LS left surround and RS right surroundloud speakers Note that the above figure is not according to a scale

MPEG Audio ndash Background MPEG has come a long way since the firstISOIEC MPEG standard was published in 1992 With the emergence of the Inter-net MPEG is now also addressing content-based multimedia descriptions anddatabase search There are five different MPEG audio standards published ieMPEG-1 MPEG-2 BC MPEG-2 NBCAAC MPEG-4 and MPEG-7 MPEG-21is being formed

Before proceeding with the details of the MPEG audio standards howeverit is necessary to discuss terminology and notation The phases correspond tothe MPEG audio standards type and to a lesser extent to their relative releasetime eg MPEG-1 MPEG-2 MPEG-4 etc The layers represent a family ofcoding algorithms within the MPEG standards Only MPEG-1 and -2 are pro-vided with layers ie MPEG-1 layer-I -II and -III MPEG-2 layer-I -II and-III The versions denote the various stages in the audio coding standardizationphase MPEG-4 was standardized in two stages (version-1 and -2) with newfunctionality being added to the older version The newer versions are back-ward compatible to the older versions Table 103 itemizes the various MPEGaudio standards and their specifications A brief overview of the MPEG standardsfollows

Tabl

e10

3

An

over

view

ofth

eM

PE

Gau

dio

stan

dard

s

Stan

dard

Stan

dard

izat

ion

deta

ilsB

itra

tes

(kb

s)Sa

mpl

ing

rate

s(k

Hz)

Cha

nnel

sR

elat

edre

fere

nces

Rel

ated

info

rmat

ion

MPE

G-1

ISO

IE

C11

172-

319

9232

ndash44

8(L

ayer

I)32

ndash38

4(L

ayer

II)

32ndash

320

(Lay

erII

I)

32 44

14

8M

ono

(1)

ster

eo(2

)[I

SOI9

2][B

ran9

4a]

[Shl

i94]

[Pan

95]

[Nol

l93]

[Nol

l95]

[Nol

l97]

[Joh

n99]

Age

neri

cco

mpr

essi

onst

anda

rdth

atta

rget

spr

imar

ilym

ultim

edia

stor

age

and

retr

ieva

l

MPE

G-2

BC

LSF

ISO

IE

C13

818-

319

9432

ndash25

6(L

ayer

I)8

ndash16

0(L

ayer

sII

II

I)

16 22

05

24

Mul

ticha

nnel

Surr

ound

soun

d(5

1)

[ISO

I94a

][S

tol9

3a]

[Gri

l94]

[Nol

l93]

[Nol

l95]

[Joh

n99]

Firs

tdi

gita

lte

levi

sion

stan

dard

that

enab

les

low

erfr

eque

ncie

san

dm

ultic

hann

elau

dio

codi

ng

MPE

G-2

NB

CA

AC

ISO

IE

C13

818-

719

978

ndash16

08

ndash96

Mul

ticha

nnel

[ISO

I97b

][B

osi9

6][B

osi9

7][J

ohn9

9][I

SOI9

9][K

oen9

9][Q

uac9

8b]

[Gri

l99]

Adv

ance

dau

dio

codi

ngsc

hem

eth

atin

corp

orat

esne

wco

ding

met

hodo

logi

es(e

g

pred

ictio

nno

ise

shap

ing

etc

)

272

Tabl

e10

3

(con

tinue

d)

Stan

dard

Stan

dard

izat

ion

deta

ils

Bit

rate

s(k

bs)

Sam

plin

gra

tes

(kH

z)C

hann

els

Rel

ated

refe

renc

esR

elat

edin

form

atio

n

MPE

G-4

(Ver

sion

1)

MPE

G-4

(Ver

sion

2)

ISO

IE

C14

496-

3O

ct

1998

ISO

IE

C14

496-

3A

MD

-1

Dec

19

99

02

ndash38

4

02

ndash38

4(fi

ner

leve

lsof

incr

emen

tpo

ssib

le)

8ndash

96

8ndash

96

Mul

ticha

nnel

Mul

ticha

nnel

[Gri

l97]

[Par

k97]

[Edl

e99]

[Pur

n00a

][S

che9

8a]

[Vaa

n00]

[Sch

e01]

[ISO

I00]

[Kim

01]

[Her

r00a

][P

urn9

9b]

[All

a99]

[Hil

p00]

[Spe

r00]

[Spe

r01]

The

first

cont

ent-

base

dm

ultim

edia

stan

dard

al

low

ing

univ

ersa

l-ity

inte

ract

ivity

and

aco

mbi

natio

nof

natu

ral

and

synt

hetic

mat

eria

lco

ded

inth

efo

rmof

obje

cts

MPE

G-7

ISO

IE

C15

938-

4Se

pt

2001

ndashndash

ndash[I

SOI0

1b]

[Nac

k99a

][N

ack9

9b]

[Qua

c01]

[Lin

d99]

[Lin

d00]

[Lin

d01]

[Spe

ng01

][I

SOI0

2a]

Ano

rmat

ive

met

adat

ast

anda

rdth

atpr

ovid

esa

Mul

timed

iaC

onte

ntD

escr

iptio

nIn

terf

ace

MPE

G-2

1IS

OI

EC

-210

00ndash

ndashndash

[ISO

I03a

][B

orm

03]

[Bur

n03]

Am

ultim

edia

fram

ewor

kth

atpr

ovid

esin

tero

pera

bilit

yin

cont

ent-

acce

ssan

ddi

stri

buti

on

273


MPEG-1 After four years of extensive collaborative research by audio codingexperts worldwide the first ISOMPEG audio coding standard MPEG-1 [ISOI92]for VHS-stereo-CD-quality was adopted in 1992 The MPEG-1 supports video bitrates up to about 15 Mbs providing a Video Home System (VHS) quality andstereo audio at 192 kbs Applications of MPEG-1 range from storing video andaudio on CD-ROMs to Internet streaming through the popular MPEG-1 layer III(MP3) format

MPEG-2 Backward Compatible (BC) In order to extend the capabilities offeredby MPEG-1 to support the so-called 32 (or 51) channel format and to facili-tate higher bit rates for video MPEG-2 [ISOI94a] was published in 1994 TheMPEG-2 standard supports digital video transmission in the range of 2ndash15 Mbsover cable satellite and other broadcast channels audio coding is defined at thebit rates of 64ndash192 kbschannel Multichannel MPEG-2 is backward compatiblewith MPEG-1 hence the acronym MPEG-2 BC The MPEG-2 BC standard isused in the high definition Television (HDTV) [ISOI94a] and produces the videoquality required in digital television applications

MPEG-2 Nonbackward CompatibleAdvanced Audio Coding (AAC) The back-ward compatibility constraints imposed on the MPEG-2 BCLSF algorithm madeit impractical to code five channels at rates below 640 kbs As a result MPEGbegan standardization activities for a nonbackward compatible advanced codingsystem targeting ldquoindistinguishablerdquo quality at a rate of 384 kbs for five fullbandwidth channels In less than three years this effort led to the adoption ofthe MPEG-2 nonbackward compatibleadvanced audio coding (NBCAAC) algo-rithm [ISOI97b] a system that exceeded design goals and produced the desiredquality at only 320 kbs for five full bandwidth channels

MPEG-4 MPEG-4 was established in December 1998 after many proposedalgorithms were tested for compliance with the program objectives establishedby the MPEG committee MPEG-4 video supports bit rates up to about 1 GbsThe MPEG-4 audio [ISOI99] [ISOI00] was released in several steps resultingin versions 1 and 2 MPEG-4 comprises an integrated family of algorithms withwide-ranging provisions for scalable object-based speech and audio coding at bitrates from as low as 200 bs up to 60 kbs per channel The distinguishing featuresof MPEG-4 relative to its predecessors are extensive scalability object-based rep-resentations user interactivityobject manipulation and a comprehensive set ofcoding tools available to accommodate trade-offs between bit rate complexityand quality Very low rates are achieved through the use of structured represen-tations for synthetic speech and music such as text-to-speech and MIDI Thestandard also provides integrated coding tools that make use of different signalmodels depending upon the desired bit rate bandwidth complexity and quality

MPEG-7 The MPEG-7 audio committee activities started in 1996 In less thanfour years a committee draft was finalized and the first audio standard address-ing ldquomultimedia content description interfacerdquo was published in September 2001MPEG-7 [ISOI01b] targets the content-based multimedia applications In partic-ular the MPEG-7 audio supports a broad range of applications ndash multimediadigital libraries broadcast media selection multimedia editing and searching


multimedia indexingsearching Moreover it provides ways for efficient audiofile retrieval and supports both the text-based and context-based queries

MPEG-21 The MPEG-21 ISOIEC-21000 standard [MPEG] [ISOI02a][ISOI03a] defines interoperable and highly automated tools that enable contentdistribution across different terminals and networks in a programmed mannerThis structure enables end-users to have capabilities for universal multimediaaccess

1041 MPEG-1 Audio (ISOIEC 11172-3)

The MPEG-1 audio standard (ISOIEC 11172-3) [ISOI92] comprises a flexiblehybrid coding technique that incorporates several methods including subbanddecomposition filter-bank analysis transform coding entropy coding dynamicbit allocation nonuniform quantization adaptive segmentation and psychoa-coustic analysis MPEG-1 audio codec operates on 16-bit PCM input data atsample rates of 32 441 and 48 kHz Moreover MPEG-1 offers separate modesfor mono stereo dual independent mono and joint stereo Available bit ratesare 32ndash192 kbs for mono and 64ndash384 kbs for stereo Several tutorials onthe MPEG-1 standards [Noll93] [Bran94a] [Shli94] [Bran95] [Herr95] [Noll95][Pan95] [Noll97] [John99] have appeared Chapter 5 Section 57 presents step-by-step procedure involved in the ISOIEC 11172-3 (MPEG-1 layer I) psychoa-coustic model 1 [ISOI92] simulation We summarize these steps in the contextof MPEG-1 audio standard

The MPEG-1 architecture contains three layers of increasing complexitydelay and output quality Each higher layer incorporates functional blocks fromthe lower layers Figure 106 shows the MPEG-1 layer III encoder block dia-gram The input signal is first decomposed into 32 critically subsampled sub-bands using a polyphase realization of a pseudo-QMF (PQMF) bank (see alsoChapter 6) The channels are equally spaced such that a 48-kHz input signal issplit into 750-Hz subbands with the subbands decimated 321 A 511th-order pro-totype filter was chosen such that the inherent overall PQMF distortion remainsbelow the threshold of audibility Moreover the prototype filter was designed

32 Channel PQMF

analysis bank

FFTcomputation

(L1 512L2 1024)

Psychoacousticsignal analysis

32Block

compandingquantization

Dynamicbit

allocation

32

SMR

Quantizers

Data MULTIPLEXER

Sideinfo

x(n)To channel

Figure 106 ISOMPEG-1 layer III encoder


for a high sidelobe attenuation (96 dB) to insure that intraband aliasing remainsnegligible For the purposes of psychoacoustic analysis and determination of justnoticeable distortion (JND) thresholds a 512 (layer I) or 1024 (layer II) point FFTis computed in parallel with the subband decomposition for each decimated blockof 12 input samples (8 ms at 48 kHz) Next the subbands are block companded(normalized by a scale factor) such that the maximum sample amplitude in eachblock is unity then an iterative bit allocation procedure applies the JND thresh-olds to select an optimal quantizer from a predetermined set for each subbandQuantizers are selected such that both the masking and bit rate requirements aresimultaneously satisfied In each subband scale factors are quantized using 6 bitsand quantizer selections are encoded using 4 bits

10411 Layers I and II For layer I encoding decimated subband sequencesare quantized and transmitted to the receiver in conjunction with side informationincluding quantized scale factors and quantizer selections Layer II improvesthree portions of layer I in order to realize enhanced output quality and reducebit rates at the expense of greater complexity and increased delay First thelayer II perceptual model relies upon a higher-resolution FFT (1024 points) thandoes layer I (512 points) Second the maximum subband quantizer resolutionis increased from 15 to 16 bits Despite this increase a lower overall bit rateis achieved by decreasing the number of available quantizers with increasingsubband index Finally scale factor side information is reduced while exploitingtemporal masking by considering properties of three adjacent 12-sample blocksand optionally transmitting one two or three scale factors plus a 2-bit sideparameter to indicate the scale factor mode Average mean opinion scores (MOS)of 47 and 48 were reported [Noll93] for monaural layer I and layer II codecsoperating at 192 and 128 kbs respectively Averages were computed over arange of test material

10412 Layer III The layer III MPEG architecture (Figure 107) achievesperformance improvements by adding several important mechanisms on top ofthe layer III foundation The MPEG layer-III algorithm operates on consecutiveframes of data Each frame consists of 1152 audio samples a frame is furthersplit into two subframes of 576 samples each A subframe is called a granule Atthe decoder every granule can be decoded independently A hybrid filter bankis introduced to increase frequency resolution and thereby better approximatecritical band behavior The hybrid filter bank includes adaptive segmentation toimprove pre-echo control Sophisticated bit allocation and quantization strate-gies that rely upon nonuniform quantization analysis-by-synthesis and entropycoding are introduced to allow reduced bit rates and improved quality Thehybrid filter bank is constructed by following each subband filter with an adap-tive MDCT This practice allows for higher-frequency resolution and pre-echocontrol Use of an 18-point MDCT for example improves frequency resolu-tion to 4167 Hz per spectral line The adaptive MDCT switches between 6 and18 points to allow improved pre-echo control Shorter blocks (4 ms) provide

32 C

hann

elP

olyp

hase

Ana

lysi

sF

ilter

bank

FF

T

M U X

32

Psy

choa

cous

ticA

naly

sis

32

SM

R

s(n)

Dat

a

To C

hann

el

MD

CT

Ada

ptiv

e S

egm

enta

tion

Blo

ck C

ompa

nder

Qua

ntiz

atio

nH

uffm

an C

odin

g

Bit

Allo

cati

on

Lo

op

Cod

eS

ide

Info

Fig

ure

107

IS

OM

PEG

-1la

yer

III

enco

der

277


for temporal pre-masking of pre-echoes during transients longer blocks duringsteady-state periods improve coding gain by reducing side information and hencebit rates

Bit allocation and quantization of the spectral lines are realized in a nestedloop procedure that uses both nonuniform quantization and Huffman coding Theinner loop adjusts the nonuniform quantizer step sizes for each block until thenumber of bits required to encode the transform components falls within the bitbudget The outer loop evaluates the quality of the coded signal (analysis-by-synthesis) in terms of quantization noise relative to the JND thresholds AverageMOS of 31 and 37 were reported [Noll93] for monaural layer II and layer IIIcodecs operating at 64 kbs

10413 Applications MPEG-1 has been successful in numerous applica-tions For example MPEG-1 layer III has become the standard for transmis-sion and storage of compressed audio for both World Wide Web (WWW) andhandheld media applications (eg IPod) In these applications the ldquoMP3rdquolabel denotes MPEG-1 layer III Note that MPEG-1 audio coding has steadilygained acceptance and ultimately has been deployed in several other large scalesystems including the European digital radio (DBA) or Eureka [Jurg96] thedirect broadcast satellite or ldquoDBSrdquo [Prit90] and the digital compact cassette orldquoDCCrdquo [Lokh92] In particular the Philips Digital Compact Cassette (DCC) is anexample of a consumer product that essentially implements the 384 kbs stereomode of MPEG-1 layer I A discussion of the precision adaptive subband coding(PASC) algorithm and other elements of the DCC system are given in [Lokh92]and [Hoog94]

The collaborative European Advanced Communications Technologies and Ser-vices (ACTS) program adopted MPEG audio and video as the core compressiontechnology for the Advanced Television at Low Bit rates And Networked Transmis-sion over Integrated Communication systems (ATLANTIC) project ATLANTICis a system intended to provide functionality for television program productionand distribution [Stor97] [Gilc98] This system posed new challenges for MPEGdeployment such as seamless bitstream (source) switching [Laub98] and robusttranscoding (tandem coding) Bitstream (source) switching becomes nontrivialwhen different bit rates andor MPEG layers are associated with different programsources Robust transcoding is also essential in the video production environmentEditing tasks inevitably require retrieval of compressed bit streams from archivalstorage processing of program material in uncompressed form and then replace-ment of the recoded compressed bit stream to the archival system Unfortunatelytranscoding is neither guaranteed nor likely to preserve perceptual noise mask-ing [Rits96] The ATLANTIC designers proposed a buried data ldquoMOLErdquo signal tomitigate and in some cases eliminate transcoding distortion for cascaded MPEG-1layer II codecs [Flet98] ideally allowing downstream tandem stages to preservethe original bit stream The idea behind the MOLE is to apply the same set of quan-tizers to the same set of data in the downstream codecs as in the original codecThe output bit stream will then be identical to the original bit stream provided that


numerical precision in the analysis filter banks does not bias the data [tenK96b]It is possible in a cascade of MPEG-1 layer II codecs to regenerate the same setof decimated subband sequences in the downstream codec filter banks as in theoriginal codec filter bank if the full-band PCM signal is properly time aligned at theinput to each cascaded stage Essentially delays at the filter-bank input must cor-respond to integer delays at the subband level [tenK96b] and the analysis framesmust contain the same block of data in each analysis filter bank The MOLE signaltherefore provides downstream codecs with timing synchronization bit allocationand scale-factor information for the MPEG bit stream on each frame The MOLEis buried in the PCM samples between tandem stages and remains inaudible byoccupying the LSB of each 20-bit PCM word Although optimal time-alignmentbetween codecs is possible even without the MOLE [tenK96b] there is unfortu-nately no easy way to force selection of the same set of quantizers and thus preservethe bit stream

The widespread use and maturity of MPEG-1 relative to the more recentstandards provided several concrete examples for the above discussion of MPEG-1 audio applications Various real-time implementation schemes of MPEG-1layers-I II and III codecs were proposed [Gbur96] [Hans96] [Main96] [Wang01]We will next consider the MPEG-2 BCLSF MPEG-2 AAC the MPEG-4 andthe MPEG-7 algorithms The discussion will focus primarily upon architecturalnovelties and differences with respect to MPEG-1

1042 MPEG-2 BCLSF (ISOIEC-13818-3)

MPEG-2 BCLSF Audio [Stol93a] [Gril94] [ISOI94a] [Stol96] extends the capa-bilities offered by MPEG-1 to support the so-called 32-channel format with left(L) right (R) center (C) and left and right surround (LS and RS) channels TheMPEG-2 BCLSF audio standard is backward compatible with MPEG-1 whichmeans that the 32 channel information transmitted by an MPEG-2 encoder canbe appropriately decoded for 2-channel presentation by an MPEG-1 receiverAnother important feature that was implemented in MPEG-2 BCLSF is themultilingual compatibility The acronym BC corresponds to the backward compat-ibility of MPEG-2 towards MPEG-1 and the extension of sampling frequenciesto lower ranges (16 2205 and 24 kHz) is denoted by LSF Several tutorials onMPEG-2 [Noll93] [Noll95] [John99] have appeared Meares and Theile studiedthe potential application of matrixed surround sound [Mear97] in MPEG audioalgorithms

10421 The Backward Compatibility Feature Depending on the bit-demand constraints interchannel dependencies and the complexity allowed atthe decoder different methods can be employed to realize compatibility betweenthe 32- and 2-channel formats These methods include midside (MS) intensitycoding simulcast and matrixing The MS and intensity coding techniquesare particularly handy when bit demand imposed by multiple independentchannels exceeds the bit budget The MS scheme is carefully controlled [Davi98]


to maintain compatibility among the mono stereo and the surround soundformats Intensity coding also known as channel coupling is a multichannelirrelevancy reduction coding technique that exploits properties of spatial hearingThe idea behind intensity coding is to transmit only one envelope with someside information instead of two or more from independent channels The sideinformation consists of a set of coefficients that is used to recover individualspectra from the intensity channel The simulcast encoding involves transmissionof both stereo and multichannel bitstreams Two separate bitstreams ie onefor 2-channel stereo and another one for the multichannel audio are transmittedresulting in reduced coding efficiency

MPEG-2 BCLSF employs matrixing techniques [tenK92] [tenK94] [Mear97]to down-mix the 32 channel format to the 2-channel format Down-mixingcapability is essential for the 51-channel system since many of the playbacksystems are stereophonic or even monaural Figure 108 depicts the matrix-ing technique employed in the MPEG-2 BCLSF and can be mathematicallyexpressed as follows

L total = x(L + yC + zLs) (101)

R total = x(R + yC + zRs) (102)

where x y and z are constants specified by the IS-13818-3 MPEG-2 stan-dard [ISOI94a] In Eqs (101) and (102) L C R Ls and Rs represent the32-channel configuration and the parameters L total and R total correspond tothe 2-channel format

Three different choices are provided in the MPEG-2 audio standard [ISOI94a]for choosing the values of x y and z to perform the 32-channel to 2-channeldown-mixing These include

Choice1 x = 1

1 + radic2 y = 1radic

2 and z = 1radic

2(103)

Choice2 x = 2

3 + radic2

y = 1radic2

and z = 1

2(104)

Choice3 x = 1 y = 1 and z = 1 (105)

The selection of the down-mixing parameters is encoder dependent The availabil-ity of the basic stereo format channels ie L total and R total and the surroundsound extension channels ie CLs and Rs at the decoder helps to decode both32-channel and 2-channel bitstreams This insures the backwards compatibilityin the MPEG-2 BCLSF audio coding standard

10422 MPEG-2 BCLSF Encoder The steps involved in the reductionof the objective redundancies and the removal of the perceptual irrelevancies inMPEG-2 BCLSF encoding are the same as in MPEG-1 audio standard Howeverthe differences arise from employing multichannel and multilingual bitstream


(a)

minus

L_total = x(L + yC + zLs)

R_total = x(R + yC + zRs)

+

+ +

minus3 dB

minus3 dB 90deg

L

C

R

Ls

L-total

R-total

Rs

C

Ls

Rs

+

L-total

R-total

+

+

C

LsRs

Decoder

L

R

C

Ls

Rs

(b)

minus

Figure 108 Multichannel surround sound matrixing (a) encoder (b) decoder

format in the MPEG-2 audio A matrixing module is used for this purposeAnother important feature employed in the MPEG-2 BCLSF is the ldquodynamiccross-talkrdquo a multichannel irrelevancy reduction technique This feature exploitsproperties of spatial hearing and encodes only one envelope instead of two ormore together with some side information (ie scale factors) Note that this tech-nique is in some sense similar to the intensity coding that we discussed earlierIn summary matrixing enables backwards compatibility between the MPEG-2and MPEG-1 bitstreams and dynamic cross-talk reduces the interchannel red-undancies

In Figure 109 first the segmented audio frames are decomposed into 32 crit-ically subsampled subbands using a polyphase realization of a pseudo QMF(PQMF) bank Next a matrixing module is employed for down-mixing pur-poses Matrixing results in two stereo-format channels ie L total and R totaland three extension channels ie C Ls and Rs In order to remove statisticalredundancies associated with these channels a second-order linear predictor isemployed [Fuch93] [ISOI94a] The predictor coefficients are updated on eachsubband using a backward adaptive LMS algorithm [Widr85] The resulting

32 C

hann

el

PQ

MF

anal

ysis

ban

k FF

Tco

mpu

tatio

nP

sych

oaco

ustic

sign

al a

naly

sis

32

Blo

ckco

mpa

ndin

gqu

antiz

atio

n

Dyn

amic

bit

allo

catio

n

32

SM

R

Qua

ntiz

ersD

ata

M U L T I P L E X E RS

ide

info

x(n)

To

chan

nel

Mat

rixin

g

Dyn

amic

cros

s-ta

lk

Sca

le fa

ctor

s

Pre

dict

ion

Fig

ure

109

IS

OM

PEG

-2B

CL

SFau

dio

(lay

erI

II)

enco

der

algo

rith

m

282


prediction error is further processed to eliminate interchannel dependencies JNDthresholds are computed in parallel with the subband decomposition for eachdecimated block A bit-allocation procedure similar to the one in the MPEG-1audio standard is used to estimate the number of appropriate bits required forquantization

10423 MPEG-2 BCLSF Decoder Synchronization followed by error det-ection and correction are performed first at the decoder Then the coded audiobitstream is de-multiplexed into the individual subbands of each audio channelNext the subband signals are converted to subband PCM signals based on theinstructions in the header and the side information transmitted for every subbandDe-matrixing is performed to compute L and R bitstreams as follows

L = L total

xminus (yC + zLs) (106)

R = R total

xminus (yC + zRs) (107)

where x y and z are constants and are known at the decoder The inverse-quantized de-matrixed subband PCM signals are then inverse-filtered to recon-struct the full-band time-domain PCM signals for each channel

The second MPEG-2 standard ie the MPEG-2 NBCAAC sacrificed back-ward MPEG-1 compatibility to eliminate quantization noise unmasking arti-facts [tenK94] which are potentially introduced by the forced backward com-patibility

1043 MPEG-2 NBCAAC (ISOIEC-13818-7)

The 11172-3 MPEG-1 and IS13818-3 MPEG-2 BCLSF are standardized algo-rithms for high-quality coding of monaural and stereophonic program materialBy the early 1990s however the demand for high-quality coding of multichannelaudio at reduced bit rates had increased significantly The backwards compatibil-ity constraints imposed on the MPEG-2 BCLSF algorithm made it impracticalto code 5-channel program material at rates below 640 kbs As a result MPEGbegan standardization activities for a nonbackward compatible advanced cod-ing system targeting ldquoindistinguishablerdquo quality [ITUR91] [ISOI96a] at a rate of384 kbs for five full-bandwidth channels In less than three years this effort ledto the adoption of the IS13818-7 MPEG-2 Non-backward CompatibleAdvancedAudio Coding (NBCAAC) algorithm [ISOI97b] a system that exceeded designgoals and produced the desired quality at 320 kbs for five full-bandwidth chan-nels While similar in many respects to its predecessors the AAC algorithm[Bosi96] [Bosi97] [Bran97] [John99] achieves performance improvements by in-corporating coding tools previously not found in the standards such as filter-bankwindow shape adaptation spectral coefficient prediction temporal noise shap-ing (TNS) and bandwidth- and bit-rate-scaleable operation Bit rate and quality


improvements are also realized through the use of a sophisticated noiseless codingscheme integrated with a two-stage bit allocation procedure Moreover the AACalgorithm contains scalability and complexity management tools not previouslyincluded with the MPEG algorithms As far as applications are concerned theAAC algorithm is embedded in the atob and LiquidAudio players for stream-ing of high-fidelity stereophonic audio It is also a candidate for standardizationin the United States Digital Audio Radio (US DAR) project The remainder ofthis section describes some of the features unique to MPEG-2 AAC

The MPEG-2 AAC algorithm (Figure 1010) is organized as a set of codingtools Depending upon available CPU or channel resources and desired qualityone can select from among three complexity ldquoprofilesrdquo namely main low andscalable sample rate profiles Each profile recommends a specific combination oftools Our focus here is on the complete set of tools available for main profilecoding which works as follows

10431 Filter Bank First a high-resolution MDCT filter bank obtains a spec-tral representation of the input Like previous MPEG coders the AAC filter-bankresolution is signal adaptive Quasi-stationary segments are analyzed with a 2048-point window while transients are analyzed with a block of eight 256-pointwindows to maintain time synchronization for channels using different filter-bank resolutions during multichannel operations The frequency resolution istherefore 23 Hz for a 48-kHz sample rate and the time resolution is 26 msUnlike previous MPEG coders however AAC eliminates the hybrid filter bankand relies on the MDCT exclusively The AAC filter bank is also unique in itsability to switch between two distinct MDCT analysis window shapes ie a sinewindow (Eq (108)) and a Kaiser-Bessel designed (KBD) window (Eq (109))Given specific input signal characteristics the idea behind window shape adap-tation is to optimize filter-bank frequency selectivity in order to localize thesupra-masking threshold signal energy in the fewest spectral coefficients Thisstrategy seeks essentially to maximize the perceptual coding gain of the filterbank While both windows satisfy the perfect reconstruction and aliasing cancel-lation constraints of the MDCT they offer different spectral analysis propertiesThe sine window is given by

w(n) = sin

[(

n + 1

2

)π

2M

]

(108)

for 0 n M minus 1 where M is the number of subbands This particular windowis perhaps the most popular in audio coding In fact this window has becomestandard in MDCT audio applications and its properties are typically referencedas performance benchmarks when new windows are proposed The so-calledKBD window was obtained in a procedure devised at Dolby Laboratories byapplying a transformation of the form

wa(n) = ws(n)

radicradicradicradic

sumnj=0 v(j)

sumMj=0 v(j)

0 n lt M (109)

Per

cept

ual M

odel

MD

CT

256

2048

pt

Mul

ti-C

hann

elM

S

Inte

nsity

G

ain

Con

trol

TN

SP

redi

ctio

n

Sca

leF

acto

rE

xtra

ctQ

uant

Ent

ropy

Cod

ing

Itera

tive

Rat

e C

ontr

ol L

oop

zminus1

s(n)

Sid

e In

form

atio

n C

odin

g B

itstr

eam

For

mat

ting

To

Cha

nnel

Fig

ure

101

0IS

OI

EC

IS13

818-

7(M

PEG

-2N

BC

AA

C)

enco

der

([B

osi9

6])

285


where the sequence v(n) represents the symmetric kernel The resulting identicalanalysis and synthesis windows wa(n) and ws(n) respectively are of lengthM + 1 and symmetric ie w(2M minus n minus 1) = w(n) More detailed explanationon the MDCT windows is given in Chapter 6 Section 67

A filter-bank simulation exemplifying the performance of the two windowssine and KBD for the MPEG-2 AAC algorithm follows A sine window isselected when narrow pass-band selectivity is more beneficial than strong stop-band attenuation For example sounds characterized by a dense harmonic struc-ture (less than 140 Hz spacing) such as harpsichord or pitch pipe benefit froma sine window On the other hand a Kaiser-Bessel designed (KBD) window isselected in cases for which stronger stop-band attenuation is required or for situ-ations in which strong components are separated by more than 220 Hz The KBDwindow in AAC has its origins in the MDCT filter bank window designed atDolby Labs for the AC-3 algorithm using explicit perceptual criteria By sacrific-ing pass-band selectivity the KBD window gains improved stop-band attenuationrelative to the sine window In fact the stop-band magnitude response is belowa conservative composite minimum masking threshold for a tonal masker atthe center of the pass-band A KBD versus sine window simulation example(Figure 1011) for a signal containing 300 Hz plus 3 harmonics shows the KBDpotential for reduced bit allocation A masking threshold estimate generated byMPEG-1 psychoacoustic model 2 is superimposed (red line) It can be seen thatfor the given input the KBD window is advantageous in terms of supra-thresholdcomponent minimization All of the MDCT components below the superimposedmasking threshold will potentially require allocations of zero bits This tradeoffcan ultimately lead to a lower bit rate Details of the minimum masking templatedesign procedure are given in [Davi94] and [Fiel96]

10432 Spectral Prediction The AAC algorithm realizes improved codingefficiency relative to its predecessors by applying prediction over time to the trans-form coefficients below 16 kHz as was done previously in [Mahi89] [Fuch93][Fuch95] In this case the spectral prediction tool is applied only during long anal-ysis windows and then only if a bit-rate reduction is obtained when coding theprediction residuals instead of the original coefficients Side information is mini-mal since the second-order lattice predictors are updated on each frame using abackward adaptive LMS algorithm The predictor banks which can be selectivelyactivated for individual quantization scale-factor bands produced an improvementfor a fixed bit rate of +1 point on the ITU 5-point impairment scale for the criticalpitch pipe and harpsichord test material

10433 Bit Allocation The bit allocation and quantization strategies in AACbear some similarities to previous MPEG coders in that they make use of anested-loop iterative procedure and in that psychoacoustic masking thresholdsare obtained from an analysis model similar to MPEG-1 model recommendationnumber two Both lossy and lossless coding blocks are integrated into the rate-control loop structure so that redundancy removal and irrelevancy reduction are


0 200 400 600 800 1000 1200 1400 1600

minus80

minus60

minus40

minus20

0

Leve

l (dB

)Sine

0 200 400 600 800 1000 1200 1400 1600

minus80

minus60

minus40

minus20

0

Frequency (Hz)

Leve

l (dB

)

Kaiser-Bessel

Supra-maskingthreshold

Figure 1011 Comparison of the MPEG-2 AAC MDCT analysis filter-bank outputs forthe sine window vs the KBD window

simultaneously affected in a single analysis-by-synthesis process The schemeworks as follows As in the case of MPEG-1 layer III the AAC coefficients aregrouped into 49 scale-factor bands that mimic the auditory systemrsquos frequencyresolution

In the nested-loop allocation procedure the inner loop adjusts scale-factorquantizer step sizes in increments of 15 dB (approximates intensity differencelimen (DL)) and obtains Huffman codewords for both quantized scale factors andquantized coefficients until the desired bit rate is achieved Then in the outerloop the quantization noise introduced by the inner loop is compared to themasking threshold in order to assess noise audibility Undercoded scale factorbands are amplified to force increased coding precision and then the inner loopis called again for compliance with the desired bit rate A best result is storedafter each iteration since the two-loop process is not guaranteed to converge Aswith other algorithms such as the MPEG-1 layer III and the Lucent TechnologiesPAC [John96c] a bit reservoir is maintained to compensate for time-varyingperceptual bit-rate requirements

10434 Noiseless Coding The noiseless coding block [Quac97] embeddedin the rate-control loop has several innovative features as well Twelve Huffmancode books are available for 2- and 4-tuple blocks of quantized coefficients Sec-tioning and merging techniques are applied to maximize redundancy reduction


Individual code books are applied to time-varying ldquosectionsrdquo of scale-factorbands and the sections are defined on each frame through a greedy merge algo-rithm that minimizes the bit rate Grouping across time and intraframe frequencyinterleaving of coefficients prior to code-book application are also applied tomaximize zero coefficient runs and further reduce bit rates

10435 Other Enhancements Relative to MPEG-1 and MPEG-2 BCLSFother enhancements have also been embedded in AAC For example the AAC algo-rithm has an embedded TNS module [Herr96] for pre-echo control (Section 69)a special profile for sample-rate scalability (SSR) and time-varying as well asfrequency subband selective application of MS andor intensity stereo coding for5-channel inputs [John96b]

10436 Performance Incorporation of the nonbackward compatible codingenhancements proved to be a judicious strategy for the AAC algorithm In inde-pendent listening tests conducted worldwide [ISOI96d] the AAC algorithm metthe strict ITU-R BS1116 criteria for indistinguishable quality [ITUR94b] at arate of 320 kbs for five full-bandwidth channels [Kirb97] This level of qualitywas achieved with a manageable decoder complexity Two-channel real-timeAAC decoders were reported to run on 133-MHz Pentium platforms using 40and 25 of available CPU resources for the main and low-complexity profilesrespectively [Quac98a] MPEG-2 AAC maintained its presence as the core ldquotime-frequencyrdquo coder reference model for the MPEG-4 standard

10437 Reference Model Validation (RM) Before proceeding with a dis-cussion of MPEG-4 we first consider a significant system-level aspect of MPEG-2 AAC that also propagated into MPEG-4 Both algorithms are structured in termsof so-called reference models (RMs) In the RM approach generic coder blocks ortools (eg perceptual model filter bank rate-control loop etc) adhere to a set ofdefined interfaces The RM therefore facilitates the testing of incremental singleblock improvements without disturbing the existing macroscopic RM structureFor instance one could devise a new psychoacoustic analysis model that satisfiesthe AAC RM interface and then simply replace the existing RM perceptual modelin the reference software with the proposed model It is then a straightforwardmatter to construct performance comparisons between the RM method and theproposed method in terms of quality complexity bit rate delay or robustnessThe RM definitions are intended to expedite the process of evolutionary coderimprovements

In fact several practical AAC improvements have already been analyzed withinthe RM framework For example a backward predictor was proposed [Yin97] asa replacement for the existing backward adaptive LMS predictors This methodthat relies upon a block LPC estimation procedure rather than a running LMSestimation was reported to achieve comparable quality with a 38 (instruction)complexity reduction [Yin97] This contribution was significant in light of the factthat the spectral prediction tool in the AAC main profile decoder constitutes 40


of the computational complexity [Yin97] Decoder complexity is further reducedsince the block predictors only require updates when the prediction module hasbeen enabled rather than requiring sample-by-sample updating regardless of acti-vation status Forward adaptive predictors have also been investigated [Ojan99]In another example of RM efficacy improvements to the AAC noiseless codingmodule were reported in [Taka97] A modification to the greedy merge section-ing algorithm was proposed in which high-magnitude spectral peaks that tendedto degrade Huffman coding efficiency were coded separately The improvementyielded consistent bit-rate reductions up to 11 In informal listening tests itwas found that the bit savings resulted in higher quality at the same bit rate Inyet another example of RM innovation aimed at improving quality for a givenbit rate product code VQ techniques [Gers92] were applied to increase AACscale-factor coding efficiency [Sree98a] In the proposed scheme scale-factorsare decorrelated using a DCT and then grouped into subvectors for quantizationby a product code VQ The method is intended primarily for low-rate codingsince the side information bit burden rises from roughly 6 at 64 kbs to in somecases 25 at 16 kbs As expected subjective tests reflected an insignificant qual-ity improvement at 64 kbs On the other hand the reduction in bits allocated toside information at low rates (eg 16 kbs) allowed more bits for spectral coef-ficient coding and therefore produced mean improvements of +052 and +036on subjective differential improvement tests at bit rates of 16 and 40 kbs respec-tively [Sree98b] Additionally noise-to-mask ratios (NMRs) were reduced by asmuch as minus243 for the ldquoharpsichordrdquo critical test item at 16 kbs Several archi-tectures for MPEG-2 AAC real-time implementations were proposed Some ofthese include [Chen99] [Hilp98] [Geye99] [Saka00] [Hong01] [Rett01] [Taka01][Duen02] [Tsai02]

10438 Enhanced AAC in MPEG-4 The next section is concerned with themultimodal MPEG-4 audio standard for which the MPEG-2 AAC RM core wasselected as the ldquotime-frequencyrdquo audio coding RM with some improvements Forexample perceptual noise substitution (PNS) was included [Herr98a] as part ofthe MPEG-4 AAC RM Moreover the long-term prediction (LTP) [Ojan99] andtransform-domain weighted interleave VQ (TwinVQ) [Iwak96] modules becamepart of the MPEG-4 audio LTP after the MPEG-2 AAC prediction block providesa higher coding precision for tonal signals while the TwinVQ provided scalabilityand ultra-low bit-rate audio coding


The MPEG-4 ISOIEC-14496 Part 3 audio was adopted in December 1998 aftermany proposed algorithms were tested [Cont96] [Edle96a] [ISOI96b] [ISOI96c]for compliance with the program objectives [ISOI94b] established by the MPEGcommittee MPEG-4 audio (Figure 1012) encompasses a great deal more func-tionality than just perceptual coding [Koen96] [Koen98] [Koen99] It comprisesan integrated family of algorithms with wide-ranging provisions for scalable


Universality

Scalability

Object-basedRepresentation

Content-basedInteractivity

EnhancedComposition

NaturalSyntheticRepresentation

MPEG-4Audio

[ISOI99]

FEATURES

TOOLS

PROFILES

bull Synthesis profilebull Speech profilebull Scalable profilebull Main profile

General AudioCoding[Gril99]

Scalable AudioCoding

[Bran94b] [Gril97]

Parametric AudioCoding

[Edle96q] [Purn99a]

Speech Coding[Edle99] [Nish99]

Structured AudioCoding

[Sche98a] [Sche01]

Low Delay AudioCoding

[Alla99] [Hilp00]

Error ResilientFeature

[Sper00] [Mein01]

Figure 1012 An overview of the MPEG-4 audio coder

object-based speech and audio coding at bit rates from as low as 200 bs upto 64 kbs per channel The distinguishing features of MPEG-4 relative to itspredecessors are extensive scalability object-based representations user inter-activityobject manipulation and a comprehensive set of coding tools availableto accommodate almost any desired tradeoff between bit rate complexity andquality Efficient and flexible coding of different content (objects) such as nat-ural audiospeech and synthetic audiospeech became indispensable for some ofthe innovative multimedia applications To facilitate this MPEG-4 audio pro-vides coding and composition of natural and synthetic audiospeech content atvarious bit rates Very low rates are achieved through the use of structured repre-sentations for synthetic speech and music such as text-to-speech and MIDI Forhigher bit rates and ldquonatural audiordquo speech and music the standard provides inte-grated coding tools that make use of different signal models the choice of whichis made depending upon desired bit rate bandwidth complexity and qualityCoding tools are also specified in terms of MPEG-4 ldquoprofilesrdquo that essentiallyrecommend tool sets for a given level of functionality and complexity Beyondits provisions specific to coding of speech and audio MPEG-4 also specifies


numerous sophisticated system-level functions for media-independent transportefficient buffer management syntactic bitstream descriptions and time-stampingfor synchronization of audiovisual information units

10441 MPEG-4 Audio Versions The MPEG-4 audio standard was releasedin several steps due to timing constraints This resulted in two different versionsof MPEG-4 Version 1 [ISOI99] was standardized in February 1999 followedby version 2 [ISOI00] (also referred as Amendment 1 to version 1) in February2000 New amendments for bandwidth extension parametric audio extensionMP3 on MP4 audio lossless coding and scalable to lossless coding have alsobeen considered in the MPEG-4 audio standard

MPEG-4 Audio Version 1 The MPEG-4 audio version 1 comprises the major-ity of the MPEG-4 audio tools These are general audio coding scalable codingspeech coding techniques structured audio coding and text-to-speech syntheticcoding These techniques can be grouped into two main categories ie natu-ral [Quac98b] and synthetic audio coding [Vaan00] The MPEG-4 natural audiocoding part describes traditional type speech coding and high-quality audio cod-ing algorithms at bit rates ranging from 2 kbs to 64 kbs and above Three typesof coders enable hierarchical (scalable) coding in MPEG-4 Audio version-1 at dif-ferent bit rates Firstly at lower bit rates ranging from 2 kbs to 6 kbs parametricspeech coding is employed Secondly a code excited linear predictive (CELP)coding is used for medium bit rates between 6 kbs and 24 kbs Finally for thehigher bit rates typically ranging from 24 kbs transform-based (time-frequency)general audio coding techniques are applied The MPEG-4 synthetic audio cod-ing part describes the text-to-speech (TTS) and structured audio synthesis toolsTypically the structured tools are used to provide effects like echo reverberationand chorus effects the TTS synthetic tools generate synthetic speech from textparameters

MPEG-4 Audio Version 2 While remaining backwards compatible with MPEG-4 version 1 version 2 adds new profiles that incorporate a number of significantsystem-level enhancements These include error robustness low-delay audio cod-ing small-step scalability and enhanced composition [Purn99b] At the systemlevel version 2 includes a media independent bit stream format that supportsstreaming editing local playback and interchange of contents Furthermore inversion 2 an MPEG-J programmatic system specifies an application programminginterface (API) for interoperation of MPEG players with JAVA Version 2 offersimproved audio realism in sound rendering New tools allow parameterization ofthe acoustical properties of an audio scene enabling features such as immersiveaudiovisual rendering room acoustical modeling and enhanced 3-D sound presen-tation New error resilience techniques in version 2 allow both equal and unequalerror protection for the audio bit streams Low-delay audio coding is employed atlow bit rates where the coding delay is significantly high Moreover to facilitatethe bit rate scalability in small steps version 2 provides a highly desirable toolcalled small-step scalability or fine-grain scalability Text-to-speech (TTS) inter-faces from version 1 are enhanced in version 2 with a mark-up TTS intended for


applications such as speech-enhanced web browsing verbal email and story-telleron demand Markup TTS has the ability to process HTML SABLE and facialanimation parameter (FAP) bookmarks

10442 MPEG-4 Audio Profiles Although many coding and processingtools are available in MPEG-4 audio cost and complexity constraints often dic-tate that it is not practical to implement all of them in a particular system Version1 therefore defines four complexity-ranked audio profiles intended to help sys-tem designers in the task of appropriate tool subset selection In order of bit ratethey are as follows The low rate synthesis audio profile provides only wavetable-based synthesis and a text-to-speech (TTS) interface For natural audio processingcapabilities the speech audio profile provides a very-low-rate speech coder and aCELP speech coder The scalable audio profile offers a superset of the first twoprofiles With bit rates ranging from 6 to 24 kbs and bandwidths from 35 to9 kHz this profile is suitable for scalable coding of speech music and syntheticmusic in applications such as Internet streaming or narrow-band audio digitalbroadcasting (NADIB) Finally the main audio profile is a superset of all otherprofiles and it contains tools for both natural and synthetic audio

10443 MPEG-4 Audio Tools Unlike MPEG-1 and MPEG-2 the MPEG-4 audio describes not only a set of compression schemes but also a completefunctionality for a broad range of applications from low-bit-rate speech codingto high-quality audio coding or music synthesis This feature is called the uni-versality MPEG-4 enables scalable audio coding ie variable rate encoding isprovided to adapt dynamically to the varying transmission channel capacity Thisproperty is called scalability One of the main features of the MPEG-4 audio isits ability to represent the audiovisual content as a set of objects This enablesthe content-based interactivity

Natural Audio Coding Tools MPEG-4 audio [Koen99] integrates a set of tools(Figure 1013) for coding of natural sounds [Quac98b] at bit rates ranging fromas low as 200 bs up to 64 kbs per channel For speech and audio three distinctalgorithms are integrated into the framework These include parametric codingCELP coding and transform coding The parametric coding is employed forbit rates of 2ndash4 kbs and 8 kHz sampling rate as well as 4ndash16 kbs and 8 or16 kHz sampling rates (Section 94) For higher quality narrow-band (8 kHzsampling rate) and wideband (16 kHz) speech is handled by a CELP speech codecoperating between 6 and 24 kbs For generic audio at bit rates above 16 kbsa timefrequency perceptual coder is employed and in particular the MPEG-2 AAC algorithm with extensions for fine-grain bit-rate scalability [Park97] isspecified in MPEG-4 version 1 RM as the time-frequency coder The multimodalframework of MPEG-4 audio allows the user to tailor the coder characteristicsto the program material

Synthetic Audio Coding Tools While the earlier MPEG standards treated onlynatural audio program material the MPEG-4 audio achieves very-low-rate codingby supplementing its natural audio coding techniques with tools for syntheticaudio processing [Sche98a] [Sche01] and interfaces for structured high-level


CELP Coder

TF Coder (AAC or TWIN-VQ) General Audio Coder

2

Text-to-Speech

02

Structured Audio

MPEG-4Naturalaudiocoding tools

MPEG-4Syntheticaudio codingtools

ParametricCoder(eg HILN)

64

Bitrate Per Channel (kbs)

Scalable Coder

20 kHz Typical Audio Bandwidth

32244 6 8 10 12 14 16

4 kHz 8 kHz

Figure 1013 ISOIEC MPEG-4 integrated tools for audio coding ([Koen99])

audio representations Chief among these are the text-to-speech interface (TTSI)and methods for score-driven synthesis The TTSI provides the capability for200ndash1200 bs transmission of synthetic speech that can be represented in termsof either text only or text plus prosodic parameters such as a pitch contour or aset of phoneme durations Also one can specify the age gender and speech rateof the speaker Additionally there are facilities for lip synchronization controlinternational language and dialect support as well as controls for pause resumeand jump forwardbackward The TTSI specifies only an interface rather than anormative speech synthesis methodology in order to maximize implementationflexibility

Beyond speech general music synthesis capabilities in MPEG-4 are providedby a set of structured audio tools [Sche98a] [Sche98d] [Sche98e] Syntheticsounds are represented using the structured audio orchestra language (SAOL)SAOL [Sche98d] treats music as a collection of instruments Instruments arethen treated as small networks of signal-processing primitives all of which canbe downloaded to a decoder Some of the available synthesis methods includewavetable FM additive physical modeling granular synthesis or nonparametrichybrids of any of these methods [Sche98c] An excellent tutorial on these andother structured audio methods and applications appeared in [Verc98] TheSAOL instruments are controlled at the decoder by ldquoscoresrdquo or scripts in the


structured audio score language (SASL) A score is a time-sequenced set ofcommands that invokes various instruments at specific times to contribute theiroutputs to an overall performance SASL provides significant flexibility in that notonly can instruments be controlled but the existing sounds can be modified Forthose situations in which fine control is not required structured audio in MPEG-4 also provides backward compatibility with the MIDI protocol Moreover astandardized ldquowavetable bank formatrdquo is available for low-functionality termi-nals [Koen99] In the next seven subsections ie 10444 through 104410 wedescribe in detail the features and tools (Figure 1012) integrated in the MPEG-4 audio

10444 MPEG-4 General Audio Coding The MPEG-4 General AudioCoder (GAC) [Gril99] has the most vital and versatile functionality associatedwith the MPEG-4 tool-set that covers the arbitrary natural audio signals TheMPEG-4 GAC is often called as the ldquoall-roundrdquo coding system among the MPEG-4 audio schemes and operates at bit rates ranging from 6 to 300 kbs and atsampling rates between 735 kHz and 96 kHz The MPEG-4 GA coder is builtaround the MPEG-2 AAC (Figure 1010 discussed in Section 1043) along withsome extended features and coder configurations highlighted in Figure 1014These features are given by the perceptual noise substitution (PNS) long-termprediction (LTP) Twin VQ coding and scalability

Perceptual Noise Substitution (PNS) The PNS exploits the fact that a ran-dom noise process can be used to model efficiently transform-coefficients innoise-like frequency subbands provided the noise vector has an appropriate tem-poral fine structure [Schu96] Bit-rate reduction is realized since only a compactparametric representation is required for each PNS subband (ie noise energy)rather than full quantization and coding of subband transform coefficients ThePNS technique was integrated into the existing AAC bitstream definition in abackward-compatible manner Moreover PNS actually led to reduced decodercomplexity since pseudo-random sequences are less expensive to compute thanHuffman decoding operations Therefore in order to improve the coding effi-ciency the following principle of PNS is employed

The PNS acronym is composed from the following perceptual coding +substitute parametric form of noise-like signals ie PNS allows frequency-selective parametric encoding of noise-like components These noise-like com-ponents are detected based on a scale-factor band and are grouped into separatecategories The spectral coefficients corresponding to these categories are notquantized and are excluded from the coding process Furthermore only a noisesubstitution flag along with the total power of these spectral coefficients are trans-mitted for each band At the decoder the spectral coefficients are replaced bythe pseudo-random vectors with the desired target noise power At a bit rate of32 kbs a mean improvement due to PNS of +061 on the comparison meanopinion score (CMOS) test (for critical test items such as speech castanetsand complex sound mixtures) was reported in [Herr98a] The multichannel PNSmodes include some provisions for binaural masking level difference (BMLD)compensation

s(n

)

Per

cept

ual M

odel

Gai

nC

ontr

olM

DC

TT

NS

PN

S

Inte

nsity

C

oupl

ing

Sca

leF

acto

rE

xtra

ct

Ent

ropy

codi

ngT

win

VQ

Itera

tive

Rat

e C

ontr

ol L

oop

Sid

e in

form

atio

n C

odin

g B

itstr

eam

For

mat

ting

Mul

ti-ch

anne

lM

SLT

P

To

Cha

nnel

Fig

ure

101

4M

PEG

-4G

Aco

der

295


Long-Term Prediction (LTP) Unlike the noise-like signals the tonal signalsrequire higher coding precision In order to achieve a required coding precision(20 dB for tone-like and 6 dB for noise-like signals) the long-term prediction(LTP) technique [Ojan99] is employed In particular since the tonal signal com-ponents are predictable the speech coding pitch prediction techniques [Span94]can be used to improve the coding precision The only significant differencebetween the prediction techniques performed in a common speech coder and inthe MPEG-4 GA coder is that in the latter case the LTP is performed in thefrequency domain while in speech codecs the LTP is carried out in the timedomain A brief description of the LTP scheme in MPEG-4 GA coder followsFirst the input audio is transformed to frequency domain using an analysis filterbank and later a TNS analysis filter is employed for shaping the noise artifactsNext the processed spectral coefficients are quantized and encoded For pre-diction purposes these quantized coefficients are transformed back to the timedomain by a synthesis filter bank and the associated TNS operation The optimumpitch lag and the gain parameters are determined based on the residual and theinput signal In the next step both the input signal and the residual are mappedto a spectral representation via the analysis filter bank and the forward TNS filterbank Depending on which alternative is more favorable coding of either thedifference signal or the original signal is selected on a scale-factor basis Thisis achieved by means of a so-called frequency-selective switch (FSS) which isalso used in the context of the MPEG-4 GA scalable systems The complexityassociated with the LTP in MPEG-4 GA scheme is considerably (50) reducedcompared to the MPEG-2 AAC prediction scheme [Gril99]

TwinVQTwinVQ [Iwak96] [Hwan01] [Iwak01] isanacronymof theTransform-domain Weighted Interleave Vector Quantization The Twin VQ performs vec-tor quantization of the transformed spectral coefficients based on a perceptuallyweighted model The quantization distortion is controlled through a perceptualmodel [Iwak96] The Twin VQ provides high coding efficiencies even for musicand tonal signals at extremely low bit rates (6ndash8 kbs) which CELP coders fail toachieve The Twin VQ performs quantization of the spectral coefficients in two stepsas shown in Figure 1015 First the spectral coefficients are flattened and normalizedacross the frequency axis Second the flattened spectral coefficients are quantizedbased on a perceptually weighted vector quantizer

From Figure 1015 the first step includes a linear predictive coding periodicitycomputation a Bark scale spectral estimation scheme and a power computationblock The LPC provides the overall spectral shape The periodic componentincludes information on the harmonic structure The Bark-scale envelope codingprovides the required additional flattening of the spectral coefficients The nor-malization restricts these spectral coefficients to a specific target range In thesecond step the flattened and normalized spectral coefficients are interleaved intosubvectors Based on some spectral properties and a weighted distortion measureperceptual weights are computed for each subvector These weights are appliedto the vector quantizer (VQ) A conjugate-structure VQ that uses a pair of codebooks is employed More detailed information on the conjugate structure VQ can

Tw

in V

QS

tep

ndash 1

Nor

mal

izat

ion

Line

arP

redi

ctiv

eC

odin

g (L

PC

)B

ark

Sca

leP

ower

Est

imat

ion

Nor

mal

izat

ion

ofS

pect

ral C

oeffi

cien

ts

Fla

ttene

dS

pect

ral C

oeffi

cien

ts

Inpu

tS

igna

l

Tw

in V

QS

tep

ndash 2

Per

cept

ually

Wei

ghte

d V

Q

Inpu

t Vec

tor

Sub

-vec

torminus

1S

ub-v

ecto

rminus4

Sub

-vec

torminus

3S

ub-v

ecto

rminus2

Inte

rleav

ing

Wei

ghte

d V

QW

eigh

ted

VQ

Wei

ghte

d V

QW

eigh

ted

VQ

Per

cept

ual W

eigh

ts

Vec

tor

Qua

ntiz

erIn

dex

Val

ues

Per

iodi

city

Ext

ract

ion

Fig

ure

101

5Tw

inV

Qsc

hem

ein

MPE

G-4

GA

code

r

297


be obtained from [Kata93] [Kata96] The MPEG-4 Audio Twin VQ scheme pro-vides audio coding at ultra-low bit rates (6ndash8 kbs) and supports the perceptualcontrol of the quantization distortion Comparative tests of MPEG AAC withand without Twin VQ tool were performed and are given in [ISOI98] Further-more the Twin VQ tool has provisions for scalable audio coding which will bediscussed next

10445 MPEG-4 Scalable Audio Coding MPEG-4 scalable audio codingimplies a variable rate encodingdecoding of bitstreams at bit rates that can beadapted dynamically to the varying transmission channel capacity [Gril97] [Park97][Herr98b] [Creu02] Scalable coding schemes [Bran94b] generate partial bitstreamsthat can be decoded separately Therefore encodingdecoding of a subset of the totalbitstream will result in a valid signal at a lower bit rate The various types of scalabil-ity [Gril97] are given by signal-to-noise ratio (SNR) scalability noise-to-mask ratio(NMR) scalability audio bandwidth scalability and bit-rate scalability The bit-ratescalability is considered to be one of the core functionalities of the MPEG-4 audiostandard Therefore in our discussion on the MPEG-4 scalable audio coding we willconsider only the bit-rate scalability and the various scalable coder configurationsdescribed in the standard

The MPEG-4 bit-rate scalability scheme (Figure 1016) allows an encoder totransmit bitstreams at a high bit rate while decoding successfully a low-ratebitstream contained within the high-rate code For instance if an encoder trans-mits bitstreams at 64 kbs the decoder can decode at bit rates of 16 32 or64 kbs according to channel capacity receiver complexity and quality require-ments Typically scalable audio coders constitute several layers ie a core layerand a series of enhancement layers For example Figure 1016 depicts one corelayer and two enhancement layers The core layer encodes the core (main) audiostream while the enhancement layers provide further resolution and scalability Inparticular in the first stage the core layer encodes the input audio s(n) basedon a conventional lossy compression scheme Next an error signal (residual)E1(n) is calculated by subtracting the reconstructed signal s(n) (that is obtainedby decoding the compressed bitstream locally) from the input signal s(n) Inthe second stage (first enhancement layer) the error signal E1(n) is encoded toobtain the compressed residual e1(n) The above sequence of steps is repeatedfor all the enhancement layers

To further demonstrate this principle we consider an example (Figure 1016)where the core layer uses 32 kbs and the two enhancement layers employ bitrates of 16 kbs and 8 kbs and the final sink layer supports 8 kbs codingTherefore if no side information is encoded then the coding rate associatedwith the codec is 64 kbs At the decoder one can decode this multiplexed audiobitstream at various rates ie 64 32 or 40 kbs etc depending up on the bit-raterequirements receiver complexity and channel capacity In particular the corebitstream guarantees reconstruction of the original input audio with minimumartifacts On top of the core layer additional enhancement layers are added toincrease the quality of the decoded signal

Com

pres

sion

sche

me

Aud

io In

put

s(n

)

Loca

lD

ecod

er

minus

Com

pres

sed

audi

ox

(n)

Err

orE

1(n

)C

ompr

essi

onsc

hem

e

Loca

lD

ecod

er

minus

Com

pres

sed

resi

dual

e 1(n

)

Err

orE

2(n

)C

ompr

essi

onsc

hem

e

Loca

lD

ecod

ersum

sum

sum

minusCom

pres

sed

resi

dual

e 2(n

)

Err

orE

3(n

)

M U L T I P L E X E RC

ompr

essi

onsc

hem

e

e 3(n

)

s(n

)ˆ

Fig

ure

101

6M

PEG

-4sc

alab

leau

dio

codi

ng

299


Scalable audio coding finds potential applications in the fields of digital audiobroadcasting mobile multimedia communication and streaming audio It sup-ports real-time streaming with a low buffer delay One of the significant exten-sions of the MPEG-4 scalable audio coding is the fine-grain scalability [Kim01]where a bit-sliced arithmetic coding (BSAC) [Kim02a] is used In each frame bitplanes are coded in the order of significance beginning with the most significantbits (MSBs) and progressing to the LSBs This results in a fully embedded codercontaining all lower-rate codecs The BSAC and fine-grain scalability conceptsare explained below in detail

Fine-Grain Scalability It is important that bit-rate scalability is achieved with-out significant coding efficiency penalty compared to fixed-bit-rate systems andwith low computational complexity This can be achieved using the fine-grainscalability technique [Purn99b] [Kim01] In this approach bit-sliced arithmeticcoding is employed along with the combination of advanced audio coding tools(Section 1042) In particular the noiseless coding of spectral coefficients andthe scale-factor selection scheme is replaced by the BSAC technique that pro-vides scalability in steps of 1 kbschannel The BSAC scheme works as followsFirst the quantized spectral values are grouped into frequency bands each ofthese groups contain the quantized spectral values in the binary form Then thebits of each group are processed in slices and in the order of significance begin-ning with the MSBs These bit-slices are then encoded using an arithmetic codingtechnique (Chapter 3) Usually the BSAC technique is used in conjunction withthe MPEG-4 GA tool where the Huffman coding is replaced by this special typeof arithmetic coding

10446 MPEG-4 Parametric Audio Coding In research proposed as partof an MPEG-4 ldquocore experimentrdquo [Purn97] Purnhagen at the University of Han-nover developed in conjunction with Deutsche Telekom Berkom an object-basedalgorithm In this approach harmonic sinusoid individual sinusoid and colorednoise objects were combined in a hybrid source model to create a paramet-ric signal representation The enhanced algorithm known as the ldquoHarmonicand Individual Lines Plus Noiserdquo (HILN) [Purn00a] [Purn00b] is architecturallyvery similar to the original ASAC [Edle96b] [Edle96c] [Purn98] [Purn99a] withsome modifications The parametric audio coding scheme is a part of MPEG-4version 2 and is based on the HILN scheme (see also Section 94) This tech-nique involves coding of audio signals at bit rates of 4 kbs and above basedon the possibilities of modifying the playback speed or pitch during decod-ing The parametric audio coding tools have also been extended to high-qualityaudio [Oom03]

10447 MPEG-4 Speech Coding The MPEG-4 natural speech codingtool [Edle99] [Nish99] provides a generic coding framework for a wide rangeof applications with speech signals at bit rates between 2 kbs and 24 kbs TheMPEG-4 speech coding is based on two algorithms namely harmonic vectorexcitation coding (HVXC) and code excited linear predictive coding (CELP) The


HVXC algorithm essentially based on the parametric representation of speechhandles very low bit rates of 14ndash4 kbs at a sampling rate of 8 kHz On the otherhand the CELP algorithm employs multipulse excitation (MPE) and regular-pulseexcitation (RPE) coding techniques (Chapter 4) and supports higher bit rates of4ndash24 kbs operating at sampling rates of 8 kHz and 16 kHz The specificationsof MPEG-4 Natural Speech Coding Tool Set are summarized in Table 104

In all the aforementioned algorithms ie HVXC CELP-MPE and CELP-RPE the idea is that an LP analysis filter models the human vocal tract while anexcitation signal models the vocal chord and the glottal activity All the three con-figurations share the same LP analysis method while they generally differ onlyin the excitation computation In the LP analysis first the autocorrelation coeffi-cients of the input speech are computed once every 10 ms and are convertedto LP coefficients using the Levinson-Durbin algorithm The LP coefficientsare transformed to line spectrum pairs using Chebyshev polynomials [Kaba86]These are later quantized using a two-stage split-vector quantizer The exci-tation signal is chosen in such a way that the error between the original andreconstructed signal is minimized according to a perceptually weighted distortionmeasure

Multiple Bit RatesSampling Rates Scalability The speech coder family inMPEG-4 audio is different from the standard speech coding algorithms (egITU-T G7231 G729 etc) Some of the salient features and functionalities(Figure 1017) of the MPEG-4 speech coder include multiple sampling rates andbit rates bit-rate scalability [Gril97] and bandwidth scalability [Nomu98]

The multiple bit ratessampling rates functionality provides flexible bit rateselection among multiple available bit rates (14ndash24 kbs) based on the channelconditions and the bandwidth availability (8 kHz and 16 kHz) At lower bit ratesan algorithmic delay of the order of 30ndash40 ms is expected while at higher bit

Table 104 MPEG-4 speech coding sampling rates and bandwidthspecifications [Edle99]

Specification HVXC CELP-MPE CELP-RPE

Samplingfrequency(kHz)

8 8 16 16

Bit rate (kbs) 14ndash4 385ndash238 109ndash23858 Bit rates 30 Bit rates

Frame size(ms)

10ndash40 10ndash40 10ndash20

Delay (ms) 335ndash56 sim15ndash45 sim20ndash25Features Multi-bit-rate

coding bit-ratescalability

Multi-bit-rate codingbit-rate scalabilitybandwidthscalability

Multi-bit-ratecoding bit-ratescalability

Mul

tiple

sam

plin

g ra

tes

and

bitr

ates

Sca

labi

lity

MP

EG

-4 S

peec

h C

odin

gF

unct

iona

litie

s

Nar

row

-ban

d sp

eech

cod

ing

- B

itrat

es 2

ndash 1

22k

bs

- S

ampl

ing

rate

8kH

z-

Del

ay 4

0 ndash

15m

s-

HV

XC

and

CE

LP

Wid

e-ba

nd s

peec

h co

ding

- B

itrat

es 1

0 ndash

24kb

s-

Sam

plin

g ra

te 1

6kH

z-

Del

ay 2

5 ndash

15m

s-

CE

LP

Bitr

ate

scal

abili

ty

- F

or b

itrat

es 2

ndash 2

4kb

s-

HV

XC

and

CE

LP

Ban

dwid

th s

cala

bilit

y

- F

or b

itrat

es 4

ndash 2

4kb

s-

CE

LP o

nly

Fig

ure

101

7M

PEG

-4sp

eech

code

rfu

nctio

nalit

ies

302


rates 15ndash25 ms delay is common The bit-rate scalability feature allows a widerange of bit rates (2ndash24 kbs) in step sizes of as low as 100 bs Both HVXC andCELP tools can be used to realize bit-rate scalability by employing a core layerand a series of enhancement layers at the encoder When the HVXC encoding isused one enhancement layer is preferred while three bit-rate scalable enhance-ment layers may be used for the CELP codec [Gril97] Bandwidth scalabilityimproves the audio quality by adding an additional coding layer that extends thetransmitted audio bandwidth Only CELP-RPE and CELP-MPE schemes allowbandwidth scalability in the MPEG-4 audio Furthermore only one bandwidthscalable enhancement layer is possible [Nomu98] [Herr00a]

10448 MPEG-4 Structured Audio Coding Structured audio (SA) intro-ducedbyVercoeet al [Verc98]presentsanewdimension toMPEG-4audioprimar-ily due to its ability to represent and encode efficiently the synthetic audio and multi-media content The MPEG-4SAtool [Sche98a] [Sche98c] [Sche99a] [Sche99b] wasdeveloped based on a synthesizer-description language called the Csound [Verc95]developed by Vercoe at the MIT Media Labs Moreover the MPEG-4 SA tool inher-its features from ldquoNetsoundrdquo [Case96] a structured audio experiment carried out byCasey et al based on the Csound synthesis language Instead of specifying a syn-thesis method the MPEG-4 SA describes a special language that defines synthesismethods In particular the MPEG-4 SA tool defines a set of syntax and seman-tic rules corresponding to the synthesis-description language called the StructuredAudio Orchestra Language (SAOL) [Sche98d] A control (score) language calledthe StructuredAudioScore Language (SASL)wasalsodefined todescribe the detailsof the SAOL code compaction Another component namely the Structured AudioSample Bank Format (SASBF) is used for the transmission of data samples in blocksThese blocks contain sample data as well as details of the parameters used for select-ing optimum wave-table synthesizers and facilitate algorithmic modifications Atheoretical basis for the SA coding was established in [Sche01] based on the Kol-mogorov complexity theory Also in [Sche01] Scheirer proposed a new paradigmcalled the generalized audio coding in which SA encompasses all other audio cod-ing techniques Furthermore treatment of structured audio in view of both losslesscoding and perceptual coding is also given in [Sche01]

The SA bitstream available at the MPEG-4 SA decoder (Figure 1018) con-sists of a header sample data and score data The SAOL decoder block actsas an interpreter and reads the header structure It also provides the informa-tion required to reconfigure the synthesis engine The header carries descriptionsof several instruments synthesizers control algorithms and routing instructionsThe Event List and Data block obtains the actual stream of data samples andparameters controlling algorithmic modifications In particular the bitstream dataconsists of access units that primarily contain the list of events Furthermoreeach event refers to an instrument described (eg in the orchestra chunk) inthe header [Sche01] The SASL decoder block compiles the score data fromthe SA bitstream and provides control sequences and signals to the synthesisengine via a run-time scheduler This control information determines the time at


which the events (or commands) are to be dispatched in order to create notes (orinstances) of an instrument Each note produces some sound output Finally allthese sound outputs (corresponding to each note) are added in order to create theoverall orchestra output In Figure 1018 we represented the run-time schedulerand reconfigurable synthesis engine blocks separately however in practice theyare usually combined into one block

As mentioned earlier the structured audio tool and the text-to-speech (TTS)fall in the synthetic audio coding group Recall that the structured audio tools con-vert structured representation into synthetic sound while the TTS tools translatetext to synthetic speech In both these methods the particular synthesis methodor implementation is not defined by the MPEG-4 audio standard however theinput-output relation for SA and the TTS interface are standardized The nextquestion that arises is how the natural and synthetic audio content can be mixedThis is typically carried out based on a special format specified by the MPEG-4namely the Audio Binary Format for Scene Description (AudioBIFS) [Sche98e]AudioBIFS enables sound mixing grouping morphing and effects like echo(delay) reverberation (feedback delay) chorus etc

10449 MPEG-4 Low-Delay Audio Coding Significantly large algorithmicdelays (of the order of 100ndash200 ms) in the MPEG-4 GA coding tool (discussedin Section 10444) hinder its applications in two-way real-time communica-tion These algorithmic delays in the GA coder can be attributed primarily tothe analysissynthesis filter bank window the look-ahead the bit-reservoir andthe frame length In order to overcome large algorithmic delays a simplifiedversion of the GA tool ie the MPEG-4 low-delay (LD) audio coder has beenproposed [Herr98c] [Herr99] One of the main reasons for the wide proliferationof this tool is the low algorithm delay requirements in voice-over Internet protocol(VoIP) applications In contrast to the ITU-T G728 speech standard that is based

DEMUX ReconfigurableSynthesis

Engine

SAOL Decoder

Run-timeScheduler

SASL Decoder

Score DataControl

BitstreamHeader

Samples Event Listamp

Data

SABitstream

AudioOutput

Figure 1018 MPEG-4 SA decoder (after [Sche98a])


on the LD-CELP [G728] the MPEG-4 LD audio coder [Alla99] is derived fromthe GA coder and MPEG-2 AAC The ITU-T G728 LD-CELP algorithm operateson speech frames of 25 ms (20 samples) at a sampling rate of 8 kHz and resultsin an algorithmic delay of 0625 ms (5 samples) On the other hand the MPEG-4LD audio coding tool operates on 512 or 480 samples at a sampling rate of up to48 kHz with an overall algorithmic delay of 20 ms Recall that the GA tool that isbased on the MPEG-2 AAC operates on frames of 1024 or 960 samples

The delays due to the analysissynthesis filter-bank window can be reduced byemploying shorter windows The look-ahead delays can be avoided by not employ-ing the block switching To reduce pre-echo distortions (Sections 69 and 610)TNS is employed in conjunction with window shape adaptation In particular fornontransient parts of the signal a sine window is used while a so-called low-overlap window is used in case of transient signals to achieve optimum TNSperformance [Purn99b] [ISOI00] Although most algorithms are fixed rate theinstantaneous bit rates required to satisfy masked thresholds on each frame arein fact time-varying Thus the idea behind a bit reservoir is to store surplus bitsduring periods of low demand and then to allocate bits from the reservoir duringlocalized periods of peak demand resulting in a time-varying instantaneous bitrate but at the same time a fixed average bit rate However in MPEG-4 LD audiocodec the use of the bit reservoir is minimized in order to reach the desired targetdelay

Based on the results published in [Alla99] [Herr99] [Purn99b] [ISOI00] theMPEG-4 LD audio codec performs relatively well compared to the MP3 coderat a bit rate of 64 kbschannel It can also be noted from the MPEG-4 version 2audio verification test [ISOI00] the quality measures of MPEG-2 AAC at 24 kbsand MPEG-4 LD audio codec at 32 kbs can be favorably compared Moreoverthe MPEG-4 LD audio codec [Herr98c] [Alla99] [Herr99] outperformed the ITU-T G728 LD CELP [G728] for the case of coding both music and speech signalsHowever as expected the coding efficiency in the case of MPEG-4 LD codec isslightly reduced compared to its predecessors MPEG-2 AAC and MPEG-4 GAIt should be noted that this reduction in the coding efficiency is attributed to thelow coding delay achieved

104410 MPEG-4 Audio Error Robustness Tool One of the key issuesin achieving reliable transmission over noisy and fast time-varying channels isthe bit-rate scalability feature (discussed in Section 10445) The bit-rate scala-bility enables flexible selection of coding features and dynamically adapts to thechannel conditions and the varying channel capacity However the bit-rate scala-bility feature alone is not adequate for reliable transmission The error resilienceand error protection tools are also essential to obtain high quality audio Tothis end the MPEG-4 audio version 2 is fitted with codec-specific error robust-ness techniques [Purn99b] [ISOI00] In this subsection we will review the errorrobustness and equal and unequal error protection (EEP and UEP) tools in theMPEG-4 audio version 1 In particular we discuss the error resilience [Sper00][Sper02] error protection [Purn99b] [Mein01] and error concealment [Sper01]functionalities that are primarily designed for mobile applications


The main idea behind the error resilience and protection tools is to providebetter protection to sensitive and priority (important) bits For instance the audioframe header requires maximum error robustness otherwise transmission errorsin the header will seriously impair the entire audio frame The codewords cor-responding to these priority bits are called the priority codewords (PCW) Theerror resilience tools available in the MPEG-4 audio version 2 are classified intothree groups the Huffman codeword reordering (HCR) the reversible variablelength coding (RVLC) and the virtual codebooks (VCB11) In the HCR tech-nique some of the codewords eg the PCWs are sorted in advance and placedat known positions First a presorting procedure is employed that reorders thecodewords based on their priority The resulting PCWs are placed such that anerror in one codeword will not affect the subsequent codewords This can beachieved by defining segments of known length (LSEG) and placing the PCWsat the beginning of these segments The non-PCWs are filled into the gaps leftby the PCWs as shown in Figure 1019

The various applications of reversible variable length codes (RVLC) [Taki95][Wen98][Tsai01]inimagecodinghaveinspiredresearcherstoconsider theminerror-resilient techniques for MPEG-4 audio RVLC codes are used instead of Huffmancodes for packing the scale factors in an AAC bitstream The RVLC codes are(symmetrically) designed to enable both forward and backward decoding withoutaffecting the coding efficiency In particular RVLCs allow instantaneous decodingin both directions that provides error robustness and significantly reduces the effectsof bit errors in delay-constrained real-time applications The next important toolemployed for error resilience is the virtual codebook 11 (VCB11) Virtual codebooksare used to detect serious errors within spectral data [Purn99b] [ISOI00] The errorrobustness techniques are codec specific (eg AAC and BSAC bitstreams) Forexample AAC supports the HCR the RVLC and the VCB11 error-resilient toolsOn the other hand BSAC supports segmented binary arithmetic coding [ISOI00] toavoid error propagation within spectral data

LSEG

PCW PCW PCW PCWNon-PCW Non-PCW

PCW PCW PCW PCW

Before codeword reordering

After codeword reordering

Non-PCW

Figure 1019 Huffman codeword reordering (HCR) algorithm to minimize error propa-gation in spectral data


The MPEG-4 audio error protection tools include cyclic redundancy check(CRC) forward error correction (FEC) and interleaving Note that these toolsare inspired by some of the error correctingdetecting features inherent in theconvolutional and block codes that essentially provide the controlled redundancydesired for error protection Unlike the error-resilient tools that are limited only tothe AAC and BSAC bitstreams the error protection tools can be used in conjunc-tion with a variety of MPEG-4 audio tools namely General Audio Coder (LTPand TwinVQ) Scalable Audio Coder parametric audio coder (HILN) CELPHVXC and low-delay audio coder Similar to the error-resilient tools the firststep in the EP tools is to reorder the bits based on their priority and error sensi-tiveness The bits are sorted and grouped into different classes (usually 4 or 5)according to their error sensitivities For example consider that there are fourerror-sensitive classes (ESC) namely ESC-0 ESC-1 ESC-2 and ESC-3 Usu-ally header bitstream or other very important bits that control the syntax and theglobal gain are included in the ESC-0 While the scale factors and spectral data(spectral envelope) are grouped in ESC-1 and ESC-2 respectively The remain-ing side information and indices of MDCT coefficients are classified in ESC-3After reordering (grouping) the bits each error sensitive class receives a differenterror protection depending on the overhead allowed for each configuration CRCand systematic rate-compatible punctured convolutional code (SRCPC) enableerror detection and forward error correction (FEC) The SRCPC codes also aid inadjusting the redundancy rates in small steps An interleaver is employed typicallyto deconcentrate or spread the burst errors Shortened Reed-Solomon (SRS) codesare used to protect the interleaved data Details on the design of Reed-Solomoncodes for MPEG AAC are given in [Huan02] For an in-depth treatment on theerror correcting and error detecting codes refer to [Lin82] [Wick95]

104411 MPEG-4 Speech Coding Tool Versus ITU-T SpeechStandards It is noteworthy to compare the MPEG-4 speech coding tool againstthe ITU-T speech coding standards While the latter applies source-filter configu-ration to model the speech parameters the former employs a variety of techniquesin addition to the traditional parametric representation The MPEG-4 speech cod-ing tool allows bit-rate scalability and real-time processing as well as applicationsrelated to storage media The MPEG-4 speech coding tool incorporates algorithmssuch as the TwinVQ the BSAC and the HVXCCELP The MPEG-4 speech cod-ing tool also accommodates multiple sampling rates Error protection and errorresilient techniques are provided in the MPEG-4 speech coding tool to obtainimproved performance over error-prone channels Other important features thatdistinguish the MPEG-4 tool from the ITU-T speech standards are the content-based interactivity and the ability to represent the audiovisual content as a set ofobjects

104412 MPEG-4 Audio Applications The MPEG-4 audio standard findsapplications in low-bit-rate audiospeech compression individual coding of nat-ural and synthetic audio objects low-delay coding error-resilient transmission


and real-time audio transmission over packet-switching networks such as theInternet [Diet96] [Liu99] MPEG-4 tools allow parameterization of the acousticalproperties of an audio scene with features such as immersive audiovisual render-ing (virtual 3-D environments [Kau98]) room acoustical modeling and enhanced3-D sound presentation MPEG-4 finds interesting applications in remote robotcontrol system design [Kim02b] Streaming audio codecs have also been pro-posed as a result of the MPEG-4 standardization efforts

Applications of MPEG-4 audio in DRM digital narrowband broadcasting(DNB) and digital multimedia broadcasting (DMB) are given in [Diet00]and [Grub01] respectively The general audio coding tool provides the necessaryinfrastructure for the design of error-robust scalable coders [Mori00b] anddelivers improved speechaudio quality [Moori00a] The ldquobit rate scalabilityrdquoand ldquoerror resilienceprotectionrdquo tools of the MPEG-4 audio standard dynamicallyadapt to the channel conditions and the varying channel capacity Other importantapplication-oriented features of MPEG-4 audio include low-delay bi-directionalaudio transmission content-based interactivity and object-based representationReal-time implementation of the MPEG-4 audio is reported in [Hilp00] [Mesa00][Pena01]

104413 Spectral Band Replication and Parametric Stereo Spectralband replication (SBR) [Diet02] and parametric stereo (PS) [Schu04] are thetwo new compression techniques recently added to the MPEG 4 audio standard[ISOI03c] The SBR technique is used in conjunction with a conventional codersuch as the MP3 or the MPEG AAC The audio signal is divided into low- andhigh-frequency bands The underlying core coder operates at a reduced sam-pling rate and encodes the low-frequency band The SBR technique operates atthe original sampling rate to estimate the spectral envelope associated with theinput audio The spectral envelope along with a set of control parameters areencoded and transmitted to the decoder The control parameters contain infor-mation regarding the gain and the spectral envelope level adjustment of the highfrequency components At the decoder the SBR reconstructs the high frequenciesbased on the transposition of the lower frequencies

aacPlus v1 is the combination of AAC and SBR and is standardized as theMPEG 4 high-efficiency (HE)-AAC [ISOI03c] [Wolt03] Relative to the con-ventional AAC the MPEG 4 HE-AAC results in bit rate reductions of about30 [Wolt03] The SBR has also been used to enhance the performance of MP3[Zieg02] and the MPEG layer 2 digital audio broadcasting systems [Gros03]

aacPlus v2 [Purn03] adds the parametric stereo coding to the MPEG 4 HE-AAC standard In the PS encoding [Schu04] the stereo signal is representedas a monaural signal plus ancillary data that describe the stereo image Thestereo image is described using four different PS parameters ie inter-channelintensity differences (IID) inter-channel phase differences (IPD) inter-channelcoherence (IC) and overall phase difference (OPD) These PS parameters cancapture the perceptually relevant spatial cues at bit rates as low as 10 kbs[Bree04]



MPEG-7 audio standard targets content-based multimedia applications [ISOI01b]MPEG-7 audio supports a broad range of applications [ISOI01d] that includemultimedia indexingsearching multimedia editing broadcast media selectionand multimedia digital library sorting Moreover it provides ways for efficientaudio file retrieval and supports both text-based and context-based queries Itis important to note that MPEG-7 will not replace MPEG-1 MPEG-2 BCLSFMPEG-2 AAC or MPEG-4 It is intended to provide complementary functional-ity to these MPEG standards If MPEG-4 is considered as the first object-basedmultimedia representation standard then MPEG-7 can be regarded as the firstcontent-based standard that incorporates multimedia interfaces through descrip-tions These descriptions are the means of linking the audio content features andattributes with the audio itself Figure 1020 presents an overview of the MPEG-7audio standard This figure depicts the various audio tools features and profilesassociated with the MPEG-7 audio Publications on the MPEG-7 Audio Standardinclude [Lind99] [Nack99a] [Nack99b] [Lind00] [ISOI01b] [ISOI01e] [Lind01][Quac01] [Manj02]

Motivated by the need to exchange multimedia content through the WorldWide Web in 1996 the ISOIEC MPEG workgroup worked on a project calledldquoMultimedia Content Description Interfacerdquo (MCDI) ndash MPEG-7 A working draftwas formed in December 1999 followed by a final committee draft in February2001 Seven months later MPEG-7 ISOIEC 15938 Part 4 Audio an inter-national standard (IS) for content-based multimedia applications was publishedalong with seven other parts of the MPEG-7 standard (Figure 1020) Figure 1020shows a summary of various features applications and profiles specified by theMPEG-7 audio coding standard

10451 MPEG-7 Parts MPEG-7 defines the following eight parts [MPEG](Figure 1020) MPEG-7 Systems MPEG-7 DDL MPEG-7 Visual MPEG-7Audio MPEG-7 MDS MPEG-7 Reference Software (RS) MPEG-7 Confor-mance Testing (CT) and MPEG-7 Extraction and use of Descriptions

MPEG-7 Systems (Part I) specifies the binary format for encoding MPEG-7Descriptions MPEG-7 DDL (Part II) is the language for defining the syntax ofthe Description Tools MPEG-7 Visual (Part III) and MPEG-7 Audio (Part IV)deal with the visual and audio descriptions respectively MPEG-7 MDS (PartV) defines the structures for multimedia descriptions MPEG-7 RS (Part VI)is a unique software implementation of certain parts of the MPEG-7 Standardwith noninformative status MPEG-7 CT (Part VII) provides the essential guide-linesprocedures for conformance testing of MPEG-7 implementations Finallythe eighth part addresses the use and formulation of a variety of description toolsthat we will discuss later in this section

In our discussion on MPEG-7 Audio we refer to MPEG-7 DDL and MPEG-7 MDS parts quite regularly mostly due to their interconnectivity within theMPEG-7 Audio Framework Therefore it is necessary that we introduce thesetwo parts first before we move on to the MPEG-7 Audio Description Tools

Lo

w-l

evel

or

Gen

eric

To

ols

1A

udio

fram

ewor

k2

Sile

nce

segm

ent

MP

EG

-7[IS

OI0

1e]

MP

EG

-7 A

UD

IOD

ES

CR

IPT

ION

TO

OLS

PA

RT

S

PR

OF

ILE

S

bull S

impl

e pr

ofile

bull U

ser

Des

crip

tion

prof

ile

bull S

umm

ary

prof

ile

bull A

udio

visu

al L

oggi

ng p

rofil

e

Sys

tem

s

Des

crip

tion

Def

initi

onLa

ngua

ge (

DD

L)[IS

OI0

1a]

Vis

ual

Au

dio

[IS

OI0

b]

Mul

timed

iaD

escr

iptio

n S

chem

e(M

DS

) [IS

OI0

c]

Ref

eren

ce S

oftw

are

Con

form

ance

Tes

ting

Ext

ract

ion

and

Use

of

Des

crip

tions

Hig

h-l

evel

(A

pp

licat

ion

-sp

ecif

ic)

Des

crip

tio

n T

oo

ls

1S

poke

n co

nten

t2

Mus

ical

inst

rum

ent t

imbe

r3

Mel

ody

4S

ound

rec

ogni

tion

and

inde

xing

5A

udio

sig

natu

rer

obus

tm

atch

ing

FE

AT

UR

ES

Mul

timed

ia In

dexi

ng

Mul

timed

ia E

ditin

g

Sel

ectin

g B

road

cast

Med

ia

Mul

timed

ia D

igita

lLi

brar

y S

ortin

g

Cog

nitiv

e C

odin

g

A N

ew D

escr

iptio

nT

echn

olog

y

Mul

timed

ia C

onte

ntD

escr

iptio

n In

terf

ace

AP

PLI

CA

TIO

NS

[ISO

I01d

]

Fig

ure

102

0A

nov

ervi

ewof

the

MPE

G-7

stan

dard

310


D4D5

D1 D2

D3

Descriptor (D)

DescriptionScheme (DS)

NOTE1 Descriptors (Ds) are the features and attributes associated with an audio waveform

2 The structure and the relationships among the descriptors are defined by a DescriptionScheme (DS)

3 Description Definition Language (DDL) defines the syntax necessary to create extendand combine avariety of DSs and Ds

DS1

DS2

D49D50

DSN

Description Definition Language (DDL)

Figure 1021 Some essential building blocks of MPEG-7 standard descriptors (Ds)description schemes (DSs) and description definition language (DDL)

MPEG-7 Description Definition Language (DDL) ndash Part II We mentioned ear-lier that MPEG-7 incorporates multimedia interfaces through descriptors Thesedescriptors are the features and attributes associated with the audio For exampledescriptors in the case of MPEG-7 Visual part describe the visual features such ascolor resolution contour mapping techniques etc A group of descriptors relatedin a manner suitable for a specific application forms a description scheme (DS)The standard [ISOI01e] defines the description scheme as one that specifies astructure for the descriptors and semantics of their relationships

MPEG-7 in its entirety has been built around these descriptors (Ds) anddescription schemes (DSs) and most importantly on a language called the descrip-tion definition language (DDL) The DDL defines the syntax necessary to createextend and combine a variety of DSs and Ds In particular the DDL formsldquothe core partrdquo of the MPEG-7 standard and will also be invoked by other parts(ie Visual Audio and MDS) to create new Ds and DSs The DDL follows aset of programming rulesstructure similar to the ones employed in the eXten-sible Markup Language (XML) It is important to note that the DDL is not amodeling language but a schema language that is based on the WWW Con-sortiumrsquos XML schema [XML] [ISOI01e] Several modifications were neededbefore adopting the XML schema language as the basis for the DDL We referto [XML] [ISOI01a] [ISOI01e] for further details on the XML schema languageand its liaison with MPEG-7 DDL [ISOI01a]

MPEG-7 Multimedia Description Schemes (MDS) ndash Part V Recall that a descrip-tion scheme (DS) specifies structures for descriptors similarly a multimedia


description scheme (MDS) [ISOI01c] provides details on the structures for describ-ing multimedia content (in particular audio visual and textual data) MPEG-7 MDSdefines two classes of description tools namely the basic (or low-level) and multi-media (or high-level) tools [ISOI01c] Figure 1022 shows the classification of MDSelementsThebasic tools specifiedbytheMPEG-7MDSare thegenericentitiesusu-ally associated with simple descriptors such as the basic data types textual databaseetc On the other hand the high-level multimedia tools deal with the content-specificentities that are complex and involve signal structures semantics models efficientnavigation and access The high-level (complex) tools are further subdivided intofive groups (Figure 1022) ie content description content management contentorganization navigation and access and user interaction

Let us consider an example to better understand the concepts of DDL andMDS framework Suppose that an audio signal s(n) is described using threedescriptors namely spectral features D1 parametric models D2 and energy D3Similarly visual v(i j) and textual content can also be described as shown inTable 105 We arbitrarily chose four description schemes (DS1 through DS4)that link these multimedia features (audio visual and textual) in a structuredmanner This linking mechanism is performed through DDL a schema languagedesigned specifically for MPEG-7 From Table 105 the descriptors D2 D8 D9

are related using the description scheme DS2 The melody descriptor D8 providesthe melodic information (eg rhythmic high-pitch etc) and the timbre descrip-tor D9 represents some perceptual features (eg pitchloudness details basstrebleadjustments in audio etc) The parametric model descriptor D2 describes theaudio encoding model and related encoder details (eg MPEG-1 layer III sam-pling rates delay bit rates etc) While the descriptor D2 provides details onthe encoding procedure the descriptors D8 and D9 describe audio morphingechoreverberation tone control etc

MPEG-7 Audio ndash Part IV MPEG-7 Audio represents part IV of the MPEG-7standard and provides structures for describing the audio content Figure 1023shows the organization of MPEG-7 audio framework

MPEG-7Multimedia Description

Scheme (MDS)

Basic tools Concept-specific tools

ContentDescription

ContentManagement

ContentOrganization

Navigationand Access

UserInteraction

Figure 1022 Classification of multimedia description scheme (MDS) tools


Table 105 A hypothetical example that gives a broader perspective on multimediadescriptors ie audio visual and textual features to describe a multimediacontent

Group Descriptors Description schemes

Audio content s(n) D1 Spectral featuresD2 Parametric modelsD3 Energy of the signal

DS1 D1 D3

Visual content v(i j) D4 ColorDS2 D2 D8 D9

D5 ShapeDS3 DS2 D4 D5

Textual descriptions D6 Title of the clipDS4 DS1 D6 D7

D7 Author informationD8 Melody detailsD9 Timbre details

Low-level or Generic Tools

1 Audio framework2 Silence segment

MPEG-7 AUDIODESCRIPTION TOOLS

High-level (Application-specific)Description Tools

1 Spoken content2 Musical instrument timber3 Melody4 Sound recognition and indexing5 Audio signaturerobust matching

Figure 1023 MPEG-7 audio description tools

10452 MPEG-7 Audio Versions and Profiles New extensions (Amend-ment 1) for the existing MPEG-7 Audio are being considered Some of theextensions are in the areas of application-specific spoken content tempo descrip-tion and specification of precision for low-level data types This new amendmentwill be standardized as MPEG-7 Audio Version 2 (Final drafts of InternationalStandard (FDIS) for Version 2 were finalized in March 2003)

Although many description tools are available in MPEG-7 audio it is notpractical to implement all of them in a particular system MPEG-7 Version 1therefore defines four complexity-ranked profiles (Figure 1020) intended to helpsystem designers in the task of tool subset selection These include simple profileuser description profile summary profile and audiovisual logging profile


10453 MPEG-7 Audio Description Tools The MPEG-7 Audio frameworkcomprises two main categories namely generic tools and a set of application-specific tools (see Figure 1020 and Figure 1023)

104531 Generic Tools The generic toolset consists of 17 low-level audiodescriptors and a silence segment descriptor (Table 106)

MPEG-7 Audio Low-level Descriptors MPEG-7 audio [ISOI01b] defines twoways of representing the low-level audio features ie segmenting and samplingIn segmentation usually common datatypes or scalars are grouped together (egenergy power bit rate sampling rate etc) On the other hand sampling enablesdiscretization of audio features in a vector form (eg spectral features excitationsamples etc) Recently a unified framework called the scalable series [Lind99][Lind00] [ISOI01b] [Lind01] has been proposed to manipulate these discretizedvalues This is somewhat similar to MPEG-4 scalable audio coding that wediscussed in Section 1044 A list of low-level audio descriptors defined bythe MPEG-7 Audio standard [ISOI01b] is summarized in Table 106 These

Table 106 Low-level audio descriptors (17 in number) and the silence descriptorsupported by the MPEG-7 generic toolset [ISOI01b]

Generic toolset Descriptors

Low-levelaudiodescriptorsgroup

1 Basic D1 Audio waveform

D2 Power2 Basic spectral D3 Spectrum envelope

D4 Spectral centroidD5 Spectral spreadD6 Spectral flatness

3 Signal parameters D7 HarmonicityD8 Fundamental frequency

4 Spectral basis D9 Spectrum basisD10 Spectrum projection

5 Timbral spectral D11 Harmonic spectralcentroid

D12 Harmonic spectraldeviation

D13 Harmonic spectral spreadD14 Harmonic spectral

variationD15 Spectral centroid

6 Timbral temporal D16 Log attack timeD17 Temporal centroid

Silence 7 Silence segment D18 Silence descriptor


descriptors can be classified into the following groups basic basic spectralsignal parameters spectral basis timbral spectral and timbral temporal

MPEG-7 Silence Segment The MPEG-7 silence segment attaches a semanticof silence to an audio segment The silence descriptor provides ways to specifythreshold levels (eg the level of silence)

104532 High-Level or Application-Specific MPEG-7 Audio Tools Bes-ides the aforementioned generic toolset the MPEG-7 audio standard describesfive specialized high-level tools (Table 107) These application-specific descrip-tion tools can be grouped as spoken content musical instrument melody soundrecognitionindexing and robust audio matching

Spoken Content Description Tool (SC-DT) The SC-DT provides descriptionsof spoken words in an audio clip thereby enabling speech recognition and speechparameter indexingsearching Spoken content lattice and spoken content headerare the two important parts of the SC-DT (see Table 107) While the SC headercarries the lexical information (ie wordlexicon phonelexicon ConfusionInfoand SpeakerInfo descriptors) the SC-lattice DS represents lattice-structures toconnect words or phonemes chosen from the corresponding lexicon The idea ofusing lattice structures in the SC-lattice DS is similar to the one employed in atypical continuous automatic speech recognition scenario [Rabi89] [Rabi93]

Musical Instrument Timbre Description Tool (MIT-DT) The MIT-DT describesthe timbre features (ie perceptual attributes) of sounds from musical instrumentsTimbre can be defined as the collection of perceptual attributes that make two

Table 107 Application-specific audio descriptors and descriptionschemes [ISOI01b]

High-level descriptor toolset Descriptor details

SC-DT 1 SC-header D1 Word lexiconD2 Phone lexiconD3 Confusion infoD4 Speaker info

2 SC-lattice DS Provides structures to connect or linkthe wordsphonemes in the lexicon

MIT-DT 3 Timbre (perceptual) featuresof musical instruments

D1 Harmonic Instrument Timbre

D2 Percussive Instrument TimbreM-DT 4 Melody features DS1 Melody contour

DS2 Melody sequenceSRI-DT 5 Sound recognition and

indexing applicationD1 Sound Model State Path

D2 Sound Model State HistogramDS1 Sound modelDS2 Sound classification model

AS-DT 6 Robust audio identification DS1 Audio signature DS


audio clips having the same pitch and loudness sound different [ISOI01b]Musical instrument sounds in general can be classified as harmonic-coherent-sustained percussive-nonsustained nonharmonic-coherent-sustained and non-coherent-sustained The standard defines descriptors for the first two classesof musical sounds (Table 107) In particular MIT-DT defines two descriptorsnamely the harmonic instrument timbre (HIT) descriptor and the percussiveinstrument timbre (PIT) descriptor The HIT descriptor was built on the fourharmonic low-level descriptors (ie D11 through D14 in Table 106) and theLogattacktime descriptor On the other hand the PIT descriptor is based on thecombination of the timbral temporal low-level descriptors (ie Logattacktimeand Temporalcentroid) and the spectral centroid descriptor

Melody Description Tool (M-DT) The M-DT represents the melody detailsof an audio clip The melodycontourDS and the melodysequenceDS are the twoschemes included in M-DT While the former scheme enables simple and robustmelody contour representation the latter approach involves detailed and expandedmelodyrhythmic information

Sound Recognition and Indexing Description Tool (SRI-DT) The SRI-DT ison automatic sound identificationrecognition and indexing Recall that the SC-DT employs lexicon descriptors (Table 107) for SC recognition in an audio clipIn the case of SRI classificationindexing of sound tracks are achieved throughsound models These models are constructed based on the spectral basis low-level descriptors ie spectral basis (D9) and spectral projection (D10) listed inTable 106 Two descriptors namely the sound model state path descriptor andthe sound model state histogram descriptor are defined to keep track of the activepaths in a trellis

Robust Audio Identification and Matching Robust matching and identification ofaudioclips isoneof the importantapplicationsofMPEG-7audiostandard [ISOI01d]This feature is enabled by the low-level spectral flatness descriptor (Table 106) Adescription scheme namely the Audio Signature DS defines the semantics andstructures for the spectral flatness descriptor Hellmuth et al [Hell01] proposed anadvanced audio identification procedure based on content descriptions

10454 MPEG-7 Audio Applications Being the first metadata standardMPEG-7 audio provides new ideas for audio-content indexing and archiving[ISOI01d] Some of the applications are in the areas of multimedia searchingaudio file indexing sharing and retrieval and media selection for digital audiobroadcasting (DAB) We discussed most of these applications while addressingthe high-level audio descriptors and description schemes A summary of theseapplications follows Unlike in an automatic speech recognition scenario whereword or phoneme lattices (based on feature vectors) are employed for identi-fying speech in MPEG-7 these lattice structures are denoted as Ds and DSsThese description data enable spoken content retrieval MPEG-7 audio version 2includes new tools and specialized enhancements to spoken content search Musi-cal instrument timbre search is another important application that targets content-based editing Melody search enables query by humming [Quac01] Sound recog-nitionindexing and audio identificationfingerprinting form two other important


applications of the MPEG-7 We will next address the concepts of ldquointeroperabil-ityrdquo and ldquouniversal multimedia accessrdquo (UMA) in the context of the new workinitiated by the ISOIEC MPEG workgroup in June 2000 called the MultimediaFramework ndash MPEG 21 [Borm03]

1046 MPEG-21 Framework (ISOIEC-21000)

Motivated by the need for a standard that enables multimedia content access anddistribution the ISOIEC MPEG workgroup addressed the 21st Century Multi-media Framework ndash MPEG-21 ISOIEC 21000 [Spen01] [ISOI02a] [ISOI03a][ISOI03b] [Borm03] [Burn03] This multimedia standard should be interoperableand highly automated [Borm03] The MPEG-21 multimedia framework envisionscreating a platform that encompasses a great deal of functionalities for bothcontent-users and content-creatorsproviders Some of these functions include themultimedia resource delivery to a wide range of networks and terminals (egpersonal computers (PCs) PDAs and other digital assistants mobile phonesthird-generation cellular networks digital audiovideo broadcasting (DABDVB)HDTVs and several other home entertainment systems) protection of intellectualproperty rights through digital rights management (DRM) systems

Content creators and service providers face several challenging tasks in order tosatisfy simultaneously the conflicting demands of ldquointeroperabilityrdquo and ldquointellec-tual property management and protectionrdquo (IPMP) To this end MPEG-21 definesa multimedia framework that comprises seven important parts [ISOI02a] as shownin Table 108 Recall that the MPEG-7 ISOIEC-15938 standard defines a funda-mental unit called ldquoDescriptorsrdquo (Ds) to definedeclare the features and attributesof multimedia content In a manner analogous to this MPEG-21 ISOIEC-21000Part 1 defines a basic unit called the ldquoDigital Itemrdquo (DI) Besides DI MPEG-21specifies another entity called the ldquoUserrdquo interaction [ISOI02a] [Burn03] that pro-vides details on how each ldquoUserrdquo interacts with other users via objects called theldquoDigital Itemsrdquo Furthermore MPEG-21 Parts 2 and 3 define the declaration andidentification of the DIs respectively (see Table 108) MPEG-21 ISOIEC-21000Parts 4 through 6 enables interoperable digital content distribution and transactionsthat take into account the IPMP requirements In particular a machine-readablelanguage called the Rights Expression Language (REL) is specified in MPEG-21ISOIEC-21000 Part 5 that defines the rights and permissions for the access anddistribution of multimedia resources across a variety of heterogeneous terminalsand networks MPEG-21 ISOIEC-21000 Part 6 defines a dictionary called theRights Data Dictionary (RDD) that contains information on content protection andrights

MPEG-7 and MPEG-21 standards provide an open framework on which onecan build application-oriented interfaces or tools that satisfy a specific criterion(eg a query an audio file indexing etc) In particular the MPEG-7 standardprovides an interface for indexing accessing and distribution of multimediacontent and the MPEG-21 defines an interoperable framework to access themultimedia content


Table 108 MPEG-21 multimedia framework and the associated parts [ISOI02a]

Parts in the MPEG-21 ISOIEC 21000 Standard [ISOI02a] Details

Part 1 Vision technologies andstrategy

Defines the visionrequirements andapplications of thestandard andprovides an overviewof the multimediaframeworkIntroduces two newterms ie digitalitem (DI) and userinteraction

Part 2 Digital item declaration Defines the relationshipbetween a variety ofmultimedia resourcesand providesinformationregarding thedeclaration of Dis

Part 3 Digital item identification Provides ways toidentify differenttypes of digital items(DIs) and descrip-torsdescriptionschemes (DsDSs)via uniform resourceidentifiers (URIs)

Part 4 IPMP Defines a frameworkfor the intellectualproperty managementand protection(IPMP) that enablesinteroperability

Part 5 Rights expression language A syntax language thatenables multimediacontent distributionin a way that protectsthe digital contentThe rights and thepermissions areexpressed or declaredbased on the termsdefined in the rightsdata dictionary

ADAPTIVE TRANSFORM ACOUSTIC CODING (ATRAC) 319

Table 108 (continued )


Part 6 Rights data dictionary A database or adictionary thatcontains theinformation regardingthe rights andpermissions to protectthe digital content

Part 7 Digital item adaptation Defines the concept of anadapted digital item

Until now our focus was primarily on ISOIEC MPEG Audio Standards Inthe next few sections we will attend to company-oriented perceptual audio cod-ing algorithms ie the Sony Adaptive Transform Acoustic Coding (ATRAC)the Lucent Technologies Perceptual Audio Coder (PAC) the Enhanced PAC(EPAC) the Multichannel PAC (MPAC) Dolby Laboratories AC-2AC-2AAC-3 Audio Processing Technology (APT-x100) and the Digital Theater Systems(DTS) Coherent Acoustics (CA) encoder algorithms

1047 MPEG Surround and Spatial Audio Coding

MPEG spatial audio coding began receiving attention during the early 2000s[Fall01] [Davis03] Advances in joint stereo coding of multichannel signals[Herr04b] binaural cue coding [Fall01] and the success of the recent low com-plexity parametric stereo encoding in MPEG 4 HE-AACPS standard [Schu04]generated interest in the MPEG surround and spatial audio coding [Herr04a][Bree05] Unlike the discrete 51-channel encoding as used in Dolby Digital orDTS the MPEG spatial audio coding captures the ldquospatial imagerdquo of a multi-channel audio signal The spatial image is represented using a compact set ofparameters that describe the perceptually relevant differences among the chan-nels Typical parameters include the interchannel level difference (ICLD) theinterchannel coherence (ICC) and the interchannel time difference (ICTD) Themultichannel signal is first downmixed to a stereo signal and then a conventionalMP3 coder is used Spatial image parameters are computed using the binaural cuecoding (BCC) technique and are transmitted to the decoder as side information[Herr04a] At the decoder a one-to-two (OTT) or two-to-three (TTT) channelmapping is used to synthesize the multichannel surround sound

105 ADAPTIVE TRANSFORM ACOUSTIC CODING (ATRAC)

The ATRAC algorithm developed by Sony for use in its rewriteable Mini-Disc system [Yosh94] combines subband and transform coding to achieve nearlyCD-quality coding of 441 kHz 16-bit PCM input data at a bit rate of 146 kbs per


QMFAnalysisBank 1

QMFAnalysisBank 2

0 - 55 kHz

55 - 11 kHz

11 - 22 kHz

MDCT32128 pt

s(n) MDCT32128 pt

MDCT32256 pt

WindowSelect

Bit Allocation

Quantization

Bit Allocation

Quantization

Bit Allocation

Quantization

RL(i )

RM(i)

RH(i )

SL(k)ˆ

SM(k)ˆ

SH(k)ˆ

Figure 1024 Sony ATRAC (embedded in MiniDisc SDDS)

channel [Tsut98] Using a tree-structured QMF analysis bank (Section 65) theATRAC encoder (Figure 1024) first splits the input signal into three subbands of0ndash55 kHz 55ndash11 kHz and 11ndash22 kHz Like MPEG-1 layer III the ATRACQMF bank is followed by signal-adaptive MDCT analysis in each subband Nexta window-switching scheme is employed that can be summarized as follows Dur-ing steady-state input periods high-resolution spectral analysis is attained using512 sample blocks (116 ms) During input attack or transient periods shortblock sizes of 145 ms in the high-frequency band and 29 ms in the low- andmid-frequency bands are used for pre-echo cancellation

After MDCT analysis spectral components are clustered into 52 nonuniformsubbands (block floating units or BFUs) according to critical band spacing TheBFUs are block-companded quantized and encoded according to a psychoa-coustically derived bit allocation For each analysis frame the ATRAC encodertransmits quantized MDCT coefficients subband window lengths BFU scale-factors and BFU word lengths to the decoder Like the MPEG family theATRAC architecture decouples the decoder from psychoacoustic analysis andbit-allocation details Evolutionary improvements in the encoder bit allocationstrategy are therefore possible without modifying the decoder structure An addedbenefit of this architecture is asymmetric complexity which enables inexpensivedecoder implementations

Suggested bit allocation techniques for ATRAC are of lower complexity thanthose found in other standards since ATRAC is intended for low-cost battery-powered devices One proposed method distributes bits between BFUs accordingto a weighted combination of fixed and adaptive bit allocations [Tsut96] For thek-th BFU bits are allocated according to the relation

r(k) = αra(k) + (1 minus α)rf (k) minus β (1010)

where rf (k) is a fixed allocation ra(k) is a signal-adaptive allocation the parame-ter β is a constant offset computed to guarantee a fixed bit rate and the parameterα is a tonality estimate ranging from 0 (noise-like) to 1 (tone-like) The fixed

LUCENT TECHNOLOGIES PAC EPAC AND MPAC 321

allocations rf (k) are the same for all inputs and concentrate more bits at thelower frequencies The signal adaptive bit allocations ra(k) assign bits accord-ing to the strength of the MDCT components The effect of Eq (1010) is thatmore bits are allocated to BFUs containing strong peaks for tonal signals Fornoise-like signals bits are allocated according to a fixed allocation rule with lowbands receiving more bits than high bands

Sony Dynamic Digital Sound (SDDS) In addition to providing near CD-qualityon a MiniDisc medium the ATRAC algorithm has also been deployed as thecore of Sonyrsquos digital cinematic sound system SDDS SDDS integrates eightindependent ATRAC modules to carry the program information for the left (L)left center (LC) center (C) right center (RC) right (R) subwoofer (SW) leftsurround (LS) and right surround (RS) channels typically present in a moderntheater SDDS data is recorded using optical black and white dot-matrix tech-nology onto two thin strips along the right and left edges of the film outside ofthe sprocket holes Each edge contains four channels There are 512 ATRAC bitsper channel associated with each movie frame and each optical data frame con-tains a matrix of 52 times 192 bits [Yama98] SDDS data tracks do not interfere withor replace the existing analog sound tracks Both Reed-Solomon error correctionand redundant track information are delayed by 18 frames and employed to makeSDDS robust to bit errors introduced by run-length scratches dust splice pointsand defocusing during playback or film printing Analog program information isused as a backup in the event of uncorrectable digital errors

106 LUCENT TECHNOLOGIES PAC EPAC AND MPAC

The pioneering research contributions on perceptual entropy [John88b] mono-phonic PXFM [John88a] stereophonic PXFM [John92a] and ASPEC [Bran91]strongly influenced not only the MPEG family architecture but also evolved atATampT Bell Laboratories into the Perceptual Audio Coder (PAC) The PAC algo-rithm eventually became property of Lucent ATampT meanwhile became activein the MPEG-2 AAC research and standardization The low-complexity profileof AAC became the ATampT coding standard

Like the MPEG coders the Lucent PAC algorithm is flexible in that it supportsmonophonic stereophonic and multiple channel modes In fact the bit streamdefinition will accommodate up to 16 front side 7 surround and 7 auxiliarychannel pairs as well as 3 low-frequency effects (LFE or subwoofer) channelsDepending upon the desired quality PAC supports several bit rates For a modestincrease in complexity at a particular bit rate improved output quality can berealized by enabling enhancements to the original system For example whereas96 kbs output was judged to be adequate with stereophonic PAC near CD qualitywas reported at 56ndash64 kbs for stereophonic enhanced PAC [Sinh98a]

1061 Perceptual Audio Coder (PAC)

The original PAC system described in [John96c] achieves very-high-quality cod-ing of stereophonic inputs at 96 kbs Like the MPEG-1 layer III and the ATRAC


the PAC encoder (Figure 1025) uses a signal-adaptive MDCT filter bank to ana-lyze the input spectrum with appropriate frequency resolution A long windowof 2048 points (1024 subbands) is used during steady-state segments or else aseries of short 256-point windows (128 subbands) is applied for segments con-taining transients or sharp attacks In contrast to MPEG-1 and ATRAC howeverPAC relies on the MDCT alone rather than a hybrid filter-bank structure thusrealizing a complexity reduction As noted previously [Bran88a] [Mahi90] theMDCT lends itself to compact representation of stationary signals and a 2048-point block size yields sufficiently high-frequency resolution for most sourcesThis segment length was also associated with the maximum realizable coding gainas a function of block size [Sinh96] Filter-bank resolution switching decisionsare made on the basis of PE (high complexity) or signal energy (low complexity)criteria

The PAC perceptual model derives noise masking thresholds from filter-bankoutput samples in a manner similar to MPEG-1 psychoacoustic model recommen-dation 2 [ISOI92] and the PE calculation in [John88b] The PAC model howeveraccounts explicitly for both simultaneous and temporal masking effects Samplesare grouped into 13 critical band partitions tonality is estimated in each bandand then time and frequency spreading functions are used to compute a maskingthreshold that can be related to the filter-bank outputs One can observe that PACrealizes some complexity reduction relative to MPEG by avoiding parallel fre-quency analysis structures for quantization and perceptual modeling The maskingthresholds are used to select one of 128 exponentially distributed quantizationstep sizes in each of 49 or 14 coder bands (analogous to ATRAC BFUs) in high-resolution and low-resolution modes respectively The coder bands are quantizedusing an iterative rate control loop in which thresholds are adjusted to satisfysimultaneously bit-rate constraints and an equal loudness criterion that attemptsto shape quantization noise such that its absolute loudness is constant relative tothe masking threshold The rate control loop allows time-varying instantaneousbit rates so that additional bits are available in times of peak demand muchlike the bit reservoir of MPEG-1 layer III Remaining statistical redundanciesare removed from the stream of quantized spectral samples prior to bit streamformatting using eight structured multidimensional Huffman codebooks Thesecodebooks are applied to DPCM-encoded quantizer outputs By clustering coderbands into sections and selecting only one codebook per section the systemminimizes the overhead

s(n)MDCT2562048 pt

Perceptual Model

Quantization HuffmanCoding Bitstream

Figure 1025 Lucent Technologies Perceptual Audio Coder (PAC)


MDCT2048 pt

FilterbankSelect

PerceptualModel

Quantization

WaveletFilterbank

SS

TR

Transient Steady StateSwitch

SS

TR

HuffmanCoding Bitstream

s(n)

Figure 1026 Lucent Technologies Enhanced Perceptual Audio Coder (EPAC)

1062 Enhanced PAC (EPAC)

In an effort to enhance PAC output quality at low bit rates Sinha and John-ston [Sinh96] introduced a novel signal-adaptive MDCTWP1 switched filter bankscheme that resulted in nearly transparent coding for CD-quality source mate-rial at 64 kbs per stereo pair EPAC (Figure 1026) is unique in that it switchesbetween two distinct filter banks rather than relying upon hybrid [Tsut98] [ISOI92]or nonuniform cascade [Prin95] structures

A 2048-point MDCT decomposition is applied normally during ldquostationaryrdquoperiods EPAC switches to a tree-structured wavelet packet (WP) decompositionmatched to the auditory filter bank during sharp transients Switching decisionsoccur every 25 ms as in PAC using either PE or energy criteria The WP analysisoffers the dual advantages of more compact signal representation during transientperiods than MDCT as well as improved time resolution at high frequencies foraccurate estimation of the timefrequency distribution of masking power containedin sharp attacks In contrast to the uniform time-frequency tiling associated withMDCT window-length switching schemes (eg [ISOI92] [Bran94a]) the EPACWP transform (tree-structured QMF bank) achieves a nonuniform time-frequencytiling For a suitably designed analysis wavelet and tree-structure an improvementin time resolution is restricted to the high-frequency regions of interest while good-frequency resolution is maintained in the low-frequency subbands The EPACWP filter bank was specifically designed for time-localized impulse responsesat high frequencies to minimize the temporal spread of quantization noise (pre-echo) Novel start and stop windows are inserted between analysis frames duringswitching intervals to mitigate boundary effects associated with the MDCT-to-WP and WP-to-MDCT transitions Other than the enhanced filter bank EPAC isidentical to PAC In subjective tests involving 12 expert and nonexpert listenerswith difficult castanets and triangle test signals EPAC outperformed PAC for a64 kbs per stereo pair by an average of 04ndash06 on a five-point quality scale

1063 Multichannel PAC (MPAC)

Like the MPEG AC-3 and SDDS systems the PAC algorithm also extendsits monophonic processing capabilities into stereophonic and multiple-channel1 See Chapter 8 Sections 82 and 83 for descriptions on wavelet filter bank and WP transforms


modes Stereophonic PAC computes individual masking thresholds for the leftright mono and stereo (L R M = L + R and S = L minus R) signals using a versionof the monophonic perceptual model that has been modified to account for binary-level masking differences (BLMD) or binaural unmasking effects [Moor77]Then monaural PAC methods encode either the signal pairs (L R) or (M S) Inorder to minimize the overall bit rate however a LRMS switching procedureis embedded in the rate-control loop such that different encoding modes (LR orMS) can be applied to the individual coder bands on the same analysis frame

In the MPAC 5-channel configuration composite coding modes are availablefor the front side left center right and left and right surround (L C R Lsand Rs) channels On each frame the composite algorithm works as followsFirst appropriate window-switched filter-bank frequency resolution is determinedseparately for the front side and surround channels Next the four signal pairsLR MS LsRs and MsSs (Ms = Ls + Rs Ss = Ls minus Rs) are generated TheMPAC perceptual model then computes individual BLMD-compensated maskingthresholds for the eight LR and MS signals as well as the center channel COnce thresholds have been obtained a two-step coding process is applied In step1 a minimum PE criterion is first used to select either MS or LR coding for thefront side and surround channel groups in each coder band Then step 2 appliesinterchannel prediction to the quantized spectral samples The prediction residualsare quantized such that the final quantization noise satisfies the original maskingthresholds for each channel (LR or MS) The interchannel prediction schemesare summarized in [Sinh98a] In pursuit of a minimum bit rate the compositecoding algorithm may elect to utilize either step 1 or step 2 both step 1 and step2 or neither step 1 nor step 2 Finally the composite perceptual model computesa global masking threshold as the maximum of the five individual thresholdsminus a safety margin This threshold is phased in gradually for joint codingwhen the bit reservoir drops below 20 [Sinh98a] The safety margin magnitudedepends upon the bit reservoir state Composite modes are applied separately foreach coder band on each analysis frame In terms of performance the MPACsystem was found to produce the best quality at 320 kbs for 5 channels duringa recent ISO test of multichannel algorithms [ISOII94]

Applications Both 128 and 160 kbs stereophonic versions of PAC were con-sidered for standardization in the US Digital Audio Radio (DAR) project Inan effort to provide graceful degradation and extend broadcast range in thepresence of heavy fading associated with fringe reception areas perceptuallymotivated unequal error protection (UEP channel coding) schemes were exam-ined in [Sinh98b] The proposed scheme ranks bit stream elements into twoclasses of perceptual importance Bit stream parameters associated with cen-ter channel information and certain mid-frequency subbands are given greaterchannel protection (class 1) than other parameters (class 2) Subjective testsrevealed a strong preference for UEP over equal error protection (EEP) partic-ularly when bit error rates (BER) exceeded 2 times 10minus4 For network applicationsacceptable PAC output quality at bit rates as low as 12ndash16 kbs per channelin conjunction with the availability of JAVA PAC decoder implementations are

DOLBY AUDIO CODING STANDARDS 325

reportedly increasing PAC deployment among suppliers of Internet audio programmaterial [Sinh98a] MPAC has also been considered for cinematic and advancedtelevision applications Real-time PAC and EPAC decoder implementations havebeen demonstrated on 486-class PC platforms

107 DOLBY AUDIO CODING STANDARDS

Since the late 1980s Dolby Laboratories has been active in perceptual audiocoding research and standardization and Dolby researchers have made numer-ous scientific contributions within the collaborative framework of MPEG audioOn the commercial front Dolby has developed the AC-2 and the AC-3 algo-rithms [Fiel91] [Fiel96]

1071 Dolby AC-2 AC-2A

The AC-2 [Davi90] [Fiel91] is a family of single-channel algorithms operatingat bit rates between 128 and 192 kbs for 20 kHz bandwidth input sampled at441 or 48 kHz There are four available AC-2 variants all of which share anarchitecture in which the input is mapped to the frequency domain by an evenlystacked TDAC filter bank [Prin86] with a novel parametric Kaiser-Bessel analy-sis window (Section 67) optimized for improved stop-band attenuation relativeto the sine window The evenly stacked TDAC differs from the oddly stackedMDCT in that the evenly stacked low-band filter is half-band and its magni-tude response wraps around the fold-over frequency (see Chapter 6) A uniquemantissa-exponent coding scheme is applied to the TDAC transform coefficientsFirst sets of frequency-adjacent coefficients are grouped into blocks (subbands)of roughly critical bandwidth For each block the maximum is identified andthen quantized as an exponent in terms of the number of left shifts required untiloverflow occurs The collection of exponents forms a stair-step spectral envelopehaving 6 dB (left shift = multiply by 2 = 602 dB) resolution and normaliz-ing the transform coefficients by the envelope generates a set of mantissas Theenvelope approximates the short-time spectrum and therefore a perceptual modeluses the exponents to compute both a fixed and a signal-adaptive bit allocationfor the mantissas on each frame

As far as details on the four AC-2 variants are concerned two versions aredesigned for low-complexity low-delay applications and the other two for higherquality at the expense of increased delay or complexity In version 1 a 128-sample (64-channel) filter bank is used and the coder operates at 192 kbs perchannel resulting in high-quality output with only 7-ms delay at 48 kHz Ver-sion 2 is also for low-delay applications with improved quality at the same bitrate and it uses the same filter bank but exploits time redundancy across blockpairs thus increasing delay to 12 ms Version 3 achieves similar quality with thereduced rate of 128 kbs per channel at the expense of longer delay (45 ms) byusing a 512-sample (256 channel) filter bank to improve steady-state coding gainFinally version 4 (the AC-2A [Davi92] algorithm) employs a switched 128512-point TDAC filter bank to improve quality for transient signals while maintaining


high coding gain for stationary signals A 320-sample bridge window preservesPR filter bank properties during mode switching and a transient detector consist-ing of an 8-kHz Chebyshev highpass filter is responsible for switching decisionsOrder of magnitude peak level increases between 64-sample sub-blocks at thefilter output are interpreted as transient events The Kaiser window parametersused for the KBD windows in each of the AC-2 algorithms appeared in [Fiel96]For all four algorithms the AC-2 encoder multiplexes spectral envelope and man-tissa parameters into an output bitstream along with some auxiliary informationByte-wide Reed-Solomon ECC allows for correction of single byte errors in theexponents at the expense of 1 overhead resulting in good performance up toa BER of 0001

One AC-2 feature that is unique among the standards is that the perceptualmodel is backward adaptive meaning that the bit allocation is not transmittedexplicitly Instead the AC-2 decoder extracts the bit allocation from the quan-tized spectral envelope using the same perceptual model as the AC-2 encoderThis structure leads to a significant reduction of side information and inducesa symmetric encoderdecoder complexity which was well suited to the origi-nal AC-2 target application of single point-to-point audio transport An examplesingle point-to-point system now using low-delay AC-2 is the DolbyFAX afull-duplex codec that carries simultaneously two channels in both directionsover four ISDN ldquoBrdquo links for film and TV studio distance collaboration Low-delay AC-2 codecs have also been installed on 950-MHz wireless digital studiotransmitter links (DSTL) The AC-2 moderate delay and AC-2A algorithms havebeen used for both network and wireless broadcast applications such as cableand direct broadcast satellite (DBS) television The AC-2A is the predecessor tothe now popular multichannel AC-3 algorithm As the next section will showthe AC-3 coder has inherited and enhanced several facets of the AC-2AC-2Aarchitecture In fact the AC-2 encoder is nearly identical to (one channel of)the simplified AC-3 encoder shown in Figure 1027 except that AC-2 does nottransmit explicitly any perceptual model parameters

TransientDetector

MDCT256512-pt

Spectral EnvelopeExponentEncoder

MantissaQuantizer

PerceptualModel

BitAllocation

MUX

s(n)

Figure 1027 Dolby Laboratories AC-3 encoder


1072 Dolby AC-3Dolby DigitalDolby SR middot D

The 51-channel ldquosurroundrdquo format that had become the de facto standard inmost movie houses during the 1980s was becoming ubiquitous in home the-aters of the 1990s that were equipped with matrixed multichannel sound (egDolby ProLogic) As a result of this trend it was clear that emerging appli-cations for perceptual coding would eventually minimally require stereophonicor even multichannel surround-sound capabilities to gain consumer acceptanceAlthough single-channel algorithms such as the AC-2 can run on parallel inde-pendent channels significantly better performance can be realized by treatingmultiple channels together in order to exploit interchannel redundancies and irrel-evancies The Dolby Laboratories AC-3 algorithm [Davis93] [Todd94] [Davi98]also known as ldquoDolby Digitalrdquo or ldquoSR middot Drdquo was developed specifically for mul-tichannel coding by refining all of the fundamental AC-2 blocks including thefilter bank the spectral envelope encoding the perceptual model and the bitallocation The coder carries 51 channels of audio (left center right left sur-round right surround and a subwoofer) but at the same time it incorporates aflexible downmix strategy at the decoder to maintain compatibility with conven-tional monaural and stereophonic sound reproduction systems The ldquo1rdquo channelis usually reserved for low-frequency effects and is lowpass bandlimited below120 Hz The main features of the AC-3 algorithm are as follows

ž Sample rates 32 441 and 48 kHzž Bit rates 32ndash640 kbs variablež High-quality output at 64 kbs per channelž Delay roughly 100 msž MDCT filter bank (oddly stacked TDAC [Prin87]) KBD prototype windowž MDCT coefficients quantized and encoded in terms of exponents mantissasž Spectral envelope represented by exponentsž Signal-adaptive exponent strategy with time-varying time-frequency reso-

lutionž Hybrid forward-backward adaptive perceptual modelž Parametric bit allocationž Uniform quantization of mantissas according to signal-adaptive bit allocationž Perceptual model improvements possible at the encoder without changing

decoderž Multiple channels processed as an ensemblež Frequency-selective intensity coding as well as LR MSž Robust decoder downmix functionality from 51 to fewer channelsž Integral dynamic range control systemž Board-level real-time encoders availablež Chip-level real-time decoders available


The AC-3 works in the following way A signal-adaptive MDCT filter bank witha customized KBD window (Section 67) maps the input to the frequency domainLong windows are applied during steady-state segments and a pair of short win-dows is used for transient segments The MDCT coefficients are quantized andencoded by an exponentmantissa scheme similar to AC-2 Bit allocation for themantissas is performed according to a perceptual model that estimates the maskedthreshold from the quantized spectral envelope Like AC-2 an identical percep-tual model resides at both the encoder and decoder to allow for backward adaptivebit allocation on the basis of the spectral envelope thus reducing the burden ofside information on the bitstream Unlike AC-2 however the perceptual modelis also forward adaptive in the sense that it is parametric Model parameters canbe changed at the encoder and the new parameters transmitted to the decoderin order to affect modified masked threshold calculations Particularly at lowerbit rates the perceptual bit allocation may yield insufficient bits to satisfy boththe masked threshold and the rate constraint When this happens mid-side andintensity coding (ldquochannel couplingrdquo above 2 kHz) reduce the demand for bits byexploiting respectively interchannel redundancies and irrelevancies Ultimatelyexponents mantissas coupling data and exponent strategy data are combinedand transmitted to the receiver

The remainder of this section provides details on the major functional blocksof the AC-3 algorithm including the filter bank exponent strategy perceptualmodel bit allocation mantissa quantization intensity coding system-level func-tions complexity and applications and standardization activities

10721 Filter Bank Although the high-level AC-3 structure (Figure 1027)resembles that of AC-2 there are significant differences between the two algo-rithms Like AC-2 the AC-3 algorithm first maps input samples to the frequencydomain using a PR cosine-modulated filter bank with a novel KBD window(Section 67 parameters given in [Fiel96]) Unlike AC-2 however AC-3 is basedon the oddly stacked MDCT The AC-3 also handles window switching differ-ently than AC-2A Long 512-sample (9375 Hz resolution 48 kHz) windowsare used to achieve reasonable coding gain during stationary segments Duringtransients however a pair of 256-sample windows replaces the long window tominimize pre-echoes Also in contrast to the MPEG and AC-2 algorithms theAC-3 MDCT filter bank retains PR properties during window switching withoutresorting to bridge windows by introducing a suitable phase shift into the MDCTbasis vectors (Chapter 6 Eq (638a) and (638b) see also [Shli97]) for one ofthe two short transforms Whenever a scheme similar to the one used in AC-2Adetects transients short filter-bank windows may activate independently on anyone or more of the 51 channels

10722 Exponent Strategy The AC-3 algorithm uses a refined version ofthe AC-2 exponentmantissa MDCT coefficient representation resulting in a sig-nificantly improved coding gain In AC-3 the MDCT coefficients correspondingto 1536 input samples (six transform blocks) are combined into a single frame


Then a frame-processing routine optimizes the exponent representation to exploittemporal redundancy while at the same time representing the stair-step spectralenvelope with adequate frequency resolution In particular spectral envelopes areformed from partitions of either one two or four consecutive MDCT coefficientson each of the six MDCT blocks in the frame To exploit time-redundancy thesix envelopes can be represented individually or any or all of the six can becombined into temporal partitions As in AC-2 the exponents correspond to thepeak values of each time-frequency partition and each exponent is representedwith 6 dB of resolution by determining the number of left shifts until overflowThe overall exponent strategy is selected by evaluating spectral stability Manystrategies are possible For example all transform coefficients could be transmit-ted for stable spectra but time updates might be restricted to 32-ms intervalsie an envelope of single-coefficient partitions might be repeated five times toexploit temporal redundancy On the other hand partitions of two or four com-ponents might be encoded for transient signals but the time-partition might besmaller eg updates could occur for every 53-ms MDCT block Regardlessof the particular strategy in use for a given frame exponents are differentiallyencoded across frequency Differential coding of exponents exploits knowledgeof the filter-bank transition band characteristics thus avoiding slope overloadwith only a five-level quantization strategy The AC-3 exponent strategy exploitsin a signal-dependent fashion the time- and frequency-domain redundancies thatexist on a frame of MDCT coefficients

10723 Perceptual Model A novel parametric forward-backward adaptiveperceptual model estimates the masked threshold on each frame The forward-adaptive component exists only at the encoder Given a rate constraint this blockinteracts with an iterative rate control loop to determine the best set of percep-tual model parameters These parameters are passed to the backward-adaptivecomponent which estimates the masked threshold by applying the parametersfrom the forward-adaptive component to a calculation involving the quantizedspectral envelope Identical backward-adaptive model components are embeddedin both the encoder and decoder Thus model parameters are fixed at the encoderafter several threshold calculations in an iterative rate control process and thentransmitted to the decoder The decoder only needs to perform one thresholdcalculation given the parameter values established at the encoder

The backward-adaptive model component works as follows First the quan-tized spectral envelope exponents are clustered into 50 05-Bark-width subbandsThen a spreading function is applied (Figure 1028a) that accounts only for theupward spread of masking To compensate for filter-bank leakage at low fre-quencies spreading is disabled below 200 Hz Also spreading is not enabledbetween 200 and 700 Hz for frequencies below the occurrence of the first sig-nificant masker The absolute threshold of hearing is accounted for after thespreading function has been applied Unlike other algorithms AC-3 neglects thedownward spread of masking assumes that masking power is nonadditive andmakes no explicit assumptions about the relationship between tonality and the


skirt slopes on the spreading function Instead these characteristics are capturedin a set of parameters that comprise the forward-adaptive model componentMasking threshold calculations at the decoder are controlled by a set of param-eters transmitted by the encoder creating flexibility for model improvements atthe encoder such that improved estimates of the masked threshold can be realizedwithout modifying the embedded model at the decoder

For example a parametric (upwards only) spreading function is defined(Figure 1028a) in terms of two slopes Si and two level offsets Li for i isin [1 2]While the parameters S1 and L1 can be uniquely specified for each channelthe parameters S2 and L2 are applied to all channels The parametric spreadingfunction is advantageous in that it allows the perceptual model at the encoder

SMR

Leve

l (d

B)

Frequency (Barks)

(a)

Frequency (Barks)

(b)

S1

S2

L1

L2

Masker

ParametricSpreading Function

100

SMR = 30 dB

20 dB

60

40 15 dB

SP

L (d

B)

Figure 1028 Dolby AC-3 parametric perceptual model (a) prototype spreading func-tion (b) masker SMR level-dependence


to account for tonality or dynamic masking patterns without the need to alterthe decoder model A range of values is available for each parameter Withunits of dB per 12 Bark the slopes are defined to be within the rangesminus295 S1 minus577 and minus07 S2 minus098 With units of dB SPL the levelsare defined to be within the ranges minus6 L1 minus48 and minus49 L2 minus63Ultimately there are 512 unique spreading function shapes to choose from Theacoustic-level dependence of masking thresholds is also modeled in AC-3 Itis in general true that the signal-to-mask ratio (SMR) increases with increasingstimulus level (Figure 1028b) ie the threshold moves closer to the stimulus asthe stimulus intensity decreases In the AC-3 parametric perceptual model thisphenomenon is captured by adding a positive bias to the masked thresholds whenthe spectral envelope is below a threshold level Acoustic level threshold biasingis applied on a band-by-band basis The decision threshold for the biasing isone of the forward adaptive parameters transmitted by the encoder This functioncan also be disabled altogether The parametric perceptual model also providesa convenient upgrade path in the form of a bit allocation delta parameter

It was envisioned that future more sophisticated AC-3 encoders might runin parallel two perceptual models with one being the original reference modeland the other being an enhanced model with more accurate estimates of maskedthreshold The delta parameter allows the encoder to transmit a stair-step functionfor which each tread specifies a masking level adjustment for an integral numberof 12-Bark bands Thus the masking model can be incrementally improvedwithout alterations to the existing decoders Other details on the hybrid backward-forwards AC-3 perceptual model can be found in [Davi94]

10724 Bit Allocation and Mantissa Quantization A bit allocation isdetermined at the encoder for each frame of mantissas by an iterative pro-cedure that adjusts the mantissa quantizers the multichannel coding strategies(below) and the forward-adaptive model parameters to satisfy simultaneouslythe specified rate constraint and the masked threshold Within the rate-controlloop threshold partitions are formed on the basis of a bit allocation frequencyresolution parameter with coefficient partitions ranging in width between 94 and375 Hz In a manner similar to MPEG-1 quantizers are selected for the set ofmantissas in each partition based on an SMR calculation Sufficient bits are allo-cated to ensure that the SNR for the quantized mantissas is greater than or equalto the SMR The quantization noise is thus rendered inaudible below maskedthreshold Uniform quantizers are selected from a set of 15 having 0 3 5 7 11and 15 levels symmetric about 0 and conventional 2rsquos-complement quantizershaving 32 64 128 256 512 1024 2048 4096 16384 or 65536 levels Certainquantizer codewords are group-encoded to make more efficient usage of avail-able bits Dithering can be enabled optionally on individual channels for 0-bitmantissas If the bit supply is insufficient to satisfy the masked threshold thenSNRs can be reduced in selected threshold partitions until the rate is satisfied orintensity coding and MS transformations are used in a frequency-selective fash-ion to reduce the bit demand Two variable-rate methods are also available to


satisfy peak-rate demands Within a frame of six MDCT coefficient blocks bitscan be distributed unevenly such that the instantaneous bit rate is variable butthe average rate is constant In addition bit rates are adjustable and a unique ratecan be specified for each frame of six MDCT blocks Unlike some of the otherstandardized algorithms the AC-3 does not include an explicit lossless codingstage for final redundancy reduction after quantization and encoding

10725 Multichannel Coding When bit demand imposed by multiple inde-pendent channels exceeds the bit budget the AC-3 ensemble processing of 51channels exploits interchannel redundancies and irrelevancies respectively bymaking frequency-selective use of mid-side (MS) and intensity coding tech-niques Although the MS and intensity functions can be simultaneously activeon a given channel they are restricted to nonoverlapping subbands The MSscheme is carefully controlled [Davi98] to maintain compatibility between AC-3and matrixed surround systems such as Dolby ProLogic Intensity coding alsoknown as ldquochannel couplingrdquo is a multichannel irrelevancy reduction codingtechnique that exploits properties of spatial hearing There is considerable exper-imental evidence [Blau74] suggesting that the interaural time difference of asignalrsquos fine structure has negligible influence on sound localization above a cer-tain frequency Instead the ear evaluates primarily energy envelopes Thus theidea behind intensity coding is to transmit only one envelope in place of twoor more sufficiently correlated spectra from independent channels together withsome side information The side information consists of a set of coefficients thatis used to recover individual spectra from the intensity channel

A simplified version of the AC-3 intensity coding scheme is illustrated inFigure 1029 At the encoder (Figure 1029a) two or more input spectra are addedtogether to form a single intensity channel Prior to the addition an optionaladjustment is applied to prevent phase cancellation Then groups of adjacentcoefficients are partitioned into between 1 and 18 separate intensity subbandson both the individual and the intensity channels A set of coupling coefficientsis computed cij that expresses the fraction of energy contributed by the i-thindividual channel to the j -th band of the intensity envelope ie cij = βij αj where βij is the power contained in the j -th band of the i-th channel and αj isthe power contained in the j -th band of the intensity channel Finally the inten-sity spectrum is quantized encoded and transmitted to the decoder The couplingcoefficients cij are transmitted as side information Once the intensity channelhas been recovered at the decoder (Figure 1029b) the intensity subbands arescaled by the coupling coefficients cij in order to recover an appropriate frac-tion of intensity energy in the j -th band of the i-th channel The intensity-codedcoefficients are then combined with any remaining uncoupled transform coeffi-cients and passed through the synthesis filter bank to reconstruct the individualchannel The AC-3 coupling coefficients have a dynamic range that spans minus132to +18 dB with quantization step sizes between 028 and 053 dB Intensity cod-ing is applied in a frequency-selective manner parameterized by a start frequencyof 342 kHz or higher and a bandwidth expressed in multiples of 12 kHz for


(a)

BandSplit

BandSplit

PhaseAdjust

PhaseAdjust

BandSplit

PowerCalc

PowerCalc

PowerCalc

PowerCalc

PowerCalc

PowerCalc

b11

bij

aj

a1

a2

b21

b22

b12

Ch 1

Ch 2

cij

Intensity Channel

CouplingCoefficients

divide

sum

Figure 1029 Dolby AC-3 intensity coding (ldquochannel couplingrdquo) (a) encoder (b)decoder

a 48-kHz system [Davi98] Note that unlike the simplified system shown in thefigure the actual AC-3 intensity coding scheme may couple the spectra from asmany as five channels

10726 System-Level Functions At the system level AC-3 provides me-chanismsforchanneldown-mixinganddynamicrangecontrolDown-mixcapabilityis essential for the 51-channel system since the majority of potential playback sys-tems are still monaural or at best stereophonic Down-mixing is performed at thedecoder in the frequency domain rather than the time-domain to reduce complex-ity This is possible because of the filter-bank linearity The bit stream carries somedown-mix informationsincedifferent listeningsituationscall fordifferentdown-mixweighting Dialog-levelnormalization isalsoavailable at the decoder Finally the bitstream has available facilities to handle other control and ancillary user informationsuch as copyright language production and time-code data [Davis94]

10727 Complexity Assuming the standard HDTV configuration of 384 kbswith a 48 kHz sample rate and implementation using the Zoran ZR38001 general-purposeDSPinstructionset theAC-3decodermemoryrequirementsandcomplexity


(b)

X

SynthesisFilterBank

Uncoupled Coefs

X

X

X

BandSplit

c11

c12

c21

c22

IntensityChannel

Uncoupled Coefs

Ch 1

Ch 2 Synthesis

FilterBank

Figure 1029 (continued )

are as follows 66 kbytes RAM 54 kbytes ROM 273 MIPS for 51 channels and31 kbytes RAM 54 kbytes ROM and 265 MIPS for 2 channels [Vern95] Notethat complexity estimates are processor-dependent For example on a MotorolaDSP56002 45 MIPS are required for a 51-channel decoder Encoder complexityvaries between two and five times decoder complexity depending on the encodersophistication [Vern95] Numerous real-time encoder and decoder implementationshavebeenreportedEarlyon forexampleasingle-chipdecoderwas implementedona Zoran DSP [Vern93] More recently a DP561 AC-3 encoder (51 channels 441- or48-kHz sample rate) for DVD mastering was implemented in real-time on a PC hostwith a plug-in DSP subsystem The computational requirements were handled by anAriel PC-Hydra DSP array of eight Texas Instruments TMS 320C44 floating-pointDSP devices clocked at 50 MHz [Terr96] Current information on real-time AC-3implementations is also available online from Dolby Laboratories

AUDIO PROCESSING TECHNOLOGY APT-x100 335

10728 Applications and Standardization The first popular AC-3 appli-cation was in the cinema The ldquoDolby Digitalrdquo or ldquoSR Drdquo AC-3 information isinterleaved between sprocket holes on one side of the 35-mm film The AC-3was first deployed in only three theaters for the film Star Trek VI in 1991 afterwhich the official rollout of Dolby SR D occurred in 1992 with Batman ReturnsBy 1997 April over 900 film soundtracks had been AC-3 encoded Nowadaysthe AC-3 algorithm is finding use in digital versatile disc (DVD) cable televi-sion (CATV) and direct broadcast satellite (DBS) Many hi-fidelity amplifiersand receiver units now contain embedded AC-3 decoders and accept an AC-3digital rather than an analog feed from external sources such as DVD

In addition the DP504524 version of the DolbyFAX system (Section 1071)has added AC-3 stereo and MPEG-1 layer II to the original AC-2-based systemFilm television and music studios use DolbyFAX over ISDN links for auto-matic dialog replacement music collaboration sound effects delivery and remotevideotape audio playback As far as standardization is concerned the UnitedStates Advanced Television Systems Committee (ATSC) has adopted the AC-3 algorithm as the A52 audio compression standard [USAT95b] and as theaudio component of the A52 Digital Television (DTV) Standard [USAT95a] TheUnited States Federal Communications Commission (US FCC) in 1996 Decemberadopted the ATSC standard for DTV including the AC-3 audio component Onthe international standardization front the Digital Audio-Visual Council (DAVIC)selected AC-3 and MPEG-1 layer II for the audio component of the DAVIC 12specification [DAVC96]

10729 Recent Developments ndashThe Dolby Digital Plus A Dolby digitalplus system or the enhanced AC-3 (E-AC-3) [Fiel04] was recently introducedto extend the capabilities of the Dolby AC-3 algorithm While remaining back-ward compatible with the Dolby AC-3 standard the Dolby digital plus providesseveral enhancements Some of the extensions include flexibility to encode upto 131 channels extended data rates up to 6144 Mbs The AC-3 filterbank issupplemented with a second stage DCT to exploit the stationary characteristicsin the audio Other coding tools include spectral extension enhanced channelcoupling and transient pre-noise processing The E-AC-3 is used in cable andsatellite television set-top boxes and broadcast distribution transcoding devicesFor a detailed description on the Dolby digital plus refer to [Fiel04]

108 AUDIO PROCESSING TECHNOLOGY APT-x100

Without exception all of the commercial and international audio coding standardsdescribed thus far couple explicit models of auditory perception with classicalquantization techniques in an attempt to distribute quantization noise over thetime-frequency plane such that it is imperceptible to the human listener In addi-tion to irrelevancy reduction most of these algorithms simultaneously seek toreduce statistical redundancies For the sake of comparison and perhaps to betterassess the impact of perceptual models on realizable coding gain it is instructive


to next consider a commercially available audio coding algorithm that relies onlyupon redundancy removal without any explicit regard for auditory perception

We turn to the Audio Processing Technology APT-x100 algorithm whichhas been reported to achieve nearly transparent coding of CD-quality 441 kHz16-bit PCM input at a compression ratio of 41 or 1764 kbs per monauralchannel [Wyli96b] Like the ITU-T G722 wideband speech codec [G722] theAPT-x100 encoder (Figure 1030) relies upon subband signal decomposition fol-lowed by independent ADPCM quantization of the decimated subband outputsequences Codewords from four uniform bandwidth subbands are multiplexedonto the channel and sent to the decoder where the ADPCM and filter-bank oper-ations are inverted to generate an output As shown in the figure a tree-structuredQMF filter bank splits the input signal into four subbands The first and secondfilter stages have 64 and 32 taps respectively Backward adaptive prediction isapplied to the four subband output sequences The resulting prediction residualis quantized with a backward-adaptive Laplacian quantizer Backward adaptationin the prediction and quantization steps eliminates side information but increasessensitivity to fast transients On the other hand both prediction and adaptivequantization were found to significantly improve coding gain for a wide range oftest signals [Wyli96b] Adaptive quantization attempts to track signal dynamicsand tends to produce constant SNR in each subband during stationary segments

Unlike the other algorithms reviewed in this document APT-x100 containsno perceptual model or rate control loop The ADPCM output codewords are offixed resolution (1 bit per sample) and therefore with four subbands the outputbit rate is reduced 41 A comparison between APT-x100 quantization noise and

QMFAnalysis

Filter(64)

QMFAnalysis

Filter(32)

A1(z)

Q

1Q

A4(z)

Q

1Q

MUX OutputBitstream

Channel-2

Channel-3

QMFAnalysis

Filter(32)

sum

sum

minus

minus

s(n)

Σ

Σ

∆

∆

Figure 1030 Audio Processing Technology APT-x100 encoder


32 ChannelPQMF

analysis bank

Psychoacousticsignal analysis Bit allocation

SMR

Quantizers

Data MULTIPLEXER

SideInfo

x(n)

SMR signal-to-mask ratio

x(n) input audio

Tochannel

amp

Quantization

SubbandADPCM

Quantization

Figure 1031 DTS-coherent acoustics (DTS-CA) encoding scheme

noise masking thresholds computed as in [John88a] for a variety of test signalsfrom the SQAM test CD [SQAM88] revealed two trends in the APT-x100 noisefloor First as expected it is flat rather than shaped Second the noise is belowthe masking threshold in most critical bands for most stationary test signalsbut tends to exceed the threshold in some critical bands for transient signalsIn [Wyli96b] however the fast step-size adaptation in APT-x100 is reportedto exploit temporal masking effects and mitigate the audibility of unmaskedquantization noise While the lack of a perceptual model results in an inefficientflat noise floor it also affords some advantages including reduced complexityreduced frequency resolution requirements and low delay of only 122 samplesor 277 ms at 441 kHz

Several other relevant facts on APT-x100 quality and robustness were alsoreported in [Wyli96b] Objective output quality was evaluated in terms of averagesubband SNRs which were 30 15 10 and 7 dB respectively for the low-est to highest subbands and the authors stated that the algorithm outperformedNICAM [NICAM] in an informal subjective comparison [Wyli96b] APT-x100was robust to both random bit errors and tandem encoding Errors were inaudiblefor a bit error rate (BER) of 10minus4 and speech remained intelligible for a BER of10minus1 In one test 10 stages of synchronous tandeming reduced output SNR from45 dB to 37 dB An auxiliary channel that accommodates up to 14 kbs of thesample rate in buried data (eg 24 kbs for 48-kHz stereo samples) by bit steal-ing from one of the subbands had a negligible effect on output quality Finallyreal-time APT-x100 encoder and decoder modules were implemented on a singleATampT DSP16A masked ROM DSP As far as applications are concerned APT-x100 has been deployed in digital studio-transmitter links audio storage productsand cinematic surround sound applications A cursory performance comparisonof the nonperceptual algorithms versus the perceptually based algorithms (egNICAM or APT-x100 vs MPEG or PAC etc) confirms that some awarenessof peripheral auditory processing is necessary to achieve high-quality coding ofCD-quality audio for compression ratios in excess of 41


109 DTS ndash COHERENT ACOUSTICS

The performance comparison of the nonperceptual algorithms versus the percep-tually based algorithms (eg APT-x100 vs MPEG or PAC etc) given in theearlier section highlights that some awareness of peripheral auditory processingis necessary to achieve high-quality encoding of digital audio for compressionratios in excess of 41 To this end DTS employs an audio compression algorithmbased on the principles of ldquocoherent acoustics encodingrdquo [Smyt96] [Smyt99][DTS] In coherent acoustics both ADPCM-subband filtering and psychoacous-tic analysis are employed to compress the audio data The main emphasis in DTSis to improve the precision (and hence the quality) of the digital audio The DTSencoding algorithm provides a resolution of up to 24 bits per sample and at thesame time can deliver compression rates in the range of 3 to 40 Moreover DTScan deliver up to eight discrete channels of multiplexed audio at sampling fre-quencies of 8ndash192 kHz and at bit rates of 8ndash512 kbs per channel Table 109summarizes the various bit rates sampling frequencies and the bit resolutionsemployed in the four configurations supported by the DTS-coherent acoustics

1091 Framing and Subband Analysis

The DTS-CA encoding algorithm (Figure 1031) operates on 24-bit linear PCMsignals The audio signals are typically analyzed in blocks (frames) of 1024samples although frame sizes of 256 512 2048 and 4096 samples are alsosupported depending on the bit rates and sampling frequencies used (Table 1010)For example if operating at bit rates of 1024ndash2048 kbs and sampling frequenciesof 32 or 441 or 48 kHz then the maximum number of samples allowed per frameis 1024 Next the segmented audio frames are decomposed into 32 criticallysubsampled subbands using a polyphase realization of a pseudo QMF (PQMF)bank (Chapter 6) Two different PQMF filter-bank structures namely perfectreconstructing (PR) and nonperfect reconstructing (NPR) are provided in DTS-coherent acoustics In the example that we considered above a frame size of 1024samples results in 32 PCM samples per subband (ie 102432) The channelsare equally spaced such that a 32 kHz input signal is split into 500 Hz subbands(ie 16 kHz32) with the subbands being decimated at the ratio 321

Table 109 A list of encoding parameters used in DTS-coherent acousticsafter [Smyt99]

Bit rates (kbschannel) Sampling rates (kHz) Bit resolution per sample

8ndash32 24 1632ndash96 48 2096ndash256 96 24256ndash512 192 24

DTS ndash COHERENT ACOUSTICS 339

Table 1010 Maximum frame sizes allowed in DTS-CA (after [Smyt99])

Sampling frequency-set fs = [8110512] (kHz)

Bit rates (kbs) fs 2fs 4fs 8fs 16fs

0ndash512 Max 1024 Max 2048 Max 4096 NA NA512ndash1024 NA Max 1024 Max 2048 NA NA1024ndash2048 NA NA Max 1024 Max 2048 NA2048ndash4096 NA NA NA Max 1024 Max 2048

1092 Psychoacoustic Analysis

While the subband filtering stage minimizes the statistical dependencies associ-ated with the input PCM signal the psychoacoustic analysis stage eliminates theperceptually irrelevant information Since we have already established the neces-sary background on psychoacoustic analysis in Chapters 5 we will not elaborateon these steps However we describe next the advantages of combining thedifferential subband coding techniques (eg ADPCM) with the psychoacousticanalysis

1093 ADPCM ndash Differential Subband Coding

A block diagram depicting the steps involved in the differential subband codingin DTS-CA is shown in Figure 1032 A fourth-order forward linear predictionis performed on each subband containing 32 PCM samples From the aboveexample we have 32 subbands and 32 PCM samples per subband Recall that inLP we predict the current time-domain audio sample based on a linearly weightedcombination of previous samples From the LPC analysis corresponding to thei-th subband we obtain predictor coefficients aik for k = 0 1 4 and theresidual error ei(n) for n = 0 1 2 31 samples The predictor coefficientsare usually vector quantized in the line spectral frequency (LSF) domain

Two stages of ADPCM modules are provided in the DTS-CA algorithm iethe ADPCM estimation stage and the real ADPCM stage ADPCM utilizes theredundancy in the subband PCM audio by exploiting the correlation betweenadjacent samples First the ldquoestimation ADPCMrdquo module is used to determinethe degree of prediction achieved by the fourth-order linear prediction filter(Figure 1032) Depending upon the statistical features of audio a decision toenable or disable the second ldquoreal ADPCMrdquo stage is made

A predictor mode flag ldquoPMODErdquo = 1 or 0 is set to indicate if the ldquorealADPCMrdquo module is active or not respectively

sipred(n) =4sum

k=0

aiksi(n minus k) for n = 0 1 31 (1011)

Firs

t Sta

ge(lsquoE

stim

atio

nrsquo)

AD

PC

M

Tra

nsie

ntA

naly

sis

Line

arP

redi

ctio

n

Inpu

tB

uffe

r

Sub

band

PC

M A

udio

32 S

ubba

nds

Pre

dict

ion

Ana

lysi

s

Sca

lefa

ctor

Com

puta

tion

Dec

isio

n W

heth

er to

Dis

able

the

Sec

ond-

stag

e A

DP

CM

Pre

dict

orC

oeffi

cien

ts

Res

idua

l

TM

OD

E

PM

OD

E

Sec

ond

Sta

ge(lsquoR

ealrsquo)

AD

PC

M

Sid

e In

form

atio

n

Qua

ntiz

edP

redi

ctor

Coe

ffici

ents

MU

X

Sca

le-

fact

ors

Diff

eren

tial S

ub-

band

Aud

io

Fig

ure

103

2D

iffe

rent

ial

subb

and

(AD

PCM

)co

ding

inth

eD

TS-

CA

enco

der

340


ei(n) = si(n) minus sipred(n)

= si(n) minus4sum

k=0

aiksi(n minus k)(1012)

While the ldquoprediction analysisrdquo block computes the PMODE flag based on theprediction gain the ldquotransient analysisrdquo module monitors the transient behaviorof the error residual In particular when a signal with a sharp attack (ie rapidtransitions) begins near the end of a transform block and immediately following aregion of low energy pre-echoes occur Several pre-echo control strategies havebeen developed (Chapter 6 Section 610) These include window switching gainmodification switched filter banks including the bit reservoir and temporal noiseshaping In DTS-CA the pre-echo artifacts are controlled by dividing the subbandanalysis buffer into four sub-buffers A transient mode ldquoTMODErdquo = 0 1 2 or3 is set to denote the beginning of a transient signal in sub-buffers 1 2 3 or 4respectively In addition two scale factors are computed for each subband (iebefore and after the transition) based on the peak magnitude of the residual errorei(n) A 64-level nonuniform quantizer is usually employed to encode the scalefactors in DTS-CA

Note that the PMODE flag is a ldquoBooleanrdquo and the TMODE flag has fourvalues Therefore a total of 15 bits (ie 12 bits for two scale factors 1 bit forthe ldquoPMODErdquo flag and 2 bits for the ldquoTMODErdquo flag) are sufficient to encode theentire side information in the DTS-CA algorithm Next based on the predictormode flag (1 or 0) the second-stage ADPCM is used to encode the differentialsubband PCM audio as shown in Figure 1032 The optimum number of bits (inthe sense of minimizing the quantization noise) required to encode the differentialaudio in each subband is estimated using a bit allocation procedure

1094 Bit Allocation Quantization and Multiplexing

Bit allocation is determined at the encoder for each frame (32 subbands) by aniterative procedure that adjusts the scale-factor quantizers the fourth-order lin-ear predictive model parameters and the quantization levels of the differentialsubband audio This is done in order to satisfy simultaneously the specified rateconstraint and the masked threshold In a manner similar to MPEG-1 quantizersare selected in each subband based on an SMR calculation A sufficient numberof bits is allocated to ensure that the SNR for the quantized error is greater than orequal to the SMR The quantization noise is thus rendered inaudible ie belowthe masked threshold Recall that the main emphasis in DTS-CA is to improveprecision and hence the quality of the digital audio while giving relatively lessimportance to minimizing the data rate Therefore the DTS-CA bit reservoir willalmost always meet the bit demand imposed by the psychoacoustic model Similarto some of the other standardized algorithms (eg MPEG codecs lossless audiocoders) the DTS-CA includes an explicit lossless coding stage for final redun-dancy reduction after quantization and encoding A data multiplexer merely packsthe differential subband data the side information the synchronization details


and the header syntax into a serial bitstream Details on the structure of the ldquoout-put data framerdquo employed in the DTS-CA algorithm are given in [Smyt99] [DTS]

As an extension to the current coherent acoustics algorithm Fejzo et alproposed a new enhancement that delivers 96 kHz 24-bit resolution audio qual-ity [Fejz00] The proposed enhancement makes use of both ldquocorerdquo and ldquoexten-sionrdquo data to reproduce 96-kHz audio bitstreams Details on the real-time imple-mentation of the 51-channel decoder on a 32-bit floating-point processor arealso presented in [Fejz00] Although much work has been done in the area ofencoderdecoder architectures for the DTS-CA codecs relatively little has beenpublished [Mesa99]

1095 DTS-CA Versus Dolby Digital

The DTS-Coherent Acoustics and the Dolby AC-3 algorithms were the two com-peting standards during the mid-1990s While the former employs an adaptivedifferential linear prediction (ADPCMndashsubband coding) in conjunction with aperceptual model the latter employs a unique exponentmantissa MDCT coef-ficient encoding technique in conjunction with a parametric forward-backwardadaptive perceptual model

PROBLEMS

101 List some of the primary differences between the DTS the Dolby digitaland the Sony ATRAC encoding schemes

102 Using a block diagram describe how the ISOIEC MPEG-1 layer I codecis different from the ISOIEC MPEG-1 layer III algorithm

103 What are the enhancements integrated into MPEG-2 AAC relative to theMP3 algorithm State key differences in the algorithms

104 List some of the distinguishing features of the MP4 audio format over theMP3 format Give bitrates and cite references

105 What is the main idea behind the scalable audio coding Explain using ablock diagram Give examples

106 What is structured audio coding and parametric audio coding

107 How is ISOIEC MPEG-7 standard different from the other MPEGstandards

COMPUTER EXERCISE

108 The audio files Ch10aud2Lwav Ch10aud2Rwav Ch10aud2CwavCh10aud2Lswav and Ch10aud2Rswav correspond to left rightcenter left-surround and right-surround respectively of a 32-channelconfiguration Using the matrixing technique obtain a stereo output

CHAPTER 11

LOSSLESS AUDIO CODINGAND DIGITAL WATERMARKING

111 INTRODUCTION

The emergence of high-end storage formats such as the DVD-audio and thesuper-audio CD (SACD) created opportunities for lossless audio coding schemesLossless audio coding techniques yield high-quality audio without any arti-facts Lossy audio coding (LAC) results typically in compression ratios of101ndash251 while the lossless audio coding (L2AC) algorithms achieve com-pression ratios of 21ndash41 Therefore L2AC techniques are not typically usedin real-time storagemultimedia processing and Internet streaming but can be ofuse in storage-rich formats Several lossless audio coding algorithms includingthe SHORTEN [Robi94] the DVD [Crav96] [Oome96] [Crav97] the MUSI-Compress [Wege97] the AudioPaK [Hans98a] [Hans98b] [Hans01] the loss-less transform coding of audio (LTAC) [Pura97] and the IntMDCT [Geig01][Geig02] have been proposed The meridian lossless packing (MLP) [Gerz99]and the direct-stream digital (DSD) techniques [Brue97] form a group of high-end lossless compression algorithms that have already become standards in theDVD-Audio [DVD01] and the SACD [SACD02] storage formats respectivelyIn Figure 111 we taxonomize various lossless audio coding algorithms

This chapter is organized as follows In Section 112 we describe the principlesof lossless audio coding A survey of various lossless audio coding algorithms isgiven in Section 1122 This is followed by more detailed descriptions of the DVD-audio (Section 113) and the SACD (Section 114) formats Section 115 addresses


343

344 LOSSLESS AUDIO CODING AND DIGITAL WATERMARKING

Lossless audiocoding (L2AC)

schemes

Prediction-basedTechniques

SHORTEN[Robi94]

The DVD Algorithm[Crav96] [Crav97]

MUSICompress[Wege97]

AudioPak[Hans98a] [Hans98b]

Direct Stream Digital (DSD)Technique

[Jans03] [Reef01a][SACD02] [SACD03]

Meridian Lossless Packing(MLP)

[Gerz99] [Kuzu03] [DVD01]

Transform-basedTechniques

Lossless TransformCoding (LTAC)

[Pura97]

IntMDCT[Geig01] [Geig02]

C-LPAC[Qiu01]

MPEG-4 Lossless AudioCoding

[Mori02a] [Sche01] [Vasi02]

Figure 111 A list of lossless audio coding algorithms

digital audio watermarking Finally in Section 116 we present commercial appli-cations and recent developments in audio coding standardization

112 LOSSLESS AUDIO CODING (L2AC)

LAC schemes employ psychoacoustic principles in conjunction with time-frequ-ency mapping techniques to eliminate perceptually irrelevant information Hencethe encoded signal is not be an exact replica of the input audio and mightcontain some artifacts Lossless coding schemes (L2AC) obtain bit-exact com-pression by eliminating the statistical dependencies associated with the signalvia prediction techniques Several prediction modeling methods for L2AC havebeen proposed namely FIRIIR prediction [Robi94] [Crav96] [Crav97] poly-nomial approximationcurve fitting methods [Robi94] [Hans98b] transform cod-ing [Pura97] [Geig02] high-order context modeling [Qiu01] backward adaptive

LOSSLESS AUDIO CODING (L2AC) 345

prediction method [Angu97] subband linear prediction [Giur98] and set parti-tioning [Raad02] L2AC algorithms are broadly classified into two categoriesnamely prediction-based and transform-based

Prediction-based L2AC algorithms employ linear predictive (LP) modeling orsome form of polynomial approximation to remove redundancies in the wave-form The SHORTEN the DVD the MUSICompress and the C-LPAC use LPanalysis while the AudioPaK is based on curve-fitting methods Transform-basedcoding schemes employ specific transforms to decorrelate samples within a frameThe LTAC scheme proposed by Purat et al and the IntMDCT scheme introducedby Geiger et al are two L2AC schemes that use transform coding

In all the aforementioned L2AC methods the idea is to obtain a close approxi-mation to the input audio and compute a low variance prediction residual that canbe coded efficiently Typically the residual error is entropy coded using one ofthe following methods Huffman coding Lempel-Ziv coding run-length codingarithmetic coding or Rice coding Table 111 lists the various L2AC coders andtheir associated prediction and entropy coding methods

Before getting into the details of lossless coding of digital audio let us takea look at some of the lossless data compression algorithms ie PkZip WinZipWinRAR gzip and TAR Although these compression algorithms achieve a40ndash60 bit rate reductions with data and text files they only achieve a mere 10in the case of audio [Hans01] lists an interesting example that illustrates the fol-lowing A PostScript file of 456 MB can be compressed with PkZip to 696 KBachieving a bit rate reduction of about 85 whereas an audio file of 3372 MBcan be compressed to 3157 MB resulting in a bit-rate reduction of only about65 In designing and evaluating L2AC algorithms one must address two keyissues ie the trade-offs of compression ratio and average residual energy perframe and the selection of the entropy coding method

1121 L2AC Principles

A general block diagram of a L2AC algorithm is depicted in Figure 112 Firsta conventional lossy audio coding scheme (eg MPEG-1 layer III) is used toproduce a compressed bitstream Then an error signal is calculated by subtracting

Table 111 Prediction and entropy coding schemes in some of the L2AC schemes

L2AC scheme Prediction model Entropy coding used

SHORTEN FIR prediction filter Rice codingDVD IIR prediction Huffman codingMUSICompress Adaptively varied approximations Huffman codingAudioPak Curve fittingpolynomial

approximationsGolomb coding

LTAC Orthonormal transform coding Rice codingIntMDCT Integer transform coding Huffman codingC-LPAC FIR prediction High-order context modeling


s(n)

Approximator Module

LossyCompression

Scheme

Audio Inputs(n)

LocalDecoder

minus

LossyCompressedAudio x(n)

MUX

Residual Errore(n)

Channelor

OtherMedium

DEMUX

x(n)

e(n)

Decoder

s(n)

OriginalAudio s(n)

Σ

Σ

Figure 112 Lossless audio coding scheme based on the LAC method

the reconstructed signal from the input signal Reconstruction involves decodingthe compressed bitstream locally at the encoder from the input signal The residualis encoded using entropy coding methods and transmitted along with the encoded(lossy) audio bitstream The transmission of the coding error enables the decoderto obtain an exact replica of the input audio resulting in a lossless scheme Analternative to this scheme was also proposed where instead of using a lossyaudio compression scheme a predictor is used to approximate the input audiosignal

1122 L2AC Algorithms

11221 The SHORTEN Algorithm Framing prediction and residual cod-ing are the three primary stages associated with the SHORTEN algorithm Thealgorithm operates on 16-bit linear PCM signals sampled at 16 kHz The audiosignal is divided into blocks of 16 ms corresponding to 256 samples An L-thorder LP analysis is performed using the Levinson-Durbin algorithm The LPanalysis window (usually rectangular) is typically L samples longer than theframe size ie N = (256 + L) Typically a 10th-order (L = 10) LP filter isemployed

A prediction technique based on curve fitting and polynomial approximationshas also been proposed in [Robi94] (Figure 113) This is a simplified form of


x1(n) = x(n minus 1)

x2(n) = 2 x(n minus 1) minus x(n minus 2)

x(n minus 1)

x(n minus 2)

x(n minus 3)

n0 1 2 3

x(n)

x(n)

x0(n)

ˆ

x2(n)

x1(n)

x3(n)ˆ

Input audio samples

Predicted samples

x0(n) = 0ˆ

x3(n) = 3 x(n minus 1) minus 3 x(n minus 2) + x(n minus 3)

Figure 113 Polynomial approximation of x(n) used in the SHORTEN [Robi94] and inthe AudioPaK [Hans98a] algorithms

linear predictor that uses a selection criterion by fitting an L-th order polynomialto the last L data points For instance four polynomials ie i = 0 1 2 3 areformed as follows

x0(n) = 0


x2(n) = 2x(n minus 1) minus x(n minus 2)

x3(n) = 3x(n minus 1) minus 3x(n minus 2) + x(n minus 3)

(111)

Note that the FIR predictor coefficients in the above equation are integers Theresidual ei(n) is computed by subtracting the corresponding polynomial estimatefrom the actual signal ie

ei(n) = x(n) minus xi(n) (112)

From (111) and (112) we obtain

e0(n) = x(n) (113)


Similarly e1(n) is given by

e1(n) = x(n) minus x1(n) (114)

Finally from (111) and (113) we get

e1(n) = e0(n) minus e0(n minus 1) (115)

The above equations lead to an efficient recursive algorithm to compute thepolynomial prediction residuals ie

e0(n) = x(n)

e1(n) = e0(n) minus e0(n minus 1)



(116)

Next the residual energy Ei corresponding to each of the four residuals iscomputed

Ei =sum

n

ei2(n) for i = 0 1 2 3 (117)

The polynomial that results in the smallest residual energy is considered as thebest approximation for that particular frame It should be noted that the predictionbased on polynomial approximations does not always result in maximum datarate reduction often due to the poor estimation of signal statistics On the otherhand FIR predictors with real-valued coefficients are more versatile than integerpredictors Based on experiments conducted by Robinson [Robi94] the residuale(n) exhibits a Laplacian distribution Usually a Huffman coder that best fits theLaplacian distribution is employed for residual packing Note that these codesare also called Rice codes in the entropy coding literature An error statistic of0004 bitssample has been reported in [Robi94]

11222 The AudioPaK Coder The AudioPaK algorithm developed by Hanset al [Hans98a] [Hans98b] uses linear prediction with integer coefficients andGolomb-Rice coding for packing the error In particular the AudioPaK coderemploys an adaptive polynomial approximation method in order to eliminatethe intrachannel dependencies In addition to the intrachannel decorrelation theAudioPaK algorithm also allows interchannel decorrelation in the case of thestereo and dual modes1 Inter-channel decorrelation in these modes can be achie-ved as follows First the residuals associated with the left and right channels

1 Recall that in the stereo mode two audio channels that form a stereo pair (left and right) are encodedwith one bitstream On the other hand in the dual mode two audio channels with independentprogram contents eg bilingual are encoded within one bitstream However the encoding processis the same for both modes [ISOI92]


are computed followed by the differences between the left and the right channelresiduals An average residual energy Ei corresponding to each of the four right-channel residuals and Ei corresponding to the four difference residuals arecomputed (similar to Eq (117)) The residual that results in the smallest residualenergy among E0 E1 E2 E3 E0 E1 E2 and E3 is considered as thebest approximation for that particular frame Hans and Schafer reported smallimprovements in the compression rates when the interchannel decorrelation isincluded [Hans01]

Golomb Coding The entropy coding techniques employed in AudioPaK areinspired by the Golomb codes used in the lossless image compression scheme[Wein96] [ISOJ97] First the frames are classified as silent (constant) or timevarying In the case of silent frames the residuals e0(n) and e1(n) are used as thepredictor polynomials On the other hand for the varying frames Golomb codingis employed In particular residual e0(n) is used to encode silent frames madeof zeros and residual e1(n) is used to encode silent frames made of a constantvalue For frames that are classified as time varying Golomb coding is employedas follows Golomb codes are optimal for exponentially decaying probabilitydistributions of positive integers Therefore a mapping process is required toreorder the negative integers to positive values This is done as follows

ei(n) =

2ei(n) if ei(n) 02|ei(n)| minus 1 otherwise

(118)

where ei(n) is the residual corresponding to the i-th polynomial Note that theGolomb codes are prefix codes that can be characterized by a unique parameterldquomrdquo An integer ldquoIrdquo can be encoded using a Golomb code as follows If m = 2k the codeword for ldquoIrdquo consists of ldquokrdquo LSBs of ldquoI rdquo followed by the number formedby the remaining MSBs of ldquoIrdquo in unary representation and a stop bit Therefore

the length of the code is k +[

I

2k

]

+ 1 For example if n = 69 m = 16 ie k =4 Part 1 4 LSBsof [1000101] = 0101 Part 2 unary(100) = unary(4) = 1110Stopbit = 0 GolombCode = [010111100] Length of the code = 9 Usuallythe value of ldquokrdquo is constant over an entire frame and is computed based on theexpectation E(|ei(n)|) as follows

k = [log2(|ei(n)|)] (119)

The lower and upper bounds of k are given by 0 and 15 respectively for a 16-bitaudio input

11223 The DVD Algorithm Craven et al proposed a L2AC scheme basedon IIR prediction filters [Crav97] This was later adopted in the DVD standardAUDIO SPECIFICATIONS Part 4 [DVD99] An important insight to FIRIIRprediction techniques towards the lossless compression is given in [Crav96] TheFIR prediction fails to provide satisfactory compression when dealing with signals


exhibiting wide dynamic ranges (gt60 dB) Usually these wider dynamic rangesin the spectrum of input audio are more likely when the sampling frequenciesare in excess of 96 kHz [ARA95] To this end Craven et al pointed out thatIIR filter prediction techniques [Crav96] [Crav97] perform well for these cases(Figure 114)

(b)

5 15 20 25 30 35 40 45 5010

010

1525

205

Bit

reso

lutio

n

Frequency (kHz)

5 10 15 20 25

Frequency (kHz)

0

minus20

minus40

minus60

Spe

ctra

l lev

el (

dB)

Input audio

FIR Residual for Np = 10

IIR Residual for Na = 5 Nb = 3minus80

(a)

Figure 114 (a) A simulation example depicting the advantages of employing IIR predic-tors over FIR predictors in L2AC schemes Note that the FIR predictor fails to flatten thedrop in the input spectrum above 20 kHz while a third-order IIR predictor is able to dothis efficiently (b) Bit resolution as a function of frequency The shaded area representsthe total information content of 96 kHz sampled signal (after [Crav96])


The DVD algorithm [Crav96] works as follows First the input audio isdivided into frames of lengths that are integer multiples of 384 A third-order(Nb = Na = 3) IIR predictor with fine coefficient quantization was proposedFigure 114(a) illustrates the advantages of employing an IIR predictor Notethat a 10th-order FIR predictor fails to flatten the drop (sim50dB above 20 kHz)in the input spectrum On the other hand a third-order IIR predictor is able todo this efficiently IIR prediction schemes provide superior performance overFIR prediction techniques for the cases where control of both average and peakdata rates is equally important Moreover IIR predictors are particularly effectivewhen large portions of the spectrum are left unoccupied ie if the filter roll-offis much less than the sampling rate fc Fs as shown in Figure 114(b) Notethat the area under the curve in Figure 114(b) represents the total informationcontent and fc asymp 30 kHz Fs = 96 kHz Results of 1-bit reduction per sampleper channel in the data rate were reported in [Crav96] when the order in thedenominator was incremented (Nb = 3 Na = 4) In other words a performanceimprovement of 6 dB can be realized by simply using an extra order in thedenominator of the predictor transfer function A special quantization structure isproposed in order to avoid precision errors while dealing with fixed-point DSPcomputations [DVD99] A Huffman coder that best fits the Laplacian distributionis employed for the residual packing The Huffman codes are chosen accordingto an adaptive scheme [Crav97] (based on the input quantization step size) inorder to accommodate a wide range of input word lengths

It is interesting to note that the L2AC IIR prediction techniques [Crav97]described in this section resulted in new initiatives towards a more efficient loss-less packing of digital audio [Gerz99] In particular these ideas were integratedin the Meridian lossless packing algorithm (MLP) proposed by Gerzon et al thatbecame an integral part of the DVD-audio specifications [DVD99]

11224 The MUSICompress Wegner at Soundspace Audio developed a lowcomplexity low-MIPS fast L2AC scheme called the MUSICompress (MC) algo-rithm [Wege97] This algorithm employs an adaptive approximator to estimate

Approximator

Audio inputs(n)

Σminus Residual

error e(n)

Header

Header

Approxaudio

Error

Compressedbitstream

blocksApproximated

audio s(n)ˆ

Figure 115 The MUSICompress (MC) algorithm L2AC scheme


the residual error as shown in Figure 115 Each encoded bitstream block con-tains a header the approximated audio s(n) and the residual e(n) The headerblock specifies the type of the approximator used for compression Dependingupon the bit-budget and computational complexity the MC algorithm can alsosupport lossy compression by discarding LSBs from the encoded bitstream Ametric for comparing L2AC algorithms is also proposed in [Wege97] This isgiven by

L2AC metric = Compression ratio

Number of MIPS (1110)

Although this metric is not universal it provides a rough estimate by consideringboth the compression ratio and the complexity associated with a L2AC algo-rithm At the decoder these compressed bitstream chunks are first demultiplexedAn interpreter reads the header-structure that provides information regarding theapproximator used in the encoder A synthesis operation is performed at thedecoder to reconstruct the lossless audio signal Real-time DSP and hardwareimplementation aspects of the MC algorithm along with the compression resultsare given in [Wege97]

11225 The Context Modeling ndash Linear Predictive Audio Coding(C-LPAC) Inspired by applications of context-modeling in image coding[Wu97] [Wu98] Qiu proposed a L2AC scheme called C-LPAC [Qiu01] Themain idea behind the C-LPAC scheme is that the residual error e(n) is encodedbased on higher-order context modeling that uses local (neighborhood) informa-tion to estimate the probability of the occurrence of an event An FIR linearprediction is employed to compute the prediction coefficients and the residualerror An error feedback is also employed in order to model the random processrp(n) ie

spred(n) =Lsum

k=1

aks(n minus k) + rp(n) (1111)

where spred(n) is the predicted value based on the previous L samples of theinput signal s(n) and ak are the prediction coefficients The LP coefficients arecomputed by minimizing the error e(n) between the predicted and the originalsignal as follows

e(n) = s(n) minus spred(n) (1112)


k=1

aks(n minus k) minus r(n) n = 1 2 N (1113)

The mean square error (MSE) is given by

ε = 1

N

Nsum

n=1

e2(n) (1114)


where N is the number of samples Note that the minimization of the MSEε (1114) fails to model the random process r(n) since the audio frames assumedto be stationary with constant mean and variance Therefore in order to estimatecorrectly the random process r(n) the residual error e(n) is modified as follows

ef (n) =

e(n) minus (n) for encodere(n) + (n) for decoder

(1115)

where(n) = F(e(n minus 1) e(n minus 2) e(n minus nf )) (1116)

where nf denotes the number of error samples fed back to compute the correctionfactor and (n) is based on an arbitrary function F Although details on theselection of nf and the function F are not included in [Qiu01] we note that thebasic idea is to employ an error feedback in order to account for the estimationof the random process rp(n) The modified residual error ef (n) is entropycoded using the high-order context modeling In particular bit-plane encoding isemployed [Kunt80] [Cai95] For example let e(i minus 1) e(i) and e(i + 1) be theresidual samples Each bit plane (BP0 through BP7 ) contains a representation ofeach significant bit corresponding to a sample The bit-planes BP 0 BP 7contain the bits corresponding to MSB LSB respectively and thereforethe most significant information is included in the top few bit-planes

BP 0 BP 7darr darr

e(i minus 1) = 63 (00111111)

e(i) = 64 (01000000)

e(i + 1) = 65 (01000001)

(1117)

In the above example bit-plane 0 contains ( 0 0 0 ) bit-plane-1 contains( 0 1 1 ) and so on First a local energy ξe(i) = sump

m1=1 |e(i minus m1)|+sumq

m2=1 |e(i + m2)| in the neighborhood of ldquoprdquo samples before ldquoeirdquo and ldquoqrdquosamples after ldquoeirdquo is computed Depending on the energy level and the bit-planeposition (0 through 7 in our example) the significance of the coding symbol ei is computed Expressions for context modeling for ldquosignificantrdquo coding ldquosignrdquocoding and ldquorefinementrdquo coding are given in [Qiu01] The research involvingthe higher-order context modeling applied to L2AC uses concepts from imagecoding [Wu97] [Wu98]

11226 The Lossless Transform Audio Coding (LTAC) The LTAC algo-rithm [Pura97] employs a unitary transform (AH = Aminus1) In particular an ortho-normal DCT is used to decorrelate the input signal Figure 116 shows the LTACblock diagram where the input audio s(n) is first buffered in Nbuff sampleblocks transformed to the frequency domain using the DCT and then quantizedto obtain the spectral coefficients X(k) Both fixed and adaptive block length


modes are available for DCT block transformation In the fixed block lengthmode the DCT block size is equal to the number of input samples (Nbuff = 128256 1024 and 2048) On the other hand in the adaptive block length mode thenumber of input samples per block Nbuff = 2048 and the transformation thatresults in the optimum performance among the 512 1024 or 2048-point DCTis selected Higher block lengths often provide good decorrelation of the inputsignal However this is true only when the input signal is not rapidly time vary-ing The LTAC algorithm works as follows An estimate of the input audios(n) is computed at the local decoder by performing the inverse transformation(ie IDCT) as shown in Figure 116 An error signal (residual) is calculated bysubtracting the estimated signal from the input signal

e(n) = s(n) minus s(n) (1118)

where s(n) = Q〈IDCT[Q〈DCT[s(n)]〉]〉 and lsquoQrsquo denotes quantizationThe residual e(n) is entropy coded using Rice codes and transmitted along

with the DCT coefficients X(k) as shown in Figure 116 Compression ratesin the range of 59ndash72 and performance improvements over prediction-basedL2AC schemes have been reported in [Pura97]

11227 The IntMDCT Algorithm Inspired by the use of the DCT andother integer transforms in lossless audio coding Geiger et al proposed theinteger-MDCT algorithm [Geig01] [Geig02] One of the attractive properties thathas contributed to the widespread use of the MDCT in audio standards is the

DCT(OrthonormalTransform)

Audio inputs(n)

minus

Lossycompressedaudio X(k)

Residualerror e(n)

Quantizer

Lossy compression scheme

IDCT Quantizer

Local decoderEntropycoding

and

MUX

s(n)ˆ

Σ

Figure 116 Lossless transform audio coding (LTAC)


availability of FFT-based algorithms [Duha91] [Sevi94] that make it viable forreal-time applications MDCT features that are important in audio coding includ-ing its forward and inverse transform interpretations prototype filter (window)design criteria and window design examples are described in Chapter 6 Recallthat in the orthogonal case the generalized perfect reconstruction conditionsgiven in Chapter 6 Eqs (619) through (623) can be reduced to linear phaseand Nyquist constraints on the window w(n) ie

minus w(2M minus 1 minus n) + w(n) = 0 and

w2(n) + w2(n + M) = 1 for 0 n M minus 1(1119)

where M denotes the number of transform coefficientsGivens Rotations and Lifting Scheme The IntMDCT algorithm is based on

the integer approximation of the MDCT algorithm This is accomplished usingthe ldquolifting schemerdquo [Daub96] proposed by Daubechies et al First the MDCT isdecomposed [Geig01] into Givens rotations2 by choosing c and s in Eq (1120)so that a vector X is orthogonally transformed (rotated) into a vector Y as follows

dArr Givensrotation

Y =[

y1

y2

]

=[

c minuss

s c

] [x1

x2

]

=[

y1

0

]

(1120)

Note from the above equation

y2 = minussx1 + cx2 = 0 andc2 + s2 = 1

(1121)

Details on the decomposition of the MDCT into windowing time-domain alias-ing and DCT-IV and eventually into Givens rotations are given in [Vaid93] Notethe similarities between Eqs (1119) and (1121) The lifting scheme is appliednext in order to approximate Givens rotations The lifting scheme splits theGivens rotations into three lifting steps Eq (1122) that can be easily approx-imated and mapped onto the integer domain [Daub96] This integer mapping isreversible and therefore contributes to the lossless coding Equations (1123) and(1124) show that the lifting scheme is indeed a reversible integer transform sinceeach of the three lifting steps can be inverted uniquely by simply subtracting thevalue that has been added

Lifting scheme

[c minuss

s c

]

=

1c minus 1

s0 1

[

1 0s 1

]

1c minus 1

s0 1

(1122)

2 Note that c and s represent cos(α) and sin(α) where ldquoαrdquo corresponds to the angle of rotationof some arbitrary plane In particular the Givens rotation results in an orthogonal transformationwithout changing any of the norms of the vectors X and Y


The lifting scheme and its inverting property are described as follows

Y =[

c minuss

s c

]

X =

1c minus 1

s0 1

[

1 0s 1

]

1c minus 1

s0 1

X

X =[

c minuss

s c

]minus1

Y =[

c s

minuss c

]

Y

=

1 minus(

c minus 1

s

)

0 1

[

1 0minuss 1

]

1 minus(

c minus 1

s

)

0 1

Y

(1123)

The lifting scheme is a reversible integer transform ie

Consider the first lifting step

(1 (c minus 1)s

0 1

)

Y 1 =[

y11

y21

]

=[

1 (c minus 1)s

0 1

] [x1

1

x21

]

=[

x11 + x2

1(c minus 1)s

x21

]

(1124)

Decoding

X1 =[

x11

x21

]

=[

1 minus(c minus 1)s

0 1

] [y1

1

y21

]

=[

y11 minus y2

1(c minus 1)s

y21

]

The IntMDCT is typically used in conjunction with the MDCT-based percep-tual coding scheme as shown in Figure 117 First the input audio s(n) iscompressed using a perceptual coding scheme (eg ISOIEC MPEG-audio algo-rithms) resulting in a lossy compressed audio X(k) Later these coefficientsare inverse quantized and integer rounded and subtracted from the IntMDCTspectrum coefficients This results in an error residual e(n) The error essen-tially carries the enhancement bitstream that contributes to lossless coding Theerror spectrum and the MDCT coefficients can be entropy coded using Huffmancodes Bit-sliced arithmetic coding (BSAC) that inherits the properties of finegrain scalability can also be used to encode the lossy audio and the enhancementbitstream

In BSAC arithmetic coding replaces Huffman coding In each frame bitplanes are coded in order of significance beginning with the most significantbits (MSBs) This results in a fully embedded coder containing all lower-ratecodecs With proper tuning of quantization steps the IntMDCT algorithm shownin Figure 117 can also be used in perceptual (lossy) audio coding applica-tions [Geig02] [Geig03]

113 DVD-AUDIO

The DVD Forum Committee recommended standardizing the DVD-A specifica-tions [DVD01] and the related coding techniques for DVD-A ie the Meridian

MD

CT

Aud

io In

put

s(n

)

minus

Loss

yco

mpr

esse

dau

dio

X(k

)

Err

oren

hanc

emen

tbi

tstr

eam

e(n

)

Blo

ckC

ompa

ndin

gA

ndQ

uant

izat

ion

Per

cept

ual c

odin

g sc

hem

e

Psy

choa

cous

ticA

naly

sis

Inve

rse

Qua

ntiz

atio

nA

ndIn

tege

r R

ound

ing

IntM

DC

T

Ent

ropy

Cod

ing

orB

it-sl

iced

Arit

hmet

icC

odin

g (B

SA

C)

and

MU

X

Σ

Fig

ure

117

L

ossl

ess

audi

oco

ding

usin

gIn

tMD

CT

algo

rith

m

357


lossless packing [DVD99] DVD-A offers two important features ie multi-ple sampling rates (441 48 882 96 1764 and 192 kHz) and multiple bit-resolutions (16- 20- and 24-bit) The DVD-A format can store up to 6 channelsof digital media at 24-bit resolution sampled at 96 kHz Table 112 lists someof the important parameters of the DVD-A format DVD-A provides a dynamicrange of the order of 144 dB over a 96-kHz frequency range and the maxi-mum input data rate allowed is 96 Mbs Along with these features DVD-Aaddresses the digital content protection and copyright issues with a variety ofencryption and audio watermarking technologies A 74-min 96-kHz multichan-nel 20-bit resolution audio will require a minimum of 893 GB storage space(see Table 112) However the maximum storage capacity offered by DVD-Ais 47 GB Therefore a lossless compression is required The Meridian losslesspacking (MLP) scheme [Gerz99] [Kuzu03] developed by Gerzon et al has beenchosen as the compression standard for the DVD-Audio format

1131 Meridian Lossless Packing (MLP)

The MLP encoding scheme [Gerz99] was built around the work done by Cravenand Gerzon [Crav96] [Crav97] and Oomen Bruekers and Vleuten [Oome96]Furthermore it inherits some features from the DVD algorithm discussed insection 1122 Figure 118 shows the five important stages associated with theMLP encoding scheme These include channel remapping and shifting inter-channel decorrelation intrachannel decorrelation (or prediction) entropy cod-ing (lossless packing) and interleavingMLP bitstream formatting In the firststage N channels of PCM audio are remapped and shifted in order to facilitateefficient buffering of MLP substreams and to improve bit-precision MLP canencode a maximum of 63 audio channels In the second stage the interchan-nel dependencies within these N channels are exploited Typically matrixingtechniques (see Chapter 10) are employed to eliminate interchannel redundan-ciesirrelevancies In particular MLP employs lossless matrixing scheme that isdescribed in Figure 119 From this figure it can be noted that only one PCMaudio channel for instance x1 is modified while the channels x2 through xN

are unchanged The coefficients [a2 a3 a4 aN ] are computed for each audioframe that results in minimum interchannel redundancy Stage 3 is concerned withthe intrachannel redundancy removal based on the FIRIIR prediction techniques

114 SUPER-AUDIO CD (SACD)

Almost 20 years after the launch of the audio CD format [PhSo82] [IECA87]Philips and Sony established a new storage format called the ldquosuper-audio CDrdquo(SACD) [SACD02] [SACD03] Unlike the conventional CD format that employsPCM encoding with a 16-bit resolution and a sampling rate of 441 kHz theSACD system uses a superior 1-bit recording technology called direct streamdigital (DSD) [Nish96a] [Nish96b] [Reef01a] SACD operates at a samplingfrequency 64 times the rate used in conventional CDs ie 64 times 441 kHz =28224 MHz

Tabl

e11

2

Con

vent

iona

lC

Ds

vers

usth

ene

xtge

nera

tion

stor

age

form

ats

Para

met

erC

onve

ntio

nal

CD

SAC

DD

VD

-Aud

io

Stan

dard

ized

year

1982

[I

EC

A87

]19

99

[SA

CD

02]

1999

[D

VD

01]

Enc

odin

gsc

hem

ePu

lse

code

mod

ulat

ion

(PC

M)

Dir

ect-

stre

amdi

gita

l(D

SD)

Mer

idia

nlo

ssle

sspa

ckin

g(M

LP)

Bit

reso

lutio

n16

-bit

linea

rPC

M1-

bit

DSD

16

20

or24

-bit

linea

rPC

MSa

mpl

ing

rate

441

kHz

2822

4kH

z44

1

48

882

96

17

64

or19

2kH

zD

ata

rate

14

Mb

s5

6to

84

Mb

slt

96

Mb

sFr

eque

ncy

rang

eD

Cto

20kH

zD

Cto

100

kHz

DC

to96

kHz

Dyn

amic

rang

e96

dBsim1

20dB

sim144

dBN

umbe

rof

chan

nels

22

to6

1to

6Pl

ayba

cktim

e74

min

sim74

(plusmn5)

min

sim70

to80

0m

in(d

epen

ding

upon

the

bit

rate

)M

axim

umst

orag

eca

paci

tyav

aila

ble

sim750

to80

0M

Bsim4

38

to4

7G

B(s

ingl

e-la

yer)

sim47

GB

Stor

age

requ

ired

74times

60times

4410

0times

(2)

times(

16 8

)sim =

740

MB

74times

60times

282

24times

106times

(2+

6)

times(

1 8

)sim =

116

7G

B

74times

60times

96times

103times

(1+

2+

6)

times(

20 8

)sim =

893

GB

Com

pres

sion

Not

requ

ired

Req

uire

dR

equi

red

359

Inpu

tA

udio

x(

n)

Cha

nnel

Re-

map

ping

Shi

ft R

egis

ter-

1

Shi

ft R

egis

ter-

2

Shi

ft R

egis

ter-

N

Cha

nnel

Re-

map

ping

and

Shi

fting

Cha

n-1

Cha

n-2

Cha

n-N

Inte

r-ch

anne

lD

e-co

rrel

ator

Loss

less

Mat

rixin

g

Pre

dict

or-1

Pre

dict

or-2

Pre

dict

or-N

Intr

a-C

hann

elD

e-co

rrel

ator

Ent

ropy

code

r-1

Ent

ropy

code

r-2

Ent

ropy

code

r-N

Buf

ferin

gIn

terle

avin

gan

d M

LPbi

tstr

eam

form

attin

g

N n

umbe

r of

aud

io c

hann

els

~ 6

3 fo

r M

LP

ST

AG

E-1

ST

AG

E-2

ST

AG

E-3

ST

AG

E-4

ST

AG

E-5

x 1 x 2 x N

xrsquo1

xrsquo2

xrsquoN

Fig

ure

118

M

LP

enco

ding

sche

me

[Ger

z99]

360

SUPER-AUDIO CD (SACD) 361

Lossless Matrixing

x 2 = x2

x 11

0 0 0

0 1 0

10 0

rarr (III)

Lossless De minusmatrixing

aNx1 a2 a4

x2

x3

xN

x3 = x3

x N = xN

=

(1) Decoding x2 through xN is straightforward from (III)

(2) Lets see if decoding x1 is lossless

rarr (IV)

x 2x 3

x N

=gt The above equation when re-arranged is same as (II)

=gt there4 Lossless matrix encoding and decoding

From (IV) =gt x1 = x1 + a2 x2 + a3 x3 + a4 x4 + + aN xN

1

0

a3

0

0

0 0

( x i = xi from (III))

there4

x1

x2

x3

1

0 0 0

0 1 0

10 0

rarr (I)

rarr (II)

xN

x1 minusa2 minusa3 minusa4 minusaN

x2x 3

x N

=

1

0

0

0

0 0

Note a2 x2 corresponds to Quantization of a2 x2

x 1 = x1 minus a2 x2 minus a3 x3 minus a4 x4 minus minus aN xN

x1 = x1 + a2 x2 + a3 x3 + a4 x4 + + aN xN

Figure 119 Lossless matrixing in MLP

Moreover SACD enables frequency response from DC to 100 kHz with a dyna-mic range of 120 dB SACD systems enable both high-resolution surround soundaudio recordings (51 or 32-channel format) as well as high-quality stereo encod-ing Some important characteristics of the red book audio CD format [IECA87]and the SACD and DVD-audio storage formats are summarized in Table 112

The main idea behind the DSD technology is to avoid the decimation andinterpolation filters that invariably impart quantization noise in the compressionprocess To this end DSD directly records each sample as a 1-bit pulse [Hori96][Nish96a] [Nish96b] [Reef01a] thereby eliminating the need for down-sampling


and up-sampling filters (see Figure 1110(b)) Moreover the DSD encoding enab-les improved lossless compression of digital audio [Reef01a] [Jans03] ie directstream transfer (DST) (compression ratios range from 2 to 4) Figures 1110 (a)and (b) show the steps involved in the conventional PCM encoding and theDSD technology respectively Before describing the DSD encoding we providea brief introduction to the SACD storage format [tenK97] and the sigma-delta() modulators [Cand92] [Sore96] [Nors97] in the next two subsections

1141 SACD Storage Format

SACDs come in three formats based on the number of layers employed [tenK97][SACD02] [SACD03] [Jans03] These include the single layer the dual-layer andthe hybrid formats In the single-layer SACD a single high-density layer for thehigh-resolution DSD recording (47 GB) is allowed On the other hand the dual-layer SACD contains one or two layers of DSD content (2 times 47 GB = 84 GB)In order to allow backwards compatibility with conventional CD players a hybriddisc format is standardized In particular the hybrid disc contains one innerlayer of DSD content and one outer layer of conventional CD content (ie750 MB + 47 GB = 545 GB) Similar to the conventional CDs the SACDshave a 12 cm diameter and 12 mm thickness The lens numerical aperture (NA)is 06 and the laser wavelengths to read the SACD and the CD content are650 nm and 780 nm respectively

1142 Sigma-Delta Modulators (SDM)

Sigma-Delta () modulators [Cand92] [Sore96] [Nors97] convert analog sig-nals to one-bit digital outputs Figure 1111 shows a schematic block diagramof the modulator that consists of a single-bit quantizer and a loop filter(ie low-pass filter) in a negative feedback loop modulators output 1-bitsamples that indicate whether the current sample is larger or smaller than thevalue accumulated from the previous samples If the amplitude of the currentsample is greater than the value accumulated during the negative-feedback-loop the SDM outputs ldquo1rdquo or otherwise ldquo0rdquo SDM involves high samplingfrequencies and a two-level (ldquo0rdquo or ldquo1rdquo) quantization process Over-samplingrelaxes the requirements on the analog antialiasing filter (see Figure 1110) andhence it simplifies analog hardware Recall from Chapter 3 that multi-bit PCMquantizers (ie more than two discrete amplitude levels) suffer from differ-ential nonlinearities particularly with analog signals Several solutions havebeen proposed [Nors92] [Nors93] [Dunn96] [Angu97] to overcome the nonlin-ear distortion in the PCM quantization process including oversampling and noiseshaping

Nonlinear distortion are eliminated by switching from multi-bit PCM to single-bit quantizers However the single-bit quantizer employed in the modulatorresults in high quantization noise therefore a noise shaping process is essen-tial Noise shaping removes or ldquomovesrdquo the quantization noise from the audible

Ana

log

to D

igita

lC

onve

rter

(1-

bit A

DC

)x

(n)

Dec

imat

ion

Filt

erP

CM

Enc

odin

g

64 F

s

1-bi

t

Fs

20-b

it

Inte

rpol

atio

nF

ilter

and

Sig

ma-

Del

taM

odul

ator

Fs

20-b

it

Ana

log

Low

-pas

s F

ilter

64 F

s

1-bi

t

Ana

log

Out

put

(a)

Ana

log

to D

igita

lC

onve

rter

(1-

bit A

DC

)D

SD

Enc

odin

g

64 F

s

1-bi

t

Ana

log

Low

-pas

s F

ilter

64 F

s

1-bi

t

Ana

log

Inpu

tA

nalo

gO

utpu

t

(b)

Fig

ure

111

0(a

)C

onve

ntio

nal

PCM

enco

ding

and

(b)

dire

ct-s

trea

mdi

gita

l(D

SD)

tech

nolo

gy

363


Low-pass Filter(Integrator)

H(z)

1-bit Quantizer

minus

AnalogSignal x(n)

1-bit Digital Output y(n)

k

e(n)

sumsum

Figure 1111 Sigma-delta () modulator

band (lt20 kHz) to the high-frequency band (gt 50 kHz) From Figure 1111 thetransfer function of the SDM system is given by

(X(z) minus Y (z))H(z)k + E(z) = Y (z) (1125)

Y (z) =(

kH(z)

1 + kH(z)

)

︸︷︷︸Signaltransf erf unction

X(z) +(

1

1 + kH(z)

)

︸︷︷︸Noiseshapingf unction

E(z) (1126)

where H(z) is the loop-filter k is the error in gain and E(z) is the DC offset corre-sponding to 1-bit quantization In order to remove the quantization noise from the

audible band the noise shaping function NS(z) = 1

1 + kH(z) must be a high-pass

filter and hence the loop-filter H(z) must be a low-pass filter that acts as an inte-grator The next important step is to decide the order of the loop-filter H(z) Ahigher-order loop-filter results in higher resolution and larger dynamic ranges Wediscussed earlier that SACDs provide dynamic ranges in the order of 120 dB Exper-imental results [Reef01a] [Jans03] indicate that at least a fifth-order loop-filter mustbe employed in order to obtain a dynamic range greater than 120 dB Howeverdue to the negative feedback loop employed in the modulator depending onthe analog input signal amplitude the SDM may become unstable if the order ofthe loop-filter exceeds two In order to overcome these stability issues Reefmanand Janssen proposed a fifth-order modulator structure with integrator-clippingdesigned specifically for DSD encoding [Reef01a] [Reef02] Several publicationsthat describe the SDM structures for 1-bit audio encoding [Angu97] [East97][Angu98] [Angu01] [Reef01a] [Reef02] [Harp03] [Jans03] have appeared

1143 Direct Stream Digital (DSD) Encoding

Designed primarily to enhance the performance of 1-bit lossless audio codingthe DSD encoding offers several advantages [Reef01b] over standard PCM tech-nology ndash it eliminates the requirement of decimation and interpolation filtersenables lossless compression and higher dynamic ranges (sim120 dB) reducesstorage requirements (since 1-bit encoding) allows much higher time resolution(due to higher sampling rates) and involves controlled transient behavior (sinceit employs simple antialiasing low-pass filters) Moreover the DSD bitstreams


can be easily down-converted to lower frequencies (32 kHz 441 kHz 96 kHzetc) During mid-1990s Nishio et al provided the first ideas of direct streamdigital audio system [Nish96a] [Nish96b] [Nogu97] At the same time severalother researchers proposed methods for 1-bit (lossless) audio coding [Hori96][Brue97] [East97] [Ogur97] In 1999 Philips and Sony released the licensingfor the SACD systems and considered the 1-bit DSD encoding as its record-ingcoding technology The following year Reefman and Janssen at PhilipsResearch Labs presented a white paper on the signal processing aspects of DSDfor SACD recording [Reef01a] Since then several advancements towards effi-cient DSD encoding have been made [Reef01c] [Hawk02] [Kato02] [Reef02][Harp03] [SACD03]

Figure 1112 shows the DSD encoding procedure that employs a mod-ulator and the direct stream transfer (DST) technique to perform lossless com-pression

From Table 112 note that an SACD allows 47 GB of storage space howeverthe storage required for a 74-min 1-bit DSD data is approximately 1167 GB andhence the need for lossless compression of 1-bit audio Lossless coding in caseof 1-bit digital signals is altogether a different scenario from what we discussedearlier in Section 112 Bruekers et al in 1997 proposed a lossless coding tech-nique for 1-bit audio signals [Brue97] This was later adopted as the direct streamtransfer (DST) L2AC technique for 1-bt DSD encoding by Philips and Sony

11431 Analog to 1-Bit Digital Conversion ( Modulator) First theanalog input signal x(n) is converted to 1-bit digital output bitstream y(n)

(Figure 1112) The steps involved in the analog to 1-bit digital bitstream con-version have been presented earlier in Section 1142 In this section we presentsome simulation results (Figure 1113) to analyze the dynamic ranges offered incase of 16-bit linear PCM coding (sim96 dB) and 1-bit DSD encoding (sim120 dB)Figure 1113(a) shows the time-domain input waveform x(n) which in this caseis a 4 kHz sine wave Figure 1113(b) represents the modulator output ie1-bit digital output bitstream y(n) Note that as the amplitude of the inputsignal increases the density of ldquo1srdquo increases (plotted as a bar-chart) and ifthe amplitude decreases the density of ldquo0srdquo increases Figure 1113(c) showsthe frequency-domain output of the 16-bit linear PCM quantizer for the 1-kHzsine wave input x(n) Recall from Chapter 3 that the SNR for linear PCM willimprove approximately by 6 dB per bit This is given by

SNRPCM = 602Rb + K1 (1127)

where Rb denotes the number of bits and the factor K1 is a constant that accountsfor the step size and loading factors Therefore for a 16-bit linear PCM the outputSNR is given by SNRPCM asymp 602(16) asymp 96 dB This can be associated with thedynamic ranges (sim96 dB) obtained in case of conventional CDs that employ16-bit linear PCM encoding (see Figure 1113(c)) Figure 1113(d) shows thespectrum of the 1-bit digital output bitstream Although there are no mathematical

Pre

dict

ion

Loop

Filt

erH

(z)

Sin

gle-

bit

Qua

ntiz

er

Fra

min

gor

Buf

ferin

g

Map

ping

0rarr minus

11rarr

+1

Pre

dict

ion

zminus1A

(z)

t1-

biQ

uant

izer

sum

sum

Ent

ropy

Cod

ing

DS

DB

itstr

eam

For

mat

ting

Ana

log

Sig

nal

x(n

)minus

sing

le-b

itdi

gita

l out

put

bits

trea

my

(n)

376

32 b

its p

er

fram

ey f

(n)

376

32 b

its p

er

fram

ey f

(n)

0 1

Dire

ct S

trea

m T

rans

fer

(DS

T)

minus1

+1si

ngle

-bit

reso

lutio

n

r(n)

mul

ti-bi

tre

solu

tion

0 1

si

ngle

-bit

reso

lutio

n

minus

Coe

ffici

ents

Pre

dict

ion

Coe

ffici

ents r e

(n)

Sid

e-in

form

atio

n(e

rror

pro

babi

lity

tabl

e)

sumM

odul

ator

D

Fig

ure

111

2D

SDen

codi

ngus

ing

a

m

odul

ator

and

the

dire

ctst

ream

tran

sfer

(DST

)te

chni

que

for

1-bi

tlo

ssle

ssau

dio

codi

ngin

SAC

Ds

366


15

minus15

1

minus1

05

minus05

0

0 05 1 15 2 25 3

Am

plitu

de

Time times 10minus4

15

1

minus1

0

0 2 4 6 8 10 122-le

vel q

uant

izer

am

plitu

de

Number of samples times 104

1-bit DSD Waveform

Time domain input waveform

minus50

minus100

minus150

0

0 01 02 03 04 05 06 07 08 09 1 11

Mag

nitu

de (

dB)

Frequency (times 20 KHZ)

Frequency-domain output of 16-bit LPCM

minus50

minus100

minus150

0

0 01 02 03 04 05 06 07 08 09 1 11

Mag

nitu

de (

dB)


Frequency-domain output of 1-bit DSD

(b)

(a)

(c)

(d)

4 kHz

4 kHz

Figure 1113 Analog to 1-bit digital conversion (a) time-domain input sinusoid (4 kHz)x(n) (b) modulator output (c) frequency-domain output of 16-bit linear PCM encod-ing (DR sim96 dB) and (d) frequency-domain representation of 1-bit DSD (DR sim120 dB)


models to compute the SNR values directly for 1-bit DSD encoding severalexperimental and simulation results are available [Reef01a] [Jans03] The outputSNR values in case of 1-bit encoding (and therefore the dynamic ranges) are loop-filter-order dependent In particular a fourth-order loop filter provides a dynamicrange (DR) of 105 dB a fifth-order loop filter results in a DR of 120 dB anda seventh-order loop filter offers a DR of 135 dB In Figure 1113(d) slightincrease in the noise-floor around the 50-kHz range is due to the noise-shapingemployed in case of the modulator

11432 Direct Stream Transfer (DST) The 1-bit digital output bitstream(sim11 GB) available at the output of modulator is to be compressed inorder to fit into an SACD (sim47 GB) A lossless coding technique called theDST is employed DST is somewhat similar to the prediction-based losslesscompression techniques discussed earlier in this chapter However several mod-ifications are required in order to process the 1-bit digital samples The DSTalgorithm [Brue97] [Reef01a] works as follows (see Figure 1112) First the 1-bit pulse stream y(n) is buffered into frames of 37632 bits yf (n) A mappingprocedure is applied in order to reorder the amplitudes of yf (n) DST employs aprediction filter A(z) in order to reduce the redundancies associated with sam-ples in a frame The prediction filter can be adaptive and the filter orders canrange from 1 to 128 It is intuitive that the resulting predicted signal yfm(n)exhibits multi-bit resolution The predicted signal yfm(n) is converted (using a1-bit quantizer) to single-bit resolution yfs(n) The residual r(n) is given by

r(n) = yf (n) minus yfs(n) (1128)

The residual error r(n) is entropy coded based on an error probability look-uptable For each frame the prediction coefficients ai i isin [1 2 128] the entropycoded residual re(n) and some side information (error probability table etc) arearranged according to the DSD bitstream formatting specified by [SACD02]

11433 Content Protection and Copyright in SACD Unlike typical audiowatermarking techniques employed in conventional CDs SACDs employ a unique(distinct for each SACD) invisible watermark called the Pit Signal Process-ingndashPhysical Disc Mark (PSP-PDM) [SACD02] [SACD03] and three importantcontrols ie Disc-Access Control Content-Access Control and Playback Con-trol Figure 1114 provides details on these SACD controls and the PSP-PDMSACD employs several scrambling (content rearrangement) and encryption tech-niques In order to descramble and play back the DSD content knowledge ofPSP-PDM ldquokeyrdquo is essential

115 DIGITAL AUDIO WATERMARKING

Audio content protection from unauthorized users and the issue of copyright man-agement are very important these days considering the vast amount of multimediaresources available Moreover with the popularity of efficient search engines and

DIGITAL AUDIO WATERMARKING 369

D C P

Disc-access Control

SACD Mark holds the key to access the disc and theassociated SACD parameters (ie PSP-PDM etc)This in fact is the first step to check if the SACD iscompliant with the drive

Content-access Control

Once it is found that the disc was compatible the nextstep is to obtain the hidden PSP-PDM key to access theSACD content Using the PSP-PDM key and the InitialValues information the SACD content is de-scrambled

Playback Control

The de-scrambled data is played back once the PSP-PDM key is verified to be authentic

CONTROLS

Figure 1114 D-C-P controls in SACD

fast Internet connections it became feasible to searchdownload an audio clip overthe World Wide Web in the MP3 format that can offer CD-quality high-fidelityaudio For example an audio clip of 6-minute duration in the MP3 format ingeneral would require 3ndash5 MB of storage space and note that at 20 kBs itwould take no more than 5 minutes to download the entire audio clip Becauseof this security issues in MP3 audio streaming have become important [Horv02][Thor02] Furthermore with the advancements in the computer industry anyunprotected CD can be reproduced (ie copied) with minimal effort The needfor ldquointellectual property management and protectionrdquo (IPMP) [Spec98] [Spec99][Rose01] [Beck04] motivated content creators and service providers to seek effi-cient encryption and data hiding techniques The main idea behind these datahiding techniques is to insert (or hide) an auxiliary data string into the digitalmedia in order to resolve rightful ownership Some of the important tasks of theseencryptiondata-hiding techniques include the ability to ensure that the encryp-tion is imperceptible and robust to a variety of signal processing manipulations(eg down-sampling up-sampling filtering etc) [Bend96] [Bone96] to providereliable information regarding the ownership of a media file [Crav97] [Crav98]to deactivate the playback option if the media clip was found to be a pirated oneand to comply with the digital rights management (DRM) principles [Rose01][Arno02] [Beck04]

This section addresses various data-hiding techniques employed for audio con-tent protection A data hiding technique [Bend95] [Peti99] [Cox01a] [Wu03]involves schemes that insert signals of certain characteristics (imperceptiblerobust statistically undetectable etc) in the audio Such signals are called


watermarks A digital audio watermark (DAW) can be defined as a stream of bitsembedded (or hidden) in an audio file that offers features such as content protec-tion intellectual property management and proof of ownership Before gettinginto the details of audio watermarking techniques we present some of the basicconcepts of ldquocontent protectionrdquo and ldquocopyright managementrdquo A simple idea isto include in the audio clip a password or a key that is relatively difficult tobe ldquohackedrdquo in a given amount of time Details of this key must be essential inorder for someone to playback the audio file This ensures content protection Forcopyright management a data string or some form of a signature that indicatesthe proof of ownership is included in the audio However in both the cases if thekey or the signature string is lost one cannot guarantee for either content pro-tection or copyright management Therefore in applications related to copyrightmanagement the watermarking techniques are usually combined with encryptionschemes The latter is beyond the scope of this text and the reader can find moreinformation on several encryption techniques in [Wick95] [Trap02] [Mao03]

Some of the important aspects that pose major challenges in digital audiowatermarking are described below Digital audio and video watermarkingtechniques rely on the perceptual properties of the human auditory (HAS) andhuman visual systems (HVS) respectively In certain ways the HVS is lesssensitive compared to the human auditory system (HAS) This indicates that it ismore challenging to embed imperceptible audio watermarks Moreover the HASis characterized by larger dynamic ranges in terms of poweramplitudes (sim 1081)and frequencies (sim 1041) Unlike in video compression where data rates of theorder of 1 Mbs are employed in audio coding the data rates typically rangefrom 16 to 320 kbs This means that the stream of bits embedded in an audiofile is much shorter compared to the auxiliary information inserted in a video filefor watermarking Researchers published a variety of sophisticated techniques([Bend95] [Cox95] [Bend96] [Bone96] [Cox97] [Bass98] [Lacy98] [Swan98a][Kiro03a] [Zhen03]) in order to meet several conflicting demands ([Crav98][Swan99] [Cox99] [Barn03a] [Barn03b]) associated with the digital audiowatermarking ([Cox01a] [Arno02] [Wu03])

1151 Background

The advent of ISOIEC MPEG-4 standardization (1996ndash2000) [ISOI99] [ISOI00a]established new research goals for high-quality coding of general audio signals evenat low bit rates Moreover bit-rate scalability and error resilienceprotection toolsof the MPEG-4 audio standard enabled flexible selection of coding features anddynamically adapt to the channel conditions and the varying channel capacity Theseand other significant strides in the area of digital audio coding have motivated consid-erable research during the last 10 years towards formulation of several audio water-marking algorithms Table 113 lists some of the important innovations in digitalaudio watermarking research

In 1995 Bender et al at the MIT Media Laboratory proposed a variety ofdata-hiding techniques for audio [Bend95] [Bend96] These include low-bit cod-ing phase coding spread-spectrum and echo data hiding The pioneering work


Table 113 Some of the milestones in the digital audio watermarking research ndash ahistorical perspective

Year Digital audio watermarking (DAW) scheme Related references

1995 Phase coding low-bit coding echo-hidingand spread-spectrum techniques

[Bend95] [Bend96]

1995ndash96 Echo-hiding watermarking scheme [Gruh96]1995ndash96 Spread-spectrum watermarking for

multimedia[Cox95] [Cox96]

[Cox97]1996 Watermarking based on perceptual masking [Bone96] [Swan98a]1997 DAW based on vector quantization [Mori97]1998 Time-domain DAW technique [Bass98] [Bass01]1999 DAW based on bandlimited random

sequences[Xu99a]

1999 DAW based on a psychoacoustic model andspread-spectrum techniques

[Garc99]

1999 Permutation and data scrambling [Wang99]2000 DAW in the cepstrum-domain [Lee00a] [Lee00b]2000 Watermarking techniques for MPEG-2 AAC

bitstreams[Neub00a] [Neub00b]

2000 Audio watermarking based on theldquopatchworkrdquo algorithm

[Arno00]

2001 Combined audio compressionwatermarkingmethods

[Herr01a] [Herr01b][Herr01c]

2001 Modified echo hiding technique [Oh01] [Oh02]2001 Adaptive and content-based DAW scheme

(improve the accuracy of watermarkdetection in echo hiding)

[Foo01]

2001 Modified patchwork algorithm [Yeo01] [Yeo03]2001 Time-scale modification techniques for DAW [Mans01a] [Mans01b]2001 Replica modification [Petr01]2001 Security aspects of DAW schemes [Kalk01a] [Kalk01b]

[Mans02]2001 Synchronization methods of audio

watermarks[Leon01] [Gang02]

2002 Pitch scaling techniques for robust DAW [Shin02]2002 DAW using artificial neural networks [Huij02]2002 Time-spreading echo hiding using PN

sequences[Ko02]

2002 Improved spread-spectrum audiowatermarking

[Kiro01a] [Kiro01b][Kiro02] [Kiro03a]

2002 Hybrid spread spectrum DAW [Munt02]2002 DAW using subband divisionQMF banks [Sait02a] [Sait02b]2002 Blind cepstrum-domain DAW [Hsie02]2002 Blind DAW with self-synchronization [Huan02]2003 Time-frequency techniques [Esma03]2003 Chirp-based techniques for robust DAW [Erku03]

(continued overleaf )




2003 DAW using turbo codes [Cvej03]2003 DAW using sinusoidal patterns based on PN

sequences[Zhen03]

Year Tutorial reviews and survey papers Reference1996 Techniques for data hiding [Bend96]1997 Secure spread spectrum watermarking for

multimedia[Cox97]

1997 Resolving rightful ownerships with invisiblewatermarks

[Crav97] [Crav98]

1998 Multimedia data-embedding andwatermarking technologies

[Swan98]

1998 Special issue on copyright and privacyprotection

[Spec98]

1999 Special issue on identification and protectionof multimedia information

[Spec99]

1999 Watermarking as communications with sideinformation

[Cox99]

1999 Multimedia watermarking techniques [Hart99]1999 Information hiding ndash a survey [Peti99]1999 Current state of the art challenges and future

directions for audio watermarking[Swan99]

2001 Digital watermarking algorithms andapplications

[Podi01]

2001 Electronic watermarking the first 50 years [Cox01]2003 What is the future of watermarking Part I amp

II[Barn03a] [Barn03b]

of Cox et al resulted in a novel scheme called spread spectrum watermarking formultimedia [Cox95] [Cox96] [Cox97] In 1996 the first ever international work-shop on information hiding was held in Cambridge UK that attracted manyresearchers working on audio watermarking [Cox96] [Gruh96] Boney et al in1996 presented a watermarking scheme for audio signals based on the percep-tual masking properties of the human ear [Bone96] [Swan98a] This was perhapsthe first audio watermarking scheme that employed the psychoacoustic principlesand the frequencytemporal masking characteristics of the HAS A watermark-ing scheme based on vector quantization was also proposed [Mori97] All theaforementioned data hiding schemes except the ldquoecho hidingrdquo and ldquophase cod-ingrdquo techniques embed watermarks in the frequency domain In 1998 Bassiaet al presented a time-domain audio watermarking technique [Bass98] [Bass01]Audio watermarking using bandlimited random sequences [Iked99] multi-bit


1992 1994 1996 1998 2000 2002 20040

50

100

150

200

250

300

350

Year

Num

ber

of p

ublic

atio

ns

Number of publications on DAW

Projectednumber of papers

Based onthe INSPEC surveyJan 1999 [Piti99] [Rosh98]

Figure 1115 Number of publications on digital watermarking (according to INSPECJan 1999 [Roch98] [Peti99])

hopping and HAS properties [Xu99a] psychoacoustic models and spread spec-trum theory [Garc99] and permutation-based data scrambling [Wang99] havealso been proposed Data hiding research during 1995ndash1999 [Swan99] [Arno00]laid the foundation for the next generation (2000ndash2004) watermarking tech-niques Figure 1115 presents a review of the number of publications on digitalwatermarking appeared over the years according to INSPEC Jan 1999 [Roch98][Peti99] Special publication issues on copyrightprivacy protection and multi-media information protection have appeared in Proceedings of the IEEE andother journals [Spec98] [Spec99] during 1998ndash1999 Several tutorials and surveypapers highlighting watermarking techniques for multimedia have been pre-sented [Bend96] [Cox97] [Swan98a] [Swan98b] [Cox99] [Hart99] [Peti99]

In general the success of a data hiding scheme is characterized by the watermarkembedding and detection procedures Researchers began to develop schemes thatenable improved watermark detection [Cilo00] [Furo00] [Kim00] [Lean00][Seok00] Lee et al presented a digital audio watermarking scheme that works inthe cepstrum domain [Lee00a] [Lee00b] Neubauer and Herre proposed watermark-ing techniques for MPEG-2 AAC bitstreams [Neub00a] [Neub00b] Enhanced sp-read spectrum watermarking for MPEG-2 AAC was proposed later by Cheng et al[Chen02b] With the advent of sophisticated audio compression schemes the com-bined audio compressionwatermarking techniques also gained interest [Herr01a][Herr01b] [Herr01c]

Motivated by the success of the ldquopatchworkrdquo algorithm in image watermark-ing [Bend96] Arnold proposed an audio watermarking scheme based on the sta-tistical patchwork method that operates in the frequency domain [Arno00] LaterYeo et al presented a modified patchwork algorithm (MPA) [Yeo01] [Yeo03]


that is relatively more robust compared to other patchwork algorithms for DAWRobust audio watermarking schemes based on time-scale modification [Mans01a][Mans01b] replica modulation [Petr01] pitch scaling [Shin02] and artificial neu-ral networks [Huij02] have also been proposed

The echo hiding technique proposed by Gruhl et al [Gruh96] was modifiedand improved by Oh et al [Oh01] [Oh02] for robust and imperceptible audiowatermarking Ko et al [Ko02] further modified the echo hiding technique bytime-spreading the echo using pseudorandom noise (PN) sequences In order toavoid the audible echoes and further improve the accuracy of watermark detec-tion in the echo hiding technique (ie [Gruh96]) an adaptive and content-basedaudio watermarking technique has been proposed [Foo01] Costarsquos pioneeringwork on the theoretical bounds on the amount of data that can be hidden in asignal [Cost83] motivated researchers to apply these bounds to the data hidingproblem in audio [Chou01] [Kalk02] [Neub02] [Peel03]

Watermark detection becomes particularly challenging when a part of theembedded watermark is lost due to some common signal processing manipula-tions (eg cropping sampling shifting etc) Hence security and synchronizationaspects of DAW became an important issue during the early 2000 Severalsynchronization methods for audio watermarks have been proposed [Lean01][Gang02] Efficient methods to improve the security of watermarks have alsobeen presented [Kalk01a] [Kalk01b] [Gang02] [Mans02] Kirovski and Mal-var presented a robust framework for direct-sequence spread-spectrum (DSSS)-based audio watermarking by incorporating several sophisticated techniques suchas robustness to desynchronization cepstrum filtering and chess watermarks[Kiro01a] [Kiro01b] [Kiro02] [Kiro03a] [Kiro03b] DAW based on hybrid spread-spectrum methods have also been investigated [Munt02] Saito et al employed asubband division technique based on QMF banks for digital audio watermarking[Sait02a] [Sait02b] DAW based on time-frequency characteristics and chirp-based techniques have also been proposed [Erku03] [Esma03] Motivated by theapplications of turbo codes in encryption techniques Cvejic et al [Cvej03] devel-oped a robust DAW scheme using turbo codes DAW using sinusoidal patternsbased on PN sequences has been described in [Zhen03]

Blind and asymmetric audio watermarking gained popularity due to theirinherent property that the original signal is not required for watermark detec-tion Several blind audio watermarking algorithms have been reported [Hsie02][Huan02] [Peti02] however the research associated with the blind DAW is stillin its early stages Asymmetric watermarking schemes involve watermark detec-tion by knowing only a part of the secret information used to encode and embedthe watermark

1152 A Generic Architecture for DAW

During the last decade researches have proposed several efficient methods foraudio watermarking Most of these algorithms are based on the generic architec-ture shown in Figure 1116 The three primary modules associated with a digital


Watermarkdesign

Select details to design a watermark1 Input audio features2 Psychoacoustic principles3 Spread-spectrum techniques

Time-domain or frequency-domain or cepstrum-domain watermarks

Input audiox +

Watermarkinsertion

Watermark w

Key

Watermarkedaudio xw

Arbitraryembedding ofa signature ora security key

(a)

(b)

Watermarkedaudio xw

Watermarkretrieval

Select details to retrieve a watermark1 Synchronization details2 Watermark type3 Original audio required in some cases

Key

Check the correctness

of thewatermark Watermark

detection wassuccessful

Provideaccess to theinput audio

Input audiox

Watermarkdetection wasunsuccessful

Do not provideaccess to theinput audio

Figure 1116 A general audio watermarking framework (a) watermark encoding proce-dure (watermark design and embedding) and (b) watermark decoder

audio watermarking scheme include the watermark design the watermark embed-ding algorithm and the watermark retrieval The first two modules belong to thewatermark encoding procedure (Figure 1116(a)) and the watermark retrievalcan be viewed as a watermark decoder (Figure 1116(b))


First the watermark design section generates an auxiliary data string basedon a security key and a set of design parameters (eg bit rates robustness psy-choacoustic principles etc) Furthermore the watermark generation techniqueoften depends on the analysis features of the input audio the human auditorysystem (HAS) and some data scramblingspread spectrum technique Neverthe-less the ultimate objective is to design a watermark that is perceptually inaudibleand robust The watermark embedding algorithm relies in hiding the designedwatermark sequence w in the audio bitstream x This is given by

xw = weierp(x w) (1129)

where xw is the watermarked audio and weierp represents the watermark embeddingfunction

The watermark embedding can be performed either in the frequency domainThe idea is to exploit the frequency- and the temporal-masking properties of thehuman ear Three key features of the human auditory systems that are used inthe audio watermarking are

i) sounds that are louder typically tend to mask out the quieter soundsii) the human ear perceives only the ldquorelative phaserdquo associated with the

sounds andiii) the HAS fails to recognize minor reverberations and echoes

These features when employed in conjunction with encryption schemes[Mao03] or spread spectrum techniques [Pick82] lead to efficient DAW schemesIn particular direct sequence spread spectrum (DSSS) technique ndash due to itsinherent suitability towards secure transmission ndash is the most popular spreadingscheme employed in data hiding The DSSS technique employs pseudo-noise(PN) sequences to transform the original signal to a wideband and noise-likeThe resulting spread spectrum signals are resilient to interferences difficult todemodulatedecode robust to collusion attacks and therefore forms the basisfor many watermarking schemes From Figure 1116(a) note that a signature ora security key that provides details on the proof of authorshipownership canalso be inserted Depending on overall system objectives and design philosophywatermarks can be inserted either before or after audio compression Moreoveralgorithms for combined audio compression and watermarking are also available[Herr01a] [Herr01b] The watermark security should lie in the hidden randomkeys and not in the watermark-embedding algorithm In other words details ofthe watermark-embedding scheme must be available at the decoder for efficientwatermark detection

Figure 1116(b) illustrates the watermark decoding procedure The secret keyand the design parameters employed in watermark embedding are available duringthe watermark retrieval process Two common scenarios of watermark retrievalinclude (1130) ndash when the input audio is employed for decoding and when the


input audio is not available for watermark detection The latter case results inblind watermark detection

Case I Input audio required for decoding

wprime = ς(xw x)

Case II Input audio NOT required

wprime = ς(xw)

(1130)

where wprime is the decoded watermark and ς represents the watermark retrievalfunction Note that in case of perfect watermark retrieval wprime = w Watermarkretrieval must be unambiguous In other words the watermark detection mustprovide reliable details about the content ownership In particular an audiowatermark must be imperceptible statistically undetectable remain in the audioafter repeated encodingcopying and robust to a variety of signal processingmanipulations and deliberate attacks For a DAW scheme to be successful incopyright management and audio content protection applications it should havesome important attributes that are described next

1153 DAW Schemes ndash Attributes

11531 Imperceptibility or Perceptual Transparency Watermarks em-bedded in the digital audio must be perceptually transparent and should notresult in any audible artifacts Watermarks can be inserted in either perceptuallysignificant or insignificant regions depending upon the application If insertedin a perceptually significant region special care must be taken for watermarksto be inaudible Inaudibility can be guaranteed by placing watermark signalsin audio segments whose low-frequency components have higher energy Notethat this technique is based on the frequency-masking characteristics of theHAS that we discussed in Chapter 5 On the other hand if embedded in per-ceptually insignificant regions watermarks become vulnerable to compressionalgorithms Therefore watermark designers must consider the trade-offs betweenthe robustness and the imperceptibility features while designing a watermarkingscheme Another important factor is the optimum energy of the watermarkedsignal required for efficient watermark detection Watermarks with higher energypossess some inherent features such as the reliable watermark detection and therobustness to intentional and unintentional signal processing operations How-ever it should be noted that increase in the audio watermark energy may resultin audible artifacts

11532 Statistical Transparency Consider a set of audio signals x y z

andawatermarksignalwLetthewatermarkedsignalsbedenotedasxw yw zwWatermarked signals must be statistically undetectable Watermarks are designedsuch that users are not able to detect the embedded watermark by performing somestatistical signal processing operation (eg correlation autoregressive estimation


etc) This scenario can be treated as a special case of intentional or collusion attacksthat employ statistical methods

11533 Robustness to Signal Processing Manipulations Watermarksare embedded directly in the audio stream In fact watermark bits are dis-tributed (typically 2ndash128 bs) uniformly across the audio bitstream An audiosignal undergoes several changes on its way from a transmitterencoder to areceiverdecoder Therefore watermark robustness to a variety of signal process-ing operations is very important Common signal processing operations typicallyconsidered for the robustness of an audio watermark include linear and nonlinearfiltering resampling requantization A-to-D and D-to-A conversions croppingMP3 compression low-passhigh-pass filtering dithering bass and treble con-trol changes etc Watermark robustness can be improved by also increasingthe number of watermark bits Nonetheless this results in increased encoding bitrates Spread-spectrum techniques offer good robustness however with increasedcomputational complexity and synchronization problems

116 SUMMARY OF COMMERCIAL APPLICATIONS

Some of the popular applications (Table 114) for embedded audio coding includedigital broadcast audio (DBA) [Stol93b] [Jurg96] direct broadcast satellite(DBS) [Prit90] digital versatile disc (DVD) [Crav96] high-definition television(HDTV) [USAT95b] cinema [Todd94] electronic distribution of music [Bran02]and audio-on-demand over wide area networks such as the Internet [Diet96]Audio coding has also enabled miniaturization of digital audio storage mediasuch as compact MiniDisc [Yosh94] With the advent of the MP3 audio formatthat denotes audio files that have been compressed using the MPEG-1 layerIII algorithm perceptual audio coding has become of central importance toovernetwork exchange of multimedia information Moreover the MP3 algorithmhas been integrated into several popular portable consumer audio playbackdevices that are specifically designed for web compatibility Streaming audiocodecs have been developed by several commercial entities Example codecsinclude RealAudio and atob audio In addition DolbyNET [DOLBY] aversion of the Dolby AC-3 algorithm modified to accommodate Internet streaming[Vern99] has been successfully integrated into streaming audio processors fordelivery of audio-on-demand to the desktop web browser Moreover some of theMPEG-4 audio tools enable real-time audio transmission over packet-switchednetworks such as the Internet [Diet96] [Ben99] [Liu99] [Zhan00] transparentcompression at lower bit rates with reduced complexity low delay coding andenhanced bit-error-robustness The MPEG-7 audio supports multimedia indexingand searching multimedia editing broadcast media selection and multimediadigital library sorting

New audio coding algorithms must satisfy several requirements First con-sumers expect high quality at low to medium bit rates Secondly the popularityof streaming audio requires that algorithms intended for packet-switched networks

Tabl

e11

4

Aud

ioco

ding

stan

dard

san

dap

plic

atio

ns

Alg

orith

mA

pplic

atio

nsR

efer

ence

s

Sony

AT

RA

CM

iniD

isc

[Yos

h94]

Luc

ent

PAC

DB

A

128

160

kbs

[Joh

n95]

Phili

psPA

SCD

igita

lau

dio

tape

(DA

T)

[Lok

h91]

[Lok

h92]

Dol

byA

C-2

Dig

ital

broa

dcas

tau

dio

(DB

A)

[Tod

d84]

Dol

byA

C-3

Cin

ema

HD

TV

D

VD

[Tod

d94]

Aud

ioPr

oces

sing

Tech

nolo

gy

APT

-X10

0C

inem

a[W

yli9

6a]

[Wyl

i96b

]D

TS

Coh

eren

tac

oust

ics

Cin

ema

[Sm

yt96

][B

osi0

0]M

PEG

-1

LI-

III

ldquoMP3

rdquoL

III

[ISO

I92]

DB

A

LII

25

6kb

sD

BS

LII

22

4kb

sD

CC

L

I38

4kb

sM

PEG

-2B

C-L

SFC

inem

a[I

SOI9

4a]

MPE

G-2

AA

CIn

tern

etw

ww

e

g

Liq

uidA

udio

atob

au

dio

[ISO

I96a

][I

SOI9

7b]

MPE

G-4

Low

bit

rate

lo

wde

lay

and

scal

able

audi

oen

codi

ng

stre

amin

gau

dio

appl

icat

ions

[ISO

I97a

][I

SOI9

9][I

SOI0

0]

MPE

G-7

Bro

adca

stm

edia

sele

ctio

nm

ultim

edia

inde

xing

sea

rchi

nge

diti

ng

[ISO

I01b

][I

SOI0

1d]

Dir

ect-

stre

amdi

gita

l(D

SD)

enco

ding

Supe

rau

dio

CD

enco

ding

[Ree

f01a

][J

ans0

3]M

erid

ian

loss

less

pack

ing

(ML

P)D

VD

-Aud

io[G

erz9

9][K

uzu0

3]

379


are able to deal with a highly time-varying channel Therefore new algorithmsmust include facilities for scalable operation ie the ability to decode only asubset of the overall bitstream with minimal loss of quality Finally for anyalgorithm to be viable it should be possible to implement the decoder in realtime on a general-purpose processor architecture Some of the important aspectsthat have been influential over the years in the design of an audio codec arelisted below

ž Streaming audio applicationsž Low delay audio codingž Error robustnessž Hybrid speechaudio coding structuresž Digital audio watermarkingž Perceptual audio quality measurementsž Bit rate scalabilityž Complexity issuesž Psychoacoustic signal analysis techniquesž Parametric models in audio codingž Lossless audio compressionž Interoperable multimedia frameworkž Intellectual property management and protection

Development of sophisticated audio coding algorithms that deliver rate-scalable and quality-scalable compression in error-prone channels receivedattention during the late 1990s Several researchers developed structured soundrepresentations oriented specifically towards low-bandwidth and native-signal-processing sound synthesis applications For example NetSound [Case96] is asound and music description system in which sound streams are described bydecomposition into a sound-specification description that represents arbitrarilycomplex signal processing algorithms as well as an event list comprising scoresor MIDI files The advantage of this approach over streaming codecs is thatonly a very small amount of data is required to transmit complex instrumentdescriptions and appropriately parameterized scores The trade-off with respectto streaming codecs however is that significant client-side computing resourcesare required during real-time synthesis Some examples of synthesis techniquesused by NetSound include wavetable FM phase-vocoder or additive synthesisScheirer proposed a paradigm called the generalized audio coding [Sche01] inwhich Structured Audio (SA) encompasses all other audio coding techniquesMoreover the MPEG-4 SA tool offers a flexible framework in which bothlossy and lossless audio compression can be realized A variety of lossless audiocoding features have been incorporated in the MPEG-4 audio standard [Sche01][Mori02a] [Vasi02]

Although streaming audio has achieved widespread acceptance on 288 kbsdial-up telephone connections consistent high-quality output is still difficult

SUMMARY OF COMMERCIAL APPLICATIONS 381

Network-specific design considerations motivated research into joint source-chan-nel coding [Ben99] for audio over the Internet Another emerging trend is oneof convergence between low-rate audio coding algorithms and speech codersthat are increasingly embedding mechanisms to exploit perceptual irrelevancies[Malv98] [Mori98] [Ramp98] [Ramp99] [Trin99a] [Trin99b] [Maki00] Sev-eral wideband speech and audio coders based on the CS-ACELP ITU-T G729standard have been proposed [Kois00] [Kuma00] Low delay [Jbir98] [Alla99][Dorw01] [Schu02] [Schu05] enhanced bit error robustness [ISOI00] [Sper00][Sper02] and scalability [Bran94b] [Gril95] [Span97] [Gril98] [Hans98] [Herr98][Jin99] [Zhou01] [Dong02] [Kim02] [Mori02b] have also become very impor-tant In particular the MPEG-4 bit rate scalability tool provides encoder withoptions to transmit bitstreams at variable bit rates Note that scalable audio codershave several layers ie a core layer and a series of enhancement layers Thecore layer encodes the main audio stream while the enhancement layers providefurther resolution and scalability Scalable algorithms will ultimately be usedto accommodate the unique challenges associated with audio transmission overtime-varying channels such as the packet-switched networks Several ideas thatprovide a link between the perceptual and the lossless audio coding techniques[Geig02] [Mori00] [Schu01] [Mori02a] [Mori02b] [Geig03] [Mori03] have alsobeen proposed

Hybrid coding algorithms that make use of specific characteristics of a signalwhile operating over a range of bit rates became popular [Bola98] [Deri99][Rong99] [Maki00] [Ning02] Some experimental work performed in the contextof MPEG-4 standardization has demonstrated that a cascaded hybrid sinusoidaltime-frequency coder can not only meet but in some cases even exceed the outputquality achieved by the time-frequency coder alone at the same bit rate for certaintest signals [Edle96b] Sinusoidal signal models due to their inherent suitabilitytowards scalable audio coding gained significant research interests during thelate 1990s [Taor99] [Pain00] [Ferr01a] [Ferr01b] [Pain01] [Raad01] [Herm02][Heus02] [Pain02] [Purn02] [Shao02] [Pain05]

Potential improvements for the various perceptual coder building blocks such asnovel filter banks for low-delay coding and reduced pre-echo [Schu98] [Schu00]efficient bit-allocation strategies [Yang03] and new psychoacoustic signal analy-sis techniques [Baum98] [Huan99] [Vande02] have been considered Researchersalso investigated new algorithms for tasks of peripheral interest to perceptual audiocoding such as transform-domain signal modifications [Lanc99] and digital water-marking [Neub98] [Tewf99]

In order to offer audiophiles with listening experiences that promise to bemore realistic than ever audio codec designers pursued several sophisticatedmultichannel audio coding techniques [DOLBY] [Bosi93] [Holm99] [Bosi00]In the mid-1990s discrete encoding ie 51 separate channels of audio wasintroduced by the Dolby Laboratories and the Digital Theater Systems (DTS) Thehuman auditory system allows hearing in substantially more directions than whatcurrent multichannel audio systems offer therefore future research will focuson overcoming the limitations of existing multichannel systems In particular


research involving spatial audio real-time acoustic source localization binauralcue coding and application of head-related transfer functions (HRTF) towardsrendering immersive audio [Sand01] [Fall02a] [Fall02b] [Kimu02] [Yang02] havebeen considered

PROBLEMS

111 What are the differences between the SACD and DVD-A formats Explainusing diagrams Provide bitrates

112 Use diagrams and describe the MLP algorithm employed in the DVD-A

113 How is direct stream digital (DSD) technology employed in the SACDdifferent from other encoding schemes Describe using a block diagram

114 What is the main idea behind the Integer MDCT and its application inlossless audio coding Explain in a concise manner and by using mathe-matical arguments the difference between Integer MDCT (Int MDCT) andthe MDCT Show the utility of the FFT in both

115 Explain using an example the lossless matrixing employed in the MLPHow is this scheme different from the Givens rotations employed in theIntMDCT

116 List some of the key differences between the lossless transform audiocoder (LTAC) and the IntMDCT algorithm

117 Survey the literature for studies on fast algorithms for IntMDCT and des-cribe at least one such algorithm Show how computational savings areachieved

118 Survey the literature for studies on the properties of IntMDCT Discuss inparticular properties dealing with preservation of energypower (whicheveris appropriate) and transform pair uniqueness

119 Describe the concept of noise shaping Explain how it is applied in theDSD algorithm

COMPUTER EXERCISE

1110 Write MATLAB code to implement an IntMDCT Profile the algorithmand compare the computational complexity against a standard MDCT

CHAPTER 12

QUALITY MEASURES FORPERCEPTUAL AUDIO CODING

121 INTRODUCTION

In many situations and particularly in the context of standardization activitiesperformance measures are needed to evaluate whether one of the establishedor emerging techniques is in some sense superior to other alternative methodsPerceptual audio codecs are most often evaluated in terms of bit rate complex-ity delay robustness and output quality Reliable and repeatable output qualityassessment (which is related to robustness) presents a significant challenge It iswell known that perceptual coders can achieve transparent quality over a verybroad highly signal-dependent range of segmental SNRs ranging from as low as13 dB to as high as 90 dB Classical objective measures of signal fidelity suchas signal-to-noise ratio (SNR) or total harmonic distortion (THD) are thereforeinadequate [Ryde96] As a result time-consuming and expensive subjective lis-tening tests are required to measure the small impairments that often characterizeperceptual coding algorithms Despite some confounding factors subjective lis-tening tests are nevertheless the most reliable tool available for codec qualityevaluation and standardized listening test procedures have been developed tomaximize reliability Research into improved subjective testing methodologies isalso quite active At the same time considerable research is being devoted tothe development of automatic perceptual measurement systems that can predictaccurately the outcomes of subjective listening tests Several algorithms havebeen proposed in the last two decades and the ITU-R has recently adopted an


383

384 QUALITY MEASURES FOR PERCEPTUAL AUDIO CODING

international standard after combining several of the best techniques proposed inthe literature

This chapter offers a perspective on quality measures for perceptual audiocoding The first half of the chapter is concerned with subjective evaluations ofperceptual codec quality Sections 122 through 125 describe subjective qualitymeasurement techniques identify confounding factors that complicate subjectivetests and give sample subjective test results that facilitate comparisons betweenseveral 2- and 51-channel standards The second half of the chapter reviewsfundamental techniques and significant developments in automatic perceptualmeasurement systems with particular attention given to several of the candidatesfrom the ITU-R standardization process In particular Sections 126 and 127respectively describe perceptual measurement systems and the classes of algo-rithms that have been proposed for standardization Section 128 describes therecently completed ITU-R TG 104 which led to the adoption of the ITU-RRec BS1387 algorithm for perceptual measurement Section 910 also providessome information on the ITU-T P861 Section 129 offers some final remarkson anticipated future developments in quality measures for perceptual codecs

122 SUBJECTIVE QUALITY MEASURES

Although listening tests are often conducted informally the ITU-R Recommen-dation BS1116 [ITUR94b] formally specifies a listening environment and testprocedure appropriate for subjective evaluations of the small impairments asso-ciated with high-quality audio codecs The standard procedure calls for grading byexpert listeners [Bech92] using the CCIR continuous impairment scale(Figure 121-a) [ITUR90] in a double blind A-B-C triple stimulus hidden ref-erence comparison paradigm While stimulus A always contains the reference(uncoded) signal the B and C stimuli contain in random order a repetition ofthe reference and then the impaired (coded) signal ie either B or C is a hid-den reference After listening to all three the subject must identify either B orC as the hidden reference and then grade the impaired stimulus (coded signal)relative to the reference stimulus using the five-category 41-point ldquocontinuousrdquoabsolute category rating (ACR) impairment scale shown in the left-hand columnof Figure 121 (a) From best to worst the five ACR ranges rate the coding distor-tion as ldquoimperceptible (5)rdquo ldquoperceptible but not annoying (40ndash49)rdquo ldquoslightlyannoying (30ndash39)rdquo ldquoannoying (20ndash29)rdquo or ldquovery annoying (10ndash19)rdquo Adefault grade of 50 is assigned to the stimulus identified by the subject as thehidden reference A subjective difference grade (SDG) is computed by subtractingthe score assigned to the actual hidden reference from the score assigned to theactual impaired signal Thus negative difference grades are obtained when thesubject identifies correctly the hidden reference and positive difference grades areobtained if the subject misidentifies the hidden reference Over many subjects andmany trials mean impairment scores are calculated and used to evaluate codecperformance relative to the ideal of transparency It is important to notice thedifference between the small impairment subjective measurements in [ITUR94b]

SUBJECTIVE QUALITY MEASURES 385

and the five-point mean opinion score (MOS) most often associated with speechcoding algorithms [ITUT94] Unlike the small impairment scale the scale of thespeech coding MOS is discrete and scores are absolute rather than relative to ahidden reference To emphasize this difference it has been proposed [Spor96] thatldquomean subjective scorerdquo (MSS) denote the small impairment subjective score forperceptual audio coders During data reduction MSS and MSS standard devia-tions are computed for both the coded signals and the hidden references An MSSfor the hidden reference different from 50 means that subjects have difficultydistinguishing the hidden reference from the coded signal

In fact nearly transparent quality for the coded signal is implied if the hiddenreference MSS lies within the 95 confidence interval of the coded signal andthe coded signal MSS lies within the 95 confidence interval of the hiddenreference Unless otherwise specified the subjective listening test scores citedfor the various algorithms described throughout this text are from either theabsolute or the differential small impairment scales in Figure 121 (a)

It is important to realize that the most reliable subjective evaluation strategyfor a given perceptual codec depends on the nature of the coding distortionAlthough the small-scale impairments associated with nearly transparent codingare well characterized by measurements relative to a reference standard usinga fine-grade scale some experts have argued that the more-audible distortionsassociated with nontransparent coding are best measured using a different scalethat can better cope with large impairments

For example in listening tests [Keyh98] on 16-kbs codecs for the WorldSpacesatellite communications system it was determined that an ITU-T P800P830

Perceptible butNOT Annoying

SlightlyAnnoying

Annoying

VeryAnnoying

Imperceptible

292827262524232221

20

494847464544434241

40393837363534333231

30

191817161514131211

10

50 00

minus01

minus10minus11

minus20

minus30

minus21

minus31

minus40

Ab

solu

te Grad

e

minus02minus03minus04minus05minus06minus07minus08minus09




Differen

ce Grad

e

A slightly betterthan B

A same as B

A slightly worsethan B

A much worsethan B

A better than B +2

+1

0

minus1

minus2A worse than B

minus3

CCR

A much betterthan B +3

Figure 121 Subjective quality scales (a) ITU-R Rec BS1116 [ITUR94b] smallimpairment scale for absolute and differential subjective quality grades (b) ITU-T RecP800P830 [ITUR96] large impairment comparison category rating


seven-point comparison category rating (CCR) method [ITUR96] was bettersuited to the evaluation task (Figure 121b) than the scale of BS1116 because ofthe nontransparent quality associated with the test signal Investigators preferredthe CCR over both the small impairment scale as well as the five-point absolutecategory rating (ACR) commonly used in tests of speech codecs A listening teststandard for large scale impairments analogous to BS1116 does not yet exist foraudio codec evaluation This is likely to change in light of the growing popularityfor network and wireless applications of low-rate nontransparent audio codecscharacterized at times by distinctly audible (not small-scale) artifacts

123 CONFOUNDING FACTORS IN SUBJECTIVE EVALUATIONS

Regardless of the particular grading scale in use subjective test outcomes gen-erated using even rigorous methodologies such as the ITU-R BS1116 are stillinfluenced by factors such as context site selection and individual listener acu-ity (physical) or preference (cognitive) Before comparing subjective test resultson particular codecs therefore one should be prepared to interpret the subjec-tive scores with some care For example consider the variability of ldquoexpertrdquolisteners A study of decision strategies [Prec97] using multidimensional scalingtechniques [Schi81] found that subjects disagree on the relative importance withwhich to weigh perceptual criteria during impairment detection tasks In anotherstudy [Shli96] Shlien and Soulodre presented experimental evidence that can beinterpreted as a repudiation of the ldquogolden earrdquo Expert listeners were tasked withdiscriminating between clean audio and audio corrupted by low-level artifacts typ-ically induced by audio codecs (five types were analyzed in [Miln92]) includingpre-echo distortion unmasked granular (quantization) noise and high-frequencyboost or attenuation Different experts were sensitive to different artifact typesFor instance a subject strongly sensitive to pre-echo distortion was not neces-sarily proficient when asked to detect high-frequency gain or attenuation In factartifact sensitivities were linked to psychophysical measurement profiles Becausetest subjects tended to be sensitive to one impairment type but not to others theability of an individual to perform as an expert listener depended upon the typeof artifacts to be detected Sporer [Spor96] reached similar conclusions after yeta third study of expert listeners A statistical analysis of the results indicatedthat significant site and subject differences exist that even well-trained listen-ers cannot necessarily repeat their own results (self-recalibration) and that somesubjects are sensitive to specific artifacts but insensitive to others The numer-ous confounding factors in subjective evaluations perhaps point to the need fora fundamental change in the evaluation of perceptual codecs Sporer [Spor96]suggested that codecs could be evaluated on the basis of specific artifact per-ceptibility Then appropriate artifact-sensitive subjects could be matched to thetest material It has also been suggested that the influence of individual nonre-peatability could be minimized with larger subject pools Nonhuman factors alsoinfluence subjective listening test outcomes For example playback level (SPL)and background noise respectively can influence excitation pattern shapes and

SUBJECTIVE EVALUATIONS OF TWO-CHANNEL STANDARDIZED CODECS 387

introduce undesired masking effects While both SPL and background noise canbe replicated in different listening test environments other environmental factorsare not as easy to replicate The presentation method can strongly influence per-ceived quality because loudspeakers introduce distortions on their own and inconjunction with a listening room These effects can introduce site dependenciesIn short although they have proven effective existing subjective test proceduresfor audio codecs are clearly suboptimal Recent research into more reliable toolsfor subjective codec evaluations has shown promise and is continuing Confound-ing factors must be considered carefully when subjective test results are comparedbetween codecs and between different test sites These considerations have moti-vated a number of subjective tests in which multiple algorithms were tested onthe same site under identical listening conditions In Sections 124 and 125 wewill consider two example tests one for 2-channel codecs and the other for51-channel algorithms

124 SUBJECTIVE EVALUATIONS OF TWO-CHANNELSTANDARDIZED CODECS

The influence of site and subject dependencies on subjective listening tests canpotentially invalidate direct comparisons between independent test results fordifferent algorithms Ideally fair intercodec comparisons require that scores areobtained from a single site with the same test subjects Soulodre et al conducted aformal ITU-R BS1116-compliant [ITUR94b] listening test that compared severalstandardized two-channel stereo codecs [Soul98] including the MPEG-1 layerII [ISOI92] the MPEG-1 layer III [ISOI92] the MPEG-2 AAC [ISOI96a] theLucent Technologies PAC [John95] and the Dolby AC-3 [Fiel96] codecs over avariety of bit rates between 64 and 192 kbs per stereo pair The AAC algorithmwas tested in the main complexity profile but with the dynamic window shapeswitching and intensity coding tools disabled The MPEG-1 layer II codec wastested in both software simulation and in a real-time implementation (ldquoITISrdquo)In all 17 algorithmbit rate combinations were examined in the tests Listeningmaterial was selected from a library of 80 items deemed critical by experts andultimately the two most critical items were chosen for each codec tested Meandifference grades were computed over the set of results from 21 expert listenersafter three of the original subjects were eliminated from consideration due totheir statistically unreliable performance (t-test disqualification)

The test results reproduced in Table 121 clearly show eight performanceclasses The AAC and AC-3 codecs at 128 and 192 kbs respectively exhibitedbest performance with mean difference grades better than minus10 The MPEG-2 AAC algorithm at 128 kbs however was the only codec that satisfied thequality requirements defined by ITU-R Rec BS1115 [ITUR97] for perceptualaudio coding systems in broadcast applications namely that there may not beany audio materials rated below minus100 Overall the ranking of the families frombest to worst with respect to quality was AAC PAC MPEG-1 layer III AC-3MPEG-1 layer II and ITIS (MPEG-1 LII hardware implementation) The trend


is exemplified in Table 122 where the codec families are ranked in terms ofSDGs for a fixed bit rate of 128 kbs The class three results can be interpretedto mean that bit rate increases of 32 64 and 96 kbs per stereo pair are requiredfor the PAC AC-3 and layer II codec families respectively to match the outputquality of the MPEG-2 AAC at 96 kbs per stereo pair

125 SUBJECTIVE EVALUATIONS OF 51-CHANNEL STANDARDIZEDCODECS

In addition to two-channel stereo systems multichannel perceptual audiocoders are increasingly in demand for multimedia cinema and home theater

Table 121 Comparison of standardized two-channel algorithms (after [Soul98])

Group Algorithm Rate (kbs)Mean diff

gradeTransparent

itemsItems below

minus100

1 AAC 128 minus047 1 0AC-3 192 minus052 1 1

2 PAC 160 minus082 1 33 PAC 128 minus103 1 4

AC-3 160 minus104 0 4AAC 96 minus115 0 5

MP-1 L2 192 minus118 0 54 IT IS 192 minus138 0 65 MP-1 L3 128 minus173 0 6

MP-1 L2 160 minus175 0 7PAC 96 minus183 0 6IT IS 160 minus184 0 6

6 AC-3 128 minus211 0 8MP-1 L2 128 minus214 0 8

IT IS 128 minus221 0 77 PAC 64 minus309 0 88 IT IS 96 minus332 0 8

Table 122 Comparison of standardized two-channelalgorithms at 128 kbs (after [Soul98])

Rank Algorithm (128 kbs) Mean diff grade

1 AAC minus0472 PAC minus1033 MP-1 L3 minus1734 AC-3 minus2115 MP-1 L2 minus2146 IT IS minus221

SUBJECTIVE EVALUATIONS USING PERCEPTUAL MEASUREMENT SYSTEMS 389

applications As a result the European Broadcasting Union (EBU) sponsoredDeutsche Telekom Berkom in a formal subjective evaluation [Wust98] thatcompared the output quality for real-time implementations of the 51-channelDolby AC-3 and the matrixed 51-channel MPEG-2BC layer II algorithmsat bit rates between 384 and 640 kbs (Table 123) The tests adhered tothe methodologies outlined in ITU BS1116 and the five-channel listeningenvironment was configured according to ITU-R Rec BS775 [ITUR93a] Meansubjective difference grades were computed over the scores from 32 experts aftera t-test analysis was used to discard inconsistent results from seven subjects

The resulting difference grades given in Table 123 represent averages of themean grades reported for a collection of eight critical test items Even thoughnone of the tested codec configurations satisfied ldquotransparencyrdquo the MPEG-2BClayer II had only one nontransparent test item at the 640 kbs bit rate On theother hand the Dolby AC-3 outperformed MPEG-2BC layer II at 384 kbs bymore than half a difference grade In terms of specific weaknesses the ldquoapplausearoundrdquo item was most critical for AC-3 (eg minus241 384 kbs) whereas theldquopitch piperdquo item was most critical for MPEG-2BC layer II (eg minus356 384 kbs)

126 SUBJECTIVE EVALUATIONS USING PERCEPTUALMEASUREMENT SYSTEMS

Even well-designed and carefully controlled experiments [ITUR94b] are sus-ceptible to problems with reliability repeatability site-dependence and subject-dependence [Grew91] Moreover subjective tests are difficult time consumingtedious and expensive These circumstances have inspired researchers to seekautomatic perceptual measurement techniques that can evaluate coding distortionon a subjective impairment scale with the ultimate objective being the replace-ment of human test subjects Ideally a perceptual measurement system shouldanswer the following questions

ž Is the codec transparent for all sourcesž If so what is the ldquocoding marginrdquo or the degree of distortion inaudibilityž If not how disturbing are the artifacts on a meaningful subjective scale

Table 123 Comparison of standardized 51-channel algorithms

Group Algorithm Rate (kbs) Mean diff grade

1 MP-2 BC 640 minus0512 AC-3 448 minus093

MP-2 BC 512 minus099A3 AC-3 384 minus117

MP-2 BC 384 minus173


In other words the system should quantify overall subjective quality provide adistance measure for the coding margin (ldquoperceptual headroomrdquo) and quantify theaudibility of unmasked distortion Many researchers have proposed measurementsystems that attempt to answer these questions The resulting system architecturescan be broadly categorized into two classes (Figure 122) namely systems thatcompare internal auditory representations (CIR) and systems that perform noisesignal evaluation (NSE)

1261 CIR Perceptual Measurement Schemes

The CIR systems (Figure 122a) generate independent auditory images of theoriginal and coded signals and then compare them The degree of audible differ-ence between the auditory images is then evaluated using either a deterministicdifference thresholding scheme or a probabilistic detector The auditory imagesin the figure could represent excitation patterns loudness patterns or someother perceptually relevant physiological quantity such as neural firing ratesfor example CIR systems avoid explicit masking threshold calculations by cap-turing the essence of cochlear signal processing in a filter bank combined witha series of post-filter-bank operations on each cochlear channel Moreover CIRsystems do not require any explicit knowledge of an error signal

1262 NSE Perceptual Measurement Schemes

The NSE systems (Figure 122b) on the other hand require an explicit repre-sentation of the coding distortion The idea behind NSE systems is to analyzethe coding distortion with respect to properties of the auditory system TypicalNSE systems estimate explicitly the masked threshold for each short time signal

(a)

PerceptualCodec

DelayGain Comp

AuditoryModel

AuditoryModel

Comparisonof

AuditoryldquoImagesrdquo

s (n )

(b)

s (n )

DelayGain Comp

TFAnalysis

TFAnalysis

PerceptualModel

ComparisonErr vs Thr

minusPerceptualCodec

sum

Figure 122 Perceptual measurement system architectures (a) comparison of internalrepresentation (CIR) (b) noise signal evaluation (NSE)

ALGORITHMS FOR PERCEPTUAL MEASUREMENT 391

segment using the properties of simultaneous and temporal masking as is usu-ally done in perceptual audio coders Then the distribution of noise energy overtime and frequency is compared against the corresponding masked threshold Inboth the CIR and NSE scenarios the automatic measurement system seeks toemulate the signal processing behavior of the human ear andor the cognitiveprocessing of auditory stimuli in the human listener Emulation of the humanear for the purpose of making measurements requires a filter bank with time andfrequency resolution similar to the auditory filter bank [Spor95a] Although it ispossible to model the characteristics of individual listeners [Treu96] the usualapproach is to match the acuity of average subjects Frequency resolution shouldbe sufficient to model excitation pattern or hair cell selectivity for NSE and CIRalgorithms respectively For modeling temporal effects time resolution shouldbe at least 3 ms below 300 Hz and at least 15 ms at higher frequencies Fortu-nately a great deal of information about the auditory filter bank is available toassist in the modeling task The primary challenge is to design a system of man-ageable complexity while adequately emulating the earrsquos time-frequency analysisproperties Cognitive effects however are less well understood and more diffi-cult to model Cognitive weighting of perceptual criteria can be quite differentacross listeners [Prec97] While some subjects might interpret single strong dis-tortions such as clicks to be less annoying than small omnipresent distortions theconverse can also be true Moreover impairments are asymmetrically weightedSubjects tend to forget distortions at the beginning of a sample but rememberthose occurring at the end Other subjective preferences can also affect outcomesSome subjects prefer bandwidth limiting (ldquodullnessrdquo) over low-level artifacts athigh frequencies but others would prefer greater ldquobrilliancerdquo at the expense oflow-level artifacts While these issues present significant modeling challengesthey can also undermine the utility of subjective listening tests whereas an auto-matic measurement system guarantees a repeatable output The next few sectionsare concerned with examples of particular CIR and NSE perceptual measurementschemes that have been proposed including several that have been integrated intointernational standards

127 ALGORITHMS FOR PERCEPTUAL MEASUREMENT

Numerous examples of both CIR- and NSE-based perceptual measurementschemes for speech and audio codec output quality evaluation have appearedin the literature since the late 1970s with particular emphasis placed on thosesystems that can predict accurately subjective listening test results In 1979Schroeder Atal and Hall at Bell Labs proposed an NSE technique calledthe ldquonoise loudnessrdquo (NL) metric for assessment of vocoder output qualityThe NL system [Schr79] estimates output quality by computing a ratio of thecoding distortion loudness to the loudness of the original signal For bothsignal and noise loudness patterns are derived every 20 ms from frequency-domain excitation patterns that are in turn computed from short-time criticalband densities The critical band densities are obtained from FFT-based spectral


estimates The approach is hampered by insufficient analysis resolution at lowfrequencies Karjaleinen in 1985 proposed a CIR technique [Karj85] known asthe ldquoauditory spectrum distancerdquo (ASD) Instead of using an FFT for spectralanalysis the ASD front-end models the cochlear filter-bank using a 48-channelbank of overlapping FIR filters with roughly 05 Bark spacing The filter bankchannel magnitude responses are designed to model the smearing of spectralenergy that produces the usual psychophysical excitation patterns Then asquare-law post-processor and two low-pass filters are used to model hair cellrectification and temporal masking respectively for each channel The ASDmeasure is derived from the maximum deviation between the post-processedfilter-bank outputs for the reference and test signals The ASD profile is analyzedover time and frequency to extract some estimate of perceptual quality

As the field of perceptual audio coding matured rapidly and created greaterdemand for listening tests there was a corresponding growth of interest in per-ceptual measurement schemes Several NSE and CIR algorithms appeared inthe space of the next few years namely the noise-to-mask Ratio [Bran87a](NMR 1987) the perceptual audio quality measure (PAQM 1991) [Beer91]the peripheral internal representation [Stua93] (PIR 1991) the Bark transform(BT 1992) [Kapu92] the perceptual evaluation (PERCEVAL 1992) [Pail92]the perceptual objective measure (POM 1995) [Colo95] the distortion index(DIX 1996) [Thie96] and the objective audio signal evaluation (OASE 1997)[Spor97] After a proposal phase the proponents of several of these measurement

systems ultimately collaborated on the development of the international standardon perceptual measurement the ITU-T BS1387 [ITUR98] (1998) Additionalschemes were reported in a special session on perceptual measurement systemsfor audio codecs during a 1992 Audio Engineering Society conference [Cabo92][Stau92] An NSE measure based on quantifying the dissonances associated withcoding distortion was proposed in [Bank96]

The remainder of this chapter is intended to provide some functional insightson perceptual measurement schemes Selected details of the perceptual audioquality measure (PAQM) the noise-to-mask ratio (NMR) and the objective audiosignal evaluation (OASE) algorithms are given as representative examples of boththe CIR and NSE methodologies

1271 Example Perceptual Audio Quality Measure (PAQM)

Beerends and Stemerdink in 1991 proposed a perceptual measurementsystem [Beer91] known as the ldquoperceptual audio quality measurerdquo (PAQM) forevaluating audio device output quality relative to a reference signal It waslater shown that the parameters of PAQM can be optimized for predictingthe results of subjective listening tests on speech [Beer94a] and perceptualaudio [Beer92a][Beer92b] codecs The PAQM (Figure 123) is a CIR-typealgorithm that maps independently the original s(n) and coded signal s(n)from the time domain to a corresponding pair of internal psychophysicalrepresentations In particular original and coded signals are processed by a two-stage auditory signal processing model This model captures mostly the peripheral

Hz

rarr B

ark

Mid

dle

Ear

TF

Exc

itatio

nS

calin

gLo

udne

ss

Com

pare

Win

dow

Mid

dle

Ear

TF

Exc

itatio

nLo

udne

ssT

FA

vera

ge

Λn

ss at

af

minusW

indo

w

Hz

rarr B

ark

Σ

gg

2F

FT

2F

FT

Fig

ure

123

Pe

rcep

tual

audi

oqu

ality

mea

sure

(PA

QM

)

393


process of nonlinear time-frequency energy smearing [Vier86] followed by thepredominantly central process of nonlinear intensity compression [Pick88] Amodified mapping derived from Zwickerrsquos loudness model [Zwic67] [Zwic77]is used to transform the two internal representations to a compressed loudnessscale Then a difference is computed to quantify the noise disturbance associatedwith the coding distortion Ultimately a second transformation can be applied thatmaps the noise disturbance compressed loudness n to a predicted listening testsubjective difference grade Precise computational details of the PAQM algorithmappeared in the appendices of [Beer92b] The important steps however can besummarized as follows

Note that the sequence of operations is the same for both the reference andimpaired signals First short-time spectral estimates are obtained using a 2048-point Hann-windowed FFT with 50 block overlap yielding time resolutionclose to 20 ms and frequency resolution of about 22 Hz A middle-ear transferfunction is applied at the output of the FFT and the frequency axis is warped fromHertz to Barks To simulate the excitation patterns generated along the basilarmembrane the PAQM next smears the middle-ear signal energy across timeand frequency in accordance with empirically derived psychophysical modelsFor frequency-domain smearing within one analysis window Terhardtrsquos [Terh79]narrow-band level- and frequency-dependent excitation pattern approximationwith slopes defined by

S1 = 31 dBBark fm gt f (121a)

S2 = 22 + min(230fm 10) minus 02LdBBark fm lt f (121b)

is used to spread the spectral energy across critical bands Thus the slope S1

controls the downward spread of excitation the slope S2 controls the upwardspread of excitation and the parameters fm L and f correspond respectivelyto the masker center frequency in Hz the masker level in units of dB SPLand the frequencies of the excitation components Within an analysis windowindividual excitation patterns Mi are combined point-wise to form a compositepattern Mc using a parametric power law of the form

Mc =(

Nsum

i=1

Mαi

)1α

α lt 2 (122)

where the parameter N corresponds to the number of individual excitation patternsbeing combined at a particular point and the parameter α controls the powerlaw Power law addition has been shown to govern the combination of multiplesimultaneous [Luft83] [Luft85] and temporal masking [Penn80] patterns but theparameter α may be different along the frequency and time axes We denote byαf the value of the parameter α used in PAQM for combining frequency-domainexcitation patterns The PAQM also models the temporal spread of excitationEnergy from the k-th block is scaled by an exponentially decaying gain g(t)


with a frequency-dependent time constant τ(f ) ie

g(t) = eminustτ (f ) (123)

where the time parameter t corresponds to the time distance between blockseg 20 ms The scaled version of the k-th block energy is then combined withthe energy of the (k + 1)-th (overlapping) block using a power law of the formgiven in Eq (122) and the parameter α is set for the time domain ie α =αt It was found that the level dependence of temporal excitation spread couldbe disregarded for the purposes of PAQM without affecting significantly themeasurement quality Thus the parameters S1 S2 and αf control the spread offrequency-domain signal energy while the parameters τ(f ) and αt control thesignal energy dispersion in time Once excitation patterns have been obtainedinternal original and coded signal representations are generated by transformingat each frequency the excitation energy E into a measure of loudness usingZwickerrsquos compressive excitation-to-loudness mapping [Zwic67] ie

= k

(E0

s

)γ [(

1 minus s + sE

E0

)γ

minus 1

]

(124)

where the parameter γ controls the compression characteristic the parameter k

is an arbitrary scaling constant the constant E0 represents the amount of exci-tation in a tone of the same frequency at the absolute hearing threshold andthe constants is a so-called ldquoschwellrdquo factor as defined in [Zwic67] Althoughthe mapping in Eq (124) was proposed in the context of loudness estimationthe particular value of the parameter γ used in PAQM is different from thevalue empirically determined in Zwickerrsquos experiments and therefore the inter-nal representations are not strictly speaking loudness measurements but arerather compressed loudness estimates Indexed over Bark frequencies z andtime t the compressed loudness of the original signal s(z t) and the codedsignal s(z t) are compared in three Bark regions Gains are extracted for eachBark subband in order to scale s(z t) such that the matching between s(z t)

and the scaled version of s(z t) is maximized The instantaneous frequency-localized noise disturbance n(z t) = s(z t) minus s(z t) ie the differencebetween internal representations is integrated over the Bark scale and then acrossa suitable time interval to obtain a composite noise disturbance estimate nFor the purposes of predicting the results of subjective listening tests the PAQMparameters of αf αt and γ can be adapted to minimize the standard deviationof a third-order regression line that fits the absolute performance grades fromactual subjective listening tests to the PAQM output log(n)

For example in a fitting experiment using listening test results from theISOIEC MPEG 1990 database [ISOI90] for loudspeaker presentation of the testmaterial the values of 08 06 and 004 respectively were obtained for theparameters αf αt and γ This yielded a correlation coefficient of 0968 andstandard deviation of 0351 performance grades With a correlation coefficient of091 and standard deviation of 048 MSS prediction was not quite as good when


the same parameters were used to validate PAQM on subjective listening testresults from the ISOIEC MPEG 1991 database [ISOI91] Some of the reducedprediction accuracy however was attributed to inexperienced listeners partic-ularly for some very high quality items for which the grading task is moredifficult In fact better prediction accuracy was realized during a later compari-son with more experienced listeners on the same database [Beer92b] In additionthe model was also validated using the results of the ITU-R Task Group 1021993 audio codec test [ITUR93b] on the contribution distribution emission (CDE)database resulting in a correlation coefficient of 088 and standard deviation of029 MSS As reported in [ITUR93c] these results were later verified by theSwedish Broadcasting Corporation It is also interesting to note that the PAQMwas optimized for listening test predictions on narrowband speech codecs Sev-eral modifications were required to achieve a correlation coefficient of 099 andstandard deviation of 014 points relative to a 5-point discrete MOS In particulartime and frequency energy dispersion had to be eliminated the loudness compres-sion parameter was changed to γ = 0001 and silence intervals required a zeroweighting [Beer94a] In an effort to improve PAQM performance cognitive effectmodeling was investigated in [Beer94b] and [Beer95] It was concluded that aunified PAQM configuration for both narrowband and wideband codecs mustaccount for subjective phenomena such as the asymmetry in perceived distortionbetween audible artifacts and the perceived distortion when signal componentsare eliminated rather than distorted The signal-dependent importance of mod-eling informational masking effects [Leek84] was also demonstrated [Beer96]although a general formulation was not proposed More details on cognitivemodeling improvements for PAQM appeared recently in [Beer98] Future PAQMenhancements will involve better cognitive models as well as binaural processingcapabilities

1272 Example Noise-to-Mask Ratio (NMR)

In the late 1980s Brandenburg proposed an NSE perceptual measurement schemeknown as the ldquonoise-to-mask ratiordquo (NMR) [Bran87a] The system was orig-inally intended to assess output quality and quantify noise margin during thedevelopment of the early perceptual transform coders such as OCF [Bran87b]and ASPEC [Bran91] but it has also been widely used in the development ofmore recent and sophisticated algorithms (eg MPEG-2 AAC [ISOI96a]) TheNMR output measurement η represents a time- and frequency-averaged dBdistance between the coding distortion s(n) minus s(n) and the masked thresholdassociated with the original signal s(n) The NMR system also sets a binarymasking flag [Bran92b] whenever the noise density in any critical band exceedsthe masked threshold Tracking the masking flag over time helps designers toidentify critical items for a particular codec In a procedure markedly similar tothe perceptual entropy front-end calculation (Chapter 5 Section 55) the NMR(Figure 124) is measured as follows First the original signal s(n) is time-delayed and in some cases amplitude scaled to maximize the temporal matching


with the coded signal s(n) so that the coding distortion can be estimated througha simple difference operation ie s(n) minus s(n) Then FFT-based spectral analy-sis is performed on Hann-windowed 1024-sample blocks (233 ms at 441 kHz)of the signal and the noise sequences Although not shown in the figure the dif-ference is sometimes computed in the spectral domain (after the FFT) to reducethe impact of phase errors on estimated audible noise New spectral estimates arecomputed every 512 samples (50 block overlap) to improve the effective tem-poral resolution of the measurement Next critical band signal and noise densitiesare estimated on a set of Bark-like subbands by summing and then normalizingthe energy contained in 27 blocks of adjacent FFT bins that approximate 1-Barkbandwidth [Bran92b] The masked threshold is derived from the Bark-scale signaldensity by means of simplified simultaneous and temporal masking models Forsimultaneous masking a conservatively estimated level-independent prototypespreading function (specified in [Bran92b]) is convolved with the Bark densityUnlike other perceptual models NMR assumes that the in-band masked thresh-old is fixed at 3 dB below the masker peak independent of tonality While thisassumption apparently overestimates the overall excitation level the NMR modelcompensates by ignoring masking additivity The absolute threshold of hearingin quiet is also accounted for in the Bark-domain masking calculation As faras temporal masking is concerned only postmasking is accounted for explicitlywith the simultaneous masked threshold decreased for each critical band indepen-dently by 6 dB per analysis window (1165 ms at 441 kHz) Finally the localNMR ηloc is defined as

ηloc = 10 log10

(1

27

27sum

i=1

σn(i)

σm(i)

)

(125)

ie the mean of the ratio between the critical band noise density σn(i) and thecritical band mask density σm(i) For a signal containing N blocks the globalNMR η is defined as the mean local NMR ie

η = 10 log10

(1

N

Nsum

i=1

10(01ηloc(i))

)

(126)

MaskingModel

NMR

minus

s

ss

Window

Window

h

sum 2FFT

2FFT

Figure 124 Noise-to-mask ratio (NMR)


Finally a masking flag (MF) is set once every 512 samples if the noise inat least one critical band exceeds the masked threshold One appealing featureof the NMR algorithm is its modest complexity making it amenable to real-time implementation In fact a real-time NMR system for 441 kHz signals hasbeen implemented on a set of three ATampT DSP32C 50 MHz devices [Herr92a][Herr92b] Processor-specific tasks were partitioned as follows 1) correlator fordelay and gain determination 2) FFT and windowing and 3) perceptual modelNMR calculations We will consider next how the raw NMR measurements areapplied in codec performance evaluation tasks

Given that NMR updates occur once every 512 samples (117 ms at 441 kHz)the raw NMR measurements of ηloc and the masking flag can create a consider-able volume of data To simplify the evaluation task several postprocessed datareducing metrics have been proposed for codec evaluation purposes [Beat96]These secondary measures including the global NMR (126) worst-case localNMR (max(ηloc)) the percentage of nonzero masking flag (MF) and the ldquoNMRfingerprintrdquo have been demonstrated to predict quite accurately the presence ofcoding artifacts For example a global NMR η of less than or equal to minus10 dBtypically implies an artifact-free signal High quality is also implied by a MF thatoccurs less than 3 of the time An NMR fingerprint shows the time-averageddistribution of NMR over frequency (critical bands) A well-balanced fingerprinthas been shown to correspond with better output quality than an unbalancedldquopeakyrdquo fingerprint for which high NMRs are localized to a just few criticalbands NMR equivalent bit rate has also been proposed as a quality metric par-ticularly in the case of impaired channel performance characterization In thismeasurement scheme an impaired codec is quantified in terms of an equiva-lent bit rate for which an unimpaired reference codec would achieve the sameNMR For example a 64-kbs codec in a three-stage tandem configuration mightachieve the same NMR as a 32-kbs version of the same codec in the absenceof any tandeming These and other NMR-based figures of merit have been usedsuccessfully to evaluate the performance of perceptual audio codecs in consumerrecording equipment [Keyh93] tandem coding configurations [Keyh94] and net-work applications [Keyh95]

The various NMR-based measures have proven to be useful evaluation toolsfor many phases of codec research and development The ultimate objective ofTG 104 however is to standardize automatic evaluation tools that can predictsubjective quality on an impairment scale In fact NMR measurements can becorrelated directly with the results of subjective listening tests [Herr92b] More-over an appropriately interpreted NMR output can emulate a human subject inpsychophysical experiments [Spor95b] In the case of listening test predictionsa minimum mean square error (MMSE) linear mapping from the raw NMR mea-surements to the 5-grade CCIR impairment was derived on listening test data fromthe 1990 Swedish Radio ISOMPEG database [ISOI90] The mapping achieveda correlation of 094 and mean deviation of 028 CCIR impairment grades Asfor the psychophysical measurements the NMR was treated as a test subject inseveral simultaneous and temporal masking experiments and then the resulting


psychometric functions were compared against average results for actual humansubjects In the experiments the NMR ldquosubjectrdquo was judged to detect a givenprobe if the masking flag was set for more than half of the blocks and thenthe threshold was interpreted as the lowest level probe for which more than50 of the masking flags were set In experiments on noise-masking-tone andtone-masking-noise the NMR ldquosubjectrdquo exhibited masking patterns similar toaverage human subjects with some artifacts at low frequencies evident In tone-masking-tone experiments however the NMR ldquosubjectrdquo diverged from humanperformance because the NMR perceptual model fails to account for the maskingrelease caused by beating and difference tones

The level of performance achieved by the NMR system on listening test pre-diction and psychometric measurement emulation tasks resulted in its inclusionin the ITU TG 104 evaluation process Although the NMR perceptual model hadbeen ldquofrozenrdquo in 1994 [Spor95b] to facilitate comparisons between NMR resultsgenerated at different sites for different algorithms it was clear that perceptualmodel enhancements relative to [Bran92b] were necessary in order to improvemasked threshold predictions and ultimately overall performance Thus whilemaintaining a complexity level compatible with real-time constraints the TG104 NMR submission was improved relative to the original NMR in severalareas For instance the excitation model is now dependent on tonality and pre-sentation level Furthermore the additivity of masking is now modeled explicitlyFinally level dependencies are now matched to the actual presentation SPL Thesystem might eventually realize further performance gains by using higher res-olution spectral analysis at low frequencies and by estimating the BLMD forstereo signals

1273 Example Objective Audio Signal Evaluation (OASE)

None of the NMR improvements cited above change the fact the uniform band-width channels of the FFT filter bank are fundamentally different from theauditory filter bank In spite of other NMR modeling improvements the FFThandicaps measurement system because of inadequate frequency resolution at lowfrequencies and inadequate temporal resolution at high frequencies To overcomesuch limitations an alternative CIR-type algorithm for perceptual measurementknown as the objective audio signal evaluation (OASE) [Spor97] was also sub-mitted to the ITU TG 104 as one of the three variant proposals from the NMRproponents Unlike the other NMR variants the OASE system (Figure 125) isa CIR system that generates independent auditory feature sets for the coded andreference signals denoted by s(n) and s(n) respectively Then a probabilis-tic detection stage determines the perceptual distance between the two auditoryfeature sets

The OASE was designed to rectify perceptual measurement shortcomingsinherent in FFT-based systems by achieving analysis resolution in both time andfrequency that is greater than or equal to auditory resolution In particular theOASE filter bank approximates a Mel frequency scale by using 241 highly over-lapping 128-tap FIR bandpass filters centered at integral multiples of 01 Bark


DetectorBank

FilterBank

FilterBank

TemporalModel

TemporalModel

241 minus

241

LoudnessMapping

pg pcb

DiffGrade

s

241

241

s SubjectiveMapping

sum

Figure 125 Objective audio signal evaluation (OASE)

Meanwhile the time resolution is fs 32 or 067 ms at 48 kHz Moderate complex-ity is achieved through an efficient structured multirate realization that exploitsfast convolutions Individual filter shapes are level- and frequency-dependentmodeled after experimental results on excitation patterns The bandpass filtersalso simulate outer and middle ear transfer characteristics A square-law recti-fier post-processes the output from each channel of the filter bank and then theauditory ldquointernal noiserdquo associated with the absolute threshold of hearing is sim-ulated by adding to each output a predefined constant Finally the nonlinear filtersproposed in [Karj85] are applied to simulate temporal integration and smoothingThe output of the temporal processing block constitutes a 241-element auditoryfeature vector for each signal A Decibel difference is computed on each channelbetween the features associated with the reference and coded signals This time-frequency difference vector forms the input to a bank of detectors At each timeindex a frequency-localized difference detection probability pi(t) is assigned tothe i = 1 2 241 features as a function of the difference magnitude using asignal-dependent detection profile For example artificial signals use a detectionprofile different than speech and music The time index t is discrete and takes onvalues that are integer multiples of 067 ms at 48 kHz (fs 32) Finally a globalprobability of difference detection pg(t) is computed as the complement of thecombined probability that no single channel differences are detected ie

pg(t) = 1 minus241prod

i=l

(1 minus pi(t)) (127)

Alternatively a critical band detection probability pcb(t) for a particular 1-Bark subband can be computed by combining the frequency-localized detectionprobabilities as follows

pcb(t) = 1 minusk+5prod

i=kminus4

(1 minus pi(t)) (128)

where the critical band of interest is centered on any of the eligible Mel-scalefeatures ie 5 k 236 In cases where it is desirable for OASE to pre-dict the results of a subjective listening test a noise loudness mapping can beapplied to the feature difference vectors available at the output of the temporalprocessing stages Such a configuration estimates the loudness of the differ-ence between the original and coded signals Then as is possible with many

ITU-R BS1387 AND ITU-T P861 401

other perceptual measurement schemes the noise loudness can be mapped to apredicted subjective difference grade The ITU TG 104 has in fact used OASEmodel outputs to predict subjective scores The results were reported in [Spor97]to have been reasonable and similar to other perceptual measurement systemsbut no correlation coefficients or standard deviations were given

128 ITU-R BS1387 AND ITU-T P861 STANDARDS FORPERCEPTUAL QUALITY MEASUREMENT

In 1995 the Radio Communication Sector of the International Telecommunica-tions Union (ITU-R) convened Task Group (TG) 104 to select a perceptual mea-surement scheme capable of predicting the results of subjective listening tests inthe form of an objective difference grade (ODG) Six proposals were submitted toTG 104 in 1995 including the distortion index (DIX) from TU-Berlin [Thie96]the noise-to-mask ratio (NMR) from FhG-IIS Erlangen [Bran87a] the perceptualaudio quality measure (PAQM) from KPN [Beer92a] the perceptual evalua-tion (PERCEVAL) from CRC Canada [Pail92] the perceptual objective measure(POM) from CCETT France [Colo95] and the Toolbox (TB) from IRT MunichDuring initial TG 104 tests in 1996 three variants of each proposal were requiredto predict the results of an unknown listening test A lengthy comparison and ver-ification process ensued (eg [ITUR94a] compares NMR PAQM and PERCE-VAL) Ultimately the six TG 104 proponents in conjunction with Opticom andDeutsche Telekom Berkom collaborated to design a single system combining themost successful features from each individual methodology This effort resultedin the recently adopted ITU-R Recommendation BS1387 [ITUR98] which spec-ifies two versions of the measurement scheme known as perceptual evaluationof audio quality (PEAQ) The basic low-complexity version is based on an FFTfilter bank much like the PAQM PERCEVAL POM and NMR algorithms Theadvanced version incorporates a more sophisticated filter bank that better approx-imates the auditory system as in for example the DIX and OASE algorithmsNearly all features extracted from the perceptual models to estimate quality aresimilar to the feature sets previously used in the individual proposals Unlikethe regression polynomials used in earlier tests however in BS1387 a neuralnetwork performs the final mapping from the objective output to an ODG Stan-dardization activities are ongoing and the ITU has formed a successor groupto the now dissolved TG 104 under the designation JWP10-11Q [Spor99a]Research into perceptual measurement techniques has also resulted in a stan-dardized algorithm for speech quality measurements In particular Study Group12 within ITU Telecommunications Sector (ITU-T) in 1996 adopted the Rec-ommendation P861 [ITUT96] for perceptual quality measurements of telephoneband signals (300ndash3400 Hz) The technique is essentially a speech-specific ver-sion of the PAQM algorithm in which the time-frequency smearing blocks areusually disabled the outer ear transfer function is eliminated and simplificationsare made in both the intensity-to-loudness mapping and the modeling of cognitiveasymmetry effects


129 RESEARCH DIRECTIONS FOR PERCEPTUAL CODEC QUALITYMEASURES

This chapter has provided a perspective on current performance evaluationtechniques for perceptual audio codecs including both subjective and objectivequality measures At present the subjective listening test remains the most reliabletechnique for measurement of the small impairments associated with high-quality codecs and a formal test methodology optimized for maximum reliabilityhas been adopted as an international standard the ITU-R BS1116 [ITUR94b]Nevertheless even the results of BS1116-compliant subjective tests continueto exhibit site and subject dependencies Future research will continue toseek new methods for improving the reliability of subjective test outcomesFor example Lucent Technologies sponsored an investigation at MoultonLaboratories [Moul98a][Moul98b] into the effectiveness of multifacet Raschmodels [Lina94] for improved reliability of subjective listening tests on high-quality audio codecs Developed initially in the context of intelligence testing theRasch model [Rasc80] is a statistical analysis technique designed to remove theeffects of local disturbances on test outcomes This is accomplished by computingan expected data value for each data point that can be compared with the actualcollected value to determine whether or not the data point is consistent withthe demands of reproducible measurement The Rasch methodology involves aniterative data pruning process of suspending misfitting data from the analysisrecomputing the measures and expected values and then repeating the pruningprocess until all significant misfits are removed from the data set The Raschanalysis proponents have indicated [Moul98a] [Moul98b] that future work willattempt to demonstrate the elimination of site dependencies and other local datadisturbances Research along these lines will continue as long as subjectivelistening tests exhibit site and subject dependence

Meanwhile the considerable research that has been devoted to developmentof automatic perceptual measurement schemes has resulted in the adoption of aninternational standard the ITU-R BS1387 [ITUR98] Experts do not consider thestandardized algorithm to be a human subject replacement however research intoimproved perceptual measurement schemes will continue (eg ITU-R JWP10-11Q) While the signal processing models of the auditory periphery have reacheda high level of sophistication with manageable complexity it is likely that futureperceptual measurement schemes will realize the greatest performance gains asa result of improved models for binaural listening effects and general cognitiveeffects As described in [Beer98] several central cognitive effects are importantin subjective audio quality assessment including

Informational Masking The masked threshold of a complex target maskedby a complex masker may decrease by more than 40 dB [Leek84] after sub-ject training Modeling this effect has been shown to cause both increases anddecreases in correlation between ODGs and SDGs depending on the listeningtest database [Beer98] Future work will concentrate on the development of satis-factory models for informational masking ie models that do not lead to reducedcorrelation

RESEARCH DIRECTIONS FOR PERCEPTUAL CODEC QUALITY MEASURES 403

Separation of Linear from Nonlinear Distortions Linear distortions tend to beless objectionable than nonlinear distortions Experimental attempts to quantifythis effect in perceptual measurement schemes however have not yet succeeded

Auditory Scene Analysis This describes the cognitive mechanism that groupsdifferent auditory events into distinct objects During a successful applicationof this principle to perceptual measurement systems [Beer94b] it was foundthat there is a significant asymmetry between the perceived distortion associatedwith the introduction of a new time-frequency component versus the perceiveddistortion associated with the removal of an existing time-frequency componentMoreover the effect was shown to improve SDGODG correlation for somelistening test results

Naturally there is some debate over the feasibility of creating an automaticsystem that can predict the responses of human test subjects with sufficient preci-sion to allow for differentiation between competing high-quality algorithms Evenrecent systems tend to have difficulty predicting some observable psychoacousticphenomena For example most systems fail to model the beat and combina-tion tone detection cues that generate a masking release (reduced threshold) intone masking tone experiments For these and other reasons the large correlationcoefficients that are obtained when mapping particular objective measures to cer-tain subjective test sets have been shown to be data dependent (eg [Beer92b])Keyhl et al calibrated the perceptual audio quality measure (PAQM) to matcha set of subjective test results for an MPEG codec in the WorldSpace satel-lite system [Keyh98] The idea was to later use the automatic measure in placeof expensive subjective tests for quality assurance over time In spite of anyapparent shortcomings considerable interest in perceptual measurement schemeswill persist as long as subjective tests remain time-consuming expensive andmarginally reliable For additional information the interested reader can consultseveral survey papers on perceptual measurement schemes that appeared in anAES collection [Beat96] [Koml96] [Ryde96] as well as the textbook on percep-tual measurement techniques [Beer98] Two popular standards published includethe perceptual evaluation of speech quality (PESQ) [Rix01] and the perceptualevaluation of audio quality (PEAQ) standard [PEAQ]

REFERENCES

[Adou87] JP Adoul et al ldquoFast CELP coding based on algebraic codesrdquo in ProcIEEE ICASSP-87 vol 12 pp 1957ndash1960 Dallas TX Apr 1987

[Adou95] J-P Adoul and R Lefebvre ldquoWideband speech codingrdquo in Speech Codingand Synthesis Eds WB Kleijn and KK Paliwal Kluwer AcademicPublishers Norwell MA 1995

[AES00] AES Technical Council Document AESTD1001101ndash10 MultichannelSurround Sound Systems and Operations Audio Engineering Society NewYork NY

[Ager94] F Agerkvist Time-Frequency Analysis and Auditory Models PhD ThesisThe Acoustics Laboratory Technical University of Denmark 1994

[Ager96] F Agerkvist ldquoA time-frequency auditory model using wavelet packetsrdquo JAud Eng Soc vol 44 no 12 pp 37ndash50 Feb 1996

[Akag94] K Akagiri ldquoTechnical Description of Sony Preprocessingrdquo ISOIEC JTC1SC29WG11 MPEG Input Document 1994

[Akai95] K Akaigiri ldquoDetailed Technical Description for MPEG-2 Audio NBCrdquoTechnical Report ISOIEC JTC1SC29WG11 1995

[Akan92] A Akansu and R Haddad Multiresolution Signal Decomposition Trans-forms Subbands Wavelets Academic Press San Diego CA 1992

[Akan96] A Akansu and MJT Smith eds Subband and Wavelet Transforms Designand Applications Kluwer Academic Publishers Norwell MA 1996

[Ali96] M Ali Adaptive Signal Representation with Application in Audio CodingPhD Thesis University of Minnesota Mar 1996


405

406 REFERENCES

[Alki95] O Alkin and H Calgar ldquoDesign of efficient M-band coders with linearphase and perfect-reconstruction propertiesrdquo IEEE Trans Sig Proc vol43 no 7 pp 1579ndash1589 July 1995

[Alla99] E Allamanche R Geiger J Herre and T Sporer ldquoMPEG-4 low-delayaudio coding based on the AAC codecrdquo in Proc 106th Conv Aud EngSoc preprint 4929 May 1999

[Alle77] JB Allen and LR Rabiner ldquoA unified theory of short-time spectrumanalysis and synthesisrdquo Proc IEEE vol 65 no 11 pp 1558ndash1564 Nov1977

[Alme83] L Almeida ldquoNonstationary spectral modeling of voiced speechrdquo IEEETrans Acous Speech and Sig Proc vol 31 pp 374ndash390 June 1983

[Angu01] JA Angus ldquoAchieving effective dither in delta-sigma modulation systemsrdquoin Proc 110th Conv Aud Eng Soc preprint 5393 May 2001

[Angu97a] JA Angus and NM Casey ldquoFiltering signals directlyrdquo in Proc 102ndConv Aud Eng Soc preprint 4445 Mar 22ndash25 1997

[Angu97b] JAS Angus ldquoBackward adaptive lossless coding of audio signalsrdquo in Proc103rd Conv Aud Eng Soc preprint 4631 New York Sept 1997

[Angu98] JA Angus and S Draper ldquoAn improved method for directly filtering

audio signalsrdquo in Proc 104th Conv Aud Eng Soc preprint 4737 May16ndash19 1998

[Arno00] M Arnold ldquoAudio watermarking features applications and algorithmsrdquo inInt Conf on Multimedia and Expo (ICME-2000) vol 2 pp 1013ndash1016 30Julyndash2 Aug 2000

[Arno02] M Arnold et al Techniques and Applications of Digital Watermarking andContent Protection Artech House Norwood MA Dec 2002

[Arno02a] M Arnold and O Lobisch ldquoReal-time concepts for block-based wat-ermarking schemesrdquo in Proc of 2nd Int Conf on WEDELMUCSIC-02 pp156ndash160 Dec 2002

[Arno02b] M Arnold ldquoSubjective and objective quality evaluation of watermarkedaudio tracksrdquo in Proc of 2nd Int Conf on WEDELMUCSIC-02 pp161ndash167 Dec 2002

[Arno03] M Arnold ldquoAttacks on digital audio watermarks and countermeasuresrdquo inProc of 3rd Int Conf on WEDELMUCSIC-03 pp 55ndash62 Sept 2003

[Atal82a] BS Atal ldquoPredictive coding of speech at low bit ratesrdquo IEEE Trans onComm vol 30 no 4 p 600 Apr 1982

[Atal82b] BS Atal and J Remde ldquoA new model for LPC excitation for producingnatural sounding speech at low bit ratesrdquo in Proc IEEE ICASSP-82pp 614ndash617 Paris France April 1982

[Atal84] B Atal and MR Schroeder ldquoStochastic coding of speech signals at verylow bit ratesrdquo in Proc Int Conf Comm pp 1610ndash1613 May 1984

[Atal90] B Atal V Cuperman and A Gersho Advances in Speech Coding KluwerNorwell MA 1990

[Atti05] V Atti and A Spanias ldquoSpeech analysis by estimating perceptually ndashrelevant pole locationrdquo in Proc IEEE ICASSP-05 vol 1 pp 217ndash220Philadelphia PA March 2005

REFERENCES 407

[Bald96] A Baldini ldquoDSP and MIDI applications in loudspeaker designrdquo in Proc100th Conv Aud Eng Soc preprint 4189 May 1996

[Bamb94] R Bamberger et al ldquoGeneralized symmetric extension for size-limitedmultirate filter banksrdquo IEEE Trans Image Proc vol 3 no 1 pp 82ndash87Jan 1994

[Bank96] M Bank et al ldquoAn objective method for sound quality estimation ofcompression systemsrdquo in Proc 101st Conv Aud Eng Soc preprint 43731996

[Barn03a] M Barni ldquoWhat is the future of watermarking Part-Irdquo IEEE SignalProcessing Magazine vol 20 no 5 pp 55ndash60 Sept 2003

[Barn03b] M Barni ldquoWhat is the future of watermarking Part-IIrdquo IEEE SignalProcessing Magazine vol 20 no 6 pp 53ndash59 Nov 2003

[Bass98] P Bassia and I Pitas ldquoRobust audio watermarking in the time-domainrdquo inProc EUSIPCO vol 1 pp 25ndash28 Greece Sept 1998

[Bass01] P Bassia I Pitas and N Nikolaidis ldquoRobust audio watermarking in thetime domainrdquo IEEE Trans on Multimedia vol 3 no 2 pp 232ndash241 June2001

[Baum95] F Baumgarte C Ferekidis and H Fuchs ldquoA nonlinear psychoacousticmodel applied to the ISO MPEG layer 3 coderrdquo in Proc 99th Conv AudioEng Soc preprint 4087 New York Oct 1995

[Bayl97] J Baylis Error Correcting Codes A Mathematical Introduction CRC PressBoca Raton FL Dec 1997

[Beat89] RJ Beaton ldquoHigh quality audio encoding within 128 kbitsrdquo in IEEEPacific Rim Conf on Comm Comp and Sig Proc pp 388ndash391 June 1989

[Beat96] R Beaton et al ldquoObjective perceptual measurement of audio qualityrdquo inCollected Papers on Digital Audio Bit-Rate Reduction N Gilchrist and CGrewin Eds Aud Eng Soc pp 126ndash152 1996

[Bech92] S Bech ldquoSelection and training of subjects for listening tests on sound-reproducing equipmentrdquo J Aud Eng Soc pp 590ndash610 JulAug 1992

[Beck04] E Becker et al Digital Rights Management Technological Economic andLegal and Political Aspects Springer Verlag Berlin Jan 2004

[Beer91] J Beerends and J Stemerdink ldquoMeasuring the quality of audio devicesrdquo inProc 90th Conv Aud Eng Soc preprint 3070 Feb 1991

[Beer92a] J Beerends and J Stemerdink ldquoA perceptual audio quality measurerdquo inProc 92nd Conv Aud Eng Soc preprint 3311 Mar 1992

[Beer92b] J Beerends and J Stemerdink ldquoA perceptual audio quality measure basedon a psychoacoustic sound representationrdquo J Aud Eng Soc vol 40 no 12pp 963ndash978 Dec 1992

[Beer94a] J Beerends and J Stemerdink ldquoA perceptual speech-quality measure basedon a psychoacoustic sound representationrdquo J Aud Eng Soc vol 42 no 3pp 115ndash123 Mar 1994

[Beer94b] J Beerends and J Stemerdink ldquoModeling a cognitive aspect in themeasurement of the quality of music codecsrdquo in Proc 96th Conv AudEng Soc preprint 3800 1994

408 REFERENCES

[Beer95] J Beerends and J Stemerdink ldquoMeasuring the quality of speech and musiccodecs an integrated psychoacoustic approachrdquo in Proc 98th Conv AudEng Soc preprint 3945 1995

[Beer96] J Beerends et al ldquoThe role of informational masking and perceptualstreaming in the measurement of music codec qualityrdquo in Proc 100th ConvAud Eng Soc preprint 4176 May 1996

[Beer98] J Beerends ldquoAudio quality determination based on perceptual measurementtechniquesrdquo in Applications of Digital Signal Processing to Audio andAcoustics M Kahrs and K Brandenburg Eds Kluwer AcademicPublishers Boston pp 1ndash83 1998

[Beke60] G von Bekesy Experiments in Hearing McGraw Hill New-York 1960

[Bell76] M Bellanger et al ldquoDigital filtering by polyphase network application tosample rate alteration and filter banksrdquo IEEE Trans Acous Speech and SigProc vol 24 no 4 pp 109ndash114 Apr 1976

[Ben99] L Ben and M Sandler ldquoJoint source and channel coding for Internet audiotransmissionrdquo in Proc 106th Conv Aud Eng Soc preprint 4932 May1999

[Bend95] W Bender D Gruhl and N Morimoto ldquoTechniques for data hidingrdquo inProc of the SPIE 1995

[Bend96] W Bender D Gruhl N Morimoto and A Lu ldquoTechniques for data hidingrdquoIBM Syst J vol 35 nos 3 and 4 pp 313ndash336 1996

[Bene86] N Benevuto et al ldquoThe 32 kbs coding standardrdquo ATampT Technical Journal vol 65(5) pp 12ndash22 SeptndashOct 1986

[Berm02] G Berman and I Jouny ldquoTemporal watermarking invertible only throughsource separationrdquo in Proc IEEE ICASSP-2002 vol 4 pp 3732ndash3735Orlando FL May 2002

[Bess02] B Bessette et al ldquoThe adaptive multi-rate wideband speech codec (AMR-WB)rdquo IEEE Trans on Speech and Audio Proc vol 10 no 8 pp 620ndash636Nov 2002

[Blac95] M Black and M Zeytinoglu ldquoComputationally efficient wavelet packetcoding of wideband stereo audio signalsrdquo in Proc IEEE ICASSP-95 pp3075ndash3078 Detroit MI May 1995

[Blau74] J Blauert Spatial Hearing The MIT Press Cambridge MA 1974

[Bola95] S Boland and M Deriche ldquoHigh quality audio coding using multipulse LPCand wavelet decompositionrdquo in Proc IEEE ICASSP-95 pp 3067ndash3069Detroit MI May 1995

[Bola96] S Boland and M Deriche ldquoAudio coding using the wavelet packet transformand a combined scalar-vector quantizationrdquo in Proc IEEE ICASSP-96 pp1041ndash1044 Atlanta GA May 1996

[Bola97] S Boland and M Deriche ldquoNew results in low bitrate audio coding using acombined harmonic-wavelet representationrdquo in Proc IEEE ICASSP-97 pp351ndash354 Munich Germany April 1997

[Bola98] S Boland and M Deriche ldquoHybrid LPC and discrete wavelet transformaudio coding with a novel bit allocation algorithmrdquo in Proc IEEE ICASSP-98 vol 6 pp 3657ndash3660 Seattle WA May 1998

REFERENCES 409

[Bone96] L Boney A Tewfik and K Hamdy ldquoDigital watermarks for audio signalsrdquoin Proc of Multimedia rsquo96 pp 473ndash480 Piscataway NJ IEEE Press 1996

[Borm03] J Bormans J Gelissen and A Perkis ldquoMPEG-21 the 21st centurymultimedia frameworkrdquo IEEE Sig Proc Mag vol 20 no 2 pp 53-62ndashMar 2003

[Bosi00] M Bosi ldquoHigh-quality multichannel audio coding trends and challengesrdquoJ Audio Eng Soc vol 48 no 6 pp 588 590ndash595 June 2000

[Bosi93] M Bosi et al ldquoAspects of current standardization activities for high-qualitylow rate multichannel audio codingrdquo IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics New Paltz NY Oct 1993

[Bosi96] M Bosi K Brandenburg S Quackenbush L Fielder K Akagiri and HFuchs ldquoISOIEC MPEG-2 advanced audio codingrdquo in Proc 101st ConvAud Eng Soc preprint 4382 Los Angeles Nov 1996

[Bosi96a] M Bosi et al ldquoISOIEC MPEG-2 advanced audio codingrdquo in Proc 101stConv Audio Eng Soc preprint 4382 1996

[Bosi97] M Bosi et al ldquoISOIEC MPEG-2 advanced audio codingrdquo J Audio EngSoc pp 789ndash813 Oct 1997

[Boyd88] I Boyd and C Southcott ldquoA speech codec for the Skyphone servicerdquo BrTelecom Technical J vol 6(2) pp 51ndash55 April 1988

[Bran87a] K Brandenburg ldquoEvaluation of quality for audio encoding at low bit ratesrdquoin Proc 82nd Conv Aud Eng Soc preprint 2433 Mar 1987

[Bran87b] K Brandenburg ldquoOCF ndash a new coding algorithm for high quality soundsignalsrdquo in Proc IEEE ICASSP-87 vol 12 pp 141ndash144 Dallas TX May1987

[Bran88a] K Brandenburg ldquoHigh quality sound coding at 25 bitssamplerdquo in Proc84th Conv Aud Eng Soc preprint 2582 Mar 1988

[Bran88b] K Brandenburg ldquoOCF coding high quality audio with data rates of64 kbitsecrdquo in Proc 85th Conv Aud Eng Soc preprint 2723 Mar 1988

[Bran90] K Brandenburg and J Johnston ldquoSecond generation perceptual audiocoding the hybrid coderrdquo in Proc 88th Conv Aud Eng Soc preprint2937 Mar 1990

[Bran91] K Brandenburg et al ldquoASPEC adaptive spectral entropy coding of highquality music signalsrdquo in Proc 90th Conv Aud Eng Soc preprint 3011Feb 1991

[Bran92a] K Brandenburg et al ldquoComparison of filter banks for high quality audiocodingrdquo in Proc IEEE ISCAS 1992

[Bran92b] K Brandenburg and T Sporer ldquolsquoNMRrsquo and lsquoMasking Flagrsquo Evaluation ofquality using perceptual criteriardquo in Proc 11th Int Conf Aud Eng Soc pp169ndash179 May 1992

[Bran94a] K Brandenburg and G Stoll ldquoISO-MPEG-1 audio a generic standard forcoding of high-quality digital audiordquo J Audio Eng Soc pp 780ndash792 Oct1994

[Bran94b] K Brandenburg and B Grill ldquoFirst ideas on scalable audio codingrdquo in Proc97th Conv Aud Eng Soc San Francisco preprint 3924 1994

410 REFERENCES

[Bran95] K Brandenburg and M Bosi ldquoOverview of MPEG-audio current and futurestandards for low bit-rate audio codingrdquo in Proc 99th Conv Aud Eng Socpreprint 4130 Oct 1995

[Bran97] K Brandenburg and M Bosi ldquoISOIEC MPEG-2 advanced audio codingoverview and applicationsrdquo in Proc 103th Conv Aud Eng Soc preprint4641 Sept 1997

[Bran98] K Brandenburg ldquoPerceptual coding of high quality digital audiordquo inApplications of Digital Signal Processing to Audio and Acoustics M Kahrsand K Brandenburg Eds Kluwer Academic Publishers Boston 1998

[Bran02] K Brandenburg ldquoAudio coding and electronic distribution of musicrdquoin Proc of 2nd International Conference on Web Delivering of Music(WEDELMUSIC-02) pp 3ndash5 Dec 2002

[Bree04] J Breebaart et al ldquoHigh-quality parametric spatial audio coding at lowbitratesrdquo in 116th Conv Aud Eng Soc Berlin Germany May 2004

[Bree05] J Breebart et al ldquoMPEG spatial audio codingMPEG surround Overviewand current statusrdquo in Proc 119th AES convention paper 6599 New YorkNew York Oct 2005

[Breg90] A S Bregman Auditory Scene Analysis MIT Press Cambridge MA 1990

[Bros03] PM Brossier MB Sandler and MD Plumbley ldquoReal time object basedcodingrdquo in Proc 101st Conv Aud Eng Soc preprint 5809 Mar 2003

[Brue97] F Bruekers W Oomen R van der Vleuten and L van de KerkhofldquoImproved lossless coding of 1-bit audio signalsrdquo in Proc 103rd ConvAud Eng Soc preprint 4563 New York Sept 1997

[Burn03] I Burnett et al ldquoMPEG-21 goals and achievementsrdquo in IEEE Multimedia vol 10 no 4 pp 60ndash70 OctndashDec 2003

[Buzo80] A Buzo et al ldquoSpeech coding based upon vector quantization rdquo IEEETrans Acoust Speech Signal Process vol 28 no 5 pp 562ndash574 Oct1980

[BWEB] Specification of the Bluetooth System available online at wwwbluetoothcom

[Cabo92] R Cabot ldquoMeasurements on low-rate codersrdquo in Proc 11th Int Conf AudEng Soc pp 216ndash222 May 1992

[Cacc01] G Caccia R Lancini M Ludovico F Mapelli and S TubaroldquoWatermarking for musical pieces indexing used in automatic cue sheetgeneration systemsrdquo IEEE 4th Workshop on Mult and Sig Proc pp487ndash492 Oct 2001

[Cagl91] H Caglar et al ldquoStatistically optimized PR-QMF designrdquo in Proc SPIEVis Comm and Img Proc pp 86ndash94 Nov 1991

[Cai95] H Cai and G Mirchandani ldquoWavelet transform and bit-plane codingrdquo inProc ICIP pp 578ndash581 1995

[Camp86] J Campbell and TE Tremain ldquoVoicedunvoiced classification of speechwith applications to the US government LPC-10e algorithmrdquo in Proc IEEEICASSP-86 vol 11 pp 473ndash476 Tokyo Japan April 1986

[Camp90] J Campbell TE Tremain and V Welch ldquoThe Proposed Federal Standard1016 4800 bps voice coder CELPrdquo Speech Tech Mag pp 58ndash64 Apr1990

REFERENCES 411

[Cand92] JC Candy and GC Temes ldquoOversampling methods for AD and DAconversionrdquo in Oversampling Delta-Sigma Converters Eds JC Candy andGC Temes Piscataway NJ IEEE Press 1992

[Casa98] A Casal et al ldquoTesting a flexible time-frequency mapping for highfrequencies in TARCO (Tonal Adaptive Resolution COdec)rdquo in Proc 104thConv Aud Eng Soc preprint 4676 Amsterdam May 1998

[Case96] MA Casey and P Smaragdis ldquoNetsoundrdquo in Proc Int Computer MusicConf Hong Kong 1996

[CD82] Philips and Sony ldquoAudio CD System Descriptionrdquo Philips LicensingEindhoven The Netherlands 1982

[Chan90] W-Y Chan and A Gersho ldquoHigh fidelity audio transform coding with vectorquantizationrdquo in Proc IEEE ICASSP-90 pp 1109ndash1112 Albuquerque NMMay 1990

[Chan91a] W Chan and A Gersho ldquoconstrained-storage quantization of multiple vectorsources by codebook sharingrdquo IEEE Trans Comm vol 39 no 1 pp11ndash13 Jan 1991

[Chan91b] W Chan and A Gersho ldquoConstrained-storage vector quantization in highfidelity audio transform codingrdquo in Proc IEEE ICASSP-91 pp 3597ndash3600Toronto Canada May 1991

[Chan96] W Chang and C Want ldquoA masking-threshold-adapted weighting filter forexcitation searchrdquo IEEE Trans on Speech and Audio Proc vol 4 no 2pp 124ndash132 Mar 1996

[Chan97] W Chang et al ldquoAudio coding using sinusoidal excitation representationrdquoin Proc IEEE ICASSP-97 vol 1 pp 311ndash314 Munich Germany April1997

[Char88] A Charbonnier and JP Petit ldquoSub-band ADPCM coding for high qualityaudio signalsrdquo in Proc IEEE ICASSP-88 pp 2540ndash2543 New York NYMay 1988

[Chen92] J Chen R Cox Y Lin N Jayant and M Melchner ldquoA low-delay CELPcoder for the CCITT 16 kbs speech coding standardrdquo IEEE Trans SelectedAreas Commun vol 10 no 5 pp 830ndash849 June 1992

[Chen95] B Chen et al ldquoOptimal signal reconstruction in noisy filterbanks multirateKalman synthesis filtering approachrdquo IEEE Trans Sig Proc vol 43 no 11pp 2496ndash2504 Nov 1995

[Chen99] J Chen and H-M Tai ldquoMPEG-2 AAC coder on a fixed-point DSPrdquo in Procof Int Conf on Consumer Electronics ICCE-99 pp 24ndash25 June 1999

[Chen02a] S Cheng H Yu and Z Xiong ldquoError concealment of MPEG-2 AAC audiousing modulo watermarksrdquo in Proc IEEE ISCAS-2002 vol 2 pp 261ndash264May 2002

[Chen02b] S Cheng H Yu and X Zixiang ldquoEnhanced spread spectrum watermarkingof MPEG-2 AACrdquo in Proc IEEE ICASSP-2002 vol 4 pp 3728ndash3731Orlando FL May 2002

[Chen04] L-J Chen R Kapoor K Lee MY Sanadidi M Gerla ldquoAudio streamingover Bluetooth an adaptive ARQ timeout approachrdquo in 6th Int Workshopon Mult Network Syst and App(MNSA 2004) Tokyo Japan 2004

412 REFERENCES

[Cheu95] S Cheung and J Lim ldquoIncorporation of biorthogonality into lappedtransforms for audio compressionrdquo in Proc IEEE ICASSP-95 pp3079ndash3082 Detroit MI May 1995

[Chia96] H-C Chiang and J-C Liu ldquoRegressive implementations for the forward andinverse MDCT in MPEG audio codingrdquo IEEE Sig Proc Letters vol 3no 4 pp 116ndash118 Apr 1996

[Chou01a] J Chou K Ramchandran A Ortega ldquoNext generation techniques for robustand imperceptible audio data hidingrdquo in Proc IEEE ICASSP-2001 vol 3pp 1349ndash1352 Salt Lake City UT May 2001

[Chou01b] J Chou K Ramchandran and A Ortega ldquoHigh capacity audio data hidingfor noisy channelsrdquo in Proc Int Conf on Inf Tech Coding and Computingpp 108ndash112 Apr 2001

[Chow73] J Chowing ldquoThe synthesis of complex audio spectra by means of frequencymodulationrdquo J Audio Eng Soc pp 526ndash529 Sep 1973

[Chu85] PL Chu ldquoQuadrature mirror filter design for an arbitrary number of equalbandwidth channelsrdquo IEEE Trans Acous Speech and Sig Proc vol 33no 1 pp 203ndash218 Feb 1985

[Cilo00] T Ciloglu and SU Karaaslan ldquoAn improved all-pass watermarking schemefor speech and audiordquo in Proc Int Conf on Mult and Expo (ICME-2000)vol 2 pp 1017ndash1020 JulyndashAug 2000

[Coif92] R Coifman and M Wickerhauser ldquoEntropy based algorithms for best basisselectionrdquo IEEE Trans Information Theory vol 38 no 2 pp 712ndash718Mar 1992

[Colo95] C Colomes et al ldquoA perceptual model applied to audio bit-rate reductionrdquoJ Aud Eng Soc vol 43 no 4 pp 233ndash240 Apr 1995

[Cont96] L Contin et al ldquoTests on MPEG-4 audio codec proposalsrdquo J Sig ProcImage Comm Oct 1996

[Conw88] J Conway and N Sloane Sphere Packing Lattices and Groups Springer-Verlag New York 1988

[Cove91] T Cover and J Thomas Elements of Information Theory Wiley-Interscience New York Aug 1991

[Cox86] Cox R ldquoThe design of uniformly and nonuniformly spaced pseudo QMFrdquoIEEE Trans Acous Speech and Sig Proc vol 34 pp 1090ndash1096 Oct1986

[Cox95] IJ Cox J Kilian T Leighton and T Shamoon ldquoSecure spread spectrumwatermarking for multimediardquo Tech Rep 95ndash10 NEC Research InstitutePrinceton NJ 1995

[Cox96] IJ Cox J Kilian T Leighton and T Shamoon ldquoA secure robustwatermark for multimediardquo in Proc of Information Hiding Workshop rsquo96Cambridge UK pp 147ndash158 May 1996

[Cox97] IJ Cox J Kilian T Leighton and T Shamoon ldquoSecure spread spectrumwatermarking for multimediardquo IEEE Trans Image Processing vol 6 no 12pp 1673ndash1687 1997

[Cox99] IJ Cox ML Miller AL McKellips ldquoWatermarking as communicationswith side informationrdquo Proc of the IEEE vol 87 no 7 pp 1127ndash1141July 1999

REFERENCES 413

[Cox01a] IJ Cox ML Miller and JA Bloom Digital Watermarking MorganKaufmann Publishers Sanfransisco 2001

[Cox01b] IJ Cox and ML Miller ldquoElectronic watermarking the first 50 yearsrdquo IEEE4th Workshop on Mult Sig Proc pp 225ndash230 Oct 2001

[Crav96] PG Craven and M Gerzon ldquoLossless coding for audio discsrdquo J AudioEng Soc vol 44 no 9 pp 706ndash720 Sept 1996

[Crav97] S Craver N Memon B-L Yeo and M Yeung ldquoCan invisible watermarksresolve rightful ownershiprdquo IBM Research Report RC 20509 Feb 1997

[Crav97] PG Craven M Law and J Stuart ldquoLossless compression using IIRprediction filtersrdquo in Proc 102nd Conv Aud Eng Soc preprint 4415Munich March 1997

[Crav98] S Craver N Memon B-L Yeo and M Yeung ldquoResolving rightfulownerships with invisible watermarking techniques limitations attacks andimplicationsrdquo in IEEE Journal on Selected Areas in Communications vol16 no 4 pp 573ndash586 1998

[Crav01] SA Craver M Wu and B Liu ldquoWhat can we reasonably expect fromwatermarksrdquo in 2001 IEEE Workshop on the Applications of SignalProcessing to Audio and Acoustics 21ndash24 Oct 2001 pp 223ndash226

[Creu96] C Creusere and S Mitra ldquoEfficient audio coding using perfect reconstructionnoncausal IIR filter banksrdquo IEEE Trans Speech and Aud Proc vol 4 no 2pp 115ndash123 Mar 1996

[Creu02] CD Creusere ldquoAn analysis of perceptual artifacts in MPEG scalable audiocodingrdquo Proc of Data Compression Conference (DCC-02) pp 152ndash161Apr 2002

[Croc76] RE Crochiere et al ldquoDigital coding of speech in subbandsrdquo Bell Sys TechJ vol 55 no 8 pp 1069ndash1085 Oct 1976

[Croc83] RE Crochiere and L R Rabiner Multirate Digital Signal ProcessingPrentice-Hall Englewood Cliffs NJ 1983

[Croi76] A Croisier et al ldquoPerfect channel splitting by use of interpolationdecimationtree decomposition techniquesrdquo in Proc Int Conf on Inf Sciand Syst Patras 1976

[Cumm73] P Cummiskey et al ldquoAdaptive quantization in differential PCM coding ofspeechrdquo Bell Syst Tech J vol 52 no 7 p 1105 Sept 1973

[Cupe85] V Cuperman and A Gersho ldquoVector predictive coding of speech at 16 kbsrdquoIEEE Trans on Commun vol 33 no 7 pp 685ndash696 Jul 1985

[Cupe89] V Cuperman ldquoOn adaptive vector transform quantization for speechcodingrdquo IEEE Trans on Commun vol 37 no 3 pp 261ndash267 Mar 1989

[Cvej01] N Cvejic A Keskinarkaus T Seppanen ldquoAudio watermarking using m-sequences and temporal maskingrdquo IEEE Workshop on App of Sig Proc toAudio and Acoustics pp 227ndash230 Oct 2001

[Cvej03] N Cvejic D Tujkovic T Seppanen ldquoIncreasing robustness of an audiowatermark using turbo codesrdquo in Proc ICME-03 vol 1 pp 217ndash220 July2003

[Daub88] I Daubechies ldquoOrthonormal bases of compactly supported waveletsrdquoComm Pure Appl Math pp 909ndash996 Nov 1988

414 REFERENCES

[Daub92] I Daubechies Ten Lectures on Wavelets Society for Industrial and AppliedMathematics Philadelphia PA 1992

[Daub96] I Daubechies and W Sweldens ldquoFactoring wavelet transforms into liftingstepsrdquo Tech Rep Bell Laboratories Lucent Technologies 1996

[DAVC96] Digital Audio-Visual Council (DAVIC) DAVIC Technical Specification 12Part 9 ldquoInformation Representationrdquo Dec 1996 Available online fromhttpwwwdavicorg

[Davi86] G Davidson and A Gersho ldquoComplexity reduction methods for vectorexcitation codingrdquo in Proc IEEE ICASSP-86 vol 11 pp 3055ndash3058Tokyo Japan April 1986

[Davi90] G Davidson et al ldquoLow-complexity transform coder for satellite linkapplicationsrdquo in Proc 89th Conv Aud Eng Soc preprint 2966 1990

[Davi92] G Davidson and M Bosi ldquoAC-2 high quality audio coding for broadcas-ting and storagerdquo in Proc 46th Annual Broadcast Eng Conf pp 98ndash105Apr 1992

[Davi94] G Davidson et al ldquoParametric bit allocation in a perceptual audio coderrdquoin Proc 97th Conv Aud Eng Soc preprint 3921 Nov 1994

[Davi98] G Davidson ldquoDigital audio coding Dolby AC-3rdquo in The Digital SignalProcessing Handbook V Madisetti and D Williams Eds CRC Press pp411ndash4121 1998

[Davis93] M Davis ldquoThe AC-3 multichannel coderrdquo in Proc 95th Conv Aud EngSoc preprint 3774 Oct 1993

[Davis94] M Davis and C Todd ldquoAC-3 operation bitstream syntax and featuresrdquo inProc 97th Conv Aud Eng Soc preprint 3910 1994

[Davis03] M Davis ldquoHistory of spatial codingrdquo J Audio Eng Soc vol 51 no 6pp 554-569 June 2003

[Dehe91] YF Dehery et al ldquoA MUSICAM source codec for digital audiobroadcasting and storagerdquo in Proc IEEE ICASSP-91 pp 3605ndash3608Toronto Canada May 1991

[Delo96] A Delopoulos and S Kollias ldquoOptimal filterbanks for signal reconstructionfrom noisy subband componentsrdquo IEEE Trans Sig Proc vol 44 no 2 pp212ndash224 Feb 1996

[Deri99] M Deriche D Ning and S Boland ldquoA new approach to low bit rate audiocoding using a combined harmonic-multiband-wavelet representationrdquo inProc ISSPA-99 vol 2 pp 603ndash606 Aug 1999

[Diet96] M Dietz et al ldquoAudio compression for network transmissionrdquo J Aud EngSoc pp 58ndash71 Jan Feb 1996

[Diet00] M Dietz and T Mlasko ldquoUsing MPEG-4 audio for DRM digital narrowbandbroadcastingrdquo in Proc ISCAS-2000 vol 3 pp 205ndash208 May 2000

[Diet02] M Dietz et al ldquoSpectral band replication a novel approach in audio codingrdquoin 112th Conv Aud Eng Soc preprint 5553 Munich Germany May2002

[Ditt00] J Dittmann A Mukherjee and M Steinebach ldquoMedia-independentwatermarking classification and the need for combining digital video andaudio watermarking for media authenticationrdquo in Proc Int Conf on InfTech Coding and Computing pp 62ndash67 Mar 2000

REFERENCES 415

[Ditt01] J Dittmann and M Steinebach ldquoJoint watermarking of audio-visual datardquoIEEE 4th Workshop on Mult Sig Proc pp 601ndash606 Oct 2001

[Dobs97] W Dobson et al ldquoHigh quality low complexity scalable wavelet audiocodingrdquo in Proc IEEE ICASSP-97 pp 327ndash330 Munich Germany April1997

[DOLBY] ldquoSurround Sound and Multimedia Dolby Surround and Dolby Digital forvideo games CD ROM DVD ROM the Internet ndash and morerdquo Availableonline at httpwwwdolbycommultisurmultihtml

[Dong02] H Dong JD Gibson and MG Kokes ldquoSNR and bandwidth scalablespeech codingrdquo in Proc IEEE ICASSP-02 vol 2 pp 859ndash862 OrlandoFL May 2002

[Dorw01] S Dorward D Huang SA Savari G Schuller and B Yu ldquoLow delayperceptually lossless coding of audio signalsrdquo in Proc of Data CompressionConference (DCC-01) pp 312ndash320 Mar 2001

[DTS] The Digital Theater Systems (DTS) Available online at wwwdtsonlinecom

[Duen02] A Duenas R Perez B Rivas E Alexandre and A Pena ldquoA robust andefficient implementation of MPEG-24 AAC natural audio codersrdquo in Proc112th Conv Aud Eng Soc preprint 5556 May 2002

[Duha91] P Duhamel et al ldquoA fast algorithm for the implementation of filter banksbased on time domain aliasing cancellationrdquo in Proc IEEE ICASSP-91 pp2209ndash2212 Toronto Canada May 1991

[Dunn96] C Dunn and M Sandler ldquoA comparison of dithered and chaotic sigma-deltamodulatorsrdquo J Audio Eng Soc vol 44 no 4 pp 227ndash244 Apr 1996

[Duro96] X Durot and J-B Rault ldquoA new noise injection model for audio compressionalgorithmsrdquo in Proc 101st Conv Aud Eng Soc preprint 4374 Nov 1996

[DVD01] DVD Forum ldquoDVD Specifications for Read-only Discrdquo Part 4 AudioSpecifications Version 12 Tokyo Japan Mar 2001

[DVD96] DVD Consortium ldquoDVD-video and DVD-ROM specificationsrdquo 1996

[DVD99] DVD Forum ldquoDVD Specifications for Read-Only Discrdquo Part 4 AudioSpecifications Packed PCM MLP Reference Information Version 10Tokyo Japan Mar 1999

[East97] PC Eastty C Sleight and PD Thorpe ldquoResearch on cascadable filteringequalization gain control and mixing of 1-bit signals for professional audioapplicationsrdquo in Proc 102th Conv Aud Eng Soc preprint 4444 Mar22ndash25 1997

[EBU99] EBU Technical Recommendation R96-1999 Formats for production anddelivery of multichannel audio programs European Broadcasting UnionGeneva Switzerland 1999

[Edle89] B Edler ldquoCodierung von Audiosignalen mit uberlappender Transformationund adaptiven Fensterfunktionen (Coding of Audio Signals with OverlappingBlock Transform and Adaptive Window Functions)rdquo Frequenz pp252ndash256 1989

[Edle95] B Edler ldquoTechnical Description of the MPEG-4 Audio Coding Proposalfrom University of Hannover and Deutsche Bundespost Telekomrdquo ISOIECJTC1SC29WG11 MPEG95MO414 Oct 1995

416 REFERENCES

[Edle96a] B Edler and H Purnhagen ldquoTechnical Description of the MPEG-4 AudioCoding Proposal from University of Hannover and Deutschet Telekom AGrdquoISOIEC JTC1SC29WG11 MPEG96MO632 Jan 1996

[Edle96a] B Edler ldquoCurrent status of the MPEG-4 audio verification modeldevelopmentrdquo in Proc 101st Conv Aud Eng Soc Preprint 4376 Nov1996

[Edle96b] B Edler and L Contin ldquoMPEG-4 Audio Test Results (MOS Test)rdquo ISOIECJTC1SC29WG11 N1144 Jan 1996

[Edle96b] B Edler et al ldquoASAC ndash AnalysisSynthesis Audio Codec for very low bitratesrdquo in Proc 100th Conv Aud Eng Soc preprint 4179 May 1996

[Edle96c] B Edler and H Purnhagen ldquoParametric audio codingrdquo in Proc InternationalConference on Signal Processing (ICSP 2000) Beijing 2000

[Edle97] B Edler ldquoVery low bit rate audio coding developmentrdquo in Proc 14th AudioEng Soc Int Conf June 1997

[Edle98] B Edler and H Purnhagen ldquoConcepts for hybrid audio coding schemesbased on parametric techniquesrdquo in Proc 105th Conv Audio Eng Socpreprint 4808 1998

[Edle99] B Edler ldquoSpeech coding in MPEG-4rdquo Int J of Speech Technology vol 2no 4 pp 289ndash303 May 199

[Egan50] J Egan and H Hake ldquoOn the masking pattern of a simple auditory stimulusrdquoJ Acoust Soc Am vol 22 pp 622ndash630 1950

[Ehre04] A Ehret et al ldquoaacPlus only a low-bitrate codecrdquo in 117th Conv AudEng Soc preprint 6199 San Francisco CA Oct 2004

[Ekud99] R Ekudden R Hagen I Johansson and J Svedburg ldquoThe adaptive multi-rate speech coderrdquo in Proc IEEE Workshop on Speech Coding pp 117ndash119June 1999

[Elli96] PW Ellis ldquoPrediction-Driven Computational Auditory Scene AnalysisrdquoPhD Thesis Massachusetts Institute of Technology June 1996

[Erku03] S Erkucuk S Krishnan and M Zeytinoglu ldquoRobust audio watermarkingusing a chirp based techniquerdquo in Proc ICME-03 vol 2 pp 513ndash516 July2003

[Esma03] S Esmaili S Krishnan and K Raahemifar ldquoA novel spread spectrum audiowatermarking scheme based on time-frequency characteristicsrdquo in Proc ofCCECE-03 vol 3 pp 1963ndash1966 May 2003

[Este77] D Esteban and C Galand ldquoApplication of quadrature mirror filters tosplit band voice coding schemesrdquo in Proc IEEE ICASSP-77 vol 2pp 191ndash195 Hartford CT May 1977

[ETSI98] ETSI AMR Qualification Phase Documentation 1998

[Fall01] C Faller and F Baumgarte ldquoEfficient representation of spatial audio usingperceptual parameterizationrdquo IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics New Paltz New York 2001

[Fall02a] C Faller and F Baumgarte ldquoBinaural cue coding a novel and efficientrepresentation of spatial audiordquo in Proc IEEE ICASSP-02 vol 2 pp1841ndash1844 Orlando FL May 2002

REFERENCES 417

[Fall02b] C Faller and F Baumgarte ldquoEstimation of auditory spatial cues for binauralcue codingrdquo in Proc IEEE ICASSP-02 vol 2 pp 1801ndash1804 Orlando FLMay 2002

[Fall02b] C Faller B-H Juang P Kroon H-L Lou SA Ramprashad C-EWSundberg ldquoTechnical advances in digital audio radio broadcastingrdquo Proc ofthe IEEE vol 90 no 8 pp 1303ndash1333 Aug 2002

[Feit98] B Feiten et al ldquoDynamically scalable audio internet transmissionrdquo in Proc104th Conv Audio Eng Soc preprint 4686 May 1998

[Fejz00] Z Fejzo S Smyth K McDowell and Y-L You ldquoBackward compatibleenhancement of DTS multi-channel audio coding that delivers 96-kHz24-bitaudio qualityrdquo in Proc of 109th Conv Aud Eng Soc preprint 5259 Sept22ndash25 2000

[Ferr96a] A Ferreira ldquoConvolutional effects in transform coding with TDAC anoptimal windowrdquo IEEE Trans on Speech and Audio Proc vol 4 no 2pp 104ndash114 Mar 1996

[Ferr96b] AJS Ferreira ldquoPerceptual coding of harmonic signalsrdquo in Proc 100thConv Audio Eng Soc preprint 4746 May 1996

[Ferr01a] AJS Ferreira ldquoCombined spectral envelope normalization and subtractionof sinusoidal components in the ODFT and MDCT frequency domainsrdquoIEEE Workshop on the Applications of Signal Processing to Audio andAcoustics pp 51ndash54 Oct 2001

[Ferr01b] AJS Ferreira ldquoAccurate estimation in the ODFT domain of the frequencyphase and magnitude of stationary sinusoidsrdquo IEEE Workshop on theApplications of Signal Processing to Audio and Acoustics pp 47ndash51 Oct2001

[Fiel91] L Fielder and G Davidson ldquoAC-2 a family of low complexity transform-based music codersrdquo in Proc 10th AES Int Conf Sep 1991

[Fiel96] L Fielder et al ldquoAC-2 and AC-3 low-complexity transform-based audiocodingrdquo in Collected Papers on Digital Audio Bit-Rate Reduction NGilchrist and C Grewin Eds Aud Eng Soc pp 54ndash72 1996

[Fiel04] L Fielder et al ldquoIntroduction to Dolby digital plus an enhancement to theDolby digital coding systemrdquo in Proc 117th AES convention paper 6196San Francisco CA Oct 2004

[Flet37] H Fletcher and W Munson ldquoRelation between loudness and maskingrdquoJ Acoust Soc Am vol 9 pp 1ndash10 1937

[Flet40] H Fletcher ldquoAuditory patternsrdquo Rev Mod Phys pp 47ndash65 Jan 1940

[Flet91] N Fletcher and T Rossing The Physics of Musical Instruments SpringerNew York 1991

[Flet98] J Fletcher ldquoISOMPEG layer 2 ndash optimum re-encoding of decoded audiousing a MOLE signalrdquo in Proc 104th Conv Aud Eng Soc preprint 4706Amsterdam 1998 See also httpwwwbbccoukatlantic

[Foo01] SW Foo TH Yeo and DY Huang ldquoAn adaptive audio watermarkingsystemrdquo in Proc Int Conf on Electrical and Electronic Technology vol 2pp 509ndash513 Aug 2001

[Foss95] R Foss and T Mosala ldquoRouting MIDI messages over Ethernetrdquo in Proc99 th Conv Aud Eng Soc preprint 4057 Oct 1995

418 REFERENCES

[FS1015] Federal Standard 1015 Telecommunications Analog to Digital Conversionof Radio Voice by 2400 bitsecond Linear Predictive Coding NationalCommunication System ndash Office Technology and Standards Nov 1985

[FS1016] Federal Standard 1016 Telecommunications Analog to Digital Conversionof Radio Voice by 4800 bitsecond Code Excited Linear Prediction (CELP)National Communication System ndash Office Technology and Standards Feb1991

[Fuch93] H Fuchs ldquoImproving joint stereo audio coding by adaptive inter-channelpredictionrdquo in Proc IEEE ASSP Workshop on Apps of Sig Proc to Audand Acous 1993

[Fuch95] H Fuchs ldquoImproving MPEG audio coding by backward adaptive linearstereo predictionrdquo in Proc 99th Conv Aud Eng Soc preprint 4086 Oct1995

[Furo00] T Furon N Moreau and P Duhamel ldquoAudio public key watermarkingtechniquerdquo in Proc IEEE ICASSP-2000 vol 4 pp 1959ndash1962 IstanbulTurkey June 2000

[G722] ITU Recommendation G722 ldquo7 kHz Audio Coding within 64 kbsrdquo BlueBook vol III Fascicle III Oct 1988

[G7222] ITU Recommendation G7222 ldquoWideband coding of speech at around16 kbs using Adaptive Multi-rate Wideband (AMR-WB)rdquo Jan 2001

[G7231] ITU Recommendation G7231 ldquoDual Rate Speech Coder for MultimediaCommunications transmitting at 53 and 63 kbsrdquo Draft 1995

[G726] ITU Recommendation G726 ldquo24 32 40 kbs Adaptive Differential PulseCode Modulation (ADPCM)rdquo Blue Book vol III Fascicle III3 Oct 1988

[G728] ITU Draft Recommendation G728 ldquoCoding of Speech at 16 kbits UsingLow-Delay Code Excited Linear Prediction (LD-CELP)rdquo 1992

[G729] ITU Study Group 15 Draft Recommendation G729 ldquoCoding of Speech at8 kbs using Conjugate-Structure Algebraic-Code-Excited Linear-Prediction(CS-ACELP)rdquo 1995

[G721] CCITT Recommendation G721 ldquo32 kbs adaptive differential pulse codemodulation (ADPCM)rdquo in Blue Book vol III Fascicle III Oct 1988

[G722] CCITT (ITU-T) Rec G722 ldquo7 kHz audio coding within 64 kbitsrdquo BlueBook vol III fasc III4 pp 269ndash341 1988

[G726] ITU Recommendation G726 (formerly G721) ldquo24 32 40 kbs Adaptivedifferential pulse code modulation (ADPCM)rdquo Blue Book vol III FascicleIII3 Oct 1988

[G728] International Telecommunications Union Rec G728 ldquoCoding of Speech at16 kbps Using Low-Delay Code Excited Linear Predictionrdquo Sept 1992

[G729] ITU-T Recommendation G729 ldquoCoding of speech at 8 kbps usingconjugate-structure algebraic code excited linear prediction (CS-ACELP)rdquo1995

[Gang02] L Gang AN Akansu and M Ramkumar ldquoSecurity and synchronization inwatermark sequencerdquo in Proc IEEE ICASSP-2002 vol 4 pp 3736ndash3739Orlando FL May 2002

REFERENCES 419

[Gao01a] Y Gao et al ldquoThe SMV algorithm selected by TIA and 3GPP2 for CDMAapplicationsrdquo in Proc IEEE ICASSP-01 vol 2 pp 709ndash712 Salt Lake CityUT May 2001

[Gao01b] Y Gao et al ldquoEX-CELP A speech coding paradigmrdquo in Proc IEEEICASSP-01 vol 2 pp 689ndash692 Salt Lake City UT May 2001

[Garc99] RA Garcia ldquoDigital watermarking of audio signals using a psychoacousticauditory model and spread spectrum theoryrdquo in Proc 107th Conv Aud EngSoc preprint 5073 New York Sept 1999

[Gass54] G Gassler ldquoUber die horschwelle fur schallereignisse mit verschiedenbreitem frequenzspektrumrdquo Acustica vol 4 pp 408ndash414 1954

[Gbur96] U Gbur M Werner and M Dietz ldquoRealtime implementation of anISOMPEG layer 3 encoder on Pentium PCsrdquo in Proc 101st Conv AudEng Soc preprint 4386 Nov 1996

[Geig01] R Geiger T Sporer J Koller and K Brandenburg ldquoAudio coding basedon integer transformsrdquo in Proc 111th Conv Aud Eng Soc Preprint 5471New York 2001

[Geig02] R Geiger J Herre J Koller and K Brandenburg ldquoIntMDCT ndash A linkbetween perceptual and lossless audio codingrdquo in Proc IEEE ICASSP-2002 vol 2 pp 1813ndash1816 Orlando FL May 2002

[Geig03] R Geiger J Herre G Schuller and T Sporer ldquoFine grain scalableperceptual and lossless audio coding based on IntMDCTrdquo in Proc IEEEICASSP-2003 Hong Kong April 2003

[Geor87] EB George and MJT Smith ldquoA new speech coding model based ona least-squares sinusoidal representationrdquo in Proc IEEE ICASSP-87 pp1641ndash1644 Dallas TX Apr 1987

[Geor90] EB George and MJT Smith ldquoPerceptual Considerations in a low bit ratesinusoidal vocoderrdquo in Proc IEEE Int Phoenix Conf on Comp in Commpp 268ndash275 Mar 1990

[Geor92] EB George and MJT Smith ldquoAnalysis-by-synthesisoverlap-addsinusoidal modeling applied to the analysis and synthesis of musical tonesrdquoJ Audio Eng Soc pp 497ndash516 June 1992

[Geor97] EB George and MJT Smith ldquoSpeech analysissynthesis and modificationusing and analysis-by-synthesisoverlap-add sinusoidal modelrdquo IEEE Transon Speech and Audio Proc pp 389ndash406 Sept 1997

[Gers83] A Gersho and V Cuperman ldquoVector quantization a pattern matchingtechnique for speech codingrdquo IEEE Commun Mag vol 21 no 9 pp15ndash21 Dec 1983

[Gers90a] A Gersho and S Wang ldquoRecent trends and techniques in speech codingrdquoin Proc 24th Asilomar Conf Pacific Grove CA Nov 1990

[Gers90b] IA Gerson and MA Jasiuk ldquoVector sum excited linear prediction (VSELP)speech coding at 8 kbsrdquo in Proc of ICASSP-90 vol 1 pp 461ndash464Albuquerque NM Apr 1990

[Gers91] I Gerson and M Jasiuk ldquoTechniques for improving the performanceof CELP-type speech codersrdquo in Proc IEEE ICASSP-91 pp 205ndash208Toronto Canada May 1991

[Gers92] A Gersho and R Gray Vector Quantization and Signal Compression Kluwer Academic Publishers Norwell MA 1992

420 REFERENCES

[Gerz99] MA Gerzon JR Stuart RJ Wilson PG Craven and MJ Law ldquoTheMLP Lossless Compression Systemrdquo in AES 17th Int Conf High-QualityAudio Coding pp 61ndash75 Florence Italy Sept 1999

[Geye99] S Geyersberger et al ldquoMPEG-2 AAC multichannel realtime implementa-tion on floating-point DSPsrdquo in Proc 106th Conv Aud Eng Soc preprint4977 May 1999

[Gibs74] J Gibson ldquoAdaptive prediction in speech differential encoding systemsrdquoProc of the IEEE vol 68 pp 1789ndash1797 Nov 1974

[Gibs78] J Gibson ldquoSequentially adaptive backward prediction in ADPCM speechcodersrdquo IEEE Trans Commun pp 145ndash150 Jan 1978

[Gibs90] J Gibson et al ldquoBackward adaptive prediction in multi-tree speech codersrdquoin Advances in Speech Coding B Atal V Cuperman and A Gersho EdsKluwer Norwell MA 1990 pp 5ndash12

[Gilc98] N Gilchrist ldquoATLANTIC audio preserving technical quality during low bitrate coding and decodingrdquo in Proc 104th Conv Aud Eng Soc preprint4694 Amsterdam May 1998

[Giur98] CD Giurcaneanu I Tabus and J Astola ldquoLinear prediction from subbandsfor lossless audio compressionrdquo in Proc Norsig-98 3rd IEEE Nordic SignalProcessing Symposium pp 225ndash228 Vigso Denmark June 1998

[Glas90] B Glasberg and BCJ Moore ldquoDerivation of auditory filter shapes fromnotched-noise datardquo Hearing Res vol 47 pp 103ndash138 1990

[Golo66] SW Golomb ldquoRun-length encodingsrdquo in IEEE Trans on Info Theory vol12 pp 399ndash401 1966

[Good96] M Goodwin ldquoResidual modeling in music analysis-synthesisrdquo Proc IEEEICASSP-96 Atlanta GA May 1996

[Good97] M Goodwin ldquoAdaptive Signal Models Theory Algorithms and AudioApplicationsrdquo PhD Dissertation University of California at Berkeley 1997

[Gray84] R Gray ldquoVector quantizationrdquo IEEE ASSP Mag vol 1 no 2 pp 4ndash29Apr 1984

[Gree61] DD Greenwood ldquoCritical bandwidth and the frequency coordinates of thebasilar membranerdquo J Acous Soc Am pp 1344ndash1356 Oct 1961

[Gree90] DD Greenwood ldquoA cochlear frequency-position function for severalspecies ndash 29 years laterrdquo J Acoust Soc Am vol 87 pp 2592ndash2605 June1990

[Grew91] C Grewin and T Ryden ldquoSubjective assessments on low bit-rate codecsrdquoin Proc 10th Int Conf Aud Eng Soc pp 91ndash102 1991

[Gril02] B Grill et al ldquoCharacteristics of audio streaming over IP networks withinthe ISMA Standardrdquo in Proc 112th Conv Aud Eng Soc preprint 5514May 2002

[Gril94] B Grill et al ldquoImproved MPEG-2 audio multi-channel encodingrdquo in Proc96th Conv Aud Eng Soc preprint 3865 Feb 1994

[Gril97] B Grill ldquoA bit-rate scalable perceptual coder for MPEG-4 audiordquo in Proc103rd Conv Aud Eng Soc preprint 4620 Sept 1997

[Gril99] B Grill ldquoThe MPEG-4 general audio coderrdquo in Proc AES 17 th Int ConfSept 1999

REFERENCES 421

[Gros03] A Groschel et al ldquoEnhancing audio coding efficiency of MPEG layer-2 with spectral band replication for digitalradio (DAB) in a backwardscompatible wayrdquo in 114th Conv Aud Eng Soc preprint 5850Amsterdam Netherlands Mar 2003

[Grub01] M Grube P Siepen C Mittendorf J Friess and S Bauer ldquoApplicationsof MPEG-4 digital multimedia broadcastingrdquo in Proc of Int Conf onConsumer Electronics ICCE-01 pp 136ndash137 June 2001

[Gruh96] D Gruhl A Lu and W Bender ldquoEcho hidingrdquo in Proc of InformationHiding Workshop rsquo96 Cambridge UK pp 295ndash316 May 1996

[GSM89] GSM 0610 ldquoGSM full-rate transcodingrdquo Technical Report Version 32ETSIGSM July 1989

[GSM96a] GSM 0660 ldquoGSM Digital Cellular Communication Standards EnhancedFull-Rate Transcodingrdquo ETSIGSM 1996

[GSM96b] GSM 0620 ldquoGSM Digital Cellular Communication Standards Half RateSpeech Half Rate Speech Transcodingrdquo ETSIGSM 1996

[Hadd95] R Haddad and K Park ldquoModeling analysis and optimum design ofquantized M-band filterbanksrdquo IEEE Trans Sig Proc vol 43 no 11 pp2540ndash2549 Nov 1995

[Hall97] JL Hall ldquoAsymmetry of masking revisited generalization of masker andprobe bandwidthrdquo J Acoust Soc Am vol 101 pp 1023ndash1033 Feb 1997

[Hall98] JL Hall ldquoAuditory psychophysics for coding applicationsrdquo in The DigitalSignal Processing Handbook V Madisetti and D Williams Eds CRCPress Boca Raton FL pp 391ndash3925 1998

[Hamd96] KN Hamdy et al ldquoLow bit rate high quality audio coding with combinedharmonic and wavelet representationsrdquo in Proc IEEE ICASSP-96 pp1045ndash1048 Atlanta GA May 1996

[Hans96] MC Hans and V Bhaskaran ldquoA fast integer-based CPU scalable MPEG-1layer-II audio decoderrdquo in Proc 101st Conv Aud Eng Soc preprint 4359Nov 1996

[Hans98a] M Hans Optimization of Digital Audio for Internet Transmission PhDdissertation School Elect Comp Eng Georgia Inst Technol Atlanta 1998

[Hans98b] M Hans and RW Schafer ldquoAudioPaK ndash an integer arithmetic losslessaudio codecrdquo in Proc Data Compression Conf p 550 Snowbird UT 1998

[Hans01] M Hans and RW Schafer ldquoLossless compression of digital audiordquo IEEESig Proc Mag vol 18 no 4 pp 21ndash32 July 2001

[Harm96] A Harma et al ldquoWarped linear prediction (WLP) in audio codingrdquo in ProcNORSIG rsquo96 pp 367ndash370 Sep 1996

[Harm97a] A Harma et al ldquoAn experimental audio codec based on warpedlinear prediction of complex valued signalsrdquo in Proc IEEE ICASSP-97pp 323ndash326 Munich Germany April 1997

[Harm97b] A Harma et al ldquoWLPAC ndash a perceptual audio codec in a nutshellrdquo Proc102nd Conv Audio Eng Soc preprint 4420 Munich 1997

[Harm98] A Harma et al ldquoA warped linear predictive stereo codec using temporalnoise shapingrdquo in Proc NORSIG-98 pp 229ndash232 May 1998

[Harm00] A Harma ldquoImplementation of frequency-warped recursive filtersrdquo signalprocessing vol 80 pp 543ndash548 Feb 2000

422 REFERENCES

[Harm01] A Harma and UK Laine ldquoA comparison of warped and conventional linearpredictionrdquo IEEE Trans Speech Audio Proc vol 9 no 5 pp 579ndash588July 2001

[Harp03] P Harpe D Reefman and E Janssen ldquoEfficient trellis-type sigma-delta modulatorrdquo in Proc 114th Conv Aud Eng Soc preprint 5845Amsterdam Mar 22ndash25 2003

[Hart99] F Hartung and M Kutter ldquoMultimedia watermarking techniquesrdquo Proc ofthe IEEE vol 87 no 7 pp 1079ndash1107 July 1999

[Hask97] BG Haskell A Puri and AN Netravali Digital Video An Introduction toMPEG-2 Chapman amp Hall New York 1997

[Hawk02] MOJ Hawksford ldquoTime-quantized frequency modulation with timedispersive codes for the generation of sigma-delta modulationrdquo in Proc112th Conv Aud Eng Soc preprint 5618 Munich Germany May 10ndash132002

[Hay96] K Hay et al ldquoA low delay subband audio coder (20 Hz ndash 15 kHz) at64 kbitsrdquo in Proc IEEE Int Symp on Time-Freq and Time-Scale Analpp 265ndash268 June 1996

[Hay97] K Hay et al ldquoThe D5 lattice quantization for a 64 kbits low-delay subbandaudio coder with a 15 kHz bandwidthrdquo in Proc IEEE ICASSP-97 pp319ndash322 Munich Germany Apr 1997

[Hede81] P Hedelin ldquoA tone-oriented voice-excited vocoderrdquo in Proc IEEE ICASSP-81 pp 205ndash208 Atlanta GA Mar 1981

[Hedg97] R Hedges ldquoHybrid wavelet packet analysisrdquo in Proc 31st Asilomar Confon Sig Sys and Comp Oct 1997

[Hedg98a] R Hedges and D Cochran ldquoHybrid wavelet packet analysisrdquo in Proc IEEESP Int Sym on Time-Freq and Time-Scale Anal Oct 1998

[Hedg98b] R Hedges and D Cochran ldquoHybrid wavelet packet analysis a topdown approachrdquo in Proc 32nd Asilomar Conf on Sig Sys and CompNov 1998

[Hell72] R Hellman ldquoAsymmetry of masking between noise and tonerdquo Percep andPsychphys vol 11 pp 241ndash246 1972

[Hell01] O Hellmuth et al ldquoAdvanced audio identification using MPEG-7 contentdescriptionrdquo in Proc 111th Conv Aud Eng Soc preprint 5463 NovndashDec2001

[Herl95] C Herley ldquoBoundary filters for finite-length signals and time-varying filterbanksrdquo IEEE Trans Circ Sys II vol 42 pp 102ndash114 Feb 1995

[Herm90] H Hermansky ldquoPerceptual linear predictive (PLP) analysis of speechrdquo JAcoust Soc Amer vol 87 no 4 pp 1738ndash1752 Apr 1990

[Herm02] K Hermus W Verhelst and P Wambacq ldquoPsychoacoustic modeling ofaudio with exponentially damped sinusoidsrdquo Proc IEEE ICASSP-02 vol 2pp 1821ndash1824 Orlando FL 2002

[Herr92a] J Herre et al ldquoAdvanced audio measurement system using psychoacousticpropertiesrdquo in Proc 92nd Conv Aud Eng Soc preprint 3321 Mar 1992

[Herr92b] J Herre et al ldquoAnalysis tool for realtime measurements using perceptualCriteriardquo in Proc 11th Int Conf Aud Eng Soc pp 180ndash190 May 1992

REFERENCES 423

[Herr95] J Herre K Brandenburg E Eberlein and B Grill ldquoSecond-generationISOMPEG-audio layer III codingrdquo in Proc 98th Conv Aud Eng Socpreprint 3939 Feb 1995

[Herr96] J Herre and JD Johnston ldquoEnhancing the performance of perceptual audiocoders by using temporal noise shaping (TNS)rdquo in Proc 101st Conv AudEng Soc preprint 4384 Nov 1996

[Herr98] J Herre and D Schulz ldquoExtending the MPEG-4 AAC codec by perceptualnoise substitutionrdquo in Proc 104th Conv Audio Eng Soc preprint 4720Amsterdam 1998

[Herr98a] J Herre and D Schulz ldquoExtending the MPEG-4 AAC codec by perceptualnoise substitutionrdquo in Proc 104th Conv Aud Eng Soc preprint 4720May 1998

[Herr98b] J Herre et al ldquoThe integrated filterbank-based scalable MPEG-4 audiocoderrdquo in Proc 105th Conv Aud Eng Soc preprint 4810 Sept 1998

[Herr98c] J Herre E Allamanche R Geiger and T Sporer ldquoProposal for a low delayMPEG-4 audio coder based on AACrdquo ISOIEC JTC1SC29WG11 M4139Oct 1998

[Herr99] J Herre E Allamanche R Geiger and T Sporer ldquoInformation andproposed enhancements for MPEG-4 low delay audio codingrdquo ISOIECJTC1SC29WG11 M4560 Mar 1998

[Herr00a] J Herre and B Grill ldquoOverview of MPEG-4 audio and its applications inmobile communicationsrdquo in Proc of 5th Int Conf on Sig Proc (ICSP-00)vol 1 pp 11ndash20 Aug 2000

[Herr01a] J Herre C Neubauer and F Siebenhaar ldquoCombined compressionwatermar-king for audio signalsrdquo in Proc 110th Conv Aud Eng Soc preprint 5344Amsterdam May 2001

[Herr01b] J Herre C Neubauer F Siebenhaar and R Kulessa ldquoNew results oncombined audio compressionwatermarkingrdquo in Proc 111th Conv Aud EngSoc preprint 5442 New York Dec 2001

[Herr01c] J Herre R Kulessa and C Neubauer ldquoA compatible family of bitstreamwatermarking schemes for MPEG-Audiordquo in Proc 110th Conv Audio EngSoc preprint 5346 Amsterdam May 2001

[Herr04a] J Herre et al ldquoMP3 surround Efficient and compatible coding ofmultichannel audiordquo in Proc 116 thAES convention paper 6049 BerlinGermany May 2004

[Herr04b] J Herre et al ldquoFrom joint stereo to spatial audio coding ndash recent progressand standardizationrdquo in 6th International Conference on Digital AudioEffects Naples Italy Oct 2004

[Herr05] J Herre et al ldquoThe reference model architecture for MPEG spatial audiocodingrdquo in Proc 118th AES convention paper 6447 Barcelona Spain May2005

[Herre98] J Herre et al ldquoThe integrated filterbank based scalable MPEG-4 audiocoderrdquo in Proc 105th Conv Aud Eng Soc preprint 4810 Sep 1998

[Heus02] R Heusdens and S van de Par ldquoRate-distortion optimal sinusoidal modelingof audio and speech using psychoacoustical matching pursuitsrdquo in ProcICASSP-02 vol 2 pp 1809ndash1812 2002

424 REFERENCES

[Hilp98] J Hilpert M Braun M Lutzky S Geyersberger and R BuchtaldquoImplementing ISOMPEG-2 advanced audio coding realtime on a fixed-point DSPrdquo in Proc 105th Conv Aud Eng Soc preprint 4822 Sept1998

[Hilp00] J Hilpert et al ldquoReal-time implementation of the MPEG-4 low delayadvanced audio coding algorithm (AAC-LD) on Motorola DSP56300rdquo inProc 108th Conv Aud Eng Soc preprint 5081 Feb 2000

[Holm99] T Holman ldquoNew factors in sound for cinema and televisionrdquo J Aud EngSoc vol 39 pp 529ndash539 July-Aug 1999

[Hong01] J-W Hong T-J Lee and JS Lucas ldquoDesign and implementation of a real-time audio service using MPEG-2 AAC and streaming technologyrdquo in Proc110th Conv Aud Eng Soc preprint 5343 May 2001

[Hoog94] A Hoogendoorn ldquoDigital compact cassetterdquo Proc of the IEEE vol 82 no10 pp 1479ndash1489 Oct 1994

[Hori96] N Horikawa and PC Eastty ldquoOne-bit audio recordingrdquo in AES UK Audiofor New Media Conference Apr 1996 London

[Horv02] P Horvatic and N Schiffner ldquoSecurity issues in high quality audiostreamingrdquo in Proc of 2nd Int Conf on Web Delivering of Music(WEDELMUSIC-02) p 224 Dec 2002

[Howa94] P Howard and J Vitter ldquoArithmetic coding for data compressionrdquo Proc ofthe IEEE vol 82 no 6 pp 857ndash865 June 1994

[Hsie02] CT Hsieh and PY Sou ldquoBlind cepstrum domain audio watermarkingbased on time energy featuresrdquo in Proc Int Conf on DSP-2002 vol 2pp 705ndash708 July 2002

[Huan02] DY Huang JN Al Lee SW Foo and L Weisi ldquoReed-Solomon codesfor MPEG advanced audio codingrdquo in Proc 113th Conv Aud Eng Socpreprint 5682 Oct 2002

[Huan02] J Huang Y Wang and YQ Shi ldquoA blind audio watermarking algorithmwith self-synchronizationrdquo in Proc IEEE ISCAS-2002 vol 3 pp 627ndash630Orlando FL May 2002

[Hube98] DM Huber The MIDI Manual second edition Focal Press MA 1998

[Huff52] D Huffman ldquoA method for the construction of minimum redundancy codesrdquoin Proc of the Institute of Radio Engineers vol 40 pp 1098ndash1101 Sept1952

[Huij02] Y Huijuan JC Patra and C W Chan ldquoAn artificial neural network-basedscheme for robust watermarking of audio signalsrdquo in Proc IEEE ICASSP-2002 vol 1 pp 1029ndash1032 Orlando FL May 2002

[Hwan01] Y-T Hwang N-J Liu and M-C Tsai ldquoAn MPEG-4 Twin-VQ based highquality audio codec designrdquo in IEEE Workshop on Sig Proc Systems pp321ndash331 Sept 2001

[IECA87] International Electrotechnical CommissionAmerican National StandardsInstitute (IECANSI) CEI-IEC 908 ldquoCompact-Disc Digital Audio Systemrdquo(ldquored bookrdquo) 1987

[Iked95] K Ikeda et al ldquoError protected TwinVQ audio coding at less than64 kbitsrdquo in Proc IEEE Speech Coding Workshop pp 33ndash34 1995

REFERENCES 425

[Iked99] M Ikeda K Takeda and F Itakura ldquoAudio data hiding by use ofband-limited random sequencesrdquo in Proc IEEE ICASSP-99 vol 4 pp2315ndash2318 Phoenix AZ Mar 1999

[ISO-V96] Multifunctional Ad hoc Group ldquoCore Experiments Descriptionrdquo ISOIECJTC1SC29WG11 N1266 March 1996

[ISOI90] ISOIEC JTC1SC2WG11-MPEG ldquoMPEGAudio Test Report 1990rdquo DocMPEG90N0030 1990

[ISOI91] ISOIEC JTC1SC2WG11 ldquoMPEGAudio Test Report 1991rdquo DocMPEG910010 1991

[ISOI92] ISOIEC JTC1SC29WG11 MPEG IS11172-3 ldquoInformation Technology ndashCoding of Moving Pictures and Associated Audio for Digital Storage Mediaat up to About 15 Mbits Part 3 Audiordquo 1992 (ldquoMPEG-1rdquo)

[ISOI94] ISOIEC JTC1SC29WG11 MPEG IS13818ndash3 ldquoInformation Technol-ogy ndash Generic Coding of Moving Pictures and Associated Audio Part 3Audiordquo 1994 (ldquoMPEG-2 BC-LSFrdquo)

[ISOI94a] ISOIEC JTC1SC29WG11 MPEG IS13818-3 ldquoInformation Technology ndashGeneric Coding of Moving Pictures and Associated Audio Part 3 Audiordquo1994 (ldquoMPEG-2 BC-LSFrdquo)

[ISOI94b] ISOIEC JTC1SC29WG11 MPEG94443 ldquoRequirements for Low BitrateAudio CodingMPEG-4 Audiordquo 1994 (ldquoMPEG-4rdquo)

[ISOI96] ISOIEC JTC1SC29WG11 MPEG Committee Draft 13818-7 ldquoGenericCoding of Moving Pictures and Associated Audio Audio (non backwardscompatible coding NBC)rdquo 1996 (ldquoMPEG-2 NBCAACrdquo)

[ISOI96a] ISOIEC JTC1SC29WG11 MPEG Committee Draft 13818-7 ldquoGenericCoding of Moving Pictures and Associated Audio Audio (non backwardscompatible coding NBC)rdquo 1996 (ldquoMPEG-2 NBCAACrdquo)

[ISOI96b] ISOIEC JTC1SC29WG11 ldquoMPEG-4 Audio Test Results (MOS Tests)rdquoISOIEC JTC1SC29WG11N1144 Munich Jan 1996

[ISOI96c] ISOIEC JTC1SC29WG11 ldquoReport of the Ad Hoc Group onthe Evaluation of New Audio Submissions to MPEG-4rdquo ISOIECJTC1SC29WG11MPEG96M0680 Munich Jan 1996

[ISOI96d] ISOIEC JTC1SC29WG11 N1420 ldquoOverview of the Report on the FormalSubjective Listening Tests of MPEG-2 AAC Multichannel Audio CodingrdquoNov 1996

[ISOI97a] ISOIEC JTC1SC29WG11 ldquoMPEG-4 Audio Committee Draft 14496-3rdquoISOIEC JTC1SC29WG11 N1903 Oct 1997 (Available on httpwwwtntuni-hannoverdeprojectmpegaudiodocuments)

[ISOI97b] ISOIEC 13818-7 ldquoInformation Technology ndash Generic Coding of MovingPictures and Associated Audio Part 7 Advanced Audio Codingrdquo 1997

[ISOI98] ISOIEC JTC1SC29WG11 (MPEG) document N2011 ldquoResults of AACand TwinVQ Tool Comparative Testsrdquo San Jose CA 1998

[ISOI99] ISOIEC JTC1SC29WG11 (MPEG) International Standard ISOIEC14496-3 ldquoCoding of Audio-Visual Objects ndash Part 3 Audiordquo 1999(ldquoMPEG-4rdquo)

426 REFERENCES

[ISOI00] ISOIEC JTC1SC29WG11 (MPEG) International Standard ISOIEC14496-3 AMD-1 ldquoCoding of Audio-Visual Objects ndash Part 3 Audiordquo 2000(ldquoMPEG-4 version-2rdquo)

[ISOI01a] ISOIEC JTC1SC29WG11 MPEG IS15938-2 ldquoMultimedia ContentDescription Interface Description Definition Language (DDL)rdquo Interna-tional Standard Sept 2001 (ldquoMPEG-7 Part-2rdquo)

[ISOI01b] ISOIEC JTC1SC29WG11 MPEG IS15938-4 ldquoMultimedia ContentDescription Interface Audiordquo International Standard Sept 2001 (ldquoMPEG-7 Part-4rdquo)

[ISOI01c] ISOIEC JTC1SC29WG11 MPEG IS15938-5 ldquoMultimedia ContentDescription Interface Multimedia Description Schemes (MDS)rdquo Interna-tional Standard Sept 2001 (ldquoMPEG-7 Part-5rdquo)

[ISOI01d] ISOIEC JTC1SC29WG11 MPEG ldquoMPEG-7 Applications DocumentrdquoDoc N3934 Jan 2001 Also available online httpwwwcseltitmpeg

[ISOI01e] ISOIEC JTC1SC29WG11 MPEG ldquoIntroduction to MPEG-7rdquo DocN4032 Mar 2001 Also available online httpwwwcseltitmpeg

[ISOI02a] MPEG Requirements Group MPEG-21 Overview ISOMPEG N5231 Oct2002

[ISOI03a] MPEG Requirements Group MPEG-21 Requirements ISOMPEG N5873July 2003

[ISOI03b] MPEG Requirements Group MPEG-21 Architecture Scenarios and IPMPRequirements ISOMPEG N5874 July 2003

[ISOI03c] ISOIEC JTC1SC29WG11 MPEG IS14496-32001Amd12003 ldquoInforma-tion technology - coding of audio-visual objects Part 3 Audio Amendment1 Bandwidth Extensionrdquo 2003 (MPEG 4 SBR)

[ISOII94] ISO-II ldquoReport on the MPEGAudio Multichannel Formal SubjectiveListening Testsrdquo ISOMPEG doc MPEG94063 ISOMPEG-II AudioCommittee 1994

[ISOJ97] ISOIEC JTC1SC29WG1 (1997) JPEG-LS Emerging losslessnear-lossless compression standard for continuous-tone images JPEG Lossless

[IS-54] TIAEIA-PN 2398 (IS-54) ldquoThe 8 kbits VSELP Algorithmrdquo 1989

[IS-96] TIAEIAIS-96 ldquoQCELPrdquo Speech Service Option 3 for Wideband SpreadSpectrum Digital Systemsrdquo TIA 1992

[IS-127] TIAEIAIS-127 ldquoEnhanced Variable Rate Codecrdquo Speech Service Option3 for Wideband Spread Spectrum Digital Systems TIA 1997

[IS-641] TIAEIAIS-641 ldquoCellularPCS Radio Interface ndash Enhanced Full-RateSpeech Codecrdquo TIA 1996

[IS-893] TIAEIAIS-893 ldquoSelectable Mode Vocoderrdquo Service Option for WidebandSpread Spectrum Communications Systems ver 20 Dec 2001

[ITUR90] International Telecommunications Union Radio Communications Sector(ITU-R) ldquoSubjective Assessment of Sound Qualityrdquo CCIR Rec 562-3 volX Part 1 Dusseldorf 1990

[ITUR91] ITU-R Document TG10-23-E only ldquoBasic Audio Quality Requirementsfor Digital Audio Bit Rate Reduction Systems for Broadcast Emission andPrimary Distributionrdquo Oct 1991

REFERENCES 427

[ITUR92] International Telecommunications Union Rec G728 ldquoCoding of Speech at16 kbps Using Low-Delay Code Excited Linear Predictionrdquo Sept 1992

[ITUR93a] International Telecommunications Union Radio Communications Sector(ITU-R) Recommendation BS775-1 ldquoMulti-Channel Stereophonic SoundSystem With and Without Accompanying Picturerdquo Nov 1993

[ITUR93b] ITU-R Task Group 102 ldquoCCIR Listening Tests Network Verification Testswithout Commentary Codecs Final Reportrdquo Delayed Contribution 10-2511993

[ITUR93c] ITU-R Task Group 102 ldquoPAQM Measurementsrdquo Delayed Contribution 10-251 1993

[ITUR94b] ITU-R BS1116 ldquoMethods for Subjective Assessment of Small Impairmentsin Audio Systems Including Multichannel Sound Systemsrdquo 1994

[ITUR94c] ITU-R BS775-1 Multichannel Stereophonic Sound System with and WithoutAccompanying Picture International Telecommunication Union GenevaSwitzerland 1994

[ITUR96] International Telecommunications Union Rec P 830 ldquoSubjective Perfor-mance Assessment of Telephone-Band and Wideband Digital Codecsrdquo 1996

[ITUR97] International Telecommunications Union Radio Communications Sector(ITU-R) Recommendation BS1115 ldquoLow Bit Rate Audio Codingrdquo GenevaSwitzerland 1997

[ITUR98] International Telecommunications Union Radiocommunications Sector(ITU-R) Recommendation BS1387 ldquoMethod For Objective Measurementsof Perceived Audio Qualityrdquo Dec 1998

[ITUT94] International Telecommunications Union Telecommunications Sector (ITU-T) Recommendation P80 ldquoTelephone Transmission Quality SubjectiveOpinion Testsrdquo 1994

[ITUT96] International Telecommunications Union Telecommunications Sector (ITU-T) Recommendation P861 ldquoObjective Quality Measurement of Telephone-Band (300ndash3400 Hz) Speech Codecsrdquo 1996

[Iwad92] M Iwadare et al ldquoA 128 kbs hi-fi audio codec based on adaptive transformcoding with adaptive block size MDCTrdquo IEEE J Sel Areas in Comm vol10 no 1 pp 138ndash144 Jan 1992

[Iwak95] N Iwakami et al ldquoHigh-quality audio-coding at less than 64 kbits by usingtransform-domain weighted interleave vector quantization (TWINVQ)rdquo inProc IEEE ICASSP-95 pp 3095ndash3098 Detroit MI May 1995

[Iwak96] N Iwakami and T Moriya ldquoTransform domain weighted interleave vectorquantization (TwinVQ)rdquo in Proc 101st Conv Aud Eng Soc preprint 43771996

[Iwak01] N Iwakami et al ldquoFast encoding algorithms for MPEG-4 TwinVQ audiotoolrdquo in Proc IEEE ICASSP-01 vol 5 pp 3253ndash3256 Salt Lake City UTMay 2001

[Jako96] C Jakob and A Bradley ldquoMinimising the effects of subband quantisation ofthe time domain aliasing cancellation filter bankrdquo in Proc IEEE ICASSP-96pp 1033ndash1036 Atlanta GA May 1996

[Jans03] E Janssen and D Reefman ldquoSuper-audio CD an introductionrdquo in IEEESig Proc Mag vol 20 no 4 pp 83ndash90 July 2003

428 REFERENCES

[Jawe95] B Jawerth and W Sweldens ldquoBiorthogonal smooth local trigonometricbasesrdquo J Fourier Anal Appl vol 2 no 2 pp 109ndash133 1995

[Jaya76] NS Jayant Waveform Quantization and Coding IEEE Press New York1976

[Jaya84] N Jayant and P Noll Digital Coding of Waveforms Principles andApplications to Speech and Video Prentice-Hall Englewood Cliffs NJ 1984

[Jaya90] NS Jayant V Lawrence and D Prezas ldquoCoding of speech and widebandaudiordquo ATampT Tech J vol 69(5) pp 25ndash41 SeptndashOct 1990

[Jaya92] N Jayant J Johnston and Y Shoham ldquoCoding of wideband speechrdquoSpeech Comm vol 11 no 2ndash3 1992

[Jaya93] N Jayant et al ldquoSignal compression based on models of human perceptionrdquoProc IEEE pp 1385ndash1422 Oct 1993

[Jbir98] A Jbira N Moreau and P Dymarski ldquoLow delay coding of widebandaudio (20 Hzndash15 kHz) at 64 kbpsrdquo in Proc IEEE ICASSP-98 vol 6 pp3645ndash3648 Seattle WA May 1998

[Jess99] P Jessop ldquoThe business case for audio watermarkingrdquo in Proc IEEEICASSP-99 vol 4 pp 2077ndash2078 Phoenix AZ Mar 1999

[Jest82] W Jesteadt S Bacon and J Lehman ldquoForward masking as a function offrequency masker level and signal delayrdquo J Acoust Soc Am vol 71 pp950ndash962 1982

[Jetz79] J Jetzt ldquoCritical distance measurement of rooms from the sound energyspectrum responserdquo J Acoust Soc Am vol 65 pp 1204ndash1211 1979

[Ji02] Z Ji Q Zhang W Zhu and J Zhou ldquoPower-efficient distortion-minimizedrate allocation for audio broadcasting over wireless networksrdquo in Proc ofICME-2002 vol 1 pp 257ndash260 Aug 2002

[Joha01] P Johansson M Kazantzidls R Kapoor and M Gerla ldquoBluetooth anenabler for personal area networkingrdquo IEEE Network vol 15 no 5 pp28ndash37 SeptndashOct 2001

[John80] J Johnston ldquoA filter family designed for use in quadrature mirror filterbanksrdquo in Proc IEEE ICASSP-80 pp 291ndash294 Denver CO 1980

[John88a] J Johnston ldquoTransform coding of audio signals using perceptual noisecriteriardquo IEEE J Sel Areas in Comm vol 6 no 2 pp 314ndash323 Feb1988

[John88b] J Johnston ldquoEstimation of perceptual entropy using noise masking criteriardquoin Proc IEEE ICASSP-88 pp 2524ndash2527 New York NY May 1988

[John89] J Johnston ldquoPerceptual transform coding of wideband stereo signalsrdquo inProc IEEE ICASSP-89 pp 1993ndash1996 Glasgow Scotland May 1989

[John92a] J Johnston and A Ferreira ldquoSum-difference stereo transform codingrdquo inProc IEEE ICASSP-92 vol 2 pp 569ndash572 San Francisco CA May 1992

[John95] JD Johnston et al ldquoThe ATampT Perceptual Audio Coder (PAC)rdquo Presentedat the AES convention New York Oct 1995

[John96] J Johnston et al ldquoATampT Perceptual Audio Coding (PAC)rdquo in CollectedPapers on Digital Audio Bit-Rate Reduction N Gilchrist and C GrewinEds Aud Eng Soc pp 73ndash81 1996

REFERENCES 429

[John96a] J Johnston ldquoAudio coding with filter banksrdquo in Subband and WaveletTransforms A Akansu and MJT Smith Eds Kluwer AcademicPublishers pp 287ndash307 Norwell MA 1996

[John96b] J Johnston J Herre M Davis and U Gbur ldquoMPEG-2 NBC audio-stereoand multichannel coding methodsrdquo in Proc 101st Conv Aud Eng Socpreprint 4383 Los Angeles Nov 1996

[John96c] J Johnston et al ldquoATampT Perceptual Audio Coding (PAC)rdquo in CollectedPapers on Digital Audio Bit-Rate Reduction N Gilchrist and C GrewinEds Aud Eng Soc pp 73ndash81 1996

[John99] J Johnston et al ldquoMPEG Audio Codingrdquo in Wavelet Subband and BlockTransforms in Communications and Multimedia A Akansu and M Medleyeds Kluwer Academic Publishers Boston 1999

[Juan82] B Juang and A Gray ldquoMultiple stage vector quantization for speechcodingrdquo in Proc IEEE ICASSP-82 pp 597ndash600 Paris France 1982

[Jurg96] RK Jurgen ldquoBroadcasting with digital audiordquo IEEE Spectrum pp 52ndash59Mar 1996

[Kaba86] P Kabal and RP Ramachandran ldquoThe computation of line-spectralfrequencies using Chebyshev polynomialsrdquo IEEE Trans Acoust SpeechSignal Processing vol 34 pp 1419ndash1426 1986

[Kalk01a] T Kalker ldquoConsiderations on watermarking securityrdquo IEEE 4th Workshopon Mult Sig Proc pp 201ndash206 Oct 2001

[kalk01b] T Kalker W Oomen AN Lemma J Haitsma F Bruekers and M VeenldquoRobust multi-functional and high quality audio watermarking technologyrdquoin Proc 110th Conv Aud Eng Soc preprint 5345 Amsterdam May 2001

[Kalk02] T Kalker and FMJ willems ldquoCapacity bounds and constructions forreversible data-hidingrdquo in Proc DSP-2002 vol 1 pp 71ndash76 July 2002

[Kapu92] R Kapust ldquoA human ear related objective measurement technique yieldsaudible error and error marginrdquo in Proc 11th Int Conf Aud Eng Soc pp191ndash201 May 1992

[Karj85] M Karjaleinen ldquoA new auditory model for the evaluation of sound qualityof audio systemsrdquo in Proc IEEE ICASSP-85 pp 608ndash611 Tampa FL May1985

[Kata93] A Kataoka et al ldquoConjugate structure CELP coder for the CCITT8 kbs standardization candidaterdquo in IEEE Workshop on Speech Coding forTelecommunications pp 25ndash26 1993

[Kata96] A Kataoka et al ldquoAn 8-kbs conjugate structure CELP (CS-CELP) speechcoderrdquo IEEE Trans on Speech and Audio Processing vol 4 no 6pp 401ndash411 Nov 1996

[Kato02] H Kato ldquoTrellis noise-shaping converters and 1-bit digital audiordquo in Proc112th Conv Aud Eng Soc preprint 5615 Munich Germany May 10ndash132002

[Kau98] P Kauff J-R Ohm S Rauthenberg and T Sikora ldquoThe MPEG-4 standardand its applications in virtual 3D environmentsrdquo in 32nd ASILOMAR Confon Sig Syst and Computers vol 1 pp 104ndash107 Nov 1998

[Kay89] S Kay ldquoA fast and accurate single frequency estimatorrdquo IEEE Trans AcousSpeech and Sig Proc pp 1987ndash1990 Dec 1989

430 REFERENCES

[Keyh93] M Keyhl et al ldquoNMR measurements of consumer recording devices whichuse low bit-rate audio codingrdquo in Proc 94th Conv Aud Eng Soc preprint3616 1993

[Keyh94] M Keyhl et al ldquoNMR measurements on multiple generations audio codingrdquoin Proc 96th Conv Aud Eng Soc preprint 3803 1994

[Keyh95] M Keyhl et al ldquoMaintaining sound quality ndash experiences and constraintsof perceptual measurements in todayrsquos and future networksrdquo in Proc 98thConv Aud Eng Soc preprint 3946 1995

[Keyh98] M Keyhl et al ldquoQuality assurance tests of MPEG encoders for adigital broadcasting system (Part II) ndash minimizing subjective test efforts byperceptual measurementsrdquo in Proc 104th Conv Aud Eng Soc preprint4753 May 1998

[Kim00] H Kim ldquoStochastic model based audio watermark and whitening filter forimproved detectionrdquo in Proc IEEE ICASSP-00 vol 4 pp 1971ndash1974Istanbul Turkey June 2000

[Kim01] S-W Kim S-H Park and Y-B Kim ldquoFine grain scalability in MPEG-4Audiordquo in Proc 111th Conv Aud Eng Soc preprint 5491 NovndashDec2001

[Kim02] KT Kim SK Jung YC Park and DH Youn ldquoA new bandwidth scalablewideband speechaudio coderrdquo in Proc IEEE ICASSP-02 vol 1 pp 13ndash17Orlando FL May 2002

[Kim02a] D-H Kim J-H Kim and S-W Kim ldquoScalable lossless audio coding basedon MPEG-4 BSACrdquo in Proc 113th Conv Aud Eng Soc preprint 5679Oct 2002

[Kim02b] J-W Kim B-D Choi S-H Park K-K Kim S-J Ko ldquoRemote control systemusing real-time MPEG-4 streaming technology for mobile robotrdquo in Procof Int Conf on Consumer Electronics ICCE-02 pp 200ndash201 June 2002

[Kimu02] T Kimura K Kakehi K Takeda and F Itakura ldquoSpatial compression ofmulti-channel audio signals using inverse filtersrdquo in Proc ICASSP-02 vol4 p 4175 May 2002

[Kirb97] D Kirby and K Watanabe ldquoFormal subjective testing of the MPEG-2NBC multichannel coding algorithmrdquo in Proc 102nd Conv Aud Eng Socpreprint 4418 Munich 1997

[Kirk83] S Kirkpatrick et al ldquoOptimization by simulated annealingrdquo Science pp671ndash680 May 1983

[Kiro01a] D Kirovski and H Malvar ldquoRobust spread-spectrum audio watermarkingrdquoin Proc IEEE ICASSP-01 vol 3 pp 1345ndash1348 Salt Lake City UT May2001

[Kiro01b] D Kirovski and H Malvar ldquoSpread-spectrum audio watermarking re-quirements applications and limitationsrdquo IEEE 4th Workshop on Mult SigProc pp 219ndash224 Oct 2001

[Kiro02] D Kirovski and H Malvar ldquoEmbedding and detecting spread-spectrumwatermarks under estimation attacksrdquo in Proc IEEE ICASSP-02 vol 2pp 1293ndash1296 Orlando FL May 2002

[Kiro03a] D Kirovski and HS Malvar ldquoSpread-spectrum watermarking of audiosignalsrdquo Trans on Sig Proc vol 51 no 4 pp 1020ndash1033 Apr 2003

REFERENCES 431

[Kiro03b] D Kirovski and FAP Petitcolas ldquoBlind pattern matching attack onwatermarking systemsrdquo Trans on Sig Proc vol 51 no 4 pp 1045ndash1053Apr 2003

[Kiya94] H Kiya et al ldquoA development of symmetric extension method for subbandimage codingrdquo IEEE Trans Image Proc vol 3 no 1 pp 78ndash81 Jan 1994

[Klei90a] WB Kleijn et al ldquoFast methods for the CELP speech coding algorithmrdquoIEEE Trans on Acoustics Speech and Signal Proc vol 38 no 8 p 1330Aug 1990

[Klei90b] WB Kleijn ldquoSource-dependent channel coding and its application toCELPrdquo Advances in Speech Coding Eds B Atal V Cuperman andA Gersho pp 257ndash266 Kluwer Academic Publishers Norwell MA 1990

[Klei92] WB Kleijn et al ldquoGeneralized analysis-by-synthesis coding and itsapplication to pitch predictionrdquo in Proc IEEE ICASSP-92 vol 1pp 337ndash340 San Francisco CA Mar 1992

[Klei95] WB Kleijn and KK Paliwal ldquoAn introduction to speech codingrdquo inSpeech Coding and Synthesis Eds WB Kleijn and KK Paliwal KluwerAcademic Publishers Norwell MA Nov 1995

[Knis98] DN Knisely S Kumar S Laha and S Navda ldquoEvolution of wireless dataservices IS-95 to CDMA2000rdquo IEEE Comm Mag vol 36 pp 140ndash149Oct 1998

[Ko02] B-S Ko R Nishimura and Y Suzuki ldquoTime-spread echo method for digitalaudio watermarking using PN sequencesrdquo in Proc IEEE ICASSP-02 vol2 pp 2001ndash2004 Orlando FL May 2002

[Koen96] R Koenen et al ldquoMPEG-4 context and objectivesrdquo J Sig Proc ImageComm Oct 1996

[Koen98] R Koenen ldquoOverview of the MPEG-4 Standardrdquo ISOIEC JTC1SC29WG11 N2323 Jul 1998 Available online at httpwwwcseltitmpegstan-dardsmpeg-4mpeg-4html

[Koen99] R Koenen ldquoOverview of the MPEG-4 Standardrdquo ISOIEC JTC1SC29WG11 N3156 Dec 1999 Available online at httpwwwcseltitmpegstan-dardsmpeg-4mpeg-4htm

[Koil91] R Koilpillai and PP Vaidyanathan ldquoNew results on cosine-modulated FIRfilter banks satisfying perfect reconstructionrdquo in Proc IEEE ICASSP-91 pp1793ndash1796 Toronto Canada May 1991

[Koil92] R Koilpillai and PP Vaidyanathan ldquoCosine-modulated FIR filter bankssatisfying perfect reconstructionrdquo IEEE Trans Sig Proc vol 40 pp770ndash783 Apr 1992

[Kois00] K Koishida V Cuperman and A Gersho ldquoA 16-kbits bandwidth scalableaudio coder based on the G729 standardrdquo Proc IEEE ICASSP-00 vol 2pp 49ndash52 Istanbul Turkey June 2000

[Koll99] J Koller T Sporer and K Brandenburg ldquoRobust coding of high qualityaudio signalsrdquo in Proc 103rd Conv Aud Eng Soc preprint 4621 NewYork 1997

[Koml96] A Komly ldquoAssessing the performance of low-bit-rate audio codecs anaccount of the work of ITU-R Task Group 102rdquo in Collected Papers onDigital Audio Bit-Rate Reduction N Gilchrist and C Grewin Eds AudEng Soc pp 105ndash114 1996

432 REFERENCES

[Kons94] K Konstantinides ldquoFast subband filtering in MPEG audio codingrdquo IEEESig Proc Letters vol 1 pp 26ndash28 Feb 1994

[Kost95] B Kostek ldquoFeature extraction methods for the intelligent processing ofmusical signalsrdquo in Proc 99th Conv Aud Eng Soc preprint 4076 Oct1995

[Kost96] B Kostek and M Szczerba ldquoMIDI database for the automatic recognitionof musical phrasesrdquo in Proc 100th Conv Aud Eng Soc preprint 4169May 1996

[Kova95] J Kovacevic ldquoSubband coding systems incorporating quantizer modelsrdquoIEEE Trans Image Proc vol 4 no 5 pp 543ndash553 May 1995

[Krah85] D Krahe ldquoNew Source Coding Method for High Quality Digital AudioSignalsrdquo NTG Fachtagung Hoerrundfunk Mannheim 1985

[Krah88] D Krahe ldquoGrundlagen eines Verfahrens zur Dataenreduktion bei QualitativHochwertigen digitalen Audiosignalen auf Basis einer Adaptiven Transfor-mationscodierung unter Berucksightigung Psychoakustischer PhanomenerdquoPhD Thesis Duisburg 1988

[kroo86] P Kroon E Deprettere and RJ Sluyeter ldquoRegular-pulse excitation ndash anovel approach to effective and efficient multi-pulse coding of speechrdquoIEEE Trans on Acoustics Speech and Signal Proc vol 34 no 5pp 1054ndash1063 Oct 1986

[Kroo90] P Kroon and B Atal ldquoPitch predictors with high temporal resolutionrdquo inProc IEEE ICASSP-90 pp 661ndash664 Albuquerque NM April 1990

[Kroo95] P Kroon and WB Kleijn ldquoLinear-prediction based analysis-synthesiscodingrdquo in Speech Coding and Synthesis Eds WB Kleijn andKK Paliwal Kluwer Academic Publishers Norwell MA Nov 1995

[Krug88] E Kruger and H Strube ldquoLinear prediction on a warped frequencyscalerdquo IEEE Trans on Acoustics Speech and Signal Proc vol 36 no 9pp 1529ndash1531 Sept 1988

[Kudu95a] P Kudumakis and M Sandler ldquoOn the performance of wavelets for low bitrate coding of audio signalsrdquo in Proc IEEE ICASSP-95 pp 3087ndash3090Detroit MI May 1995

[Kudu95b] P Kudumakis and M Sandler rdquoWavelets regularity complexity andMPEG-Audiordquo in Proc 99th Conv Aud Eng Soc preprint 4048Oct 1995

[Kudu96] P Kudumakis and M Sandler ldquoOn the compression obtainable with four-tapwaveletsrdquo IEEE Sig Proc Let pp 231ndash233 Aug 1996

[Kuma00] A Kumar ldquoLow complexity ACELP coding of 7 kHz speech and audioat 16 kbpsrdquo IEEE Int Conf on Personal Wireless Communications pp368ndash372 Dec 2000

[Kunt80] M Kunt and O Johnsen ldquoBlock coding of graphics A tutorial reviewrdquo inProc of the IEEE vol 7 pp 770ndash786 1980

[Kuo02] SS Kuo JD Johnston W Turin and SR Quackenbush ldquoCovert audiowatermarking using perceptually tuned signal independent multiband phasemodulationrdquo in Proc IEEE ICASSP-02 vol 2 pp 1753ndash1756 OrlandoFL May 2002

REFERENCES 433

[Kurt02] F Kurth A Ribbrock and M Clausen ldquoIdentification of highly distortedaudio material for querying large scale data basesrdquo in Proc 112th ConvAud Eng Soc preprint 5512 Munich May 2002

[Kuzu03] BH Kuzuki N Fuchigami and JR Stuart ldquoDVD-audio specificationsrdquo inIEEE Sig Proc Mag pp 72ndash82 July 2003

[Lacy98] J Lacy SR Quackenbush AR Reibman D Shur and JH Snyder ldquoOncombining watermarking with perceptual codingrdquo in Proc IEEE ICASSP-98vol 6 pp 3725ndash3728 Seattle WA May 1998

[Lafl90] C Laflamme et al ldquoOn reducing computational complexity of codebooksearch in CELP coder through the use of algebraic codesrdquo in Proc IEEEICASSP-90 vol 1 pp 177ndash180 Albuquerque NM Apr 1990

[Lafl93] C Laflamme R Salami and J-P Adoul ldquo96 kbits ACELP coding ofwideband speechrdquo in Speech and Audio Coding for Wireless and NetworkApplications Eds B Atal V Cuperman and A Gersho Kluwer AcademicPublishers Norwell MA 1993

[Laft03] C Laftsidis A Tefas N Nikolaidis and I Pitas ldquoRobust multibit audiowatermarking in the temporal domainrdquo in Proc IEEE ISCAS-03 vol 2 pp944ndash947 May 2003

[Lanc02] R Lancini F Mapelli and S Tubaro ldquoEmbedding indexing informationin audio signal using watermarking techniquerdquo Int Symp on VideoImageProc and Mult Comm pp 257ndash261 June 2002

[Laub98] P Lauber and N Gilchrist ldquoATLANTIC audio switching layer 3 signalsrdquoin Proc 104th Aud Eng Soc Conv preprint 4738 Amsterdam May 1998

[Lean00] CTG de Leandro M Bonnet and N Moreau ldquoCyclostationarity-basedaudio watermarking with private and public hidden datardquo in Proc 109thConv Aud Eng Soc preprint 5258 Los Angeles Sept 2000

[Leand01] CTG de Leandro E Gomez and N Moreau ldquoResynchronization methodsfor audio watermarking an audio systemrdquo in Proc 111th Conv Aud EngSoc preprint 5441 New York Dec 2001

[Lee90] JI Lee et al ldquoOn reducing computational complexity of codebook searchin CELP codingrdquo IEEE Trans on Comm vol 38 no 11 pp 1935ndash1937Nov 1990

[Lee00a] SK Lee and YS Ho ldquoDigital audio watermarking in the cepstrum domainrdquoInt Conf on Consumer Electronics (ICCE-00) pp 334ndash335 June 2000

[Lee00b] SK Lee and YS Ho ldquoDigital audio watermarking in the cepstrum domainrdquoTrans on Consumer Electronics vol 46 no 3 pp 744ndash750 Aug 2000

[Leek84] M Leek and C Watson ldquoLearning to detect auditory pattern componentsrdquoJ Acoust Soc Am vol 76 pp 1037ndash1044 1984

[Lefe94] R Lefebvre R Salami C Laflamme and J-P Adoul ldquoHigh quality codingof wideband audio signals using transform coded excitation (TCX)rdquo in ProcIEEE ICASSP-94 vol 1 pp 93ndash96 Adelaide Australia 1994

[LeGal92] DJ Le Gall ldquoThe MPEG video compression algorithmrdquo J Sig Proc ImageComm vol 4 no 2 pp 129ndash140 1992

[Lehr93] P Lehrman and T Tully MIDI for the Professional Music Sales Corp1993

434 REFERENCES

[Lemm03] AN Lemma J Aprea W Oomen and LV de Kerkhof ldquoA temporaldomain audio watermarking techniquerdquo Trans on Sig Proc vol 51 no 4pp 1088ndash1097 Apr 2003

[Levi98a] S Levine and J Smith ldquoA sines+transients+noise audio representation fordata compression and timepitch scale modificationsrdquo in Proc 105th ConvAudio Eng Soc preprint 4781 Sep 1998

[Levi99] SN Levine and JO Smith ldquoA switched parametric and transform audiocoderrdquo in Proc IEEE ICASSP-99 vol 2 pp 985ndash988 Phoenix AZ Mar1999

[Lie01] W-N Lie and L-C Chang ldquoRobust and high-quality time-domain audiowatermarking subject to psychoacoustic maskingrdquo in Proc IEEE ISCAS-01vol 2 pp 45ndash48 May 2001

[Lin82] S Lin and DJ Costello Error Control Coding Fundamentals andApplications Englewood Cliffs NJ Prentice Hall 1982

[Lin91] X Lin RA Salami and R Steele ldquoHigh quality audio coding usinganalysis-by-synthesis techniquerdquo in Proc IEEE ICASSP-91 vol 5pp 3617ndash3620 Toronto Canada Apr 1991

[Lina94] J Linacre Many-Facet Rasch Measurement MESA Press Chicago 1994

[Lind80] Y Linde A Buzo and A Gray ldquoAn algorithm for vector quantizer designrdquoIEEE Trans Commun vol 28 no 1 pp 84ndash95 Jan 1980

[Lind99] A T Lindsay and W Kriechbaum ldquoTherersquos more than one way to hear itmultiple representations of music in MPEG-7rdquo J New Music Res vol 28no 4 pp 364ndash372 1999

[Lind00] A T Lindsay S Srinivasan et al ldquoRepresentation and linking mechanismfor audio in MPEG-7rdquo Signal Proc Image Commun vol 16 pp 193ndash2092000

[Lind01] AT Lindsay and J Herre ldquoMPEG-7 and MPEG-7 audio ndash an overviewrdquoJ Audio Eng Soc vol 49 no 78 pp 589ndash594 JulyAug 2001

[Link93] M Link ldquoAn attack processing of audio signals for optimizing the temporalcharacteristics of a low bit-rate audio coding systemrdquo in Proc 95th ConvAudio Eng Soc preprint 3696 1993

[Liu98] C-M Liu and W-C Lee ldquoA unified fast algorithm for cosine modulatedfilter banks in current audio coding standardsrdquo in Proc 104th Conv AudioEng Soc preprint 4729 1998

[Liu99] F Liu J Shin C-C J Kuo and AG Tescher ldquoStreaming of MPEG-4 speechaudio over Internetrdquo Int Conf on Consumer Electronics pp300ndash301 June 1999

[LKur96] E Lukac-Kuruc ldquoBeyond MIDI XMidi an affordable solutionrdquo in Proc101st Conv Aud Eng Soc preprint 4348 Nov 1996

[Lloy82] S P Lloyd ldquoLeast squares quantization in PCMrdquo originally publishedin 1957 and republished in IEEE Trans Inform Theory vol 28 pt IIpp127ndash135 Mar 1982

[Lokh91] GCP Lokhoff ldquoDCC ndash digital compact cassetterdquo in IEEE Trans onConsumer Electronics vol 37 no 3 pp 702ndash706 Aug 1991

REFERENCES 435

[Lokh92] GCP Lokhoff ldquoPrecision adaptive sub-band coding (PASC) for the digitalcompact cassette (DCC)rdquo IEEE Trans Consumer Electron vol 38 no 4pp 784ndash789 Nov 1992

[Lu98] Z Lu and W Pearlman ldquoAn efficient low-complexity audio coder deliveringmultiple levels of quality for interactive applicationsrdquo in Proc IEEE SigProc Soc Workshop on Multimedia Sig Proc Dec 1998

[Luft83] R Lufti ldquoAdditivity of simultaneous maskingrdquo J Acoust Soc Am vol 73pp 262ndash267 1983

[Luft85] R Lufti ldquoA power-law transformation predicting masking by sounds withcomplex spectrardquo J Acoust Soc of Am vol 77 pp 2128ndash2136 1985

[Maco97] MW Macon L Jensen-Link J Oliverio MA Clements and EB GeorgeldquoConcatenation-based MIDI-to-singing voice synthesisrdquo in Proc 101stConv Aud Eng Soc preprint 4591 Nov 1997

[Madi97] V Madisetti and DB Williams The Digital Signal Processing Handbook CRC Press Boca Raton FL Dec 1997

[Mahi89] Y Mahieux et al ldquoTransform coding of audio signals using correlationbetween successive transform blocksrdquo in Proc IEEE ICASSP-89 pp2021ndash2024 Glasgow Scotland May 1989

[Mahi90] Y Mahieux and J Petit ldquoTransform coding of audio signals at 64 kbitssecrdquoin Proc IEEE Globecom lsquo90 vol 1 pp 518ndash522 Nov 1990

[Mahi94] Y Mahieux and JP Petit ldquoHigh-quality audio transform coding at 64 kbpsrdquoIEEE Trans Comm vol 42 no 11 pp 3010ndash3019 Nov 1994

[Main96] L Mainard and M Lever ldquoA bi-dimensional coding scheme applied to audiobitrate reductionrdquo in Proc IEEE ICASSP-96 pp 1017ndash1020 Atlanta GAMay 1996

[Main96] L Mainard C Creignou M Lever and YF Dehery ldquoA real-time PC-based high-quality MPEG layer II codecrdquo in Proc 101st Conv Aud EngSoc preprint 4345 Nov 1996

[Makh75] J Makhoul ldquoLinear prediction A tutorial reviewrdquo in Proc of the IEEE vol 63 no 4 pp 561ndash580 Apr 1975

[Makh78] J Makhoul et al ldquoA mixed-source model for speech compression andSynthesisrdquo J Acoust Soc Am vol 64 pp 1577ndash1581 Dec 1978

[Makh85] J Makhoul et al ldquoVector quantization in speech codingrdquo Proc of the IEEE vol 73 no 11 pp 1551ndash1588 Nov 1985

[Maki00] K Makino and J Matsumoto ldquoHybrid audio coding for speech and audiobelow medium bit raterdquo Int Conf on Consumer Electronics (ICCE-00) pp264ndash265 June 2000

[Malv90a] H Malvar ldquoLapped transforms for efficient transformsubband codingrdquoIEEE Trans Acous Speech and Sig Process vol 38 no 6 pp 969ndash978June 1990

[Malv90b] H Malvar ldquoModulated QMF filter banks with perfect reconstructionrdquoElectronics Letters vol 26 pp 906ndash907 June 1990

[Malv91] HS Malvar Signal Processing with Lapped Transforms Artech HouseNorWood MA 1991

436 REFERENCES

[Malv98] H Malvar ldquoBiorthogonal and nonuniform lapped transforms for transformcoding with reduced blocking and ringing artifactsrdquo IEEE Trans Sig Procvol 46 no 4 pp 1043ndash1053 Apr 1998

[Malv98] H Malvar ldquoEnhancing the performance of subband audio coders for speechsignalsrdquo in Proc IEEE ISCAS-98 vol 5 pp 98ndash101 May 1998

[Mana02] B Manaris T Purewal and C McCormick ldquoProgress towards recognizingand classifying beautiful music with computers ndash MIDI-encoded music andthe Zipf-Mandelbrot lawrdquo in Proc of Southeast Conf pp 52ndash57 Apr 2002

[Manj02] BS Manjunath P Salembier and T Sikora editors Introduction to MPEG-7 Multimedia Content Description Language John Wiley amp Sons NewYork 2002

[Mans01a] MF Mansour and AH Tewfik ldquoAudio watermarking by time-scalemodificationrdquo in Proc IEEE ICASSP-01 vol 3 pp 1353ndash1356 Salt LakeCity UT May 2001

[Mans01b] MF Mansour and AH Tewfik ldquoEfficient decoding of watermarkingschemes in the presence of false alarmsrdquo IEEE 4th Workshop on Mult SigProc pp 523ndash528 Oct 2001

[Mans02] MF Mansour and AH Tewfik ldquoImproving the security of watermark publicdetectorsrdquo Int Conf on DSP-02 vol 1 pp 59ndash66 July 2002

[Mao03] W Mao Modern Cryptography Theory and Practice Prentice Hall PTRJuly 2003

[Mark76] J Markel and A Gray Jr Linear Prediction of Speech Springer-VerlagNew York 1976

[Marp87] S Marple Digital Spectral Analysis with Applications Prentice-HallEnglewood Cliffs NJ 1987

[Marp89] L Marple Digital Spectral Analysis with Applications 2nd edition PrenticeHall Englewood Cliffs NJ 1989

[Mart02] LG Martins AJS Ferreira ldquoPCM to MIDI transpositionrdquo in Proc 112thConv Aud Eng Soc preprint 5524 May 2002

[Mass85] J Masson and Z Picel ldquoFlexible design of computationally efficient nearlyperfect QMF filter banksrdquo in Proc IEEE ICASSP-85 vol 10 pp 541ndash544Tampa FL Mar 1985

[Matv96] G Matviyenko ldquoOptimized Local Trigonometric Basesrdquo Appl ComputHarmonic Anal v 3 n 4 pp 301-323ndash 1996

[McAu86] R McAulay and T Quatieri ldquoSpeech analysis synthesis based on asinusoidal representationrdquo IEEE Trans Acous Speech and Sig Proc vol34 no 4 pp 744ndash754 Aug 1986

[McAu88] R McAulay and T Quatieri ldquoComputationally efficient sine-wave synthesisand its application to sinusoidal transform codingrdquo in Proc IEEE ICASSP-88 New York NY 1988

[McCr91] A McCree and T Barnwell III ldquoA new mixed excitation LPC vocoderrdquo inProc IEEE ICASSP-91 vol 1 pp 593ndash596 Toronto Canada Apr 1991

[McCr93] A McCree and T Barnwell III ldquoImplementation and evaluation of a 2400BPS mixed excitation LPC vocoderrdquo in Proc IEEE ICASSP -93 vol 2p 159 Minneapolis MN Apr 1993

REFERENCES 437

[Mear97] DJ Meares and G Theile ldquoMatrixed surround sound in an MPEG digitalworldrdquo in Proc 102th Conv Aud Eng Soc preprint 4431 Mar 1997

[Mein01] N Meine B Edler and H Purnhagen ldquoError protection and concealmentfor HILN MPEG-4 parametric audio codingrdquo in Proc 110th Conv AudEng Soc preprint 5300 May 2001

[Merh00] N Merhav ldquoOn random coding error exponents of watermarking systemsrdquoTrans on Info Theory vol 46 no 2 pp 420ndash430 Mar 2000

[Merm88] P Mermelstein ldquoG722 A new CCITT coding standard for digitaltransmission of wideband audio signalsrdquo IEEE Comm Mag vol 26pp 8ndash15 Feb 1988

[Mesa99] VZ Mesarovic MV Dokic and RK Rao ldquoDTS multichannel audiodecoder on a 24-bit fixed-point dual DSPrdquo in Proc 106th Conv Aud EngSoc preprint 4964 May 1999

[MID03] ldquoMIDI and musical instrument controlrdquo in the Journal of AES vol 51no 4 pp 272ndash276 Apr 2003

[MIDI] The MIDI Manufacturers Association Official home page available onlineat httpwwwmidiorg

[Mill47] G Miller ldquoSensitivity to changes in the intensity of white noise and itsrelation to masking and loudnessrdquo J Acoust Soc Am vol 19 pp 609ndash6191947

[Miln92] A Milne ldquoNew test methods for digital audio data compression algorithmsrdquoin Proc 11th Int Conf Aud Eng Soc pp 210ndash215 May 1992

[Mitc97] JL Mitchell WB Pennebaker CE Fogg and DJ LeGall MPEG VideoCompression Standard in Digital Multimedia Standards Series Chapman ampHall New York NY 1997

[Mizr02] S Mizrachi and D Malah ldquoParameter identification of a class of nonlinearsystems for robust audio watermarking verificationrdquo in Proc 22nd ConvElectrical and Electronics Engineers pp 13ndash15 Dec 2002

[Mode98] T Modegi and S-I Iisaku ldquoProposals of MIDI coding and its application foraudio authoringrdquo in Proc Int Conf on Mult Comp and Syst pp 305ndash314JunendashJuly 1998

[Mode00] T Modegi ldquoMIDI encoding method based on variable frame-length analysisand its evaluation of coding precisionrdquo in IEEE Int Conf on Mult and ExpoICME-00 vol 2 pp 1043ndash1046 JulyndashAug 2000

[Mont94] P Monta and S Cheung ldquoLow rate audio coder with hierarchicalfilterbanksrdquo in Proc IEEE ICASSP-94 vol 2 pp 209ndash212 AdelaideAustralia May 1994

[Moor77] BCJ Moore Introduction to the Psychology of Hearing University ParkPress London 1977

[Moor78] BCJ Moore ldquoPsychophysical tuning curves measured in simultaneous andforward maskingrdquo J Acoust Soc Am vol 63 pp 524ndash532 1978

[Moor83] BCJ Moore and B Glasberg ldquoSuggested formulae for calculating auditory-filter bandwidths and excitation patternsrdquo J Acous Soc Am vol 74 pp750ndash753 1983

438 REFERENCES

[Moor96] BCJ Moore ldquoMasking in the human auditory systemrdquo in Collected Paperson Digital Audio Bit-Rate Reduction N Gilchrist and C Grewin Eds AudEng Soc pp 9ndash19 1996

[Moor96] JA Moorer ldquoBreaking the sound barrier mastering at 96 kHz and beyondrdquoin Proc 101st Conv Aud Eng Soc preprint 4357 Los Angeles Nov 1996

[Moor97] BCJ Moore B Glasberg and T Baer ldquoA model for the prediction ofthresholds loudness and partial loudnessrdquo J Audio Eng Soc vol 45 no 4pp 224-240ndash Apr 1997

[Moor03] BCJ Moore An Introduction to the Psychology of Hearing Fifth EditionAcademic Press San Diego Jan 2003

[More95] F Moreau de Saint-Martin et al ldquoA measure of near orthogonality ofPR biorthogonal filter banksrdquo in Proc IEEE ICASSP-95 pp 1480ndash1483Detroit MI May 1995

[Mori96] T Moriya et al ldquoExtension and complexity reduction of TWINVQ audiocoderrdquo in Proc IEEE ICASSP-96 vol 2 pp 1029ndash1032 Atlanta GA May1996

[Mori97] T Moriya Y Takashima T Nakamura and N Iwakami ldquoDigitalwatermarking schemes based on vector quantizationrdquo in IEEE Workshopon Speech Coding for Telecommunications pp 95ndash96 1997

[Mori00] T Moriya N Iwakami A Jin and T Mori ldquoA design of lossy and losslessscalable audio codingrdquo in Proc IEEE ICASSP 2000 vol 2 pp 889ndash892Istanbul Turkey June 2000

[Mori00a] T Moriya A Jin N Iwakami and T Mori ldquoDesign of an MPEG-4 generalaudio coder for improving speech qualityrdquo in Proc of IEEE Workshop onSpeech Coding pp 139ndash141 Sept 2000

[Mori00b] T Moriya T Mori N Iwakami and A Jin ldquoA design of error robustscalable coder based on MPEG-4Audiordquo in Proc IEEE ISCAS-2000 vol3 pp 213ndash216 May 2000

[Mori02a] T Moriya ldquoReport of AHG on issues in lossless audio codingrdquo ISOIECJTC1SC29WG11 M7955 Mar 2002

[Mori02b] T Moriya A Jin T Mori K Ikeda and T Kaneko ldquoLossless scalableaudio coder and quality enhancementrdquo in Proc IEEE ICASSP-02 vol 2pp 1829ndash1832 Orlando FL 2002

[Mori03] T Moriya A Jin T Mori K Ikeda and T Kaneko ldquoHierarchical losslessaudio coding in terms of sampling rate and amplitude resolutionrdquo in ProcIEEE ICASSP-03 vol 5 pp 409ndash412 Hong Kong 2003

[Moul98a] D Moulton and M Moulton ldquoMeasurement of small impairments ofperceptual audio coders using a 3-facet Rasch modelrdquo in Proc 104th ConvAud Eng Soc preprint 4709 May 1998

[Moul98b] D Moulton and M Moulton ldquoCodec lsquotransparencyrsquo listener lsquoseverityrsquoprogram lsquointolerancersquo suggestive relationships between Rasch measures andsome background variablesrdquo in Proc 105th Conv Aud Eng Soc preprint4843 Sept 1998

[MPEG] The MPEG workgroup web-page available online at httpwwwmpegorg

REFERENCES 439

[Munt02] T Muntean E Grivel and M Najim ldquoAudio digital watermarking based onhybrid spread spectrumrdquo in Proc 2nd Int Conf WEBDELMUSIC-02 pp150ndash155 Dec 2002

[Nack99a] F Nack and AT Lindsay ldquoEverything you wanted to know about MPEG-7Part-Irdquo IEEE Multimedia vol 6 no 3 pp 65ndash71 JulyndashSept 1999

[Nack99b] F Nack and AT Lindsay ldquoEverything you wanted to know about MPEG-7Part-IIrdquo IEEE Multimedia vol 6 no 4 pp 64ndash73 OctndashDec 1999

[Naja00] H Najafzadeh and P Kabal ldquoPerceptual bit allocation for low rate codingof narrowband audiordquo in Proc IEEE ICASSP-00 vol 2 pp 5ndash9 IstanbulTurkey June 2000

[Neub00a] C Neubauer and J Herre ldquoAudio watermarking of MPEG-2 AAC bitstreamsrdquo in Proc 108th Conv Aud Eng Soc preprint 5101 Paris Feb2000

[Neub00b] C Neubauer and J Herre ldquoAdvanced watermarking and its applicationsrdquo inProc 109th Conv Aud Eng Soc preprint 5176 Los Angeles Sept 2000

[Neub02] C Neubauer F Siebenhaar J Herre and R Bauml ldquoNew high data rateaudio watermarking based on SCS (scalar costa scheme)rdquo in Proc 113thConv Aud Eng Soc preprint 5645 Los Angeles Oct 2002

[Neub98] C Neubauer and J Herre ldquoDigital watermarking and its influence on audioqualityrdquo in Proc 105th Conv Aud Eng Soc preprint 4823 San FranciscoSept 1998

[NICAM] NICAM 728 ldquoSpecification for Two Additional Digital Sound Channelswith System I Televisionrdquo Independent Broadcasting Authority (IBA)British Radio Equipment Manufacturersrsquo Association (BREMA) and BritishBroadcasting Corporation (BBC) Aug 1988

[Ning02] D Ning and M Deriche ldquoA new audio coder using a warped linearprediction model and the wavelet transformrdquo in Proc IEEE ICASSP-02vol 2 pp 1825ndash1828 Orlando FL 2002

[Nish96a] A Nishio et al ldquoDirect stream digital audio systemrdquo in Proc 100th ConvAud Eng Soc preprint 4163 May 1996

[Nish96b] A Nishio et al ldquoA new CD mastering processing using direct stream digitalrdquoin Proc 101st Conv Aud Eng Soc preprint 4393 Los Angeles Nov 1996

[Nish99] M Nishiguchi ldquoMPEG-4 speech codingrdquo in Proc AES 17 th Int Conf Sept1999

[Nish01] A Nishio and H Takahashi ldquoInvestigation of practical 1-bit delta-sigmaconversion for professional audio applicationsrdquo in Proc 110th Conv AudEng Soc preprint 5392 May 2001

[Nogu97] M Noguchi G Ichimura A Nishio and S Tagami ldquoDigital signalprocessing in a direct stream digital editing systemrdquo in Proc 102nd ConvAud Eng Soc Preprint 4476 May 1997

[Noll93] P Noll ldquoWideband speech and audio codingrdquo IEEE Comm Mag pp 34ndash44 Nov 1993

[Noll95] P Noll ldquoDigital audio coding for visual communicationsrdquo Proc of the IEEE pp 925ndash943 June 1995

[Noll97] P Noll ldquoMPEG digital audio codingrdquo IEEE Sig Proc Mag pp 59ndash81Sept 1997

440 REFERENCES

[Nomu98] T Nomura et al ldquoA bitrate and bandwidth scalable CELP coderrdquo in ProcIEEE ICASSP-1998 vol 1 pp 341ndash344 Seattle WA May 1998

[Nors92] SR Norsworthy ldquoEffective dithering of sigma-delta modulatorsrdquo in ProcIEEE ISCAS-92 vol 3 pp 1304ndash1307 1992

[Nors93] SR Norsworthy and DA Rich ldquoIdle channel tones and dithering in delta-sigma modulatorsrdquo in Proc 95th Conv Aud Eng Soc preprint 3711 NewYork Oct 1993

[Nors97] SR Norsworthy R Schreier and GC Temes Delta-Sigma ConvertersTheory Design and Simulation New York IEEE Press 1997

[Nuss81] Nussbaumer H J ldquoPseudo QMF filter bankrdquo IBM Tech DisclosureBulletin vol 24 pp 3081ndash3087 Nov 1981

[Ogur97] Y Ogura T Sugihara S Okada H Yamauchi and A Nishio ldquoOne-bittwo-channel recorder systemrdquo in Proc 103rd Conv Aud Eng Soc preprint4564 New York Sept 1997

[Oh01] HO Oh JW Seok JW Hong and DH Youn ldquoNew echo embeddingtechnique for robust and imperceptible audio watermarkingrdquo in Proc IEEEICASSP-01 vol 3 pp 1341ndash1344 Salt Lake City UT May 2001

[Oh02] HO Oh DH Youn JW Hong JW Seok ldquoImperceptible echo for robustaudio watermarkingrdquo in Proc 113th Conv Aud Eng Soc preprint 5644Los Angeles Oct 2002

[Ohma97] I Ohman ldquoThe placement of one or several subwoofersrdquo in Music ochLjudteknik no 1 Sweden 1997

[Ojan99] J Ojanpera and M Vaananen ldquoLong-term predictor for transform domainperceptual audio codingrdquo in Proc 107th Conv Aud Eng Soc preprint5036 New York 1999

[Oliv48] BM Oliver J Pierce and CE Shannon ldquoThe philosophy of PCMrdquo ProcIRE vol 36 pp 1324ndash1331 Nov 1948

[Onno93] P Onno and C Guillemot ldquoTradeoffs in the design of wavelet filters forimage compressionrdquo in Proc VCIP pp 1536ndash1547 Nov 1993

[Oom03] W Oomen E Schuijers B den Brinker and J Breebaart ldquoAdvances inparametric coding for high-quality audiordquo in Proc 114th Conv Aud EngSoc preprint 5852 Mar 2003

[Oome96] AWJL Oomen AAM Bruekers RJ van der Vleuten and LM Van deKerkhof ldquoLossless coding for DVD audiordquo in 101st Conv Aud Eng Socpreprint 4358 Nov 1996

[Oppe71] A Oppenheim et al ldquoComputation of spectra with unequal resolution usingthe fast Fourier transformrdquo Proc of the IEEE vol 59 pp 299ndash301 Feb1971

[Oppe72] A Oppenheim and D Johnson ldquoDiscrete representation of signalsrdquo Procof the IEEE vol 60 pp 681ndash691 June 1972

[Oppe99] AV Oppenheim RW Schafer JR Buck Discrete-Time Signal Processing2nd edition Prentice Hall Englewood Cliffs NJ 1999

[Orde91] E Ordentlich and Y Shoham ldquoLow-delay code excited linear predictivecoding of wideband speech at 32 kbsrdquo in Proc IEEE ICASSP-93 vol 2pp 9ndash12 Minneapolis MN 1993

REFERENCES 441

[Pail92] B Paillard et al ldquoPERCEVAL perceptual evaluation of the quality of audiosignalsrdquo J Aud Eng Soc vol 40 no 12 pp 21ndash31 JanFeb 1992

[Pain00] T Painter and A Spanias ldquoPerceptual coding of digital audiordquo Proc of theIEEE vol 88 no 4 pp 451ndash513 Apr 2000

[Pain00a] T Painter and A Spanias ldquoPerceptual Coding of Digital Audiordquo Proc ofthe IEEE vol 88 no 4 pp 451ndash513 April 2000

[Pain01] T Painter and A Spanias ldquoPerceptual component selection in sinusoidalcoding of audiordquo IEEE 4th Workshop on Multimedia Signal Processing pp187ndash192 Oct 2001

[Pain05] T Painter and A Spanias ldquoPerceptual segmentation and component selectionfor sinusoidal representations of audiordquo IEEE Trans Speech Audio Procvol 13 no 2 pp 149ndash162 Mar 2005

[Pali91] KK Paliwal and B Atal ldquoEfficient vector quantization of LPC parametersat 24 kbsrdquo in Proc IEEE ICASSP-91 pp 661ndash663 Toronto ON Canada1991

[Pali93] KK Paliwal and BS Atal ldquoEfficient vector quantization of LPC parametersat 24 bitsframerdquo IEEE Trans Speech Audio Process vol 1 no 1 pp3ndash14 1993

[Pan93] D Pan ldquoDigital audio compressionrdquo Digital Tech J vol 5 no 2 pp28ndash40 1993

[Pan95] D Pan ldquoA tutorial on MPEGaudio compressionrdquo IEEE Mult vol 2 no 2pp 60ndash74 Summer 1995

[Papa95] P Papamichalis ldquoMPEG audio compression algorithms and implementa-tionrdquo in Proc DSP 95 Int Conf on DSP pp 72ndash77 June 1995

[Papo91] A Papoulis Probability Random Variables and Stochastic Processes 3rdedition McGraw-Hill New York 1991

[Para95] M Paraskevas and J Mourjopoulos ldquoA differential perceptual audio codingmethod with reduced bitrate requirementsrdquo IEEE Trans Speech and AudioProc vol 3 no 6 pp 490ndash503 Nov 1995

[Park97] SH Park et al ldquoMulti-layered bit-sliced bit-rate scalable audio codingrdquo inProc 103rd Conv Aud Eng Soc preprint 4520 1997

[Paul82] D Paul ldquoA 500ndash800 bs adaptive vector quantization vocoder using aperceptually motivated distance measurerdquo in Proc Globecom 1982 (MiamiFL Dec 1982) p E631

[PEAQ] The PEAQ webpage available at httpwwwpeaqorg

[Peel03] CB Peel ldquoOn lsquodirty-paper codingrsquordquo in IEEE Sig Proc Mag vol 20 no 3pp 112ndash113 May 2003

[Pena95] A Pena ldquoA suggested auditory information environment to ease thedetection and minimization of subjective annoyance in perceptive-basedsystemsrdquo in Proc 98th Conv Aud Eng Soc preprint 4019 Paris 1995

[Pena96] A Pena et al ldquoARCO (Adaptive Resolution COdec) a hybrid approachto perceptual audio codingrdquo in Proc 100th Conv Aud Eng Soc preprint4178 May 1996

[Pena97a] A Pena et al ldquoNew improvements in ARCO (Adaptive Resolution Codec)rdquoin Proc 102nd Conv Aud Eng Soc preprint 4419 Mar 1997

442 REFERENCES

[Pena97b] A Pena et al ldquoA flexible tiling of the time axis for adaptive waveletpacket decompositionsrdquo in Proc IEEE ICASSP-97 pp 2137ndash2140 MunichGermany Apr 1997

[Pena01] A Pena E Alexandre B Rivas R Perez and A Duenas ldquoRealtimeimplementations of MPEG-2 and MPEG-4 natural audio codersrdquo in Proc110th Conv Aud Eng Soc preprint 5302 May 2001

[Penn80] M Penner ldquoThe coding of intensity and the interaction of forward andbackward maskingrdquo J Acoust Soc of Am vol 67 pp 608ndash616 1980

[Penn95] B Pennycook ldquoAudio and MIDI Markup Tools for the World Wide Webrdquoin Proc 99th Conv Aud Eng Soc preprint 4058 Oct 1995

[Peti99] FAP Petitcolas RJ Anderson and MG Kuhn ldquoInformation hiding ndash asurveyrdquo in Proc of the IEEE vol 87 no 7 pp 1062ndash1078 July 1999

[Peti02] FAP Petitcolas and D Kirovski ldquoThe blind pattern matching attack onwatermark systemsrdquo in Proc IEEE ICASSP-02 vol 4 pp 3740ndash3743Orlando FL May 2002

[Petr01] R Petrovic ldquoAudio signal watermarking based on replica modulationrdquo itInt Conf on Telecomm in Modern Satellite Cable and Broadcasting vol1 pp 227ndash234 Sept 2001

[Phil95a] P Philippe et al ldquoA relevant criterion for the design of wavelet filters inhigh-quality audio codingrdquo in Proc 98th Conv Aud Eng Soc preprint3948 Feb 1995

[Phil95b] P Philippe et al ldquoOn the Choice of Wavelet Filters for AudioCompressionrdquo in Proc IEEE ICASSP-95 pp 1045ndash1048 Detroit MI May1995

[Phil96] P Philippe et al ldquoOptimal wavelet packets for low-delay audio codingrdquo inProc IEEE ICASSP-96 pp 550ndash553 Atlanta GA May 1996

[PhSo82] Philips and Sony Audio CD System Description Philips LicensingEindhoven The Netherlands 1982

[Pick82] R Pickholtz D Schilling and L Milstein ldquoTheory of spread-spectrumcommunications ndash a tutorialrdquo IEEE Trans on Comm vol 30 no 5 pp855ndash884 May 1982

[Pick88] J Pickles An Introduction to the Physiology of Hearing Academic PressNew York 1988

[Podi01] CI Podilchuk and EJ Delp ldquoDigital watermarking algorithms andapplicationsrdquo IEEE Sig Proc Mag vol 18 no 4 pp 33ndash46 July 2001

[Port80] MR Portnoff ldquoTime-frequency representation of digital signals and systemsbased on short-time fourier analysisrdquo IEEE Trans ASSP vol 28 no 1pp 55ndash69 Feb 1980

[Port81a] MR Portnoff ldquoShort-time fourier analysis of sampled speechrdquo IEEE TransASSP vol 29 no 3 pp 364ndash373 June 1981

[Prec97] K Precoda and T Meng ldquoListener differences in audio compressionEvaluationsrdquo J Aud Eng Soc vol 45 no 9 pp 708ndash715 Sept 1997

[Prel96a] N Prelcic and A Pena ldquoAn adaptive tree search algorithm with applicationto multiresolution based perceptive audio codingrdquo in Proc IEEE Int Sympon Time-Freq and Time-Scale Anal 1996

REFERENCES 443

[Prel96b] N Prelcic et al ldquoConsiderations on the performance of filter design methodsfor wavelet packet audio decompositionrdquo in Proc 100th Conv Aud EngSoc preprint 4235 May 1996

[Pres89] W Press et al Numerical Recipes Cambridge Univ Press 1989

[Prin86] J Princen and A Bradley ldquoAnalysissynthesis filter bank design based ontime domain aliasing cancellationrdquo IEEE Trans ASSP vol 34 no 5 pp1153ndash1161 Oct 1986

[Prin87] J Princen et al ldquoSubbandtransform coding using filter bank designs basedon time domain aliasing cancellationrdquo in Proc IEEE ICASSP-87 vol 12pp 2161ndash2164 Dallas TX May 1987

[Prin94] J Princen ldquoThe design of non-uniform modulated filterbanksrdquo in Proc IEEEInt Symp on Time-Frequency and Time-Scale Analysis Oct 1994

[Prin95] J Princen and J Johnston ldquoAudio coding with signal adaptive filter banksrdquoin Proc IEEE ICASSP-95 pp 3071ndash3074 Detroit MI May 1995

[Prit90] WL Pritchard and M Ogata ldquoSatellite direct broadcastrdquo Proc of the IEEE vol 78 no 7 pp 1116ndash1140 July 1990

[Pura95] M Purat and P Noll ldquoA new orthonormal wavelet packet decompositionfor audio coding using frequency-varying modulated lapped transformsrdquo inIEEE ASSP Workshop on Applic of Sig Proc to Aud and Acous Oct 1995

[Pura96] M Purat and P Noll ldquoAudio coding with a dynamic wavelet packetdecomposition based on frequency-varying modulated lapped transformsrdquoin Proc IEEE ICASSP-96 pp 1021ndash1024 Atlanta GA May 1996

[Pura97] M Purat T Liebchen and P Noll ldquoLossless transform coding of audiosignalsrdquo in Proc 102nd Conv Aud Eng Soc Munich preprint 4414 Mar1997

[Purn00a] H Purnhagen N Meine ldquoHILN-the MPEG-4 parametric audio codingtoolsrdquo in Proc IEEE ISCAS-2000 vol 3 pp 201ndash204 May 2000

[Purn00b] H Purnhagen N Meine and B Edler ldquoSpeeding up HILN-MPEG-4parametric audio encoding with reduced complexityrdquo in Proc 109th ConvAud Eng Soc preprint 5177 Sept 2000

[Purn02] H Purnhagen N Meine and B Edler ldquoSinusoidal coding using loudness-based component selectionrdquo in Proc IEEE ICASSP-02 vol 2 pp1817ndash1820 Orlando FL 2002

[Purn97] H Purnhagen et al ldquoProposal of a Core Experiment for extended lsquoHarmonicand Individual Lines plus Noisersquo Tools for the Parametric Audio CoderCorerdquo ISOIEC JTC1SC29WG11 MPEG972480 July 1997

[Purn98] H Purnhagen et al ldquoObject-based analysissynthesis audio coder for verylow bit ratesrdquo in Proc 104th Conv Aud Eng Soc preprint 4747 May1998

[Purn99a] H Purnhagen ldquoAdvances in parametric audio codingrdquo in Proc IEEEWASPAA New Paltz NY Oct 17ndash20 1999

[Purn99b] H Purnhagen ldquoAn overview of MPEG-4 audio version-2rdquo in Proc AES17th Int Conf Florence Italy Sept 1999

[Purn03] H Purnhagen et al ldquoCombining low complexity parametric stereo withhigh efficiency AACrdquo ISOIEC JTC1SC29WG11 MPEG2003M10385Dec 2003

444 REFERENCES

[Qiu01] T Qiu ldquoLossless audio coding based on high order context modelingrdquo inIEEE Fourth Workshop on Multimedia Signal Processing pp 575ndash580 Oct2001

[Quac01] S Quackenbush and AT Lindsay ldquoOverview of MPEG-7 Audiordquo IEEETrans on Circuits and Systems for Video Technology vol 2 no 6 pp725ndash729 June 2001

[Quac97] S Quackenbush ldquoNoiseless coding of quantized spectral components inMPEG-2 advanced audio Codingrdquo IEEE ASSP Workshop on Apps of SigProc to Aud And Acous Mohonk NY 1997

[Quac98a] S Quackenbush and Y Toguri ISOIEC JTC1SC29WG11 N2005ldquoRevised Report on Complexity of MPEG-2 AAC Toolsrdquo Feb 1998

[Quac98b] S Quackenbush ldquoCoding of Natural Audio in MPEG-4rdquo in Proc IEEEICASSP-98 Seattle WA May 1998

[Quat02] T Quatieri Discrete-Time Speech Signal Processing Prentice Hall PTRUpper Saddle River NJ 2002

[Quei93] R de Queiroz ldquoTime-varying lapped transforms and wavelet packetsrdquo IEEETrans Sig Proc vol 41 pp 3293ndash3305 1993

[Raad01] M Raad and IS Burnett ldquoAudio coding using sorted sinusoidal parametersrdquoin Proc IEEE ISCAS-02 vol 2 pp 401ndash404 May 2001

[Raad02] M Raad and A Mertins ldquoFrom lossy to lossless audio coding using SPIHTrdquoin Proc of the 5th Int Conf on Digital Audio Effects Hamburg GermanySept 26ndash28 2002

[Rabi78] LR Rabiner and RW Schafer Digital Processing of Speech Signals Pren-tice-Hall Englewood Cliffs NJ 1978

[Rabi89] LR Rabiner ldquoA tutorial on hidden markov models and selected applicationsin speech recognitionrdquo Proc of the IEEE vol 77 no 2 pp 257ndash286 Feb1989

[Rabi93] LR Rabiner and BH Juang Fundamentals of Speech Recognition PrenticeHall Englewood Cliffs NJ 1993

[Rahm87] MD Rahman and K Yu ldquoTotal least squares approach for frequencyestimation using linear Predictionrdquo IEEE Trans ASSP vol 35 no 10 pp1440ndash1454 Oct 1987

[Ramp98] SA Ramprashad ldquoA two stage hybrid embedded speechaudio codingstructurerdquo in Proc IEEE ICASSP-98 vol 1 pp 337ndash340 Seattle WAMay 1998

[Ramp99] SA Ramprashad ldquoA multimode transform predictive coder (MTPC) forspeech and audiordquo IEEE Workshop on Speech Coding pp 10ndash12 June1999

[Rams82] T Ramstad ldquoSub-band coder with a simple adaptive bit-allocation algorithma possible candidate for digital mobile telephonyrdquo in Proc IEEE ICASSP-82vol 7 pp 203ndash207 Paris France May 1982

[Rams86] T Ramstad ldquoConsiderations on quantization and dynamic bit-allocation insubband codersrdquo in Proc IEEE ICASSP-86 vol 11 pp 841ndash844 TokyoJapan Apr 1986

REFERENCES 445

[Rams91] T Ramstad ldquoCosine modulated analysis-synthesis filter bank with criticalsampling and perfect reconstruction in Proc IEEE ICASSP-91 pp1789ndash1792 Toronto Canada May 1991

[Rao90] K Rao and P Yip The Discrete Cosine Transform Algorithm Advantagesand Applications Academic Press Boston 1990

[Rasc80] G Rasch Probabilistic Models for Some Intelligence Attainment Tests University of Chicago Press Chicago 1980

[Reef01a] D Reefman and E Janssen ldquoWhite paper on signal processing for SACDrdquoPhilips IPampS [Online] Available at httpwwwsuperaudiocdphilipscom

[Reef01b] D Reefman and P Nuijten ldquoWhy direct stream digital (DSD) is the bestchoice as a digital audio formatrdquo in Proc 110th Conv Aud Eng Socpreprint 5396 Amsterdam May 2001

[Reef01c] D Reefman and P Nuijten ldquoEditing and switching in 1-bit audio streamsrdquo inProc 110th Conv Aud Eng Soc preprint 5399 Amsterdam May 12ndash152001

[Reef02] D Reefman and E Janssen ldquoEnhanced sigma delta structures for superaudio CD applicationrdquo in Proc 112th Conv Aud Eng Soc preprint 5616Munich Germany May 2002

[Rice79] R Rice ldquoSome practical universal noiseless coding techniquesrdquo JetPropulsion Laboratory JPL Publication Pasadena CA Mar 1979

[Riou91] O Rioul and M Vetterli ldquoWavelets and signal processingrdquo IEEE SP Magpp 14ndash38 Oct 1991

[Riou94] O Rioul and P Duhamel ldquoA Remez exchange algorithm for orthonormalwaveletsrdquo IEEE Trans Circ Sys II Aug 1994

[Riss79] J Rissanen and G Langdon ldquoArithmetic codingrdquo IBM J Res Develop vol23 no 2 pp 146ndash162 Mar 1979

[Rits96] S Ritscher and U Felderhoff ldquoCascading of Different Audio Codecsrdquo inProc 100th Audio Eng Soc Conv preprint 4174 Copenhagen May 1996

[Rix01] Rix et al ldquoPerceptual evaluation of speech quality (PESQ) ndash a new methodfor speech quality assessment of telephone networks and codecsrdquo in ProcIEEE ICASSP-01 vol 2 pp 749ndash752 Salt Lake City UT 2001

[Robi94] T Robinson ldquoSHORTEN Simple lossless and near-lossless waveformcompressionrdquo Technical Report 156 Engineering Department CambridgeUniversity Dec 1994

[Roch98] S Roche and J-L Dugelay ldquoImage watermarking based on the fractaltransformrdquo in Proc Workshop Multimedia Signal Processing pp 358ndash363Los Angeles CA 1998

[Rode92] X Rodet and P Depalle ldquoSpectral envelopes and inverse FFT synthesisrdquoin Proc 93rd Conv Audio Eng Soc preprint 3393 Oct 1992

[Rong99] Y Rongshan and KC Chung ldquoHigh quality audio coding using a novelhybrid WLP-subband coding algorithmrdquo 5th Asia-Pacific Conf on Commvol 2 pp 952ndash955 Oct 1999

[Rose01] B Rosenblatt B Trippe and S Mooney Digital Rights ManagementBusiness and Technology New York John Wiley amp Sons Nov 2001

[Roth83] JH Rothweiler ldquoPolyphase quadrature filters ndash a new subband codingtechniquerdquo in Proc ICASSP-83 pp 1280ndash1283 May 1983

446 REFERENCES

[Ruiz00] FJ Ruiz and JR Deller ldquoDigital watermarking of speech signals for theNational Gallery of the Spoken Wordrdquo in Proc IEEE ICASSP-00 vol 3pp 1499ndash1502 Istanbul Turkey June 2000

[Ryde96] T Ryden ldquoUsing listening tests to assess audio codecsrdquo in Collected Paperson Digital Audio Bit-Rate Reduction N Gilchrist and C Grewin Eds AudEng Soc pp 115ndash125 1996

[Sabi82] M Sabin and R Gray ldquoProduct code vector quantizers for speech waveformcodingrdquo in Proc Globecom 1982 (Miami FL Dec 1982) p E651

[SACD02] Philips and Sony ldquoSuper Audio CD System Descriptionrdquo Philips LicensingEindhoven The Netherlands 2002

[SACD03] Philips Super Audio CD 2003 Available at httpwwwsuperaudiocdphilipscom

[Sait02a] S Saito T Furukawa and K Konishi ldquoA data hiding for audio using banddivision based on QMF bankrdquo in Proc IEEE ISCAS-02 vol 3 pp 635ndash638May 2002

[Sait02b] S Saito T Furukawa and K Konishi ldquoA digital watermarking for audiodata using band division based on QMF bankrdquo in Proc IEEE ICASSP-02vol 4 pp 3473ndash3476 Orlando FL May 2002

[Saka00] T Sakamoto M Taruki and T Hase ldquoMPEG-2 AAC decoder for a 32-bitMCUrdquo Int Conf on Consumer Electronics ICCE-00 pp 256ndash257 June2000

[Sala98] R Salami et al ldquoDesign and description of CS-ACELP A toll quality8 kbs speech coderrdquo IEEE Trans on Speech and Audio Processing vol 6no 2 pp 116ndash130 Mar 1998

[Sand01] FE Sandnes ldquoEfficient large-scale multichannel audio codingrdquo in Proc ofEuromicro Conference pp 392ndash399 Sept 2001

[Saun96] J Saunders ldquoReal time discrimination of broadcast speechmusicrdquo in ProcIEEE ICASSP-96 pp 993ndash996 Atlanta GA May 1996

[SBC90] Swedish Broadcasting Corporation ldquoISO MPEGAudio Test ReportrdquoStokholm July 1990

[Scha70] B Scharf ldquoCritical bandsrdquo in Foundations of Modern Auditory Theory Academic Press New York 1970

[Scha75] R Schafer and L Rabiner ldquoDigital representations of speech signalsrdquo ProcIEEE vol 63 no 4 pp 662ndash677 Apr 1975

[Scha79] R Schafer and J Markel Speech Analysis IEEE Press New York 1979

[Scha95] R Schafer and T Sikora ldquoDigital video coding standards and their role invideo communicationsrdquo in Proc of the IEEE vol 83 no 6 June 1995

[Sche98a] E Scheirer ldquoThe MPEG-4 Structured Audio Standardrdquo in Proc IEEEICASSP-98 vol 6 pp 3801ndash3804 Seattle WA May 1998

[Sche98b] E Scheirer and M Slaney ldquoConstruction and Evaluation of a RobustMultifeature SpeechMusic Discriminatorrdquo in Proc IEEE ICASSP-98Seattle WA May 1998

[Sche98c] E Scheirer and L Ray ldquoAlgorithmic and wavetable synthesis in the MPEG-4multimedia standardrdquo in Proc 105th Conv Aud Eng Soc preprint 4811Sept 1998

REFERENCES 447

[Sche98d] E Scheirer ldquoThe MPEG-4 structured audio orchestra languagerdquo in ProcICMC Oct 1998

[Sche98e] E Scheirer et al ldquoAudioBIFS the MPEG-4 standard for effects processingrdquoin Proc DAFX98 Workshop on Digital Audio Effects Processing Nov 1998

[Sche99a] E Scheirer ldquoStructured audio and effects processing in the MPEG-4multimedia standardrdquo ACM Multimedia Sys vol7 no 1 pp 11ndash22 1999

[Sche99b] E Scheirer and BL Vercoe ldquoSAOL The MPEG-4 structured audioorchestra languagerdquo Comput Music J vol 23 no 2 pp 31ndash51 1999

[Sche01] E Scheirer ldquoStructured audio Kolmogorov complexity and generalizedaudio codingrdquo in IEEE Trans on Speech and Audio Proc vol 9 no 8 pp914ndash931 Nov 2001

[Schi81] S Schiffman et al Introduction to Multidimensional Scaling TheoryMethod and Applications Academic Press New York 1981

[Schr79] MR Schroeder BS Atal and JL Hall ldquoOptimising digital speech codersby exploiting masking properties of the human earrdquo J Acoust Soc Amvol 66 1979

[Schr79] M Schroeder et al ldquoOptimizing digital speech coders by exploiting maskingproperties of the human earrdquo J Acoust Soc Am pp 1647ndash1652 Dec 1979

[Schr85] MR Schroeder and B Atal ldquoCode-excited linear prediction (CELP) highquality speech at very low bit ratesrdquo in Proc IEEE ICASSP-85 vol 10pp 937ndash940 Tampa FL Apr 1985

[Schr85] M Schroeder and B Atal ldquoCode-excited linear prediction (CELP) highquality speech at very low bit ratesrdquo in Proc IEEE ICASSP-85 pp937ndash940 Tampa FL May 1985

[Schr86] E Schroder and W Voessing ldquoHigh quality digital audio encoding with 30bitssample using adaptive transform codingrdquo in Proc 80th Conv Aud EngSoc preprint 2321 Mar 1986

[Schu96] D Schulz ldquoImproving audio codecs by noise substitutionrdquo J Audio EngSoc pp 593ndash598 JulyAug 1996

[Schu98] G Schuller ldquoTime-varying filter banks with low delay for audio codingrdquo inProc 105th Conv Audio Eng Soc preprint 4809 Sept 1998

[Schu00] G Schuller and T Karp ldquoModulated filter banks with arbitrary system delayefficient implementations and the time-varying caserdquo IEEE Trans SpeechAudio Proc vol 48 no 3 pp 737ndash748 Mar 2000

[Schu01] G Schuller Y Bin and D Huang ldquo Lossless coding of audio signals usingcascaded predictionrdquo in Proc IEEE ICASSP-01 vol 5 pp 3273ndash3276 SaltLake City UT 2001

[Schu02] G Schuller and A Harma ldquoLow delay audio compression using predictivecodingrdquo in Proc IEEE ICASSP-02 vol 2 pp 1853ndash1856 Orlando FL2002

[Schu04] E Schuijers et al ldquoLow complexity parametric stereo codingrdquo in 116thConv Aud Eng Soc preprint 6073 Berlin Germany May 2004

[Schu05] G Schuller J Kovacevic F Masson and V Goyal ldquoRobust low-delayaudio coding using multiple descriptionsrdquo IEEE Trans Speech Audio Procvol 13 no 5 pp 1014ndash1024 Sep 2005

448 REFERENCES

[Sega76] A Segall ldquoBit allocation and encoding for vector sourcesrdquo IEEE TransInfo Theory vol 22 no 2 pp 162ndash169 Mar 1976

[Seok00] J-W Seok and J-W Hong ldquoPrediction-based audio watermark detectionalgorithmrdquo in Proc 109th Conv Aud Eng Soc preprint 5254 Los AngelesSept 2000

[Sera97] C Serantes et al ldquoA fast noise-scaling algorithm for uniform quantizationin audio coding schemesrdquo in Proc IEEE ICASSP-97 pp 339ndash342 MunichGermany Apr 1997

[Serr89] X Serra ldquoA System for Sound AnalysisTransformationSynthesis Basedon a Deterministic Plus Stochastic Decompositionrdquo PhD Thesis StanfordUniversity 1989

[Serr90] X Serra and JO Smith III ldquoSpectral modeling and synthesis asound analysissynthesis system based on a deterministic plus stochasticdecompositionrdquo Comput Mus J pp 12ndash24 Winter 1990

[Sevi94] D Sevic and M Popovic ldquoA new efficient implementation of the oddly-stacked Princen-Bradley filter bankrdquo IEEE Sig Proc Lett vol 1 no 11Nov 1994

[Shan48] CE Shannon ldquoA mathematical theory of communicationrdquo Bell Sys TechJ vol 27 pp 379ndash423 pp 623ndash656 JulyOct 1948

[Shao02] S Shaopeng Y Junxun Y Yongcong and A-M Raed ldquoA low bit-rate audiocoder based on modified sinusoidal modelrdquo in IEEE Int Conf on CommCircuits and Systems and West Sino Expositions vol 1 pp 648ndash652 June2002

[Shap93] J Shapiro ldquoEmbedded image coding using zerotrees of wavelet coefficientsrdquoIEEE Trans Sig Proc vol 41 no 12 pp 3445ndash3462 Dec 1993

[Shin02] S Shin O Kim J Kim and J Choil ldquoA robust audio watermarkingalgorithm using pitch scalingrdquo Int Conf on DSP-02 vol 2 pp 701ndash704July 2002

[Shli94] S Shlien ldquoGuide to MPEG-1 audio standardrdquo IEEE Trans Broadcast pp206ndash218 Dec 1994

[Shli96] S Shlien and G Soulodre ldquoMeasuring the characteristics of lsquoexpertrsquolistenersrdquo in Proc 101st Conv Aud Eng Soc preprint 4339 Nov 1996

[Shli97] S Shlien ldquoThe modulated lapped transform its time-varying forms and itsapplications to audio coding standardsrdquo IEEE Trans on Speech and AudioProc vol 5 no 4 pp 359ndash366 July 1997

[Shoh88] Y Shoham and A Gersho ldquoEfficient bit allocation for an arbitrary set ofquantizersrdquo IEEE Trans ASSP vol 36 pp 1445ndash1453 1988

[Siko97a] T Sikora ldquoThe MPEG-4 Video Standard Verification Modelrdquo in IEEE TransCSVT vol7 no 1 Feb1997

[Siko97b] T Sikora ldquoMPEG digital video-coding standardsrdquo in IEEE Sig Proc Magvol 14 no 5 pp 82ndash100 Sept 1997

[Silv74] HF Silverman and NR Dixon ldquoA parametrically controlled spectralanalysis system for speechrdquo IEEE Trans ASSP vol 22 no 5 pp 362ndash381oct 1974

REFERENCES 449

[Silv01] GCM Silvestre NJ Hurley GS Hanau and WJ Dowling ldquoInformedaudio watermarking scheme using digital chaotic signalsrdquo in Proc IEEEICASSP-01 vol 3 pp 1361ndash1364 Salt Lake City UT May 2001

[Sing84] S Singhal and B Atal ldquoImproving performance of multi-pulse LPC codersat low bit ratesrdquo in Proc IEEE ICASSP-84 vol 9 pp 9ndash12 San DiegoCA Mar 1984

[Sing90] S Singhal ldquoHigh quality audio coding using multipulse LPCrdquo in Proc IEEEICASSP-90 vol 2 pp 1101ndash1104 Albuquerque NM Apr 1990

[Sinh93a] D Sinha and A Tewfik ldquoLow bit rate transparent audio compression usinga dynamic dictionary and optimized waveletsrdquo in Proc IEEE ICASSP-93pp I-197ndashI-200ndash Minneapolis MN May 1993

[Sinh93b] D Sinha and A Tewfik ldquoLow bit rate transparent audio compression usingadapted waveletsrdquo IEEE Trans Sig Proc vol 41 no 12 pp 3463ndash3479Dec 1993

[Sinh96] D Sinha and J Johnston ldquoAudio compression at low bit rates using a signaladaptive switched filter bankrdquo in Proc IEEE ICASSP-96 pp 1053ndash1056Atlanta GA May 1996

[Sinh98a] D Sinha et al ldquoThe perceptual audio coder (PAC)rdquo in The Digital SignalProcessing Handbook V Madisetti and D Williams Eds CRC Press BocaRaton FL pp 421ndash4218 1998

[Sinh98b] D Sinha and CEW Sundberg ldquoUnequal error protection (UEP) forperceptual audio codersrdquo in Proc 104th Conv Aud Eng Soc preprint 4754May 1998

[Smar95] G Smart and A Bradley ldquoFilter bank design based on time-domain aliasingcancellation with non-identical windowsrdquo in Proc IEEE ICASSP-94 vol 3no 185ndash188 Detroit MI May 1995

[Smit86] MJT Smith and T Barnwell ldquoExact reconstruction techniques for tree-structured subband codersrdquo IEEE Trans Acous Speech and Sig Processvol 34 no 3 pp 434ndash441 June 1986

[Smit95] JO Smith and J Abel ldquoThe Bark bilinear transformrdquo in Proc IEEEWorkshop on App Sig Proc to Audio and Electroacoustics Oct 1995

[Smit99] JO Smith and JS Abel ldquoBark and ERB bilinear transformsrdquo IEEE Transon Speech and Audio Proc vol 7 no 6 pp 697ndash708 Nov 1999

[SMPTE99] SMPTE 320M-1999 Channel Assignments and Levels on MultichannelAudio Media Society of Motion Picture and Television Engineers WhitePlains NY 1999

[SMPTE02] SMPTE RP 173ndash2002 Loudspeaker Placements for Audio Monitoringin High-Definition Electronic Production Society of Motion Picture andTelevision Engineers White Plains NY 2002

[Smyt96] S Smyth WP Smith M Smyth M Yan and T Jung ldquoDTS coherentacoustics delivering high quality multichannel sound to the consumerrdquo inProc 100th Conv Aud Eng Soc May 11ndash14 preprint 4293 1996

[Smyt99] M Smyth ldquoWhite paper An Overview of the Coherent AcousticsCoding Systemrdquo June 1999 Available on the Internet at wwwdtsonlinecommediauploadspdfswhitepaperpdf

[Soda94] I Sodagar et al ldquoTime-varying filter banks and waveletsrdquo IEEE Trans SigProc vol 42 pp 2983ndash2996 Nov 1994

450 REFERENCES

[Sore96] HV Sorensen PM Azziz and J van der Spiegel ldquoAn overview of sigma-delta convertersrdquo IEEE Sig Proc Mag vol 13 pp 61ndash84 Jan 1996

[Soul98] G Soulodre et al ldquoSubjective evaluation of state-of-the-art two-channelaudio codecsrdquo J Aud Eng Soc vol 46 no 3 pp 164ndash177 Mar 1998

[Span91] A Spanias ldquoA hybrid transform method for speech analysis and synthesisrdquoSignal Processing Vol 24 pp 217ndash229 1991

[Span92] A Spanias M Deisher P Loizou and G Lim ldquoFixed-point implementa-tion of the VSELP algorithmrdquo ASU-TRC Technical Report TRC-SP-ASP-9201 July 1992

[Span94] A Spanias ldquoSpeech coding A tutorial reviewrdquo in Proc of the IEEE vol82 no 10 pp 1541ndash1582 Oct 1994

[Span97] A Spanias and T Painter ldquoUniversal Speech and Audio Coding Using aSinusoidal Signal Modelrdquo ASU-TRC Technical Report 97ndash001 Jan 1997

[Spec98] ldquoSpecial issue on copyright and privacy protectionrdquo IEEE J Select AreasCommun vol 16 May 1998

[Spec99] ldquoSpecial issue on identification and protection of multimedia informationrdquoin Proc of the IEEE vol 87 July 1999

[Spen01] G Spenger J Herre C Neubauer and N Rump ldquoMPEG-21 ndash What doesit bring to Audiordquo in Proc 111th Conv Aud Eng Soc preprint 5462NovndashDec 2001

[Sper00] R Sperschneider ldquoError resilient source coding with variable length codesand its application to MPEG advanced audio codingrdquo in Proc 109th ConvAud Eng Soc preprint 5271 Sept 2000

[Sper01] R Sperschneider and P Lauber ldquoError concealment for compressed digitalaudiordquo in Proc 111th Conv Aud Eng Soc preprint 5460 NovndashDec2001

[Sper02] R Sperschneider D Homm and L-H Chambat ldquoError resilient sourcecoding with differential variable length codes and its application to MPEGadvanced audio codingrdquo in Proc 112th Conv Aud Eng Soc preprint 5555May 2002

[Spor95a] T Sporer and K Brandenburg ldquoConstraints of filter banks used forperceptual measurementrdquo J Aud Eng Soc vol 43 no 3 pp 107ndash115Mar 1995

[Spor95b] T Sporer et al ldquoEvaluating a measurement systemrdquo J Aud Eng Soc vol43 no 5 pp 353ndash363 May 1995

[Spor96] T Sporer ldquoEvaluating small impairments with the mean opinionscale ndash reliable or just a guessrdquo in Proc 101st Conv Aud Eng Socpreprint 4396 Nov 1996

[Spor97] T Sporer ldquoObjective audio signal evaluation ndash applied psychoacoustics formodeling the perceived quality of digital audiordquo in Proc 103rd Conv AudEng Soc preprint 4512 Sept 1997

[Spor99a] T Sporer web site on TG 104 and other perceptual measure-ment standardization activities available at httpwwwltee-technikuni-erlangendesimspotg104indexhtml

[SQAM88] Sound Quality Assessment Material Recordings for Subjective Testsrdquo EBUDoc Tech 3253 (includes SQAM Compact Disc) 1988

REFERENCES 451

[Sree98a] T Sreenivas and M Dietz ldquoVector quantization of scale factors in advancedaudio coder (AAC)rdquo in Proc IEEE ICASSP-98 vol 6 pp 3641ndash3644Seattle WA May 1998

[Sree98b] T Sreenivas and M Dietz ldquoImproved AAC Performance lt64 kbs usingVQrdquo in Proc 104th Conv Audio Eng Soc preprint 4750 Amsterdam 1998

[Srin97] P Srinivasan Speech and Wideband Audio Compression Using Filter Banksand Wavelets Ph D Thesis Purdue University May 1997

[Srin98] P Srinivasan and L Jamieson ldquoHigh quality audio compression using Anadaptive wavelet packet decomposition and psychoacoustic modelingrdquo IEEETrans Sig Proc pp 1085ndash1093 Apr 1998 lsquoCrsquo Libraries and examplesavailable on httpwwwwaveletecnpurdueedusim speechg

[Stau92] J Stautner ldquoPhysical testing of psychophysically-based digital audio codingsystemsrdquo in Proc 11th Int Conf Aud Eng Soc pp 203ndash209 May 1992

[Stol93a] G Stoll et al ldquoExtension of ISOMPEG-audio layer II to multi-channelcoding the future standard for broadcasting telecommunication andmultimedia applicationsrdquo in Proc 94th Conv Aud Eng Soc preprint 3550Mar 1993

[Stol96] G Stoll et al ldquoISO-MPEG-2 Audio A generic standard for the codingof two-channel and multi-channel Soundrdquo in Collected Papers on DigitalAudio Bit-Rate Reduction N Gielchrist and C Grewin Eds 1996

[Stoll88] G Stoll et al ldquoMasking-pattern adapted subband coding use of thedynamic bit-rate marginrdquo in Proc 84th Conv Aud Eng Soc preprint 2585Mar 1988

[Stor97] Storey R ldquoATLANTIC Advanced Television at Low Bitrates andNetworked Transmission over Integrated Communication Systemsrdquo ACTSCommon European Newsletter Feb 1997

[Stra96] G Strang and T Nguyen Wavelets and Filter Banks Wellesley-CambridgePress Wellesley MA 1996

[Stru80] H Strube ldquoLinear prediction on a warped frequency scalerdquo J Acoust SocAm vol 68 no 4 pp 1071ndash1076 Oct 1980

[Stua93] J Stuart ldquoNoise methods for estimating detectability and thresholdrdquo inProc 94th Audio Eng Soc Conv preprint 3477 Berlin Mar 1993

[Sugi90] A Sugiyama et al ldquoAdaptive transform coding with an adaptive block size(ATC-ABS)rdquo in Proc IEEE ICASSP-90 pp 1093ndash1096 Albuquerque NMMay 1990

[Swan98a] M Swanson B Zhu A Tewfik and L Boney ldquoRobust audio watermarkingusing perceptual maskingrdquo Elsevier Signal Processing Special Issue onCopyright Protection and Access Control vol 66 no 3 pp 337ndash355 May1998

[Swan98b] MD Swanson M Kobayashi and AH Tewfik ldquoMultimedia data-embedding and watermarking technologiesrdquo Proc of the IEEE vol 86no 6 pp 1064ndash1087 June 1998

[Swan99] M Swanson B Zhu and A Tewfik ldquoCurrent state of the art challengesand future directions for audio watermarkingrdquo in Proc ICMCS vol 1 pp19ndash24 Florence Italy June 1999

[Swee02] P Sweeney Error Control Coding From Theory to Practice John Wiley ampSons New York May 2002

452 REFERENCES

[Taka88] M Taka et al ldquoDSP implementations of sophisticated speech codecsrdquo IEEETrans on Selected Areas in Communications vol 6 no 2 pp 274ndash282Feb 1988

[Taka97] Y Takamizawa ldquoAn efficient tonal component coding algorithm for MPEG-2 audio NBCrdquo in Proc IEEE ICASSP-97 pp 331ndash334 Munich GermanyApr 1997

[Taka01] Y Takamizawa and T Nomura ldquoProcessor-efficient implementation of ahigh quality MPEG-2 AAC encoderrdquo in Proc 110th Conv Aud Eng Socpreprint 5294 May 2001

[Taki95] Y Takishima M Wada and H Murakami ldquoReversible variable lengthcodesrdquo IEEE Trans Commun vol 43 pp 158ndash162 FebndashApr 1995

[Tan89] E Tan and B Vermuelen ldquoDigital audio tape for data storagerdquo IEEESpectrum vol 26 no 10 pp 34ndash38 Oct 1989

[Tang95] B Tang et al ldquoSpectral analysis of subband filtered signalsrdquo in Proc IEEEICASSP-95 pp 1324ndash1327 Detroit MI May 1995

[Taor99] R Taori and RJ Sluijter ldquoClosed-loop tracking of sinusoids for speech andaudio codingrdquo IEEE Workshop on Speech Coding Proceedings pp 1ndash31999

[Tefa03] A Tefas A Nikolaidis N Nikolaidis V Solachidis S Tsekeridou andI Pitas ldquoPerformance analysis of correlation-based watermarking schemesemploying Markov chaotic sequencesrdquo Trans on Sig Proc vol 51 no 7pp 1979ndash1994 July 2003

[Teh90] D Teh et al ldquoSubband coding of high-fidelity quality audio signals at128 kbpsrdquo in Proc IEEE ICASSP-92 vol 2 pp 197ndash200 AlbuquerqueNM May 1990

[tenK92] W ten Kate et al ldquoMatrixing of bit-rate reduced signalsrdquo in Proc IEEEICASSP vol 2 pp 205ndash208 San Francisco CA May 1992

[tenK94] W ten Kate et al ldquoCompatibility matrixing of multi-channel bit rate reducedaudio signalsrdquo in Proc 96th Conv Aud Eng Soc preprint 3792 1994

[tenK96b] W ten Kate ldquoMaintaining audio quality in cascaded psychoacoustic codingrdquoin Proc 101st Conv Aud Eng Soc preprint 4387 Nov 1996

[tenK97] R tenKate ldquoDisc-technology for super-quality audio applicationsrdquo in Proc103rd Conv Aud Eng Soc preprint 4565 Sept 1997 New York

[Terh79] Terhardt E ldquoCalculating virtual pitchrdquo Hearing Research vol 1 pp 155ndash182 1979

[Terh82] E Terhardt et al ldquoAlgorithm for extraction of pitch and pitch salience fromcomplex tonal signalsrdquo J Acous Soc Am vol 71 pp 679ndash688 Mar 1982

[Terr96] K Terry and J Seaver ldquoA real-time multichannel Dolby AC-3 audioencoder implementationrdquo in Proc 101st Conv Aud Eng Soc preprint4363 Nov 1996

[Tewf93] A Tewfik and M Ali ldquoEnhanced wavelet based audio coderrdquo in Conf Recof the 27th Asilomar Conf on Sig Sys and Comp pp 896ndash900 Nov 1993

[Thei87] G Theile et al ldquoLow-bit rate coding of high quality audio signalsrdquo in Proc82nd Conv Aud Eng Soc preprint 2432 Mar 1987

REFERENCES 453

[Thie96] T Thiede and E Kabot ldquoA new perceptual quality measure for bit ratereduced audiordquo in Proc 100th Conv Aud Eng Soc preprint 4280 May1996

[Thom82] D Thomson ldquoSpectrum estimation and harmonic analysisrdquo Proc of theIEEE pp 1055ndash1096 Sep 1982

[Thor00] NJ Thorwirth P Horvatic R Weis and J Zhao ldquoSecurity methods forMP3 music deliveryrdquo in 34th ASILOMAR Signals Systems and Computersvol 2 pp 1831ndash1835 Nov 2000

[Todd94] C Todd et al ldquoAC-3 flexible perceptual coding for audio transmission andstoragerdquo in Proc 96th Conv Aud Eng Soc preprint 3796 Feb 1994

[Toh03] C-K Toh et al ldquoTransporting audio over wireless ad hoc networksexperiments and new insightsrdquo in Proc of Personal Indoor and MobileRadio Communications (IMRC-2003) vol 1 pp 772ndash777 Sept 2003

[Tran90] I Trancoso and B Atal ldquoEfficient search procedures for selecting theoptimum innovation in stochastic codersrdquo IEEE Trans ASSP vol 38 no 3pp 385ndash396 Mar 1990

[Trap02] W Trappe and LC Washington Introduction to Cryptography with CodingTheory Prentice Hall Englewood Cliffs NJ Jan 2002

[Trem82] TE Tremain ldquoThe government standard linear predictive coding algorithmLPC-10rdquo Speech Technology Magazine pp 40ndash49 Apr 1982

[Treu96] W Treurniet ldquoSimulation of individual listeners with an auditory modelrdquoin Proc 100th Conv Aud Eng Soc preprint 4154 May 1996

[Tsai01] CW Tsai and JL Wu ldquoOn constructing the Huffman-code based reversiblevariable length codesrdquo IEEE Trans Commun vol 49 pp 1506ndash1509 Sept2001

[Tsai02] T-H Tsai and J-N Liu ldquoArchitecture design for MPEG-2 AAC filterbankdecoder using modified regressive methodrdquo in Proc IEEE ICASSP vol 3pp 3216ndash3219 Orlando FL May 2002

[Tsut96] K Tsutsui et al ldquoATRAC Adaptive Transform Acoustic Coding forMiniDiscrdquo in Collected Papers on Digital Audio Bit-Rate Reduction NGilchrist and C Grewin Eds Aud Eng Soc pp 95ndash101 1996

[Tsut98] K Tsutsui ldquoATRAC (Adaptive Transform Acoustic Coding) and ATRAC 2rdquoin The Digital Signal Processing Handbook V Madisetti and D WilliamsEds CRC Press Boca Raton FL pp 4316ndash4320 1998

[Twve99] On-line animations of cochlear traveling waves a) Boys TownNational Research Hospital Communication Engineering Laboratoryhttpwwwbtnrhboystownorgcelwaveshtm b) Department of Physiologyat the University of Wisconsin Madison httpwwwneurophyswisceduanimations c) Scuola Internazionale Superiore di Studi AvanzatiInter-national School for Advanced Studies (SISSAISAS) httpwwwsissaitbpCochleatwlohtm and d) Ear Lab at Boston University httpearlabbueduphysiologymechanicshtml

[USAT95a] United States Advanced Television Systems Committee (ATSC) DocA53 ldquoDigital Television Standardrdquo Sep 1995 Available online athttpwwwatscorgStandardsA53

454 REFERENCES

[USAT95b] United States Advanced Television Systems Committee (ATSC) Doc A52ldquoDigital Audio Compression Standard (AC-3)rdquo Dec 1995 Available onlineat httpwwwatscorgStandardsA52

[Vaan00] R Vaananen ldquoSynthetic audio tools in MPEG-4 standardrdquo in Proc 108thConv Aud Eng Soc preprint 5080 Feb 2000

[Vaid87] PP Vaidyanathan ldquoQuadrature mirror filter banks M-band extensions andperfect-reconstruction techniquesrdquo IEEE ASSP Mag vol 4 no 3 pp 4ndash20July 1987

[Vaid88] PP Vaidyanathan and PQ Hoang ldquoLattice structures for optimal design androbust implementation of two-channel perfect reconstruction QMF banksrdquoIEEE Trans Acous Speech and Sig Process vol 36 no 1 pp 81ndash94Jan 1988

[Vaid90] PP Vaidyanathan ldquoMultirate digital filters filter banks polyphase networksand applications a tutorialrdquo Proc of the IEEE vol 78 no 1 pp 56ndash93Jan 1990

[Vaid93] PP Vaidyanathan Multirate Systems and Filter Banks Prentice-HallEnglewood Cliffs NJ 1993

[Vaid94] PP Vaidyanathan and T Chen ldquoStatistically optimal synthesis banks forsubband codersrdquo in Proc 28th Asilomar Conf on Sig Sys and Comp vol2 pp 986ndash990 Nov 1994

[vand98] O van der Vrecken et al ldquoA new subband perceptual audio coder usingCELPrdquo in Proc IEEE ICASSP-98 pp 3661ndash3664 Seattle WA May 1998

[Vande02] S van de Par A Kohlrausch G Charestan and R Heusdens ldquoA newpsychoacoustical masking model for audio coding applicationsrdquo in ProcIEEE ICASSP-02 vol 2 pp 1805ndash1808 Orlando FL 2002

[Vasi02] N Vasiloglou RW Schafer and MC Hans ldquoLossless audio coding withMPEG-4 structured audiordquo 2nd Int Conf on WEBDELMUSIC-02 pp184ndash191 Dec 2002

[Vaup91] T Vaupel ldquoEin Beitrag zur Transformationscodierung von Audiosignalenunter Verwendung der Methode der lsquoTime Domain Aliasing Cancellation(TDAC)rsquo und einer Signalkompandierung in Zeitbereichrdquo PhD DissertationDuisburg Germany Univ of Duisburg Apr 1991

[Veen03] MV Veen AV Leest and F Bruekers ldquoReversible audio watermarkingrdquoin Proc 114th Conv Aud Eng Soc preprint 5818 Amsterdam Mar 2003

[Veld89] RNJ Veldhuis ldquoSubband coding of digital audio signals without loss ofqualityrdquo in Proc IEEE ICASSP-89 pp 2009ndash2012 Glasgow ScotlandMay 1989

[Verc95] BL Vercoe Csound A Manual for the Audio-Processing System MITMedia Lab Cambridge 1995

[Verc98] BL Vercoe et al ldquoStructured audio creation transmission and rendering ofparametric sound representationsrdquo Proc IEEE vol 85 no 5 pp 922ndash940May 1998

[Verm98a] T Verma and T Meng ldquoAn analysissynthesis tool for transient signals thatallows a flexible sines+transients+noise model for audiordquo in Proc IEEEICASSP-98 Seattle WA May 1998

REFERENCES 455

[Verm98b] T Verma et al ldquotransient modeling synthesis a flexible analysissynthesistool for transient signalsrdquo in Proc Int Comp Mus Conf pp 164ndash167Greece 1997

[Vern93] S Vernon et al ldquoA single-chip DSP implementation of a high-quality lowbit-rate multi-channel audio coderrdquo in Proc 95th Conv Aud Eng Soc1993

[Vern95] S Vernon ldquoDesign and implementation of AC-3 codersrdquo IEEE TransConsumer Elec vol 41 no 3 Aug 1995

[Vett92] M Vetterli and C Herley ldquoWavelets and filter banksrdquo IEEE Trans SigProc vol 40 no 9 pp 2207ndash2232 Sept 1992

[Vier86] MA Viergever ldquoCochlear macromechanics ndash a reviewrdquo in PeripheralAuditory Mechanisms J Allen et al Eds Springer Berlin 1986

[Volo01] S Voloshynovskiy S Pereira T Pun JJ Eggers and JK SuldquoAttacks on digital watermarks classification estimation based attacks andbenchmarksrdquo IEEE Comm Mag vol 39 no 8 pp 118ndash126 Aug 2001

[Vora97] S Voran ldquoPerception-based bit-allocation algorithms for audio codingrdquo inIEEE App of Sig Proc to Audio and Acoustics (ASSP) Workshop 19ndash22Oct 1997

[Voro88] P Voros ldquoHigh-quality sound coding within 2 times 64 kbits usinginstantaneous dynamic bit-allocationrdquo in Proc IEEE ICASSP-88 pp2536ndash2539 New York NY May 1988

[Wang98] Y Wang ldquoA new watermarking method of digital audio content for copyrightprotectionrdquo in Proc ICSP-98 vol 2 pp 1420ndash1423 Oct 1998

[Wang01] Y Wang J Ojanpera M Vilermo and M Vaananen ldquoSchemes for re-compressing MP3 audio bitstreamsrdquo in Proc 111th Conv Aud Eng Socpreprint 5435 NovndashDec 2001

[Watk88] J Watkinson Art of Digital Audio Focal Press London and Boston 1988

[Wege97] A Wegener ldquoMUSICompress lossless low-MIPS audio compression insoftware and hardwarerdquo in Proc of the Int Conf on Sig Proc App andTech Sept 1997

[Wein96] M Weinberger G Seroussi and G Sapiro ldquoLOCO-I a low complexitycontext-based lossless image compression algorithmrdquo Proc DataCompression Conf 1996

[Welc84] T Welch ldquoA technique for high performance data compressionrdquo IEEEComp vol 17 no 6 pp 8ndash19 June 1984

[Wen98] J Wen and J Villasenor ldquoReversible variable length codes for efficient androbust image and video codingrdquo in Proc IEEE Data Compression Conf pp471ndash480 1998

[West88] P H Westerink et al ldquoAn optimal bit allocation algorithm for sub-bandcodingrdquo in Proc IEEE ICASSP-88 pp 757ndash760 New York NY May 1988

[Whit00] P White Basic MIDI Sanctuary Press Feb 2000

[Wick94] M Wickerhauser Adapted Wavelet Analysis from Theory to Software AKPeters Wellesley MA 1994

[Wick95] SB Wicker Error Control Systems for Digital Communication and StoragePrentice Hall Englewood Cliffs NJ 1995

456 REFERENCES

[Widr85] B Widrow and S Stearns Adaptive Signal Processing Prentice HallEnglewood Cliffs NJ 1985

[Wies90] D Wiese and G Stoll ldquoBitrate reduction of high quality audio signals bymodelling the earrsquos masking thresholdsrdquo in Proc 89th Conv Aud Eng Socpreprint 2970 Sep 1990

[Wind98] B Winduratna ldquoFM analysissynthesis based audio codingrdquo Proc 104thConv Audio Eng Soc preprint 4746 May 1998

[Witt87] IH Witten RM Neal and JG Cleary ldquoArithmetic Coding for datacompressionrdquo Comm ACM vol 30 no 6 pp 520ndash540 June 1987

[Wolt03] M Wolters et al ldquoA closer look into MPEG-4 high efficiency AACrdquo in115th Conv Aud Eng Soc preprint 5871 New York NY Oct 2003

[Wu97] X Wu ldquoHigh-order context modeling and embedded conditional entropycoding of wavelet coefficients for image compressionrdquo in Proc of 31 st

ASILOMAR Conf on Signals Systems and Computers pp 1378ndash1382Nov 1997

[Wu98] X Wu and Z Xiong ldquoAn empirical study of high-order context modelingand entropy coding of wavelet coefficientsrdquo ISOIEC JTC-1SC -29WG-1No 771 Feb 1998

[Wu01] M Wu S Craver EW Felten and B Liu ldquoAnalysis of attacks on SDMIaudio watermarksrdquo in Proc IEEE ICASSP-01 vol 3 pp 1369ndash1372 SaltLake City UT May 2001

[Wu03] M Wu and B Liu Multimedia Data Hiding Springer Verlag New YorkJan 2003

[Wust98] U Wustenhagen et al ldquoSubjective listening test of multichannel audiocodecsrdquo in Proc 105th Conv Aud Eng Soc preprint 4813 Sept 1998

[Wyli96b] F Wylie ldquoAPT-x100 low-delay low-bit-rate-subband ADPCM digitalaudio codingrdquo in Collected Papers on Digital Audio Bit-Rate Reduction N Gilchrist and C Grewin Eds Aud Eng Soc pp 83ndash94 1996

[Xu99a] C Xu J Wu and Q Sun ldquoA robust digital audio watermarking techniquerdquoin Proc of the Int Symposium on Signal Processing and Its Applications(ISSPA-99) vol 1 pp 95ndash98 Aug 1999

[Xu99b] C Xu J Wu and Q Sun ldquoDigital audio watermarking and its application inmultimedia databaserdquo in Proc of the Int Symposium on Signal Processingand Its Applications (ISSPA-99) vol 1 pp 91ndash94 Aug 1999

[Xu99c] C Xu J Wu Q Sun and K Xin ldquoApplications of watermarking technologyin audio signalsrdquo J Aud Eng Soc vol 47 no 10 pp 805ndash812 Oct 1999

[Yama98] H Yamauchi et al ldquoThe SDDS system for digitizing film soundrdquo in TheDigital Signal Processing Handbook V Madisetti and D Williams EdsCRC Press Boca Ration FL pp 436ndash4312 1998

[Yang02] D Yang A Hongmei and C-CJ Kuo ldquoProgressive multichannel audiocodec (PMAC) with rich featuresrdquo in Proc IEEE ICASSP-02 vol 3 pp2717ndash2720 Orlando FL May 2002

[Yang03] C-H Yang and H-M Hang ldquoEfficient bit assignment strategy for perceptualaudio codingrdquo in Proc IEEE ICASSP-03 vol 5 pp 405ndash408 Hong Kong2003

REFERENCES 457

[Yatr88] P Yatrou and P Mermelstein ldquoEnsuring predictor tracking in ADPCMspeech coders under noisy transmission conditionsrdquo IEEE J Selected AreasCommun (Special Issue on Voice Coding for Communications) vol 6 no 2pp 249ndash261 Feb 1988

[Yeo01] I-K Yeo and HJ Kim ldquoModified patchwork algorithm a novel audiowatermarking schemerdquo Int Conf on Inf Tech Coding and Computing pp237ndash242 Apr 2001

[Yeo03] I-K Yeo and HJ Kim ldquoModified patchwork algorithm a novel audiowatermarking schemerdquo in Trans on Speech and Aud Proc vol 11 no 4pp 381ndash386 July 2003

[Yin97] L Yin M Suonio and M Vaananen ldquoA new backward predictor for MPEGaudio codingrdquo in Proc 103 th Conv Aud Eng Soc preprint 4521 Sept1997

[Yosh94] T Yoshida ldquoThe rewritable MiniDisc Systemrdquo Proc of the IEEE vol 82no 10 pp 1492ndash1500 Oct 1994

[Yosh97] S Yoshikawa et al ldquoDoes high sampling frequency improve perceptualtime-axis resolution of a digital audio signalrdquo 103rd Conv Aud Eng Socpreprint 4562 New York Sept 1997

[Zara02] RH Morelos-Zaragoza The Art of Error Correcting Coding John Wiley ampSons New York Apr 2002

[Zhan00] Q Zhang Y-Q Zhang and W Zhu ldquoResource allocation for audio andvideo streaming over the Internetrdquo in Proc IEEE ISCAS-00 vol 4 pp21ndash24 May 2000

[Zhen03] L Zheng and A Inoue ldquoAudio watermarking techniques using sinusoidalpatterns based on pseudorandom sequencesrdquo in Trans Circuits and Systemsfor Video Technology vol 13 no 8 pp 801ndash812 Aug 2003

[Zhou01] J Zhou Q Zhang Z Xiong and W Zhu ldquoError resilient scalableaudio coding (ERSAC) for mobile applicationsrdquo in IEEE 4th Workshop onMultimedia Signal Processing pp 307ndash312 Oct 2001

[Zieg02] T Ziegler et al ldquoEnhancing MP3 with SBR Features and capabilities of thenew mp3PRO algorithmrdquo in 112th Conv Aud Eng Soc preprint 5560Munich Germany May 2002

[Ziv77] J Ziv and A Lempel ldquoA universal algorithm for sequential datacompressionrdquo IEEE Trans Inform Theory vol 23 no 3 pp 337ndash343May 1977

[Zwic67] E Zwicker and R Feldtkeller Das Ohr als Nachrichtenempfinger HirzelStuttgart Germany 1967

[Zwic77] E Zwicker ldquoProcedure for calculating loudness of temporally variablesoundsrdquo J Acoust Soc of Am vol 62 pp 675ndash682 1977

[Zwic90] E Zwicker and H Fastl Psychoacoustics Facts and Models Springer-Verlag New York 1990

[Zwic91] E Zwicker and U Zwicker ldquoAudio engineering and psychoacousticsmatching signals to the final receiver the human auditory systemrdquo J AudioEng Soc pp 115ndash126 Mar 1991

[Zwis65] J Zwislocki ldquoAnalysis of some auditory characteristicsrdquo in Handbook ofMathematical Psychology R Luce et al Eds John Wiley and Sons NewYork 1965

INDEX

AD conversion 20 331-bit digital conversion 3652-channel stereo 28032-channel format 267 2793G cellular standards 1013G wireless communications 10351 channel configuration 267ndash268 319

327 332 361

Absolute category rating (ACR) 384Absolute hearing threshold 113 115 128

137ACELP codebook 101Adaptive codebook 64Adaptive differential PCM 60Adaptive Huffman coding 80 81Adaptive resolution codec (ARCO) 230

232Adaptive spectral entropy coding

(ASPEC) 178 195197 203Adaptive step size 58Adaptive transform coder (ATC) 196Adaptive VQ 64 219ADPCM 2 51 60 96 102 336ADPCM subband filtering 338A-law companding 58

Algebraic CELP 101Aliasing 18 20 33 39 146 155 164 362Analysis-by-synthesis 97 104 197 233

242 248Arithmetic coding 52 83 345Asymmetric watermarking schemes 374Audio fingerprinting 316Audio graphic equalizer 32Audio identification 315Audio processing technology (APT-x100)

335 379Audio signature 316AudioPaK 343 348Auditory scene analysis 258 309Auditory spectrum distance 392Autocorrelation 40ndash44 60 94Automatic perceptual measurement 383

389 402Autoregressive moving average (ARMA)

25Auxiliary data string 369 376Averaging filter 26

Backward adaptive 60 64 201 281 326Bandwidth expansion 95 104Bandwidth scalability 298 301


459

460 INDEX

Bark frequency scale 106 117 119Bark scale envelope 296Bark-domain masking 397Bark-scale warping 105Binaural cue coding (BCC) 319 382Binaural masking difference 294 396Binaural unmasking effects 324Bi-orthogonal MDCT basis 167 168 171Bit reservoir 182 249 287 304Bit-sliced arithmetic coding 300 356Butterfly stages 23

Cascaded CQF 156 159CCIR J41 reference audio codec 203CELP algorithms 100 233Continuous Fourier transform (CFT) 13

18Channel symbol 62Closed-loop analysis 91 106 236C-LPAC 344 352Cochlea 115ndash117 392Cochlear channel 116 390Code division multiple access 100Codebook 62ndash69 100 206Code-excited linear predictive (CELP) 67

70 98 100 233 291 300Coherent acoustics 265 268 337Coiled basilar membrane 115Compare internal-auditory representation

(CIR) 390Comparison category rating (CCR) 385Conformance testing 309Conjugate quadrature filters (CQF) 155Conjugate structure VQ 51 69Content protection 317 358 369Content retrieval 316Content-access control 368Convolution 16Cosine-modulated filterbank 160 163Critical band densities 391Critical bandwidth 105 116 119Critically-sampled 4 146 154 224Cross-correlation 41CS-ACELP 69 101 381Curve fitting 118 344

Data hiding 369Daubechiesrsquo wavelets 223Decimation 33 135 215 361Delta modulation 33 60Description scheme 311Deterministic signal 39 254 390DFT filterbank 178 179Difference equations 25

Differential PCM 59Differential perceptual audio coder

(DPAC) 195 204Digital audio watermarking 368Digital broadcast audio 212 378Direct broadcast satellite (DBS) 163 326

378Digital filters 25 30Digital item 317Digital multimedia 308Digital rights management 317 369Digital Sinc 21Digital theater systems (DTS) 264 267

338Digital versatile disc (DVD) 267 335Direct sequence spread spectrum 374Direct stream digital (DSD) 268 343

358 364Direct stream transfer (DST) 362 368Direct-form LP coefficients 95Dirichlet 22Disc-access control 368Discrete Bark spectrum 128Discrete cosine transform (DCT) 23 178

196 206 335 353Discrete Fourier transform (DFT) 22 155

178 197 200 205Discrete wavelet excitation 105Discrete wavelet packet transform

(DWPT) 155 214 255Discrete wavelet transform 214 224Discrete-time Fourier transform (DTFT)

20 34Distortion index 392 401Dolby AC-2AC-3 171 325Dolby digital plus 335Dolby proLogic 267 327 332Down-mixing 280 333Down-sampling 20 33 36DVD-Audio 263 268 343 356Dynamic bit-allocation 52 70 230 275Dynamic crosstalk 281Dynamic dictionary 218

Echo data hiding 370Encryption 358 369Entropy 52 74Equivalent noise bandwidth 118Equivalent rectangular bandwidth (ERB)

117 122Ergodic 40Error concealment 305Error protection 7 291 305 324Error resilience 291 306

INDEX 461

Euclidean distance 130 204European A-law 58European broadcasting union (EBU) 269

389Expansion function 57Expectation operator 10 40 70 259

Fine grain scalability 291 300 356Finite-length impulse response (FIR) 26FIR filters 39 167 214FIRIIR prediction 344 349 358Fixed codebooks 63Formant structure 93Formants 29 93 109Forward adaptive 60 64 289 328Forward error correction 307Forward linear prediction 94 339Forward MDCT 165 176Frequency response 10 27Frequency warped LP 92 105Frequency-to-place transformation 116FS 1016 CELP 100 110

Gain modification 182 185 341General audio coder 294 307Generalized audio coding 303 380Generalized PR cosine-modulated

filterbank 164Givens rotation 355Global masking threshold 125 130 137

202 324Golomb coding 52 82 345 349Granular quantization noise 386

Hann window 128 131 219 394Harmonic synthesis 228Harmonic vector excitation coding

(HVXC) 300Head-related transfer function 382High definition television (HDTV) 274

378Huffman coding 52 77 202 227Human auditory system 4 114 125 152Human visual system 370Hybrid filter bank 152 163 167 178 185Hybrid filter bankCELP 233 235Hybrid subbandCELP 234ndash235

IIR filterbanks 212 237Imperceptibility 377Individual masking threshold 126 136

324Infinite-length impulse response (IIR)

filter 25 43 202 237 344 351

Informational masking 396 402Instrument timbre 315Integer MDCT (IntMDCT) 354Intellectual property 317 369 380Interchannel coherence 319Interleaved pulse permutation codes 101Interoperable framework 317iPod 9 278IS 96 QCELP 102ISOIEC MPEG psychoacoustic model 5

105 113 130 275ITU 5-point impairment scale 286ITU-BS1116 288 384ITU-T G722 91 102ITU-T G7222 AMR-WB codec 103ITU-T G726 62 96ITU-T G728 LD-CELP 100 234 304ITU-T G729 69 101 301 381

JND estimates 129JND thresholds 198 276 283Just-noticeable distortion (JND) 126

Kaiser MDCT 171Kaiser-Bessel derived (KBD) window

171 284 326K-means algorithm 235Kraft inequality 76

Laplacian distribution 54 77 81 201 336LBG algorithm 64Lempel-Ziv coding 52Levinson-Durbin recursive algorithm 95

186 301Line spectrum pairs (LSPs) 95Linear PCM 51 55 338Linear prediction (LP) 41 92ndash99 102

186Linear predictive coding (LPC) 91 99LMS algorithm 281 286Log area ratios (LARs) 95Logarithmic companding 57Long-term prediction (LTP) 95 99 111

289 296Lossless audio coding 264 343Lossless matrixing 358 361Lossless transform audio coding (LTAC)

353Lossy audio coding 265 343Loudness mapping 395 400Low-delay coding 61 307 381LPC 69 91

462 INDEX

LPC10e 95ndash96LTI system 16LTP lags 96Lucent PAC 3 321 379

Mantissa quantization 328 331MASCAM 212Masking threshold 105 125 138Masking tone 121ndash124Matrixing 267 279Maximally decimated 146 154M-band filterbanks 148 160MDCT basis 165 168 328MDCT filterbank 149 167Mean opinion score (MOS) 9Mean subjective score 385Melody features 312 315Mel-scale 400Meridian lossless packing (MLP) 268 358Metadata 316Middle-ear transfer function 394Minimum masking threshold 126 221Minimum redundancy codes 77Minimum SMR 123 137MIPS 6Mixed excitation schemes 96Modified discrete cosine transform

(MDCT) 145 163Modulated bi-orthogonal lapped transform

(MBLT) 171Mother wavelet function 237Moving pictures expert group (MPEG)

270MP 1 Layer III MP3 274ndash276MP3 encoders 130MPEG 1 Layer I 275MPEG 2 Advanced audio coding (AAC)

283MPEG 2 BC 279MPEG 21 317MPEG 4 289MPEG 4 LD codec 304MPEG 4 natural audio 292MPEG 4 synthetic audio 303MPEG 7 309micro-law 57Multichannel coding 327 331Multichannel perceptual audio coder

(MPAC) 268Multichannel surround sound 4 267Multilingual compatibility 279Multimedia content description interface

(MCDI) 274 309Multipulse excitation (MPE) 5 104 301

Multipulse excited linear prediction 98Multirate signal processing 33Multistage tree-structured VQ (MSTVQ)

206Multitransform 232Musical instrument digital interface

(MIDI) 264Musical instrument timbre description tool

(MITDT) 315MUSICAM 212MUSICompress 3 351

Natural audio coding 291ndash292NetSound 303 380Noble identities 158Noise loudness 391 400Noise signal evaluation (NSE) 390Noise threshold 139 201Noise-masking-noise 124Noise-masking-tone 123 399Noise-to-mask ratio (NMR) 74 138 205

289 396Nonadaptive Huffman coding 80Nonadaptive PCM 57Non-integer factor sampling 36Nonlinear interpolative VQ 206Nonparametric quantization 51Nonsimultaneous masking 120 127North American PCM standard 58

Object-based representation 274 308Objective measures 6 383 403Octave-band 147 158Open-loop closed-loop 96Optimum bit assignment 72Optimum coding in the frequency domain

(OCF) 196

Parametric audio coding 242 300Parametric coder 257Parametric phase-modulated window 168Parametric spreading 330PDF-optimized PCM 57Peaking filters 30Perceptual audio coder (PAC) 321Perceptual audio quality measure (PAQM)

392Perceptual bit allocation 52 74 120 138

149Perceptual entropy 2 75 128Perceptual evaluation (PERCEVAL) 6

401PEAQ PEAQ 403Perceptual noise substitution (PNS) 294

INDEX 463

Perceptual transform coder (PXFM) 197Perceptual transparency 235 377Perceptual weighting filter PWF 91 98

104Perceptually-weighted speech 99Percussive instrument timbre 315Perfect reconstruction (PR) 4 36 147Phase coding 370Phase vocoder 380Phillips digital compact cassette (DCC) 3

278Phillips PASC 3 6Pitch prediction 111 296Playback control 368Poles and zeros 29Polynomial approximations 346Positive definite 41Post-masking 127Power spectral density (PSD) 10 42Prediction error 59 94 104Predictive differential coding 59Pre-echo control 182Pre-echo distortion 180Pre-masking 278Priority codewords 306Probability density function 53Pseudo QMF 160 177 275 338Pseudorandom noise (PN) 374Psychoacoustics 2 74 113Pulse-sinc pair 15

Quadraphonic 267Quadrature mirror filter (QMF) 36 102Quantization levels 54 70 199Quantization noise 53 75

Random process signal 39 53Real-time DSP 352Reed-solomon codes 307Reflection coefficients 95 104Regular pulse excitation (RPE) 98 301Rice codes 52 81Rights expression language (REL) 317Run-Length encoding 82

Sampling theorem 17Scalable audio coding 258 298Scalable DWPT coder 220Scalable polyphonic 266Scalar quantization 54Security key 375Selectable mode vocoder (SMV) 69 102Sensation level 115Sequential bit-allocation method 72

Shannon-Fano coding 76Shelving filters 30SHORTEN 307 343 346Short-term linear predictor 94Sigma-Delta () 20 362Signal-to-mask ratio (SMR) 74 123 126

138Signal-to-noise ratio 55 62 116Simulcast 256Simultaneous masking 120 126 130Sine MDCT 171Sinusoidal excitation coding 105Sony ATRAC 6 173 264 320Sony dynamic digital sound (SDDS) 321Sound pressure level (SPL) 10 113 131Sound recognition 315ndash316Source-system modeling 92Spatial audio coding 319Spectral band replication (SBR) 308Spectral flatness measure (SFM) 74 129

196Speech recognition 315Split-VQ 67Spoken content description tool (SCDT)

315Spread of masking 125Spread spectrum watermarking 372Standard deviation 40 214Statistical transparency 377Step size 54 57Stereophonic coder 198Stereophonic PXFM 321 198ndash199Streaming audio 4 241 263 300 378Structured audio orchestra language

(SAOL) 293 303Structured audio score language (SASL)

294Subband coding 211Subwoofer 268 270 321Subjective difference grade 384 391Super audio CD (SACD) 3 20 358Supra-masking threshold 284 287Synchronization 269 279 291 341 375

TDMA 100Temporal noise shaping (TNS) 106 182

185 283Text-to-speech 274 291 293 304Third generation partnership project

(3GPP) 103 267TIA IS 127 RCELP 101TIA IS 54 VSELP 95 100Time-domain aliasing cancellation

(TDAC) 164

464 INDEX

Time-frequency resolution 149 227 327Tonal adaptive resolution codec (TARCO)

232Tonal and noise maskers 131 136Tone-masking-noise 124Toeplitz 95Total harmonic distortion 383Training vectors 62ndash69Tree-structured filterbanks 157Tree-structured QMF 38 156 320Turbo codes 374Twin VQ 69 207 293 296

Uniform bit-allocation 70 72Uniform step size 55 57United States (US) digital audio radio

(DAR) 284Unvoiced speech 93Up-sampling 35

Variance 10 40Vector PCM (VPCM) 62Vector quantization (VQ) 62ndash69VSELP coder 95 100Video home system (VHS) 274

Vocal tract 29 92 233Vocal tract envelope 96Voice activity detection 104Voice-over-Internet-protocol (VoIP) 7

304

Warped linear prediction 105Watermark embedding detection 376Watermark encoding 375Wavelet basis 214 219 233Wavelet coder 220 229Wavelet packet 155 212 215Wavelet packet transform (WPT) 214Wavwrite 11Weighted-mean-square error 97 233White noise 25 41Wideband CDMA (WCDMA) 103Widesense stationary 40Window shape adaptation 283 305Window switching 173 182 203 320

328

z transform 10 20Zwickerrsquos loudness 394

AUDIO SIGNAL PROCESSING AND CODING

CONTENTS

PREFACE

1 INTRODUCTION

11 Historical Perspective

12 A General Perceptual Audio Coding Architecture

13 Audio Coder Attributes

131 Audio Quality

132 Bit Rates

133 Complexity

134 Codec Delay


14 Types of Audio Coders ndash An Overview

15 Organization of the Book

16 Notational Conventions

Problems

Computer Exercises


21 Introduction

22 Spectra of Analog Signals

23 Review of Convolution and Filtering

24 Uniform Sampling

25 Discrete-Time Signal Processing





26 Difference Equations and Digital Filters

27 The Transfer and the Frequency Response Functions



28 Review of Multirate Signal Processing





29 Discrete-Time Random Signals



210 Summary

Problems

Computer Exercises


31 Introduction


32 Density Functions and Quantization

33 Scalar Quantization




34 Vector Quantization

341 Structured VQ

342 Split-VQ


35 Bit-Allocation Algorithms

36 Entropy Coding

361 Huffman Coding

362 Rice Coding

363 Golomb Coding


37 Summary

Problems

Computer Exercises


41 Introduction

42 LP-Based Source-System Modeling for Speech

43 Short-Term Linear Prediction



44 Open-Loop Analysis-Synthesis Linear Prediction

45 Analysis-by-Synthesis Linear Prediction


46 Linear Prediction in Wideband Coding



47 Summary

Problems

Computer Exercises


51 Introduction

52 Absolute Threshold of Hearing

53 Critical Bands

54 Simultaneous Masking Masking Asymmetry and the Spread of Masking






55 Nonsimultaneous Masking

56 Perceptual Entropy

57 Example Codec Perceptual Model ISOIEC 11172-3 (MPEG - 1) Psychoacoustic Model 1






58 Perceptual Bit Allocation

59 Summary

Problems

Computer Exercises


61 Introduction

62 Analysis-Synthesis Framework for M-band Filter Banks

63 Filter Banks for Audio Coding Design Considerations

631 The Role of Time-Frequency Resolution in Masking Power Estimation



64 Quadrature Mirror and Conjugate Quadrature Filters

65 Tree-Structured QMF and CQF M-band Banks

66 Cosine Modulated ldquoPseudo QMFrdquo M-band Banks

67 Cosine Modulated Perfect Reconstruction (PR) M-band Banks and the Modified Discrete Cosine Transform (MDCT)




68 Discrete Fourier and Discrete Cosine Transform

69 Pre-echo Distortion

610 Pre-echo Control Strategies

6101 Bit Reservoir





611 Summary

Problems

Computer Exercises

7 TRANSFORM CODERS

71 Introduction

72 Optimum Coding in the Frequency Domain

73 Perceptual Transform Coder

731 PXFM

732 SEPXFM

74 Brandenburg-Johnston Hybrid Coder

75 CNET Coders

751 CNET DFT Coder



76 Adaptive Spectral Entropy Coding

77 Differential Perceptual Audio Coder

78 DFT Noise Substitution

79 DCT with Vector Quantization

710 MDCT with Vector Quantization

711 Summary

Problems

Computer Exercises

8 SUBBAND CODERS

81 Introduction


82 DWT and Discrete Wavelet Packet Transform (DWPT)

83 Adapted WP Algorithms

831 DWPT Coder with Globally Adapted Daubechies Analysis Wavelet



834 DWPT Coder with Adaptive Tree Structure and Locally Adapted Analysis Wavelet


84 Adapted Nonuniform Filter Banks



85 Hybrid WP and Adapted WPSinusoidal Algorithms



853 Hybrid SinusoidalDWPT Coder with WP Tree Structure Adaptation (ARCO)

86 Subband Coding with Hybrid Filter BankCELP Algorithms


862 Hybrid SubbandCELP Algorithm for Low-Complexity Applications

87 Subband Coding with IIR Filter Banks

Problems

Computer Exercise

9 SINUSOIDAL CODERS

91 Introduction

92 The Sinusoidal Model



93 AnalysisSynthesis Audio Codec (ASAC)




94 Harmonic and Individual Lines Plus Noise Coder (HILN)



95 FM Synthesis



96 The Sines + Transients + Noise (STN) Model

97 Hybrid Sinusoidal Coders



98 Summary

Problems

Computer Exercises


101 Introduction

102 MIDI Versus Digital Audio




103 Multichannel Surround Sound




104 MPEG Audio Standards








105 Adaptive Transform Acoustic Coding (ATRAC)

106 Lucent Technologies PAC EPAC and MPAC




107 Dolby Audio Coding Standards



108 Audio Processing Technology APT-x100

109 DTS ndash Coherent Acoustics






Problems

Computer Exercise


111 Introduction

112 Lossless Audio Coding (L(2)AC)

1121 L(2)AC Principles

1122 L(2)AC Algorithms

113 DVD-Audio


114 Super-Audio CD (SACD)




115 Digital Audio Watermarking

1151 Background



116 Summary of Commercial Applications

Problems

Computer Exercise


121 Introduction

122 Subjective Quality Measures

123 Confounding Factors in Subjective Evaluations

124 Subjective Evaluations of Two-Channel Standardized Codecs

125 Subjective Evaluations of 51-Channel Standardized Codecs

126 Subjective Evaluations Using Perceptual Measurement Systems



127 Algorithms for Perceptual Measurement




128 ITU-R BS1387 and ITU-T P861 Standards for Perceptual Quality Measurement

129 Research Directions for Perceptual Codec Quality Measures

REFERENCES

INDEX





Copyright 2007 by John Wiley amp Sons Inc All rights reserved

Published by John Wiley amp Sons Inc Hoboken New JerseyPublished simultaneously in Canada

No part of this publication may be reproduced stored in a retrieval system or transmitted in anyform or by any means electronic mechanical photocopying recording scanning or otherwiseexcept as permitted under Section 107 or 108 of the 1976 United States Copyright Act withouteither the prior written permission of the Publisher or authorization through payment of theappropriate per-copy fee to the Copyright Clearance Center Inc 222 Rosewood Drive DanversMA 01923 (978) 750-8400 fax (978) 750-4470 or on the web at wwwcopyrightcom Requeststo the Publisher for permission should be addressed to the Permissions Department John Wiley ampSons Inc 111 River Street Hoboken NJ 07030 (201) 748-6011 fax (201) 748-6008 or online athttpwwwwileycomgopermission

Limit of LiabilityDisclaimer of Warranty While the publisher and author have used their bestefforts in preparing this book they make no representations or warranties with respect to theaccuracy or completeness of the contents of this book and specifically disclaim any impliedwarranties of merchantability or fitness for a particular purpose No warranty may be created orextended by sales representatives or written sales materials The advice and strategies containedherein may not be suitable for your situation You should consult with a professional whereappropriate Neither the publisher nor author shall be liable for any loss of profit or any othercommercial damages including but not limited to special incidental consequential or otherdamages

For general information on our other products and services or for technical support please contactour Customer Care Department within the United States at (800) 762-2974 outside the UnitedStates at (317) 572-3993 or fax (317) 572-4002

Wiley also publishes its books in a variety of electronic formats Some content that appears in printmay not be available in electronic formats For more information about Wiley products visit ourweb site at wwwwileycom

Wiley Bicentennial Logo Richard J Pacifico

Library of Congress Cataloging-in-Publication Data

Spanias AndreasAudio signal processing and codingby Andreas Spanias Ted Painter Venkatraman Atti

p cmldquoWiley-Interscience publicationrdquoIncludes bibliographical references and indexISBN 978-0-471-79147-8

1 Coding theory 2 Signal processingndashDigital techniques 3 SoundndashRecording andreproducingndashDigital techniques I Painter Ted 1967-II Atti Venkatraman 1978-IIITitle

TK510292S73 2006621382rsquo8ndashdc22

2006040507

Printed in the United States of America

10 9 8 7 6 5 4 3 2 1

ToPhotini John and Louis

Lizzy Katie and LeeSrinivasan Sudha Kavitha Satish and Ammu

CONTENTS

PREFACE xv

1 INTRODUCTION 1

11 Historical Perspective 112 A General Perceptual Audio Coding Architecture 413 Audio Coder Attributes 5

131 Audio Quality 6132 Bit Rates 6133 Complexity 6134 Codec Delay 7135 Error Robustness 7

14 Types of Audio Coders ndash An Overview 715 Organization of the Book 816 Notational Conventions 9


2 SIGNAL PROCESSING ESSENTIALS 13

21 Introduction 1322 Spectra of Analog Signals 1323 Review of Convolution and Filtering 1624 Uniform Sampling 1725 Discrete-Time Signal Processing 20

vii

viii CONTENTS

251 Transforms for Discrete-Time Signals 20252 The Discrete and the Fast Fourier Transform 22253 The Discrete Cosine Transform 23254 The Short-Time Fourier Transform 23

26 Difference Equations and Digital Filters 2527 The Transfer and the Frequency Response Functions 27

271 Poles Zeros and Frequency Response 29272 Examples of Digital Filters for Audio Applications 30

28 Review of Multirate Signal Processing 33281 Down-sampling by an Integer 33282 Up-sampling by an Integer 35283 Sampling Rate Changes by Noninteger Factors 36284 Quadrature Mirror Filter Banks 36

29 Discrete-Time Random Signals 39291 Random Signals Processed by LTI Digital Filters 42292 Autocorrelation Estimation from Finite-Length Data 44


3 QUANTIZATION AND ENTROPY CODING 51

31 Introduction 51311 The QuantizationndashBit AllocationndashEntropy Coding

Module 5232 Density Functions and Quantization 5333 Scalar Quantization 54

331 Uniform Quantization 54332 Nonuniform Quantization 57333 Differential PCM 59

34 Vector Quantization 62341 Structured VQ 64342 Split-VQ 67343 Conjugate-Structure VQ 69

35 Bit-Allocation Algorithms 7036 Entropy Coding 74

361 Huffman Coding 77362 Rice Coding 81363 Golomb Coding 82

CONTENTS ix

364 Arithmetic Coding 8337 Summary 85


4 LINEAR PREDICTION IN NARROWBAND AND WIDEBANDCODING 91

41 Introduction 9142 LP-Based Source-System Modeling for Speech 9243 Short-Term Linear Prediction 94

431 Long-Term Prediction 95432 ADPCM Using Linear Prediction 96

44 Open-Loop Analysis-Synthesis Linear Prediction 9645 Analysis-by-Synthesis Linear Prediction 97

451 Code-Excited Linear Prediction Algorithms 10046 Linear Prediction in Wideband Coding 102

461 Wideband Speech Coding 102462 Wideband Audio Coding 104


5 PSYCHOACOUSTIC PRINCIPLES 113

51 Introduction 11352 Absolute Threshold of Hearing 11453 Critical Bands 11554 Simultaneous Masking Masking Asymmetry and the Spread

of Masking 120541 Noise-Masking-Tone 123542 Tone-Masking-Noise 124543 Noise-Masking-Noise 124544 Asymmetry of Masking 124545 The Spread of Masking 125

55 Nonsimultaneous Masking 12756 Perceptual Entropy 12857 Example Codec Perceptual Model ISOIEC 11172-3

(MPEG - 1) Psychoacoustic Model 1 130571 Step 1 Spectral Analysis and SPL Normalization 131

x CONTENTS

572 Step 2 Identification of Tonal and Noise Maskers 131573 Step 3 Decimation and Reorganization of Maskers 135574 Step 4 Calculation of Individual Masking Thresholds 136575 Step 5 Calculation of Global Masking Thresholds 138

58 Perceptual Bit Allocation 13859 Summary 140


6 TIME-FREQUENCY ANALYSIS FILTER BANKS ANDTRANSFORMS 145

61 Introduction 14562 Analysis-Synthesis Framework for M-band Filter Banks 14663 Filter Banks for Audio Coding Design Considerations 148

631 The Role of Time-Frequency Resolution in MaskingPower Estimation 149

632 The Role of Frequency Resolution in Perceptual BitAllocation 149

633 The Role of Time Resolution in Perceptual BitAllocation 150

64 Quadrature Mirror and Conjugate Quadrature Filters 15565 Tree-Structured QMF and CQF M-band Banks 15666 Cosine Modulated ldquoPseudo QMFrdquo M-band Banks 16067 Cosine Modulated Perfect Reconstruction (PR) M-band Banks

and the Modified Discrete Cosine Transform (MDCT) 163671 Forward and Inverse MDCT 165672 MDCT Window Design 165673 Example MDCT Windows (Prototype FIR Filters) 167

68 Discrete Fourier and Discrete Cosine Transform 17869 Pre-echo Distortion 180610 Pre-echo Control Strategies 182

6101 Bit Reservoir 1826102 Window Switching 1826103 Hybrid Switched Filter Banks 1846104 Gain Modification 1856105 Temporal Noise Shaping 185


CONTENTS xi

7 TRANSFORM CODERS 195

71 Introduction 19572 Optimum Coding in the Frequency Domain 19673 Perceptual Transform Coder 197

731 PXFM 198732 SEPXFM 199

74 Brandenburg-Johnston Hybrid Coder 20075 CNET Coders 201

751 CNET DFT Coder 201752 CNET MDCT Coder 1 201753 CNET MDCT Coder 2 202

76 Adaptive Spectral Entropy Coding 20377 Differential Perceptual Audio Coder 20478 DFT Noise Substitution 20579 DCT with Vector Quantization 206710 MDCT with Vector Quantization 207711 Summary 208


8 SUBBAND CODERS 211

81 Introduction 211811 Subband Algorithms 212

82 DWT and Discrete Wavelet Packet Transform (DWPT) 21483 Adapted WP Algorithms 218

831 DWPT Coder with Globally Adapted DaubechiesAnalysis Wavelet 218

832 Scalable DWPT Coder with Adaptive Tree Structure 220833 DWPT Coder with Globally Adapted General

Analysis Wavelet 223834 DWPT Coder with Adaptive Tree Structure and

Locally Adapted Analysis Wavelet 223835 DWPT Coder with Perceptually Optimized Synthesis

Wavelets 22484 Adapted Nonuniform Filter Banks 226

841 Switched Nonuniform Filter Bank Cascade 226842 Frequency-Varying Modulated Lapped Transforms 227

85 Hybrid WP and Adapted WPSinusoidal Algorithms 227

xii CONTENTS

851 Hybrid SinusoidalClassical DWPT Coder 228852 Hybrid SinusoidalM-band DWPT Coder 229853 Hybrid SinusoidalDWPT Coder with WP Tree

Structure Adaptation (ARCO) 23086 Subband Coding with Hybrid Filter BankCELP Algorithms 233

861 Hybrid SubbandCELP Algorithm for Low-DelayApplications 234

862 Hybrid SubbandCELP Algorithm forLow-Complexity Applications 235

87 Subband Coding with IIR Filter Banks 237Problems 237Computer Exercise 240

9 SINUSOIDAL CODERS 241

91 Introduction 24192 The Sinusoidal Model 242

921 Sinusoidal Analysis and Parameter Tracking 242922 Sinusoidal Synthesis and Parameter Interpolation 245

93 AnalysisSynthesis Audio Codec (ASAC) 247931 ASAC Segmentation 248932 ASAC Sinusoidal Analysis-by-Synthesis 248933 ASAC Bit Allocation Quantization Encoding and

Scalability 24894 Harmonic and Individual Lines Plus Noise Coder (HILN) 249

941 HILN Sinusoidal Analysis-by-Synthesis 250942 HILN Bit Allocation Quantization Encoding and

Decoding 25195 FM Synthesis 251

951 Principles of FM Synthesis 252952 Perceptual Audio Coding Using an FM Synthesis

Model 25296 The Sines + Transients + Noise (STN) Model 25497 Hybrid Sinusoidal Coders 255

971 Hybrid Sinusoidal-MDCT Algorithm 256972 Hybrid Sinusoidal-Vocoder Algorithm 257


CONTENTS xiii

10 AUDIO CODING STANDARDS AND ALGORITHMS 263

101 Introduction 263102 MIDI Versus Digital Audio 264

1021 MIDI Synthesizer 2641022 General MIDI (GM) 2661023 MIDI Applications 266

103 Multichannel Surround Sound 2671031 The Evolution of Surround Sound 2671032 The Mono the Stereo and the Surround Sound

Formats 2681033 The ITU-R BS775 51-Channel Configuration 268

104 MPEG Audio Standards 2701041 MPEG-1 Audio (ISOIEC 11172-3) 2751042 MPEG-2 BCLSF (ISOIEC-13818-3) 2791043 MPEG-2 NBCAAC (ISOIEC-13818-7) 2831044 MPEG-4 Audio (ISOIEC 14496-3) 2891045 MPEG-7 Audio (ISOIEC 15938-4) 3091046 MPEG-21 Framework (ISOIEC-21000) 3171047 MPEG Surround and Spatial Audio Coding 319

105 Adaptive Transform Acoustic Coding (ATRAC) 319106 Lucent Technologies PAC EPAC and MPAC 321

1061 Perceptual Audio Coder (PAC) 3211062 Enhanced PAC (EPAC) 3231063 Multichannel PAC (MPAC) 323

107 Dolby Audio Coding Standards 3251071 Dolby AC-2 AC-2A 3251072 Dolby AC-3Dolby DigitalDolby SR middot D 327

108 Audio Processing Technology APT-x100 335109 DTS ndash Coherent Acoustics 338

1091 Framing and Subband Analysis 3381092 Psychoacoustic Analysis 3391093 ADPCM ndash Differential Subband Coding 3391094 Bit Allocation Quantization and Multiplexing 3411095 DTS-CA Versus Dolby Digital 342Problems 342Computer Exercise 342

11 LOSSLESS AUDIO CODING AND DIGITAL WATERMARKING 343

111 Introduction 343

xiv CONTENTS

112 Lossless Audio Coding (L2AC) 3441121 L2AC Principles 3451122 L2AC Algorithms 346

113 DVD-Audio 3561131 Meridian Lossless Packing (MLP) 358

114 Super-Audio CD (SACD) 3581141 SACD Storage Format 3621142 Sigma-Delta Modulators (SDM) 3621143 Direct Stream Digital (DSD) Encoding 364

115 Digital Audio Watermarking 3681151 Background 3701152 A Generic Architecture for DAW 3741153 DAW Schemes ndash Attributes 377

116 Summary of Commercial Applications 378Problems 382Computer Exercise 382

12 QUALITY MEASURES FOR PERCEPTUAL AUDIO CODING 383

121 Introduction 383122 Subjective Quality Measures 384123 Confounding Factors in Subjective Evaluations 386124 Subjective Evaluations of Two-Channel Standardized Codecs 387125 Subjective Evaluations of 51-Channel Standardized Codecs 388126 Subjective Evaluations Using Perceptual Measurement

Systems 3891261 CIR Perceptual Measurement Schemes 3901262 NSE Perceptual Measurement Schemes 390

127 Algorithms for Perceptual Measurement 3911271 Example Perceptual Audio Quality Measure (PAQM) 3921272 Example Noise-to-Mask Ratio (NMR) 3961273 Example Objective Audio Signal Evaluation (OASE) 399

128 ITU-R BS1387 and ITU-T P861 Standards for PerceptualQuality Measurement 401

129 Research Directions for Perceptual Codec Quality Measures 402

REFERENCES 405

INDEX 459

PREFACE

Audio processing and recording has been part of telecommunication and enter-tainment systems for more than a century Moreover bandwidth issues associatedwith audio recording transmission and storage occupied engineers from the veryearly stages in this field A series of important technological developments pavedthe way from early phonographs to magnetic tape recording and lately compactdisk (CD) and super storage devices In the following we capture some of themain events and milestones that mark the history in audio recording and storage1

Prototypes of phonographs appeared around 1877 and the first attempt to mar-ket cylinder-based gramophones was by the Columbia Phonograph Co in 1889Five years later Marconi demonstrated the first radio transmission that markedthe beginning of audio broadcasting The Victor Talking Machine Company withthe little nipper dog as its trademark was formed in 1901 The ldquotelegraphonerdquo amagnetic recorder for voice that used still wire was patented in Denmark aroundthe end of the nineteenth century The Odeon and His Masters Voice (HMV)label produced and marketed music recordings in the early nineteen hundredsThe cabinet phonograph with a horn called ldquoVictrolardquo appeared at about the sametime Diamond disk players were marketed in 1913 followed by efforts to producesound-on-film for motion pictures Other milestones include the first commercialtransmission in Pittsburgh and the emergence of public address amplifiers Elec-trically recorded material appeared in the 1920s and the first sound-on-film wasdemonstrated in the mid 1920s by Warner Brothers Cinema applications in the1930s promoted advances in loudspeaker technologies leading to the develop-ment of woofer tweeter and crossover network concepts Juke boxes for musicalso appeared in the 1930s Magnetic tape recording was demonstrated in Ger-many in the 1930s by BASF and AEGTelefunken The Ampex tape recordersappeared in the US in the late 1940s The demonstration of stereo high-fidelity(Hi-Fi) sound in the late 1940s spurred the development of amplifiers speakersand reel-to-reel tape recorders for home use in the 1950s both in Europe and

xv

xvi PREFACE

Apple iPod (Courtesy of Apple Computer Inc) Apple iPod is a registered trademarkof Apple Computer Inc

the US Meanwhile Columbia produced the 33-rpm long play (LP) vinyl recordwhile its rival RCA Victor produced the compact 45-rpm format whose salestook off with the emergence of rock and roll music Technological developmentsin the mid 1950s resulted in the emergence of compact transistor-based radiosand soon after small tape players In 1963 Philips introduced the compact cas-sette tape format with its EL3300 series portable players (marketed in the US asNorelco) which became an instant success with accessories for home portableand car use Eight track cassettes became popular in the late 1960s mainly for caruse The Dolby system for compact cassette noise reduction was also a landmarkin the audio signal processing field Meanwhile FM broadcasting which hadbeen invented earlier took off in the 1960s and 1970s with stereo transmissionsHelical tape-head technologies invented in Japan in the 1960s provided high-bandwidth recording capabilities which enabled video tape recorders for homeuse in the 1970s (eg VHS and Beta formats) This technology was also usedin the 1980s for audio PCM stereo recording Laser compact disk technologywas introduced in 1982 and by the late 1980s became the preferred format forHi-Fi stereo recording Analog compact cassette players high-quality reel-to-reelrecorders expensive turntables and virtually all analog recording devices startedfading away by the late 1980s The launch of the digital CD audio format in

PREFACE xvii

the 1980s coincided with the advent of personal computers and took over inall aspects of music recording and distribution CD playback soon dominatedbroadcasting automobile home stereo and analog vinyl LP The compact cas-sette formats became relics of an old era and eventually disappeared from musicstores Digital audio tape (DAT) systems enabled by helical tape head technologywere also introduced in the 1980s but were commercially unsuccessful becauseof strict copyright laws and unusually large taxes

Parallel developments in digital video formats for laser disk technologiesincluded work in audio compression systems Audio compression research papersstarted appearing mostly in the 1980s at IEEE ICASSP and Audio Engineer-ing Society conferences by authors from several research and development labsincluding Erlangen-Nuremburg University and Fraunhofer IIS ATampT Bell Lab-oratories and Dolby Laboratories Audio compression or audio coding researchthe art of representing an audio signal with the least number of informationbits while maintaining its fidelity went through quantum leaps in the late 1980sand 1990s Although originally most audio compression algorithms were devel-oped as part of the digital motion video compression standards eg the MPEGseries these algorithms eventually became important as stand alone technologiesfor audio recording and playback Progress in VLSI technologies psychoacous-tics and efficient time-frequency signal representations made possible a series ofscalable real-time compression algorithms for use in audio and cinema applica-tions In the 1990s we witnessed the emergence of the first products that usedcompressed audio formats such as the MiniDisc (MD) and the Digital CompactCassette (DCC) The sound and video playing capabilities of the PC and theproliferation of multimedia content through the Internet had a profound impacton audio compression technologies The MPEG-1-2 layer III (MP3) algorithmbecame a defacto standard for Internet music downloads Specialized web sitesthat feature music content changed the ways people buy and share music Com-pact MP3 players appeared in the late 1990s In the early 2000s we had theemergence of the Apple iPod player with a hard drive that supports MP3 andMPEG advanced audio coding (AAC) algorithms

In order to enhance cinematic and home theater listening experiences anddeliver greater realism than ever before audio codec designers pursued sophis-ticated multichannel audio coding techniques In the mid 1990s techniques forencoding 51 separate channels of audio were standardized in MPEG-2 BC andlater MPEG-2 AAC audio Proprietary multichannel algorithms were also devel-oped and commercialized by Dolby Laboratories (AC-3) Digital Theater System(DTS) Lucent (EPAC) Sony (SDDS) and Microsoft (WMA) Dolby Labs DTSLexicon and other companies also introduced 2N channel upmix algorithmscapable of synthesizing multichannel surround presentation from conventionalstereo content (eg Dolby ProLogic II DTS Neo6) The human auditory systemis capable of localizing sound with greater spatial resolution than current multi-channel audio systems offer and as a result the quest continues to achieve theultimate spatial fidelity in sound reproduction Research involving spatial audioreal-time acoustic source localization binaural cue coding and application of

xviii PREFACE

head-related transfer functions (HRTF) towards rendering immersive audio hasgained interest Audiophiles appeared skeptical with the 441-kHz 16-bit CDstereo format and some were critical of the sound quality of compression for-mats These ideas along with the need for copyright protection eventually gainedmomentum and new standards and formats appeared in the early 2000s In par-ticular multichannel lossless coding such as the DVD-Audio (DVD-A) and theSuper-Audio-CD (SACD) appeared The standardization of these storage for-mats provided the audio codec designers with enormous storage capacity Thismotivated lossless coding of digital audio

The purpose of this book is to provide an in-depth treatment of audio com-pression algorithms and standards The topic is currently occupying several com-munities in signal processing multimedia and audio engineering The intendedreadership for this book includes at least three groups At the highest level anyreader with a general scientific background will be able to gain an appreciation forthe heuristics of perceptual coding Secondly readers with a general electrical andcomputer engineering background will become familiar with the essential signalprocessing techniques and perceptual models embedded in most audio codersFinally undergraduate and graduate students with focuses in multimedia DSPand computer music will gain important knowledge in signal analysis and audiocoding algorithms The vast body of literature provided and the tutorial aspectsof the book make it an asset for audiophiles as well

Organization

This book is in part the outcome of many years of research and teaching at Ari-zona State University We opted to include exercises and computer problems andhence enable instructors to either use the content in existing DSP and multimediacourses or to promote the creation of new courses with focus in audio and speechprocessing and coding The book has twelve chapters and each chapter containsproblems proofs and computer exercises Chapter 1 introduces the readers tothe field of audio signal processing and coding In Chapter 2 we review thebasic signal processing theory and emphasize concepts relevant to audio cod-ing Chapter 3 describes waveform quantization and entropy coding schemesChapter 4 covers linear predictive coding and its utility in speech and audio cod-ing Chapter 5 covers psychoacoustics and Chapter 6 explores filter bank designChapter 7 describes transform coding methodologies Subband and sinusoidalcoding algorithms are addressed in Chapters 8 and 9 respectively Chapter 10reviews several audio coding standards including the ISOIEC MPEG family thecinematic Sony SDDS the Dolby AC-3 and the DTS-coherent acoustics (DTS-CA) Chapter 11 focuses on lossless audio coding and digital audio watermarkingtechniques Chapter 12 provides information on subjective quality measures

Use in Courses

For an undergraduate elective course with little or no background in DSP theinstructor can cover in detail Chapters 1 2 3 4 and 5 then present select

PREFACE xix

sections of Chapter 6 and describe in an expository and qualitative mannercertain basic algorithms and standards from Chapters 7-11 A graduate class inaudio coding with students that have background in DSP can start from Chapter 5and cover in detail Chapters 6 through Chapter 11 Audio coding practitioners andresearchers that are interested mostly in qualitative descriptions of the standardsand information on bibliography can start at Chapter 5 and proceed readingthrough Chapter 11

Trademarks and Copyrights

Sony Dynamic Digital Sound SDDS ATRAC and MiniDisc are trademarks ofSony Corporation Dolby Dolby Digital AC-2 AC-3 DolbyFAX Dolby Pro-Logic are trademarks of Dolby laboratories The perceptual audio coder (PAC)EPAC and MPAC are trademarks of ATampT and Lucent Technologies TheAPT-x100 is trademark of Audio Processing Technology Inc The DTS-CA istrademark of Digital Theater Systems Inc Apple iPod is a registered trademarkof Apple Computer Inc

Acknowledgments

The authors have all spent time at Arizona State University (ASU) and ProfSpanias is in fact still teaching and directing research in this area at ASU Thegroup of authors has worked on grants with Intel Corporation and would like tothank this organization for providing grants in scalable speech and audio codingthat created opportunities for in-depth studies in these areas Special thanks toour colleagues in Intel Corporation at that time including Brian Mears GopalNair Hedayat Daie Mark Walker Michael Deisher and Tom Gardos We alsowish to acknowledge the support of current Intel colleagues Gang Liang MikeRosenzweig and Jim Zhou as well as Scott Peirce for proof reading some of thematerial Thanks also to former doctoral students at ASU including Philip Loizouand Sassan Ahmadi for many useful discussions in speech and audio processingWe appreciate also discussions on narrowband vocoders with Bruce Fette in thelate 1990s then with Motorola GEG and now with General Dynamics

The authors also acknowledge the National Science Foundation (NSF) CCLIfor grants in education that supported in part the preparation of several computerexamples and paradigms in psychoacoustics and signal coding Also some ofthe early work in coding of Dr Spanias was supported by the Naval ResearchLaboratories (NRL) and we would like to thank that organization for providingideas for projects that inspired future work in this area We also wish to thankASU and some of the faculty and administrators that provided moral and materialsupport for work in this area Thanks are extended to current ASU studentsShibani Misra Visar Berisha and Mahesh Banavar for proofreading some ofthe material We thank the Wiley Interscience production team George TeleckiMelissa Yanuzzi and Rachel Witmer for their diligent efforts in copyeditingcover design and typesetting We also thank all the anonymous reviewers for

xx PREFACE

their useful comments Finally we all wish to express our thanks to our familiesfor their support

The book content is used frequently in ASU online courses and industry shortcourses offered by Andreas Spanias Contact Andreas Spanias (spaniasasuedu httpwwwfultonasuedusimspanias) for details

1 Resources used for obtaining important dates in recording history include web sites at the Universityof San Diego Arizona State University and Wikipedia

CHAPTER 1

INTRODUCTION

Audio coding or audio compression algorithms are used to obtain compact dig-ital representations of high-fidelity (wideband) audio signals for the purpose ofefficient transmission or storage The central objective in audio coding is to rep-resent the signal with a minimum number of bits while achieving transparentsignal reproduction ie generating output audio that cannot be distinguishedfrom the original input even by a sensitive listener (ldquogolden earsrdquo) This textgives an in-depth treatment of algorithms and standards for transparent codingof high-fidelity audio

11 HISTORICAL PERSPECTIVE

The introduction of the compact disc (CD) in the early 1980s brought to thefore all of the advantages of digital audio representation including true high-fidelity dynamic range and robustness These advantages however came atthe expense of high data rates Conventional CD and digital audio tape (DAT)systems are typically sampled at either 441 or 48 kHz using pulse code mod-ulation (PCM) with a 16-bit sample resolution This results in uncompresseddata rates of 7056768 kbs for a monaural channel or 141154 Mbs for astereo-pair Although these data rates were accommodated successfully in first-generation CD and DAT players second-generation audio players and wirelesslyconnected systems are often subject to bandwidth constraints that are incompat-ible with high data rates Because of the success enjoyed by the first-generation


1

2 INTRODUCTION

systems however end users have come to expect ldquoCD-qualityrdquo audio reproduc-tion from any digital system Therefore new network and wireless multimediadigital audio systems must reduce data rates without compromising reproduc-tion quality Motivated by the need for compression algorithms that can satisfysimultaneously the conflicting demands of high compression ratios and trans-parent quality for high-fidelity audio signals several coding methodologies havebeen established over the last two decades Audio compression schemes in gen-eral employ design techniques that exploit both perceptual irrelevancies andstatistical redundancies

PCM was the primary audio encoding scheme employed until the early 1980sPCM does not provide any mechanisms for redundancy removal Quantizationmethods that exploit the signal correlation such as differential PCM (DPCM)delta modulation [Jaya76] [Jaya84] and adaptive DPCM (ADPCM) were appliedto audio compression later (eg PC audio cards) Owing to the need for dras-tic reduction in bit rates researchers began to pursue new approaches for audiocoding based on the principles of psychoacoustics [Zwic90] [Moor03] Psychoa-coustic notions in conjunction with the basic properties of signal quantizationhave led to the theory of perceptual entropy [John88a] [John88b] Perceptualentropy is a quantitative estimate of the fundamental limit of transparent audiosignal compression Another key contribution to the field was the characterizationof the auditory filter bank and particularly the time-frequency analysis capabili-ties of the inner ear [Moor83] Over the years several filter-bank structures thatmimic the critical band structure of the auditory filter bank have been proposedA filter bank is a parallel bank of bandpass filters covering the audio spectrumwhich when used in conjunction with a perceptual model can play an importantrole in the identification of perceptual irrelevancies

During the early 1990s several workgroups and organizations such asthe International Organization for StandardizationInternational Electro-technicalCommission (ISOIEC) the International Telecommunications Union (ITU)ATampT Dolby Laboratories Digital Theatre Systems (DTS) Lucent TechnologiesPhilips and Sony were actively involved in developing perceptual audio codingalgorithms and standards Some of the popular commercial standards publishedin the early 1990s include Dolbyrsquos Audio Coder-3 (AC-3) the DTS CoherentAcoustics (DTS-CA) Lucent Technologiesrsquo Perceptual Audio Coder (PAC)Philipsrsquo Precision Adaptive Subband Coding (PASC) and Sonyrsquos AdaptiveTransform Acoustic Coding (ATRAC) Table 11 lists chronologically some ofthe prominent audio coding standards The commercial success enjoyed bythese audio coding standards triggered the launch of several multimedia storageformats

Table 12 lists some of the popular multimedia storage formats since the begin-ning of the CD era High-performance stereo systems became quite common withthe advent of CDs in the early 1980s A compact-discndashread only memory (CD-ROM) can store data up to 700ndash800 MB in digital form as ldquomicroscopic-pitsrdquothat can be read by a laser beam off of a reflective surface or a medium Threecompeting storage media ndash DAT the digital compact cassette (DCC) and the

HISTORICAL PERSPECTIVE 3

Table 11 List of perceptual and lossless audio coding standardsalgorithms

Standardalgorithm Related references

1 ISOIEC MPEG-1 audio [ISOI92]2 Philipsrsquo PASC (for DCC applications) [Lokh92]3 ATampTLucent PACEPAC [John96c] [Sinh96]4 Dolby AC-2 [Davi92] [Fiel91]5 AC-3Dolby Digital [Davis93] [Fiel96]6 ISOIEC MPEG-2 (BCLSF) audio [ISOI94a]7 Sonyrsquos ATRAC (MiniDisc and SDDS) [Yosh94] [Tsut96]8 SHORTEN [Robi94]9 Audio processing technology ndash APT-x100 [Wyli96b]10 ISOIEC MPEG-2 AAC [ISOI96]11 DTS coherent acoustics [Smyt96] [Smyt99]12 The DVD Algorithm [Crav96] [Crav97]13 MUSICompress [Wege97]14 Lossless transform coding of audio (LTAC) [Pura97]15 AudioPaK [Hans98b] [Hans01]16 ISOIEC MPEG-4 audio version 1 [ISOI99]17 Meridian lossless packing (MLP) [Gerz99]18 ISOIEC MPEG-4 audio version 2 [ISOI00]19 Audio coding based on integer transforms [Geig01] [Geig02]20 Direct-stream digital (DSD) technology [Reef01a] [Jans03]

Table 12 Some of the popular audio storageformats

Audio storage format Related references

1 Compact disc [CD82] [IECA87]2 Digital audio tape (DAT) [Watk88] [Tan89]3 Digital compact cassette (DCC) [Lokh91] [Lokh92]4 MiniDisc [Yosh94] [Tsut96]5 Digital versatile disc (DVD) [DVD96]6 DVD-audio (DVD-A) [DVD01]7 Super audio CD (SACD) [SACD02]

MiniDisc (MD) ndash entered the commercial market during 1987ndash1992 Intendedmainly for back-up high-density storage (sim13 GB) the DAT became the primarysource of mass data storagetransfer [Watk88] [Tan89] In 1991ndash1992 Sony pro-posed a storage medium called the MiniDisc primarily for audio storage MDemploys the ATRAC algorithm for compression In 1991 Philips introduced theDCC a successor of the analog compact cassette Philips DCC employs a com-pression scheme called the PASC [Lokh91] [Lokh92] [Hoog94] The DCC began

4 INTRODUCTION

as a potential competitor for DATs but was discontinued in 1996 The introduc-tion of the digital versatile disc (DVD) in 1996 enabled both video and audiorecordingstorage as well as text-message programming The DVD became oneof the most successful storage media With the improvements in the audio com-pression and DVD storage technologies multichannel surround sound encodingformats gained interest [Bosi93] [Holm99] [Bosi00]

With the emergence of streaming audio applications during the late1990s researchers pursued techniques such as combined speech and audioarchitectures as well as joint source-channel coding algorithms that are optimizedfor the packet-switched Internet The advent of ISOIEC MPEG-4 standard(1996ndash2000) [ISOI99] [ISOI00] established new research goals for high-qualitycoding of audio at low bit rates MPEG-4 audio encompasses more functionalitythan perceptual coding [Koen98] [Koen99] It comprises an integrated family ofalgorithms with provisions for scalable object-based speech and audio coding atbit rates from as low as 200 bs up to 64 kbs per channel

The emergence of the DVD-audio and the super audio CD (SACD) pro-vided designers with additional storage capacity which motivated research inlossless audio coding [Crav96] [Gerz99] [Reef01a] A lossless audio coding sys-tem is able to reconstruct perfectly a bit-for-bit representation of the originalinput audio In contrast a coding scheme incapable of perfect reconstruction iscalled lossy For most audio program material lossy schemes offer the advan-tage of lower bit rates (eg less than 1 bit per sample) relative to losslessschemes (eg 10 bits per sample) Delivering real-time lossless audio contentto the network browser at low bit rates is the next grand challenge for codecdesigners

12 A GENERAL PERCEPTUAL AUDIO CODING ARCHITECTURE

Over the last few years researchers have proposed several efficient signal models(eg transform-based subband-filter structures wavelet-packet) and compressionstandards (Table 11) for high-quality digital audio reproduction Most of thesealgorithms are based on the generic architecture shown in Figure 11

The coders typically segment input signals into quasi-stationary frames rangingfrom 2 to 50 ms Then a time-frequency analysis section estimates the temporaland spectral components of each frame The time-frequency mapping is usuallymatched to the analysis properties of the human auditory system Either waythe ultimate objective is to extract from the input audio a set of time-frequencyparameters that is amenable to quantization according to a perceptual distortionmetric Depending on the overall design objectives the time-frequency analysissection usually contains one of the following

ž Unitary transformž Time-invariant bank of critically sampled uniformnonuniform bandpass

filters

AUDIO CODER ATTRIBUTES 5

Time-frequencyanalysis

Quantizationand encoding

Psychoacousticanalysis Bit-allocation

Entropy(lossless)

codingMUX

Tochannel

Inputaudio

Maskingthresholds

Sideinformation

Parameters

Figure 11 A generic perceptual audio encoder

ž Time-varying (signal-adaptive) bank of critically sampled uniformnonunif-orm bandpass filters

ž Harmonicsinusoidal analyzerž Source-system analysis (LPC and multipulse excitation)ž Hybrid versions of the above

The choice of time-frequency analysis methodology always involves a fun-damental tradeoff between time and frequency resolution requirements Percep-tual distortion control is achieved by a psychoacoustic signal analysis sectionthat estimates signal masking power based on psychoacoustic principles Thepsychoacoustic model delivers masking thresholds that quantify the maximumamount of distortion at each point in the time-frequency plane such that quan-tization of the time-frequency parameters does not introduce audible artifactsThe psychoacoustic model therefore allows the quantization section to exploitperceptual irrelevancies This section can also exploit statistical redundanciesthrough classical techniques such as DPCM or ADPCM Once a quantized com-pact parametric set has been formed the remaining redundancies are typicallyremoved through noiseless run-length (RL) and entropy coding techniques egHuffman [Cove91] arithmetic [Witt87] or Lempel-Ziv-Welch (LZW) [Ziv77][Welc84] Since the output of the psychoacoustic distortion control model issignal-dependent most algorithms are inherently variable rate Fixed channelrate requirements are usually satisfied through buffer feedback schemes whichoften introduce encoding delays

13 AUDIO CODER ATTRIBUTES

Perceptual audio coders are typically evaluated based on the following attributesaudio reproduction quality operating bit rates computational complexity codecdelay and channel error robustness The objective is to attain a high-quality(transparent) audio output at low bit rates (lt32 kbs) with an acceptable

6 INTRODUCTION

algorithmic delay (sim5 to 20 ms) and with low computational complexity (sim1 to10 million instructions per second or MIPS)

131 Audio Quality

Audio quality is of paramount importance when designing an audio codingalgorithm Successful strides have been made since the development of sim-ple near-transparent perceptual coders Typically classical objective measures ofsignal fidelity such as the signal to noise ratio (SNR) and the total harmonicdistortion (THD) are inadequate [Ryde96] As the field of perceptual audio cod-ing matured rapidly and created greater demand for listening tests there was acorresponding growth of interest in perceptual measurement schemes Severalsubjective and objective quality measures have been proposed and standard-ized during the last decade Some of these schemes include the noise-to-maskratio (NMR 1987) [Bran87a] the perceptual audio quality measure (PAQM1991) [Beer91] the perceptual evaluation (PERCEVAL 1992) [Pail92] the per-ceptual objective measure (POM 1995) [Colo95] and the objective audio signalevaluation (OASE 1997) [Spor97] We will address these and several other qual-ity assessment schemes in detail in Chapter 12

132 Bit Rates

From a codec designerrsquos point of view one of the key challenges is to rep-resent high-fidelity audio with a minimum number of bits For instance if a5-ms audio frame sampled at 48 kHz (240 samples per frame) is representedusing 80 bits then the encoding bit rate would be 80 bits5 ms = 16 kbs Lowbit rates imply high compression ratios and generally low reproduction qual-ity Early coders such as the ISOIEC MPEG-1 (32ndash448 kbs) the Dolby AC-3(32ndash384 kbs) the Sony ATRAC (256 kbs) and the Philips PASC (192 kbs)employ high bit rates for obtaining transparent audio reproduction However thedevelopment of several sophisticated audio coding tools (eg MPEG-4 audiotools) created ways for efficient transmission or storage of audio at rates between8 and 32 kbs Future audio coding algorithms promise to offer reasonable qual-ity at low rates along with the ability to scale both rate and quality to matchdifferent requirements such as time-varying channel capacity

133 Complexity

Reduced computational complexity not only enables real-time implementationbut may also decrease the power consumption and extend battery life Com-putational complexity is usually measured in terms of millions of instructionsper second (MIPS) Complexity estimates are processor-dependent For examplethe complexity associated with Dolbyrsquos AC-3 decoder was estimated at approxi-mately 27 MIPS using the Zoran ZR38001 general-purpose DSP core [Vern95]for the Motorola DSP56002 processor the complexity was estimated at 45MIPS [Vern95] Usually most of the audio codecs rely on the so-called asym-metric encoding principle This means that the codec complexity is not evenly

TYPES OF AUDIO CODERS ndash AN OVERVIEW 7

shared between the encoder and the decoder (typically encoder 80 and decoder20 complexity) with more emphasis on reducing the decoder complexity

134 Codec Delay

Many of the network applications for high-fidelity audio (streaming audio audio-on-demand) are delay tolerant (up to 100ndash200 ms) providing the opportunityto exploit long-term signal properties in order to achieve high coding gainHowever in two-way real-time communication and voice-over Internet proto-col (VoIP) applications low-delay encoding (10ndash20 ms) is important Considerthe example described before ie an audio coder operating on frames of 5 msat a 48 kHz sampling frequency In an ideal encoding scenario the minimumamount of delay should be 5 ms at the encoder and 5 ms at the decoder (same asthe frame length) However other factors such as analysis-synthesis filter bankwindow the look-ahead the bit-reservoir and the channel delay contribute toadditional delays Employing shorter analysis-synthesis windows avoiding look-ahead and re-structuring the bit-reservoir functions could result in low-delayencoding nonetheless with reduced coding efficiencies


The increasing popularity of streaming audio over packet-switched and wire-less networks such as the Internet implies that any algorithm intended for suchapplications must be able to deal with a noisy time-varying channel In partic-ular provisions for error robustness and error protection must be incorporatedat the encoder in order to achieve reliable transmission of digital audio overerror-prone channels One simple idea could be to provide better protection tothe error-sensitive and priority (important) bits For instance the audio frameheader requires the maximum error robustness otherwise transmission errorsin the header will seriously impair the entire audio frame Several error detect-ingcorrecting codes [Lin82] [Wick95] [Bayl97] [Swee02] [Zara02] can also beemployed Inclusion of error correcting codes in the bitstream might help to obtainerror-free reproduction of the input audio however with increased complexityand bit rates

From the discussion in the previous sections it is evident that several tradeoffsmust be considered in designing an algorithm for a particular application For thisreason audio coding standards consist of several tools that enable the design ofscalable algorithms For example MPEG-4 provides tools to design algorithmsthat satisfy a variety of bit rate delay complexity and robustness requirements

14 TYPES OF AUDIO CODERS ndash AN OVERVIEW

Based on the signal model or the analysis-synthesis technique employed to encodeaudio signals audio coders can be broadly classified as follows

ž Linear predictivež Transform

8 INTRODUCTION

ž Subbandž Sinusoidal

Algorithms are also classified based on the lossy or the lossless nature of audiocoding Lossy audio coding schemes achieve compression by exploiting percep-tually irrelevant information Some examples of lossy audio coding schemesinclude the ISOIEC MPEG codec series the Dolby AC-3 and the DTS CA Inlossless audio coding the audio data is merely ldquopackedrdquo to obtain a bit-for-bitrepresentation of the original The meridian lossless packing (MLP) [Gerz99]and the direct stream digital (DSD) techniques [Brue97] [Reef01a] form a classof high-end lossless compression algorithms that are embedded in the DVD-audio [DVD01] and the SACD [SACD02] storage formats respectively Losslessaudio coding techniques in general yield high-quality digital audio without anyartifacts at high rates For instance perceptual audio coding yields compressionratios from 101 to 251 while lossless audio coding can achieve compressionratios from 21 to 41

15 ORGANIZATION OF THE BOOK

This book is organized as follows In Chapter 2 we review basic signal pro-cessing concepts associated with audio coding Chapter 3 provides introductorymaterial to waveform quantization and entropy coding schemes Some of the keytopics covered in this chapter include scalar quantization uniformnonuniformquantization pulse code modulation (PCM) differential PCM (DPCM) adap-tive DPCM (ADPCM) vector quantization (VQ) bit-allocation techniques andentropy coding schemes (Huffman Rice and arithmetic)

Chapter 4 provides information on linear prediction and its application innarrow and wideband coding First we address the utility of LP analysissynthesisapproach in speech applications Next we describe the open-loop analysis-synthesis LP and closed-loop analysis-by-synthesis LP techniques

In Chapter 5 psychoacoustic principles are described Johnstonrsquos notion ofperceptual entropy is presented as a measure of the fundamental limit of trans-parent compression for audio The ISOIEC 11172-3 MPEG-1 psychoacous-tic analysis model 1 is used to describe the five important steps associatedwith the global masking threshold computation Chapter 6 explores filter bankdesign issues and algorithms with a particular emphasis placed on the modi-fied discrete cosine transform (MDCT) that is widely used in several perceptualaudio coding algorithms Chapter 6 also addresses pre-echo artifacts and controlstrategies

Chapters 7 8 and 9 review established and emerging techniques for trans-parent coding of FM and CD-quality audio signals including several algorithmsthat have become international standards Transform coding methodologies aredescribed in Chapter 7 subband coding algorithms are addressed in Chapter 8and sinusoidal algorithms are presented in Chapter 9 In addition to methodsbased on uniform bandwidth filter banks Chapter 8 covers coding methods that

NOTATIONAL CONVENTIONS 9

utilize discrete wavelet transforms (DWT) discrete wavelet packet transforms(DWPT) and other nonuniform filter banks Examples of hybrid algorithmsthat make use of more than one signal model appear throughout Chapters 7 8and 9

Chapter 10 is concerned with standardization activities in audio coding Itdescribes coding standards and products such as the ISOIEC MPEG family(minus1 ldquoMP123rdquo minus2 minus4 minus7 and minus21) the Sony Minidisc (ATRAC) thecinematic Sony SDDS the Lucent Technologies PACEPACMPAC the DolbyAC-2AC-3 the Audio Processing Technology APT-x100 and the DTS-coherentacoustics (DTS-CA) Details on the MP3 and MPEG-4 AAC algorithms thatare popular in Web and in handheld media applications eg Apple iPod areprovided

Chapter 11 focuses on lossless audio coding and digital audio watermarkingtechniques In particular the SHORTEN the DVD algorithm the MUSICom-press the AudioPaK the C-LPAC the LTAC and the IntMDCT lossless codingschemes are described in detail Chapter 11 also addresses the two popular high-end storage formats ie the SACD and the DVD-Audio The MLP and the DSDtechniques for lossless audio coding are also presented

Chapter 12 provides information on subjective quality measures for perceptualcodecs The five-point absolute and differential subjective quality scales areaddressed as well as the subjective test methodologies specified in the ITU-R Recommendation BS1116 A set of subjective benchmarks is provided forthe various standards in both stereophonic and multichannel modes to facilitatealgorithm comparisons

16 NOTATIONAL CONVENTIONS

Unless otherwise specified bit rates correspond to single-channel or monauralcoding throughout this text Subjective quality measurements are specified interms of either the five-point mean opinion score (MOS Table 13) or the 41-point subjective difference grade (SDG Chapter 12 Table 121) Table 14 listssome of the symbols and notation used in the book

Table 13 Mean opinionscore (MOS) scale

MOS Perceptual quality

1 Bad2 Poor3 Average4 Good5 Excellent

10 INTRODUCTION

Table 14 Symbols and notation used in the book

Symbolnotation Description

t n Time indexsample indexω Frequency index (analog domain

discrete domain)f (= ω2π) Frequency (Hz)Fs Ts Sampling frequency sampling periodx(t) harr X(ω) Continuous-time Fourier transform

(FT) pairx(n) harr X() Discrete-time Fourier transform

(DTFT) pairx(n) harr X(z) z transform pairs[] Indicates a particular element in a

coefficient arrayh(n) harr H() Impulse-frequency response pair of a

discrete time systeme(n) Errorprediction residual

H(z) = B(z)

A(z)= 1 + b1z


1 + a1zminus1 + + aMzminusMTransfer function consisting of

numerator-polynomial anddenominator-polynomial(corresponding to b-coefficientsand a-coefficients)

s(n) = sumMi=0 ais(n minus i) Predicted signal

Q〈s(n)〉 = s(n) Quantizationapproximation operatoror estimatedencoded value

s[] Square brackets in the superscriptdenote recursion

s() Parenthesis superscript timedependency

N Nf Nsf Total number of samples samples perframe samples per subframe

log() ln() logp() Log to the base-10 log to the base-elog to the base-p

E[] Expectation operatorε Mean squared error (MSE)microx σ 2

x Mean and the variance of the signalx(n)

rxx(m) Autocorrelation of the signal x(n)

rxy(m) Cross-correlation of x(n) and y(n)

Rxx(ej) Power spectral density of the signal

x(n)

Bit rate Number of bits per second (bs kbsor Mbs)

dB SPL Decibels sound pressure level


PROBLEMS

The objective of these introductory problems are to introduce the novice to simplerelations between the sampling rate and the bit rate in PCM coded sequences

11 Consider an audio signal s(n) sampled at 441 kHz and digitized using a)8-bit b) 24-bit and c) 32-bit resolution Compute the data rates for the cases(a)ndash(c) Give the number of samples within a 16-ms frame and compute thenumber of bits per frame

12 List some of the typical data rates (in kbs) and sampling rates (in kHz)employed in applications such as a) video streaming b) audio streamingc) digital audio broadcasting d) digital compact cassette e) MiniDisc f)DVD g) DVD-audio h) SACD i) MP3 j) MP4 k) video conferencingand l) cellular telephony

COMPUTER EXERCISES

The objective of this exercise is to familiarize the reader with the handling ofsound files using MATLAB and to expose the novice to perceptual attributes ofsampling rate and bit resolution

13 For this computer exercise use MATLAB workspace ch1pb1mat from thewebsite

Load the workspace ch1pb1mat using

gtgt load(lsquoch1pb1matrsquo)

Use whos command to view the variables in the workspace The data-vectorlsquoaudio inrsquo contains 44100 samples of audio data Perform the following inMATLAB

gtgt wavwrite(audio in4410016lsquopb1 aud44 16wavrsquo)gtgt wavwrite(audio in1000016lsquopb1 aud10 16wavrsquo)gtgt wavwrite(audio in441008lsquopb1 aud44 08wavrsquo)

Listen to the wave files pb1minusaud44minus16wav pb1minusaud10minus16wav andpb1minusaud44minus08wav using a media player Comment on the perceptual qualityof the three wave files

14 Down-sample the data-vector lsquoaudio inrsquo in problem 13 using

gtgt aud down 4 = downsample(audio in 4)

Use the following commands to listen to audio in and aud down 4 Com-ment on the perceptual quality of the data vectors in each of the cases below

gtgt sound(audio in fs)gtgt sound(aud down 4 fs)gtgt sound(aud down 4 fs4)

CHAPTER 2

SIGNAL PROCESSING ESSENTIALS

21 INTRODUCTION

The signal processing theory described here will be restricted only to the con-cepts that are relevant to audio coding Because of the limited scope of thischapter we provide mostly qualitative descriptions and establish only the essen-tial mathematical formulas First we briefly review the basics of continuous-time(analog) signals and systems and the methods used to characterize the frequencyspectrum of analog signals We then present the basics of analog filters and sub-sequently describe discrete-time signals Coverage of the basics of discrete-timesignals includes the fundamentals of transforms that represent the spectra ofdigital sequences and the theory of digital filters The essentials of random andmultirate signal processing are also reviewed in this chapter

22 SPECTRA OF ANALOG SIGNALS

The frequency spectrum of an analog signal is described in terms of the con-tinuous Fourier transform (CFT) The CFT of a continuous-time signal x(t) isgiven by

X(ω) =int infin

minusinfinx(t)eminusjωt dt (21)

where ω is the frequency in radians per second (rads) Note that ω = 2πf wheref is the frequency in Hz The complex-valued function X(ω) describes the CFT


13


02

T0 T0

2t w

1 CFT

0 2πT0

4πT0

X(w) = T0 sinc2

wT0

T0

1x(t) = 2 2

0 otherwise

T0 T0tminus le le

minus

Figure 21 The pulse-sinc CFT pair

t

1

T0

x(t) = cos(w0t)where w0 = 2pT0

w0 w0minusw0

w0

w0minusw0

CFT

0

t

1CFT

0

w(t)

CFT W(w) = T0e minusjwT02sinc2

wT0

X(w) = p(d(w minus w0) + d(w + w0))

1 0 le t leT0w(t) =0 otherwise

xw(t) = x(t) w(t)Xw (w) = pT0 e minusj (w minus w

0)T

02 sinc ((w minus w0)T0 2)

+e minusj (w + w0)T

02 sinc ((w + w0)T02)

pT0

T0

p

Figure 22 CFT of a sinusoid and a truncated sinusoid

magnitude and phase spectrum of the signal The inverse CFT is given by

x(t) = 1

2π

int infin

minusinfinX(ω)ejωt dω (22)

The inverse CFT is also known as the synthesis formula because it describesthe time-domain signal x(t) in terms of complex sinusoids In CFT theory x(t)

and X(ω) are called a transform pair ie

x(t) harr X(ω) (23)

SPECTRA OF ANALOG SIGNALS 15

The pulse-sinc pair shown in Figure 21 is useful in explaining the effects oftime-domain truncation on the spectra For example when a sinusoid is truncatedthen there is loss of resolution and spectral leakage as shown in Figure 22

In real-life signal processing all signals have finite length and hence time-domain truncation always occurs The truncation of an audio segment by arectangular window is shown in Figure 23 To smooth out frame transitionsand control spectral leakage effects the signal is often tapered prior to truncationusing window functions such as the Hamming the Bartlett and the trapezoidalwindows A tapered window avoids the sharp discontinuities at the edges of thetruncated time-domain frame This in turn reduces the spectral leakage in thefrequency spectrum of the truncated signal This reduction of spectral leakage isattributed to the reduced level of the sidelobes associated with tapered windowsThe reduced sidelobe effects come at the expense of a modest loss of spectral

minus20 minus10 0 10 20 30 40

minus1

minus05

0

05

1

(a)

x(t

)


minus05

0

05

1

(b)

x w(t

)

0 02 04 06 08 1minus80

minus60

minus40

minus20

0

(c)

Xw

(w)

w(t )

Time t (ms)

Time t (ms)


Figure 23 (a) Audio signal x(t) and a rectangular window w(t) (shown in dashedline) (b) truncated audio signal xw(t) and (c) frequency-domain representation Xw(ω)of the truncated audio


minus20 minus10 0 10 20 30 40

minus1

minus05

0

05

1

(a)

x(t

)w(t )

Time t (ms)


minus05

0

05

1

(b)

x w(t

)

Time t (ms)

0 02 04 06 08 1minus80

minus60

minus40

minus20

0

(c)

Xw

(w)


Figure 24 (a) Audio signal x(t) and a Hamming window w(t) (shown in dashed line)(b) truncated audio signal xw(t) and (c) frequency-domain representation Xw(ω) of thetruncated audio

resolution An audio segment formed using a Hamming window is shown inFigure 24

23 REVIEW OF CONVOLUTION AND FILTERING

A linear time-invariant (LTI) system configuration is shown in Figure 25 Alinear filter satisfies the property of generalized superposition and hence its outputy(t) is the convolution of the input x(t) with the filter impulse response h(t)Mathematically convolution is represented by the integral in Eq (24)

y(t) =int infin

minusinfinh(τ)x(t minus τ)dτ = h(t) lowast x(t) (24)

UNIFORM SAMPLING 17

LTI system

h(t)

x(t ) y(t ) = x(t ) h(t )

Figure 25 A linear time-invariant (LTI) system and convolution operation

0 20 40 60 80 100minus50

minus40

minus30

minus20

minus10

0

Frequency w (rads)

Mag

nitu

de|H

(w)|2

(dB

)

RC = 1

1H(w) =

1 + jwRC

x(t)

R

C y(t)

Figure 26 A simple RC low-pass filter

The symbol between the impulse response h(t) and the input x(t) is oftenused to denote the convolution operation

The CFT of the impulse response h(t) is the frequency response of thefilter ie

h(t) harr H(ω) (25)

As an example for reviewing these fundamental concepts in linear systemswe present in Figure 26 a simple first-order RC circuit that corresponds to a low-pass filter The impulse response for this RC filter is a decaying exponential andits frequency response is given by a simple first-order rational function H(ω)This function is complex-valued and its magnitude represents the gain of thefilter with respect to frequency at steady state If a sinusoidal signal drives thelinear filter the steady-state output is also a sinusoid with the same frequencyHowever its amplitude is scaled and phase is shifted in a manner consistent withthe magnitude and phase of the frequency response function respectively

24 UNIFORM SAMPLING

In all of our subsequent discussions we will be treating audio signals andassociated systems in discrete time The rules for uniform sampling of ana-log speechaudio are provided by the sampling theorem [Shan48] This theoremstates that a signal that is strictly bandlimited to a bandwidth of B rads can beuniquely represented by its sampled values spaced at uniform intervals that are


not more than πB seconds apart In other words if we denote the samplingperiod as Ts then the sampling theorem states that Ts πB In the frequencydomain and with the sampling frequency defined as ωs = 2πfs = 2πTs thiscondition can be stated as

ωs 2B(rads) or fs B

π (26)

Mathematically the sampling process is represented by time-domain multi-plication of the analog signal x(t) with an impulse train p(t) as shown inFigure 27

Since multiplication in time is convolution in frequency the CFT of the sam-pled signal xs(t) corresponds to the CFT of the original analog signal x(t)convolved with the CFT of the impulse train p(t) The CFT of the impulses isalso a train of uniformly spaced impulses in frequency that are spaced 1Ts Hzapart The CFT of the sampled signal is therefore a periodic extension of theCFT of the analog signal as shown in Figure 28 In Figure 28 the analog signalwas considered to be ideally bandlimited and the sampling frequency ωs waschosen to be more than 2B to avoid aliasing The CFT of the sampled signal is

x(t)

t0

xs(t )

t0 Ts

x

p(t)

0 Ts

=

Analog signal Sampling Discrete signal

t

Figure 27 Uniform sampling of analog signals

x(t)

t0 0

X(w)

w

CFT

xs(t)

t

CFT

BminusB

0

Xs(w)

wBminusB wsminuswsTs0

Figure 28 Spectrum of ideally bandlimited and uniformly sampled signals

UNIFORM SAMPLING 19

given by

Xs(ω) = 1

Ts

infinsum

k=minusinfinX(ω minus kωs) (27)

Note that the spectrum of the sampled signal in Figure 28 is such that anideal low-pass filter (LPF) can recover the baseband of the signal and henceperfectly reconstruct the analog signal from the digital signal The reconstructionprocess is shown in Figure 29 This reconstruction LPF essentially interpolatesbetween the samples and reproduces the analog signal from the digital signal Theinterpolation process becomes evident once the filtering operation is interpretedin the time domain as convolution Reconstruction occurs by interpolating withthe sinc function which is the impulse response of the ideal low-pass filter Thereconstruction process for ωs = 2B is given by

x(t) =infinsum

n=minusinfinx(nTs)sinc(B(t minus nTs)) (28)

Note that if the sampling frequency is less than 2B then aliasing will occurand therefore the signal can no longer be reconstructed perfectly Figure 210illustrates aliasing

In real-life applications the analog signal is not ideally bandlimited and thesampling process is not perfect ie sampling pulses have finite amplitude andfinite duration Therefore some level of aliasing is always present To reduce

x(t)

t0

Reconstruction

0

Xs(w)

wBminusB wsminusws Ts

A reconstructionlow-pass filter

Interpolated signal

Figure 29 Reconstruction (interpolation) using a low-pass filter

Xs(w)

0w

BminusBwsminusws

Aliasing

Figure 210 Aliasing when ωs lt 2B


Table 21 Sampling rates and bandwidth specifications

Format Bandwidth Sampling frequency

Telephony 32 kHz 8 kHzWideband audio 7 kHz 16 kHzHigh-fidelity CD 20 kHz 441 kHzDigital audio tape (DAT) 20 kHz 48 kHzSuper audio CD (SACD) 100 kHz 28224 MHzDVD audio (DVD-A) 96 kHz 441 48 882 96 1764 or 192 kHz

aliasing the signal is prefiltered by an anti-aliasing low-pass filter and usuallyover-sampled (ωs gt 2B) The degree of over-sampling depends also on the choiceof the analog anti-aliasing filter For high-quality reconstruction and modest over-sampling the anti-aliasing filter must have good rejection characteristics On theother hand over-sampling by a large factor relaxes the requirements on the analoganti-aliasing filter and hence simplifies analog hardware at the expense of a higherdata rate Nowadays over-sampling is practiced often even in high-fidelity sys-tems In fact the use of inexpensive Sigma-Delta () analog-to-digital (AD)converters in conjunction with down-sampling in the digital domain is a com-mon practice Details on AD conversion and some over-sampling schemestailored for high-fidelity audio will be presented in Chapter 11 Standard samplingrates for the different grades of speech and audio are given in Table 21

25 DISCRETE-TIME SIGNAL PROCESSING

Audio coding algorithms operate on a quantized discrete-time signal Prior tocompression most algorithms require that the audio signal is acquired with high-fidelity characteristics In typical standardized algorithms audio is assumed to bebandlimited at 20 kHz sampled at 441 kHz and quantized at 16 bits per sampleIn the following discussion we will treat audio as a sequence ie as a streamof numbers denoted x(n) = x(t)|t=nTs

Initially we will review the discrete-timesignal processing concepts without considering further aliasing and quantizationeffects Quantization effects will be discussed later during the description ofspecific coding algorithms


Discrete-time signals are described in the transform domain using the z-transformand the discrete-time Fourier transform (DTFT) These two transformations havesimilar roles as the Laplace transform and the CFT for analog signals respec-tively The z-transform is defined as

X(z) =infinsum

n=minusinfinx(n)zminusn (29)


where z is a complex variable Note that if the z-transform is evaluated on theunit circle ie for

z = ej = 2πf Ts (210)

then the z-transform becomes the discrete-time Fourier transform (DTFT) TheDTFT is given by

X(ej) =infinsum

n=minusinfinx(n)eminusjn (211)

The DTFT is discrete in time and continuous in frequency As expected thefrequency spectrum associated with the DTFT is periodic with period 2π rads

Example 21

Consider the DTFT of a finite-length pulse

x(n) = 1 for 0 n N minus 1

= 0 else

Using geometric series results and trigonometric identities on the DTFT sum

X(ej) =Nminus1sum

n=0

eminusjn

0 01 02 03 04 05

Normalized frequency Ω (x p rad)

06 07 08 09 10

2

4

6

8

10

12

14

16N = 8N = 16

|X(e

jΩ)|

Figure 211 DTFT of a sampled pulse for the Example 21 Digital sinc for N = 8(dashed line) and N = 16 (solid line)


= 1 minus eminusjN

1 minus eminusj

= eminusj (Nminus1)2 sin(N2)

sin(2) (212)

The ratio of sinusoidal functions in Eq (212) is known as the Dirichlet func-tion or as a digital sinc function Figure 211 shows the DTFT of a finite-lengthpulse The digital sinc is quite similar to the continuous-time sinc functionexcept that it is periodic with period 2π and has a finite number of sidelobeswithin a period


A computational tool for Fourier transforms is developed by starting from theDTFT analysis expression (211) and considering a finite length signal consistingof N points ie

X(ej) =Nminus1sum

n=0

x(n)eminusjn (213)

Furthermore the frequency-domain signal is sampled uniformly at N pointswithin one period = 0 to 2π ie

rArr k = 2π

Nk k = 0 1 N minus 1 (214)

With the sampling in the frequency domain the Fourier sum of Eq (213)becomes

X(ejk ) =Nminus1sum

n=0

x(n)eminusjnk (215)

It is typical in the DSP literature to replace k with the frequency index k

and hence Eq (215) can be written as

X(k) =Nminus1sum

n=0

x(n)eminusj2πknN k = 0 1 2 N minus 1 (216)

The expression in (216) is called the discrete Fourier transform (DFT) Notethat the sampling in the frequency domain forces periodicity in the time domainie x(n) = x(n + N) We also have periodicity in the frequency domain X(k) =X(k + N) because the signal in the time domain is also discrete These periodic-ities create circular effects when convolution is performed by frequency-domainmultiplication ie

x(n) otimes h(n) harr X(k)H(k) (217)


where

x(n) otimes h(n) =Nminus1sum

m=0

h(m) x((n minus m)modN) (218)

The symbol otimes stands for circular or periodic convolution and mod N impliesmodulo N subtraction of indices The DFT is a one-to-one transformation whosebasis functions are orthogonal With the proper normalization the DFT matrixcan be written as a unitary matrix The N -point inverse DFT (IDFT) is written as

x(n) = 1

N

Nminus1sum

k=0

X(k)ej2πknN n = 0 1 2 N minus 1 (219)

The DFT transform pair is represented by the following notation

x(n) harr X(k) (220)

The DFT can be computed efficiently using the fast Fourier transform (FFT)The FFT takes advantage of redundancies in the DFT sum by decimating thesequence into subsequences with even and odd indices It can be shown that if N

is a radix-2 integer the N -point DFT can be computed using a series of butterflystages The complexity associated with the DFT algorithm is of the order of N2

computations In contrast the number of computations associated with the FFTalgorithm is roughly of the order of N log2 N This is a significant reduction incomputational complexity and FFTs are almost always used in lieu of a DFT


The discrete cosine transform (DCT) of x(n) can be defined as

X(k) = c(k)

radic2

N

Nminus1sum

n=0

x(n) cos

[π

N

(

n + 1

2

)

k

]

0 k N minus 1 (221)

where c(0) = 1radic

2 and c(k) = 1 for 1 k N minus 1 Depending on the period-icity and the symmetry of the input signal x(n) the DCT can be computed usingdifferent orthonormal transforms (usually DCT-1 DCT-2 DCT-3 and DCT-4)More details on the DCT and the modified DCT (MDCT) [Malv91] are given inChapter 6


Spectral analysis of nonstationary signals cannot be accommodated by theclassical Fourier transform since the signal has time-varying characteristicsInstead a time-frequency transformation is required Time-varying spectral


hk(n) X X

x(n)

Analysis Synthesis

xk(n)

e jΩkne minus jΩkn

Figure 212 The k-th channel of the analysis-synthesis filterbank (after [Rabi78])

analysis [Silv74] [Alle77] [Port81a] can be performed using the short-timeFourier transform (STFT) The analysis expression for the STFT is given by

X(n ) =infinsum

m=minusinfinx(m)h(n minus m)eminusjm = h(n) lowast x(n)eminusjn (222)

where = ωT = 2πf T is the normalized frequency in radians and h(n) is thesliding analysis window The synthesis expression (inverse transform) is given by

h(n minus m)x(m) = 1

2π

int π

minusπ

X(n )ejm d (223)

Note that if n = m and h(0) = 1 [Rabi78] [Port80] then x(n) can be obtainedfrom Eq (223) The basic assumption in this type of analysis-synthesis is thatthe signal is slowly time-varying and can be modeled by its short-time spectrumThe temporal and spectral resolution of the STFT are controlled by the lengthand shape of the analysis window For speech and audio signals the length of thewindow is often constrained to be about 5ndash20 ms and hence spectral resolutionis sacrificed The sequence h(n) can also be viewed as the impulse responseof a LTI filter which is excited by a frequency-shifted signal (see Eq (222))The latter leads to the filter-bank interpretation of the STFT ie for a discretefrequency variable k = k() k = 0 1 N minus 1 and and N chosensuch that the speech band is covered Then the analysis expression is written as

X(n k) =infinsum

m=minusinfinx(m)h(n minus m)eminusjkm = h(n) lowast x(n)eminusjkn (224)

and the synthesis expression is

xST FT (n) =Nminus1sum

k=0

X(n k)ejkn (225)

where xST FT (n) is the signal reconstructed within the band of interest Ifh(n) and N are chosen carefully [Scha75] the reconstruction by Eq (225)

DIFFERENCE EQUATIONS AND DIGITAL FILTERS 25

can be exact The k-th channel analysis-synthesis scheme is depicted inFigure 212 where hk(n) = h(n)ejkn

26 DIFFERENCE EQUATIONS AND DIGITAL FILTERS

Digital filters are characterized by difference equations of the form

y(n) =Lsum

i=0

bix(n minus i) minusMsum

i=1

aiy(n minus i) (226)

In the input-output difference equation above the output y(n) is given as thelinear combination of present and past inputs minus a linear combination of pastoutputs (feedback term) The parameters ai and bi are the filter coefficients or fil-ter taps and they control the frequency response characteristics of the digital filterFilter coefficients are programmable and can be made adaptive (time-varying)A direct-form realization of the digital filter is shown in Figure 213

The filter in the Eq (226) is referred to as an infinite-length impulse response(IIR) filter The impulse response h(n) of the filter shown in Figure 213 isgiven by

h(n) =Lsum

i=0

biδ(n minus i) minusMsum

i=0

aih(n minus i) (227)

The IIR classification stems from the fact that when the feedback coefficientsare non-zero the impulse response is infinitely long In a statistical signal rep-resentation Eq (226) is referred to as a time-series model That is if the inputof this filter is white noise then y(n) is called an autoregressive moving average(ARMA) process The feedback coefficients ai are chosen such that the filteris stable ie a bounded input gives a bounded output (BIBO) An input-outputequation of a causal filter can also be written in terms of the impulse responseof the filter ie

y(n) =infinsum

m=0

h(m)x(n minus m) (228)

zminus1 zminus1 zminus1 zminus1zminus1

b0

b1

bL Σ

minusa1

minusa2

minusaM

x(n)

y(n)

Figure 213 Direct-form realization of an IIR digital filter


The impulse response of the filter is associated with its coefficients and canbe computed explicitly by programming the difference equation It can also beobtained in closed form by solving the difference equation

Example 22

Consider the first-order IIR filter shown in Figure 214 The difference equationof this digital filter is given by

y(n) = 02x(n) + 08y(n minus 1)

The coefficient b0 = 02 and a1 = minus08 The impulse response of this filter isgiven by

h(n) = 02δ(n) + 08h(n minus 1)

The impulse response can be determined in closed-form by solving the abovedifference equation Note that h(0) = 02 and h(1) = 016 Therefore theclosed-form expression for the impulse response is h(n) = 02(08)nu(n) Notealso that this first-order IIR filter is BIBO stable because

infinsum

n=minusinfin|h(n)| lt infin (229)

Digital filters with finite-length impulse response (FIR) are realized by settingthe feedback coefficients ai = 0 for i = 1 2 M FIR filters Figure 215are inherently BIBO stable because their impulse response is always abso-lutely summable The output of an FIR filter is a weighted moving averageof the input The simplest FIR filter is the so-called averaging filter that isused in some simple estimation applications The input-output equation of theaveraging filter is given by

y(n) = 1

L + 1

Lsum

i=0

x(n minus i) (230)

02

x(n)

Σ

08

y(n)

zminus1

Figure 214 A first-order IIR digital filter


b0

b1

bL Σzminus1zminus1x(n) y(n)

Figure 215 An FIR digital filter

The impulse response of this filter is equal to h(n) = 1(L + 1) for n =0 1 L The frequency response of the averaging filter is the DTFT ofits impulse response h(n) Therefore frequency responses of averaging filtersfor L = 7 and 15 are normalized versions of the DTFT spectra shown inFigure 211(a) and Figure 211(b) respectively

27 THE TRANSFER AND THE FREQUENCY RESPONSE FUNCTIONS

The z-transform of the impulse response of a filter is called the transfer functionand is given by

H(z) =infinsum

n=minusinfinh(n)zminusn (231)

Considering the difference equation we can also obtain the transfer functionin terms of filter parameters ie

X(z)

(Lsum

i=0

bizminusi

)

= Y (z)

(

1 +Msum

i=1

aizminusi

)

(232)

The ratio of output over input in the z domain gives the transfer function interms of the filter coefficients

H(z) = Y (z)

X(z)= b0 + b1z


1 + a1zminus1 + + aMzminusM (233)

For an FIR filter the transfer function is given by

H(z) =Lsum

i=0

bizminusi (234)

The frequency response function is a special case of the transfer function ofthe filter That is for z = ej then

H(ej) =infinsum

n=minusinfinh(n)eminusjn


By considering the difference equation associated with the LTI digital filterthe frequency response can be written as the ratio of two polynomials ie



1 + a1eminusj + a2eminusj2 + + aMeminusjM

Note that for an FIR filter the frequency response becomes



Example 23

Frequency responses of four different first-order filters are shown inFigure 216 The frequency responses in Figure 216 are plotted up to the fold-over frequency which is half the sampling frequency Note from Figure 216that low-pass and high-pass filters can be realized as either FIR or as IIRfilters The location of the root of the polynomial of the FIR filter determineswhere the notch in the frequency response occurs Therefore in the top twofigures that correspond to the FIR filters the low-pass filter (top left) has anotch at π rads (zero at z = minus1) while the high-pass filter has a notch at

0 05 1minus60

minus40

minus20

0

20

Normalized frequency Ω ( x p rad )

|H(z

)|2(d

B)

0 05 1minus60

minus40

minus20

0

20

0 05 1minus10

0

10

20

0 05 1minus10

0

10

20

H(z) = 1+zminus1 H(z) = 1minuszminus1

H(z) = 11 minus 09zminus1 H(z) = 1

1 + 09zminus1

Figure 216 Frequency responses of first-order FIR and IIR digital filters


0 rads (zero at z = 1) The bottom two IIR filters have a pole at z = 09 (peakat 0 rads) and z = minus09 (peak at π rads) for the low-pass and high-pass filtersrespectively


A z domain function H(z) can be written in terms of its poles and zeros asfollows

H(z) = G(z minus ζ1)(z minus ζ2) (z minus ζL)

(z minus p1)(z minus p2) (z minus pM)= G

prodLi=1(z minus ζi)

prodMi=1(z minus pi)

(235)

where ζi and pi are the zeros and poles of H(z) respectively and G is a constantThe locations of the poles and zeros affect the shape of the frequency responseThe magnitude of the frequency response can be written as

|H(ej)| = G

prodLi=1 |ej minus ζi |

prodMi=1 |ej minus pi |

(236)

It is therefore evident that when an isolated zero is close to the unit circle thenthe magnitude frequency response will assume a small value at that frequencyWhen an isolated pole is close to unit circle it will give rise to a peak in themagnitude frequency response at that frequency In speech processing the pres-ence of poles in z domain representations of the vocal tract has been associatedwith the speech formants [Rabi78] In fact formant synthesizers use the polelocations to form synthesis filters for certain phonemes On the other hand thepresence of zeros has been associated with the coupling of the nasal tract Forexample zeros associate with nasal sounds such as m and n [Span94]

Example 24

For the second-order system below find the poles and zeros give a z-domaindiagram with the pole and zeros and sketch the frequency response

H(z) = 1 minus 13435zminus1 + 09025zminus2

1 minus 045zminus1 + 055zminus2

The poles and zeros appear in conjugate pairs because the coefficients of H(z)

are real-valued

H(z) = (z minus 95ej45o

)(z minus 95eminusj45o

)

(z minus 7416ej7234o)(z minus 7416eminusj7234o

)

The pole zero diagram and the frequency response are shown in Figure 217Poles give rise to spectral peaks and zeros create spectral valleys in themagnitude of the frequency response The symmetry around π is due to thefact that roots appear in complex conjugate pairs


minus1 minus05 0 05 1minus1

minus05

0

05

1

Real part

Imag

inar

y pa

rt

z plane plot

0 05 1 15 2minus30

minus20

minus10

0

10


Mag

nitu

de (

dB)

Frequency response

Figure 217 z domain and frequency response plots of the second-order system givenin Example 24


There are several standard designs for digital filters that are targeted specificallyfor audio-type applications These designs include the so-called shelving filterspeaking filters cross-over filters and quadrature mirror filter (QMF) bank filtersLow-pass and high-pass shelving filters are used for bass and treble tone controlsrespectively in stereo systems

Example 25

The transfer function of a low-pass shelving filter can be expressed as

Hlp(z) = Clp

(1 minus b1z

minus1

1 minus a1zminus1

)

(237)

where

Clp = 1 + kmicro

1 + k b1 =

(1 minus kmicro

1 + kmicro

)

a1 =(

1 minus k

1 + k

)

k =(

4

1 + micro

)

tan

(c

2

)

and micro = 10g20

Note also that c = 2πfcfs is the normalized cutoff frequency and g is thegain in decibels (dB)

Example 26

The transfer function of a high-pass shelving filter is given by

Hhp(z) = Chp

(1 minus b1z

minus1

1 minus a1zminus1

)

(238)


where

Chp = micro + p

1 + p b1 =

(micro minus p

micro + p

)

a1 =(

1 minus p

1 + p

)

p =(

1 + micro

4

)

tan

(c

2

)

and micro = 10g20

Again c = 2πfcfs is the normalized cutoff frequency and g is the gainin dB More complex tone controls that operate as graphic equalizers areaccomplished using bandpass peaking filters

Example 27

The transfer function of a peaking filter is given by

Hpk(z) = Cpk

(1 + b1z

minus1 + b2zminus2

1 + a1zminus1 + a2zminus2

)

(239)

where

Cpk =(

1 + kqmicro

1 + kq

)

b1 = minus2 cos(c)

1 + kqmicro b2 = 1 minus kqmicro

1 + kqmicro

a1 = minus2 cos(c)

1 + kq

a2 = 1 minus kq

1 + kq

(a) (b)

0 05 10

2

4

6

8

10Low-pass shelving filter with gain = 10 dB

Mag

nitu

de (

dB)

Normalized frequency x p (rad)0 02 04 06 08 1

minus20

minus10

0

10

20

Low-pass shelving filter with Ω c = p4

Mag

nitu

de (

dB)

Gain = 20 dB

10 dB

5 dBminus5 dB

minus10 dB

minus20 dB

Ω c = p4

Ω c = p6


Figure 218 Frequency responses of a low-pass shelving filter (a) for different gains and(b) for different cutoff frequencies c = π 6 and π 4


(c)

0 05 10

2

4

6

8

10

Mag

nitu

de (

dB)

Peaking filter with gain = 10 dB and Q = 4

(a)

0 02 04 06 08 10

5

10

15

20


Mag

nitu

de (

dB)

Peaking filter with Q factor 2 and Ωc = p2

Gain = 20 dB

10 dB

Ω c= p4Ω c = p2

(b)

0 05 10

2

4

6

8

10

Mag

nitu

de (

dB)

Peaking filter with gain = 10 dB and Ωc = p2

Q = 2

Q = 4



Figure 219 Frequency responses of a peaking filter (a) for different gains g = 10 dBand 20 dB (b) for different quality factors Q = 2 and 4 and (c) for different cutofffrequencies c = π 4 and π 2

kq =(

4

1 + micro

)

tan

(c

2Q

)

and micro = 10g20

The frequency c = 2πfcfs is the normalized cutoff frequency Q is thequality factor and g is the gain in dB Example frequency responses of shelv-ing and peaking filters for different gains and cutoff frequencies are given inFigures 218 and 219

Example 28

An audio graphic equalizer is designed by cascading peaking filters as shownin Figure 220 The main idea behind the audio graphic equalizer is that itapplies a set of peaking filters to modify the frequency spectrum of the inputaudio signal by dividing its audible frequency spectrum into several frequencybands Then the frequency response of each band can be controlled by varyingthe corresponding peaking filterrsquos gain


PeakingFilter

1

PeakingFilter

2

PeakingFilterNminus1

PeakingFilter

N

Σ Σ Σ

Input

Output

Figure 220 A cascaded setup of peaking filters to design an audio graphic equalizer

28 REVIEW OF MULTIRATE SIGNAL PROCESSING

Multirate signal processing (MSP) involves the change of the sampling rate whilethe signal is in the digital domain Sampling rate changes have been popular inDSP and audio applications Depending on the application changes in the sam-pling rate may reduce algorithmic and hardware complexity or increase resolutionin certain signal processing operations by introducing additional signal samplesPerhaps the most popular application of MSP is over-sampling analog-to-digital(AD) and digital-to-analog (DA) conversions In over-sampling AD the sig-nal is over-sampled thereby relaxing the anti-aliasing filter design requirementsand hence the hardware complexity The additional time-resolution in the over-sampled signal allows a simple 1-bit delta modulation (DM) quantizer to delivera digital signal with sufficient resolution even for high-fidelity audio applicationsThis reduction of analog hardware complexity comes at the expense of a data rateincrease Therefore a down-sampling operation is subsequently performed usinga DSP chip to reduce the data rate This reduction in the sampling rate requiresa high precision anti-aliasing digital low-pass filter along with some other cor-recting DSP algorithmic steps that are of appreciable complexity Therefore theover-sampling analog-to-digital (AD) conversion or as otherwise called Delta-Sigma AD conversion involves a process where complexity is transferred fromthe analog hardware domain to the digital software domain The reduction ofanalog hardware complexity is also important in DA conversion In that casethe signal is up-sampled and interpolated in the digital domain thereby reducingthe requirements on the analog reconstruction (interpolation) filter


Multirate signal processing is characterized by two basic operations namely up-sampling and down-sampling Down-sampling involves increasing the samplingperiod and hence decreasing the sampling frequency and data rate of the digitalsignal A sampling rate reduction by integer L is represented by

xd(n) = x(nL) (240)


Given the DTFT transform pairs

x(n)DT FTlarrminusminusrarrX(ej) and xd(n)

DT FTlarrminusminusrarrXd(ej) (241)

it can be shown [Oppe99] that the DTFT of the original and decimated signalare related by

Xd(ej) = 1

L

Lminus1sum

l=0

X(ej(minus2πl)L) (242)

Therefore down-sampling introduces L copies of the original DTFT that areboth amplitude and frequency scaled by L It is clear that the additional copiesmay introduce aliasing Aliasing can be eliminated if the DTFT of the originalsignal is bandlimited to a frequency πL ie

X(ej) = 0π

L || π (243)

An example of the DTFTs of the signal during the down-sampling processis shown Figure 221 To approximate the condition in Eq (243) a digital anti-aliasing filter is used The down-sampling process is illustrated in Figure 222

x(n)

n0

2

0

helliphellip

1 3

xd(n) = x(2n)

n0 1

helliphellip

minus2p 2p 0minus2p 2p

112

(a)

X (e jΩ) Xd (e jΩ)

(b)

42 2

ΩΩ

Figure 221 (a) The original and the down-sampled signal in the time-domain and(b) the corresponding DTFTs

HD(ejΩ)

x prime(nL)

Anti-aliasingfilter

x(n)

L

x prime(n)

Figure 222 Down-sampling by an integer L


In Figure 222 HD(ej) is given by

HD(ej) =

1 0 || πL

0 πL || π (244)


Up-sampling involves reducing the sampling period by introducing additionalregularly spaced samples in the signal sequence

xu(n) =infinsum

m=minusinfinx(m)δ(n minus mM) =

x(nM) n = 0 plusmnM plusmn2M

0 else (245)

The introduction of zero-valued samples in the up-sampled signal xu(n)increases the sampling rate of the signal The DTFT of the up-sampled signalrelates to the DTFT of the original signal as follows

Xu(ej) = X(ejM) (246)

xu(n) = x(n2)

n0

2

0

helliphellip

1 3

x(n)

n0 1

helliphellip

minus2p 2p0minus2p 2p

(a)

(b)

X (e jΩ) Xu (e jΩ)

2 42

ΩΩ

Figure 223 (a) The original and the up-sampled signal in the time-domain and (b) thecorresponding DTFTs

HU (ejΩ)

x prime(nM)

Interpolationfilter

x(n)

M

x(nM)

Figure 224 Up-sampling by an integer M


M

x(n)

L

x prime(nM)x(nM)x prime(nLM)

HLP (e jΩ)

Ωc =max (LM)

p

Figure 225 Sampling rate changes by a noninteger factor

Therefore the DTFT of the up-sampled signal Xu(ej) is described by a

series of frequency compressed images of the DTFT of the original signal locatedat integer multiples of 2πM rads (see Figure 223) To complete the up-samplingprocess an interpolation stage is required that fills appropriate values in the time-domain to replace the artificial zero-valued samples introduced by the samplingIn Figure 224 HU(ej) is given by

HU(ej) =

M 0 || πM

0 πM || π (247)


Sampling rate by noninteger factors can be accomplished by cascading up-sampling and down-sampling operations The up-sampling stage precedes thedown-sampling stage and the low-pass interpolation and anti-aliasing filters arecombined into one filter whose bandwidth is the minimum of the two filtersFigure 225 For example if we want a noninteger sampling period modificationsuch that Tnew = 12T 5 In this case we choose L = 12 and M = 5 Hence thebandwidth of the low-pass filter is the minimum of π 12 and π 5


The analysis of the signal in a perceptual audio coding system is usuallyaccomplished using either filter banks or frequency-domain transformations ora combination of both The filter bank is used to decompose the signal intoseveral frequency subbands Different coding strategies are then derived andimplemented in each subband The technique is known as subband coding in thecoding literature Figure 226

One of the important aspects of subband decomposition is the aliasing betweenthe different subbands because of the imperfect frequency responses of the digitalfilters Figure 227 These aliasing problems prevented early analysis-synthesisfilter banks from perfectly reconstructing the original input signal in the absenceof quantization effects In 1977 a solution to this problem was provided bycombining down-sampling and up-sampling operations with appropriate filterdesigns [Este77] The perfect reconstruction filter bank design came to be knownas a quadrature mirror filter (QMF) bank An analysis-synthesis QMF consists ofanti-aliasing filters down-sampling stages up-sampling stages and interpolationfilters A two-band QMF structure is shown in Figure 228


x(n)

BP1

BP2

BP3

BPN

Encoder1

Encoder2

Encoder3

EncoderN

2f1

2f2

2f3

2fN

MUX

DEMUX

Decoder1

Decoder2

Decoder3

DecoderN

BP1

BP2

BP3

BPN

Channel x(n)ˆ

Figure 226 Signal coding in subbands

Lower band Upper band

p 2

H0 H1

Frequency Ω (rad)

|Hk(e jΩ)|

p

Figure 227 Aliasing effects in a two-band filter bank

H0(z)

H1(z)

2

2

2

2

F0(z)

F1(z)

x(n) x0(n)

x1(n)

xd0(n)

xd1(n)


x(n)ˆ

sum

Figure 228 A two-band QMF structure

The analysis stage consists of the filters H0(z) and H1(z) and down-samplingoperations The synthesis stage includes up-sampling stages and the filters F0(z)

and F1(z) If the process includes quantizers those will be placed after the down-sampling stages We first examine the filter bank without the quantization stageThe input signal x(n) is first filtered and then down-sampled The DTFT of the


down-sampled signal can be shown to be

Xdk(ej) = 1

2(Xk(e

j2) + Xk(ej(minus2π)2)) k = 0 1 (248)

Figure 229 presents plots of the DTFTs of the original and down-sampledsignals It can be seen that an aliasing term is present

The reconstructed signal x(n) is derived by adding the contributions fromthe up-sampling and interpolations of the low and the high band It can be shown

helliphellip

X (ejΩ)

helliphellip

0minus2p 2p

X (ejΩ 2) Aliasing term

X (minusejΩ 2)

Ω

0minus2p 2p Ω

Figure 229 DTFTs of the original and down-sampled signals to illustrate aliasing

QMF(11)

QMF(21)

QMF(22)

QMF(41)

QMF(42)

QMF(43)

QMF(44)

QMF(81)

QMF(84)

Stage - 0 Stage - 1 Stage - 3Stage - 2

QMF(82)

QMF(83)

Level -1

Level -2

Figure 230 Tree-structured QMF bank


that the reconstructed signal in the z-domain has the form

X(z) = 1

2(H0(z)F0(z) + H1(z)F1(z))X(z) + 1

2(H0(minusz)F0(z)

+ H1(minusz)F1(z))X(minusz) (249)

The signal X(minusz) in Eq (249) is associated with the aliasing term Thealiasing term can be cancelled by designing filters to have the following mirrorsymmetries

F0(z) = H1(minusz) F1(z) = minusH0(minusz) (250)

Under these conditions the overall transfer function of the filter bank can thenbe written as

T (z) = 1

2(H0(z)F0(z) + H1(z)F1(z)) (251)

If T (z) = 1 then the filter bank allows perfect reconstruction Perfect delay-less reconstruction is not realizable but an all-pass filter bank with linear phasecharacteristics can be designed easily For example the choice of first-order FIRfilters

H0(z) = 1 + zminus1 H1(z) = 1 minus zminus1 (252)

results in alias-free reconstruction The overall transfer function of the QMF inthis case is

T (z) = 1

2((1 + zminus1)2 minus (1 minus zminus1)2) = 2zminus1 (253)

Therefore the signal is reconstructed within a delay of one sample and withan overall gain of 2 QMF filter banks can be cascaded to form tree structures Ifwe represent the analysis stage of a filter bank as a block that divides the signalin low and high frequency subbands then by cascading several such blockswe can divide the signal into smaller subbands This is shown in Figure 230QMF banks are part of many subband and hybrid subbandtransform audio andimagevideo coding standards [Thei87] [Stoll88] [Veld89] [John96] Note thatthe theory of quadrature mirror filterbanks has been associated with wavelettransform theory [Wick94] [Akan96] [Stra96]

29 DISCRETE-TIME RANDOM SIGNALS

In signal processing we generally classify signals as deterministic or random Asignal is defined as deterministic if its values at any point in time can be definedprecisely by a mathematical equation For example the signal x(n) = sin(πn4)

is deterministic On the other hand random signals have uncertain values and areusually described using their statistics A discrete-time random process involvesan ensemble of sequences x(nm) where m is the index of the m-th sequence


in the ensemble and n is the time index In practice one does not have accessto all possible sample signals of a random process Therefore the determinationof the statistical structure of a random process is often done from the observedwaveform This approach becomes valid and simplifies random signal analysis ifthe random process at hand is ergodic Ergodicity implies that the statistics of arandom process can be determined using time-averaging operations on a singleobserved signal Ergodicity requires that the statistics of the signal are indepen-dent of the time of observation Random signals whose statistical structure isindependent of time of origin are generally called stationary More specificallya random process is said to be widesense stationary if its statistics upto the sec-ond order are independent of time Although it is difficult to show analyticallythat signals with various statistical distributions are ergodic it can be shownthat a stationary zero-mean Gaussian process is ergodic up to second order Inmany practical applications involving a stationary or quasi-stationary processit is assumed that the process is also ergodic The definitions of signal statis-tics presented henceforth will focus on real-valued stationary processes that areergodic

The mean value microx of the discrete-time wide sense stationary signal x(n)is a first-order statistic that is defined as the expected value of x(n) ie

microx = E[x(n)] = limNrarrinfin

1

2N + 1

Nsum

n=minusN

x(n) (254)

where E[] denotes the statistical expectation The assumption of ergodicity allowsus to determine the mean value with a time-averaging process shown on theright-hand side of Eq (254) The mean value can be viewed as the DC com-ponent in the signal In many applications involving speech and audio signalsthe DC component does not carry any useful information and is either ignoredor filtered out

The variance σ 2x is a second-order signal statistic and is a measure of signal

dispersion from its mean value The variance is defined as

σ 2x = E[(x(n) minus microx)(x(n) minus microx)] = E[x2(n)] minus micro2

x (255)

The square root of the variance is the standard deviation of the signal Fora zero-mean signal the variance is simply E[x2(n)] The autocorrelation of asignal is a second-order statistic defined by

rxx(m) = E[x(n + m)x(n)] = limNrarrinfin

1

2N + 1

Nsum

n=minusN

x(n + m)x(n) (256)

where m is called the autocorrelation lag index The autocorrelation can be viewedas a measure of predictability of the signal in the sense that a future value of acorrelated signal can be predicted by processing information associated with its


past values For example speech is a correlated waveform and hence it can bemodeled by linear prediction mechanisms that predict its current value from alinear combination of past values Correlation can also be viewed as a measureof redundancy in the signal in that correlated waveforms can be parameterizedin terms of statistical time-series models and hence represented by a reducednumber of information bits

The autocorrelation sequence rxx(m) is symmetric and positive definite ie

rxx(minusm) = rxx(m) rxx(0) |rxx(m)| (257)

Example 29

The autocorrelation of a white noise signal is

rww(m) = σ 2wδ(m)

where σ 2w is the variance of the noise The fact that the autocorrelation of

white noise is the unit impulse implies that white noise is an uncorrelatedsignal

Example 210

The autocorrelation of the output of a second-order FIR digital filter H(z)(in Figure 231) to a white noise input of zero mean and unit variance is

ryy(m) = E[y(n + m)y(n)]

= δ(m + 2) + 2δ(m + 1) + 3δ(m) + 2δ(m minus 1) + δ(m minus 2)

Cross-correlation is a measure of similarity between two signals The cross-correlation of a signal x(n) relative to a signal y(n) is given by

rxy(m) = E[x(n + m)y(n)] (258)

Similarly cross-correlation of a signal y(n) relative to a signal x(n) isgiven by

ryx(m) = E[y(n + m)x(n)] (259)

Note that the symmetry property of the cross-correlation is

ryx(m) = rxy(minusm) (260)

White noise x(n) y(n)

H(z) = 1 + zminus1 + zminus2

Figure 231 FIR filter excited by white noise


m0

rww (m) = sw2 d (m) Rww (e jΩ) = sw

2

0

DTFTsw

2

sw2

Ω

Figure 232 The PSD of white noise

The power spectral density (PSD) of a random signal is defined as the DTFTof the autocorrelation sequence

Rxx(ej) =

infinsum

m=minusinfinrxx(m)eminusjm (261)

The PSD is real-valued and positive and describes how the power of therandom process is distributed across frequency

Example 211

The PSD of a white noise signal (see Figure 232) is

Rww(ej) = σ 2w

infinsum

m=minusinfinδ(m)eminusjm = σ 2

w


In Example 210 we determined the autocorrelation of the output of a second-order FIR digital filter when the excitation is white noise In this section wereview briefly the characterization of the statistics of the output of a causal LTIdigital filter that is excited by a random signal The output of a causal digitalfilter can be computed by convolving the input with its impulse response ie

y(n) =infinsum

i=0

h(i)x(n minus i) (262)

Based on the convolution sum we can derive the following expressions for themean the autocorrelation the cross-correlation and the power spectral densityof the steady-state output of an LTI digital filter

microy =infinsum

k=0

h(k)microx = H(ej)|=0microx (263)

ryy(m) =infinsum

k=0

infinsum

i=0

h(k)h(i)rxx(m minus k + i) (264)


ryx(m) =infinsum

i=0

h(i)rxx(m minus i) (265)


j) (266)

These equations describe the statistical behavior of the output at steady stateDuring the transient state of the filter the output is essentially nonstationary ie

microy(n) =nsum

k=0

h(k)microx (267)

Example 212

Determine the output variance of an LTI digital filter excited by white noiseof zero mean and unit variance

σ 2y = ryy(0) =

infinsum

k=0

infinsum

i=0

h(k)h(i)δ(i minus k) =infinsum

k=0

h2(k)

Example 213

Determine the variance the autocorrelation and the PSD of the output of thedigital filter in Figure 233 when its input is white noise of zero mean andunit varianceThe impulse response and transfer function of this first-order IIR filter is

h(n) = 08nu(n) H(z) = 1

1 minus 08zminus1

The variance of the output at steady state is

σ 2y =

infinsum

k=0

h2(k)σ 2x =

infinsum

k=0

064k = 1

1 minus 064= 278

zminus1x(n)

08

y(n)

Σ

Figure 233 An example of an IIR filter


The autocorrelation of the output is given by

ryy(m) =infinsum

k=0

infinsum

i=0

08k+iδxx(m minus k + i)

It is easy to see that the unit impulse will be non-zero only for k = m + i andhence

ryy(m) =infinsum

i=0

08m+2i m 0

And taking into account the autocorrelation symmetry

ryy(m) = ryy(minusm) = 277(08)|m|forallm

Finally the PSD is given by


j) = 1

|1 minus 08eminusj|2


Estimators of signal statistics given N observations are based on the assumptionof stationarity and ergodicity The following is an estimator of the autocorrelation(based on sample averaging) of a signal x(n)

rxx(m) = 1

N

Nminusmminus1sum

n=0

x(n + m)x(n) m = 0 1 2 N minus 1 (268)

Correlations for negative lags can be taken using the symmetry property inEq (257) The estimator above is asymptotically unbiased (fixed m and N gtgt

m) but for small N it is biased

210 SUMMARY

A brief review of some of the essentials of signal processing techniques weredescribed in this chapter In particular some of the important concepts coveredin this chapter include

ž Continuous Fourier transformž Spectral leakage effectsž Convolution sampling and aliasing issuesž Discrete-time Fourier transform and z-transformž The DFT FFT DCT and STFT basics

PROBLEMS 45

ž IIR and FIR filter representationsž Polezero and frequency response interpretationsž Shelving and peaking filters audio graphic equalizersž Down-sampling and up-samplingž QMF banks and alias-free reconstructionž Discrete-time random signal processing review

PROBLEMS

21 Determine the continuous Fourier transform (CFT) of a pulse described by

x(t) = u(t + 1) minus u(t minus 1)

where u(t) is the unit step function

22 State and derive the CFT properties of duality time shift modulation andconvolution

23 For the circuit shown in Figure 234(a) and for RC = 1a Write the input-output differential equationb Determine the impulse response in closed-form by solving the differ-

ential equationc Write the frequency response functiond Determine the steady state response y(t) for x(t) = sin(10t)

x(t)

R

C y(t)

0 12minus12 4minus4

1

x(t)

t

(a) (b)

(c)

0minus1

1

x(t)

t1

Figure 234 (a) A simple RC circuit (b) input signal for problem 23(e) x(t) and(c) input signal for problem 23(f)


e Given x(t) as shown in Figure 234(b) find the circuit output y(t)using convolution

f Determine the Fourier series of the output y(t) of the RC circuit forthe input shown in Figure 234(c)

24 Determine the CFT of p(t) = suminfinn=minusinfin δ(t minus nTs) Given xs(t) = x(t)p(t)

derive the following

Xs(ω) = 1

Ts

infinsum

k=minusinfinX(ω minus kωs)

x(t) =infinsum

n=minusinfinx(nTs) sinc(B(t minus nTs))

where X(ω) and Xs(ω) are the spectra of ideally bandlimited and uniformlysampled signals respectively and ωs = 2πTs (Refer to Figure 28 forvariable definitions)

25 Determine the z-transforms of the following causal signalsa sin(n)

b δ(n) + δ(n minus 1)

c pn sin(n)

d u(n) minus u(n minus 9)

26 Determine the impulse and frequency responses of the averaging filter

h(n) = 1

L + 1 n = 0 1 L for L = 9

27 Show that the IDFT can be derived as a least squares signal matchingproblem where N points in the time domain are matched by a linearcombination of N sampled complex sinusoids

28 Given the transfer function H(z) = (z minus 1)2(z2 + 081)a Determine the impulse response h(n)

b Determine the steady state response due to the sinusoid sin(πn

4

)

29 Derive the decimation-in-time FFT algorithm and determine the number ofcomplex multiplications required for an FFT size of N = 1024

210 Derive the following expression in a simple two-band QMF

Xdk(ej) = 1

2(Xk(e

j2) + Xk(ej(minus2π)2)) k = 0 1

Refer to Figure 228 for variable definitions Give and justify the conditionsfor alias-free reconstruction in a simple QMF bank


211 Design a tree-structured uniform QMF bank that will divide the spectrumof 0-20 kHz into eight uniform subbands Give appropriate figures anddenote on the branches the range of frequencies

212 Modify your design in problem 211 and give one possible realization ofa simple nonuniform tree structured QMF bank that will divide the of 0-20 kHz spectrum into eight subbands whose bandwidth increases with thecenter frequency

213 Design a low-pass shelving filter for the following specifications fs =16 kHz fc = 4 kHz and g = 10 dB

214 Design peaking filters for the following cases a) c = π4Q = 2 g =10 dB and b) c = π2Q = 3 g = 5 dB Give frequency responses ofthe designed peaking filters

215 Design a five-band digital audio equalizer using the concept of peakingdigital filters Select center frequencies fc as follows 500 Hz 1500 Hz4000 Hz 10 kHz and 16 kHz sampling frequency fs = 441 kHz andthe corresponding peaking filter gains as 10 dB Choose a constant Q forall the peaking filters

216 Derive equations (263) and (264)

217 Derive equation (266)

218 Show that the PSD is real-valued and positive

219 Show that the estimator (268) of the autocorrelation is biased

220 Show that the estimator (268) provides autocorrelations such that rxx(0) |rxx(m)|

221 Provide an unbiased autocorrelation estimator by modifying the estimatorin (268)

222 A digital filter with impulse response h(n) = 07nu(n) is excited by whiteGaussian noise of zero mean and unit variance Determine the mean andvariance of the output of the digital filter in closed-form during the transientand steady state

223 The filter H(z) = z(z minus 08) is excited by white noise of zero mean andunit variancea Determine all the autocorrelation values at the output of the filter at

steady stateb Determine the PSD at the output

COMPUTER EXERCISES

Use the speech file lsquoCh2speechwavrsquo from the Book Website for all the computerexercises in this chapter


224 Consider the 2-band QMF bank shown in Figure 235 In this figure x(n)

denotes speech frames of 256 samples and x(n) denotes the synthesizedspeech frames

a Design the transfer functions F0(z) and F1(z) such that aliasing iscancelled Also calculate the overall delay of the QMF bank

b Select an arbitrary voiced speech frame from Ch2speechwav Givetime-domain and frequency-domain plots of xd0(n) and xd1(n) for thatparticular frame Comment on the frequency-domain plots with regardto low-passhigh-pass band-splitting

H0(z)

H1(z)

2

2

2

2

F0(z)

F1(z)

x(n)

sum

x0(n)

x1(n)

xd0(n)

xd1(n)


x(n)

xe0(n)

xe1(n)

Given

H1(z) = 1 + zminus1

H0(z) = 1 minus zminus1 Choose F0(z) and F1(z) such that the aliasing term can be cancelled

ˆ

Figure 235 A two-band QMF bank

Select Lcomponents

out of NFFT

(Size N = 256)

x(n) Xrsquo(k) InverseFFT (Size N = 256)

xrsquo(n)X(k)

n = [1 times 256] k = [1 times N] k = [1 times L] n = [1 times 256]

Figure 236 Speech synthesis from a select number (subset) of FFT components

Table 22 Signal-to-noise ratio (SNR) and MOS values

Number of FFTcomponents L SNRoverall

Subjective evaluation MOS (meanopinion score) in a scale of

1ndash5 for the entire speech record

16128


c Repeat step (b) for x1(n) and xd1(n) in order to compare the signalsbefore and after the downsampling stage

d Calculate the SNR between the input speech record x(n) and the syn-thesized speech record x(n) Use the following equation to computethe SNR

SNR = 10 log10

( sumn x2(n)

sumn(x(n) minus x(n))2

)

(dB)

Listen to the synthesized speech record and comment on its qualitye Choose a low-pass F0(z) and a high-pass F1(z) such that aliasing

occurs Compute the SNR Listen to the synthesized speech and describeits perceptual quality relative to the output speech in step(d) (Hint Usefirst-order IIR filters)

225 In Figure 236 x(n) denotes speech frames of 256 samples and x prime(n)

denotes the synthesized speech frames For L = N(= 256) the synthesizedspeech will be identical to the input speech In this computer exercise youneed to perform speech synthesis on a frame-by-frame basis from a selectnumber (subset) of FFT components ie L lt N We will use two meth-ods for the FFT component selection (i) Method 1 selecting the first L

components including their conjugate-symmetric ones out of a total of N

components and (ii ) Method 2 the least-squares method (peak-pickingmethod that selects the L components that minimize the sum of squareserror)a Use L = 64 and Method 1 for component selection Perform speech

synthesis and give time-domain plots of both input and output speechrecords

b Repeat the above step using the peak-picking Method 2 (choose L peaksincluding symmetric components in the FFT magnitude spectrum) Listthe SNR values in both the cases Listen to the output files correspondingto (a) and (b) and provide a subjective evaluation (on a MOS scale1ndash5) To calibrate the process think of a wireline telephone quality(toll) as 4 cellphone quality around 37

c Perform speech synthesis for (i) L = 16 and (ii) L = 128 Use the peak-picking Method 2 for the FFT component selection Compute the overallSNR values and provide a subjective evaluation of the output speechfor both the cases Tabulate your results in Table 22

CHAPTER 3

QUANTIZATION AND ENTROPY CODING

31 INTRODUCTION

This chapter provides an introduction to waveform quantization and entropycoding algorithms Waveform quantization deals with the digital or more specifi-cally binary representation of signals All the audio encoding algorithms typicallyinclude a quantization module Theoretical aspects of waveform quantizationmethods were established about fifty years ago [Shan48] Waveform quantizationcan be i) memoryless or with memory depending upon whether the encod-ing rules rely on past samples and ii ) uniform or nonuniform based on thestep-size or the quantization (discretization) levels employed Pulse code modu-lation (PCM) [Oliv48] [Jaya76] [Jaya84] [Span94] is a memoryless method fordiscrete-time discrete-amplitude quantization of analog waveforms On the otherhand Differential PCM (DPCM) delta modulation (DM) and adaptive DPCM(ADPCM) have memory

Waveform quantization can also be classified as scalar or vector In scalarquantization each sample is quantized individually as opposed to vector quanti-zation where a block of samples is quantized jointly Scalar quantization [Jaya84]methods include PCM DPCM DM and their adaptive versions Several vectorquantization (VQ) schemes have been proposed including the VQ [Lind80] thesplit-VQ [Pali91] [Pali93] and the conjugate structure-VQ [Kata93] [Kata96]Quantization can be parametric or nonparametric In nonparametric quantizationthe actual signal is quantized Parametric representations are generally based onsignal transformations (often unitary) or on source-system signal models


51


A bit allocation algorithm is typically employed to compute the number ofquantization bits required to encode an audio segment Several bit allocationschemes have been proposed over the years these include bit allocation basedon the noise-to-mask-ratio (NMR) and masking thresholds [Bran87a] [John88a][ISOI92] perceptually motivated bit allocation [Vora97] [Naja00] and dynamicbit allocation based on signal statistics [Jaya84] [Rams86] [Shoh88] [West88][Beat89] [Madi97] Note that the NMR-based perceptual bit allocation sche-me [Bran87a] is one of the most popular techniques and is embedded in severalaudio coding standards (eg ISOIEC MPEG codec series etc)

In audio compression entropy coding techniques are employed in conjunctionwith the quantization and bit-allocation modules in order to obtain improved codingefficiencies Unlike the DPCM and the ADPCM techniques that remove the redun-dancy by exploiting the correlation of the signal while entropy coding schemesexploit the likelihood of the symbol occurrence [Cove91] Entropy is a measure ofuncertainty of a random variable For example consider two random variables x

and y and two random events A and B For the random variable x let the probabil-ity of occurrence of the event A be px(A) = 05 and the event B be px(B) = 05Similarly define py(A) = 099999 and py(B) = 000001 The random variable x

has a high uncertainty measure ie it is very hard to predict whether event A or B

is likely to occur On the contrary in the case of the random variable y the eventA is more likely to occur and therefore we have less uncertainty relative to therandom variable x In entropy coding the information symbols are mapped intocodes based on the frequency of each symbol Several entropy-coding schemeshave been proposed including Huffman coding [Huff52] Rice coding [Rice79]Golomb coding [Golo66] arithmetic coding [Riss79] [Howa94] and Lempel-Zivcoding [Ziv77] These entropy coding schemes are typically called noiseless Anoiseless coding system is able to reconstruct the signal perfectly from its codedrepresentation In contrast a coding scheme incapable of perfect reconstruction iscalled lossy

In the rest of the chapter we provide an overview of the quantization-bitallocation-entropy coding (QBE) framework We also provide background on theprobabilistic signal structures and we show how they are exploited in quantizationalgorithms Finally we introduce vector quantization basics


After perceptual irrelevancies in an audio frame are exploited a quantizationndashbitallocationndashentropy coding (QBE) module is employed to exploit statistical corre-lation In Figure 31 typical output parameters from stage I include the transformcoefficients the scale factors and the residual error These parameters are firstquantized using one of the aforementioned PCM DPCM or VQ schemes Thenumber of bits allocated per frame is typically specified by a bit-allocation modulethat uses perceptual masking thresholds

The quantized parameters are entropy coded using an explicit noiseless codingstage for final redundancy reduction Huffman or Rice codes are available in theform of look-up tables at the entropy coding stage Entropy coders are employed

DENSITY FUNCTIONS AND QUANTIZATION 53

Psychoacousticanalysis

ampMDCT

analysis(or)

Prediction

QuantizationEntropycoding

Huffman Rice orGolomb codes

(tables)

Bit-allocationmodule

Distortionmeasure

Quantizedand entropy

codedbitstream

Input audiosignal

A

A

B

B

C

C

Transform coefficients Scalefactors Residual error

Stage - IPerceptual masking

thresholds

Figure 31 A typical QBE module employed in audio coding

in scenarios where the objective is to achieve maximum coding efficiency Inentropy coders more probable symbols (ie frequently occurring amplitudes)are encoded with shorter codewords and vice versa This will essentially reducethe average data rate Next a distortion measure between the input and theencoded parameters is computed and compared against an established thresholdIf the distortion metric is greater than the specified threshold the bit-allocationmodule supplies additional bits in order to reduce the quantization error Theabove procedure is repeated until the distortion falls below the threshold

32 DENSITY FUNCTIONS AND QUANTIZATION

In this section we discuss the characterization of a random process in termsof its probability density function (PDF) This approach will help us derive thequantization noise equations for different quantization schemes A random pro-cess is characterized by its PDF which is a non-negative function p(x) whoseproperties are int infin

minusinfinp(x)dx = 1 (31)

and int x2

x1

p(x)dx = Pr(x1 lt X x2) (32)

From the above equations it is evident that the PDF area from x1 to x2 is theprobability that the random variable X is observed in this range Since X liessomewhere in [minusinfin infin] the total area under p(x) is one The mean and thevariance of the random variable X are defined as

microx = E[X] =int infin

minusinfinxp(x)dx (33)


0 x x0(a) (b)

pG(x) = 1

minusx2

e2ps2

x

2s2x

pL(x) = 1 e2sx

sx

minusradicminus2x

Figure 32 (a) The Gaussian PDF and (b) The Laplacian PDF

σ 2x =

int infin

minusinfin(x minus microx)

2p(x)dx = E[(X minus microx)2] (34)

Note that the expectation is computed either as a weighted average (33) orunder ergodicity assumptions as a time average (Chapter 2 Eq 254) PDFsare useful in the design of optimal signal quantizers as they can be used todetermine the assignment of optimal quantization levels PDFs often used todesign or analyze quantizers include the zero-mean uniform (pU(x)) the Gaus-sian (pG(x)) (Figure 32a) and the Laplacian (pL(x)) These are given in thatorder below

pU(x) = 1

2S minusS x S (35)

pG(x) = 1radic

2πσ 2x

eminus x2

2σ 2x (36)

pL(x) = 1radic2σx

eminus

radic2|x|σx (37)

where S is some arbitrary non-zero real number and σ 2x is the variance of the

random variable X Readers are referred to Papoulisrsquo classical book on probabilityand random variables [Papo91] for an in-depth treatment of random processes

33 SCALAR QUANTIZATION

In this section we describe the various scalar quantization schemes In particularwe review uniform and nonuniform quantization and then we present differentialPCM coding methods and their adaptive versions


Uniform PCM is a memoryless process that quantizes amplitudes by roundingoff each sample to one of a set of discrete values (Figure 33) The differencebetween adjacent quantization levels ie the step size is constant in non-adaptive uniform PCM The number of quantization levels Q in uniform PCM


Quantizationnoise eq(t )

Step size ∆



Qua

ntiz

atio

n le

vels

(a)

(b)

2 4 6 8 10 12 14 160

1

2

3

4

5

6

7

t

Am

plitu

de

sa(t )sq(t )

Figure 33 (a) Uniform PCM and (b) uniform quantization of a triangular waveformFrom the figure Rb = 3 bits Q = 8 uniform quantizer levels

binary representations is Q = 2Rb where Rb denotes the number of bits Theperformance of uniform PCM can be described in terms of the signal-to-noiseratio (SNR) Consider that the signal s is to be quantized and its values lie inthe interval

s isin (minussmax smax) (38)

A uniform step size can then be determined by

= 2smax

2Rb (39)

Let us assume that the quantization noise eq has a uniform PDF ie

minus

2 eq

2(310)


peq(eq) = 1

for |eq |

2 (311)

From (39) (310) and (311) the variance of the quantization noise can beshown [Jaya84] to be

σ 2eq

= 2

12= s2

max2minus2Rb

3 (312)

Therefore if the input signal is bounded an increase by 1 bit reduces the noisevariance by a factor of four In other words the SNR for uniform PCM will

(b)

(a)


Qua

ntiz

atio

n le

vels


2 4 6 8 10 12 14 160

035

095

258

7

t

Am

plitu

de

sa(t )sq(t )

Figure 34 (a) Nonuniform PCM and (b) nonuniform quantization of a decaying-expon-ential waveform From the figure Rb = 3 bits Q = 8 nonuniform quantizer levels


improve approximately by 6 dB per bit ie

SNRPCM = 602Rb + K1(dB) (313)

The factor K1 is a constant that accounts for the step size and loading factorsFor telephone speech K1 = minus10 [Jaya84]


Uniform nonadaptive PCM has no mechanism for exploiting signal redundancyMoreover uniform quantizers are optimal in the mean square error (MSE) sensefor signals with uniform PDF Nonuniform PCM quantizers use a nonuniformstep size (Figure 34) that can be determined from the statistical structure of thesignal

PDF-optimized PCM uses fine step sizes for frequently occurring amplitudesand coarse step sizes for less frequently occurring amplitudes The step sizescan be optimally designed by exploiting the shape of the signalrsquos PDF A signalwith a Gaussian PDF (Figure 35) for instance can be quantized more effi-ciently in terms of the overall MSE by computing the quantization step sizesand the corresponding centroids such that the mean square quantization noise isminimized [Scha79]

Another class of nonuniform PCM relies on log-quantizers that are quite com-mon in telephony applications [Scha79] [Jaya84] In Figure 36 a nonuniformquantizer is realized by using a nonlinear mapping function g() that mapsnonuniform step sizes to uniform step sizes such that a simple linear quantizer isused An example of the mapping function is given in Figure 37 The decoderuses an expansion function gminus1() to recover the signal

Two telephony standards have been developed based on logarithmic compand-ing ie the micro-law and the A-law The micro-law companding function is used in

s

PDF

Assigned values (centroids)

∆1

∆2∆3

∆2

∆1

Figure 35 PDF-optimized PCM for signals with Gaussian distribution Quantizationlevels are on the horizontal axis


Compressorg()

Expandergminus1()

Uniformquantizer

s s

Figure 36 Nonuniform PCM via compressor and expansion functions

∆1

∆

∆

∆

∆2 ∆3s

g(s)

Figure 37 Companding function for nonuniform PCM

the North American PCM standard (micro = 255) The micro-law is given by

|g(s)| = log(1 + micro|ssmax|)log(1 + micro)

(314)

For micro = 255 (314) gives approximately linear mapping for small amplitudesand logarithmic mapping for larger amplitudes The European A-law compandingstandard is slightly different and is based on the mapping

|g(s)| =

A|ssmax|1 + log(A)

for 0 lt |ssmax| lt 1A

1 + log(A|ssmax|)1 + log(A)

for 1A lt |ssmax| lt 1

(315)

The idea with A-law companding is similar with micro-law in that again for signalswith small amplitudes the mapping is almost linear and for large amplitudes thetransformation is logarithmic Both of these techniques can yield superior SNRsparticularly for small amplitudes In telephony the companding schemes havebeen found to reduce bit rates without degradation by as much as 4 bitssamplerelative to uniform PCM

Dynamic range variations in PCM can be handled by using an adaptive stepsize A PCM system with an adaptive step-size is called adaptive PCM (APCM)The step size in a feed forward system is transmitted as side information while in afeedback system the step size is estimated from past coded samples Figure 38 In


Buffer Q()

∆ Estimator

sQndash1()

∆ Estimator

(a)

Qndash1()sand

sand

(b)

Buffer Q()

∆ Estimator

s

∆ Estimator

Figure 38 Adaptive PCM with (a) forward estimation of step size and (b) backwardestimation of step size

this figure Q represents either uniform or nonuniform quantization (compression)scheme and corresponds to the stepsize


A more efficient scalar quantizer is the differential PCM (DPCM) that removesthe redundancy in the audio waveform by exploiting the correlation betweenadjacent samples In its simplest form a DPCM transmitter encodes only thedifference between successive samples and the receiver recovers the signal byintegration Practical DPCM schemes incorporate a time-invariant short-term pre-diction process A(z) This is given by

A(z) =psum

i=1

aizminusi (316)

where ai are the prediction coefficients and z is the complex variable of thez-transform This DPCM scheme is also called predictive differential coding(Figure 39) and reduces the quantization error variance by reducing the vari-ance of the quantizer input An example of a representative DPCM waveformeq(n) along with the associated analog and PCM quantized waveforms s(t) ands(n) respectively is given in Figure 310

The DPCM system (Figure 39) works as follows The sample s prime(n) is theestimate of the current sample s(n) and is obtained from past sample valuesThe prediction error e(n) is then quantized (ie eq(n)) and transmitted to the


(a) (b)

eq(n) s(n)


sum

andand

Quantizer


minus

s(n) e(n) eq(n)

s prime(n)

s prime(n)

sum

sum~

Figure 39 DPCM system (a) transmitter and (b) receiver

receiver The quantized prediction error is also added to s prime(n) in order to recon-struct the sample s prime(n) In the absence of channel errors s prime(n) = s(n) In thesimplest case A(z) is a first-order polynomial In Figure 39 A(z) is given by

A(z) =psum

i=1

aizminusi (317)

and the predicted signal is given by

s prime(n) =psum

i=1

aisprime(n minus i) (318)

The prediction coefficients are usually determined by solving the autocorrelationequations

rss(m) minuspsum

i=1

airss(m minus i) = 0 for m = 1 2 p (319)

where rss(m) are the autocorrelation samples of s(n) The details of the equationabove will be discussed in Chapter 4 Two other types of scalar coders are thedelta modulation (DM) and the adaptive DPCM (ADPCM) coders [Cumm73][Gibs74] [Gibs78] [Yatr88] DM can be viewed as a special case of DPCM wherethe difference (prediction error) is encoded with one bit DM typically operatesat sampling rates much higher than the rates commonly used with DPCM Thestep size in DM may also be adaptive

In an adaptive differential PCM (ADPCM) system both the step size andthe predictor are allowed to adapt and track the time-varying statistics of theinput signal [Span94] The predictor can be either forward adaptive or back-ward adaptive In forward adaptation the prediction parameters are estimatedfrom the current data which are not available at the receiver Therefore the


111

110

101

100

011010

001000

DigitizedPCM

waveform

7

6

5

4

3

2

1

0

Analogwaveform

(a)

xa(t )

yd(n)

zd(n)

t

n

1110

0100

DigitizedDPCM

waveform

n

(b)

(c)

Figure 310 Uniform quantization (a) Analog input signal (b) PCM waveform (3-bitdigitization of the analog signal) Output after quantization [101 100 011 011 101 001100 001 101 010 100] Total number of bits in PCM digitization = 33 (c) DifferentialPCM (2-bit digitization of the analog signal) Output after quantization [10 10 01 0110 00 10 00 10 01 01] Total number of bits in DPCM digitization = 22 As an asideit can be noted that relative to the PCM the DPCM reduces the number of bits forencoding by reducing the variance of the input signal The dynamic range of the inputsignal can be reduced by exploiting the redundancy present within the adjacent samplesof the signal

prediction parameters must be encoded and transmitted separately in order toreconstruct the signal at the receiver In backward adaptation the parameters areestimated from past data which are already available at the receiver Thereforethe prediction parameters can be estimated locally at the receiver Backward pre-dictor adaptation is amenable to low-delay coding [Gibs90] [Chen92] ADPCM


encoders with pole-zero decoder filters have proved to be particularly versatilein speech applications In fact the ADPCM 32 kbs algorithm adopted for theITU-T G726 [G726] standard (formerly G721 [G721]) uses a pole-zero adaptivepredictor

34 VECTOR QUANTIZATION

Data compression via vector quantization (VQ) is achieved by encoding a data-setjointly in block or vector form Figure 311(a) shows an N -dimensional quantizerand a codebook The incoming vectors can be formed from consecutive datasamples or from model parameters The quantizer maps the i-th incoming [N times 1]vector given by

si = [si(0) si(1) si(N minus 1)]T (320)

to a n-th channel symbol un n = 1 2 L as shown in Figure 311(a) Thecodebook consists of L code vectors

sn = [sn(0) sn(1) sn(N minus 1)]T n = 1 2 L (321)

which reside in the memory of the transmitter and the receiver A vector quantizerworks as follows The input vectors si are compared to each codeword snand the address of the closest codeword with respect to a distortion measureε(si sn) determines the channel symbol to be transmitted The simplest andmost commonly used distortion measure is the sum of squared errors which isgiven by

ε(si sn) =Nminus1sum

k=0

(si(k) minus sn(k))2 (322)

The L [N times 1] real-valued vectors are entries of the codebook and are designedby dividing the vector space into L nonoverlapping cells cn as shown in Figu-re 311(b) Each cell cn is associated with a template vector sn The quantizerassigns the channel symbol un to the vector si if si belongs to cn The channelsymbol un is usually a binary representation of the codebook index of sn

A vector quantizer can be considered as a generalization of the scalar PCMand in fact Gersho [Gers83] calls it vector PCM (VPCM) In VPCM the code-book is fully searched and the number of bits per sample is given by

B = 1

Nlog2 L (323)

The signal-to-noise ratio for VPCM is given by

SNRN = 6B + KN(dB) (324)


CodebookROM

Encoder

Channel

CodebookROM

Decodersi un un sn

sn

(a)

centroid

cellCn

si (0)

si (1)

(b)

Figure 311 Vector quantization scheme (a) block diagram (b) cells for two-dimensionalVQ ie N = 2

Note that for N = 1 VPCM defaults to scalar PC and therefore (313) is aspecial case of (324) Although the two equations are quite similar VPCM yieldsimproved SNR (reflected in KN ) since it exploits the correlation within thevectors VQ offers significant coding gain by increasing N and L However thememory and the computational complexity required grows exponentially withN for a given rate In general the benefits of VQ are realized at rates of 1 bitper sample or less The codebook design process also known as the training orpopulating process can be fixed or adaptive

Fixed codebooks are designed a priori and the basic design procedure involvesan initial guess for the codebook and then iterative improvement by using a


large number of training vectors An iterative codebook design algorithm thatworks for a large class of distortion measures was given by Linde Buzo andGray [Lind80] This is essentially an extension of Lloydrsquos [Lloy82] scalar quan-tizer design and is often referred to as the ldquoLBG algorithmrdquo Typically thenumber of training vectors per code vector must be at least ten and prefer-ably fifty [Makh85] Since speech and audio are nonstationary signals one mayalso wish to adapt the codebooks (ldquocodebook design on the flyrdquo) to the signalstatistics A quantizer with an adaptive codebook is called adaptive VQ (A-VQ) and applications to speech coding have been reported in [Paul82] [Cupe85]and [Cupe89] There are two types of A-VQ namely forward adaptive and back-ward adaptive In backward A-VQ codebook updating is based on past data thatis also available at the decoder Forward A-VQ updates the codebooks based oncurrent (or sometimes future) data and as such additional information must beencoded

341 Structured VQ

The complexity in high-dimensionality VQ can be reduced significantlywith the use of structured codebooks that allow for efficient search Tree-structured [Buzo80] and multi-step [Juan82] vector quantizers are associated withlower encoding complexity at the expense of a modest loss of performance Multi-step vector quantizers consist of a cascade of two or more quantizers each oneencoding the error or residual of the previous quantizer In Figure 312(a) thefirst VQ codebook L1 encodes the signal s(k) and the subsequent VQ stages L2

through LM encode the errors e1(k) through eMminus1(k) from the previous stagesrespectively In particular the codebook L1 is first searched and the vectorsl1(k) that minimizes the MSE (325) is selected where l1 is the codebook indexassociated with the first-stage codebook

εl =Nminus1sum

k=0


Next the difference between the original input s(k) and the first-stage codewordsl1(k) is computed as shown in Figure 312 (a) This is given by

e1(k) = s(k) minus sl1(k) for k = 0 1 2 3 N minus 1 (326)

The residual e1(k) is used in the second stage as the reference signal to beapproximated Codebooks L2 L3 LM are searched sequentially and the codevectors el21(k) el32(k) elMMminus1(k) that result in the minimum MSE (327)are chosen as the codewords Note that l2 l3 lM are the codebook indices

(b)

(a)

Vec

tor

quan

tizer

-1C

odeb

ook

L1

Vec

tor

quan

tizer

-2C

odeb

ook

L2

Vec

tor

quan

tizer

- MC

odeb

ook

LM

minusminus

hellipminus

Sta

ge-1

Sta

ge-2

Sta

ge- M

s (k

)

k=

01

N

minus1e 1

(k)

e 2(k

)

e M(k

)

e Mminus1

(k)

s l1(

k)

ˆe l

21(

k)

ˆ

e lM

Mminus1

(k)

ˆ

sumsum

sum

Sta

ge-1

Sta

ge-2

Sta

ge-3

32

10

L 11 2 3

32

10

L 21 2 3

32

10

L 31 2

Nminus1

Nminus1

Nminus1

Fig

ure

312

(a

)M

ulti-

step

M-s

tage

VQ

bloc

kdi

agra

m

At

the

deco

der

the

sign

al

s(k

)ca

nbe

reco

nstr

ucte

das

s(k

)=

s l1(k

)+

e l21(k

)

+

+

e lM

Mminus1

(k)

fork

=0

12

N

minus1

(b)

An

exam

ple

mul

ti-st

epco

debo

okst

ruct

ure

(M=

3)T

hefir

stst

age

N-d

imen

sion

alco

debo

okco

nsis

tsof

L1

code

vect

ors

Sim

ilarl

yth

enu

mbe

rof

entr

ies

inth

ese

cond

-an

dth

ird-

stag

eco

debo

oks

incl

ude

L2

and

L3r

espe

ctiv

ely

Usu

ally

in

orde

rto

redu

ceth

eco

mpu

tatio

nal

com

plex

ity

the

num

ber

ofco

deve

ctor

sis

asfo

llow

sL

1gt

L2

gtL

3

65


associated with the 2nd 3rd M-th-stage codebooks respectively

For codebook L2

εl2 =Nminus1sum

k=0

(e1(k) minus el1(k))2 for l = 1 2 3 L2

For codebook LM

εlM =Nminus1sum

k=0

(eMminus1(k) minus elMminus1(k))2 for l = 1 2 3 LM

(327)

At the decoder the transmitted codeword s(k) can be reconstructed as follows

s(k) = sl1(k) + el21(k) + + elMMminus1(k) for k = 0 1 2 N minus 1 (328)

The complexity of VQ can also be reduced by normalizing the vectors of thecodebook and encoding the gain separately The technique is called gainshapeVQ (GS-VQ) and has been introduced by Buzo et al [Buzo80] and later studiedby Sabin and Gray [Sabi82] The waveform shape is represented by a code vectorfrom the shape codebook while the gain can be encoded from the gain codebookFigure 313 The idea of encoding the gain separately allows for the encoding of

ShapeCodebook

GS-VQEncoder

si un

Channel

un si

GainCodebook

ShapeCodebook

GS-VQDecoder

GainCodebook

ˆˆ

Figure 313 Gainshape (GS)-VQ encoder and decoder In the GS-VQ the idea is toencode the waveform shape and the gain separately using the shape and gain codebooksrespectively


vectors of high dimensionality with manageable complexity and is being widelyused in encoding the excitation signal in code-excited linear predictive (CELP)coders [Atal90] An alternative method for building highly structured codebooksconsists of forming the code vectors by linearly combining a small set of basisvectors [Gers90b]

342 Split-VQ

In split-VQ it is typical to employ two stages of VQs multiple codebooks ofsmaller dimensions are used in the second stage A two-stage split-VQ blockdiagram is given in Figure 314(a) and the corresponding split-VQ codebookstructure is shown in Figure 314(b) From Figure 314(b) note that the first-stage codebook L1 employs an N -dimensional VQ and consists of L1 codevectors The second-stage codebook is implemented as a split-VQ and consistsof a combination of two N 2-dimensional codebooks L2 and L3 The number ofentries in these two codebooks are L2 and L3 respectively First the codebookL1 is searched and the vector sl1(k) that minimizes the MSE (329) is selectedwhere l1 is the index of the first-stage codebook

εl =Nminus1sum

k=0


Next the difference between the original input s(k) and the first-stage codewordsl1(k) is computed as shown in Figure 314(a)

e(k) = s(k) minus sl1(k) for k = 0 1 2 3 N minus 1 (330)

The residual e(k) is used in the second stage as the reference signal to be approx-imated Codebooks L2 and L3 are searched separately and the code vectors el2low

and el3upp that result in the minimum MSE (331) and (332) respectively arechosen as the codewords Note that l2 and l3 are the codebook indices associatedwith the second- and third-stage codebooks

For codebook L2

εllow =N2minus1sum

k=0

(e(k) minus ellow(k))2 for l = 1 2 3 L2 (331)

For codebook L3

εlupp =Nminus1sum

k=N2

(e(k) minus elupp(k minus N2))2 for l = 1 2 3 L3 (332)


(a)

Vectorquantizer-1

Codebook L1s(k) e(k)minus

Vectorquantizer-2ACodebook L2

Vectorquantizer-2BCodebook L3

minus

minusk = 01Nminus1

k prime = 01N 2 minus 1

elow = e(k prime)

eupp = e(k primeprime)

k = 01Nminus1

sl1(k)ˆ

el2 lowˆ

el3uppˆ

2 2

Nk primeprime =

N+ 1Nminus1

sum

sum

sum

(b)

Nminus13210

L1

3

2

1

Stage - 2

N2 - 13210

L2

123 hellip

L3

123

Stage - 1

N2-dimensional codebook - 2A N2-dimensional codebook - 2B

N-dimensional codebook - 1

s1(0)ˆ s1(1)ˆ s1(2)ˆ s1(3)ˆ s1(N minus 1)ˆ

Figure 314 Split VQ (a) A two-stage split-VQ block diagram (b) the split-VQcodebook structure In the split-VQ the codevector search is performed by ldquodividingrdquothe codebook into smaller dimension codebooks In Figure 314(b) the second stageN-dimensional VQ has been divided into two N 2-dimensional split-VQ Note that thefirst-stage codebook L1 employs an N-dimensional VQ and consists of L1 entries Thesecond-stage codebook is implemented as a split-VQ and consists of a combination of twoN 2-dimensional codebooks L2 and L3 The number of entries in these two codebooksare L2 and L3 respectively The codebook indices ie l1 l2 and l3 will be encoded andtransmitted At the decoder these codebook indices are used to reconstruct the transmittedcodeword s(k) Eq (333)

At the decoder the transmitted codeword s(k) can be reconstructed as fol-lows

s(k) =

sl1(k) + el2low(k) for k = 0 1 2 N2 minus 1sl1(k) + el3upp(k minus N2) for k = N2 N2 + 1 N minus 1

(333)

Split-VQ offers high coding accuracy however with increased computationalcomplexity and with a slight drop in the coding gain Paliwal and Atal discuss


these issues in [Pali91] [Pali93] while presenting an algorithm for vector quan-tization of speech LPC parameters at 24 bitsframe Despite the aforementionedshortcomings split-VQ techniques are very efficient when it comes to encodingline spectrum prediction parameters in several speech and audio coding standardssuch as the ITU-T G729 CS-ACELP standard [G729] the IS-893 SelectableMode Vocoder (SMV) [IS-893] and the MPEG-4 General Audio (GA) Coding-Twin VQ tool [ISOI99]


Conjugate structure-VQ (CS-VQ) [Kata93] [Kata96] enables joint quantizationof two or more parameters The CS-VQ works as follows Let s(n) be the targetvector that has to be approximated and the MSE ε to be minimized is given by

ε = 1

N

Nminus1sum

n=0

|e(n)|2 = 1

N

Nminus1sum

n=0

|(s(n) minus g1u(n) minus g2v(n))|2 (334)

where u(n) and v(n) are some arbitrary vectors and g1 and g2 are the gains to bevector quantized using the CS codebook given in Figure 315 From this figurecodebooks A and B contain P and Q entries respectively In both codebooks thefirst-column element corresponds to parameter 1 ie g1 and the second-columnelement represents parameter 2 ie g2 The optimum combination of g1 and g2

that results in the minimum MSE (334) is computed from PQ permutations asfollows



P

3

2

1

Index

gA2PgA1P

gA23gA13

gA22gA12

gA21gA11

Codebook ndash A

Q

3

2

1

Index

gB2QgB1Q

gB23gB13

gB22gB12

gB21gB11

Codebook ndash B

Figure 315 An example CS-VQ In this figure the codebooks lsquoArsquo and lsquoBrsquo are conjugate


CS-VQ codebooks are particularly handy in scenarios that involve joint quantiza-tion of excitation gains Second-generation near-toll-quality CELP codecs (egITU-T G729) and third-generation (3G) CELP standards for cellular applications(eg TIAIS-893 Selectable Mode Vocoder) employ the CS-VQ codebooks toencode the adaptive and stochastic excitation gains [Atal90] [Sala98] [G729] [IS-893] A CS-VQ is used to vector quantize the transformed spectral coefficientsin the MPEG-4 Twin-VQ encoder

35 BIT-ALLOCATION ALGORITHMS

Until now we discussed various scalar and vector quantization algorithms withoutemphasizing how the number of quantization levels are determined In thissection we review some of the fundamental bit allocation techniques A bit-allocation algorithm determines the number of bits required to quantize an audioframe with reduced audible distortions Bit-allocation can be based on certainperceptual rules or spectral characteristics From Figure 31 parameters typicallyquantized include the transform coefficients x scale factors S and the residualerror e For now let us consider that the transform coefficients x are to bequantized ie

x = [x1 x2 x3 xNf]T (337)

where Nf represents the total number of transform coefficients Let the totalnumber of bits available to quantize the transform coefficients be N bits Ourobjective is to find an optimum way of distributing the available N bits across theindividual transform coefficients such that a distortion measure D is minimizedThe distortion D is given by

D = 1

Nf

Nfsum

i=1

E[(xi minus xi )2] = 1

Nf

Nfsum

i=1

di (338)

where xi and xi denote the i-th unquantized and quantized transform coefficientsrespectively and E[] is the expectation operator Let ni be the number of bitsassigned to the coefficient xi for quantization such that

Nfsum

i=1

ni N (339)

Note that if xi are uniformly distributed foralli then we can employ a simpleuniform bit-allocation across all the transform coefficients ie

ni =[

N

Nf

]

foralli isin [1 Nf ] (340)

However in practice the transform coefficients x may not have uniformprobability distribution Therefore employing an equal number of bits for both

BIT-ALLOCATION ALGORITHMS 71

large and small amplitudes may result in spending extra bits for smaller ampli-tudes Moreover in such scenarios for a given N the distortion D can bevery high

Example 31

An example of the aforementioned discussion is presented in Table 31 Uni-form bit-allocation is employed to quantize both the uniformly distributed andGaussian-distributed transform coefficients In this example we assume thata total number of N = 64 bits are available for quantization and Nf = 16samples Therefore ni = 4 foralli isin [1 16] Note that the input vectors xu andxg have been randomly generated in MATLAB using rand(1 Nf ) andrandn(1 Nf ) functions respectively The distortions Du and Dg are com-puted using (338) and are given by 000023927 and 000042573 respectivelyFrom Example 31 we note that the uniform bit-allocation is not optimal forall the cases especially when the distribution of the unquantized vector xis not uniform Therefore we must have some cost function available thatminimizes the distortion di subject to the constraint given in (339) is metThis is given by

minni

D = minni

1

Nf

Nfsum

i=1

E[(xi minus xi)2]

= min

ni

1

Nf

Nfsum

i=1

σ 2i

(341)

where σ 2i is the variance Note that the above minimization problem can be

simplified if the quantization noise has a uniform PDF [Jaya84] From (312)

σ 2i = x2

i

3(22ni ) (342)

Table 31 Uniform bit-allocation scheme where ni = [N Nf ] foralli isin [1 Nf ]

Uniformly distributed coefficients Gaussian-distributed coefficients

Input vector xu

Quantizedvector xu Input vector xg

Quantizedvector xg

[06029 03806 [0625 0375 [05199 24205 [05 24375056222 012649 05625 0125 minus094578 minus00081113 minus09375 0026904 047535 025 05 minus042986 minus087688 minus04375 minus087504553 038398 04375 0375 11553 minus082724 1125 minus08125041811 035213 04375 0375 minus1345 minus015859 minus1375 minus01875023434 032256 025 03125 minus023544 085353 minus025 0875031352 03026 03125 03125 0016574 minus20292 0 minus2032179 016496] 03125 01875] 12702 028333] 125 03125]

D = 1

Nf

sumNf

i=1 E[(xi minus xi )2] ni = 4foralli isin [1 16]Du = 000023927 and Dg = 000042573


Substituting (342) in (341) and minimizing wrt ni

partD

partni

= x2i

3(minus2)2(minus2ni ) ln 2 + K1 = 0 (343)

ni = 1

2log2 x2

i + K (344)

From (344) and (339)

Nfsum

i=1

ni =Nfsum

i=1

(1

2log2 x2

i + K

)

= N (345)

K = N

Nf

minus 1

2Nf

Nfsum

i=1

log2 x2i = N

Nf

minus 1

2Nf

log2(

Nfprod

i=1

x2i ) (346)

Substituting (346) in (344) we can obtain the optimum bit-allocationn

optimum

i as

noptimum

i = N

Nf

+ 1

2log2

x2i

(prodNf

i=1 x2i )

1Nf

(347)

Table 32 presents the optimum bit assignment for both uniformly distributed andGaussian-distributed transform coefficients considered in the previous example(Table 31) From Table 32 note that the resulting optimal bit-allocation forGaussian-distributed transform coefficients resulted in two negative integersSeveral techniques have been proposed to avoid this scenario namely thesequential bit-allocation method [Rams82] and the Segallrsquos method [Sega76]For more detailed descriptions readers are referred to [Jaya84] [Madi97] Notethat the bit-allocation scheme given by (347) may not be optimal either inthe perceptual sense or in the SNR sense This is because the minimizationof (341) is performed without considering either the perceptual noise maskingthresholds or the dependence of the signal-to-noise power on the optimal num-ber of bits n

optimum

i Also note that the distortion Du in the case of optimalbit-assignment (Table 32) is slightly greater than in the case of uniform bit-allocation (Table 31) This can be attributed to the fact that fewer quantizationlevels must have been assigned to the low-powered transform coefficients rela-tive to the number of levels implied by (347) Moreover a maximum coding gaincan be achieved when the audio signal spectrum is non-flat One of the importantremarks presented in [Jaya84] relevant to the on-going discussion is that whenthe geometric mean of x2

i is less than the arithmetic mean of x2i then the optimal

bit-allocation scheme performs better than the uniform bit-allocation The ratio

Tabl

e3

2O

ptim

albi

t-al

loca

tion

sche

me

whe

ren

optim

umi

=N N

f+

1 2lo

g 2x

2 iminus

1 2lo

g 2(prod

Nf

i=1

x2 i

)1 N

f

Uni

form

lydi

stri

bute

dco

effic

ient

sG

auss

ian-

dist

ribu

ted

coef

ficie

nts

Inpu

tvec

tor

x uQ

uant

ized

vect

orx

uIn

put

vect

orx

gQ

uant

ized

vect

orx

g

[06

029

038

06

[05

9375

03

75

[05

199

242

05

[05

24

219

056

222

012

649

056

250

125

minus0

945

78minus

000

8111

3minus0

937

50

026

904

047

535

025

04

6875

minus0

429

86minus

087

688

minus04

375

minus08

75

045

530

383

98

043

750

375

1

1553

minus0

8272

41

1563

minus0

8125

0

4181

10

3521

30

4375

03

75

minus13

45minus

015

859

minus13

438

minus01

25

023

434

032

256

025

03

125

minus02

3544

08

5353

minus0

25

084

375

031

352

030

26

031

250

312

50

3125

01

25]

001

6574

minus2

0292

0

minus20

313

032

179

016

496]

127

020

283

33]

126

560

25]

Bits

allo

cate

dn

=[5

45

34

54

44

44

44

44

3]D

isto

rtio

nD

u=

000

0246

8B

itsal

loca

ted

n=

[46

5minus

24

55

56

33

5minus

16

63]

Dis

tort

ion

Dg

=0

0002

2878

73


of the two means is captured in the spectral flatness measure (SFM) ie

Geometric MeanGM =

Nfprod

i=1

x2i

1Nf

Arithemetic Mean AM = 1

Nf

Nfsum

i=1

x2i

SFM = GM

AM

and SFM isin [0 1] (348)

Other important considerations in addition to SNR and spectral flatness mea-sures are the perceptual noise masking the noise-to-mask ratio (NMR) andthe signal-to-mask ratio (SMR) All these measures are used in perceptualbit allocation methods Since at this point readers are not introduced to theprinciples of psychoacoustics and the concepts of SMR and NMR we defera discussion of perceptually based bit allocation to Chapter 5 Section 58

36 ENTROPY CODING

It is worthwhile to consider the theoretical limits for the minimum number ofbits required to represent an audio sample Shannon in his mathematical theoryof communication [Shan48] proved that the minimum number of bits requiredto encode a message X is given by the entropy He(X) The entropy of aninput signal can be defined as follows Let X = [x1 x2 xN ] be the inputdata vector of length N and pi be the probability that i-th symbol (over thesymbol setV = [v1 v2 vK ]) is transmitted The entropy He(X) is given by

He(X) = minusKsum

i=1

pi log2(pi) (349)

In simple terms entropy is a measure of uncertainty of a random variable Forexample let the input bitstream to be encoded be X = [4 5 6 6 2 5 4 4 5 4 4]ie N = 11 symbol set V = [2 4 5 6] and the corresponding probabilities are[

1

11

5

11

3

11

2

11

]

respectively with K = 4 The entropy He(X) can be com-

puted as follows

He(X) = minusKsum

i=1

pi log2(pi)

= minus

1

11log2

(1

11

)

+ 5

11log2

(5

11

)

+ 3

11log2

(3

11

)

+ 2

11log2

(2

11

)

= 17899 (350)

ENTROPY CODING 75

Table 33 An example entropy code for Example 32

Swimmer Probability of winning Binary string or the identifier

S1 12 0S2 14 10S3 18 110S4 116 1110S5 164 111100S6 164 111101S7 164 111110S8 164 111111

Example 32

Consider eight swimmers S1 S2 S3 S4 S5 S6 S7 and S8 in a race withwin probabilities 12 14 18 116 164 164 164 and 164 respectivelyThe entropy of the message announcing the winner can be computed as

He(X) = minus8sum

i=1

pi log2(pi)

= minus

1

2log2

(1

2

)

+ 1

4log2

(1

4

)

+ 1

8log2

(1

8

)

+ 1

16log2

(1

16

)

+ 4

64log2

(1

64

)

= 2

An example of the entropy code for the above message can be obtainedby associating binary strings wrt the swimmersrsquo probability of winning asshown in Table 33 The average length of the example entropy code given inTable 33 is 2 bits in contrast with 3 bits for a uniform codeThe statistical entropy alone does not provide a good measure of compress-ibility in the case of audio coding since several other factors ie quantiza-tion noise masking thresholds and tone- and noise-masking effects must beaccounted for in order to achieve efficiency Johnston in 1988 proposed a the-oretical limit on compressibility for audio signals (sim 21 bitssample) based onthe measure of perceptually relevant information content The limit is obtainedbased on both the psychoacoustic signal analysis and the statistical entropyand is called the perceptual entropy [John88a] [John88b] The various stepsinvolved in the perceptual entropy estimation are described later in Chapter 5In all the entropy coding schemes the objective is to construct an ensemblecode for each message such that the code is uniquely decodable prefix-freeand optimum in the sense that it provides minimum-redundancy encodingIn particular some basic restrictions and design considerations imposed on asource-coding process include


ž Condition 1 Each message should be assigned a unique code (seeExample 33)

ž Condition 2 The codes must be prefix-free For example considerthe following two code sequences (CS) CS-1 = 00 11 10 011 001and CS-2 = 00 11 10 011 010 Consider the output sequence to bedecoded to be 001100011 At the decoder if CS-1 is employedthe decoded sequence can be either 00 11 00 011 or 001 10001 or 001 10 00 11 etc This is due to the confusion at thedecoder whether to select lsquo00rsquo or lsquo001rsquo from the output sequence Thisconfusion is avoided using CS-2 where the decoded sequence is uniqueand is given by 00 11 00 011 Therefore in a valid no code in itsentirety can be found as a prefix of another code

ž Condition 3 Additional information regarding the beginning- and theend-point of a message source will not usually be available at the decoder(once synchronization occurs)

ž Condition 4 A necessary condition for a code to be prefix-free is givenby the Kraft inequality [Cove91]

KI =Nsum

i=1

2minusLi 1 (351)

where Li is the codeword length of the i-th symbol In the example discussedin condition 2 the Kraft inequality KI for both CS-1 and CS-2 is lsquo1rsquo Althoughthe Kraft inequality for CS-1 is satisfied the encoding sequence is not prefix-free Note that (351) is not a sufficient condition for a code to be prefix-free

ž Condition 5 To obtain a minimum-redundancy code the compressionrate R must be minimized and (351) must be satisfied

R =Nsum

i=1

piLi (352)

Example 33

Let the input bitstream X = [4 5 6 6 2 5 4 4 1 4 4] be chosen over a dataset V = [0 12 3 4 5 6 7] Here N = 11 K = 8 and the probabilities pi =[

01

11

1

11 0

5

11

2

11

2

11 0

]

i isin V

In Figure 316(a) a simple binary representation with equal-length code isused The code length of each symbol is given by l = int(log2 K) iel = int(log2 8) = 3 Therefore a possible binary mapping would be 1 rarr001 2 rarr 010 4 rarr 100 5 rarr 101 6 rarr 110 and the total length Lb = 33bitsFigure 316(b) depicts the Shannon-Fano coding procedure Each symbol isencoded using a unary representation based on the symbol probability

ENTROPY CODING 77

Input bitstream [4 5 6 6 2 5 4 4 1 4 4 ]

(a) Encoding based on the binary equal - length code

[100 101 110 110 010 101 100 100 001 100 100]Total length Lb = 33 bits

(b) Encoding based on the Shannon - FanoCoding

4s-5 times 5s-2 times 6s-2 twice 2s-once and 1s-onceProbabilities p4 = 511 p5 = 211 p6 = 211 p2 = 111 and p1 = 111

Representations 4 rarr 0 5rarr 10 6 rarr 110 1 rarr 1110 2 rarr 11110

[0 10 110 110 11110 10 0 0 1110 0 0]Total length LSF = 24 bits

Figure 316 Entropy coding schemes (a) binary equal-length code (b) Shannon-Fanocoding

361 Huffman Coding

Huffman proposed a technique to construct minimum redundancy codes [Huff52]Huffman codes found applications in audio and video encoding due to their sim-plicity and efficiency Moreover Huffman coding is regarded as the most effectivecompression method provided that the codes designed using a specific set ofsymbol frequencies match the input symbol frequencies PDFs of audio signalsof shorter frame-lengths are better described by the Gaussian distribution whilethe long-time PDFs of audio can be characterized by the Laplacian or gammadensities [Rabi78] [Jaya84] Hence for example Huffman codes designed basedon the Gaussian or Laplacian PDFs can provide minimum redundancy entropycodes for audio encoding Moreover depending upon the symbol frequencies aseries of Huffman codetables can also be employed for entropy coding eg theMPEG-1 Layer-III employs 32 Huffman codetables [ISOI94]

Example 34

Figure 317 depicts the Huffman coding procedure for the numerical Exam-ple 33 The input symbols ie 1 2 4 5 and 6 are first arranged in ascendingorder wrt their probabilities Next the two symbols with the smallest proba-bilities are combined to form a binary tree The left tree is assigned a ldquo0rdquo andthe right tree is represented by a ldquo1rdquo The probability of the resulting nodeis obtained by adding the two probabilities of the previous nodes as shownin Figure 317 The above procedure is continued until all the input symbolnodes are used Finally Huffman codes for each input symbol is formed byreading the bits along the tree For example the Huffman codeword for theinput symbol ldquo1rdquo is given by ldquo0000rdquo The resulting Huffman bit-mapping isgiven in Table 34 and the total length of the encoded bitstream is LHF = 23bits Note that depending on the node selection for the code tree formation


111 211 211 511

411

1111

2 6 5

0 1

0001 001 01 1

Input symbols

Huffman codewords

0000

111

211

1

0

1

6110

1

0

0000

41

Figure 317 A possible Huffman coding tree for Example 34

Table 34 Huffman codetable for the inputbitstream X = [4 5 6 6 2 5 4 4 1 4 4]


4 511 15 211 016 211 0012 111 00011 111 0000

several Huffman bit-mappings can be possible for example 4 rarr 1 5 rarr 0116 rarr 010 2 rarr 001 and 1 rarr 000 as shown in Figure 318 However the totalnumber of bits remain the same ie LHF = 23 bits

Example 35

The entropy of the input bitstream X = [4 5 6 6 2 5 4 4 1 4 4] is given by

He(X) = minusKsum

i=1

pi log2(pi)

= minus

2

11log2

(1

11

)

+ 4

11log2

(2

11

)

+ 5

11log2

(5

11

)

(353)

= 204

ENTROPY CODING 79

111 211 211 511

411

1111

2 6 5

0 1

001 010 011 1

Input symbols

Huffman codewords

000

111

1

211

1

0

6110

1

0

000

4

1


From Figures 316 and 317 the compression rate R is obtained using

(a) the uniform binary representation = 3311 = 3 bitssymbol(b) Shannon-Fano coding = 2411 = 218 bitssymbol and(c) Huffman coding = 2311 = 209 bitssymbol

In the case of Huffman coding entropy He(X) and the compression rate Rcan be related using the entropy bounds [Cove91] This is given by

He(X) R He(X) + 1 (354)

It is interesting to note that the compression rate for the Huffman code willbe equal to the lower entropy bound ie R = He(X) if the input symbolfrequencies are radix 2 (see Example 32)

Example 36

The Huffman code table for a different input symbol frequency thanthe one given in Example 34 Consider the input bitstream Y =[2 5 6 6 2 5 5 4 1 4 4] chosen over a data set V = [0 1 2 3 4 5 6 7] and

the probabilities pi =[

01

11

2

11 0

3

11

3

11

2

11 0

]

i isin V Using the design

procedure described above a Huffman code tree can be formed as shown inFigure 319 Table 35 presents the resulting Huffman code table Total lengthof the Huffman encoded bitstream is given by LHF = 25 bits

Depending on the Huffman code table design procedure employed three dif-ferent encoding approaches can be possible First entropy coding based on


211 211 311 311

511

2 6 5

0 1

001 01 10 11

Input symbols

Huffman codewords

000

111

1

311

1

0

1

11110

1

0

000

611

4


Table 35 Huffman codetable for the inputbitstream Y = [2 5 6 6 2 5 5 4 1 4 4]


4 311 115 311 106 211 012 211 0011 111 000

the Huffman codes designed beforehand ie nonadaptive Huffman codingIn particular a training process involving a large database of input symbolsis employed to design Huffman codes These Huffman code tables will beavailable both at the encoder and at the decoder It is important to note thatthis approach may not (always) result in minimum redundancy encoding Forexample if the Huffman bitmapping given in Table 35 is used to encodethe input bitstream X = [4 5 6 6 2 5 4 4 1 4 4] given in Example 33the resulting total number of bits is LHF = 24 bits ie one bit more com-pared to Example 34 Therefore in order to obtain better compression areliable symbol-frequency model is necessary A series of Huffman codetables (in the range of 10ndash32) based on the symbol probabilities is usuallyemployed in order to overcome the aforementioned shortcomings The non-adaptive Huffman coding method is typically employed in a variety of audiocoding standards [ISOI92] [ISOI94] [ISOI96] [John96] Second Huffman cod-ing based on an iterative designencode procedure ie semi-adaptive Huffmancoding In the entropy coding literature this approach is typically called theldquotwo-passrdquo encoding scheme In the first pass a Huffman codetable is designed

ENTROPY CODING 81

based on the input symbol statistics In the second pass entropy coding isperformed using the designed Huffman codetable (similar to Examples 34and 36) In this approach note that the designed Huffman codetables mustalso be transmitted along with the entropy coded bitstream This results inreduced coding efficiency however with an improved symbol-frequency mod-eling Third adaptive Huffman coding based on the symbol frequencies com-puted dynamically from the previous samples Adaptive Huffman coding sch-emes based on the input quantization step size have also been proposedin order to accommodate for wide range of input word lengths [Crav96][Crav97]

362 Rice Coding

Rice in 1979 proposed a method for constructing practical noiseless cod-es [Rice79] Rice codes are usually employed when the input signal x exhibitsthe Laplacian distribution ie

pL(x) = 1radic2σx

eminus

radic2|x|σx (355)

A Rice code can be considered as a Huffman code for the Laplacian PDF Severalefficient algorithms are available to form Rice codes [Rice79] [Cove91] A simplemethod to represent the integer I as a Rice code is to divide the integer into fourparts ie a sign bit m low-order bits (LSBs) and the number corresponding tothe remaining MSBs of I as zeros followed by a stop bit lsquo1rsquo The parameter lsquomrsquocharacterizes the Rice code and is given by [Robi94]

m = log2(loge(2)E(|x|)) (356)

For example the Rice code for I = 69 and m = 4 is given by [0 0101 0000 1]

Example 37

Rice coding for Example 33 Input bitstream = [4 5 6 6 2 5 4 4 1 4 4]

Input symbol Binary representation Rice code (m = 2)

4 100 0 00 0 15 101 0 01 0 16 110 0 10 0 12 010 0 10 11 001 0 011


363 Golomb Coding

Golomb codes [Golo66] are optimal for exponentially decaying probability dis-tributions of positive integers Golomb codes are prefix codes that can be char-acterized by a unique parameter ldquomrdquo An integer ldquoIrdquo can be encoded using aGolomb code as follows The code consists of two parts a binary representation

of (m mod I ) and a unary representation of

[I

m

]

For example consider I = 69

and m = 16 The Golomb code will be [010111100] as explained in Figure 320In Method 1 the positive integer ldquoIrdquo is divided in two parts ie binary andunary bits along with a stop bit On the other hand in Method 2 if m = 2kthe codeword for ldquoIrdquo consists of ldquokrdquo LSBs of ldquoI rdquo followed by the numberformed by the remaining MSBs of ldquoIrdquo in unary representation and with a stop

bit Therefore the length of the code is k +[

I

2k

]

+ 1

Example 38

Consider the input bitstream X = [4 4 4 2 2 4 4 4 4 4 4 2 4 4 4 4] chosenover the data set V = [2 4] The run-length encoding scheme [Golo66] canbe employed to efficiently encode X Note that ldquo4rdquo is the most frequentlyoccurring symbol in X The number of consecutive occurrences of ldquo4rdquo iscalled the run length n The run lengths are monitored and encoded ie [30 6 4] Here ldquo0rdquo represents the consecutive occurrence of ldquo2rdquo The prob-ability of occurrence of ldquo4rdquo and ldquo2rdquo are given by p(4) = p = 1316 andp(2) = (1 minus p) = q = 316 respectively Note that p gtgt q For this case

Method ndash 1n = 69 m =16

First part Binary ( m mod n ) = Binary (16 mod 69) = Binary (5) = 0101

Stop bit = 0

Method ndash 2

n = 69 m = 16 ie k = 4 ( where m = 2k)

First part k LSBs of n = 4 LSBs of [1000101] = 0101

Second part Unary (rest of MSBs) = unary (4) = 1110

Stop bit = 0

Second part Unary ( ) = Unary(4) = 1110nm

Golomb Code


Golomb Code


Figure 320 Golomb coding

ENTROPY CODING 83

Huffman coding of X results in 16 bits Moreover the PDFs of the run lengthsare better described using an exponential distribution ie the probability of arun length of n is given by qpn which is an exponential distribution Rice cod-ing [Rice79] or the Golomb coding [Golo66] can be employed to efficientlyencode the exponentially distributed run lengths Furthermore both Golomband Rice codes are fast prefix codes that allow for practical implementation


Arithmetic coding [Riss79] [Witt87] [Howa94] deals with encoding a sequenceof input symbols as a large binary fraction known as a ldquocodewordrdquo For examplelet V = [v1 v2 vK ] be the data set let pi be the probability that the i-thsymbol is transmitted and let X = [x1 x2 xN ] be the input data vector oflength N The main idea behind an arithmetic coder is to encode the input datastream X as one codeword that corresponds to a rational number in the half-open unit interval [0 1) Arithmetic coding is particularly useful when dealingwith adaptive encoding and with highly skewed symbols [Witt87]

Example 39

Arithmetic coding of the input stream X = 1 0 minus 1 0 1 chosen over a dataset V = [minus1 0 1] Here N = 5 K = 3 We will use the following symbol

probabilities pi =[

1

3

1

2

1

6

]

i isin V

Step 1

The probabilities associated with the data set V = [minus1 0 1] are arrangedas intervals on a scale of [0 1) as shown in Figure 321Step 2

The first input symbol in the data stream X is lsquo1rsquo Therefore the interval[5

6 1

)

is chosen as the target range

Step 3

The second input symbol in the data stream X is lsquo0rsquo Now the interval[5

6 1

)

is partitioned according to the symbol probabilities 13 12 and

16 The resulting interval ranges are given in Figure 321 For example

the interval range for symbol lsquominus1rsquo can be computed as

[5

6

5

6+ 1

6

1

3

)

=[

5

6

16

18

)

and for symbol lsquo0rsquo the interval ranges is given by

[16

18

5

6+

1

6

1

3+ 1

6

1

2

)

=[

16

18

35

36

)


0

0 1minus1

1

minus1

1

10

1

Input data stream X = [1 0 minus1 0 1]

minus1 0

1618

3536

56

56

13

1618

3536

3336

6972

1

1618

3336

0minus1

197216

97108

0minus1 1

197216

97108

393432

393432

394432

390432

Figure 321 Arithmetic coding First the probabilities associated with the data setV = [minus1 0 1] are arranged as intervals on a scale of [0 1) Next in step 2 an intervalis chosen that corresponds to the probability of the input symbol lsquo1rsquo in the datasequence X ie [56 1) In step 3 the interval [56 1) is partitioned according tothe probabilities 13 12 and 16 and the range corresponding to the input symbollsquo0rsquo is chosen ie [16183536) This procedure is repeated for the rest of the inputsymbols and the final interval range (typically a rational number) is encoded in thebinary form

PROBLEMS 85

Step 4

The third input symbol in the data stream X is lsquominus1rsquo The interval

[16

18

35

36

)

is partitioned according to the symbol probabilities 13 12 and 16 Theresulting interval ranges are given in Figure 321

The above procedure is repeated for the rest of the input symbols and an

interval range is obtained ie

[393

432

394

432

)

[09097222 0912037) In the

binary form the interval is given by [01110100011 01110100101 )Since all binary numbers that begin with 01110100100 are within the interval[

393

432

394

432

)

the binary codeword 1110100100 uniquely represents the input

data stream X = [1 0 minus 1 0 1]

37 SUMMARY

This chapter covered quantization essentials and provided background on PCMDPCM vector quantization bit allocation and entropy coding algorithms Aquantizationndashbit allocationndashentropy coding (QBE) framework that is part of mostof the audio coding standards was described Some of the important conceptsaddressed in this chapter include

ž Uniform and nonuniform quantizationž PCM DPCM and ADPCM techniquesž Vector quantization structured VQ split VQ and conjugate structure VQž Bit-allocation strategiesž Source coding principlesž Lossless (entropy) coders ndash Huffman coding Rice coding Golomb coding

and arithmetic coding

PROBLEMS

31 Derive the PCM 6 dB per bit rule when the quantization error has a uniformprobability density function

32 For a signal with Gaussian distribution (zero mean and unit variance)a Design a uniform PCM quantizer with four levelsb Design a nonuniform four-level quantizer that is optimized for the signal

PDF Compare with the uniform PCM in terms of SNR

33 For the PDF p(x) = 1

2eminus|x| determine the mean the variance and the

probability that a random variable will fall within plusmnσx of the mean value


1

minus1 10 x

p(x)

Figure 322 An example PDF

34 For the PDF p(x) given in Figure 322 design a four-level PDF-optimizedPCM and compare to uniform PCM in terms of SNR

35 Give and justify a formula for the number of bits in simple vector quanti-zation with N times 1 vectors and L template vectors

36 Give in terms of L and N the order of complexity in a VQ codebooksearch where L is the number of codebook entries and N is the codebookdimension Consider the following cases (i) a simple VQ (ii) a multi-stepVQ and (iii) a split VQ For (ii) and (iii) use configurations given inFigure 312 and Figure 314 respectively

COMPUTER EXERCISES

37 Design a DPCM coder for a stationary random signal with power spectral

density S(ej) = 1

|1 + 08ej|2 Use a first-order predictor Give a block

diagram and all pertinent equations Write a program that implements theDPCM coder and evaluate the MSE at the receiver Compare the SNR (forthe same data) for a PCM system operating at the same bit rate

38 In this problem you will write a computer program to design a vectorquantizer and generate a codebook of size [L times N ] Here L is the numberof codebook entries and N is the codebook dimension

Step 1Generate a training set Tin of size [Ln times N ] where n is the number oftraining vectors per codevector Assume L = 16 N = 4 and n = 10 Denotethe training set elements as tin(i j) for i = 0 1 159 and j = 0 1 2 3Use Gaussian vectors of zero mean and unit variance for trainingStep 2Using the LBG algorithm [Lind80] design a vector quantizer and generate acodebook C of size [L times N ] ie [16 times 4] In the LBG VQ design choosethe distortion threshold as 0001 Label the codevectors as c(i j) for i =0 1 15 and j = 0 1 2 3

Tabl

e3

6Se

gmen

tal

and

over

all

SNR

valu

esfo

ra

test

sign

alw

ithi

nan

dou

tsid

eth

etr

aini

ngse

quen

cefo

rdi

ffer

ent

num

ber

ofco

debo

oken

trie

sL

di

ffer

ent

code

book

dim

ensi

ons

N

and

ε=

000

1

Tra

inin

gve

ctor

spe

ren

try

nC

odeb

ook

dim

ensi

onN

SNR

for

ate

stsi

gnal

wit

hin

the

trai

ning

sequ

ence

(dB

)SN

Rfo

ra

test

sign

alou

tsid

eth

etr

aini

ngse

quen

ce(d

B)

Segm

enta

lO

vera

llSe

gmen

tal

Ove

rall

L=

16L

=64

L=

16L

=64

L=

16L

=64

L=

16L

=64

102 4

100

2 4

87


VectorQuantizer

Codebook L1

4-bit VQL = 16 N = 4 and n = 100

e = 000001

s s

Figure 323 Four-bit VQ design specifications for Problem 39

Step 3Similar to Step 1 generate another training set Tout of size [160 times 4] that wewill use for testing the VQ performance Label these training set values astout (i j) for i = 0 1 159 and j = 0 1 2 3Step 4Using the codebook C designed in Step 2 perform vector quantization oftin(i j) and tout (i j) Let us denote the VQ results as tin(i j) and tout (i j)respectively

a When the test vectors are within the training sequence compute theover-all SNR and segmental SNR values as follows

SNRoverall =sumLnminus1

i=0

sumNminus1j=0 t2

in(i j)sumLnminus1

i=0

sumNminus1j=0 (tin(i j) minus tin(i j))2

(357)

SNRsegmental = 1

Ln

Lnminus1sum

i=0

sumNminus1j=0 t2

in(i j)sumNminus1

j=0 (tin(i j) minus tin(i j))2(358)

b Compute the over-all SNR and segmental SNR values when the testvectors are different from the training ones ie replace tin(i j) withtout (i j) and tin(i j) with tout (i j) in (357) and (358)

c List in Table 36 the overall and segmental SNR values for differentnumber of codebook entries and different codebook dimensions Explainthe effects of choosing different values of L n and N on the SNRvalues

d Compute the MSE ε(tin tin) = 1

Ln

1

N

sumLnminus1i=0

sumNminus1j=0 (tin(i j) minus

tin(i j))2 for different cases eg L = 16 64 n = 10 100 1000N = 2 8 Explain how the MSE varies for different values of L nand N

39 Write a program to design a 4-bit VQ codebook L1 (ie use L = 16codebook entries) with codebook dimension N = 4 (Figure 323) Usen = 100 training vectors per codebook entry and a distortion threshold ε =000001 For VQ-training choose zero mean and unit variance Gaussianvectors


VectorQuantizer-1

Codebook L1

s ssum sum

minus

VectorQuantizer-2

Codebook L2e1

e1ˆ e2ˆ

minus

VectorQuantizer-3

Codebook L3e2

4-bit VQL = 16 N = 4 and n = 100

e = 000001

4-bit VQL = 16 N = 4 and n = 100

e = 00001

4-bit VQL = 16 N = 4 and n = 100

e = 0001

Figure 324 A three-stage vector quantizer

310 Extend the VQ design in problem 39 to a multi-step VQ (see Figure 324for an example multi-step VQ configuration) Use a total of three stagesin your VQ design Choose the MSE distortion thresholds in each of thestages as ε1 = 0001 ε2 = 00001 and ε3 = 000001 Comment on theMSE convergence in each of the stages How would you compare themulti-step VQ with a simple VQ in terms of the segmental and overallSNR values (Note In Figure 324 the first VQ codebook (L1) encodes thesignal s and the subsequent VQ stages L2 and L3 encode the error fromthe previous stage At the decoder the signal s prime can be reconstructed ass prime = s + e1 + e2)

311 Design a two-stage split-VQ Choose L = 16 and n = 100 Implementthe first-stage as a 4-dimensional VQ and the second-stage as two 2-dimensional VQs Select the distortion thresholds as follows for the firststage ε1 = 0001 and for the second stage ε2 = 000001 Compare thecoding accuracy in terms of a distance measure and the coding gain interms of the number of bitssample of the split-VQ with respect to thesimple VQ in problem 39 and the multi-step VQ in problem 310 (SeeFigure 314)

312 Given the input data stream X = [1 0 2 1 0 1 2 1 0 2 0 1 1] chosen overa data set V = [0 1 2]a Write a program to compute the entropy He(X)b Compute the symbol probabilities pi i isin V for the input data stream

Xc Write a program to encode X using Huffman codes Employ an appro-

priate Huffman bit-mapping Give the length of the output bitstream(Hint See Example 34 and Example 36)

d Use arithmetic coding to encode the input data stream X Give the finalcodeword interval range in the binary form Give also the length of theoutput bitstream (See Example 39)

CHAPTER 4

LINEAR PREDICTION IN NARROWBANDAND WIDEBAND CODING

41 INTRODUCTION

Linear predictive coders are embedded in several telephony and multimediastandards [G729] [G7231] [IS-893] [ISOI99] Linear predictive coding(LPC) [Kroo95] is mostly used for source coding of speech signals and thedominant application of LPC is cellular telephony Recently linear prediction(LP) analysissynthesis has also been integrated in some of the wideband speechcoding standards [G722] [G7222] [Bess02] and in audio modeling [Iwak96][Mori96] [Harm97a] [Harm97b] [Bola98] [ISOI00]

LP analysissynthesis exploits the short- and long-term correlation to param-eterize the signal in terms of a source-system representation LP analysis canbe open loop or closed loop In closed-loop analysis also called analysis-by-synthesis the LP parameters are estimated by minimizing the ldquoperceptuallyweightedrdquo difference between the original and reconstructed signal Speech cod-ing standards use a perceptual weighting filter (PWF) to shape the quantizationnoise according to the masking properties of the human ear [Schr79] [Kroo95][Sala98] Although the PWF has been successful in speech coding audio cod-ing requires a more sophisticated strategy to exploit perceptual redundancies Tothis end several extensions [Bess02] [G7222] to the conventional LPC havebeen proposed Hybrid transformpredictive coding techniques have also beenemployed for high-quality low-bit-rate coding [Ramp98] [Ramp99] [Rong99][ISOI99] [ISOI00] Other LP methods that make use of perceptual constrains and


91


auditory psychophysics include the perceptual LP (PLP) [Herm90] the warpedLP (WLP) [Stru80] [Harm96] [Harm01] and the perceptually-motivated all-pole(PMAP) modeling [Atti05] In the PLP a perceptually based auditory spectrumis obtained by filtering the signal using a filter bank that mimics the auditoryfilter bank An all-pole filter that approximates the auditory spectrum is thencomputed using the autocorrelation method [Makh75] On the other hand in theWLP the main idea is to warp the frequency axis according to a Bark scale priorto performing LP analysis The PMAP modeling employs an auditory excitationpattern matching-method to directly estimate the perceptually-relevant pole loca-tions The estimated ldquoperceptual polesrdquo are then used to construct an all-polefilter for speech analysissynthesis

Whether or not LPC is amenable for audio modeling depends on the signalproperties For example a code-excited linear predictive (CELP) coder seemsto be more adequate than a sinusoidal coder for telephone speech while thesinusoidal coder seems to be more promising for music

42 LP-BASED SOURCE-SYSTEM MODELING FOR SPEECH

Speech is produced by the interaction of the vocal tract with the vocal chords TheLP analysissynthesis framework (Figures 41 and 42) has been successful forspeech coding because it fits well the source-system paradigm forspeech [Makh75] [Mark76] In particular the slowly time-varying spectral char-acteristics of the upper vocal tract (system) are modeled by an all-pole filterwhile the prediction residual captures the voiced unvoiced or mixed excitationsignal The LP analysis filter A(z) in Figure 41 is given by

A(z) = 1 minusLsum

i=1

aizminusi (41)

Framing or buffering

Linearprediction (LP) analysis A(z)

Vocal tractLP spectral envelope

Input speech

Speech frame

Linear predictionresidual

A(z) = 1minus Σ ai z minusiL

i=1

Figure 41 Parameter estimation using linear prediction

LP-BASED SOURCE-SYSTEM MODELING FOR SPEECH 93

Synthesized signal

Vocal tract

filter 1A(z)

(LP synthesis)

Gain

Figure 42 Engineering model for speech synthesis

where L is the order of the linear predictor Figure 42 depicts a simple speechsynthesis model where a time-varying digital filter is excited by quasi-periodicwaveforms when speech is voiced (eg as in steady vowels) and random wave-forms for unvoiced speech (eg as in consonants) The inverse filter 1A(z)shown in Figure 42 is an all-pole LP synthesis filter

H(z) = G

A(z)= G

1 minus sumLi=1 aizminusi

(42)

where G represents the gain Note that the term all-pole is used loosely since (42)has zeros at z = 0 The frequency response associated with the LP synthesis filterie the LPC spectrum represents the formant structure of the speech signal(Figure 43) In this figure F1 F2 F3 and F4 represent the four formants

0 02 04 06 08 1minus50

minus40

minus30

minus20

minus10

0

10

20

Normalized Frequency (x 4kHz)

Mag

nitu

de (

dB)

LPC spectrumFFT spectrum

F1 F2

F3F4

Figure 43 The LPC and FFT spectra (dotted line) The formants represent the resonantmodes of the vocal tract


43 SHORT-TERM LINEAR PREDICTION

Figure 44 presents a typical L-th order FIR linear predictor During forwardlinear prediction of s(n) an estimated value s(n) is computed as a linear com-bination of the previous samples ie

s(n) =Lsum

i=1

ais(n minus i) (43)

where the weights ai are the LP coefficients The output of the LP analysis filterA(z) is called the prediction residual e(n) = s(n) minus s(n) This is given by


i=1

ais(n minus i) (44)

Because only short-term delays are considered in (44) the linear predictorin Figure 44 is also referred to as the short-term linear predictor The linearpredictor coefficients ai are estimated using least-square minimization of theprediction error ie

ε = E[e2(n)] = E

(

s(n) minusLsum

i=1

ais(n minus i)

)2

(45)

The minimization of ε in (45) with respect to ai ie partεpartai = 0 for i =1 2 L yields a set of equations involving autocorrelations

rss(m) minusLsum

i=1

airss(m minus i) = 0 for m = 1 2 L (46)

where rss(m) is the autocorrelation sequence of the signal s(n) Equation (46)can be written in matrix form ie

hellip

a1

aL

hellip

s(n) e(n)

aLminus1 hellip

_

_

+_

LP analysis filter A(z) = 1 minus sum ai zminus i

i = 1

L

z minus1 z minus1 sum

Figure 44 Linear prediction (LP) analysis

SHORT-TERM LINEAR PREDICTION 95

a1

a2

a3

aL

=

rss(0) rss(minus1) rss(minus2) rss(1 minus L)

rss(1) rss(0) rss(minus1) rss(2 minus L)

rss(2) rss(1) rss(0) rss(3 minus L)

rss(L minus 1) rss(L minus 2) rss(L minus 3) rss(0)

minus1

rss(1)

rss(2)

rss(3)

rss(L)

(47)

or more compactly

a = Rminus1ss rss (48)

where a is the LP coefficient vector rss is the autocorrelation vector and Rss isthe autocorrelation matrix Note that Rss has a Toeplitz and symmetric structureEfficient algorithms [Makh75] [Mark76] [Marp87] are available for inverting theautocorrelation matrix Rss including algorithms tailored to work well with finiteprecision arithmetic [Gers90] Typically the Levinson-Durbin recursive algo-rithm [Makh75] is used to compute the LP coefficients Preconditioning of theinput sequence s(n) and autocorrelation data rss(m) using tapered windowsimproves the numerical behavior of these algorithms [Klei95] [Kroo95] In addi-tion bandwidth expansion or scaling of the LP coefficients is typical in LPC asit reduces distortion during synthesis

In low-bit-rate coding the prediction coefficients and the residual must beefficiently quantized Because the direct-form LP coefficients ai do not haveadequate quantization properties transformed coefficients are typically quantizedFirst-generation voice coders (vocoders) such as the LPC10e [FS1015] and theIS-54 VSELP [IS-54] quantize reflection coefficients that are a by-product of theLevinson-Durbin recursion Transformation of the reflection coefficients can leadto a set of parameters that are also less sensitive to quantization In particularthe log area ratios and the inverse sine transformation have been used in theearly GSM 610 algorithm [GSM89] and in the skyphone standard [Boyd88]Recent LP-based cellular standards quantize line spectrum pairs (LSPs) Themain advantage of the LSPs is that they relate directly to frequency-domain andhence they can be encoded using perceptual criteria


Long-term prediction (LTP) as opposed to short-term prediction is a processthat captures the long-term correlation in the signal The LTP provides a mech-anism for representing the periodicity in the signal and as such it representsthe fine harmonic structure in the short-term spectrum (see Eq (41)) LTP syn-thesis (49) requires estimation of two parameters ie the delay τ and thegain parameter aτ For strongly voiced segments of speech the delay is usuallyan integer that approximates the pitch period A transfer function of a simpleLTP synthesis filter Hτ (z) is given in (49) More complex LTP filters involve


multiple parameters and noninteger delays [Kroo90]

Hτ (z) = 1

AL(z)= 1

1 minus aτ zminusτ

(49)

The LTP can be implemented by open loop or closed loop analysis The open-loopLTP parameters are typically obtained by searching the autocorrelation sequenceThe gain is simply obtained by aτ = rss(τ )rss (0) In closed-loop LTP search thesignal is synthesized for a range of candidate LTP lags and the lag that producesthe best waveform matching is chosen Because of the intensive computations infull-search closed-loop LTP recent algorithms use open-loop LTP to establishan initial LTP lag that is then refined using closed-loop search around the neigh-borhood of the initial estimate In order to further reduce the complexity LTPsearches are often carried in every other subframe


One of the simplest compression schemes that uses the short-term LP analysis-synthesis is the adaptive differential pulse code modulation (ADPCM)coder [Bene86] [G726] ADPCM algorithms encode the difference between thecurrent and the predicted speech samples The block diagram of the ITU-T G72632 kbs ADPCM encoder [Bene86] is shown in Figure 45 The algorithm con-sists of an adaptive quantizer and an adaptive pole-zero predictor The predictionparameters are obtained by backward estimation ie from quantized data usinga gradient algorithm at the decoder From Figure 45 it can be noted that thedecoder is embedded in the encoder The pole-zero predictor (2 poles and 6 zeros)estimates the input signal and hence it reduces the variance of e(n) The quantizerencodes the error e(n) into a sequence of 4-bit words The ITU-T G726 alsoaccommodates 16 24 and 40 kbs with individually optimized quantizers

44 OPEN-LOOP ANALYSIS-SYNTHESIS LINEAR PREDICTION

In almost all LP-based speech codecs speech is approximated on short analysisintervals typically in the neighborhood of 20 ms As shown in Figure 46 a setof LP synthesis parameters is estimated on each analysis frame to capture theshape of the vocal tract envelope and to model the excitation

Some of the typical synthesis parameters encoded and transmitted in the open-loop LP include the prediction coefficients the pitch information the frameenergy and the voicing At the receiver the transmitted ldquosourcerdquo parametersare used to form the excitation The excitation e(n) is then used to excite theLP synthesis filter 1A(z) to reconstruct the speech signal Some of the stan-dardized open-loop analysis-synthesis LP algorithms include the LPC10e FederalStandard FS-1015 [FS1015] [Trem82] [Camp86] and the Mixed Excitation LP(MELP) [McCr91] The LPC10e FS-1015 uses a tenth-order predictor to estimatethe vocal tract parameters and a two-state voiced or unvoiced excitation modelfor residual modeling Mixed excitation schemes in conjunction with LPC were


+

_Step update

Coder output

Decoder output

+

+

++

Qminus1Qs(n) e(n)

s (n)

e (n)

B(z)

A(z) sumsum

sum

Figure 45 The ADPCM ITU-T G726 encoder

A(z) = 1 minus Σ ai z minusiL

i =1

Vocal tract filter 1A(z)

Synthesis (Decoder)

Synthetic speech

Input speech

Analysis (Encoder) Pitch period

Residual or Excitation ParametersLPC excitation pitchenergy voicing etc

Parameters

LPC excitation pitch

energy voicing etc

Reconstructedspeech

gain

Figure 46 Open-loop analysis-synthesis LP

proposed by Makhoul et al [Makh78] and were later revisited by McCree andBarnwell [McCr91] [McCr93]

45 ANALYSIS-BY-SYNTHESIS LINEAR PREDICTION

In closed-loop source-system coders (Figure 47) the excitation source is deter-mined by closed-loop or analysis-by-synthesis (A-by-S) optimization Theoptimization process determines an excitation sequence that minimizes the per-ceptually weighted mean-square-error (MSE) between the input speech and recon-structed speech [Atal82b] [Sing84] [Schr85] The closed-loop LP combines the


spectral modeling properties of vocoders with the waveform matching attributesof waveform coders and hence the A-by-S LP coders are also called hybridLP coders The system consists of a short-term LP synthesis filter 1A(z) anda LTP synthesis filter 1AL(z) shown in Figure 47 The perceptual weightingfilter (PWF) W(z) shapes the error such that quantization noise is masked bythe high-energy formants The PWF is given by

W(z) = A(zγ1)

A(zγ2)= 1 minus sumL

i=1 γ i1aiz

minusi

1 minus sumLi=1 γ i

2aizminusi

0 lt γ2 lt γ1 lt 1 (410)

where γ1 and γ2 are the adaptive weights and L is the order of the linear predic-tor Typically γ1 ranges from 094 to 098 and γ2 varies between 04 and 07depending upon the tilt or the flatness characteristics associated with the LPCspectral envelope [Sala98] [Bess02] The role of W(z) is to de-emphasize theerror energy in the formant regions [Schr79] This de-emphasis strategy is basedon the fact that quantization noise in the formant regions is partially masked byspeech From Figure 47 note that a gain factor g scales the excitation vector xand the excitation samples are filtered by the long-term and short-term synthesisfilters

The three most common excitation models typically embedded in theexcitation generator module (Figure 47) in the A-by-S LP schemes includethe multi-pulse excitation (MPE) [Atal82b] [Sing84] the regular pulseexcitation (RPE) [Kroo86] and the vector or code excited linear prediction(CELP) [Schr85] A 96 kbs multi-pulse excited linear prediction (MPE-LP) algorithm is used in Skyphone airline applications [Boyd88] A 13 kbscoding scheme that uses regular pulse excitation (RPE) [Kroo86] was adoptedfor the first generation full-rate ETSI GSM Pan-European digital cellularstandard [GSM89]

The aforementioned MPE-LP and RPE schemes achieve high-quality speech atmedium rates (13 kbs) For low-rate high-quality speech coding a more efficientrepresentation of the excitation sequence is required Atal [Atal82a] suggestedthat high-quality speech at low rates may be produced by using noninstantaneous

1AL(z) 1A(z)g

Errorminimization W(z)

s

Error

xExcitationgenerator

(MPE or RPE)

s

Input speech

Syntheticspeech

+

_

e

Σ

Figure 47 A typical source-system model employed in the analysis-by-synthesis LP


(delayed decision) coding of Gaussian excitation sequences in conjunction withA-by-S linear prediction and perceptual weighting In the mid-1980s Atal andSchroeder [Atal84] [Schr85] proposed a vector or code excited linear prediction(CELP) algorithm for A-by-S linear predictive coding

We provide further details on CELP in this section because of its recent use inwideband coding standards The excitation codebook search process in CELP canbe explained by considering the A-by-S scheme shown in Figure 48 The N times 1error vector e associated with the k-th excitation vector can be written as

e[k] = sw minus s0w minus gk sw[k] (411)

where sw is the N times 1 vector that contains the perceptually-weighted speechsamples s0

w is the vector that contains the output due to the initial filter statesw[k] is the filtered synthetic speech vector associated with the k-th excitationvector and gk is the gain factor Minimizing εk = eT [k]e[k] wrt gk we obtain

gk = sTw sw[k]

sTw[k]sw[k]

(412)

where sw = sw minus s0w and T represents the transpose operator From (412) εk

can be written as

εk = sTwsw minus (sT

w sw[k])2

sTw[k]sw[k]

(413)

Excitation vectorsCodebook

helliphellip

gk

MSEminimization

Errore[k]

Input speechs

sw

LTP synthesisfilter

1 AL(z)

LP synthesisfilter

1 A(z)

+

_PWFW(z)

PWFW(z)

x[k]

Syntheticspeech

s[k]ˆ sw[k]ˆΣ

Figure 48 A generic block diagram for the A-by-S code-excited linear predictive (CELP)coding Note that the perceptual weighting W(z) is applied directly on the input speechs and synthetic speech s in order to facilitate for the CELP analysis that follows Thek-th excitation vector x[k] that minimizes εk in (413) is selected and the correspondinggain factor gk is obtained from (412) The codebook index k and the gain gk asso-ciated with the candidate excitation vector x[k] are encoded and transmitted along withthe short-term and long-term prediction filter parameters


The k-th excitation vector x[k] that minimizes (413) is selected and thecorresponding gain factor gk is obtained from (412)

One of the disadvantages of the original CELP algorithm is thelarge computational complexity required for the codebook search [Schr85]This problem motivated a great deal of work focused upon developingstructured codebooks [Davi86] [Klei90a] and fast search procedures [Tran90] Inparticular Davidson and Gersho [Davi86] proposed sparse codebooks and Kleijnet al [Klei90a] proposed a fast algorithm for searching stochastic codebooks withoverlapping vectors In addition Gerson and Jasiuk [Gers90] [Gers91] proposeda vector sum excited linear predictive (VSELP) coder which is associated withfast codebook search and robustness to channel errors Other implementationissues associated with CELP include the quantization of the CELP parametersthe effects of channel errors on CELP coders and the operation of the algorithmon finite-precision and fixed-point machines A study on the effects of parameterquantization on the performance of CELP was presented in [Kroo90] and theissues associated with the channel coding of the CELP parameters were discussedby Kleijn [Klei90b] Some of the problems associated with the fixed-pointimplementation of CELP algorithms were presented in [Span92]


In this section we taxonomize CELP algorithms into three categories that areconsistent with the chronology of their development ie first-generation CELP(1986ndash1992) second-generation CELP (1993ndash1998) and third-generation CELP(1999ndashpresent)

4511 First-Generation CELP Coders The first-generation CELP algo-rithms operate at bit rates between 58 kbs and 16 kbs These are generallyhigh complexity and non-toll-quality algorithms Some of the first-generationCELP algorithms include the FS-1016 CELP the IS-54 VSELP the ITU-TG728 low delay-CELP and the IS-96 Qualcomm CELP The FS-1016 48 kbsCELP standard [Camp90] [FS1016] was jointly developed by the Department ofDefense (DoD) and the Bell Labs for possible use in the third-generation securetelephone unit (STU-III) The IS-54 VSELP algorithm [IS-54] [Gers90] and itsvariants are embedded in three digital cellular standards ie the 8 kbs TIAIS-54 [IS-54] the 63 kbs Japanese standard [GSM96a] and the 56 kbs half-rate GSM [GSM96b] The VSELP algorithm uses highly structured codebooksthat are tailored for reduced computational complexity and increased robust-ness to channel errors The ITU-T G728 low-delay (LD) CELP coder [G728][Chen92] achieves low one-way delay by using very short frames a backward-adaptive predictor and short excitation vectors (five samples) The IS-96 Qual-comm CELP [IS-96] is a variable bit rate algorithm and is part of the originalcode division multiple access (CDMA) standard for cellular communications

4512 Second-Generation Near-Toll-Quality CELP Coders Thesecond-generation CELP algorithms are targeted for TDMA and CDMAcellphones Internet audio streaming voice-over-Internet-protocol (VoIP) and


secure communications Second-generation CELP algorithms include the ITU-T G7231 dual-rate speech codec [G7231] the GSM enhanced full rate(EFR) [GSM96a] [IS-641] the IS-127 Relaxed CELP (RCELP) [IS-127][Klei92] and the ITU-T G729 CS-ACELP [G729] [Sala98]

The coding gain improvements in second-generation CELP coders can beattributed partly to the use of algebraic codebooks in excitation coding [Adou87][Lee90] [Sala98] [G729] The term algebraic CELP refers to the structure ofthe excitation codebooks Various algebraic codebook structures have been pro-posed [Adou87] [Lafl90] but the most popular is the interleaved pulse permu-tation code In this codebook the code vector consists of a set of interleavedpermutation codes containing only few non-zero elements This is given by

pi = i + jd j = 0 1 2M minus 1 (414)

where pi is the pulse position i is the pulse number and d is the interleavingdepth The integer M represents the number of bits describing the pulse positionsTable 41 shows an example ACELP codebook structure where the interleavingdepth d = 5 the number of pulses or tracks equal to 5 and the number ofbits to represent the pulse positions M = 3 From (414) pi = i + j5 wherei = 0 1 2 3 4 j = 0 1 2 7

For a given value of i the set defined by (414) is known as lsquotrackrsquo andthe value of j defines the pulse position From the codebook structure shown inTable 41 the codevector x(n) is given by

x(n) =4sum

i=0

αiδ(n minus pi) n = 0 1 39 (415)

where δ(n) is the unit impulse αi are the pulse amplitudes (plusmn1) and pi are thepulse positions In particular the codebook vector x(n) is computed by placingthe 5-unit pulses at the determined locations pi multiplied with their signs (plusmn1)The pulse position indices and the signs are coded and transmitted Note that thealgebraic codebooks do not require any storage

4513 Third-Generation CELP for 3G Cellular Standards The third-generation (3G) CELP algorithms are multimodal and accommodate several

Table 41 An example algebraic codebookstructure tracks and pulse positions

Track (i) Pulse positions (pi)

0 P0 0 5 10 15 20 25 30 351 P1 1 6 11 16 21 26 31 362 P2 2 7 12 17 22 27 32 373 P3 3 8 1318 23 28 33 384 P4 4 9 14 19 24 29 34 39


different bit rates This is consistent with the vision on wideband wirelessstandards [Knis98] that operate in different modes including low-mobility high-mobility indoor etc There are at least two algorithms that have been developedand standardized for these applications In Europe GSM standardized theadaptive multi-rate coder [ETSI98] [Ekud99] and in the United States the TIAhas tested the selectable mode vocoder (SMV) [Gao01a] [Gao01b] [IS-893] Inparticular the adaptive multirate coder [ETSI98] [Ekud99] has been adoptedby ETSI for use in the GSM network This is an algebraic CELP algorithmthat operates at multiple rates 122 102 795 67 59 515 and 575 kbsThe bit rate is adjusted according to the traffic conditions The SMV algorithm(IS-893) was developed to provide higher quality flexibility and capacity overthe existing IS-96 QCELP and IS-127 enhanced variable rate coding (EVRC)CDMA algorithms The SMV is based on 4 codecs full-rate at 85 kbs half-rate at 4 kbs quarter-rate at 2 kbs and eighth-rate at 08 kbs The rate andmode selections in SMV are based on the frame voicing characteristics and thenetwork conditions Efforts to establish wideband cellular standards continueto drive further the research and development towards algorithms that work atmultiple rates and deliver enhanced speech quality

46 LINEAR PREDICTION IN WIDEBAND CODING

Until now we discussed the use of LP in narrowband coding with signal band-width limited to 150ndash3400 Hz Signal bandwidth in wideband speech codingspans 50 Hz to 7 kHz which substantially improves the quality of signal recon-struction intelligibility and naturalness In particular the introduction of thelow-frequency components improves the naturalness while the higher frequencyextension provides more adequate speech intelligibility In case of high-fidelityaudio it is typical to consider sampling rates of 441 kHz and signal bandwidthcan range from 20 Hz to 20 kHz Some of the recent super high-fidelity audiostorage formats (Chapter 11) such as the DVD-audio and the super audio CD(SACD) consider signal bandwidths up to 100 kHz


Over the last few years several wideband speech coding algorithms have beenproposed [Orde91] [Jaya92] [Lafl93] [Adou95] Some of the coding principlesassociated with these algorithms have been successfully integrated into sev-eral speech coding standards for example the ITU-T G722 subband ADPCMstandard and the ITU-T G7222 AMR-WB codec

4611 The ITU-T G722 Codec The ITU-T G722 standard (Figure 49) usesa combination of both subband and ADPCM (SB-ADPCM)techniques [G722] [Merm88] [Span94] [Pain00] The input signal is sampledat 16 kHz and decomposed into two subbands of equal bandwidth using quadra-ture mirror filter (QMF) banks The subband filters hlow(n) and hhigh(n) shouldsatisfy

hhigh(n) = (minus1)nhlow(n)and|Hlow(z)|2 + |Hhigh(z)|2 = 1 (416)


The low-frequency subband is typically quantized at 48 kbs while the high-frequency subband is coded at 16 kbs The G722 coder includes an adaptive bitallocation scheme and an auxiliary data channel Moreover provisions for quan-tizing the low-frequency subband at 40 or at 32 kbs are available In particularthe G722 algorithm is multimodal and can operate in three different modes ie48 56 and 64 kbs by varying the bits used to represent the lower band signalThe MOS at 64 kbs is greater than four for speech and slightly less than fourfor music signals [Jaya90] and the analysis-synthesis QMF banks introduce adelay of less than 3 ms Details on the real-time implementation of this coderare given in [Taka88]

4612 The ITU-T G7222 AMR-WB Codec The ITU-T G7222 [G7722][Bess02] is an adaptive multi-rate wideband (AMR-WB) codec that operates atbit rates ranging from 66 to 2385 kbs The G722 AMR-WB standard is pri-marily targeted for the voice-over IP (VoIP) 3G wireless communications ISDNwideband telephony and audiovideo teleconferencing It is important to note thatthe AMR-WB codec has also been adopted by the third-generation partnershipproject (3GPP) for GSM and the 3G WCDMA systems for wideband mobile com-munications [Bess02] This in fact brought to the fore all the interoperability-related advantages for wideband voice applications across wireline and wirelesscommunications The ITU-T G7222 AMR-WB codec is based on the ACELPcoder and operates on audio frames of 20 ms sampled at 16 kHz The codec sup-ports the following nine bit rates 2385 2305 1985 1825 1585 1425 1265885 and 66 kbs Excepting the two lowest modes ie the 885 kbs and the

MUX Data insertion


64 kbsoutput

QMFanalysis

bank

ADPCMencoder 1

ADPCMencoder 2

s(n)

(a)


64 kbsinput De-

MUXData

extraction

QMFsynthesis

bank

ADPCMdecoder 1

ADPCMdecoder 2

s(n)

(b)

Figure 49 The ITU-T G722 standard for ISDN teleconferencing Wideband coding at64 kbs based on a two-band QMF analysissynthesis bank and ADPCM (a) encoder and(b) decoder Note that the low-frequency band is encoded at 32 kbs in order to allow foran auxiliary data channel at 16 kbs


66 kbs that are intended for transmission over noisy time-varying channels otherencoding modes ie 2385 through 1265 kbs offer high-quality signal recon-struction The G722 AMR-WB embeds several innovative techniques [Bess02]such as i) a modified perceptual weighting filter that decouples the formantweighting from the spectrum tilt ii ) an enhanced closed-loop pitch search to bet-ter accommodate the variations in the voicing level and iii ) efficient codebookstructures for fast searches The codec also includes a voice activity detection(VAD) scheme that activates a comfort noise generator module (1ndash2 kbs) incase of discontinuous transmission


Motivated by the need to reduce the computational complexity associated with theCELP-based excitation source coding researchers have proposed several hybrid(LP + subbandtransform) coders [Lefe94] [Ramp98] [Rong99] In this sectionwe consider LP-based wideband coding methods that encode the prediction resid-ual based upon the transform or subband or sinusoidal coding techniques

4621 Multipulse Excitation Model Singhal at Bell Labs [Sing90] reportedthat analysis-by-synthesis multipulse excitation with sufficient pulse density canbe applied to correct for LP envelope errors introduced by bandwidth expansionand quantization (Figure 410) This algorithm uses a 24th-order LPC synthesisfilter while optimizing pulse positions and amplitudes to minimize perceptuallyweighted reconstruction errors Singhal determined that densities of approxi-mately 1 pulse per 4 output samples of each excitation subframe are requiredto achieve near transparent quality

Spectral coefficients are transformed to inverse sine reflection coefficientsthen differentially encoded and quantized using PDF-optimized Max quantiz-ers Entropy (Huffman) codes are also used Pulse locations are differentiallyencoded relative to the location of the first pulse Pulse amplitudes are fractionallyencoded relative to the largest pulse and then quantized using a Max quantizerThe proposed MPLPC audio coder achieved output SNRs of 35ndash40 dB at a bitrate of 128 kbs Other MPLPC audio coders have also been proposed [Lin91]

LPSynthesisFilter

ErrorWeighting

ExcitationGenerator

minus+

u(n)

s(n)

s(n)Σ

Figure 410 Multipulse excitation model used in [Sing90]


including a scheme based on MPLPC in conjunction with the discrete wavelettransform [Bola95]

4622 Discrete Wavelet Excitation Coding While most of the successfulspeech codecs nowadays use some form of closed-loop time-domain analysis-by-synthesis such as MPLPC high-performance LP-based perceptual audio cod-ing has been realized with alternative frequency-domain excitation models Forinstance Boland and Deriche reported output quality comparable to MPEG-1Layer II at 128 kbs for an LPC audio coder operating at 96 kbs [Bola98] inwhich the prediction residual was transform coded using a three-level discrete-wavelet-transform (DWT) (see also Section 82) based on a four-band uniformfilter bank At each level of the DWT the lowest subband of the previous levelwas decomposed into four uniform bands This 10-band nonuniform structurewas intended to mimic critical bandwidths to a certain extent A perceptual bitallocation according to MPEG-1 psychoacoustic model-2 was applied to thetransform coefficients

4623 Sinusoidal Excitation Coding Excitation sequences modeled as asum of sinusoids were investigated in [Chan96] This form of excitation is basedon the tendency of the prediction residuals to be spectrally impulsive rather thanflat for high-fidelity audio In coding experiments using 32-kHz-sampled inputaudio subjective and objective quality improvements relative to the MPLPCcoders were reported for the sinusoidal excitation schemes with high-qualityoutput audio reported at 72 kbs In the experiments reported in [Chan97] a setof ten LP coefficients is estimated on 94 ms analysis frames and split-vectorquantized using 24 bits Then the prediction residual is analyzed and sinusoidalparameters are estimated for the seven best out of a candidate set of thirteensinusoids for each of six subframes The masked threshold is estimated and usedto form a time-varying bit allocation for the amplitudes frequencies and phaseson each subframe Given a frame allocation of 675 a total of 573 78 and 24 bitsrespectively are allocated to the sinusoidal bit allocation side information andLP coefficients Sinusoidal excitation coding when used in conjunction with amasking-threshold adapted weighting filter resulted in improved quality relativeto MPEG-1 layer I at a bit rate of 96 kbs [Chan96] for selected test material

4624 Frequency-Warped LP Beyond the performance improvementsrealized through the use of different excitation models there has been interestin warping the frequency axis before LP analysis to effectively provide bet-ter resolution at certain frequencies In the context of perceptual coding it isnaturally of interest to achieve a Bark-scale warping Frequency axis warping toachieve nonuniform FFT resolution was first introduced by Oppenheim Johnsonand Steiglitz [Oppe71] [Oppe72] using a network of cascaded first-order all-passsections for frequency warping of the signal followed by a standard FFT Theidea was later extended to warped linear prediction (WLP) by Strube [Stru80]and was ultimately applied to an ADPCM codec [Krug88] Cascaded first-orderall-pass sections were used to warp the signal and then the LP autocorrelation


analysis was performed on the warped autocorrelation sequence In this scenarioa single-parameter warping of the frequency axis can be introduced into the LPanalysis by replacing the delay elements in the FIR analysis filter (Figure 44)with all-pass sections This is done by replacing the complex variable zminus1 ofthe FIR system function with another filter H(z) of the form

H(z) = zminus1 minus λ

1 minus λzminus1 (417)

Thus the predicted sample value is not produced from a combination of pastsamples as in Eq (44) but rather from the samples of a warped signal In factit has been shown [Smit95] [Smit99] that selecting the value of 0723 for theparameter λ leads to a frequency warp that approximates well the Bark frequencyscale A WLP-based audio codec [Harm96] was recently proposed The inherentBark frequency resolution of the WLP residual yields a perceptually shapedquantization noise without the use of an explicit perceptual model or time-varyingbit allocation In this system a 40-th order WLP synthesis filter is combined withdifferential encoding of the prediction residual A fixed rate of 2 bits per sample(882 kbs) is allocated to the residual sequence and 5 bits per coefficient areallocated to the prediction coefficients on an analysis frame of 800 samples or18 ms This translates to a bit rate of 992 kbs per channel In objective termsan auditory error measure showed considerable improvement for the WLP codingerror in comparison to a conventional LP coding error when the same numberof bits was allocated to the prediction residuals Subjectively the algorithm wasreported to achieve transparent quality for some material but it also had difficultywith transients at the frame boundaries

The algorithm was later extended to handle stereophonic signals [Harm97a]by forming a complex-valued representation of the two channels and then usinga version of WLP modified for complex signals (CWLP) It was suggested thatsignificant quality improvement could be realized for the WLPC audio coderby using a closed-loop analysis-by-synthesis procedure [Harm97b] One of theshortcomings of the original WLP coder was inadequate attention to temporaleffects As a result further experiments were reported [Harm98] in which WLPwas combined with temporal noise shaping (TNS) to realize additional qualityimprovement

47 SUMMARY

In this Chapter we presented the LP-based source-system model and described itsapplications in narrowband and wideband coding Some of the topics presentedin this chapter include

ž Short-term linear predictionž Conventional LP analysis-synthesis

PROBLEMS 107

ž Closed-loop analysis-by-synthesis hybrid codersž Code-excited linear prediction (CELP) speech standardsž Linear prediction in wideband codingž Frequency-warped LP

PROBLEMS

41 The following autocorrelation sequence is given rss(m) = 08|m|

036

Describe a source-system mechanism that will generate a signal with thisautocorrelation

42 Sketch the magnitude frequency response of the following filter function

H(z) = 1

1 minus 09zminus10

43 The autocorrelation sequence in Figure 411 corresponds to a strongly voicedspeech Show with an arrow which autocorrelation sample relates to the pitchperiod of the voiced signal Estimate the pitch period from the graph

44 Consider Figure 412 with a white Gaussian input signala Determine analytically the LP coefficients for a first-order predictor and

for H(z) = 1(1 minus 08zminus1)b Determine analytically the LP coefficients for a second-order predictor

and for H(z) = 1 + zminus1 + zminus2

0 20 40 60 80log index m

120100minus5

0

5

r ss

10

Figure 411 Autocorrelation of a voiced speech segment

H(z) Linearprediction

White Gaussianinput

s2 = 1m = 0

Residuale(n)

Figure 412 LP coefficients estimation


COMPUTER EXERCISES

The necessary MATLAB software for computer simulations and the speechaudiofiles (Ch4Sp8wav Ch4Au8wav and Ch4Au16wav ) can be obtained from theBook website

45 Linear predictive coding (LPC)

a Write a MATLAB program to load display and play back speech filesUse Ch4Sp8wav for this computer exercise

b Include a framing module in your program and set the frame size to256 samples Every frame should be read in a 256 times 1 real vector calledStime Compute the fast Fourier transform (FFT) of this vector ieSf req = ff t(Stime) Next compute the magnitude of the complex vec-tor Sfreq and plot its magnitude in dB up to the fold-over frequency Thiscomputation should be part of your frame-by-frame speech processingprogram

Deliverable 1Present at least one plot of time and one corresponding plot of frequency-domain data for a voiced unvoiced and a mixed speech segment (A total ofsix plots ndash use the subplot command)

c Pitch period and voicing estimation The period of a strongly voicedspeech signal is associated in a reciprocal manner to the fundamentalfrequency of the corresponding harmonic spectrum That is if the pitchperiod is T the fundamental frequency is 1T Note that T can be mea-sured in terms of the number of samples within a pitch period for voicedspeech If T is measured in ms then multiply the number of samples by1Fs where Fs is the sampling frequency of the input speech

Deliverable 2Create and fill Table 42 for the first 30 speech frames by visual inspec-tion as follows when the segment is voiced enter 1 in the 2nd columnIf speech pause (ie no speech present) enter 0 if unvoiced enter 025and if mixed enter 05 Measure the pitch period visually from the time-domain plot in terms of the number of samples in a pitch period If thesegment is unvoiced or pause enter infinity for the pitch period and hencezero for the fundamental frequency If the segment is mixed do your bestto obtain an estimate of the pitch if it is not possible set pitch toinfinityDeliverable 3From Table 42 plot the fundamental frequency as a function of the framenumber for all thirty frames This is called the pitch frequency contour


Table 42 Pitch period voicing and frame energy measurements

Speech framenumber

Voicedunvoicedmixedpause

Pitch(number of

samples)Frame

energy

Fundamentalfrequency

(Hz)

1230

Deliverable 4From Table 42 plot also i) the voicing and ii ) frame energy (in dB) as afunction of the frame number for all thirty frames

d The FFT and LP spectra Write a MATLAB program to implement theLevinson-Durbin recursion Assume a tenth-order LP analysis and esti-mate the LP coefficients (lp coeff ) for each speech frame

Deliverable 5Compute the LP spectra as follows H allpole = f reqz(1 lp coeff) Super-impose the LP spectra (H allpole) with the FFT speech spectra (Sfreq) for avoiced segment and an unvoiced segment Plot the spectral magnitudes in dBup to the foldover frequency Note that the LPC spectra look like a smoothedversion of the FFT spectra Divide Sfreq by H allpole Plot the magnitudeof the result in dB up to the fold-over frequency What does the resultingspectrum representDeliverable 6From the LP spectra measure (visually) the frequencies of the first threeformants F1 F2 and F3 Give these frequencies in Hz (Table 43) Plot thethree formants across the frame number These will be the formant contoursUse different line types or colors to discriminate the three contours

e LP analysis-synthesis Using the prediction coefficients (lp coeff ) frompart (d) perform LP analysis Use the mathematical formulation given inSections 42 and 43 Quantize both the LP coefficients and the prediction

Table 43 Formants F1 F2 and F3

Speech frame number F1 (Hz) F2 (Hz) F3 (Hz)

1230


residual using a 3-bit (ie 8 levels) scalar quantizer Next perform LPsynthesis and reconstruct the speech signal

Deliverable 7Plot the quantized residual and its corresponding dB spectrum for a voicedand an unvoiced frame Provide plots of the original and reconstructed speechCompute the SNR in dB for the entire reconstructed signal relative to theoriginal record Listen to the reconstructed signal and provide a subjectivescore on a scale of 1 to 5 Repeat this step when a 8-bit scalar quantizer isemployed In your simulation when the LP coefficients were quantized usinga 3-bit scalar quantizer the LP synthesis filter will become unstable for certainframes What are the consequences of this

46 The FS-1016 CELP standarda Obtain the MATLAB software for the FS-1016 CELP from the Book

website Use the following wave files Ch4Sp8wav and Ch4Au8wav

Deliverable 1Give the plots of the entire input and the FS-1016 synthesized output for thetwo wave files Comment on the quality of the two synthesized wave files Inparticular give more emphasis on the Ch4Au8wav and give specific reasonswhy the FS-1016 does not synthesize Ch4Au8wav with high quality Listen tothe output files and provide a subjective evaluation The CELP FS1016 coderscored 32 out of 5 in government tests on a MOS scale How would you rateits performance in terms of MOS for Ch4Sp8wav and Ch4Au8wav Also givesegmental SNR values for a voicedunvoicedmixed frame and overall SNR forthe entire record Present your results as Table 44 with appropriate caption

b Spectrum analysis File CELPANALM from lines 134 to 159Valuable comments describing the variable names globals and inputsoutputs are provided at the beginning of the MATLAB fileCELPANALM to further assist you with understanding the MATLAB

Table 44 FS-1016 CELP subjective and objective evaluation

Speech frame number

Segmental SNR forthe chosenframe (dB)

Voiced Ch4Sp8wav Unvoiced

Mixed Over-all SNR for the entire speech record = (dB)MOS in a scale of 1ndash5 =

Ch4Au8wavOver-all SNR for the entire music record = (dB)MOS in a scale of 1ndash5 =


program Specifically some of the useful variables in the MATLABcode include snew-input speech buffer fcn-LP filter coefficients of1A(z) rcn-reflection coefficients newfreq-LSP frequencies unqfreq-unquantized LSPs newfreq-quantized LSPs and lsp-interpolated LSPsfor each subframe

Deliverable 2Choose Ch4Sp8wav and use the voicedunvoicedmixed frames selected inthe previous step Indicate the frame numbers FS-1016 employs 30-ms speechframes so the frame size is fixed = 240 samples Give time-domain (variablein the code lsquosnewrsquo ) and frequency-domain plots (use FFT size 512 includecommands as necessary in the program to obtain the FFT) of the selectedvoicedunvoicedmixed frame Also plot the LPC spectrum using

figure freqz(1 fcn)

Study the interlacing property of the LSPs on the unit circle Note that in theFS-1016 standard the LSPs are encoded using scalar quantization Plot theLPC spectra obtained from the unquantized LSPs and quantized LSPs Youhave to convert LSPs to LPCs in both the cases (unquantized and quantized)and use the freqz command to plot the LPC spectra Give a z = domain plot(of a voiced and an unvoiced frame) containing the pole locations (show ascrosses lsquoxrsquo) of the LPC spectra and the roots of the symmetric (show as blackcircles lsquoorsquo) and asymmetric (show as red circles lsquoorsquo) LSP polynomials Notethe interlacing nature of black and red circles they always lie on the unit circleAlso note that if a pole is close to the unit circle the corresponding LSPs willbe close to each other In the file CELPANALM line 124 high-pass filteringis performed to eliminate the undesired low frequencies Experiment with andwithout a high-pass filter to note the presence of humming and low-frequencynoise in the synthesized speech

c Pitch analysis File CELPANALM from lines 162 to 191

Deliverable 3What are the key advantages of employing subframes in speech coding (eginterpolation pitch prediction) Explain in general the differences betweenlong-term prediction and short-term prediction Give the necessary transferfunctions In particular describe what aspects of speech each of the two pre-dictors capturesDeliverable 4Insert in file CELPANALM after line 182

tauptr = 75

Perform an evaluation of the perceptual quality of synthesis speech and giveyour remarks How does the speech quality change by forcing a pitch to apredetermined value (Choose different tauptr values 40 75 and 110)


47 The warped LP for audio analysis-synthesisa Write a MATLAB program to perform analysis-synthesis using i) the

conventional LP and ii ) the warped LP Use Ch4Au8wav as the inputwave file

b Perform a tenth-order LP analysis and use a 5-bit scalar quantizer toquantize the LP and the WLP coefficients and a 3-bit scalar quantizerfor the excitation vector Perform audio synthesis using the quantized LPand WLP analysis parameters Compute the warping coefficient [Smit99][Harm01] using

λ = 10674

(2

πarctan

(006583Fs

1000

))12

minus 01916

where Fs is the sampling frequency of the input audio Comment on the qualityof the synthesized audio from the LP and WLP analysis-synthesis Repeat thisstep for Ch4Au16wav Refer to [Harm00] for implementation of WLP synthe-sis filters

CHAPTER 5

PSYCHOACOUSTIC PRINCIPLES

51 INTRODUCTION

The field of psychoacoustics [Flet40] [Gree61] [Zwis65] [Scha70] [Hell72][Zwic90] [Zwic91] has made significant progress toward characterizing humanauditory perception and particularly the time-frequency analysis capabilities ofthe inner ear Although applying perceptual rules to signal coding is not a newidea [Schr79] most current audio coders achieve compression by exploiting thefact that ldquoirrelevantrdquo signal information is not detectable by even a well-trainedor sensitive listener Irrelevant information is identified during signal analysis byincorporating into the coder several psychoacoustic principles including absolutehearing thresholds critical band frequency analysis simultaneous masking thespread of masking along the basilar membrane and temporal masking Combiningthese psychoacoustic notions with basic properties of signal quantization has alsoled to the theory of perceptual entropy [John88b] a quantitative estimate of thefundamental limit of transparent audio signal compression

This chapter reviews psychoacoustic fundamentals and perceptual entropy andthen gives as an application example some details of the ISOMPEG psychoa-coustic model 1 Before proceeding however it is necessary to define the soundpressure level (SPL) a standard metric that quantifies the intensity of an acous-tical stimulus [Zwic90] Nearly all of the auditory psychophysical phenomenaaddressed in this book are treated in terms of SPL The SPL gives the level(intensity) of sound pressure in decibels (dB) relative to an internationally definedreference level ie LSPL = 20 log10(pp0) dB where LSPL is the SPL of a stim-ulus p is the sound pressure of the stimulus in Pascals (Pa equivalent to Newtons


113


per square meter (Nm2)) and p0 is the standard reference level of 20 microPa or2 times 10minus5 Nm2 [Moor77] About 150 dB SPL spans the dynamic range of inten-sity for the human auditory system from the limits of detection for low-intensity(quiet) stimuli up to the threshold of pain for high-intensity (loud) stimuli TheSPL reference level is calibrated such that the frequency-dependent absolutethreshold of hearing in quiet (Section 52) tends to measure in the vicinity of0 dB SPL On the other hand a stimulus level of 140 dB SPL is typically at orabove the threshold of pain

52 ABSOLUTE THRESHOLD OF HEARING

The absolute threshold of hearing characterizes the amount of energy needed ina pure tone such that it can be detected by a listener in a noiseless environ-ment The absolute threshold is typically expressed in terms of dB SPL Thefrequency dependence of this threshold was quantified as early as 1940 whenFletcher [Flet40] reported test results for a range of listeners that were generatedin a National Institutes of Health (NIH) study of typical American hearing acuityThe quiet threshold is well approximated [Terh79] by the non linear function

Tq(f ) = 364(f1000)minus08 minus 65eminus06(f1000minus33)2 + 10minus3(f1000)4(dB SPL)

(51)

which is representative of a young listener with acute hearing When applied tosignal compression Tq(f ) could be interpreted naively as a maximum allow-able energy level for coding distortions introduced in the frequency domain(Figure 51)

102 103 104

0

20

40

60

80

100

Frequency (Hz)

Sou

nd p

ress

ure

leve

l S

PL

(dB

)

Figure 51 The absolute threshold of hearing in quiet

CRITICAL BANDS 115

At least two caveats must govern this practice however First whereas thethresholds captured in Figure 51 are associated with pure tone stimuli the quan-tization noise in perceptual coders tends to be spectrally complex rather thantonal Secondly it is important to realize that algorithm designers have no a pri-ori knowledge regarding actual playback levels (SPL) and therefore the curveis often referenced to the coding system by equating the lowest point (ie near4 kHz) to the energy in +minus 1 bit of signal amplitude In other words it isassumed that the playback level (volume control) on a typical decoder will be setsuch that the smallest possible output signal will be presented close to 0 dB SPLThis assumption is conservative for quiet to moderate listening levels in uncon-trolled open-air listening environments and therefore this referencing practice iscommonly found in algorithms that utilize the absolute threshold of hearing Wenote that the absolute hearing threshold is related to a commonly encounteredacoustical metric other than SPL namely dB sensation level (dB SL) Sensa-tion level (SL) denotes the intensity level difference for a stimulus relative toa listenerrsquos individual unmasked detection threshold for the stimulus [Moor77]Hence ldquoequal SLrdquo signal components may have markedly different absoluteSPLs but all equal SL components will have equal supra-threshold margins Themotivation for the use of SL measurements is that SL quantifies listener-specificaudibility rather than an absolute level Whether the target metric is SPL or SLperceptual coders must eventually reference the internal PCM data to a physicalscale A detailed example of this referencing for SPL is given in Section 57 ofthis chapter

53 CRITICAL BANDS

Using the absolute threshold of hearing to shape the coding distortion spectrumrepresents the first step towards perceptual coding Considered on its own how-ever the absolute threshold is of limited value in coding The detection thresholdfor spectrally complex quantization noise is a modified version of the abso-lute threshold with its shape determined by the stimuli present at any giventime Since stimuli are in general time-varying the detection threshold is alsoa time-varying function of the input signal In order to estimate this thresholdwe must first understand how the ear performs spectral analysis A frequency-to-place transformation takes place in the cochlea (inner ear) along the basilarmembrane [Zwic90]

The transformation works as follows A sound wave generated by an acousticstimulus moves the eardrum and the attached ossicular bones which in turn trans-fer the mechanical vibrations to the cochlea a spiral-shaped fluid-filled structurethat contains the coiled basilar membrane Once excited by mechanical vibrationsat its oval window (the input) the cochlear structure induces traveling wavesalong the length of the basilar membrane Neural receptors are connected alongthe length of the basilar membrane The traveling waves generate peak responsesat frequency-specific membrane positions and therefore different neural receptors


are effectively ldquotunedrdquo to different frequency bands according to their locationsFor sinusoidal stimuli the traveling wave on the basilar membrane propagatesfrom the oval window until it nears the region with a resonant frequency nearthat of the stimulus frequency The wave then slows and the magnitude increasesto a peak The wave decays rapidly beyond the peak The location of the peakis referred to as the ldquobest placerdquo or ldquocharacteristic placerdquo for the stimulus fre-quency and the frequency that best excites a particular place [Beke60] [Gree90]is called the ldquobest frequencyrdquo or ldquocharacteristic frequencyrdquo Thus a frequency-to-place transformation occurs An example is given in Figure 52 for a three-tonestimulus The interested reader can also find online a number of high-qualityanimations demonstrating this aspect of cochlear mechanics [Twve99]

As a result of the frequency-to-place transformation the cochlea can be viewedfrom a signal processing perspective as a bank of highly overlapping bandpassfilters The magnitude responses are asymmetric and nonlinear (level-dependent)Moreover the cochlear filter passbands are of nonuniform bandwidth and thebandwidths increase with increasing frequency The ldquocritical bandwidthrdquo is afunction of frequency that quantifies the cochlear filter passbands Empirical workby several observers led to the modern notion of critical bands [Flet40] [Gree61][Zwis65] [Scha70] We will consider two typical examples

In one scenario the loudness (perceived intensity) remains constant for anarrowband noise source presented at a constant SPL even as the noise bandwidthis increased up to the critical bandwidth For any increase beyond the criticalbandwidth the loudness then begins to increase In this case one can imaginethat loudness remains constant as long as the noise energy is present withinonly one cochlear ldquochannelrdquo (critical bandwidth) and then that the loudnessincreases as soon as the noise energy is forced into adjacent cochlear ldquochannelsrdquoCritical bandwidth can also be viewed as the result of auditory detection efficacyin terms of a signal-to-noise ratio (SNR) criterion The power spectrum model

Dis

plac

emen

t

Distance from oval window (mm)

32 24 16 8 0

400 Hz 1600 Hz 6400 Hz

Figure 52 The frequency-to-place transformation along the basilar membrane The pic-ture gives a schematic representation of the traveling wave envelopes (measured in termsof vertical membrane displacement) that occur in response to an acoustic tone complexcontaining sinusoids of 400 1600 and 6400 Hz Peak responses for each sinusoid arelocalized along the membrane surface with each peak occurring at a particular distancefrom the oval window (cochlear ldquoinputrdquo) Thus each component of the complex stimulusevokes strong responses only from the neural receptors associated with frequency-specificloci (after [Zwic90])

CRITICAL BANDS 117

of hearing assumes that masked threshold for a given listener will occur at aconstant listener-specific SNR [Moor96] In the critical bandwidth measurementexperiments the detection threshold for a narrowband noise source presentedbetween two masking tones remains constant as long as the frequency separationbetween the tones remains within a critical bandwidth (Figure 53a) Beyondthis bandwidth the threshold rapidly decreases (Figure 53c) From the SNRviewpoint one can imagine that as long as the masking tones are presentedwithin the passband of the auditory filter (critical bandwidth) that is tuned tothe probe noise the SNR presented to the auditory system remains constantand hence the detection threshold does not change As the tones spread furtherapart and transition into the filter stopband however the SNR presented to theauditory system improves and hence the detection task becomes easier In orderto maintain a constant SNR at threshold for a particular listener the powerspectrum model calls for a reduction in the probe noise commensurate with thereduction in the energy of the masking tones as they transition out of the auditoryfilter passband Thus beyond critical bandwidth the detection threshold for theprobe tones decreases and the threshold SNR remains constant A notched-noiseexperiment with a similar interpretation can be constructed with masker andmaskee roles reversed (Figure 53 b and d) Critical bandwidth tends to remainconstant (about 100 Hz) up to 500 Hz and increases to approximately 20 ofthe center frequency above 500 Hz For an average listener critical bandwidth(Figure 53b) is conveniently approximated [Zwic90] by

BW c(f ) = 25 + 75[1 + 14(f1000)2]069(Hz) (52)

Although the function BW c is continuous it is useful when building practicalsystems to treat the ear as a discrete set of bandpass filters that conforms to (52)The function [Zwic90]

Zb(f ) = 13 arctan (000076f ) + 35 arctan

[(f

7500

)2]

(Bark) (53)

is often used to convert from frequency in Hertz to the Bark scale Figure 54(a) Corresponding to the center frequencies of the Table 51 filter bank thenumbered points in Figure 54 (a) illustrate that the nonuniform Hertz spacing ofthe filter bank (Figure 55) is actually uniform on a Bark scale Thus one criticalbandwidth (CB) comprises one Bark Table 51 gives an idealized filter bank thatcorresponds to the discrete points labeled on the curves in Figure 54(a b) Adistance of 1 critical band is commonly referred to as one Bark in the literature

Although the critical bandwidth captured in Eq (52) is widely used in perceptualmodels for audio coding we note that there are alternative expressions In particularthe equivalent rectangular bandwidth (ERB) scale emerged from research directedtowards measurement of auditory filter shapes Experimental data is obtained typi-cally from notched noise masking procedures Then the masking data is fitted withparametric weighting functions that represent the spectral shaping properties of the


∆f

Aud

ibili

ty T

h

fcb

(c)

∆f

Aud

ibili

ty T

h

fcb

(d)

(a)Freq

Sou

nd P

ress

ure

Leve

l (dB

)∆f

(b)Freq

Sou

nd P

ress

ure

Leve

l (dB

)

∆f

Figure 53 Critical band measurement methods (ac) Detection threshold decreases asmasking tones transition from auditory filter passband into stopband thus improvingdetection SNR (bd) Same interpretation with roles reversed (after [Zwic90])

auditory filters [Moor96] Rounded exponential models with one or two free param-eters are popular For example the single-parameter roex(p) model is given by

W(g) = (1 + pg)eminuspg (54)

where g = |f minus f0|f0 is the normalized frequency f0 is the center frequency ofthe filter and f represents frequency in Hz Although the roex(p) model does notcapture filter asymmetry asymmetric filter shapes are possible if two roex(p) modelsare used independently for the high- and low-frequency filter skirts Two parame-ter models such as the roex(p r) are also used to gain additional degrees of free-dom [Moor96] in order to improve the accuracy of the filter shape estimates Aftercurve-fitting an ERB estimate is obtained directly from the parametric filter shapeFor the roex(p) model it can be shown easily that the equivalent rectangular band-width is given by

ERBroex(p) = 4f0

p(55)

We note that some texts denote ERB by equivalent noise bandwidth An example isgiven in Figure 56 The solid line in Figure 56 (a) shows an example roex(p) filterestimated for a center frequency of 1 kHz while the dashed line shows the ERBassociated with the given roex(p) filter shape

In [Moor83] and [Glas90] Moore and Glasberg summarized experimentalERB measurements for roex(pr) models obtained over a period of several years

CRITICAL BANDS 119

(a)

0 05 1 15 2

x 104

0

5

10

15

20

25

Frequency f (Hz)

Crit

ical

ban

d ra

te Z

b (

Bar

k)

1

25 24

2221

19

15

10

5


(b)

102 103 1040

1000

2000

3000

4000

5000

6000

Frequency f (Hz)

Crit

ical

ban

dwid

th (

Hz)

25

24

22

2119

151051


Figure 54 Two views of critical bandwidth (a) Critical band rate Zb(f ) maps fromHertz to Barks and (b) critical bandwidth BWc(f ) expresses critical bandwidth as afunction of center frequency in Hertz The ldquoXsrdquo denote the center frequencies of theidealized critical band filter bank given in Table 51

by a number of different investigators Given a collection of ERB measurementson center frequencies across the audio spectrum a curve fitting on the data setyielded the following expression for ERB as a function of center frequency

ERB(f ) = 247(437(f1000) + 1) (56)


0

02

04

06

08

Am

plitu

de

1

12

02 04 06 08 1 12 14 16 18 2X 104Frequency (Hz)

Figure 55 Idealized critical band filter bank

As shown in Figure 56 (b) the function specified by Eq (56) differs from thecritical bandwidth of Eq (52) Of particular interest for perceptual codec design-ers the ERB scale implies that auditory filter bandwidths decrease below 500 Hzwhereas the critical bandwidth remains essentially flat The apparent increasedfrequency selectivity of the auditory system below 500 Hz has implications foroptimal filter-bank design as well as for perceptual bit allocation strategies Theseimplications are addressed later in the book

Regardless of whether it is best characterized in terms of critical bandwidth orERB the frequency resolution of the auditory filter bank largely determines whichportions of a signal are perceptually irrelevant The auditory time-frequencyanalysis that occurs in the critical band filter bank induces simultaneous andnonsimultaneous masking phenomena that are routinely used by modern audiocoders to shape the coding distortion spectrum In particular the perceptual mod-els allocate bits for signal components such that the quantization noise is shapedto exploit the detection thresholds for a complex sound (eg quantization noise)These thresholds are determined by the energy within a critical band [Gass54]Masking properties and masking thresholds are described next

54 SIMULTANEOUS MASKING MASKING ASYMMETRY AND THESPREAD OF MASKING

Masking refers to a process where one sound is rendered inaudible because ofthe presence of another sound Simultaneous masking may occur whenever two


Table 51 Idealized critical band filter bank (after [Scha70]) Band edges andcenter frequencies for a collection of 25 critical bandwidth auditory filters thatspan the audio spectrum

Band number Center frequency (Hz) Bandwidth (Hz)

1 50 ndash1002 150 100ndash2003 250 200ndash3004 350 300ndash4005 450 400ndash5106 570 510ndash6307 700 630ndash7708 840 770ndash9209 1000 920ndash1080

10 1175 1080ndash127011 1370 1270ndash148012 1600 1480ndash172013 1850 1720ndash200014 2150 2000ndash232015 2500 2320ndash270016 2900 2700ndash315017 3400 3150ndash370018 4000 3700ndash440019 4800 4400ndash530020 5800 5300ndash640021 7000 6400ndash770022 8500 7700ndash950023 10500 9500ndash1200024 13500 12000ndash1550025 19500 15500ndash

or more stimuli are simultaneously presented to the auditory system From afrequency-domain point of view the relative shapes of the masker and maskeemagnitude spectra determine to what extent the presence of certain spectral energywill mask the presence of other spectral energy From a time-domain perspectivephase relationships between stimuli can also affect masking outcomes A simpli-fied explanation of the mechanism underlying simultaneous masking phenomena isthat the presence of a strong noise or tone masker creates an excitation of sufficientstrength on the basilar membrane at the critical band location to block effectivelydetection of a weaker signal Although arbitrary audio spectra may contain complexsimultaneous masking scenarios for the purposes of shaping coding distortions itis convenient to distinguish between three types of simultaneous masking namelynoise-masking-tone (NMT) [Scha70] tone-masking-noise (TMN) [Hell72] andnoise-masking-noise (NMN) [Hall98]


(a)

700 800 900 1000 1100 1200 1300

minus25

minus20

minus15

minus10

minus5

0

5Equivalent Rectangular Bandwidth roex(p)

Frequency (Hz)

Atte

nuat

ion

(dB

) ERB

(b)

102

103

Center Frequency (Hz)

Ban

dwid

th (

Hz)

102 103 104

Critical bandwidth

ERB

Figure 56 Equivalent rectangular bandwidth (ERB) (a) Example ERB for a roex(p) sin-gle-parameter estimate of the shape of the auditory filter centered at 1 kHz The solid linerepresents an estimated spectral weighting function for a single-parameter fit to data froma notched noise masking experiment the dashed line represents the equivalent rectangularbandwidth (b) ERB vs critical bandwidth ndash the ERB scale of Eq (56) (solid) vs criticalbandwidth of Eq (52) (dashed) as a function of center frequency


(b)

Freq (Hz)

SP

L (d

B)

Crit BW

TonalMasker

Threshold

SM

Rsim

24 d

B

MaskedNoise

80

56

1000Freq (Hz)

SP

L (d

B)

Crit BW

NoiseMasker

ThresholdS

MR

sim 4

dB

MaskedTone

80

76

410

(a)

Figure 57 Example to illustrate the asymmetry of simultaneous masking (a) Noise-masking-tone ndash At the threshold of detection a 410-Hz pure tone presented at 76-dB SPLis just masked by a critical bandwidth narrowband noise centered at 410 Hz (90 Hz BW)of overall intensity 80 dB SPL This corresponds to a threshold minimum signal-to-mask(SMR) ratio of 4 dB The threshold SMR increases as the probe tone is shifted eitherabove or below 410 Hz (b) Tone-masking-noise ndash At the threshold of detection a 1-kHzpure tone presented at 80-dB SPL just masks a critical-band narrowband noise centeredat 1 kHz of overall intensity 56-dB SPL This corresponds to a threshold minimum SMRof 24 dB As for the NMT experiment threshold SMR for the TMN increases as themasking tone is shifted either above or below the noise center frequency 1 kHz Whencomparing (a) to (b) it is important to notice the apparent ldquomasking asymmetryrdquo namelythat NMT produces a significantly smaller threshold minimum SMR (4 dB) than doesTMN (24 dB)


In the NMT scenario (Figure 57a) a narrowband noise (eg having 1 Barkbandwidth) masks a tone within the same critical band provided that the inten-sity of the masked tone is below a predictable threshold directly related to theintensity and to a lesser extent center frequency of the masking noise Numer-ous studies characterizing NMT for random noise and pure-tone stimuli haveappeared since the 1930s (eg [Flet37] [Egan50]) At the threshold of detectionfor the masked tone the minimum signal-to-mask ratio (SMR) ie the smallestdifference between the intensity (SPL) of the masking noise (ldquosignalrdquo) and theintensity of the masked tone (ldquomaskrdquo) occurs when the frequency of the maskedtone is close to the maskerrsquos center frequency In most studies the minimumSMR tends to lie between minus5 and +5 dB For example a sample thresholdSMR result from the NMT investigation [Egan50] is schematically representedin Figure 57a In the figure a critical band noise masker centered at 410 Hz


with an intensity of 80 db SPL masks a 410 Hz tone and the resulting SMR atthe threshold of detection is 4 dB

Masking power decreases (ie SMR increases) for probe tones above andbelow the frequency of the minimum SMR tone in accordance with a level-and frequency-dependent spreading function that is described later We note thattemporal factors also affect simultaneous masking For example in the NMTscenario an overshoot effect is possible when the probe tone onset occurs withina short interval immediately following masker onset Overshoot can boost simul-taneous masking (ie decrease the threshold minimum SMR) by as much as10 dB over a brief time span [Zwic90]


In the case of TMN (Figure 57b) a pure tone occurring at the center of acritical band masks noise of any subcritical bandwidth or shape provided thenoise spectrum is below a predictable threshold directly related to the strengthand to a lesser extent the center frequency of the masking tone In contrastto NMT relatively few studies have attempted to characterize TMN At thethreshold of detection for a noise band masked by a pure tone however it wasfound in both [Hell72] and [Schr79] that the minimum SMR ie the smallestdifference between the intensity of the masking tone (ldquosignalrdquo) and the intensityof the masked noise (ldquomaskrdquo) occurs when the masker frequency is close to thecenter frequency of the probe noise and that the minimum SMR for TMN tendsto lie between 21 and 28 dB A sample result from the TMN study [Schr79] isgiven in Figure 57b In the figure a narrowband noise of one Bark bandwidthcentered at 1 kHz is masked by a 1 kHz tone of intensity 80 dB SPL Theresulting SMR at the threshold of detection is 24 dB As with NMT the TMNmasking power decreases for critical bandwidth probe noises centered above andbelow the minimum SMR probe noise


The NMN scenario in which a narrowband noise masks another narrowbandnoise is more difficult to characterize than either NMT or TMN because ofthe confounding influence of phase relationships between the masker and mas-kee [Hall98] Essentially different relative phases between the components ofeach can lead to different threshold SMRs The results from one study of intensitydifference detection thresholds for wideband noise [Mill47] produced thresholdSMRs of nearly 26 dB for NMN [Hall98]


The NMT and TMN examples in Figure 57 clearly show an asymmetry in mask-ing power between the noise masker and the tone masker In spite of the factthat both maskers are presented at a level of 80 dB SPL the associated thresh-old SMRs differ by 20 dB This asymmetry motivates our interest in both the


TMN and NMT masking paradigms as well as NMN In fact knowledge ofall three is critical to success in the task of shaping coding distortion such thatit is undetectable by the human auditory system For each temporal analysisinterval a codecrsquos perceptual model should identify across the frequency spec-trum noise-like and tone-like components within both the audio signal and thecoding distortion Next the model should apply the appropriate masking relation-ships in a frequency-specific manner In conjunction with the spread of masking(below) NMT NMN and TMN properties can then be used to construct a globalmasking threshold Although several methods for masking threshold estimationhave proven effective we note that a deeper understanding of masking asym-metry may provide opportunities for improved perceptual models In particularHall [Hall97] has shown that masking asymmetry can be explained in terms ofrelative maskermaskee bandwidths and not necessarily exclusively in terms ofabsolute masker properties Ultimately this implies that the de facto standardenergy-based schemes for masking power estimation among perceptual codecsmay be valid only so long as the masker bandwidth equals or exceeds mas-kee (probe) bandwidth In cases where the probe bandwidth exceeds the maskerbandwidth an envelope-based measure be embedded in the masking calcula-tion [Hall97] [Hall98]


The simultaneous masking effects characterized before by the paradigms NMTTMN and NMN are not bandlimited to within the boundaries of a single criticalband Interband masking also occurs ie a masker centered within one criticalband has some predictable effect on detection thresholds in other critical bandsThis effect also known as the spread of masking is often modeled in codingapplications by an approximately triangular spreading function that has slopes of+25 and minus10 dB per Bark A convenient analytical expression [Schr79] is given by

SF dB (x) = 1581 + 75(x + 0474) minus 175radic

1 + (x + 0474)2dB (57)

where x has units of Barks and SF db(x) is expressed in dB After critical band anal-ysis is done and the spread of masking has been accounted for masking thresholdsin perceptual coders are often established by the [Jaya93] decibel (dB) relations

TH N = ET minus 145 minus B (58)

TH T = EN minus K (59)

where TH N and TH T respectively are the noise-and tone-masking thresholdsdue to tone-masking-noise and noise-masking-tone EN and ET are the criticalband noise and tone masker energy levels and B is the critical band numberDepending upon the algorithm the parameter K is typically set between 3 and5 dB Of course the thresholds of Eqs (58) and (59) capture only the contributionsof individual tone-like or noise-like maskers In the actual coding scenario each


frame typically contains a collection of both masker types One can see easily thatEqs (58) and (59) capture the masking asymmetry described previously Afterthey have been identified these individual masking thresholds are combined toform a global masking threshold

The global masking threshold comprises an estimate of the level at whichquantization noise becomes just noticeable Consequently the global maskingthreshold is sometimes referred to as the level of ldquojust-noticeable distortionrdquoor JND The standard practice in perceptual coding involves first classifyingmasking signals as either noise or tone next computing appropriate thresholdsthen using this information to shape the noise spectrum beneath the JND levelTwo illustrated examples are given later in Sections 56 and 57 which addressthe perceptual entropy and the ISOIEC MPEG Model 1 respectively Note thatthe absolute threshold (Tq) of hearing is also considered when shaping the noisespectra and that MAX (JND Tq) is most often used as the permissible distortionthreshold Notions of critical bandwidth and simultaneous masking in the audiocoding context give rise to some convenient terminology illustrated in Figure 58where we consider the case of a single masking tone occurring at the center ofa critical band All levels in the figure are given in terms of dB SPL

A hypothetical masking tone occurs at some masking level This generates anexcitation along the basilar membrane that is modeled by a spreading functionand a corresponding masking threshold For the band under consideration theminimum masking threshold denotes the spreading function in-band minimumAssuming the masker is quantized using an m-bit uniform scalar quantizer noisemight be introduced at the level m Signal-to-mask ratio (SMR) and noise-to-mask ratio (NMR) denote the log distances from the minimum masking thresholdto the masker and noise levels respectively

Freq

Sou

nd P

ress

ure

Leve

l (dB

)

mminus1mm+1

Critical

Band

Masking Tone

Masking Threshold

Minimum

Masking Threshold

Neighboring

Band

NM

RS

MR

SN

R

Figure 58 Schematic representation of simultaneous masking (after [Noll93])

NONSIMULTANEOUS MASKING 127

55 NONSIMULTANEOUS MASKING

As shown in Figure 59 masking phenomena extend in time beyond the win-dow of simultaneous stimulus presentation In other words for a masker of finiteduration non simultaneous (also sometimes denoted ldquotemporalrdquo) masking occursboth prior to masker onset as well as after masker removal The skirts on bothregions are schematically represented in Figure 59 Essentially absolute audi-bility thresholds for masked sounds are artificially increased prior to during andfollowing the occurrence of a masking signal Whereas significant premaskingtends to last only about 1ndash2 ms postmasking will extend anywhere from 50to 300 ms depending upon the strength and duration of the masker [Zwic90]Below we consider key nonsimultaneous masking properties that should beembedded in audio codec perceptual models

Of the two nonsimultaneous masking modes forward masking is better under-stood For masker and probe of the same frequency experimental studies haveshown that the amount of forward (post-) masking depends in a predictableway on stimulus frequency [Jest82] masker intensity [Jest82] probe delay aftermasker cessation [Jest82] and masker duration [Moor96] Forward masking alsoexhibits frequency-dependent behavior similar to simultaneous masking that canbe observed when the masker and probe frequency relationship is varied[Moor78] Although backward (pre) masking has also been the subject of many

studies it is not well understood [Moor96] As shown in Figure 59 backwardmasking decays much more rapidly than forward masking For example onestudy at Thomson Consumer Electronics showed that only 2 ms prior to maskeronset the masked threshold was already 25 dB below the threshold of simulta-neous masking [Bran98]

We note however that the literature lacks consensus over the maximumtime persistence of significant backward masking Despite the inconsistent resultsacross studies it is generally accepted that the amount of measured backward

minus50 0 50 100 150 0 50 100 150

20

40

60 SimultaneousPre- Postmasking

Time after masker appearance (ms) Time after masker removal (ms)

Masker

Mas

kee

audi

bilit

y th

resh

old

incr

ease

(dB

)

200

Figure 59 Nonsimultaneous masking properties of the human ear Backward (pre-)masking occurs prior to masker onset and lasts only a few milliseconds Forward (post-)masking may persist for more than 100 ms after masker removal (after [Zwic90])


masking depends significantly on the training of the experimental subjects Forthe purposes of perceptual coding abrupt audio signal transients (eg the onsetof a percussive musical instrument) create pre- and postmasking regions dur-ing which a listener will not perceive signals beneath the elevated audibilitythresholds produced by a masker In fact temporal masking has been used inseveral audio coding algorithms (eg [Bran94a] [Papa95] [ISOI96a] [Fiel96][Sinh98a]) Premasking in particular has been exploited in conjunction withadaptive block size transform coding to compensate for pre-echo distortions(Chapter 6 Sections 69 and 610)

56 PERCEPTUAL ENTROPY

Johnston [John88a] combined notions of psychoacoustic masking with signalquantization principles to define perceptual entropy (PE) a measure of percep-tually relevant information contained in any audio record Expressed in bits persample PE represents a theoretical limit on the compressibility of a particular sig-nal PE measurements reported in [John88a] and [John88b] suggest that a widevariety of CD-quality audio source material can be transparently compressed atapproximately 21 bits per sample The PE estimation process is accomplished asfollows The signal is first windowed and transformed to the frequency domainA masking threshold is then obtained using perceptual rules Finally a determi-nation is made of the number of bits required to quantize the spectrum withoutinjecting perceptible noise The PE measurement is obtained by constructing aPE histogram over many frames and then choosing a worst-case value as theactual measurement

The frequency-domain transformation is done with a Hann window followedby a 2048-point fast Fourier transform (FFT) Masking thresholds are obtainedby performing critical band analysis (with spreading) making a determination ofthe noise-like or tone-like nature of the signal applying thresholding rules for thesignal quality then accounting for the absolute hearing threshold First real andimaginary transform components are converted to power spectral components

P(ω) = Re2(ω) + Im2(ω) (510)

then a discrete Bark spectrum is formed by summing the energy in each criticalband (Table 51)

Bi =bhisum

ω=bli

P (ω) (511)

where the summation limits are the critical band boundaries The range of theindex i is sample rate dependent and in particular i isin 1 25 for CD-qualitysignals A spreading function Eq (57) is then convolved with the discrete Barkspectrum

Ci = Bi lowast SFi (512)

PERCEPTUAL ENTROPY 129

to account the spread of masking An estimation of the tone-like or noise-likequality for Ci is then obtained using the spectral flatness measure (SFM)

SFM = microg

microa

(513)

where microg and microa respectively correspond to the geometric and arithmetic meansof the PSD components for each band The SFM has the property that it isbounded by 0 and 1 Values close to 1 will occur if the spectrum is flat in aparticular band indicating a decorrelated (noisy) band Values close to zero willoccur if the spectrum in a particular band is narrowband A coefficient of tonalityα is next derived from the SFM on a dB scale

α = min

(SFM db

minus60 1

)

(514)

and this is used to weight the thresholding rules given by Eqs (58) and (59)(with K = 55) as follows for each band to form an offset

Oi = α(145 + i) + (1 minus α)55( in dB) (515)

A set of JND estimates in the frequency power domain are then formed bysubtracting the offsets from the Bark spectral components

Ti = 10log10(Ci )minus

Oi

10 (516)

These estimates are scaled by a correction factor to simulate deconvolution ofthe spreading function and then each Ti is checked against the absolute thresholdof hearing and replaced by max(Ti Tq(i)) In a manner essentially identical tothe SPL calibration procedure that was described in Section 52 the PE estima-tion is calibrated by equating the minimum absolute threshold to the energy ina 4 kHz signal of +minus 1 bit amplitude In other words the system assumes thatthe playback level (volume control) is configured such that the smallest possiblesignal amplitude will be associated with an SPL equal to the minimum absolutethreshold By applying uniform quantization principles to the signal and associ-ated set of JND estimates it is possible to estimate a lower bound on the numberof bits required to achieve transparent coding In fact it can be shown that theperceptual entropy in bits per sample is given by

PE =25sum

i=1

bhisum

ω=bli

log2

(

2

∣∣∣∣nint

(Re(ω)radic6Tiki

)∣∣∣∣ + 1

)

+ log2

(

2

∣∣∣∣nint

(Im(ω)radic6Tiki

)∣∣∣∣ + 1

)

(bitssample) (517)


where i is the index of critical band bli and bhi are the lower and upper boundsof band i ki is the number of transform components in band i Ti is the mask-ing threshold in band i (Eq (516)) and nint denotes rounding to the nearestinteger Note that if 0 occurs in the log we assign 0 for the result The maskingthresholds used in the above PE computation also form the basis for a transformcoding algorithm described in Chapter 7 In addition the ISOIEC MPEG-1 psy-choacoustic model 2 which is often used in MP3 encoders is closely related tothe PE procedure

We note however that there have been evolutionary improvements since thePE estimation scheme first appeared in 1988 For example the PE calculationin many systems (eg [ISOI94]) relies on improved tonality estimates relativeto the SFM-based measure of Eq (514) The SFM-based measure is both time-and frequency-constrained Only one spectral estimate (analysis frame) is exam-ined in time and in frequency the measure by definition lumps together multiplespectral lines In contrast other tonality estimation schemes eg the ldquochaosmeasurerdquo [ISOI94] [Bran98] consider the predictability of individual frequencycomponents across time in terms of magnitude and phase-tracking properties Apredicted value for each component is compared against its actual value and theEuclidean distance is mapped to a measure of predictability Highly predictablespectral components are considered to be tonal while unpredictable componentsare treated as noise-like A tonality coefficient that allows weighting towards oneextreme or the other is computed from the chaos measure just as in Eq (514)Improved performance has been demonstrated in several instances (eg [Bran90][ISOI94] [Bran98]) Nevertheless the PE measurement as proposed in its orig-inal form conveys valuable insight on the application of simultaneous maskingasymmetry to a perceptual model in a practical system

57 EXAMPLE CODEC PERCEPTUAL MODEL ISOIEC 11172-3(MPEG - 1) PSYCHOACOUSTIC MODEL 1

It is useful to consider an example of how the psychoacoustic principles describedthus far are applied in actual coding algorithms The ISOIEC 11172-3 (MPEG-1layer 1) psychoacoustic model 1 [ISOI92] determines the maximum allowablequantization noise energy in each critical band such that quantization noiseremains inaudible In one of its modes the model uses a 512-point FFT forhigh-resolution spectral analysis (8613 Hz) then estimates for each input frameindividual simultaneous masking thresholds due to the presence of tone-like andnoise-like maskers in the signal spectrum A global masking threshold is thenestimated for a subset of the original 256 frequency bins by (power) additive com-bination of the tonal and nontonal individual masking thresholds The remainderof this section describes the step-by-step model operations Sample results aregiven for one frame of CD-quality pop music sampled at 441 kHz16-bits persample We note that although this model is suitable for any of the MPEG-1coding layers IndashIII the standard [ISOI92] recommends that model 1 be usedwith layers I and II while model 2 is recommended for layer III (MP3) The five


steps leading to computation of global masking thresholds are described in thefollowing Sections


Spectral analysis and normalization are performed first The goal of this step is toobtain a high-resolution spectral estimate of the input with spectral componentsexpressed in terms of sound pressure level (SPL) Much like the PE calculationdescribed previously this SPL normalization guarantees that a 4 kHz signal of+minus1 bit amplitude will be associated with an SPL near 0 dB (close to anacceptable Tq value for normal listeners at 4 kHz) whereas a full-scale sinusoidwill be associated with an SPL near 90 dB The spectral analysis procedure worksas follows First incoming audio samples s(n) are normalized according to theFFT length N and the number of bits per sample b using the relation

x(n) = s(n)

N(2bminus1) (518)

Normalization references the power spectrum to a 0-dB maximum The normal-ized input x(n) is then segmented into 12-ms frames (512 samples) using a116th-overlapped Hann window such that each frame contains 109 ms of newdata A power spectral density (PSD) estimate P (k) is then obtained using a512-point FFT ie

P(k) = PN + 10 log10

∣∣∣∣∣∣

Nminus1sum

n=0

w(n)x(n)eminusj

2πkn

N

∣∣∣∣∣∣

2

0 k N

2 (519)

where the power normalization term PN is fixed at 90302 dB and the Hannwindow w(n) is defined as

w(n) = 1

2

[

1 minus cos

(2πn

N

)]

(520)

Because playback levels are unknown during psychoacoustic signal analysis thenormalization procedure (Eq (518)) and the parameter PN in Eq (519) are usedto estimate SPL conservatively from the input signal For example a full-scalesinusoid that is precisely resolved by the 512-point FFT in bin ko will yielda spectral line P(k0) having 84 dB SPL With 16-bit sample resolution SPLestimates for very-low-amplitude input signals will be at or below the absolutethreshold An example PSD estimate obtained in this manner for a CD-qualitypop music selection is given in Figure 510(a) The spectrum is shown both on alinear frequency scale (upper plot) and on the Bark scale (lower plot) The dashedline in both plots corresponds to the absolute threshold of hearing approximationused by the model


After PSD estimation and SPL normalization tonal and nontonal masking com-ponents are identified Local maxima in the sample PSD that exceed neighboring


1minus10

minus10

60

50

40

30

20

10

0

10

20

30

40

50

60

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

0

SP

L (d

B)

SP

L (d

B)

3 5 7 9 11 13 15 17 19 21 23 25

Bark

(a)

Frequency (Hz)

Figure 510a ISOIEC MPEG-1 psychoacoustic analysis model 1 for an example popmusic selection steps 1ndash5 as described in the text (a) Step 1 Obtain PSD express in dBSPL Top panel gives linear frequency scale bottom panel gives Bark frequency scaleAbsolute threshold superimposed Step 2 Tonal maskers identified and denoted by lsquoxrsquosymbol noise maskers identified and denoted by lsquoorsquo symbol (b) Collection of proto-type spreading functions (Eq (531)) shown with level as the parameter These illustratethe incorporation of excitation pattern level-dependence into the model Note that theprototype functions are defined to be piecewise linear on the Bark scale These will beassociated with maskers in steps 3 and 4 (c) Steps 3 and 4 Spreading functions are asso-ciated with each of the individual tonal maskers satisfying the rules outlined in the textNote that the Signal-to-Mask Ratio (SMR) at the peak is close to the widely acceptedtonal value of 145 dB (d) Spreading functions are associated with each of the individ-ual noise maskers that were extracted after the tonal maskers had been eliminated fromconsideration as described in the text Note that the peak SMR is close to the widelyaccepted noise-masker value of 5 dB (e) Step 5 A global masking threshold is obtainedby combining the individual thresholds as described in the text The maximum of theglobal threshold and the absolute threshold is taken at each point in frequency to be thefinal global threshold The figure clearly shows that some portions of the input spectrumrequire SNRs better than 20 dB to prevent audible distortion while other spectral regionsrequire less than 3 dB SNR


7

minus60

minus40

minus20

0

20

SP

L (d

B)

40

60

8090

75

60

45

30

15

0

8 9 10 11 12Bark

(b)

13 14 15 16 17

Barks

(c)

0minus20

minus10

0

10

20

SP

L (d

B)

30

40

50

60

5 10 15 20 25

Figure 510b c


0minus20

minus10

0

10

20

SP

L (d

B)

30

40

50

60

5 10Barks

(d)

15 20 25

0minus20

minus10

0

10

20

SP

L (d

B)

30

40

50

60

5 10 15Barks

(e)

20 25

Figure 510d e


components within a certain Bark distance by at least 7 dB are classified as tonalSpecifically the ldquotonalrdquo set ST is defined as

ST =

P(k)

∣∣∣∣

P(k) gt P (k plusmn 1)

P (k) gt P (k plusmn k) + 7dB

(521)

where

k isin

2 2 lt k lt 63 (017 minus 55 kHz)[2 3] 63 k lt 127 (55 minus 11 kHz)[2 6] 127 k 256 (11 minus 20 kHz)

(522)

Tonal maskers PTM (k) are computed from the spectral peaks listed in ST asfollows

PTM (k) = 10 log10

1sum

j=minus1

1001P(k+j)(dB) (523)

In other words for each neighborhood maximum energy from three adjacentspectral components centered at the peak are combined to form a single tonalmasker Tonal maskers extracted from the example pop music selection are iden-tified using lsquoxrsquo symbols in Figure 510(a) A single noise masker for each criticalband PNM (k) is then computed from (remaining) spectral lines not within theplusmnk neighborhood of a tonal masker using the sum

PNM (k) = 10 log10

sum

j

1001P(j)(dB) forallP(j) isin PTM (k k plusmn 1 k plusmn k) (524)

where k is defined to be the geometric mean spectral line of the critical bandie

k =

uprod

j=l

j

1(lminusu+1)

(525)

where l and u are the lower and upper spectral line boundaries of the criticalband respectively The idea behind Eq (524) is that residual spectral energywithin a critical bandwidth not associated with a tonal masker must by defaultbe associated with a noise masker Therefore in each critical band Eq (524)combines into a single noise masker all of the energy from spectral componentsthat have not contributed to a tonal masker within the same band Noise maskersare denoted in Figure 510 by lsquoorsquo symbols Dashed vertical lines are included inthe Bark scale plot to show the associated critical band for each masker


In this step the number of maskers is reduced using two criteria First any tonalor noise maskers below the absolute threshold are discarded ie only maskersthat satisfy

PTM NM (k) Tq(k) (526)


are retained where Tq(k) is the SPL of the threshold in quiet at spectral line k Inthe pop music example two high-frequency noise maskers identified during step2 (Figure 510(a)) are dropped after application of Eq (526) (Figure 510(cndashe))Next a sliding 05 Bark-wide window is used to replace any pair of maskersoccurring within a distance of 05 Bark by the stronger of the two In thepop music example two tonal maskers appear between 195 and 205 Barks(Figure 510(a)) It can be seen that the pair is replaced by the stronger of thetwo during threshold calculations (Figure 510(cndashe)) After the sliding windowprocedure masker frequency bins are reorganized according to the subsamplingscheme

PTM NM (i) = PTM NM (k) (527)

PTM NM (k) = 0 (528)

where

i =

k 1 k 48k + (k mod 2) 49 k 96

k + 3 minus ((k minus 1)mod 4) 97 k 232

(529)

The net effect of Eq (529) is 21 decimation of masker bins in critical bands18ndash22 and 41 decimation of masker bins in critical bands 22ndash25 with no lossof masking components This procedure reduces the total number of tone andnoise masker frequency bins under consideration from 256 to 106 Tonal andnoise maskers shown in Figure 510(cndashe) have been relocated according to thisdecimation scheme


Using the decimated set of tonal and noise maskers individual tone and noisemasking thresholds are computed next Each individual threshold represents amasking contribution at frequency bin i due to the tone or noise masker locatedat bin j (reorganized during step 3) Tonal masker thresholds TTM (i j) are givenby

TTM (i j) = PTM (j) minus 0275Zb(j) + SF(i j) minus 6025(dB SPL) (530)

where PTM (j) denotes the SPL of the tonal masker in frequency bin j Zb(j)

denotes the Bark frequency of bin j (Eq (53)) and the spread of masking frommasker bin j to maskee bin i SF(i j) is modeled by the expression

SF(i j) =

17Zb minus 04PTM (j) + 11 minus3 Zb lt minus1(04PTM (j) + 6)Zb minus1 Zb lt 0minus17Zb 0 Zb lt 1(015PTM (j) minus 17)Zb minus 015PTM (j) 1 Zb lt 8

(dB SPL)

(531)


ie as a piecewise linear function of masker level P (j ) and Bark maskee-maskerseparation Zb = Zb(i) minus Zb(j) SF(i j) approximates the basilar spreading(excitation pattern) described in Section 54 Prototype individual masking thresh-olds TTM (i j) are shown as a function of masker level in Figure 510(b) foran example tonal masker occurring at Zb = 10 Barks As shown in the figurethe slope of TTM (i j) decreases with increasing masker level This is a reflec-tion of psychophysical test results which have demonstrated [Zwic90] that theearrsquos frequency selectivity decreases as stimulus levels increase It is also notedhere that the spread of masking in this particular model is constrained to a 10-Bark neighborhood for computational efficiency This simplifying assumption isreasonable given the very low masking levels that occur in the tails of the excita-tion patterns modeled by SF(i j) Figure 510(c) shows the individual maskingthresholds (Eq (530)) associated with the tonal maskers in Figure 510(a) (lsquoxrsquo)It can be seen here that the pair of maskers identified near 19 Barks has beenreplaced by the stronger of the two during the decimation phase The plot includesthe absolute hearing threshold for reference Individual noise masker thresholdsTNM (i j) are given by

TNM (i j) = PNM (j) minus 0175Zb(j) + SF(i j) minus 2025(dB SPL) (532)

where PNM (j) denotes the SPL of the noise masker in frequency bin j Zb(j)

denotes the Bark frequency of bin j (Eq (53)) and SF(i j) is obtained byreplacing PTM (j) with PNM (j) in Eq (531) Figure 510(d) shows individualmasking thresholds associated with the noise maskers identified in step 2(Figure 510(a) lsquoorsquo) It can be seen in Figure 510(d) that the two high frequencynoise maskers that occur below the absolute threshold have been eliminated

Before we proceed to step 5 and compute a global masking threshold it isworthwhile to consider the relationship between Eq (58) and Eq (530) as wellas the connection between Eq (59) and Eq (532) Equations (58) and (530)are related in that both model the TMN paradigm (Section 54) in order togenerate a masking threshold for quantization noise masked by a tonal signalcomponent In the case of Eq (58) a Bark-dependent offset that is consis-tent with experimental TMN data for the threshold minimum SMR is subtractedfrom the masker intensity namely the quantity 145 +B In a similar mannerEq (530) estimates for a quantization noise maskee located in bin i the inten-sity of the masking contribution due the tonal masker located in bin j LikeEq (58) the psychophysical motivation for Eq (530) is the desire to modelthe relatively weak masking contributions of a TMN Unlike Eq (58) howeverEq (530) uses an offset of only 6025 + 0275B ie Eq (530) assumes asmaller minimum SMR at threshold than does Eq (58) The connection betweenEqs (59) and (532) is analogous In the case of this equation pair howeverthe psychophysical motivation is to model the masking contributions of NMTEquation (59) assumes a Bark-independent minimum SMR of 3ndash5 dB depend-ing on the value of the parameter K Equation (532) on the other hand assumesa Bark-dependent threshold minimum SMR of 2025 + 0175B dB Also whereas


the spreading function (SF ) terms embedded in Eqs (530) and (532) explicitlyaccount for the spread of masking equations (58) and (59) assume that thespread of masking was captured during the computation of the terms ET andEN respectively


In this step individual masking thresholds are combined to estimate a globalmasking threshold for each frequency bin in the subset given by Eq (529) Themodel assumes that masking effects are additive The global masking thresholdTg(i) is therefore obtained by computing the sum

Tg(i) = 10 log10(1001Tq(i) +Lsum

l=1

1001TTM (il) +Msum

m=1

1001TNM (im))(dB SPL)

(533)

where Tq(i) is the absolute hearing threshold for frequency bin i TTM (i l) andTNM (i m) are the individual masking thresholds from step 4 and L and M are thenumbers of tonal and noise maskers respectively identified during step 3 In otherwords the global threshold for each frequency bin represents a signal-dependentpower-additive modification of the absolute threshold due to the basilar spread ofall tonal and noise maskers in the signal power spectrum Figure 510(e) showsglobal masking threshold obtained by adding the power of the individual tonal(Figure 510(c)) and noise (Figure 510(d)) maskers to the absolute threshold inquiet

58 PERCEPTUAL BIT ALLOCATION

In this section we will extend the uniform- and optimal-bit allocation algorithmspresented in Chapter 3 Section 35 with perceptual bit-assignment strategies Inperceptual bit allocation method the number of bits allocated to different bands isdetermined based on the global masking thresholds obtained from the psychoacous-tic model The steps involved in the computation of the global masking thresholdshave been presented in detail in the previous section The signal-to-mask ratio(SMR) determines the number of bits to be assigned in each band for perceptu-ally transparent coding of the input audio The noise-to-mask ratios (NMRs) arecomputed by subtracting the SMR from the SNR in each subband ie

NMR = SNR minus SMR(dB) (534)

The main objective in a perceptual bit allocation scheme is to keep the quan-tization noise below a masking threshold For example note that the NMR inFigure 511(a) is relatively more compared to NMR in Figure 511(b) Hencethe (quantization) noise in case of Figure 511(a) can be masked relatively easilythan in case of Figure 511(b) Therefore it is logical to assign sufficient number

PERCEPTUAL BIT ALLOCATION 139

of bits to the subband with the lowest NMR This criterion will be applied toall the subbands and until all the bits are exhausted Typically in audio codingstandards an iterative procedure is employed that satisfies both the bit rate andglobal masking threshold requirements

(a)

Freq

Sou

nd p

ress

ure

leve

l (dB

)

Criticalband

Masking tone

Masking thresh

Minimum masking threshold

Neighboringband

NM

RS

MR

SN

R

Freq

Sou

nd p

ress

ure

leve

l (dB

)

Criticalband

Masking tone

Masking thresh


Neighboringband

NM

RS

MR

SN

R

Noise threshold

(b)

Figure 511 Simultaneous masking depicting relatively large NMR in (a) compared to(b)


59 SUMMARY

This chapter dealt with some of the basics of psychoacoustics We coveredthe absolute threshold of hearing the Bark scale the simultaneous and tem-poral masking effects and the perceptual entropy A step-by-step procedure thatdescribes the ISOIEC psychoacoustic model 1 was provided

PROBLEMS

51 Describe the difference between a Mel scale and a Bark scale Give tablesand itemize side-by-side the center frequencies and bandwidth for 0ndash5 kHzDescribe how the two different scales are constructed

52 In Figure 512 the solid line indicates the just noticeable distortion (JND)curve and the dotted line indicates the absolute threshold in quiet Statewhich of the tones A B C or D would be audible and which ones arelikely to be masked Explain

53 In Figure 513 state whether tone B would be masked by tone A ExplainAlso indicate whether tone C would mask the narrow-band noise Givereasons

54 In Figure 514 the solid line indicates the JND curve obtained from thepsychoacoustic model 1 A broadband noise component is shown that spans

100

80

60

40

20

0

minus20

Sou

nd p

ress

ure

leve

l (dB

)

5 10 15 20 25

Barks

A

B

D

C

JND curveAbsolute threshold in quiet

Figure 512 JND curve for Problem 52


minus20

100

80

60

40

20

0


Maskingthreshold

A

B

C

Sou

nd p

ress

ure

leve

l (dB

)

5 10 15 20

Critical BWBarks

Absolute threshold

Narrow-band noise

Figure 513 Masking experiment Problem 53

from 3 to 11 Barks and a tone is present at 10 Barks Sketch the portionsof the noise and the tone that could be considered perceptually relevant

COMPUTER EXERCISES

55 Design a 3-band equalizer using the peaking filter equations of Chapter 2The center frequencies should correspond to the auditory filters (seeTable 51) at center frequencies 450 Hz 1000 Hz and 2500 Hz Computethe Q-factors associated with each of these filters using Q = f0BW where f0 is the center frequency and BW is the filter bandwidth (obtainfrom Table 51) Choose g = 5dB for all the filters Give the frequencyresponse of the 3-band equalizer in terms of Bark scale

56 Write a program to plot the absolute threshold of hearing in quiet Eq (51)Give a plot in terms of a linear Hz scale

57 Use the program of Problem 56 and plot the absolute threshold of hearingin a Bark scale

58 Generate four sinusoids with frequencies 400 Hz 1000 Hz 2500 Hz and6000 Hz fs = 441 kHz Obtain s(n) by adding these individual sinusoidsas follows

s(n) =4sum

i=1

sin

(2πfin

fs

)

n = 1 2 1024


10

80

70

60

50

40

30

20

10

0

minus10

minus20

Sou

nd p

ress

ure

leve

l (dB

)

5 15 20

Barks

3 11

Figure 514 Perceptual bit-allocation Problem 54

Give power spectrum plots of s(n) (in dB) in terms of a Bark scale and in termsof a linear Hz scale List the Bark-band numbers where the four peaks are located(Hint see Table 51 for the bark band numbers)

59 Extend the above problem and give the power spectrum plot in dB SPLSee Section 571 for details Also include the absolute threshold of hearingin quiet in your plot

510 Write a program to compute the perceptual entropy (in bitssample) of thefollowing signalsa ch5 malespeechwav (8 kHz 16 bit)b ch5 musicwav (441 kHz 16 bit)

(Hint Use equations (510)ndash(517) in Chapter 5 Section 56) Choose the framesize as 512 samples Also the perceptual entropy (PE) measurement is obtainedby constructing a PE histogram over many frames and then choosing a worst-casevalue as the actual measurement

511 FFT-based perceptual audio synthesis using the MPEG 1 psychoacous-tic model 1

In this computer exercise we will consider an example to show how thepsychoacoustic principles are applied in actual audio coding algorithms Recallthat in Chapter 2 Computer Exercise 225 we employed the peak-picking methodto select a subset of FFT components for audio synthesis In this exercise wewill use the just-noticeable-distortion (JND) curve as the ldquoreferencerdquo to selectthe perceptually important FFT components All the FFT components below the


JND curve are assigned a minimal value (for example minus50 dB SPL) such thatthese perceptually irrelevant FFT components receive minimum number of bitsfor encoding The ISOIEC MPEG 1 psychoacoustic model 1 (See Section 57Steps 1 through 5) is used to compute the JND curve

Use the MATLAB software from the Book website to simulate the ISOIECMPEG-1 psychoacoustic model 1 The software package consists of three MAT-LAB files psymainm psychoacousticsm audio synthesism a wave filech5 musicwav and a hintsdoc file The psymainm file is the main file thatcontains the complete flow of your computer exercise The psychoacousticsmfile contains the steps performed in the psychoacoustic analysis

Deliverable 1a Using the comments included in the hintsdoc file fill in the MATLAB

commands in the audio synthesism file to complete the programb Give plots of the input and the synthesized audio Is the psychoacoustic

criterion for picking FFT components equivalent to using a Parsevalrsquosrelated criterion In the audio synthesis what happens if the FFT con-jugate symmetry was not maintained

c How is this FFT-based perceptual audio synthesis method different froma typical peak-picking method for audio synthesis

Deliverable 2d Provide a subjective evaluation of the synthesized audio in terms of a

MOS scale Did you hear any clicks between the frames in the syn-thesized audio If yes what would you do to eliminate such artifactsCompute the over-all and segmental SNR values between the input andthe synthesized audio

e On the average how many FFT components per frame were selected Ifonly 30 FFT components out of 512 were to be picked (because of the bitrate considerations) would the application of the psychoacoustic modelto select the FFT components yield the best possible SNR

512 In this computer exercise we will analyze the asymmetry of simultaneousmasking Use the MATLAB software from the Book website The softwarepackage consists of two MATLAB files asymm maskm and psychoacous-ticsm The asymm maskm includes steps to generate a pure tone s1(n)with f = 4 kHz and fs = 441 kHz

s1(n) = sin

(2πf n

fs

)

n = 0 1 2 44099

a Simulate a broadband noise s2(n) by band-pass filtering uniform whitenoise (micro = 0 and σ 2 = 1) using a Butterworth filter (of appropriateorder eg 8) with center frequency 4 kHz Assume that 3-dB cut-offfrequencies of the bandpass filter as 3500 Hz and 4500 Hz Generatea test signal s(n) = αs1(n) + βs2(n) Choose α = 0025 and β = 1Observe if the broad-band noise completely masks the tone Experiment


for different values of α β and find out when 1) the broadband noisemasks the tone and 2) the tone masks the broadband noise

b Simulate the two cases of masking (ie the NMT and the TMN) givenin Figure 57 Section 54

CHAPTER 6

TIME-FREQUENCY ANALYSIS FILTERBANKS AND TRANSFORMS

61 INTRODUCTION

Audio codecs typically use a time-frequency analysis block to extract a set ofparameters that is amenable to quantization The tool most commonly employedfor this mapping is a filter bank of bandpass filters The filter bank divides thesignal spectrum into frequency subbands and generates a time-indexed series ofcoefficients representing the frequency-localized signal power within each bandBy providing explicit information about the distribution of signal and hence mask-ing power over the time-frequency plane the filter bank plays an essential rolein the identification of perceptual irrelevancies Additionally the time-frequencyparameters generated by the filter bank provide a signal mapping that is con-veniently manipulated to shape the coding distortion On the other hand bydecomposing the signal into its constituent frequency components the filter bankalso assists in the reduction of statistical redundancies

This chapter provides a perspective on filter-bank design and other tech-niques of particular importance in audio coding The chapter is organized asfollows Sections 62 and 63 introduce filter-bank design issues for audio cod-ing Sections 64 through 67 review specific filter-bank methodologies found inaudio codecs namely the two-band quadrature mirror filter (QMF) the M-bandtree-structured QMF the M-band pseudo-QMF bank and the M-band ModifiedDiscrete Cosine Transform (MDCT) The lsquoMP3rsquo or MPEG-1 Layer III audio


145


codec pseudo-QMF and MDCT are discussed in Sections 66 and 67 respec-tively Section 68 provides filter-bank interpretations of the discrete Fourierand discrete cosine transforms Finally Sections 69 and 610 examine the time-domain ldquopre-echordquo artifact in conjunction with pre-echo control techniques

Beyond the references cited in this chapter the reader is referred to in-depthtutorials on filter banks that have appeared in the literature [Croc83] [Vaid87][Vaid90] [Malv91] [Vaid93] [Akan96] The reader may also wish to explorethe connection between filter banks and wavelets that has been well docu-mented in [Riou91] [Vett92] and in several texts [Akan92] [Wick94] [Akan96][Stra96]

62 ANALYSIS-SYNTHESIS FRAMEWORK FOR M-BAND FILTERBANKS

Filter banks are perhaps most conveniently described in terms of an analysis-synthesis framework (Figure 61) in which the input signal s(n) is processed atthe encoder by a parallel bank of (L minus 1)-th order FIR bandpass filters Hk(z)The bandpass analysis outputs

vk(n) = hk(n) lowast s(n) =Lminus1sum

m=0

s(n minus m)hk(m) k = 0 1 M minus 1 (61)

are decimated by a factor of M yielding the subband sequences

yk(n) = vk(Mn) =Lminus1sum

m=0

s(nM minus m)hk(m) k = 0 1 M minus 1 (62)

which comprise a critically sampled or maximally decimated signal representa-tion ie the number of subband samples is equal to the number of input samplesBecause it is impossible to achieve perfect ldquobrickwallrdquo magnitude responseswith finite-order bandpass filters there is unavoidable aliasing between the deci-mated subband sequences Quantization and coding are performed on the subbandsequences yk(n) In the perceptual audio codec the quantization noise is usuallyshaped according to a perceptual model The quantized subband samples yk(n)are eventually received by the decoder where they are upsampled by M to formthe intermediate sequences

wk(n) =

yk(nM) n = 0 M 2M 3M

0 otherwise(63)

In order to eliminate the imaging distortions introduced by the upsampling oper-ations the sequences wk(n) are processed by a parallel bank of synthesis filtersGk(z) and then the filter outputs are combined to form the overall output s(n)

ANALYSIS-SYNTHESIS FRAMEWORK FOR M-BAND FILTER BANKS 147

The analysis and synthesis filters are carefully designed to cancel aliasing andimaging distortions It can be shown [Akan96] that

s(n) = 1

M

infinsum

m=minusinfin

infinsum

l=minusinfin

Mminus1sum

k=0

s(m)hk(lM minus m)gk(l minus Mn) (64)

or in the frequency domain

S() = 1

M

Mminus1sum

k=0

Mminus1sum

l=0

S

(

+ 2πl

M

)

Hk

(

+ 2πl

M

)

Gk() (65)

For perfect reconstruction (PR) filter banks the output s(n) will be identicalto the input s(n) within a delay ie s(n) = s(n minus n0) as long as there is noquantization noise introduced ie y(n) = yk(n) This is naturally not the casefor a codec and therefore quantization sensitivity is an important filter bankproperty since PR guarantees are lost in the presence of quantization

Figures 62 and 63 respectively give example magnitude responses for banksof uniform and nonuniform bandwidth filters that can be realized within theframework of Figure 61 A uniform bandwidth M-channel filter bank is shownin Figure 62 The M analysis filters have normalized center frequencies (2k +1)π2M and are characterized by individual impulse responses hk(n) as wellas frequency responses Hk() 0 k lt M minus 1

Some of the popular audio codecs contain parallel bandpass filters of uniformbandwidth similar to Figure 62 Other coders strive for a ldquocritical bandrdquo analysisby relying upon filters of nonuniform bandwidth The octave-band filter bank

MH0(z)

H1(z)

H2(z)

HMminus1(z)

M

M

M

s(n)

s(n)

v0(n)

v1(n)

v2(n)

vMminus1(n)

M G0(z)

G1(z)

G2(z)

GMminus1(z)

M

M

M

y0(n)

y1(n)

y2(n)

yMminus1(n)

w0(n)

w1(n)

w2(n)

wMminus1(n)

Σˆ

Figure 61 Uniform M-band maximally decimated analysis-synthesis filter bank


Frequency (Ω)

minusp p

H0 H1 H2H0H1H2HM minus1 HM minus1

2M(2M minus1)p

minus 2M

(2M minus1)p2M5p

minus2M3p

minus2Mp

minusp

2M3p2M

5p2M

Figure 62 Magnitude frequency response for a uniform M-band filter bank (oddlystacked)

Frequency (Ω)

H20 H21 H11 H01

0 8p

4p

2p p

Figure 63 Magnitude frequency response for an octave-band filter bank

for which the four-band case is illustrated in Figure 63 is sometimes used as anapproximation to the auditory filter bank albeit a poor one

As shown in Figure 63 octave-band analysis filters have normalized centerfrequencies and filter bandwidths that are dyadically related Naturally muchbetter approximations are possible

63 FILTER BANKS FOR AUDIO CODING DESIGN CONSIDERATIONS

This section addresses the issues that govern the selection of a filter bank for audiocoding Efficient coding performance depends heavily on adequately matching theproperties of the analysis filter bank to the characteristics of the input signal Algo-rithm designers face an important and difficult tradeoff between time and frequencyresolution when selecting a filter-bank structure [Bran92a] Failure to choose a suit-able filter bank can result in perceptible artifacts in the output (eg pre-echoes) orlow coding gain and therefore high bit rates No single tradeoff between time andfrequency resolution is optimal for all signals We will present three examples toillustrate the challenge facing codec designers In the first example we consider theimportance of matching time-frequency analysis resolution to the signal-dependentdistribution of masking power in the time-frequency plane The second exampleillustrates the effect of inadequate frequency resolution on perceptual bit alloca-tion Finally the third example illustrates the effect of inadequate time resolution onperceptual bit allocation These examples clarify the fundamental tradeoff requiredduring filter-bank selection for perceptual coding


631 The Role of Time-Frequency Resolution in Masking PowerEstimation

Through schematic representations of masking thresholds for castanets and pic-colo Figure 64(ab) illustrates the difficulty of selecting a single filter bank tosatisfy the diverse time and frequency resolution requirements associated withdifferent classes of audio In the figures darker regions correspond to highermasking thresholds To realize maximum coding gain the strongly harmonicpiccolo signal clearly calls for fine frequency resolution and coarse time resolu-tion because the masking thresholds are quite localized in frequency Quite theopposite is true of the castanets The fast attacks associated with this percussivesound create highly time-localized masking thresholds that are also widely dis-bursed in frequency Therefore adequate time resolution is essential for accurateestimation of the highly time-varying masked threshold Naturally similar reso-lution properties should also be associated with the filter bank used to decomposethe signal into a parametric set for quantization and encoding Using real signalsand filter banks the next two examples illustrate the bit rate impact of adequateand inadequate resolutions in each domain


To demonstrate the importance of matching a filter bankrsquos resolution propertieswith the noise-shaping requirements imposed by a perceptual model the next twoexamples combine high- and low-resolution filter banks with two input extremesnamely those of a sinusoid and an impulse First we consider the importance ofadequate frequency resolution The need for high-resolution frequency analysis ismost pronounced when the input contains strong sinusoidal components Given atonal input inadequate frequency resolution can produce unreasonably high signal-to-noise ratio (SNR) requirements within individual subbands resulting in high bitrates To see this we compare the results of processing a 27-kHz pure tone firstwith a 32-channel and then with a 1024-channel MDCT filter bank as shownin Figure 65(a) and Figure 65(b) respectively The vertical line in each figurerepresents the frequency and level (80 dB SPL) of the input tone In the figuresthe masked threshold associated with the sinusoid is represented by a solid nearlytriangular line For each filter bank in Figure 65(a) and Figure 65(b) the bandcontaining most of the signal energy is quantized with sufficient resolution tocreate an in-band SNR of 154 dB Then the quantization noise is superimposedon the masked threshold In the 32-band case (Figure 65a) it can be seen thatthe quantization noise at an SNR of 154 dB spreads considerably beyond themasked threshold implying that significant artifacts will be audible in the recon-structed signal On the other hand the improved selectivity of the 1024-channelfilter bank restricts the spread of quantization at 154 dB SNR to well within thelimits of the masked threshold (Figure 65b) The figure clearly shows that for atonal signal good frequency selectivity is essential for low bit rates In fact the 32-channel filter bank for this signal requires greater than a 60 dB SNR to satisfy themasked threshold (Figure 65c) This high cost (in terms of bits required) results


Time (ms)

Fre

quen

cy (

Bar

ks)

15 300

5 10

1520

25

Time (ms)

Fre

quen

cy (

Bar

ks)

15 300

5 10

1520

25

(b)

(a)

Figure 64 Masking thresholds in the time-frequency plane (a) castanets (b) piccolo(after [Prin95])

from the mismatch between the filter bankrsquos poor selectivity and the very limiteddownward spread of masking in the human ear As this experiment would implyhigh-resolution frequency analysis is usually appropriate for tonal audio signals


A time-domain dual of the effect observed in the previous example can be used toillustrate the importance of adequate time resolution Whereas the previous exper-iment showed that simultaneous masking criteria determine how much frequency


(a)

1000 2000 3000 4000 5000 6000 7000minus40

minus20

0

20

40

60

80

100

Band SNR = 154 dBS

PL

(dB

)

Frequency (Hz)


(b)

1000 2000 3000 4000 5000 6000 7000minus40

minus20

20

0

40

60

80

100

Band SNR = 154 dB

SP

L (d

B)

Frequency (Hz)


Figure 65 The effect of frequency resolution on perceptual SNR requirements for a27 kHz pure tone input Input tone is represented by the spectral line at 27 kHz with80 dB SPL Conservative masked threshold due to presence of tone is shown Quantizationnoise for a given number of bits per sample is superimposed for several precisions(a) 32-channel MDCT with 154 dB in-band SNR quantization noise spreads beyondmasked threshold (b) 1024-channel MDCT with 154 dB in-band SNR quantization noiseremains below masked threshold (c) 32-channel MDCT requires 637 dB in-band SNRto satisfy masked threshold ie requires larger bit allocation to mask quantization noise


resolution is necessary for good performance one can also imagine that temporalmasking effects dictate time resolution requirements In fact the need for goodtime resolution is pronounced when the input contains sharp transients This is bestillustrated with an impulse Unlike the sinusoid with its highly frequency-localizedmasking power the broadband impulse contains broadband masking power that ishighly time-localized Given a transient input therefore lacking time resolutioncan result in a temporal smearing of quantization noise beyond the time window ofeffective masking To illustrate this point consider the results obtained by process-ing an impulse with the same filter banks as in the previous example The resultsare shown in Figure 66(a) and Figure 66(b) respectively The vertical line in eachfigure corresponds to the input impulse The figures also show the temporal enve-lope of masking power associated with the impulse as a solid nearly triangularwindow For each filter bank all subbands were quantized with identical bit alloca-tions Then the error signal at the output (quantization noise) was superimposed onthe masking envelope In the 32-band case (Figure 66a) one can observe that thetime resolution (impulse response length) restricts the spread of quantization noiseto well within the limits of the masked threshold For the 1024-band filter bank(Figure 66b) on the other hand quantization noise spreads considerably beyondthe time envelope masked threshold implying that significant artifacts will be audi-ble in the reconstructed signal The only remedy in this case would be to ldquoovercoderdquothe transient so that the signal surrounding the transient receives precision adequateto satisfy the limited masking power in the regions before and after the impulseThe figures clearly show that for a transient signal good time resolution is essentialfor low bit rates The combination of long impulse responses and limited maskingwindows in the presence of transient signals can lead to an artifact known as ldquopre-echordquo distortion Pre-echo distortion and pre-echo compensation are covered at theend of this chapter in Sections 69 and 610

Unfortunately most audio source material is highly nonstationary and containssignificant tonal and atonal energy as well as both steady-state and transient inter-vals As a rule signal models [John96a] tend to remain constant for long periods andthen change abruptly Therefore the ideal coder should make adaptive decisionsregarding optimal time-frequency signal decomposition and the ideal analysis filterbank would have time-varying resolutions in both the time and frequency domainsThis fact has motivated many algorithm designers to experiment with switchedand hybrid filter-bank structures with switching decisions occurring on the basisof the changing signal properties Filter banks emulating the analysis properties ofthe human auditory system ie those containing nonuniform ldquocritical bandwidthrdquosubbands have proven highly effective in the coding of highly transient signalssuch as the castanets glockenspiel or triangle For dense harmonically structuredsignals such as the harpsichord or pitch pipe on the other hand the ldquocritical bandrdquofilter banks have been less successful because of their reduced coding gain relativeto filter banks with a large number of subbands In short several bank characteristicsare highly desirable for audio coding

ž Signal-adaptive time-frequency tilingž Low-resolution ldquocritical-bandrdquo mode (eg 32 subbands)


(a)

minus25 minus20 minus15 minus10 minus5 0 5 10 15 20 25minus50

minus45

minus40

minus35

minus30

minus25

minus20

minus15

minus10

minus5

0

5

Time (ms)

dB32 Channel MDCT


threshold

1024 Channel MDCT

minus50

minus45

minus40

minus35

minus30

minus25

minus20

minus15

minus10

minus5

0

5

(b)

minus25 minus20 minus15 minus10 minus5 0 5 10 15 20 25

Time (ms)

dB


threshold

Figure 66 The effect of time resolution on perceptual bit allocation for an impulseImpulse input occurs at time 0 Conservative temporal masked threshold due to thepresence of impulse is shown Quantization noise for a fixed number of bits per sam-ple is superimposed for low- and high-resolution filter banks (a) 32-channel MDCT(b) 1024-channel MDCT

ž High-resolution mode eg 4096 subbandsž Efficient resolution switchingž Minimum blocking artifactsž Good channel separationž Strong stop-band attenuation


ž Perfect reconstructionž Critical samplingž Fast implementation

Good channel separation and stop-band attenuation are particularly desirable forsignals containing very little irrelevancy such as the harpsichord Maximum redun-dancy removal is essential for maintaining high quality at low bit rates for thesesignals Blocking artifacts in time-varying filter banks can lead to audible distortionin the reconstruction Beyond filter-bank-specific architectural and performance cri-teria system-level considerations may also influence the best choice of filter bankfor a codec design For example the codec architecture could contain two separateparallel time-frequency analysis blocks one for the perceptual model and one forgenerating the parametric set that is ultimately quantized and encoded The parallelscenario offers the advantage that each filter bank can be optimized independentlyThis is possible since the perceptual analysis section does not typically require sig-nal reconstruction whereas the coefficients for coding must eventually be mappedback to the time-domain In the interest of computational efficiency however manyaudio codecs have only one time-frequency analysis block that does ldquodouble dutyrdquoin the sense that the perceptual model obtains information from the same set ofcoefficients that are ultimately quantized and encoded

Algorithms for filter-bank design as well as fast algorithms for efficient filter-bank realizations offer many choices to designers of perceptual audio codecsAmong the many types available are those characterized by the following

ž Uniform or nonuniform frequency partitioningž An arbitrary number of subbandsž Perfect or almost perfect reconstructionž Critically sampled or oversampled representationsž FIR or IIR constituent filters

In the next few Sections we will focus on the design and performance of well-known filter banks that are popular in audio coding Rather than dealing withefficient implementation structures that are available we have elected for eachfilter-bank architecture to describe the individual bandpass filters in terms ofimpulse and frequency response functions that are easily related to the analysis-synthesis framework of Figure 61 These descriptions are intended to provideinsight regarding the filter-bank response characteristics and to allow for com-parisons across different methods The reader should be aware however thatstructures for efficient realizations are almost always used in practice and becausecomputational efficiency is of paramount importance most audio coding filter-bank realizations although functionally equivalent may or may not resemble themaximally decimated analysis-synthesis structure given in Figure 61 In otherwords most of the filter banks used in audio coders have equivalent parallelforms and can be conveniently analyzed in terms of this analysis-synthesis frame-work The framework provides a useful interpretation for the sets of coefficients

QUADRATURE MIRROR AND CONJUGATE QUADRATURE FILTERS 155

generated by the unitary transforms often embedded in audio coders such asthe discrete cosine transform (DCT) the discrete Fourier transform (DFT) thediscrete wavelet transform (DWT) and the discrete wavelet packet transform(DWPT)

64 QUADRATURE MIRROR AND CONJUGATE QUADRATUREFILTERS

The two-band quadrature mirror and conjugate quadrature filter (QMF and CQF)banks are logical starting points for the discussion on filter banks for audio codingTwo-band QMF banks were used in early subband algorithms for speech cod-ing [Croc76] and later for the first standardized 7-kHz wideband audio algorithmthe ITU G722 [G722] Also the strong connection between two-band perfectreconstruction (PR) CQF filter banks and the discrete wavelet transform [Akan96]has played a significant role in the development of high-performance audio cod-ing filter banks Ultimately tree-structured cascades of the CQF filters have beenused to construct several ldquocritical-bandrdquo filter banks in a number of high qualityalgorithms The two-channel bank which can provide a building block for struc-tured M-channel banks is developed as follows If the analysis-synthesis filterbank (Figure 61) is constrained to two channels ie if M = 2 then Eq (65)becomes

S() = 1

2S()[H 2

0 () minus H 21 ()] (66)

Esteband and Galand showed [Este77] that aliasing is cancelled between theupper and lower bands if the QMF conditions are satisfied namely

H1() = H0( + π) rArr h1(n) = (minus1)nh0(n)

G0() = H0() rArr g0(n) = h0(n)

G1() = minusH0( + π) rArr g1(n) = minus(minus1)nh0(n) (67)

Thus the two-band filter-bank design task is reduced to the design of a sin-gle lowpass filter h0(n) under the constraint that the overall transfer func-tion Eq (66) be an allpass function with constant group delay (linear phase)Although filter families satisfying the QMF criteria with good stop-band andtransition-band characteristics have been designed (eg [John80]) to minimizeoverall distortion the QMF conditions actually make perfect reconstructionimpossible Smith and Barnwell showed in [Smit86] however that PR two-bandfilter banks based on a lowpass prototype are possible if the CQF conditions aresatisfied namely

h1(n) = (minus1)nh0(L minus 1 minus n)

g0(n) = h0(L minus 1 minus n) (68)

g1(n) = minus(minus1)nh0(n)


minus10

minus20

minus30

minus40

minus50

minus60

0

025 0750


Two Channel CQF Filter Bank

Mag

nitu

de (

dB)

Figure 67 Two-band Smith-Barnwell CQF filter bank magnitude frequency responsewith L = 8 [Smit86]

The magnitude response of an example Smith-Barnwell CQF filter bank from[Smit86] with L = 8 is shown in Figure 67 The lowpass response is shown

as a solid line while the highpass is dashed One can observe the significantoverlap between channels as well as monotonic passband and equiripple stop-band characteristics with minimum stop-band rejection of 40 dB As mentionedpreviously efficiency concerns dictate that filter banks are rarely implementedin the direct form of Eq (61) The QMF banks are most often realized using apolyphase factorization [Bell76] (ie

H(z) =Mminus1sum

l=0

zminuslEl(zM) (69)

where

El(z) =infinsum

n=minusinfinh(Mn + l)zminusn (610)

which yields better than a 21 computational load reduction On the other handthe CQF filters are incompatible with the polyphase factorization but can beefficiently realized using alternative structures such as the lattice [Vaid88]

65 TREE-STRUCTURED QMF AND CQF M-BAND BANKS

Clearly audio coders require better frequency resolution than either the QMF orCQF two-band decompositions can provide in order to realize sufficient coding


y0(n)

s(n)

2H00(z)

2H10(z)

2H11(z)

2H20(z)

2H21(z)

2H20(z)

2H21(z)

2H01(z)

2H10(z)

2H11(z)

2H20(z)

2H21(z)

2H20(z)

2H21(z)

y1(n)

y2(n)

y3(n)

y4(n)

y5(n)

y6(n)

y7(n)

Figure 68 Tree-structured realization of a uniform eight-channel analysis filter bank

gain for spectrally complex signals Tree-structured cascades are one straight-forward method for creating M-band filter banks from the two-band QMF andCQF prototypes These are constructed as follows The two-band filters are con-nected in a cascade that can be represented well using either a binary tree or apruned binary tree The root node of the tree is formed by a single two-bandQMF or CQF section Then each of the root node outputs is connected to acascaded QMF or CQF bank The cascade structure may be continued to thedepth necessary to achieve the desired magnitude response characteristics Ateach node in the tree a two-channel QMF or CQF bank operates on the out-put from a higher-level two-channel bank Thus frequency subdivision occursthrough a series of two-band splits Tree-structured filter banks have severaladvantages First of all the designer can approximate an arbitrary partitioning ofthe frequency axis by creating an appropriate cascade Consider for example theuniform subband tree (Figure 68) or the octave-band tree (Figure 69) The abil-ity to partition the frequency axis in a nonuniform manner also has implicationsfor multi-resolution temporal analysis or nonuniform tiling of the time-frequencyplane This property can be advantageous if the ultimate objective is to approx-imate the analysis properties of the human ear and in fact many algorithmsmake use of tree-structured filter banks for this very reason [Bran90] [Sinh93b]


[Tsut98] In addition the designer has the flexibility to optimize the length andother properties of constituent filters at each node in the tree This flexibilityhas been exploited to enhance the performance of several experimental audiocodecs [Sinh93b] [Phil95a] [Phil95b] Tree-structured filter banks are also attrac-tive for their computational efficiency relative to other M-band techniques Onedisadvantage of the tree-structured filter bank is that delays accumulate throughthe cascaded nodes and hence the overall delay can become quite large

The example tree in Figure 68 shows an eight-band uniform analysis filterbank in which the analysis filters are indexed first by level and then by typeLowpass filters are indexed with a 0 and highpass with a 1 For instance high-pass filters at level 2 in the tree are denoted by H21(z) and lowpass by H20(z)It is often convenient to analyze the M-band tree-structured CQF or QMF bankusing an equivalent parallel form To see the connection between the cascaded

s(n)

2H01(z)

y0(n)

2H00(z)

2H10(z)

2H11(z)

2H20(z)

2H21(z) y1(n)

y2(n)

y3(n)

Figure 69 Tree-structured realization of an octave-band four-channel analysis filter bank

MMy(n)

y(n)

x(n)

MH(z) M

(b)

(a)

x(n)

H(z) H(zM)

H(zM)

x(n)

x(n) y(n)

y(n)

Figure 610 The Noble identities In each picture the structure on the left is equivalentto the structure on the right (a) Interchange of a filter and a downsampler The positionsare swapped after the complex variable z is replaced by zM in the system function H(z)(b) Interchange of a filter and an upsampler The positions are swapped after the complexvariable z is replaced by zM in the system function H(z)


(a)

Eight Channel Tree-Structured CCQF Filter Bank N = 53

0

minus10

minus20

minus30

minus40

minus50

minus6000062501875 03125 04375 05625 06875 08125 09375


Mag

nitu

de (

dB)

(b)

0

350 5 10 15 20 25 30 40 45 50

03125

0

minus20

minus40

minus60

minus02

minus01

0

01

02


Mag

nitu

de (

dB)

Am

plitu

de

Sample Number

Figure 611 Eight-channel cascaded CQF (CCQF) filter bank (a) Magnitude frequencyresponses for all eight channels Odd-numbered channel responses are drawn with dashedlines even-numbered channel responses are drawn with solid lines (b) Isolated view ofthe magnitude frequency response and time-domain impulse response for channel 3 Thisview highlights the presence of a significant sidelobe in the stop-band In this figure N

is the length of the impulse response

tree structures (Figures 68 and 69) and the parallel analysis-synthesis structure(Figure 61) one can apply the ldquonoble identitiesrdquo (Figure 610) which allow forthe interchange of the down-sampling and filtering operations In a straightfor-ward manner this practice collapses the cascaded filter transfer functions into


single parallel-form analysis filters for each channel In the case of the eight-channel bank (Figure 68) for example we have

H0(z) = H00(z)H10(z2)H20(z

4)

H1(z) = H00(z)H10(z2)H21(z

4) (611)

H7(z) = H01(z)H11(z2)H21(z

4)

Figure 611(a) shows the magnitude spectrum of an eight-channel tree-structuredfilter bank based on a three-level cascade (Figure 68) of the same Smith-BarnwellCQF filter examined previously (Figure 67) Even-numbered channel responsesare drawn with solid lines and odd-numbered channel responses are drawn withdashed lines One can observe the effect that the cascaded structure has on theshape of the channel responses Bands 0 through 3 are each uniquely shapedand are mirror images of bands 4 through 7 Moreover the M-band stop-bandcharacteristics are significantly different than the prototype filter ie the equirip-ple property does not extend to the M-channels Figure 611(b) shows |H2()|2making it is possible to observe clearly a sidelobe of significant magnitude inthe stop-band The figure also illustrates the impulse response h2(n) associatedwith the filter H2(z) One can see that the effective length of the parallel-formimpulse response represents the cumulative contributions from each of the cas-caded filters

66 COSINE MODULATED lsquolsquoPSEUDO QMFrsquorsquo M-BAND BANKS

A tree-structured cascade of two-channel prototypes is only one of several well-known methods available for realization of an M-band filter bank Although thetree structures offer opportunities for optimization at each node and are con-ceptually simple the potential for long delay and irregular channel responsesis sometimes unappealing As an alternative to the tree-structured architecturecosine modulation of a lowpass prototype filter has been used since the early1980s [Nuss81] [Roth83] [Chu85] [Mass85] [Cox86] to realize parallel M-channelfilter banks with nearly perfect reconstruction Because they do not achieve perfectreconstruction these filter banks are known collectively as ldquopseudo QMFrdquo andthey are characterized by several attractive properties

ž Constrained design single FIR prototype filterž Uniform linear phase channel responsesž Overall linear phase hence constant group delayž Low complexity ie one filter plus modulationž Amenable to fast block algorithmsž Critical sampling

COSINE MODULATED lsquolsquoPSEUDO QMFrsquorsquo M-BAND BANKS 161

In the pseudo QMF (PQMF) bank derivation phase distortion is completely elim-inated from the overall transfer function Eq (65) by forcing the analysis andsynthesis filters to satisfy the mirror image condition

gk(n) = hk(L minus 1 minus n) (612)

Moreover adjacent channel aliasing is cancelled by establishing precise relation-ships between the analysis and synthesis filters Hk(z) and Gk(z) respectively Inthe critically sampled analysis-synthesis notation of Figure 61 these conditionsultimately yield analysis filters given by

hk(n) = 2w(n) cos

[π

M(k + 05)

(

n minus (L minus 1)

2

)

+ k

]

(613)

and synthesis filters given by

gk(n) = 2w(n) cos

[π

M(k + 05)

(

n minus (L minus 1)

2

)

minus k

]

(614)

wherek = (minus1)kπ

4(615)

and the sequence w(n) corresponds to the L-sample ldquowindowrdquo a real-coefficientlinear phase FIR prototype lowpass filter with normalized cutoff frequencyπ2M Given that aliasing and phase distortions have been eliminated in thisformulation the filter-bank design procedure is reduced to the design of the win-dow w(n) such that overall amplitude distortion (Eq (65)) is minimized Oneapproach [Vaid93] is to minimize a composite objective function ie

C = αc1 + (1 minus α)c2 (616)

where constraint c1 of the form

c1 =int πM

0

[

|W()|2 +∣∣∣W

( minus π

M

)∣∣∣2 minus 1

]2

d (617)

minimizes spectral nonflatness in the reconstruction and constraint c2 of the form

c2 =int π

π

2M+ε

|W()|2d (618)

maximizes stop-band attenuation The parameter ε is related to transition band-width and the parameter α determines which design constraint is more dominant

The magnitude frequency response of an example eight-channel pseudo QMFbank designed using Eqs (66) and (67) is shown in Figure 612 In contrast tothe previous CCQF example one can observe that all of the channel magni-tude responses are identical modulated versions of the lowpass prototype andtherefore the passband and stop-band characteristics are uniform The impulse


0 00625 01875 03125 04375 05625 06875 08125 09375minus60

minus50

minus40

minus30

minus20

minus10

0


Mag

nitu

de (

dB)

(a)

Eight Channel PQMF Bank N = 39

(b)

350 5 10 15 20 25 30minus02

minus01

0

01

02

Am

plitu

de

Sample Number

Normalized Frequency (x p rad)031250

0

minus20

minus40

minus60

Mag

nitu

de (

dB)

Figure 612 Eight-channel PQMF bank (a) Magnitude frequency responses for all eightchannels Odd-numbered channel responses are drawn with dashed lines even-numberedchannel responses are drawn with solid lines (b) Isolated view of the magnitude fre-quency response and time-domain impulse response for channel 3 Here N is the impulseresponse length

response symmetry associated with a linear phase filter is also evident in anexamination of Figure 612(b)

The PQMF bank plays a significant role in several popular audio codingalgorithms In particular the IS11172-3 and IS13818-3 algorithms (ldquoMPEG-1rdquo


[ISOI92] and ldquoMPEG-2 BCLSFrdquo [ISOI94a]) employ a 32-channel PQMF bankfor spectral decomposition in both layers I and II The prototype filter w(n)contains 512 samples yielding better than 96-dB sidelobe suppression in thestop-band of each analysis channel Output ripple (non-PR) is less than 007 dBIn addition the same PQMF is used in conjunction with a PR cosine modulatedfilter bank (discussed in the next section) in layer III to form a hybrid filter-bankarchitecture with time-varying properties The MPEG-1 algorithm has reached aposition of prominence with the widespread use of ldquoMP3rdquo files (MPEG-1 layer3) on the World Wide Web (WWW) for the exchange of audio recordings aswell as with the deployment of MPEG-1 layer II in direct broadcast satellite(DBSDSS) and European digital audio broadcast (DBA) initiatives Because ofthe availability of common algorithms for pseudo QMF and PR QMF banks wedefer the discussion on generic complexity and efficient implementation strategiesuntil later In the particular case of MPEG-1 however note that the 32-bandpseudo QMF analysis bank as defined in the standard requires approximately80 real multiplies and 80 real additions per output sample [ISOI92] although amore efficient implementation based on a fast algorithm for the DCT was alsoproposed [Pan93] [Kons94]

67 COSINE MODULATED PERFECT RECONSTRUCTION (PR)M-BAND BANKS AND THE MODIFIED DISCRETE COSINETRANSFORM (MDCT)

Although PQMF banks have been used quite successfully in perceptual audiocoders the overall system design still must compensate for the inherent dis-tortion induced by the lack of perfect reconstruction to avoid audible artifactsin the codec output The compensation strategy may be a simple one (egincreased prototype filter length) but perfect reconstruction is actually prefer-able because it constrains the sources of output distortion to the quantizationstage Beginning in the early 1990s independent work by Malvar [Malv90b]Ramstad [Rams91] and Koilpillai and Vaidyanathan [Koil91] [Koil92] showedthat generalized perfect reconstruction (PR) cosine modulated filter banks arepossible by appropriately constraining the prototype lowpass filter w(n) andsynthesis filters gk(n) for 0 k M minus 1 In particular perfect reconstructionis guaranteed for a cosine-modulated filter bank with analysis filters hk(n) givenby Eqs (613) and (615) if four conditions are satisfied First the length L ofthe window w(n) must be integer multiple of the number of subbands ie

1L = 2mM (619)

where the parameter m is an integer greater than zero Next the synthesisfilters gk(n) must be related to the analysis filters by a time-reversalsuch that

2gk(n) = hk(L minus 1 minus n) (620)


In addition the FIR lowpass prototype must have linear phase whichmeans that

3w(n) = w(L minus 1 minus n) (621)

and finally the polyphase components of w(n) must satisfy the pairwisepower complementary requirement ie

4Ek(z)Ek(z) + EM+k(z)EM+k(z) = α (622)

where the constant α is greater than 0 the functions Ek(z) are the k =0 1 2 M minus 1 polyphase components (Eq (69)) of W (z) and the tildenotation denotes the paraconjugate ie

E(z) = E lowast (zminus1) (623)

or in other words the coefficients of E(z) are conjugated and then thecomplex variable zminus1 is substituted for the complex variable z

The generalized PR cosine-modulated filter banks developed in Eqs(619)through (623) are of considerable interest in many applications This Sectionhowever concentrates on the special case that has become of central importancein the advancement of perceptual audio coding algorithms namely the filter bankfor which L = 2M ie m = 1 The PR properties of this special case were firstdemonstrated by Princen and Bradley [Prin86] using time-domain arguments forthe development of the time domain aliasing cancellation (TDAC) filter bankLater Malvar [Malv90a] developed the modulated lapped transform (MLT) byrestricting attention to a particular prototype filter and formulating the filter bankas a lapped orthogonal block transform More recently the consensus namein the audio coding literature for lapped block transform interpretation of thisspecial case filter bank has evolved into the modified discrete cosine transform(MDCT) To avoid confusion we will denote throughout this book by MDCTthe PR cosine-modulated filter bank with L = 2M and we will restrict thewindow w(n) in accordance with Eqs (619) and (621) In short the readershould be aware that the different acronyms TDAC MLT and MDCT all referessentially to the same PR cosine modulated filter bank Only Malvarrsquos MLT labelimplies a particular choice for w(n) as described below From the perspective ofan analysis-synthesis filter bank (Figure 61) the MDCT analysis filter impulseresponses are given by

hk(n) = w(n)

radic2

Mcos

[(2n + M + 1)(2k + 1)π

4m

]

(624)

and the synthesis filters to satisfy the overall linear phase constraint are obtainedby a time reversal ie

gk(n) = hk(2M minus 1 minus n) (625)


This perspective is useful for visualizing individual channel characteristics interms of their impulse and frequency responses In practice however the MDCTis typically realized as a block transform usually via a fast algorithm The remain-der of this section treats several MDCT facets that are of importance in audiocoding applications including its forward and inverse transform interpretationsprototype filter (window) design criteria window design examples time-varyingforms and fast algorithms


The analysis filter bank (Figure 613(a)) is realized as a block transform of length2M samples while using a block advance of only M samples ie with 50overlap between blocks Thus the MDCT basis functions extend across twoblocks in time leading to virtual elimination of the blocking artifacts that plaguethe reconstruction of nonoverlapped transform coders Despite the 50 overlaphowever the MDCT is still critically sampled and only m coefficients are gen-erated by the forward transform for each 2M-sample input block Given an inputblock x(n) the transform coefficients X(k) for 0 k M minus 1 are obtainedby means of the forward MDCT defined as

X(k) =2Mminus1sum

n=0

x(n)hk(n) (626)

Clearly the forward MDCT performs a series of inner products between the M

analysis filter impulse responses hk(n) and the input x(n) On the other handthe inverse MDCT (Figure 613(b)) obtains a reconstruction by computing a sumof the basis vectors weighted by the transform coefficients from two blocks Thefirst M samples of the k-th basis vector for hk(n) 0 n M minus 1 are weightedby k-th coefficient of the current block X(k) Simultaneously the second M

samples of the k-th basis vector hk(n) for M n 2M minus 1 are weighted bythe k-th coefficient of the previous block XP (K) Then the weighted basisvectors are overlapped and added at each time index n Note that the extendedbasis functions require that the inverse transform maintains an M sample memoryto retain the previous set of coefficients Thus the reconstructed samples x(n)for 0 n M minus 1 are obtained via the inverse MDCT defined as

x(n) =Mminus1sum

k=0

[X(k)hk(n) + XP (k)hk(n + M)] (627)

where xP (k) denotes the previous block of transform coefficients The overlappedanalysis and overlap-add synthesis processes are illustrated in Figure 613(a) andFigure 613(b) respectively


Given the forward (Eq (626)) and inverse (Eq (627)) transform definitions onestill must design a suitable FIR prototype filter (window) w(n) Several general


Frame k Frame k + 1 Frame k + 2 Frame k + 3M M M M

2M MDCT M

2M MDCT M

2M MDCT M

(a)

+

IMDCTM 2M

IMDCTM 2M

IMDCTM 2M

+

Frame k + 1 Frame k + 2M M

(b)

Figure 613 Modified discrete cosine transform (MDCT) (a) Lapped forward transform(analysis) ndash 2M samples are mapped to M spectral components (Eq (626)) Analy-sis block length is 2M samples but analysis stride (hop size) and time resolution areM-samples (b) Inverse transform (synthesis) ndash M spectral components are mapped to avector of 2M samples (Eq (627)) that is overlapped by M samples and added to thevector of 2M samples associated with the previous frame

purpose orthogonal [Prin86] [Prin87] [Malv90a] and biorthogonal [Jawe95][Smar95] [Matv96] windows that have been proposed while still otherorthogonal [USAT95a] [Ferr96a] [ISOI96a] [Fiel96] and biorthogonal [Cheu95][Malv98] windows are optimized explicitly for audio coding In the orthogonalcase the generalized PR conditions [Vaid93] given in Eqs (619)ndash(623) can bereduced to linear phase and Nyquist constraints on the window namely

w(2M minus 1 minus n) = w(n) (628a)

w2(n) + w2(n + M) = 1 (628b)

for the sample indices 0 n M minus 1 These constraints give rise to two consid-erations First unlike the pseudo-QMF bank linear phase in the MDCT lowpassprototype does not translate into linear phase for the modulated analysis filters oneach subband channel The overall MDCT analysis-synthesis filter bank howeveris characterized by perfect reconstruction and hence linear phase with a constantgroup delay of L minus 1 samples Secondly although Eqs (628a) and (628b) guar-antee an orthogonal basis for the MDCT an orthogonal basis is not required tosatisfy the PR constraints in Eqs (619)ndash(623) In fact it can be shown [Cheu95]


that Eq (628b) can be revised and that perfect reconstruction for the MDCT isstill guaranteed as long as it is true that

ws(n) = wa(n)

w2a(n) + w2

a(n + M)(629)

for the sample indices 0 n M minus 1 where ws(n) denotes the synthesis win-dow and wa(n) denotes the analysis window From the transform perspec-tive Eqs (628a) and (629) guarantee a biorthogonal MDCT basis Clearly thisrelaxation of prototype FIR lowpass filter design requirements increases thedegrees of freedom available to the filter bank designer from M2 to M Ineffect it is no longer necessary to use the same analysis and synthesis windowsIn any case whether an orthogonal or biorthogonal basis is used the MDCTwindow design problem can be formulated in the same manner as it was forthe PQMF bank (Eq (616)) except that the PR property of the MDCT elim-inates the spectral flatness constraint (Eq (617)) such that the designer canconcentrate solely on minimizing either the stop-band energy or the maximumstop-band magnitude of W() Well-known tools are available (eg [Pres89])for minimizing Eq (616) but in many cases one can safely forego the designprocess and rely instead upon the general purpose orthogonal [Prin86] [Prin87][Malv90a] or biorthogonal [Jawe95] [Smar95] [Matv96] MDCT windows thathave been proposed in the literature In fact several existing orthogonal [Ferr96a][ISOI96a] [USAT95a] and biorthogonal [Cheu95] [Malv98] transform windowswere explicitly designed to be in some sense optimal for audio coding


It is instructive to consider some example MDCT windows in order to appreciatemore fully the characteristics well suited to audio coding as well as the tradeoffsthat are involved in the window selection process

6731 Sine Window Malvar [Malv90a] denotes by MLT the MDCT filterbank that makes use of the sine window defined as

w(n) = sin

[(

n + 1

2

)π

2M

]

(630)

for 0 n M minus 1 This particular window is perhaps the most popularin audio coding It appears for example in the MPEG-1 layer III (MP3)hybrid filter bank [ISOI92] the MPEG-2 AACMPEG-4 time-frequency filterbank [ISOI96a] and numerous experimental MDCT-based coders that haveappeared in the literature In fact this window has become the de factostandard in MDCT audio applications and its properties are typically referencedas performance benchmarks when windows are proposed The sine window(Figure 614) has several unique properties that make it advantageous First DCenergy is concentrated in a single transform coefficient because all basis functionsexcept for the first one have infinite attenuation at DC Secondly the filter bank


channels achieve 24 dB sidelobe attenuation when the sine window (Figure 614dashed line) is used Finally the sine window has been shown [Malv90a] to makethe MDCT asymptotically optimal in terms of coding gain for a lapped transformCoding gain is desirable because it quantifies the factor by which the mean squareerror (MSE) is reduced when using the filter bank relative to using direct pulsecode modulated (PCM) quantization of the time-domain signal at the same rate

6732 Parametric Phase-Modulated Sine Window Optimization crite-ria other than coding gain or DC localization are possible and have also beeninvestigated for the MDCT Ferreira [Ferr96a] proposed a parametric window forthe orthogonal MDCT that offers a controlled tradeoff between reduction of thetime-domain ringing artifacts produced by coarse quantization and reduction ofstop-band leakage relative to the sine window The window (Figure 614 solid)which is defined in terms of three parameters for any value of M ie

w(n) = sin

[(

n + 1

2

)π

2M+ φopt (n)

]

(631)

where φopt (n) = 4πβ

(1 minus δ2)

[(4n

2M minus 2

)α

minus δ

] [(4n

2M minus 2

)α

minus 1

]

(632)

was motivated by the observation that explicit simultaneous minimization of time-domain aliasing and stop-band energy resulted in a window well approximatedby a nonlinear phase difference with respect to the sine window Moreover theparametric solution provided nearly optimal results and was tractable while theexplicit minimization was numerically unstable for long windows Parametersare given in [Ferr96a] for three windows that offer respectively time-domainaliasingstop-band leakage percentage improvements relative to the sine windowof 63101 8307 and 133minus31 Figure 614 compares the latter para-metric window (β = 003125 α = 092 δ = 00) in both time and frequencyagainst the sine window It can be seen that the negative gain in stop-band atten-uation is caused by a slight increase in the first sidelobe energy It is also clearhowever that the stop-band attenuation characteristics improve with increasingfrequency In fact the Ferreira window has a broader range of better than 110 dBattenuation than does the sine window This characteristic of improved ultimatestop-band rejection can be beneficial for perceptual gain particularly for stronglyharmonic signals

6733 Separate Analysis and Synthesis Windows ndash BiorthogonalMDCT Basis Even more dramatic improvements in ultimate stop-band rejec-tion are possible when the orthogonality constraint is removed Cheung and Lim[Cheu95] derived for the MDCT the biorthogonality window constraint given by

Eq (629) and then demonstrated with a Kaiser analysis window wa(n) the poten-tial for improved stop-band attenuation In a similar fashion Figure 615(a) showsthe analysis (solid) and synthesis (dashed) windows that result for the biorthogonalMDCT when wa(n) is a Kaiser window [Oppe99] with β = 11 The most signif-icant benefit of this arrangement is apparent from the frequency response plot for


100 200 300 400 500 600 700 800 900 10000

01

02

03

04

05

06

07

08

09

1

Sample

Am

plitu

de

(a)

0 0005 001 0015 002 0025 003minus80

minus70

minus60

minus50

minus40

minus30

minus20

minus10

0


Gai

n (d

B)

(b)

Figure 614 Orthogonal MDCT analysissynthesis windows of Malvar [Malv90a] (das-hed) and Ferreira [Ferr96a] (solid) (a) time-domain (b) frequency-domain magnituderesponse The parametric Ferreira window provides better stop-band attenuation over abroader range of frequencies at the expense of transition bandwidth and slightly reducedattenuation of the first sidelobe

two 256-channel filter banks depicted in Figure 615(b) In this figure the dashedline represents the frequency response associated with channel four of the sinewindow MDCT and the lighter solid line corresponds to the frequency responseassociated with the same channel in the Kaiser window MDCT filter bank Also


50 100 150 200 250 300 350 400 450 500

Sample

0

05

1

15

2A

mpl

itude

(a)

0 200 400 600 800 1000 1200 1400 1600 1800 2000minus120

minus100

minus80

minus60

minus40

minus20

0

Frequency (Hz)

Gai

n (d

B)

Simultaneousmasking threshold

(b)

Figure 615 Biorthogonal MDCT basis (a) Time-domain view of the analysis (solid)and synthesis (dashed) windows (Cheung and Lim [Cheu95]) that are associated witha biorthogonal MDCT basis (b) Frequency-domain magnitude responses associated withMDCT channel four of 256 for the sine (orthogonal basis dashed) and Kaiser (biorthogonalbasis solid) windows Simultaneous masking threshold is superimposed for a pure toneoccurring at the channel center frequency 388 Hz Picture demonstrates the potentialfor super-threshold leakage associated with the sine window and the improved stop-bandattenuation realized with the Kaiser window


superimposed on the plot is the simultaneous masking threshold generated by a388-Hz pure tone occurring at the channel center It can be seen that althoughthe main lobe for the Kaiser MDCT is somewhat broader than the sine MDCTthe stop-band attenuation is significantly below the masking threshold whereas thesine window MDCT stop-band leakage has substantial super-threshold energy Thesine window therefore has the potential to cause artificially high bit rates becauseof its greater leakage This type of artifact motivated the designers of the DolbyAC-2AC-3 [USAT95a] and MPEG-2 AACMPEG-4 T-F [ISOI96a] algorithms touse customized windows rather than the standard sine window in their respectiveorthogonal MDCT filter banks

6734 The Dolby AC-2Dolby AC-3MPEG-2 AAC KBD Window TheKaiser-Bessel Derived (KBD) window was obtained in a procedure devised atDolby Laboratories The AC-2 and AC-3 designers showed [Fiel96] that theprototype filter for an M ndashchannel orthogonal MDCT filter bank satisfying thePR conditions (Eqs (628a) and (628b)) can be derived from any symmetrickernel window of length M + 1 by applying a transformation of the form

wa(n) = ws(n) =radicradicradicradic

sumnj=0 v(j)

sumMj=0 v(j)

0 n lt M (633)

where the sequence v(n) represents the symmetric kernel The resulting identicalanalysis and synthesis windows wa(n) and ws(n) respectively are of lengthM + 1 and symmetric ie w(2M minus n minus 1) = w(n) Note that although a moregeneral form of Eq (633) appeared [Fiel96] we have simplified it here for theparticular case of the 50-overlapped MDCT During the development of theAC-2 and AC-3 algorithms novel MDCT prototype filters optimized to sat-isfy a minimum masking template (eg Figure 616(a) for AC-3) were designedusing Eq (633) with a parametric Kaiser-Bessel kernel v(n) At the expenseof some passband selectivity the KBD windows achieve considerably betterstop-band attenuation (greater than 40 dB improvement) than the sine window(Figure 616b) Thus for a pure tone occurring at the center of a particularMDCT channel the KBD filter bank concentrates more energy into a single trans-form coefficient The remaining dispersed energy tends to generate coefficientmagnitudes that lie below the worst-case pure tone excitation pattern (ldquomask-ing templaterdquo (Figure 616b)) Particularly for signals with adequately spacedtonal components the presence of fewer supra-threshold MDCT componentsreduces the perceptual bit allocation and therefore tends to improve coding gainIn spite of the reduced bit allocation the filter bank still renders the quantizationnoise inaudible since the uncoded coefficients have smaller magnitudes than themasked threshold A KBD filter bank simulation exemplifying this behavior forthe MPEG-2 AAC algorithm is given later

6735 Parametric Windows for a Biorthogonal MDCT Basis In anotherexample of biorthogonal window design Malvar [Malv98] proposed the lsquomod-ulated biorthogonal lapped transform (MBLT)rsquo a biorthogonal version of the


(a)

50 100 150 200 250 300 350 400 450 5000

01

02

03

04

05

06

07

08

09

1

Sample

Am

plitu

de

Sine

AC-3

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000minus120

minus100

minus80

minus60

minus40

minus20

0

Masking template

Sine window

AC-3 window

Frequency (Hz)

Mag

nitu

de (

dB)

(b)

Figure 616 Dolby AC-3 (solid) vs sine (dashed) MDCT windows (a) time-domainviews (b) frequency-domain magnitude responses in relation to worst-case masking tem-plate Improved stop-band attenuation of the AC-3 (KBD) window is shown to approxi-mate well the minimum masking template

MDCT based on a parametric window defined as

ws(n) =1 minus cos

[(n + 1

2M

)α

π

]

+ β

2 + β(634)


for 0 n M minus 1 Like [Cheu95] Eq (634) was also motivated by a desireto realize improved stop-band attenuation Additionally it was used to achievegood characteristics in a novel nonuniform filter bank based on a straightforwardmanipulation of the MBLT In this design the parameter α controls windowwidth while the parameter β controls its end values

6736 Summary and Example Eight-Channel Filter Bank (MDCT) Usinga Sine Window The foregoing examples demonstrate that MDCT windowdesigns are predominantly concerned with optimizing in some sense the trade-off between mainlobe width and stopband attenuation as is true of any FIRfilter design We also note that biorthogonal MDCT extensions are a recentdevelopment and consequently most current audio coders incorporate primarilydesign innovations that have occurred within the orthogonal MDCT frameworkTo facilitate comparisons with the previously described filter bank methodolo-gies (QMF CQF tree-structured QMF pseudo-QMF etc) the analysis filtermagnitude responses for an example eight-channel MDCT filter bank using thesine window are shown in Figure 617(a) Examination of the channel-3 impulseresponse in Figure 617(b) reveals the asymmetry that precludes linear phase forthe analysis filters

6737 Time-Varying Forms of the MDCT One final point regarding MDCTwindow design is of particular relevance for perceptual audio coders The earlierexamples for tone-like and noise-like signals (Chapter 6 Section 62) demon-strated clearly that characteristics of the ldquobestrdquo filter bank for audio are signal-specific and therefore time-varying In practice it is very common for codecsusing the MDCT (eg MPEG-1 [ISOI92a] MPEG-2 AAC [ISOI96a] Dolby AC-3 [USAT95a] Sony ATRAC [Tsut96] etc) to change the number of channels andhence the window length to match the signal properties of the input Typically abinary classification scheme identifies the input as either stationary or nonstation-arytransient Then a long window is used to maximize coding gain and achievegood channel separation during segments identified as stationary or a short windowis used to localize time-domain artifacts when pre-echoes are likely Although thestrategy has proven to be highly effective it does complicate the codec structureIn particular because of the time overlap between basis vectors either boundaryfilters [Herl95] or special transitional windows [Herl93] are required to preserveperfect reconstruction when window switching occurs Other schemes are alsoavailable to achieve perfect reconstruction with time-varying filter bank prop-erties [Quei93] [Soda94] but for practical reasons these are not typically usedConsequently window switching has been the method of choice In this scenariothe transitional window function does not need to be symmetrical It can be shownthat the PR property is preserved as long as the transitional window satisfies thefollowing constraints

w2(n) + w2(M minus n) = 1 n lt M (635a)

w2(M + n) + w2(2M minus n) = 1 n M (635b)


000625 01875 03125 04375 05625 06875 08125 09375minus60

minus50

minus40

minus30

minus20

minus10

0

Eight Channel MDCT Filter Bank N = 15


Mag

nitu

de (

dB)

(a)

(b)

0 03125minus60

minus40

minus20

0


Mag

nitu

de (

dB)

0 5 10 15minus05

0

05

Sample Number

Am

plitu

de

Figure 617 Eight-channel MDCT filter bank constructed with the sine window (a) Mag-nitude frequency responses for all eight channels of the analysis filter bank Odd-numberedchannel responses are drawn with dashed lines even-numbered channel responses aredrawn with solid lines (b) Isolated view of the magnitude frequency response andtime-domain impulse response for channel 3 Asymmetry is clearly visible in the chan-nel impulse response precluding the possibility of linear phase although the overallanalysis-synthesis filter bank has linear phase on all channels


and provided that the relationship between the transitional window and the adjoin-ing new length window obeys

w1(M + n) = w2(M minus n) (636)

where w1(n) and w2(n) are the left and right window functions respectively Inspite of the preserved PR property it should be noted that MDCT transitionalwindows are highly non ideal in the sense that they seriously impair the channelselectivity and stop-band attenuation of the filter bank

The Dolby AC-3 algorithm as well as the MPEG MDCT-based coders employMDCT window switching to maximize filter bank-to-signal matching The MPEG-1 layer III and MPEG-2 AAC window switching schemes use transitional win-dows that are described in some detail later (Section 610) Unlike the MPEGapproach the AC-3 algorithm maintains perfect reconstruction while avoidingtransitional windows The AC-3 applies high-resolution frequency analysis to sta-tionary signals using an MDCT as defined in Eqs (626) and (627) with M =256 During transient segments a pair of two half-length transforms (M = 128)given by

X1(k) =2Mminus1sum

n=0

x(n)hk1(n) (637a)

X2(k) =2Mminus1sum

n=0

x(n + 2M)hk2(n + 2M) (637b)

replaces the single long-block transform and the short block filter impulse respon-ses hk1 and hk2 are defined as

hk1(n) = w(n)

radic2

Mcos

[(2n + 1)(2k + 1)π

4M

]

(638a)

hk2(n) = w(n)

radic2

Mcos

[(2n + 2M + 1)(2k + 1)π

4M

]

(638b)

The window function w(n) remains identical for both the long and short trans-forms Here the key to maintaining the PR property is that the different phaseshifts in Eqs (638a) and (638b) relative to Eq (624) guarantee an orthogonalbasis Also note that the AC-3 window is customized and incorporates into itsdesign some perceptual properties [USAT95a] The spectral and temporal analysistradeoffs involved in transitional window designs are well illustrated in [Shli97]for both the MPEG-1 layer III [ISOI92a] and the Dolby AC-3 [USAT95a] fil-ter banks

6738 Fast Algorithms Complexity and Implementation Issues Oneof the attractive properties that has contributed to the widespread use of the MDCT


particularly in the standards is the availability of FFT-based fast algorithms[Duha91] [Sevi94] that make the filter bank viable for real-time applications A

unified fast algorithm [Liu98] is available for the MPEG-1 -2 -4 and AC-3 longblock MDCT (Eq (626) and Eq (627)) the AC-3 short block MDCT (Eq (638a)and Eq (638b)) and the MPEG-1 pseudo-QMF bank (Eq (613) and Eq (614))The computational load of [Liu98] for an M = 1024 (2048-point) MDCT (egMPEG-2 AAC ATampT PAC) is 8192 multiplies and 13920 adds This translatesinto complexity of O(M log2 M) for multiplies and O(2M log2 M) for adds Thecomplexity scales accordingly for other values of M Both [Duha91] and [Liu98]exploit the fact that the forward MDCT can be decomposed into two cascadedstages (Figure 618) namely a set of M2 butterflies followed by an M-point dis-crete cosine transform (DCT) The inverse transform is decomposed in the inversemanner ie a DCT followed by a butterfly network In both cases the butterfliescapture the windowing behavior while the DCT performs the modulation and fil-tering The decompositions are efficient as well-known fast algorithms are availablefor the various DCT types [Rao90] The butterflies are of low complexity typicallyO(2M) for both multiplies and adds In addition to the computationally efficientalgorithms of [Duha91] and [Liu98] a regressive structure suitable for parallelVLSI implementation of the Eq (626) forward MDCT was proposed in [Chia96]with complexity of 3M adds and 2M multiplies per output for the forward transformand 3M adds and M multiplies for the inverse transform

As far as other implementation issues are concerned several researchershave addressed the quantization sensitivity of the MDCT There are available

zminus1

zminus1

minusw(0)

w(0)

minusw(M minus 1)

minusw(M minus 1)

x(M minus 1)

minusw(M 2)

minusw(M 2)

minusw(M2 minus 1)

w(M2 minus 1)

x(M2 minus 1)

X(M minus 1)

x(M2 minus 2)M2 minus 2

M2 minus 1

M minus 1

0

DCT-IV

x(0) X(0)

bullbullbull

bullbullbull

bullbullbull

bullbullbull

bullbullbull

Figure 618 A fast algorithm for the 2M-point (M-channel) forward MDCT (Eq (626))consists of a butterfly network and memory followed by a Type IV DCT The inversestructure can be formed to compute the inverse MDCT (Eq (627)) Efficient FFT-basedalgorithms are available for the Type IV DCT


expressions [Jako96] for the reconstruction error of the quantized system interms of signal-correlated and uncorrelated components that can be used toassist algorithm designers in the identification and optimization of perceptuallydisturbing reconstruction artifacts induced by quantization noise A more generaltreatment of quantization issues for PR cosine modulated filter banks has alsoappeared [Akan92]

6739 Remarks on the MDCT The MDCT has become of central impor-tance in audio coding and the majority of standardized algorithms make someuse of this filter bank This section has traced the origins of the MDCT reviewedcommon terminology and definitions addressed the major window design issuesexamined the strategies for time-varying implementations and noted the avail-ability fast algorithms for efficient realization It has also provided numerousexamples The important properties of the MDCT filter bank can be summarizedas follows

ž Perfect reconstructionž Overlapping basis vectorsž Linear overall filter bank phase responsež Extended ringing artifacts due to quantizationž Critical samplingž Virtual elimination of blocking artifactsž Constant group delay = L minus 1ž Nonlinear analysis filter phase responsesž Low complexity one filter and modulationž Orthogonal version M2 degrees of freedom for w(n)

ž Amenable to time-varying implementations with some performance sac-rifices

ž Amenable to fast algorithmsž Constrained design a single FIR lowpass prototype filterž Biorthogonal version M degrees of freedom for wa(n) or ws(n)

One can see from this synopsis that the MDCT possesses many of the qualitiessuitable for audio coding (Section 63) As a PR cosine-modulated filter bankit inherits all of the advantages realized for the pseudo-QMF except for phaselinearity on individual analysis channels and it does so at the expense of lessthan a 5 dB reduction (typically) in stop-band attenuation Moreover the MDCToffers the added advantage that the number of parameters to be optimized indesign of the lowpass prototype is essentially reduced to M2 in the orthogonalcase If more freedom is desired however one can opt for the biorthogonalconstruction Finally we have presented the MDCT as both a filter bank and ablock transform To maintain consistency we recognize the filter-banktransformduality of some of the other tools presented in this chapter Recall that the MDCT


is a special case of the PR cosine modulated filter bank for which L = 2mM and m = 1 Note then that the PQMF bank (Chapter 6 Section 66) can alsobe interpreted as a lapped transform for which it is possible to have L = 2mM In the case of the MPEG-1 filter bank for layers I and II for example L = 512and M = 32 or in other words m = 8 As the coder architectures described inChapters 7 through 10 will demonstrate many filter banks for audio coding aremost efficiently realized as block transforms particularly when fast algorithmsare available

68 DISCRETE FOURIER AND DISCRETE COSINE TRANSFORM

This section offers abbreviated filter bank interpretations of the discrete Fouriertransform (DFT) and the discrete cosine transform (DCT) These classical blocktransforms were often used to achieve high-resolution frequency analysis in theearly experimental transform-based audio coders (Chapter 7) that preceded theadaptive spectral entropy coding (ASPEC) and ultimately the MPEG-1 algo-rithms layers IndashIII (Chapter 10) For example the FFT realization of the DFTplays an important role in layer III of MPEG-1 (MP3) The FFT is embeddedin efficient realizations of both MP3 hybrid filter bank stages (pseudo-QMF andMDCT) as well as in the spectral estimation blocks of the psychoacoustic mod-els 1 and 2 recommended in the MPEG-1 standard [ISOI92] It can be seen thatblock transforms are a special case of the more general uniform-band analysis-synthesis filter bank of Figure 61 For example consider the unitary DFT andits inverse [Akan92] which can be written as respectively

X(k) = 1radic2M

2Mminus1sum

n=0

x(n)Wminusnk 0 k 2M minus 1 (639a)

x(n) = 1radic2M

2Mminus1sum

k=0

X(k)Wnk 0 n 2M minus 1 (6 39b)

where W = ejπM If the analysis filters in Eq (61) all have the same length andL = 2M then the filter bank could be interpreted as taking contiguous L sam-ple blocks of the input and applying to each block the transform in Eq (639a)Although the DFT is usually defined with a block size of N instead of 2M Eqs (639a) and (639b) are given using notation slightly different from the usualto remain consistent with the convention of this chapter throughout which thenumber of filter bank channels is denoted by the parameter M The DFT hasconjugate symmetry for real signals and thus from the audio filter bank perspec-tive effectively half as many channels as its block length Also from the filterbank viewpoint the impulse response of the k-th-channel analysis filter is givenby the k-th DFT basis vector ie

hk(n) = 1radic2M

Wkn 0 n 2M minus 1 0 k M minus 1 (640)

DISCRETE FOURIER AND DISCRETE COSINE TRANSFORM 179

0 0125 025 0375 05 0625 075 0875 1minus60

minus50

minus40

minus30

minus20

minus10

0

Eight Channel DFT Filter Bank


Mag

nitu

de (

dB)

(a)

(b)

0 025minus60

minus40

minus20

0


Mag

nitu

de (

dB)

0 5 10 15minus60

minus40

minus20

0

Sample Number

Am

plitu

de

Figure 619 Eight-band STFT filter bank (a) Magnitude frequency responses for alleight channels of the analysis filter bank Odd-numbered channel responses are drawn withdashed lines even-numbered channel responses are drawn with solid lines (b) Isolatedview of the magnitude frequency response and time-domain impulse response for channel3 Note that the impulse response is complex-valued and that only its magnitude is shownin the figure


An example eight-channel DFT analysis filter bank magnitude frequency res-ponse appears in Figure 619 with the magnitude response of the third chan-nel magnified in Figure 619(b) The magnitude of the complex-valued impulseresponse for the same channel also appears in Figure 619(b) Note that theDFT filter bank is evenly stacked whereas the cosine-modulated filter banksin Sections 66 and 67 of this chapter were oddly stacked In other words thecenter frequencies for the DFT analysis and synthesis filters occur at kπM for0 k M minus 1 while the center frequencies for the oddly stacked filters occurat (2k + 1)π2M for 0 k M minus 1 As is evident from Figure 619(a) evenstacking means that the low-band filter is only half the bandwidth of the otherchannels and that it ldquowraps-aroundrdquo the fold-over frequency

A filter bank perspective can also be provided for to the DCT As a blocktransform the forward DCT (Type II) and its inverse are given by the analysisand synthesis expressions respectively

X(k) = c(k)

radic2

M

Mminus1sum

n=0

x(n) cos

[π

M

(

n + 1

2

)

k

]

0 k M minus 1 (641a)

x(n) =radic

2

M

Mminus1sum

k=0

c(k)X(k) cos

[π

M

(

n + 1

2

)

k

]

0 n M minus 1(641b)

where c(0) = 1radic

2 and c(k) = 1 for 1 k M minus 1 Using the same dualityarguments as for the DFT one can view the DCT from the perspective of theanalysis-synthesis filter bank (Figure 61) in which case the impulse response ofthe k-th-channel analysis filter is the k-th DCT-II basis vector given by

hk(n) = c(k)

radic2

Mcos

[π

M

(

n + 1

2

)

k

]

0 n k M minus 1 (642)

As an example the magnitude frequency responses of an eight-channel DCT anal-ysis filter bank are given in Figure 620(a) and the isolated magnitude responseof the third channel is given in Figure 620(b) The impulse response for thesame channel is also given in the figure

69 PRE-ECHO DISTORTION

An artifact known as pre-echo distortion can arise in transform coders using per-ceptual coding rules Pre-echoes occur when a signal with a sharp attack beginsnear the end of a transform block immediately following a region of low energyThis situation can arise when coding recordings of percussive instruments suchas the triangle the glockenspiel or the castanets for example (Figure 621a)For a block-based algorithm when quantization and encoding are performed inorder to satisfy the masking thresholds associated with the block average spec-tral estimate time-frequency uncertainty dictates that the inverse transform willspread quantization distortion evenly in time throughout the reconstructed block

PRE-ECHO DISTORTION 181

(a)

000625 01875 03125 04375 05625 06875 08125 09375minus60

minus50

minus40

minus30

minus20

minus10

0

Eight Channel DCT Filter Bank N = 7


Mag

nitu

de (

dB)

0

0

0

0 1 2 3 4 5 6 7

minus20

minus40

minus60

Mag

nitu

de (

dB)

03125Normalized Frequency (x p rad)

05

minus05

Sample Number

(b)

Am

plitu

de

Figure 620 Eight-band DCT filter bank (a) Magnitude frequency responses for all eightchannels of the analysis filter bank Odd-numbered channel responses are drawn withdashed lines even-numbered channel responses are drawn with solid lines (b) Isolatedview of the magnitude frequency response and time-domain impulse response for ch-annel 3


(Figure 621b) This results in unmasked distortion throughout the low-energyregion preceding in time the signal attack at the decoder Although it has thepotential to compensate for pre-echo temporal premasking is possible only ifthe transform block size is sufficiently small (minimal coder delay) Percussivesounds are not the only signals likely to produce pre-echoes Such artifacts alsooften plague coders when processing ldquopitchedrdquo signals containing nearly impul-sive bursts at the beginning of each pitch period eg the ldquoGerman male speechrdquorecording [Herr96] For a male speaker with a fundamental frequency of 125 Hzthe interval between impulsive events is only 8 ms which is much less than thetypical analysis block length Several methods proposed to eliminate pre-echoesare reviewed next

610 PRE-ECHO CONTROL STRATEGIES

Several methodologies have been proposed and successfully applied in the effortto mitigate the pre-echoes that tend to plague block-based coding schemes Thissection describes several of the most widespread techniques including the bitreservoir window switching gain modification switched filter banks and tem-poral noise shaping Advantages and drawbacks associated with each method arealso discussed

6101 Bit Reservoir

Some coders [ISOI92] [John96c] utilize this technique to satisfy the greater bitdemand associated with transients Although most algorithms are fixed rate theinstantaneous bit rates required to satisfy masked thresholds on each frame arein fact time-varying Thus the idea behind a bit reservoir is to store surplus bitsduring periods of low demand and then to allocate bits from the reservoir duringlocalized periods of peak demand resulting in a time-varying instantaneous bitrate but at the same time a fixed average bit rate One problem however isthat very large reservoirs are needed to deal with certain transient signals egldquopitched signalsrdquo Particular bit reservoir implementations are addressed later inconjunction with the MPEG [ISOI92a] and PAC [John96c] standards


First introduced by Edler [Edle89] this is also a popular method for pre-echosuppression particularly in the case of MDCT-based algorithms Window switch-ing works by changing the analysis block length from long duration (eg 25 ms)during stationary segments to ldquoshortrdquo duration (eg 4 ms) when transients aredetected (Figure 622) At least two considerations motivate this method First ashort window applied to the frame containing the transient will tend to minimizethe temporal spread of quantization noise such that temporal premasking effectsmight preclude audibility Secondly it is desirable to constrain the high bit ratesassociated with transients to the shortest possible temporal regions Althoughwindow switching has been successful [ISOI92] [John96c] [Tsut98] it also has


1

08

06

04

02

Am

plitu

de

0

minus02

minus04

minus06

minus08200 400 600 800 1000

Sample (n)

(a)

1200 1400 1600 1800 2000

1

08

06

04

02

Am

plitu

de

0

minus02

minus04

minus06

minus08200 400 600 800 1000

Sample (n)

(b)

1200 1400 1600 1800 2000

Pre-echo distortion

Figure 621 Pre-echo example (time-domain waveforms) (a) Uncoded castanets (b) tra-nsform coded castanets 2048-point block size Pre-echo distortion is clearly visible in thefirst 1300 samples of the reconstructed signal


significant drawbacks For one the perceptual model and lossless coding por-tions of the coder must support multiple time resolutions This usually translatesinto increased complexity Furthermore most coders nowadays use lapped trans-forms such as the MDCT To satisfy PR constraints window switching typicallyrequires transition windows between the long and short blocks Even when suit-able transition windows (Figure 622) satisfy the PR constraints they do so atthe expense of poor time and frequency localization properties [Shli97] resultingin reduced coding gain Other difficulties inherent to window switching schemesare increased coder delay undesirable latency for closely spaced transients (eglong-start-short-stop-start-short) and impractical overuse of short windows forldquopitchedrdquo signals


Window switching essentially relies upon a fixed filter bank with adaptive win-dow lengths In contrast the hybrid and switched filter-bank architectures relyupon distinct filter bank modes In hybrid schemes (eg [Prin95]) compatiblefilter-bank elements are cascaded in order to achieve the time-frequency tilingbest suited to the current input signal Switched filter banks (eg [Sinh96]) onthe other hand make hard switching decisions on each analysis interval in orderto select a single monolithic filter bank tailored to the current input Examplesof these methods are given in later chapters along with some discussion of theirassociated tradeoffs

10 20 30 40 50 60 70 80 90 100 110 1200

02

04

06

08

1

Time (ms)

Am

plitu

de

LongLong Start Stop Short

Figure 622 Example window switching scheme (MPEG-1 layer III or ldquoMP3rdquo) Tran-sitional start and stop windows are required in between the long and short blocks topreserve the PR properties of the filter bank



The gain modification approach (Figure 623) has also shown promise in thetask of pre-echo control [Vaup91] [Link93] The gain modification proceduresmoothes transient peaks in the time-domain prior to spectral analysis Then per-ceptual coding may proceed as it does for normal stationary blocks Quantizationnoise is shaped to satisfy masking thresholds computed for the equalized longblock without compensating for an undesirable temporal spread of quantizationnoise A time-varying gain and the modification time interval are transmitted asside information Inverse operations are performed at the decoder to recover theoriginal signal Like the other techniques caveats also apply to this method Forexample gain modification effectively distorts the spectral analysis time windowDepending upon the chosen filter bank this distortion could have the unintendedconsequence of broadening the filter-bank responses at low frequencies beyondcritical bandwidth One solution for this problem is to apply independent gainmodifications selectively within only frequency bands affected by the transientevent This selective approach however requires embedding of the gain blockswithin a hybrid filter-bank structure which increases coder complexity [Akag94]


The final pre-echo control technique considered in this section is temporal noiseshaping (TNS) As shown in Figure 624 TNS [Herr96] is a frequency-domaintechnique that operates on the spectral coefficients X(k) generated by the analy-sis filter bank TNS is applied only during input attacks susceptible to pre-echoes

s(n)G(n)

TRANSSpectralAnalysis

Side Info

Figure 623 Gain modification scheme for pre-echo control

X(k)A(z)

Q

e(k)

TNSe(k) X(k)ˆˆ

Figure 624 Temporal noise shaping scheme (TNS) for pre-echo control


The idea is to apply linear prediction (LP) across frequency (rather than time)since for an impulsive time signal frequency-domain coding gain is maximizedusing prediction techniques The method works as follows Parameters of a spec-tral LP synthesis filter A(z) are estimated via application of standard minimumMSE estimation methods (eg Levinson-Durbin) to the spectral coefficientsX(k) The resulting prediction residual e(k) is quantized and encoded usingstandard perceptual coding according to the original masking threshold Pre-diction coefficients are transmitted to the receiver as side information to allowrecovery of the original signal The convolution operation associated with spectraldomain prediction is associated with multiplication in time In a manner anal-ogous to the source-system separation realized by time-domain LP analysis fortraditional speech codecs TNS effectively separates the time-domain waveforminto an envelope and temporally flat ldquoexcitationrdquo Then because quantizationnoise is added to the flattened residual the time-domain multiplicative enve-lope corresponding to A(z) shapes the quantization noise such that it follows theoriginal signal envelope

Quantization noise for the castanets applied to a DCT-based coder is shownin Figure 625(a) and Figure 625(b) both without and with TNS active respec-tively TNS clearly shapes the quantization noise to follow the input signalrsquosenergy envelope TNS mitigates pre-echoes since the error energy is now concen-trated in the time interval associated with the largest masking threshold Althoughthey are related as time-frequency dual operations TNS is advantageous relativeto gain shaping because it is easily applied selectively in specific frequency sub-bands Moreover TNS has the advantages of compatibility with most filter-bankstructures and manageable complexity Unlike window switching schemes forexample TNS does not require modification of the perceptual model or losslesscoding stages to a new time-frequency mapping TNS was reported in [Herr96]to dramatically improve performance on a five-point mean opinion score (MOS)test from 264 to 354 for a particularly troublesome pitched signal ldquoGermanMale Speechrdquo for the MPEG-2 nonbackward compatible (NBC) coder [Herr96]A MOS improvement of 03 was also realized for the well-known ldquoGlocken-spielrdquo test signal This ultimately led to the adoption of TNS in the MPEG NBCscheme [Bosi96a] [ISOI96a]

611 SUMMARY

This chapter covered the basics of time-frequency analysis techniques for audiosignal processing and coding We also highlighted the time-frequency tradeoffchallenges faced by the audio codec designers when designing a filter bank Wediscussed both the QMF and CQF filter-bank designs and their extended tree-structured forms We also dealt in detail with the cosine modulated pseudo-QMFand perfect reconstruction M-band filter-bank designs The modified discretecosine transform (MDCT) and the various window designs were also covered indetail

SUMMARY 187

1

06

08

04

02

0

Am

plitu

de

minus02

minus04

minus06

minus08

minus1200 400 600 800 1000

Sample (n)

(a)

1200 1400 1600 1800 2000

1

08

06

04

02

Am

plitu

de

0

minus02

minus04

minus06

minus1

minus08

200 400 600 800 1000

Sample (n)

(b)

1200 1400 1600 1800 2000

Figure 625 Temporal noise shaping example showing quantization noise and the inputsignal energy envelope for castanets (a) without TNS and (b) with TNS


PROBLEMS

61 Prove the identities shown in Figure 610

62 Consider the analysis-synthesis filter bank shown in Figure 61 with input

signal s(n) Show that s(n) = 1

M

suminfinm=minusinfin

suminfinl=minusinfin

sumMminus1k=0 s(m)

hk(lM minus m)gk(l minus Mn) where M is the number of subbands

63 For the down-sampling and up-sampling processes given in Figure 626show that

Sd() = 1

M

Mminus1sum

l=0

S

( + 2πl

M

)

H

( + 2πl

M

)

and Su() = S(M)G()

64 Using results from Problem 63 Prove Eq (65) for the analysis-synthesisframework shown in Figure 61

65 Consider Figure 627Given s(n) = 075 sin(πn3) + 05 cos(πn6) n = 0 1 6 H0(z) = 1 minuszminus1 and H1(z) = 1 + zminus1

a Design the synthesis filters G0(z) and G1(z) in Figure 627 such thataliasing distortions are minimized

b Write the closed-form expression for v0(n) v1(n) y0(n) y1(n)

w0(n) w1(n) and the synthesized waveform s(n) In Figure 627assume yi(n) = yi(n) for i = 0 1

c Assuming an alias-free scenario show that s(n) = αs(n minus n0) whereα is the QMF bank gain n0 is a delay that depends on Hi(z) and Gi(z)Estimate the value of n0

d Repeat steps (a) and (c) for H0(z) = 1 minus 075zminus1 and H1(z) = 1 +075zminus1

H(z)

s(n) sd(n)

M

Down-sampling

G(z)s(n) su(n)

M

Up-sampling

Figure 626 Down-sampling and up-sampling processes

s(n)

y0(n)2

v0(n)

y1(n)2

v1(n)

2w0(n)

2w1(n)

H0(z)

H1(z)

G0(z)

G1(z)

y0(n)ˆ

s(n)ˆ

y1(n)ˆ

sum

Figure 627 A two-band maximally decimated analysis-synthesis filter bank

PROBLEMS 189

66 In this problem we will compare the two-band QMF and CQF designsGiven H0(z) = 1 minus 09zminus1a Design a two-band (ie M = 2) QMF [Use Eq (67)]b Design a two-band CQF for L = 8 [Use Eq (68)]c Consider the two-band QMF and CQF banks in Figure 628 with input

signal s(n) = 075 sin(πn3) + 05 cos(πn6) n = 0 1 6Compare the designs in (a) and (b) and check for the alias-free recon-struction in case of CQF Give the delay values d1 and d2

d Extend the two-band QMF design in part (a) to polyphase factorization[use Equations (69) and (610)] What are the advantages of employingpolyphase factorization

67 In this problem we will design and analyze a four-channel uniform tree-structured QMF banka Given H00(z) = 1 + 01zminus1 and H10(z) = 1 + 09zminus1 Complete

the tree-structured QMF bank (use Figure 68) for four channelsb Using the identities given in Figure 610 (or Eq (611)) construct a

parallel analysis-synthesis filter bank The parallel analysis-synthesisfilter bank structure must be similar to the one shown in Figure 61with M = 4

c Plot the frequency response of the resulting parallel filter bank analysisfilters ie H0(z) H1(z) H2(z) and H3(z) Comment on the pass-bandand stopband structures of the magnitude responses associated withthese filters

d Plot the impulse response h1(n) Is h1(n) symmetric

68 Repeat Problem 67 for a four-channel uniform tree-structured CQF bankwith L = 4

69 A time-domain plot of an audio signal is shown in Figure 629 Given theflexibility to encode the regions A through E with varying frame sizesWhich of the following choices is preferred

Choice ILong frames in regions B and DChoice IILong frames in regions A C and E

2-bandQMF

s(n) s1(n minus d1)

2-bandCQF

s(n) s2(n minus d2)

Figure 628 A two-band QMF and CQF design comparison


B CA D E

Figure 629 An example audio segment with harmonics (region A) a transient (regionB) background noise (region C) exponentially weighted harmonics (region D) and back-ground noise (region E) segments

Choice IIIShort frames in region B onlyChoice IVShort frames in regions B and DChoice VShort frames in region A only

Explain how would you assign frequency-resolution (high or low) among theregions A B C D and E

610 A pure tone at f0 with P0 dB SPL is encoded such that the quantiza-tion noise is masked Let us assume that a 256-point MDCT producedan in-band signal-to-noise ratio of SNRA and encodes the tone with bA

bitssample And a 1024-point MDCT yielded SNRB and encodes the tonewith bB bitssample

x1(n)

x2(n)

x3(n)

x4(n)

Figure 630 Audio frames x1(n) x2(n) x3(n) and x4(n) for Problem 611


In which of the two cases we will require the most bitssample (state ifbA gt bB or bA lt bB) to mask the quantization noise

611 Given the signals x1(n) x2(n) x3(n) and x4(n) as shown in Figure 630Let all the signals be of length 1024 samples When transform coded usinga 512-point MDCT which of the signals will result in pre-echo distortion

COMPUTER EXERCISES

612 In this problem we will study the filter banks that are based on theDFT Use Eq (639a) to implement a 2M-point DFT of x(n) given inFigure 631 Assume M = 8a Give the plots of |X(k)|b Plot the frequency response of the second- and third-channel analysis

filters that are associated with the basis vectors h1(n) and h2(n)c State whether the DFT filter bank is evenly stacked or oddly stacked

613 In this problem we will study the filter banks that are based on the DCTUse Eq (641a) to implement a M-point DCT of x(n) given in Figure 631Assume M = 8a Give the plots of |X(k)|b Also plot the frequency response of the second and third channel anal-

ysis filters that are associated with the basis vectors h1(n) and h2(n)c Plot the impulse response of h1(n) and see if it is symmetricd Is the DCT filter bank evenly stacked or oddly stacked

614 In this problem we will study the filter banks based on the MDCTa First design a sine window w(n) = sin[(2n + 1)π4M] with M = 8b Check if the sine window satisfies the generalized perfect reconstruction

conditions ie Eqs (628a) (628b)c Next design a MDCT analysis filter bank hk(n) for 0 lt k lt 7

x(n)

n0 7

8 15

16 23

24 31

1

minus1

Figure 631 Input signal x(n) for Problems 612 613 and 614


d Plot both the impulse response and the frequency response of the anal-ysis filters h1(n) and h2(n)

e Compute the MDCT coefficients X(k) of the input signal x(n) shownin Figure 631

f Is the impulse response of the analysis filter h1(n) symmetric

615 Show analytically that a DCT can be implemented using FFTs Also usex(n) given in Figure 632 as your test signal and verify your softwareimplementation

616 Give expressions for DCT-I DCT-II DCT-III and DCT-IV orthonormaltransforms (eg see [Rao90]) Use the signals x1(n) and x2(n) shownin Figure 633 to study the differences in the 4-point DCT coefficientsobtained from different types of DCT Describe in general whether choos-ing a particular type of DCT affects the energy compaction of a signal

617 In this problem we will design a two-band (M = 2) cosine-modulatedPQMF bank with L = 8a First design a linear phase FIR prototype lowpass filter (ie w(n)) with

normalized cutoff frequency π4 Plot the frequency response of thiswindow Use fir2 command in MATLAB to design the lowpass filter

b Use Eq (613) and (614) to design the PQMF analysis and synthesisfilters respectively

x(n)

0 7

8 15

1

minus1

n

Figure 632 Input signal x(n) for Problem 615

0

1

minus1

n

3

x2(n)x1(n)

0

1

minus1

n

3

Figure 633 Test signals to study the differences among various types of orthonormalDCT transforms


2-bandPQMF

s(n) s3(n ndash d3)

Figure 634 A two-band PQMF design

c In Figure 634 use s(n) = 075 sin(πn3) + 05 cos(πn6) n = 0

1 6 and generate s3(n) Compare s(n) and s3(n) and commenton any type of distortion that you may observe

d What are the advantages of employing a cosine modulated filter bankover the two-band QMF and two-band CQF

e List some of the key differences between the CQF and the PQMFin terms of 1) the analysis filter bank frequency responses 2) phasedistortion 3) impulse response symmetries

f Use sine window w(n) = sin[(2n + 1)π4M] with M = 8 and repeatsteps (b) and (c)

CHAPTER 7

TRANSFORM CODERS

71 INTRODUCTION

Transform coders make use of unitary transforms (eg DFT DCT etc) for thetimefrequency analysis section of the audio coder shown in Figure 11 Manytransform coding schemes for wideband and high-fidelity audio have been pro-posed starting with some of the earliest perceptual audio codecs For examplein the mid-1980s Krahe applied psychoacoustic bit allocation principles to atransform coding scheme [Krah85] [Krah88] Schroeder [Schr86] later extendedthese ideas into multiple adaptive spectral audio coding (MSC) The MSC uti-lizes a 1024-point DFT then groups coefficients into 26 subbands inspired bythe critical bands of the ear This chapter gives overview of algorithms that wereproposed for transform coding of high-fidelity audio following the early work ofSchroeder [Schr86]

The Chapter is organized as follows Sections 72 through 75 describe in somedetail the transform coding algorithms proposed by Brandenburg Johnston andMahieux [Bran87b] [John88a] [Mahi89] [Bran90] Most of this research becameconnected with the MPEG standardization and the ISOIEC eventually clusteredthese algorithms into a single candidate algorithm called adaptive spectral entropycoding (ASPEC) [Bran91] of high quality music signals The ASPEC algorithm(Section 76) has become part of the ISOIEC MPEG-1 [ISOI92] and the MPEG-2BC-LSF [ISOI94a] audio coding standards Sections 77 and 78 are concernedwith two transform coefficient substitution schemes namely the differential per-ceptual audio coder (DPAC) and the DFT noise substitution algorithm Finally


195


Sections 79 and 710 address several early applications of vector quantization(VQ) to transform coding of high-fidelity audio

The algorithms described in the Chapter that make use of modulated fil-ter banks (eg ASPEC DPAC TwinVQ) can also be characterized as high-resolution subband coders Typically transform coders perform high-resolutionfrequency analysis and subband coders rely on a coarse division of the frequencyspectrum In many ways the transform and subband coder categories overlapand in some cases it is hard to categorize a coder in a definite manner Thesource of this overlapping of transformsubband categories come from the factthat block transform realizations are used for cosine modulated filter banks

72 OPTIMUM CODING IN THE FREQUENCY DOMAIN

Brandenburg in 1987 proposed a 132 kbs algorithm known as optimum coding inthe frequency domain (OCF) [Bran87b] which is in some respects an extensionof the well-known adaptive transform coder (ATC) for speech The OCF wasrefined several times over the years with two enhanced versions appearing afterthe original algorithm The OCF is of interest because of its influence on currentstandards

The original OCF (Figure 71) works as follows The input signal is firstbuffered in 512 sample blocks and transformed to the frequency domain usingthe DCT Next transform components are quantized and entropy coded A sin-gle quantizer is used for all transform components Adaptive quantization andentropy coding work together in an iterative procedure to achieve a fixed bitrate The initial quantizer step size is derived from the spectral flatness measure(Eq (513))

In the inner loop of Figure 71 the quantizer step size is iteratively increasedand a new entropy-coded bit stream is formed at each update until the desired bit

Input BufferWindowing DCT

Psychoacoustic Analysis

EntropyCoder WeightingQuantizer

Outer Loop

Inner Loop

output

s(n)

loop count

loop count

Figure 71 OCF encoder (after [Bran88b])


rate is achieved Increasing the step size at each update produces fewer levelswhich in turn reduces the bit rate Using a second iterative procedure a perceptualanalysis is introduced after the inner loop is done First critical band analysis isapplied Then a masking function is applied which combines a flat minus6 dB mask-ing threshold with an interband masking threshold leading to an estimate of JNDfor each critical band If after inner-loop quantization and entropy encoding themeasured distortion exceeds JND in at least one critical band then quantizationstep sizes are adjusted only in the out-of-tolerance critical bands The outer looprepeats until JND criteria are satisfied or a maximum loop count is reachedEntropy coded transform components are then transmitted to the receiver alongwith side information which includes the log encoded SFM the number ofquantizer updates during the inner loop and the number of step size reductionsthat occurred for each critical band in the outer loop This side informationis sufficient to decode the transform components and perform reconstruction atthe receiver

Brandenburg in 1988 reported an enhanced OCF (OCF-2) which achievedsubjective quality improvements at a reduced bit rate of only 110 kbs [Bran88a]The improvements were realized by replacing the DCT with the MDCT andadding a pre-echo detectioncompensation scheme Reconstruction quality is imp-roved due to the effective time resolution increase (ie 50 time overlap)associated with the MDCT OCF-2 quality is also improved for difficult signalssuch as triangle and castanets due to a simple pre-echo detectioncompensationscheme The encoder detects pre-echoes using analysis-by-synthesis Pre-echoesare detected when noise energy in a reconstructed segment (16 samples = 036 ms 441 kHz) exceeds signal energy The encoder then determines the frequencybelow which 90 of signal energy is contained and transmits this cutoff to thedecoder Given pre-echo detection at the encoder (1 bit) and a cutoff frequencythe decoder discards frequency components above the cutoff in effect low-passfiltering pre-echoes Due to these enhancements the OCF-2 was reported toachieve transparency over a wide variety of source material

Later in 1988 Brandenburg reported further OCF enhancements (OCF-3) inwhich better quality was realized at a lower bit rate (64 kbs) with reducedcomplexity [Bran88b] This was achieved through differential coding of spectralcomponents to exploit correlation between adjacent samples an enhanced psy-choacoustic model modified to account for temporal masking and an improvedrate-distortion loop

73 PERCEPTUAL TRANSFORM CODER

While Brandenburg developed the OCF algorithm similar work was simulta-neously underway at ATampT Bell Labs Johnston developed several DFT-basedtransform coders [John88a] [John89] for audio during the late 1980s that becamean integral part of the ASPEC proposal Johnstonrsquos work in perceptual entropy[John88b] forms the basis for a transform coder reported in 1988 [John88a] that


Quantizers

ThresholdAdjustment


Bit Packing

Ti

s(n) FFT

2048 point

To Channel

Side Info

Ti

Bit AllocationLoop

Ti Pj

Figure 72 PXFM encoder (after [John88a])

achieves transparent coding of FM-quality monaural audio signals (Figure 72)A stereophonic coder based on similar principles was developed later

731 PXFM

A monaural algorithm the perceptual transform coder (PXFM) was developedfirst The idea behind the PXFM is to estimate the amount of quantization noisethat can be inaudibly injected into each transform domain subband using PEestimates The coder works as follows The signal is first windowed into over-lapping (116) segments and transformed using a 2048-point FFT Next the PEprocedure described in Section 56 is used to estimate JND thresholds for eachcritical band Then an iterative quantization loop adapts a set of 128 subbandquantizers to satisfy the JND thresholds until the fixed bit rate is achieved Finallyquantization and bit packing are performed Quantized transform components aretransmitted to the receiver along with appropriate side information Quantizationsubbands consist of 8-sample blocks of complex-valued transform componentsThe quantizer adaptation loop first initializes the j isin [1 128] subband quantizers(1024 unique FFT components8 components per subband) with kj levels andstep sizes of Ti as follows

kj = 2 nint

(Pj

Ti

)

+ 1 (71)

where Ti are the quantized critical band JND thresholds Pj is the quantizedmagnitude of the largest real or imaginary transform component in the j -thsubband and nint() is the nearest integer rounding function The adaptation pro-cess involves repeated application of two steps First bit packing is attemptedusing the current quantizer set Although many bit packing techniques are pos-sible one simple scenario involves sorting quantizers in kj order then filling64-bit words with encoded transform components according to the sorted resultsAfter bit packing Ti are adjusted by a carefully controlled scale factor andthe adaptation cycle repeats Quantizer adaptation halts as soon as the packeddata length satisfies the desired bit rate Both Pj and the modified Ti are quan-tized on a dB scale using 8-bit uniform quantizers with a 170 dB dynamic


range These parameters are transmitted as side information and used at thereceiver to recover quantization levels (and thus implicit bit allocations) for eachsubband which are in turn used to decode quantized transform componentsThe DC FFT component is quantized with 16 bits and is also transmitted asside information

732 SEPXFM

In 1989 Johnston extended the PXFM coder to handle stereophonic signals(SEPXFM) and attained transparent coding of a CD-quality stereophonic channelat 192 kbs or 22 bitssample SEPXFM [John89] realizes performance improve-ments over PXFM by exploiting inherent stereo cross-channel redundancy andby assuming that both channels are presented to a single listener rather thanbeing used as separate signal sources The SEPXFM structure is similar to thatof PXFM with variable radix bit packing replaced by adaptive entropy codingSide information is therefore reduced to include only adjusted JND thresholds(step sizes) and pointers to the entropy codebooks used in each transform domainsubband The coder works in the following manner First sum (L + R) and dif-ference (L minus R) signals are extracted from the left (L) and right (R) channels toexploit leftright redundancy Next the sum and difference signals are windowedand transformed using the FFT Then a single JND threshold for each criticalband is established via the PE method using the summed power spectra fromthe L + R and L minus R signals A single combined JND threshold is applied toquantization noise shaping for both signals (L + R and L minus R) based upon theassumption that a listener is more than one ldquocritical distancerdquo [Jetz79] away fromthe stereo speakers

Like PXFM a fixed bit rate is achieved by applying an iterative thresholdadjustment procedure after the initial determination of JND levels The adapta-tion process analogous to PXFM bit rate adjustment and bit packing consistsof several steps First transform components from both (L + R) and (L minus R)are split into subband blocks each averaging 8 realimaginary samples Thenone of six entropy codebooks is selected for each subband based on the averagecomponent magnitude within that subband Next transform components are quan-tized given the JND levels and encoded using the selected codebook Subbandcodebook selections are themselves entropy encoded and transmitted as sideinformation After encoding JND thresholds are scaled by an estimator andthe quantizer adaptation process repeats Threshold adaptation stops when thecombined bitstream of quantized JND levels Huffman-encoded (L + R) com-ponents Huffman-encoded (L minus R) components and Huffman-encoded averagemagnitudes achieves the desired bit rate The Huffman codebooks are developedusing a large music and speech database They are optimized for difficult sig-nals at the expense of mean compression rate It is also interesting to note thatheadphone listeners reported no noticeable acoustic mixing despite the criticaldistance assumption and single combined JND level estimate for both channels(L + R) and (L minus R)


74 BRANDENBURG-JOHNSTON HYBRID CODER

Johnston and Brandenburg [Bran90] collaborated in 1990 to produce a hybridcoder that strictly speaking is both a subband and transform coding algorithmThe idea behind the hybrid coder is to improve time and frequency resolutionrelative to OCF and PXFM by constructing a filter bank that more closely resem-bles the auditory filter bank This is accomplished at the encoder by first splittingthe input signal into four octave-width subbands using a QMF filter bank

The decimated output sequence from each subband is then followed by one ormore transforms to achieve the desired timefrequency resolution Figure 73(a)Both the DFT and the MDCT were investigated Given the tiling of the time-frequency plane shown in Figure 73(b) frequency resolution at low frequencies(234 Hz) is well matched to the ear while the time resolution at high frequencies(27 ms) is sufficient for pre-echo control

The quantization and coding schemes of the hybrid coder combine elementsfrom both PXFM and OCF Masking thresholds are estimated using the PXFM

(a)

(b)

1024 samples (Time)

128 frequency lines (23 Hz21 ms)



64 freqlines

188 Hz27 ms

6 kHz

3 kHz

24 kHz

12 kHz





64 freqlines

188 Hz27 ms

64 freqlines

188 Hz27 ms

64 freqlines

188 Hz27 ms

64 freqlines

188 Hz27 ms

64 freqlines

188 Hz27 ms

64 freqlines

188 Hz27 ms

64 freqlines

188 Hz27 ms

Freq

s(n) 80 tap 64 pt DFT (8)

0-33-6 kHz

80 tap

80 tap

64 pt DFT (4)

64 pt DFT (2)

128 pt DFT (1)

(512)

(256)

0-66-12 kHz

0-1212-24 kHz

(1024)

320 lines frame

QMF

QMF

QMF

Figure 73 Brandenburg-Johnston coder (a) filter bank structure (b) timefreq tiling(after [Bran90])

CNET CODERS 201

approach for eight time slices in each frequency subband A more sophisticatedtonality estimate was defined to replace the SFM (Eq (513)) used in PXFMhowever such that tonality is estimated in the hybrid coder as a local char-acteristic of each individual spectral line Predictability of magnitude and phasespectral components across time is used to evaluate tonality instead of just globalspectral shape within a single frame High temporal predictability of magnitudesand phases is associated with the presence of a tonal signal In contrast low pre-dictability implies the presence of a noise-like signal The hybrid coder employsa quantization and coding scheme borrowed from OCF As far as quality thehybrid coder without any explicit pre-echo control mechanism was reported toachieve quality better than or equal to OCF-3 at 64 kbs [Bran90] The onlydisadvantage noted by the authors was increased complexity A similar hybridstructure was eventually adopted in MPEG-1 and -2 layer III

75 CNET CODERS

Research at the Centre National drsquoEtudes des Telecommunications (CNET) res-ulted in several transform coders based on the DFT and the MDCT

751 CNET DFT Coder

In 1989 Mahieux Petit et al proposed a DFT-based audio coding system thatintroduced a novel scheme to exploit DFT interblock redundancy Nearly trans-parent quality was reported for 15-kHz (FM-grade) audio at 96 kbs [Mahi89]except for some highly harmonic signals The encoder applies first-order back-ward-adaptive predictors (across time) to DFT magnitude and differential phasecomponents then quantizes separately the prediction residuals Magnitude anddifferential phase residuals are quantized using an adaptive nonuniform pdf-optimized quantizer designed for a Laplacian distribution and an adaptive uniformquantizer respectively The backward-adaptive quantizers are reinitialized dur-ing transients Bits are allocated during step-size adaptation to shape quantizationnoise such that a psychoacoustic noise threshold is satisfied for each block Theperceptual model used is similar to Johnstonrsquos model that was described earlierin Section 56 The use of linear prediction is justified because it exploits mag-nitude and differential phase time redundancy which tends to be large duringperiods when the audio signal is quasi-stationary especially for signal harmonicsQuasi-stationarity might occur for example during a sustained note A similartechnique was eventually embedded in the MPEG-2 AAC algorithm


In 1990 Mahieux and Petit reported on the development of a similar MDCT-based transform coder for which they reported transparent CD-quality at64 kbs [Mahi90] This algorithm introduced a novel spectrum descriptor schemefor representing the power spectral envelope The algorithm first segments inputaudio into frames of 1024 samples corresponding to 12 ms of new data per frame


given 50 MDCT time overlap Then a bit allocation is computed at the encoderusing a set of ldquospectrum descriptorsrdquo Spectrum descriptors consist of quantizedsample variances for MDCT coefficients grouped into 35 nonuniform frequencysubbands Like their DFT coder this algorithm exploits either interblock orintrablock redundancy by differentially encoding the spectrum descriptors withrespect to time or frequency and transmitting them to the receiver as sideinformation A decision whether to code with respect to time or frequency ismade on the basis of which method requires fewer bits the binary decisionrequires only 1 bit Either way spectral descriptor encoding is done using logDPCM with a first-order predictor and a 16-level uniform quantizer with a stepsize of 5 dB Huffman coding of the spectral descriptor codewords results in lessthan 2 bitsdescriptor A global masking threshold is estimated by convolving thespectral descriptors with a basilar spreading function on a bark scale somewhatlike the approach taken by Johnstonrsquos PXFM Bit allocations for quantizationof normalized transform coefficients are obtained from the masking thresholdestimate As usual bits are allocated such that quantization noise is below themasking threshold at every spectral line Transform coefficients are normalized bythe appropriate spectral descriptor then quantized and coded with one exceptionMasked transform coefficients which have lower energy than the global maskingthreshold are treated differently The authors found that masked coefficient binstend to be clustered therefore they can be compactly represented using run lengthencoding (RLE) RLE codewords are Huffman coded for maximum coding gainThe first CNET MDCT coder was reported to perform well for broadband signalswith many harmonics but had some problems in the case of spectrally flat signals


Mahieux and Petit enhanced their 64 kbs algorithm by incorporating a sophisti-cated pre-echo detection and postfiltering scheme as well as by incorporating anovel quantization scheme for two-coefficient (low-frequency) spectral descriptorbands [Mahi94] For improved quantization performance two-component spec-tral descriptors are efficiently vector encoded in terms of polar coordinatesPre-echoes are detected at the encoder and flagged using 1 bit The idea behindthe pre-echo compensation is to temporarily activate a postfilter at the decoderin the corrupted quiet region prior to the signal attack and therefore a stop-ping index must also be transmitted The second-order IIR postfilter differenceequation is given by

spf (n) = b0s(n) + a1spf (n minus 1) + a2spf (n minus 2) (72)

where s(n) is the nonpostfiltered output signal that is corrupted by pre-echo dis-tortion spf (n) is the postfiltered output signal and ai are related to the parametersαi by

a1 = α1

[

1 minus(

p(0 0)

p(0 0) + σ 2b

)]

(73a)

a2 = α2

[

1 minus(

p(1 0)

p(0 0) + σ 2b

)]

(73b)

ADAPTIVE SPECTRAL ENTROPY CODING 203

where αi are the parameters of a second-order autoregressive (AR-2) spectralestimate of the output audio s(n) during the previous nonpostfiltered frameThe AR-2 estimate s(n) can be expressed in the time domain as

s(n) = w(n) + α1s(n minus 1) + α2s(n minus 2) (74)

where w(n) represents Gaussian white noise The prediction error is thendefined as

e(n) = s(n) minus s(n) (75)

The parameters p(ij ) in Eqs (73a) and (73b) are elements of the predictionerror covariance matrix P and the parameter σ 2

b is the pre-echo distortion vari-ance which is derived from side information Pre-echo postfiltering and improvedquantization schemes resulted in a subjective score of 365 for two-channelstereo coding at 64 kbs per channel on the 5-point CCIR 5-grade impairmentscale (described in Section 123) over a wide range of listening material TheCCIR J41 reference audio codec (MPEG-1 layer II) achieved a score of 384 at384 kbschannel over the same set of tests

76 ADAPTIVE SPECTRAL ENTROPY CODING

The MSC OCF PXFM Brandenburg-Johnston hybrid and CNET transformcoders were eventually clustered into a single proposal by the ISOIEC JTC1SC2WG11 committee As a result Schroeder Brandenburg Johnston Herre andMahieux collaborated in 1991 to propose for acceptance as the new MPEG audiocompression standard a flexible coding algorithm ASPEC which incorporatedthe best features of each coder in the group ASPEC [Bran91] was claimed toproduce better quality than any of the individual coders at 64 kbs

The structure of ASPEC combines elements from all of its predecessors LikeOCF and the later CNET coders ASPEC uses the MDCT for time-frequencymapping The masking model is similar to that used in PXFM and the Brandenburg-Johnston hybrid coders including the sophisticated tonality estimation scheme atlower bit rates The quantization and coding procedures use the pair of nested loopsproposed for OCF as well as the block differential coding scheme developed atCNET Moreover long runs of masked coefficients are run-length and Huffmanencoded Quantized scale factors and transform coefficients are Huffman codedalso Pre-echoes are controlled using a dynamic window switching mechanismlike the Thomson coder [Edle89] ASPEC offers several modes for different qualitylevels ranging from 64 to 192 kbs per channel A real-time ASPEC implementa-tion for coding one channel at 64 kbs was realized on a pair of 33-MHz MotorolaDSP56001 devices ASPEC ultimately formed the basis for layer III of the MPEG-1and MPEG-2BC-LSF standards We note that similar contributions were made inthe area of transform coding for audio outside of the ASPEC cluster For exampleIwadare et al reported on DCT-based [Sugi90] and MDCT-based [Iwad92] per-ceptual adaptive transform coders that control pre-echo distortion using an adaptivewindow size


77 DIFFERENTIAL PERCEPTUAL AUDIO CODER

Other investigators have also developed promising schemes for transform codingof audio Paraskevas and Mourjopoulos [Para95] reported on a differential per-ceptual audio coder (DPAC) which makes use of a novel scheme for exploitinglong-term correlations DPAC works as follows Input audio is transformed usingthe MDCT A two-state classifier then labels each new frame of transform coef-ficients as either a ldquoreferencerdquo frame or a ldquosimplerdquo frame The classifier labelsas ldquoreferencerdquo frames that contain significant audible differences from the pre-vious frame The classifier labels nonreference frames as ldquosimplerdquo Referenceframes are quantized and encoded using scalar quantization and psychoacousticbit allocation strategies similar to Johnstonrsquos PXFM Simple frames howeverare subjected to coefficient substitution Coefficients whose magnitude differ-ences with respect to the previous reference frame are below an experimentallyoptimized threshold are replaced at the decoder by the corresponding refer-ence frame coefficients The encoder then replaces subthreshold coefficientswith zeros thus saving transmission bits Unlike the interframe predictive cod-ing schemes of Mahieux and Petit the DPAC coefficient substitution system isadvantageous in that it guarantees that the ldquosimplerdquo frame bit allocation willalways be less than or equal to the bit allocation that would be required ifthe frame was coded as a ldquoreferencerdquo frame Suprathreshold ldquosimplerdquo framecoefficients are coded in the same way as reference frame coefficients DPACperformance was evaluated for frame classifiers that utilized three different selec-tion criteria

1 Euclidean distance Under the Euclidean criterion test frames satisfyingthe inequality

[sd

T sd

srT sr

]1

2 λ (76)

are classified as simple where the vectors sr and st respectively containreference and test frame time-domain samples and the difference vectorsd is defined as

sd = sr minus st (77)

2 Perceptual entropy Under the PE criterion (Eq 517) a test frame islabeled as ldquosimplerdquo if it satisfies the inequality

PE S

PER

λ (78)

where PE S corresponds to the PE of the ldquosimplerdquo (coefficient-substituted)version of the test frame and PER corresponds to the PE of the unmodifiedtest frame

DFT NOISE SUBSTITUTION 205

3 Spectral flatness measure Finally under the SFM criterion (Eq 513) atest frame is labeled as ldquosimplerdquo if it satisfies the inequality

abs

(

10 log10SFM T

SFM R

)

λ (79)

where SFMT corresponds to the test frame SFM and SFMR correspondsto the SFM of the previous reference frame The decision threshold λwas experimentally optimized for all three criteria Best performance wasobtained while encoding source material using a PE criterion As far asoverall performance is concerned noise-to-mask ratio (NMR) measure-ments were compared between DPAC and Johnstonrsquos PXFM algorithm at64 88 and 128 kbs Despite an average drop of 30ndash35 in PE measuredat the DPAC coefficient substitution stage output relative to the coefficientsubstitution input comparative NMR studies indicated that DPAC outper-forms PXFM only below 88 kbs and then only for certain types of sourcematerial such as pop or jazz music The desirable PE reduction led toan undesirable drop in reconstruction quality The authors concluded thatDPAC may be preferable to algorithms such as PXFM for low-bit-rate nontransparent applications

78 DFT NOISE SUBSTITUTION

Whereas DPAC exploits temporal correlation a substitution technique that exp-loits decorrelation was devised for coding efficiently noise-like portions of thespectrum In a noise substitution procedure [Schu96] Schulz parameterizes trans-form coefficients corresponding to noise-like portions of the spectrum in terms ofaverage power frequency range and temporal evolution resulting in an increasedcoding efficiency of 15 on average A temporal envelope for each parametricnoise band is required because transform block sizes for most codecs are muchlonger (eg 30 ms) than the human auditory systemrsquos temporal resolution (eg2 ms) In this method noise-like spectral regions are identified in the followingway First least-mean-square (LMS) adaptive linear predictors (LP) are appliedto the output channels of a multi-band QMF analysis filter bank that has as inputthe original audio s(n) A predicted signal s(n) is obtained by passing the LPoutput sequences through the QMF synthesis filter bank Prediction is done insubbands rather than over the entire spectrum to prevent classification errors thatcould result if high-energy noise subbands are allowed to dominate predictoradaptation resulting in misinterpretation of low-energy tonal subbands as noisyNext the DFT is used to obtain magnitude (S(k)S(k)) and phase components(θ(k)θ (k)) of the input s(n) and prediction s(n) respectively Then tonalityT (k) is estimated as a function of the magnitude and phase predictability ie

T (k) = α

∣∣∣∣∣

S(k) minus S(k)

S(k)

∣∣∣∣∣+ β

∣∣∣∣∣

θ(k) minus θ (k)

θ(k)

∣∣∣∣∣ (710)


where α and β are experimentally determined constants Noise substitution isapplied to contiguous blocks of transform coefficient bins for which T (k) is verysmall The 15 average bit savings realized using this method in conjunctionwith transform coding is offset to a large extent by a significant complexityincrease due to the additions of the adaptive linear predictors and a multi-bandanalysis-synthesis QMF filter bank As a result the author focused his attentionon the application of noise substitution to QMF-based subband coding algorithmsA modified version of this scheme was adopted as part of the MPEG-2 AACtime-frequency coder within the MPEG-4 reference model [Herr98]

79 DCT WITH VECTOR QUANTIZATION

For the most part the algorithms described thus far rely upon scalar quantiza-tion of transform coefficients This is not unreasonable since scalar quantizationin combination with entropy coding can achieve very good performance Asone might expect however vector quantization (VQ) has also been appliedto transform coding of audio although on a much more limited scale Ger-sho and Chan investigated VQ schemes for coding DCT coefficients subjectto a constraint of minimum perceptual distortion They reported on a variablerate coder [Chan90] that achieves high quality in the range of 55ndash106 kbs foraudio sequences bandlimited to 15 kHz (32 kHz sample rate) After computingthe DCT on 512 sample blocks the algorithm utilizes a novel multi-stage tree-structured VQ (MSTVQ) scheme for quantization of normalized vectors witheach vector containing four DCT components Bit allocation and vector nor-malization are derived at both the encoder and decoder from a sampled powerspectral envelope which consists of 29 groups of transform coefficients A simpli-fied masking model assumes that each sample of the power envelope representsa single masker Masking is assumed to be additive as in the ASPEC algo-rithms Thresholds are computed as a fixed offset from the masking level Theauthors observed a strong correlation between the SFM and the amount of offsetrequired to achieve high quality Two-segment scalar quantizers that are piecewiselinear on a dB scale are used to encode the power spectral envelope Quadraticinterpolation is used to restore full resolution to the subsampled envelope

Gersho and Chan later enhanced [Chan91b] their algorithm by improving thepower envelope and transform coefficient quantization schemes In the new app-roach to quantization of transform coefficients constrained-storage VQ [Chan91a]techniques are combined with the MSTVQ from the original coder allowing thenew coder to handle peak noise-to-mask ratio (NMR) requirements without imprac-tical codebook storage requirements In fact CS-MSTVQ enabled quantization of127 four-coefficient vectors using only four unique quantizers Power spectralenvelope quantization is enhanced by extending its resolution to 127 samples Thepower envelope samples are encoded using a two-stage process The first stageapplies nonlinear interpolative VQ (NLIVQ) a dimensionality reduction processwhich represents the 127-element power spectral envelope vector using only a 12-dimensional ldquofeature power enveloperdquo Unstructured VQ is applied to the feature

MDCT WITH VECTOR QUANTIZATION 207

power envelope Then a full-resolution quantized envelope is obtained from theunstructured VQ index into a corresponding interpolation codebook In the secondstage segments of a power envelope residual are encoded using 8- 9- and 10-element TSVQ Relative to their first VQDCT coder the authors reported savingsof 10ndash20 kbs with no reduction in quality due to the CS-VQ and NLIVQ schemesAlthough VQ schemes with this level of sophistication typically have not been seenin the audio coding literature since [Chan90] and [Chan91b] first appeared therehave been successful applications of less-sophisticated VQ in some of the standards(eg [Sree98a] [Sree98b])

710 MDCT WITH VECTOR QUANTIZATION

Iwakami et al developed transform-domain weighted interleave vector quantiza-tion (TWIN-VQ) an MDCT-based coder that also involves transform coefficientVQ [Iwak95] This algorithm exploits LPC analysis spectral interframe redun-dancy and interleaved VQ

At the encoder (Figure 74) each frame of MDCT coefficients is first dividedby the corresponding elements of the LPC spectral envelope resulting in a spec-trally flattened quotient (residual) sequence This procedure flattens the MDCTenvelope but does not affect the fine structure The next step therefore dividesthe first step residual by a predicted fine structure envelope This predicted finestructure envelope is computed as a weighted sum of three previous quantized finestructure envelopes ie using backward prediction Interleaved VQ is applied tothe normalized second step residual The interleaved VQ vectors are structuredin the following way Each N -sample normalized second step residual vector issplit into K subvectors each containing N K coefficients Second step residualsfrom the N -sample vector are interleaved in the K subvectors such that the i-thsubvector contains elements i + nK where n = 0 1 (NK) minus 1 Percep-tual weighting is also incorporated by weighting each subvector by a nonlinearlytransformed version of its corresponding LPC envelope component prior to thecodebook search VQ indices are transmitted to the receiver Side information

s(n) MDCT

LPCAnalysis

DenormalizeWeightedInterleaveVQ

Normalize

Inter-framePrediction

LPCEnvelope

Indices

Side info

Side info

Figure 74 TWIN-VQ encoder (after [Iwak95])


consists of VQ normalization coefficients and the LPC envelope encoded in termsof LSPs The authors claimed higher subjective quality than MPEG-1 layer IIat 64 kbs for 48 kHz CD-quality audio as well as higher quality than MPEG-1layer II for 32 kHz audio at 32 kbs

TWIN-VQperformance at lower bit rateshasalsobeen investigated At least threetrends were identified during ISO-sponsored comparative tests [ISOI98] of TWIN-VQ and MPEG-2 AAC First AAC outperformed TWIN-VQ for bit rates above16 kbs Secondly TWIN-VQ and AAC achieved similar performance at 16 kbswith AAC having a slight edge Finally the performance of TWIN-VQ exceededthat of AAC at a rate of 8 kbs These results ultimately motivated a combinedAACTWIN-VQ architecture for inclusion in MPEG-4 [Herre98] Enhancementsto the weighted interleaving scheme and LPC envelope representation [Mori96]enabled real-time implementation of stereo decoders on Pentium-I and PowerPCplatforms Channel error robustness issues are addressed in [Iked95] A later versionof the TWIN-VQ scheme is embedded in the set of tools for MPEG-4 audio

711 SUMMARY

Transform coders for high-fidelity audio were described in this Chapter Thetransform coding algorithms presented include

ž the OCF algorithmž the monaural and stereophonic perceptual transform coders (PXFM and

SEPXFM)ž the CNET DFT and MDCT codersž the ASPECž the differential PACž the TWIN-VQ algorithm

PROBLEMS

71 Given the expressions for the DFT the DCT and the MDCT

XDFT (k) = 1radic2M

2Mminus1sum

n=0

x(n)eminusjπnkM 0 k 2M minus 1

XDCT (k) = c(k)

radic2

M

Mminus1sum

n=0

x(n) cos

[π

M

(

n + 1

2

)

k

]

0 k M minus 1

where c(0) = 1radic

2 and c(k) = 1 for 1 k M minus 1

XMDCT (k) =radic

2

M

2Mminus1sum

n=0

x(n) sin

[(

n + 1

2

)π

2M

]

︸︷︷︸w(n)

cos

[(2n + M + 1)(2k + 1)π

4M

]

for 0 k M minus 1

H0(z

)

H1(z

)

2 2

2 2

F0(z

)

F1(z

)

FF

Tsy

nthe

sis

mod

ule

FF

Tsy

nthe

sis

mod

ule

x 0(n

)

x 1(n

)

x d0(n

)

x d1(

n)

x e0(n

)

x e1(n

)

Sel

ect L

com

pone

nts

out o

f N

FF

T(S

ize

N =

128)

x(n

)X

rsquo(k

)In

vers

eF

FT

(Siz

e N

=12

8)

xrsquo(n

)X

(k)

n=

[1 x

128

]k

= [1

xN]

k =

[1xL

]n

= [1

x 1

28]

FF

T s

ynth

esis

mod

ulexrsquo

d0(n

)

xrsquod1

(n)

x(n

)xrsquo

(n)

Fig

ure

75

FFT

anal

ysis

syn

thes

isw

ithin

the

two

band

sof

QM

Fba

nk

209


Write the three transforms in matrix form as follows

XT = Hx

where H is the transform matrix and x and XT denote the input and transformedvector respectively Note the structure in the transform matrices

72 Give the signal flowgraph of the FFT butterfly structure for an 8-pointDFT an 8-point DCT and an 8-point MDCT Specify clearly the valueson the nodes and the branches [Hint See Problem 616 and Figure 618 inChapter 6]

COMPUTER EXERCISES

73 In this problem we will study the energy compaction of the DFT andthe DCT Use x(n) = eminus05n sin(04πn) n = 0 1 15 Plot the 16-pointDFT and 16-point DCT of the input signal x(n) See how the energy ofthe sequence is concentrated Now pick two peaks of the DFT vector andthe DCT vector and synthesize the input signal x(n) Let the synthesizedsignals be xDFT (n) and xDCT (n) Compute the MSE values between theinput signal and the two reconstructed signals Repeat this for four peakssix peaks and eight peaks Plot the estimated MSE values across the numberof peaks selected and comment on your result

74 This computer exercise is a combination of Problems 224 and 225 inChapter 2 In particular the FFT analysissynthesis module in Problem 225will be used within the two bands of the QMF bank The configuration isshown in Figure 75a Given H0(z) = 1 minus zminus1H1(z) = 1 + zminus1 Choose F0(z) and F1(z) such

that the aliasing term can be cancelled Use L = 32 and the peak-pickingmethod for component selection Perform speech synthesis and give time-domain plots of both input and output speech records

b Use the same voiced frame selected in Problem 224 Give time-domainand frequency-domain plots of x prime

d0(n) and x primed1(n) in Figure75

c Compute the overall SNR (between x(n) and x prime(n)) and estimate a MOSscore for the output speech

d Describe whether the perceptual quality of the output speech improvesif the FFT analysissynthesis module is employed within the subbandsinstead of using it for the entire band

CHAPTER 8

SUBBAND CODERS

81 INTRODUCTION

Similar to the transform coders described in the previous chapter subband codersalso exploit signal redundancy and psychoacoustic irrelevancy in the frequencydomain The audible frequency spectrum (20 Hzndash20 kHz) is divided into fre-quency subbands using a bank of bandpass filters The output of each filter is thensampled and encoded At the receiver the signals are demultiplexed decodeddemodulated and then summed to reconstruct the signal Audio subband codersrealize coding gains by efficiently quantizing decimated output sequences fromperfect reconstruction filter banks Efficient quantization methods usually relyupon psychoacoustically controlled dynamic bit allocation rules that allocate bitsto subbands in such a way that the reconstructed output signal is free of audi-ble quantization noise or other artifacts In a generic subband audio coder theinput signal is first split into several uniform or nonuniform subbands using somecritically sampled perfect reconstruction (or nearly perfect reconstruction) filterbank Nonideal reconstruction properties in the presence of quantization noiseare compensated for by utilizing subband filters that have good sidelobe attenu-ation Then decimated output sequences from the filter bank are normalized andquantized over short 2ndash10 ms blocks Psychoacoustic signal analysis is usedto allocate an appropriate number of bits for the quantization of each subbandThe usual approach is to allocate an adequate number of bits to mask quantiza-tion noise in each block while simultaneously satisfying some bit rate constraintSince masking thresholds and hence bit allocation requirements are time-varyingbuffering is often introduced to match the coder output to a fixed rate The encoder


211

212 SUBBAND CODERS

sends to the decoder quantized subband output samples normalization scale fac-tors for each block of samples and bit allocation side information Bit allocationmay be transmitted as explicit side information or it may be implicitly repre-sented by some parameter such as the scale factor magnitudes The decoder usesside information and scale factors in conjunction with an inverse filter bank toreconstruct a coded version of the original input

The purpose of this chapter is to expose the reader to subband coding algo-rithms for high-fidelity audio This chapter is organized much like Chapter 7The first portion of this chapter is concerned with early subband algorithms thatnot only contributed to the MPEG-1 standardization but also had an impact onlater developments in the field The remainder of the chapter examines a vari-ety of recent experimental subband algorithms that make use of discrete wavelettransforms (DWT) discrete wavelet packet transforms (DWPT) and hybrid filterbanks The chapter is organized as follows Section 811 concentrates upon theearly subband coding algorithms for high-fidelity audio including the MaskingPattern Adapted Universal Subband Integrated Coding and Multiplexing (MUSI-CAM) Section 82 presents the filter-bank interpretations of the DWT and theDWPT Section 83 addresses subband audio coding algorithms in which time-invariant and time-varying signal adaptive filter banks are constructed fromthe DWT and the DWPT Section 84 examines the use of nonuniform fil-ter banks related the DWPT Sections 85 and 86 are concerned with hybridsubband architectures involving sinusoidal modeling and code-excited linear pre-diction (CELP) Finally Section 87 addresses subband audio coding with IIRfilter banks


This section is concerned with early subband algorithms proposed by researchersfrom the Institut fur Rundfunktechnik (IRT) [Thei87] [Stoll88] Philips ResearchLaboratories [Veld89] and CCETT Much of this work was motivated by stan-dardization activities for the European Eureka-147 digital broadcast audio (DBA)system The ISOIEC eventually clustered the IRT Philips and CCETT propos-als into the MUSICAM algorithm [Wies90] [Dehe91] which was adopted as partof the ISOIEC MPEG-1 and MPEG-2 BC-LSF audio coding standards

8111 Masking Pattern Adapted Subband Coding (MASCAM) TheMUSICAM algorithm is derived from coders developed at IRT Philips andCNET At IRT Theile Stoll and Link developed Masking Pattern AdaptedSubband Coding (MASCAM) a subband audio coder [Thei87] based upon atree-structured quadrature mirror filter (QMF) filter bank that was designed tomimic the critical band structure of the auditory filter bank The coder has 24nonuniform subbands with bandwidths of 125 Hz below 1 kHz 250 Hz in therange 1ndash2 kHz 500 Hz in the range 2ndash4 kHz 1 kHz in the range 4ndash8 kHz and2 kHz from 8 kHz to 16 kHz The prototype QMF has 64 taps Subband outputsequences are processed in 2-ms blocks A normalization scale factor is quantized

INTRODUCTION 213

and transmitted for each block from each subband Subband bit allocations arederived from a simplified psychoacoustic analysis The original coder reportedin [Thei87] considered only in-band simultaneous masking Later as describedin [Stol88] interband simultaneous masking and temporal masking were addedto the bit rate calculation Temporal postmasking is exploited by updating scalefactors less frequently during periods of signal decay The MASCAM coder wasreported to achieve high-quality results for 15 kHz bandwidth input signals atbit rates between 80 and 100 kbs per channel A similar subband coder wasdeveloped at Philips during this same period As described by Velhuis et alin [Veld89] the Philips group investigated subband schemes based on 20- and26-band nonuniform filter banks Like the original MASCAM system the Philipscoder relies upon a highly simplified masking model that considers only theupward spread of simultaneous masking Thresholds are derived from a prototypebasilar excitation function under worst-case assumptions regarding the frequencyseparation of masker and maskee Within each subband signal energy levels aretreated as single maskers Given SNR targets due to the masking model uniformADPCM is applied to the normalized output of each subband The Philips coderwas claimed to deliver high-quality coding of CD-quality signals at 110 kbs forthe 26-band version and 180 kbs for the 20-band version

8112 Masking Pattern Adapted Universal Subband Integrated Cod-ing and Multiplexing (MUSICAM) Based primarily upon coders developedat IRT and Philips the MUSICAM algorithm [Wies90] [Dehe91] was successfulin the 1990 ISOIEC competition [SBC90] for a new audio coding standard Iteventually formed the basis for MPEG-1 and MPEG-2 audio layers I and II Rela-tive to its predecessors MUSICAM (Figure 81) makes several practical tradeoffsbetween complexity delay and quality By utilizing a uniform bandwidth 32-band pseudo-QMF bank (aka ldquopolyphaserdquo filter bank) instead of a tree-structuredQMF bank both complexity and delay are greatly reduced relative to the IRTand Phillips coders Delay and complexity are 1066 ms and 5 MFLOPS respec-tively These improvements are realized at the expense of using a sub-optimal

s(n)

PolyphaseAnalysisFilterbank 32 ch

(750 Hz 48 kHz)

Side Info

1024-ptFFT


Quantization

Bit Allocation

Scl Fact

Samples

81624 ms

Figure 81 MUSICAM encoder (after [Wies90])

214 SUBBAND CODERS

filter bank however in the sense that filter bandwidths (constant 750 Hz for48 kHz sample rate) no longer correspond to the critical band rate Despite theseexcessive filter bandwidths at low frequencies high-quality coding is still possi-ble with MUSICAM due to its enhanced psychoacoustic analysis High-resolutionspectral estimates (46 Hzline at 48 kHz sample rate) are obtained through theuse of a 1024-point FFT in parallel with the PQMF bank This parallel structureallows for improved estimation of masking thresholds and hence determinationof more accurate minimum signal-to-mask ratios (SMRs) required within eachsubband

The MUSICAM psychoacoustic analysis procedure is essentially the same asthe MPEG-1 psychoacoustic model 1 The remainder of MUSICAM works asfollows Subband output sequences are processed in 8-ms blocks (12 samplesat 48 kHz) which is close to the temporal resolution of the auditory system(4ndash6 ms) Scale factors are extracted from each block and encoded using 6 bitsover a 120-dB dynamic range Occasionally temporal redundancy is exploitedby repetition over 2 or 3 blocks (16 or 24 ms) of slowly changing scale factorswithin a single subband Repetition is avoided during transient periods such assharp attacks Subband samples are quantized and coded in accordance with SMRrequirements for each subband as determined by the psychoacoustic analysis Bitallocations for each subband are transmitted as side information On the CCIRfive-grade impairment scale MUSICAM scored 46 (std dev 07) at 128 kbsand 43 (std dev 11) at 96 kbs per monaural channel compared to 47 (stddev 06) on the same scale for the uncoded original Quality was reported tosuffer somewhat at 96 kbs for critical signals which contained sharp attacks (egtriangle castanets) and this was reflected in a relatively high standard deviation of11 MUSICAM was selected by ISOIEC for MPEG-1 audio due to its desirablecombination of high quality reasonable complexity and manageable delay Alsobit error robustness was found to be very good (errors nearly imperceptible) upto a bit error rate of 10minus3

82 DWT AND DISCRETE WAVELET PACKET TRANSFORM (DWPT)

The previous section described subband coding algorithms that utilize banks offixed resolution bandpass QMF or pseudo-QMF finite impulse response (FIR)filters This section describes a different class of subband coders that rely insteadupon a filter-bank interpretation of the discrete wavelet transform (DWT) DWT-based subband coders offer increased flexibility over the subband coders describedpreviously since identical filter-bank magnitude frequency responses can be obtai-ned for many different choices of a wavelet basis or equivalently choices of filtercoefficients This flexibility presents an opportunity for basis optimization Theadvantage of this optimization in the audio coding application is illustrated bythe following example First a desired filter-bank magnitude response can beestablished This response might be matched to the auditory filter bank Thenfor each segment of audio one can adaptively choose a wavelet basis that mini-mizes the rate for some target distortion level Given a psychoacoustically deriveddistortion target the encoding remains perceptually transparent

DWT AND DISCRETE WAVELET PACKET TRANSFORM (DWPT) 215

darr 2

darr 2=y = Qx =

ylp

x

yhp

Q

Qylp

yhpx

Hlp(z)

Hhp(z)

Figure 82 Filter-bank interpretation of the DWT

A detailed discussion of specific technical conditions associated with thevarious wavelet families is beyond the scope of this book and this chaptertherefore concentrates upon high-level coder architectures In-depth treatmentof wavelets is available from many sources eg [Daub92] Before describingthe wavelet-based coders however it is useful to summarize some basic waveletcharacteristics Wavelets are a family of basis functions for the space of squareintegrable signals A finite energy signal can be represented as a weighted sumof the translates and dilates of a single wavelet Continuous-time wavelet sig-nal analysis can be extended to discrete-time and square summable sequencesUnder certain assumptions the DWT acts as an orthonormal linear transformT RN rarr RN For a compact (finite) support wavelet of length K the asso-ciated transformation matrix Q is fully determined by a set of coefficientsck for 0 k K minus 1 As shown in Figure 82 this transformation matrixhas an associated filter-bank interpretation One application of the transformmatrix Q to an N times 1 signal vector x generates an N times 1 vector of wavelet-domain transform coefficients y The N times 1 vector y can be separated into twoN

2times 1 vectors of approximation and detail coefficients ylp and yhp respec-

tively The spectral content of the signal x captured in ylp and yhp correspondsto the frequency subbands realized in the 21 decimated output sequences froma QMF bank (Section 64) which obeys the ldquopower complimentary conditionrdquoie

|Hlp()|2 + |Hlp( + π)|2 = 1 (81)

where Hlp() is the frequency response of the lowpass filter Therefore recursiveDWT applications effectively pass input data through a tree-structured cascadeof lowpass (LP) and highpass (HP) filters followed by 21 decimation at everynode The forwardinverse transform matrices of a particular wavelet are associ-ated with a corresponding QMF analysissynthesis filter bank The usual waveletdecomposition implements an octave-band filter bank structure as shown inFigure 83 In the figure frequency subbands associated with the coefficientsfrom each stage are schematically represented for an audio signal sampled at441 kHz

Wavelet packet (WP) or discrete wavelet packet transform (DWPT) representa-tions on the other hand decompose both the detail and approximation coefficientsat each stage of the tree as shown in Figure 84 In the figure frequency subbands

216 SUBBAND CODERS

22 kHz11552814

Q QQxy1 y2 y3

y4

y5

y1y2y3y4y5

Frequency (Hz)

Q

Figure 83 Octave-band subband decomposition associated with a discrete wavelet trans-form (ldquoDWTrdquo)

associated with the coefficients from each stage are schematically represented fora 441-kHz sample rate

A filter-bank interpretation of wavelet transforms is attractive in the context ofaudio coding algorithms Wavelet or wavelet packet decompositions can be treestructured as necessary (unbalanced trees are possible) to decompose input audiointo a set of frequency subbands tailored to some application It is possible forexample to approximate the critical band auditory filter bank utilizing a waveletpacket approach Moreover many K-coefficient finite support wavelets are asso-ciated with a single magnitude frequency response QMF pair and therefore aspecific subband decomposition can be realized while retaining the freedom tochoose a wavelet basis which is in some sense ldquooptimalrdquo These considerationshave motivated the development of several experimental wavelet-based subbandcoders in recent years The basic idea behind DWT and DWPT-based subbandcoders is to quantize and encode efficiently the coefficient sequences associatedwith each stage of the wavelet decomposition tree using the same noise shapingtechniques as the previously described perceptual subband coders

The next few sections of this chapter Sections 83 through 85 expose thereader to several WP-based subband coders developed in the early 1990s bySinha Tewfik et al [Sinh93a] [Sinh93b] [Tewf93] as well as more recentlyproposed hybrid sinusoidalWPT algorithms developed by Hamdy and Tewfik[Hamd96] Boland and Deriche [Bola97] and Pena et al [Pena96] [Prel96a][Prel96b] [Pena97a] The core of least one experimental WP audio coder [Sinh96]has been embedded in a commercial standard namely the ATampT PerceptualAudio Coder (PAC) [Sinh98] Although not addressed in this chapter we notethat other studies of DWT and DWPT-based audio coding schemes have appearedFor example experimental coder architectures for low-complexity low-delaycombined waveletmultipulse LPC coding and combined scalarvector quantiza-tion of transform coefficients were reported respectively by Black and Zeytinoglu[Blac95] Kudumakis and Sandler [Kudu95a] [Kudu95b] [Kudu96] and Bolandand Deriche [Bola95][Bola96] Several bit rate scalable DWPT-based schemeshave also been investigated recently For example a fixed-tree DWPT codingscheme capable of nearly transparent quality with scalable bitrates below 100 kbswas proposed by Dobson et al and implemented in real-time on a 75 MHzPentium-class platform [Dobs97] Additionally Lu and Pearlman investigateda rate-scalable DWPT-based coder that applies set partitioning in hierarchicaltrees (SPIHT) to generate an embedded bitstream Nearly transparent qualitywas reported at bit rates between 55 and 66 kbs [Lu98]

22kH

z11

55

28

83

138

165

193

Q

Q

QQ

QQQ

y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8

y 1y 2

y 3y 4

y 5y 6

y 7y 8

x

Fre

quen

cy (

Hz)

Fig

ure

84

Subb

and

deco

mpo

sitio

nas

soci

ated

with

apa

rtic

ular

wav

elet

pack

ettr

ansf

orm

(ldquoW

PTrdquo

orldquoW

Prdquo)

Alth

ough

the

pict

ure

illus

trat

esa

bala

nced

bina

rytr

eean

dth

eas

soci

ated

unif

orm

band

wid

thsu

bban

ds

node

sco

uld

bepr

uned

inor

der

toac

hiev

eno

nuni

form

freq

uenc

ysu

bdiv

isio

n

217

218 SUBBAND CODERS

83 ADAPTED WP ALGORITHMS

The ldquobest basisrdquo methodologies [Coif92] [Wick94] for adapting the WP treestructure to signal properties are typically formulated in terms of Shannonentropy [Shan48] and other perceptually blind statistical measures For a givenWP tree related research directed towards optimal filter selection [Hedg97][Hedg98a] [Hedg98b] has also emphasized optimization of statistical rather thanperceptual properties The questions of perceptually motivated filter selection andtree construction are central to successful application of WP analysis in audiocoding algorithms The WP tree structure determines the time and frequencyresolution of the transform and therefore also creates a particular tiling of thetime-frequency plane Several WP audio algorithms [Sinh93b] [Dobs97] havesuccessfully employed time-invariant WP tree structures that mimic the earrsquoscritical band frequency resolution properties In some cases however a moreefficient perceptual bit allocation is possible with a signal-specific time-frequencytiling that tracks the shape of the time-varying masking threshold Some examplesare described next

831 DWPT Coder with Globally Adapted Daubechies AnalysisWavelet

Sinha and Tewfik developed a variable-rate wavelet-based coding schemefor which they reported nearly transparent coding of CD-quality audio at48ndash64 kbs [Sinh93a] [Sinh93b] The encoder (Figure 85) exploits redundancyusing a VQ scheme and irrelevancy using a wavelet packet (WP) signaldecomposition combined with perceptual masking thresholds The algorithmworks as follows Input audio is segmented into N times 1 vectors which are then

DynamicDictionarySearch


T

sd

s

r

sminus

+

s

TransmitIndex of sd

T

Y

NWaveletPacketSearchAnalysis

Transmitr or s

d(s sd) le T

sum

Figure 85 Dynamic dictionaryoptimal wavelet packet encoder (after [Sinh93a])


windowed using a 116-th overlap square root Hann window The dynamicdictionary (DD) which is essentially an adaptive VQ subsystem then eliminatessignal redundancy A dictionary of N times 1 codewords is searched for the vectorperceptually closest to the input vector The effective size of the dictionary ismade larger than its actual size by a novel correlation lag searchtime-warpingprocedure that identifies two N 2-sample codewords for each N -sample inputvector At both the transmitter and receiver the dictionary is systematicallyupdated with N -sample reconstructed output audio vectors according to aperceptual distance criterion and last-used-first-out rule After the DD procedurehas been completed an optimized WP decomposition is applied to the originalsignal as well as the DD residual The decomposition tree is structured such thatits 29 frequency subbands roughly correspond to the critical bands of the auditoryfilter bank A masking threshold obtained as in [Veld89] is assumed constantwithin each subband and then used to compute a perceptual bit allocation

The encoder transmits the particular combination of DD and WP informationthat minimizes the bit rate while maintaining perceptual quality Three combi-nations are possible In one scenario the DD index and time-warping factor aretransmitted alone if the DD residual energy is below the masking threshold atall frequencies Alternatively if the DD residual has audible noise energy thenWP coefficients of the DD residual are also quantized encoded and transmittedIn some cases however WP coefficients corresponding to the original signal aremore compactly represented than the combination of the DD plus WP residualinformation In this case the DD information is discarded and only quantizedand encoded WP coefficients are transmitted In the latter two cases the encoderalso transmits subband scale factors bit allocations and energy normalizationside information

This algorithm is unique in that it contains the first reported application ofadapted WP analysis to perceptual subband coding of high-fidelity CD-qualityaudio During each frame the WP basis selection procedure applies an optimalitycriterion of minimum bit rate for a given distortion level The adaptation isldquoglobalrdquo in the sense that the same analysis wavelet is applied to the entiredecomposition The authors reached several conclusions regarding the optimalcompact support (K-coefficient) wavelet basis when selecting from among theDaubechies orthogonal wavelet bases ([Daub88])

First optimization produced average bit rate savings dependent on filter lengthof up to 15 Average bit rate savings were 3 65 875 and 15 for waveletsselected from the sets associated with coefficient sequences of lengths 10 2040 and 60 respectively In an extreme case a savings of 17 bitssample is real-ized for transparent coding of a difficult castanets sequence when using best-caserather than worst-case wavelets (08 vs 25 bitssample for K = 40) The sec-ond conclusion reached by the researchers was that it is not necessary to searchexhaustively the space of all wavelets for a particular value of K The search canbe constrained to wavelets with K 2 vanishing moments (the maximum possiblenumber) with minimal impact on bit rate The frequency responses of the filtersassociated with a p-th-order vanishing moment wavelet have p-th-order zeros at

220 SUBBAND CODERS

the foldover frequency ie = π Only a 31 bitrate reduction was realizedfor an exhaustive search versus a maximal vanishing moment constrained searchThird the authors found that larger K ie more taps and deeper decomposi-tion trees tended to yield better results Given identical distortion criteria for acastanets sequence bit rates of 21 bitssample for K = 4 wavelets were realizedversus 08 bitssample for K = 40 wavelets

As far as quality is concerned subjective tests showed that the algorithmproduced transparent quality for certain test material including drums pop vio-lin with orchestra and clarinet Subjects detected differences however for thecastanets and piano sequences These difficulties arise respectively because ofinadequate pre-echo control and inefficient modeling of steady sinusoids Thecoder utilizes only an adaptive window scheme which switches between 1024and 2048-sample windows Shorter windows (N = 1024 or 23 ms) are used forsignals that are likely to produce pre-echoes The piano sequence contained longsegments of nearly steady or slowly decaying sinusoids The wavelet coder doesnot handle steady sinusoids as well as other signals With the exception of thesetroublesome signals in a comparative test one additional expert listener alsofound that the WP coder outperformed MPEG-1 layer II at 64 kbs

Tewfik and Ali later enhanced the WP coder to improve pre-echo control andincrease coding efficiency After elimination of the dynamic dictionary theyreported improved quality in the range of 55 to 63 kbs as well as a real-time implementation of a simplified 64 to 78 kbs coder on two TMS320C31devices [Tewf93] Other improvements included exploitation of auditory tem-poral masking for pre-echo control more efficient quantization and encodingof scale-factors and run-length coding of long zero sequences The improvedWP coder also upgraded its psychoacoustic analysis section with a more sophis-ticated model similar to Johnstonrsquos PXFM coder [John88a] The most notableimprovement occurred in the area of pre-echo control This was accomplishedin the following manner First input frames likely to produce pre-echoes areidentified using a normalized energy measure criterion These frames are parsedinto 5-ms time slots (256 samples) Then WP coefficients from all scales withineach time slot are combined to estimate subframe energies Masking thresholdscomputed over the global 1024-sample frame are assumed only to apply dur-ing high-energy time slots Masking thresholds are reduced across all subbandsfor low-energy time slots utilizing weighting factors proportional to the energyratio between high- and low-energy time-slots The remaining enhancements ofimproved scale factor coding efficiency and run-length coding of zero sequencesmore than compensated for removal of the dynamic dictionary


Srinivasan and Jamieson proposed a WP-based audio coding scheme [Srin97][Srin98] in which a signal-specific perceptual best basis is constructed by adaptingthe WP tree structure on each frame such that perceptual entropy and ultimatelythe bit rate are minimized While the tree structure is signal-adaptive the analysis


AdaptiveWPT

ZerotreeQuantizer

Perceptual Model

LosslessCoding

l

s(n)

Figure 86 Masking-threshold adapted WP audio coder [Srin98] On each frame the WPtree structure is adapted in order to minimize a perceptually motivated rate constraint

filters are time-invariant and obtained from the family of spline-based biorthog-onal wavelets [Daub92] The algorithm (Figure 86) is also unique in the sensethat it incorporates mechanisms for both bit rate and complexity scaling Beforethe tree adaptation process can commence for a given frame a set of 63 maskingthresholds corresponding to a set of threshold frequency partitions roughly 13Bark wide is obtained from the ISOIEC MPEG-1 psychoacoustic model rec-ommendation 2 [ISOI92] Of course depending upon the WP tree the subbandsmay or may not align with the threshold partitions For any particular WP treethe associated bit rate (cost) is computed by extracting the minimum maskingthresholds from each subband and then allocating sufficient bits to guarantee thatthe quantization noise in each band does not exceed the minimum threshold

The objective of the tree adaptation process therefore is to construct a mini-mum cost subband decomposition by maximizing the minimum masking thresh-old in every subband Figure 87a shows a possible subband structure in whichsubband 0 contains five threshold partitions This choice of bandsplitting is clearlyundesirable since the minimum masking threshold for partition 1 is far below par-tition 4 Bit allocation for subband 0 will be forced to satisfy partition 1 with aresulting overallocation for partitions 2 through 5

It can be seen that subdividing the band (Figure 87b) relaxes the minimummasking threshold in band 1 to the level of partition 5 Naturally the idealbandsplitting would in this case ultimately match the subband boundaries tothe threshold partition boundaries On each frame therefore the tree adaptationprocess performs the following top-down iterative ldquogrowingrdquo procedure Dur-ing any iteration the existing subbands are assigned individual costs based onthe bit allocation required for transparent coding Then a decision on whetheror not to subdivide further at a node is made on the basis of cost reductionSubbands are examined for potential splitting in order of decreasing cost andthe search is ldquobreadth-firstrdquo meaning that each level is completely decomposedbefore proceeding to the next level Subdivision occurs only if the associated bitrate improvement exceeds a threshold The tree adaptation is also constrained bya complexity scaling mechanism Top-down tree growth is halted by the com-plexity scaling constraint λ when the estimated total cost of computing the

222 SUBBAND CODERS

Subband 0

t4

t5

t1

t3

t2Min = t1

Frequency

Mas

king

thre

shol

d

(a)

Subband 0

Frequency

Min = t5

Mas

king

thre

shol

d

(b)

Subband 1

t4

t5

t1

t3

t2

Figure 87 Example masking-threshold adapted WP filter bank (a) initial condition (b)after one iteration Threshold partitions are denoted by dashed lines and labeled by tk Idealized subband boundaries are denoted by heavy black lines Under the initial conditionwith only one subband the minimum masking threshold is given by t1 and therefore thebit allocation will be relatively large in order to satisfy a small threshold After oneband splitting however the minimum threshold in subband 1 increases from t1 to t5thereby reducing the perceptual bit allocation Hence the cost function is reduced in part(b) relative to part (a)

DWPT reaches a predetermined limit With this feature it is envisioned that ina real-time environment the WP adaptation process could respond to changingCPU resources by controlling the cost of the analysis and synthesis filter banks

In [Srin98] a complexity-constrained tree adaptation procedure is shown toyield a basis requiring the fewest bits for perceptually transparent coding for agiven complexity and temporal resolution After the WP tree adaptation procedurehas been completed Shapirorsquos zerotree algorithm [Shap93] is applied iterativelyto quantize the coefficients and exploit remaining temporal correlation until theperceptual rate-distortion criteria are satisfied ie until sufficient bits have beenallocated to satisfy the perceptually transparent bit rate associated with the given


WP tree The zerotree technique has the added benefit of generating an embeddedbitstream making this coder amenable to progressive transmission In scalableapplications the embedded bitstream has the property that it can be partiallydecoded and still guarantee the best possible quality of reconstruction giventhe number of bits decoded The complete bitstream consists of the encoded treestructure the number of zerotree iterations and a block of zerotree encoded dataThese elements are coded in a lossless fashion (eg Huffman arithmetic etc) toremove any remaining redundancies and transmitted to the decoder For informallistening tests over coded program material that included violin violinviola flutesitar vocalsorchestra and sax the coded outputs at rates in the vicinity of 45 kbswere reported to be indistinguishable from the originals with the exceptions ofthe flute and sax


Srinivasan and Jamieson [Srin98] demonstrated the advantages of a maskingthreshold adapted WP tree with a time-invariant analysis wavelet On the otherhand Sinha and Tewfik [Sinh93b] used a time-invariant WP tree but a glob-ally adapted analysis wavelet to demonstrate that there exists a signal-specificldquobestrdquo wavelet basis in terms of perceptual coding gain for a particular numberof filter taps The basis optimization in [Sinh93b] however was restricted toDaubechiesrsquo wavelets Recent research has attempted to identify which waveletproperties portend an optimal basis as well as to consider basis optimizationover a broader class of wavelets In an effort to identify those wavelet proper-ties that could be associated with the ldquobestrdquo filter Philippe et al measured theimpact on perceptual coding gain of wavelet regularity AR(1) coding gain andfilter bank frequency selectivity [Phil95a] [Phil95b] The study compared perfor-mance between orthogonal Rioul [Riou94] orthogonal Onno [Onno93] and thebiorthogonal wavelets of [More95] in a WP coding scheme that had essentiallythe same time-invariant critical band WP decomposition tree as [Sinh93b] Usingfilters of lengths varying between 4 and 120 taps minimum bit rates requiredfor transparent coding in accordance with the usual perceptual subband bit allo-cations were measured for each wavelet For a given filter length the resultssuggested that neither regularity nor frequency selectivity mattered significantlyOn the other hand the minimum bit rate required for transparent coding wasshown to decrease with increasing analysis filter AR(1) coding gain leading theauthors to conclude that AR(1) coding gain is a legitimate criterion for WP filterselection in perceptual coding schemes

834 DWPT Coder with Adaptive Tree Structure and Locally AdaptedAnalysis Wavelet

Phillipe et al [Phil96] measured the perceptual coding gain associated withoptimization of the WP analysis filters at every node in the tree aswell as optimization of the tree structure In the first experiment the WPtree structure was fixed and then optimal filters were selected for each

224 SUBBAND CODERS

tree node (local adaptation) such that the bit rate required for transparentcoding was minimized Simulated annealing [Kirk83] was used to solve thediscrete optimization problem posed by a search space containing 300 filtersof varying lengths from the Daubechies [Daub92] Onno [Onno93] Smith-Barnwell [Smit86] Rioul [Riou94] and Akansu-Caglar [Cagl91] families Thenthe filters selected by simulated annealing were used in a second set ofexperiments on tree structure optimization The best WP decomposition treewas constructed by means of a growing procedure starting from a single celland progressively subdividing Further splitting at each node occurred only if itsignificantly reduced the perceptually transparent bit rate As in [Phil95b] thesefilter and tree adaptation experiments estimated bit rates required for perceptuallytransparent coding of 48-kHz sampled source material using statistical signalproperties For a fixed tree the filter adaptation experiments yielded severalnoteworthy results First a nominal bit rate reduction of 3 was realized forOnnorsquos filters (665 kbs) relative to Daubechiesrsquo filters (68 kbs) when the samefilter family was applied in all tree nodes and filter length was the only freeparameter Secondly simulated annealing over the search space of 300 filtersyielded a nominal 1 bit rate reduction (66 kbs) relative to the Onno-onlycase Finally longer filter bank delay ie longer analysis filters and hencebetter frequency selectivity yielded lower bitrates For low-delay applicationshowever a sevenfold delay reduction from 700 down to only 100 samples isrealized at the cost of only a 10 increase in bit rate The tree adaptationexperiments showed that a 16-band decomposition yielded the best bit rate whentree description overhead was accounted for In light of these results and thewavelet adaptation results of [Sinh93b] one might conclude that WP filter andWP tree optimization are warranted if less than a 10 bit rate improvementjustifies the added complexity


The wavelet-based audio coding schemes as well as WP tree and filter adapta-tion experiments described in the foregoing sections (eg [Sinh93b] [Phil95a][Phil95b] [Phil96]) seek to maximize perceptual coding efficiency by match-ing subband bandwidths (ie the time-frequency tiling) andor individual filtermagnitude and phase characteristics to incoming signal properties All of thesetechniques make use of perfect reconstruction (ldquoPRrdquo) DWT or WP filter banksthat are designed to split a signal into frequency subbands in the analysis filterbank and then later recombine the subband signals in the synthesis filter bank toreproduce exactly the original input signal The PR property only holds howeverso long as distortion is not injected into the subband sequences ie in the absenceof quantization This is an important point to consider in the context of coding Thequantization noise introduced into the subbands during bit allocation leads to filterbank-induced reconstruction artifacts because the synthesis filter bank has carefullycontrolled spectral leakage properties specifically designed to cancel the aliasingand imaging distortions introduced by the critically sampled analysis-synthesisprocess Whether using classical or perceptual bit allocation rules most subband


coders do not account explicitly for the filter bank distortion artifacts introducedby quantization Using explicit knowledge of the analysis filters and the quanti-zation noise however recent research has shown that reconstruction distortioncan be minimized in the mean square sense (MMSE) by relaxing PR constraintsand tuning the synthesis filters [Chen95] [Hadd95] [Kova95] [Delo96] [Goss97b]Naturally mean square error minimization is of limited value for subband audiocoders As a result Gosse et al [Goss95] [Goss97] extended the MMSE synthe-sis filter tuning procedure [Goss96] to minimize a mean perceptual error (MMPE)rather than MMSE Experiments were conducted to determine whether or not tunedsynthesis filters outperform the unmodified PR synthesis filters and if so whetheror not MMPE filters outperform MMSE filters in subjective listening tests A WPaudio coding scheme configured for 128 kbs operation and having a time-invariantfilter-bank structure (Figure 88) formed the basis for the experiments

The tree and filter selections were derived from the minimum-rate filter andtree adaptation investigation reported in [Phil96] In the figure each of the 16 sub-bands is labeled with its upper cutoff frequency (kHz) The experiments involvedfirst a design phase and then an evaluation phase During the design phase opti-mized synthesis filter coefficients were obtained as follows For the MMPE filterscoding simulations were run using the unmodified PR synthesis filter bank withpsychoacoustically derived bit allocations for each subband on each frame Amean perceptual error (MPE) was evaluated at the PR filter bank output in termsof a unique JND measure [Duro96] Then the filter tuning algorithm [Goss96]was applied to minimize the reconstruction error Since the bit allocation was per-ceptually motivated the tuning and reconstruction error minimization procedureyielded MMPE filter coefficients For the MMSE filters coefficients were alsoobtained using [Goss96] without the benefit of a perceptual bit allocation step

Daub-14

Daub-24 Daub-24

Onno-32 Onno-32

Daub-18 Daub-18

Onno-4

Onno-6 Onno-6

Haar

Haar

01 02

04 06 08

11 15

3 45 756 9 105 12

18 24 (kHz)

Haar

Daub-18 Daub-18

Figure 88 Wavelet packet analysis filter-bank optimized for minimum bitrate used inMMPE experiments

226 SUBBAND CODERS

During the evaluation phase of the experiments three 128 kbs coding simulationswith psychoacoustic bit allocations were run with

ž PR synthesis filtersž MMSE-tuned synthesis filters andž MMPE-tuned synthesis filters

Performance was evaluated in terms of a perceptual objective measure(POM) [Colo95] an estimate of the probability that an expert listener candistinguish between the original and coded signal The POM results were 44distinguishability for the PR case versus only 16 for both the MMSE andMMPE cases The authors concluded that synthesis filter tuning is worthwhilesince some performance enhancement exists over the PR case They alsoconcluded that MMPE filters failed to outperform MMSE filters because theywere designed to minimize the perceptual error over a long period rather than atime-localized basis Since perceptual signal properties are strongly time-variantit is possible that time-variant MMPE tuning will realize some performance gainrelative to MMSE tuning The perceptual synthesis filter tuning ideas exploredin this work have shown promise but further investigation is required to bettercharacterize its costs and benefits

84 ADAPTED NONUNIFORM FILTER BANKS

The most popular method for realizing nonuniform frequency subbands is tocascade uniform filters in an unbalanced tree structure as with for examplethe DWPT For a given impulse response length however cascade structures ingeneral produce poor channel isolation Recent advances in modulated filter bankdesign methodologies (eg [Prin94]) have made tractable direct form near perfectreconstruction nonuniform designs that are critically sampled This section is con-cerned with subband coders that employ signal-adaptive nonuniform modulatedfilter banks to approximate the time-frequency analysis properties of the auditorysystem more effectively than the other subband coders Two examples are givenBeyond the pair of algorithms addressed below we note that other investigatorshave proposed nonuniform filter bank coding techniques that address redundancyreduction utilizing lattice [Mont94] and bidimensional VQ schemes [Main96]


Princen and Johnston developed a CD-quality coder based upon a signal-adaptivefilter bank [Prin95] for which they reported quality better than the sophisticatedMPEG-1 layer III algorithm at both 48 and 64 kbs The analysis filter bank forthis coder consists of a two-stage cascade The first stage is a 48-band nonuniformmodulated filter bank split into four uniform-bandwidth sections There are 8 uni-form subbands from 0 to 750 Hz 4 uniform subbands from 750 to 1500 Hz 12uniform subbands from 15 to 6 kHz and 24 uniform subbands from 6 to 24 kHz


The second stage in the cascade optionally decomposes nonuniform bank outputswith onoff switchable banks of finer resolution uniform subbands During filterbank adaptation a suitable overall time-frequency resolution is attained by selec-tively enabling or disabling the second stage filters for each of the four uniformbandwidth sections The low-resolution mode for this architecture corresponds toslightly better than auditory filter-bank frequency resolution On the other handthe high-resolution mode corresponds roughly to 512 uniform subband decompo-sition Adaptation decisions are made independently for each of the four cascadedsections based on a criterion of minimum perceptual entropy (PE) The secondstage filters in each section are enabled only if a reduction in PE (hence bit rate)is realized Uniform PCM is applied to subband samples under the constraint ofperceptually masked quantization noise Masking thresholds are transmitted asside information Further redundancy reduction is achieved by Huffman codingof both quantized subband sequences and masking thresholds


Purat and Noll [Pura96] also developed a CD-quality audio coding scheme basedon a signal-adaptive nonuniform tree-structured wavelet packet decompositionThis coder is unique in two ways First of all it makes use of a novel waveletpacket decomposition [Pura95] Secondly the algorithm adapts to the signal thewavelet packet tree decomposition depth and breadth (branching structure) basedon a minimum bit rate criterion subject to the constraint of inaudible distortionsIn informal subjective tests the algorithm achieved excellent quality at a bit rate of55 kbs

85 HYBRID WP AND ADAPTED WPSINUSOIDAL ALGORITHMS

This section examines audio coding algorithms that make use of a hybrid waveletpacketsinusoidal signal analysis Hybrid coder architectures often improve coderrobustness to diverse program material In this case the wavelet portion of acoder might be better suited to certain signal classes (eg transient) while theharmonic portion might be better suited to other classes of input signal (egtonal or steady-state) In an effort to improve coder overall performance (egbetter output quality for a given bit rate) several of the signal-adaptive waveletand wavelet packet subband coding schemes presented in the previous sectionhave been embedded in experimental hybrid coding schemes that seek to adaptthe analysis properties of the coding algorithm to the signal content Severalexamples are considered in this section

Although the WP coder improvements reported in [Tewf93] addressed pre-echo control problems evident in [Sinh93b] they did not rectify the coderrsquos inade-quate performance for harmonic signals such as the piano test sequence This is inpart because the low-order FIR analysis filters typically employed in a WP decom-position are characterized by poor frequency selectivity and therefore waveletbases tend not to provide compact representations for strongly sinusoidal signals

228 SUBBAND CODERS

MaskingModel

s(n)Σ

SinusoidalAnalysis

SinusoidalSynthesis

WPAnalysis Transient

Removal

NoiseEncoder

TransientTracker

TransientEncoder

Bit P

acking

EncodeSide Info

QuantizerEncoder

minus

Figure 89 Hybrid sinusoidalwavelet encoder (after [Hamd96])

On the other hand wavelet decompositions provide some control over time res-olution properties leading to efficient representations of transient signals Theseconsiderations have inspired several researchers to investigate hybrid coders


Hamdy et al developed a hybrid coder [Hamd96] designed to exploit the effi-ciencies of both harmonic and wavelet signal representations For each framethe encoder (Figure 89) chooses a compact signal representation from combinedsinusoidal and wavelet bases This algorithm is based on the notion that short-timeaudio signals can be decomposed into tonal transient and noise components Itassumes that tonal components are most compactly represented in terms of sinu-soidal basis functions while transient and noise components are most efficientlyrepresented in terms of wavelet bases The encoder works as follows FirstThomsonrsquos analysis model [Thom82] is applied to extract sinusoidal parameters(frequencies amplitudes and phases) for each input frame Harmonic synthesisusing the McAulay and Quatieri reconstruction algorithm [McAu86] for phaseand amplitude interpolation is next applied to obtain a residual sequence Thenthe residual is decomposed into WP subbands

The overall WP analysis tree approximates an auditory filter bank Edge-detection processing identifies and removes transients in low-frequency sub-bands Without transients the residual WP coefficients at each scale becomelargely decorrelated In fact the authors determined that the sequences are wellapproximated by white Gaussian noise (WGN) sources having exponential decayenvelopes As far as quantization and encoding are concerned sinusoidal frequen-cies are quantized with sufficient precision to satisfy just-noticeable-differencesin frequency (JNDF) which requires 8-bit absolute coding for a new frequencytrack and then 5-bit differential coding for the duration of the lifetime of thetrack The sinusoidal amplitudes are quantized and encoded in a similar abso-lutedifferential manner using simultaneous masking thresholds for shaping of


quantization noise This may require up to 8 bits per component Sinusoidalphases are uniformly quantized on the interval [minusπ π] and encoded using 6 bitsAs for quantization and encoding of WP parameters all coefficients below 11 kHzare encoded as in [Sinh93b] Above 11 kHz however parametric representationsare utilized Transients are represented in terms of a binary edge mask that can berun length encoded while the Gaussian noise components are represented in termsof means variances and exponential decay constants The hybrid harmonic-wavelet coder was reported to achieve nearly transparent coding over a widerange of CD-quality source material at bit rates in the vicinity of 44 kbs [Ali96]


During the late 1990s other researchers continued to explore the potential ofhybrid sinusoidal-wavelet signal analysis schemes for audio coding Bolandand Deriche [Bola97] reported on an experimental sinusoidal-wavelet hybridaudio codec with high-level architecture very similar to [Hamd96] but with low-level differences in the sinusoidal and wavelet analysis blocks In particularfor harmonic analysis the proposed algorithm replaces Thomsonrsquos method usedin [Hamd96] with a combination of total least squares linear prediction (TLS-LP)and Pronyrsquos method Then in the harmonic residual wavelet decomposition blockthe proposed method replaces the usual DWT cascade of two-band QMF sectionswith a cascade of four-band QMF sections The algorithm works as follows Firstharmonic analysis operates on nonoverlapping 12-ms blocks of rectangularlywindowed input audio (512 samples 441 kHz) For each block sinusoidalfrequencies fk are extracted using TLS-LP spectral estimation [Rahm87] aprocedure that is formulated to deal with closely spaced sinusoids in lowSNR environments Given the set of TLS-LP frequencies a classical Pronyalgorithm [Marp87] next determines the corresponding amplitudes Ak andphases φk Masking thresholds for the tonal sequence are calculated in a mannersimilar to the ISOIEC MPEG-1 psychoacoustic recommendation 2 [ISOI92]After masked tones are discarded the parameters of the remaining sinusoidsare uniformly quantized and encoded in a procedure similar to [Hamd96]Frequencies are encoded according to JNDFs (3 nonuniform bands 8 bits percomponent in each band) phases are allocated 6 bits across all frequencies andamplitudes are block companded with 5 bits for the gain and 6 bits per normalizedamplitude Unlike [Hamd96] however amplitude bit allocations are fixed ratherthan signal adaptive Quantized sinusoidal components are used to synthesize atonal sequence stonal(n) as follows

stonal(n) =psum

k=1

Akej(k+φk) (82)

where the parameters k = 2πfkfs are the normalized radian frequencies andonly p2 frequency components are independent since the complex exponentialsare organized into conjugate symmetric pairs As in [Hamd96] the synthetic

230 SUBBAND CODERS

y1

y2

y3

y4

y5

y6

y2y3y4y5y6 y1

y7

y8

y9y10

x

Frequency (kHz)

22 kHz1651155412713

y10

03

Q4 Q4 Q4

Figure 810 Subband decomposition associated with cascaded M-band DWT in [Bola97]

tonal sequence stonal(n) is subtracted from the input sequence s(n) to form aspectrally flattened residual r(n)

In the wavelet analysis section the harmonic residual r(n) is decomposedsuch that critical bandwidths are roughly approximated using a three-level cascade(Figure 810) of 4-band analysis filters (ie 10 subbands) designed accordingto the M-band technique in [Alki95] Compared to the usual DWT cascade of2-band QMF sections the M-band cascade offers the advantages of reducedcomplexity reduced delay and linear phase The DWT coefficients are uni-formly quantized and encoded in a block companding scheme with 5 bits persubband gain and a dynamic bit allocation according to a perceptual noise modelfor the normalized coefficients A Huffman coding section removes remainingstatistical redundancies from the quantized harmonic and DWT coefficient setsIn subjective listening comparisons between the proposed scheme at 60ndash70 kbsand MPEG-1 layer III at 64 kbs on 12 SQAM CD [SQAM88] source items theauthors reported indistinguishable quality for ldquoacoustic guitarrdquo ldquoEddie Rabbitrdquoldquocastanetsrdquo and ldquofemale speechrdquo Slight impairments relative to MPEG-1 layerIII were reported for the remaining eight items No comparisons were reportedin terms of delay or complexity

853 Hybrid SinusoidalDWPT Coder with WP Tree StructureAdaptation (ARCO)

Other researchers have also developed hybrid algorithms that represent audiousing a combination of sinusoidal and wavelet packet bases Pena et al [Pena96]have reported on the Adaptive Resolution COdec (ARCO) This algorithmemploys a two-stage hybrid tonal-WP analysis section architecturally similar toboth [Hamd96] and [Bola97] The experimental ARCO algorithm has introducedseveral novelties in the segmentation psychoacoustic analysis tonal analysis bitallocation and WP analysis blocks In addition recent work on this project hasproduced a unique MDCT-based filter bank The remainder of this subsectiongives some details on these developments

8531 ARCO Segmentation Perceptual Model and Sinusoidal An-alysis-by-Synthesis In an effort to match the time-frequency analysis res-olution to the signal properties ARCO includes a segmentation scheme that


makes use of both time and frequency block clustering to determine optimalanalysis frame lengths [Pena97b] Similar blocks are assumed to contain station-ary signals and are therefore combined into larger frames Dissimilar blocks onthe other hand are assumed to contain nonstationarities that are best analyzedusing individual short segments The ARCO psychoacoustic model resemblesISOIEC MPEG-1 model recommendation 1 [ISOI92] with some enhancementsUnlike [ISOI92] tonality labeling is based on [Terh82] and noise maskers aresegregated into narrowband and wideband subclasses Then frequency-dependentexcitation patterns are associated with the wideband noise maskers ARCO quan-tizes tonal signal components in a perceptually motivated analysis-by-synthesisUsing an iterative procedure bits are allocated on each analysis frame until thesynthetic tonal signalrsquos excitation pattern matches the original signalrsquos excitationpattern to within some tolerance

8532 ARCO WP Decomposition The ARCO WP decomposition proce-dure optimizes both the tree structure as in [Srin98] and filter selections asin [Sinh93b] and [Phil96] For the purposes of WP tree adaptation [Prel96a]ARCO defines for the k-th band a cost εk as

εk =int fk+Bk2fkminusBk2 (U(f ) minus Ak)df

int fk+Bk2fkminusBk2 U(f )df

(83)

where U(f ) is the masking threshold expressed as a continuous function theparameter f represents frequency fk is the center frequency for the k-th subbandBk is the k-th subband bandwidth and Ak is the minimum masking threshold inthe k-th band Then the total cost C to be minimized over all M subbands isgiven by

C =Msum

k=1

εk (84)

By minimizing Eq (84) on each frame ARCO essentially arranges the subbandssuch that the corresponding set of idealized brickwall rectangular filters havingamplitude equal to the height of the minimum masking threshold in the eachband matches as closely as possible the shape of the masking threshold Thenbits are allocated in each subband to satisfy the minimum masking thresholdAk Therefore uniform quantization in each subband with sufficient bits affectsa noise shaping that satisfies perceptual requirements without wasting bits Themethod was found to be effective without accounting explicitly for the spec-tral leakage associated with the filter bank sidelobes [Prel96b] As far as filterselection is concerned ARCO employs signal-adaptive filters during steady-statesegments and time-invariant filters during transients Some of the filter selectionstrategies were reported to have been inspired by Agerkvistrsquos auditory modelingwork [Ager94] [Ager96] In [Pena97a] it was found that the ldquosymmetrizationrdquotechnique [Bamb94] [Kiya94] was effective for minimizing the boundary distor-tions associated with the time-varying WP analysis

232 SUBBAND CODERS

8533 ARCO Bit Allocation Unlike most other algorithms ARCO encodesand transmits the masking threshold to the decoder This has the advantage ofefficiently representing both the adapted WP tree and the subband bit allocationswith a single piece of information The disadvantage however is that the decoderis no longer decoupled from the details of perceptual bit allocation as is typi-cally the case with other algorithms The ARCO bit allocation strategy [Sera97]achieves fast convergence to a desired bit rate by shifting the masking thresholdup or down using a novel noise scaling procedure The technique essentially usesa Newton algorithm to converge in only a few iterations to the noise scaling levelthat achieves the desired bit rate The technique takes into account bit allocationsfrom previous frames and allocates bits to all subbands simultaneously Conver-gence speed and accuracy are controlled by a single parameter and the procedureis amenable to subband weighting of the threshold to create unique noise pro-files In one set of experiments convergence to a target rate with perceptual noiseshaping was achieved in between two and seven iterations of the low complexitytechnique Another unique property of ARCO is its set of high-level ldquocognitiverulesrdquo that seek to minimize the objectionable distortion when insufficient bitsare available to guarantee transparent coding [Pena95] These rules monitor theevolution of coding distortion over many frames and make fine noise-shapingadjustments on individual frames in order to avoid perceptually annoying noisepatterns that could not otherwise be detected on a short-time basis

8534 ARCO Developments It is interesting to note that the researchersdeveloping ARCO recently replaced the hybrid sinusoidal-WP analysis filter bankwith a novel multiresolution MDCT-based filter bank In [Casa98] Casal et aldeveloped a ldquomulti-transformrdquo (MT) that retains the lapped properties of theMDCT but creates a nonuniform time-frequency tiling by transforming back intotime the high-frequency MDCT components in L-sample blocks The proposedMT is characterized by high resolution in frequency for the low subbands andhigh resolution in time for the high frequencies Like the MDCT upon whichit is based the MT maintains critical sampling and perfect reconstruction inthe absence of quantization Preliminary results for application of the MT inthe TARCO (Tonal Adaptive Resolution COdec) are given in [Casa98] As faras bit rates reconstruction quality and complexity are concerned details onARCOTARCO have not yet appeared in the literature

We conclude this section with the observation that hybrid DWT-sinusoidal andDWPT-sinusoidal architectures such as those advocated by Hamdy [Hamd96]Boland [Bola97] and Pena [Pena96] have been motivated by the notion that asource-robust audio coder must represent radically different signal types withuniform efficiency The idea behind the hybrid structure is that providing twoextreme basis possibilities might yield opportunities for maximally efficient signaladaptive basis selection By offering superior frequency resolution with inher-ently narrowband basis elements sinusoidal signal models are ideally suited forstrongly tonal signals while DWT and WPT filter banks on the other handsacrifice some frequency resolution but offer greater time resolution flexibilitymaking these bases inherently more efficient for representing transient signals As


this section has demonstrated the combination of the both signal models withina single codec can provide compact representations for a wide range of input sig-nals The next section of this chapter examines a different type of hybrid audiocoding architecture in which code excited linear prediction (CELP) is embeddedwithin subband coding schemes

86 SUBBAND CODING WITH HYBRID FILTER BANKCELPALGORITHMS

While hybrid sinusoidal-DWT and sinusoidal-DWPT signal models seek to max-imize robustness and basis flexibility other hybrid signal models have been moti-vated by low-delay and low-complexity concerns In this section we consider inparticular algorithms that combine a filter bank front end with subband-specificcode-excited linear prediction (CELP) blocks for quantization and coding of thedecimated subband sequences The goal of these experimental hybrid coders is toachieve very low delay andor low-complexity perceptual coding with reconstruc-tion quality comparable to any state-of-the-art audio codec Before consideringthese algorithms however we first define what is meant by ldquocode-excited linearpredictionrdquo

In the coding literature the acronym ldquoCELPrdquo denotes an entire class of effi-cient analysis-by-synthesis source coding techniques developed primarily forspeech applications in which the analyzed signal is treated as the output ofa source-system mechanism such as the human vocal apparatus In the CELPscheme excitation vectors corresponding to the lower vocal tract ldquosourcerdquo con-tribution drive a slowly time-varying LP synthesis filter that corresponds to theupper vocal tract ldquosystemrdquo Parameters of the LP synthesis filter are usuallyestimated on a block basis typically every 20 ms while the excitation vectorsare usually updated more frequently typically every 5 ms The LP parametersare most often estimated in an open-loop procedure by solving a set of normalequations that have been formulated to minimize the mean square predictionerror In contrast the excitation vectors are optimized in a closed-loop analysis-by-synthesis procedure such that the reconstruction error is minimized most oftenin the perceptually weighted mean square sense Given a vector of input speechthe analysis-by-synthesis process essentially reduces to a search during whichthe encoder must identify within a vector codebook that candidate excitation thatgenerates the best synthetic output speech when processed by the LP synthesisfilter The set of encoded parameters is therefore a set of filter parameters andone (or more) vector indices depending upon the codebook structure Since itsintroduction in the mid-1980s [Schr85] CELP and its derivatives have receivedconsiderable attention in the literature As a result numerous high-quality highlyefficient algorithms have been proposed and adopted as international standards inspeech coding Although a detailed discussion of CELP is beyond the scope ofthis book we refer the reader to the comprehensive tutorial in [Span94] for fur-ther details as well as a complete perspective on the CELP research and standardsThe remainder of this section assumes that the reader has a basic understanding

234 SUBBAND CODERS

of CELP coding principles Several examples of experimental subbandCELPalgorithms are examined next


One example of a hybrid filter bankCELP low-delay audio codec was developedjointly by Hay and Saoudi at ENST and Mainard at CCETT They devised asystem for generic audio signals sampled at 32 kHz based on the four-bandpolyphase quadrature filter bank (pseudo-QMF) borrowed from the ISOIECMPEG-2 AAC scalable sample rate profile [Akai95] and a bank of modifiedITU G728 [ITUR92] low-delay CELP speech codecs (Figure 811) The pri-mary objective of this system is to achieve transparent coding of the high-fidelityinput with very low delay The coder was first reported in [Hay96] and thenenhanced in [Hay97] The enhanced algorithm works as follows First the filterbank decomposes the input into four equal width subbands Then each of thedecimated subband sequences is quantized and encoded in five-sample blocks(0625 ms) using modified G728 codecs (low-delay CELP) for each subbandThe backward adaptive G728 algorithm [ITUR92] generates as output a singlevector index for each block of input samples and therefore a set of four code-book indices i1 i2 i3 i4 comprises the complete bitstream for the hybrid audiocodec Algorithmic delay consists of the 3-ms filter bank delay (96-tap filters)plus the additional 2-ms delay contributed by the G728 stages resulting in antotal delay of only 5 ms Bit allocation targets for each subband are computedby means of a modified ISOIEC MPEG-1 psychoacoustic model-1 that com-putes masking thresholds signal-to-mask ratios and ultimately the number ofbits required for transparent coding by analyzing the quantized outputs of thei-th band Si from a 4-ms-old block of data

H2(z)

H1(z)

s(n)

H3(z)

H4(z)

LD-CELP 32 kbps(Lattice-VQ)

LD-CELP var

LD-CELP var

LD-CELP var

Perceptual Model

4

4

4

4

bi

i1

i2

i3

i4

Siˆ

Figure 811 Low-delay hybrid filter-bankLD-CELP algorithm [Hay97]


The perceptual model utilizes an alias-cancelled DFT [Tang95] to compensatefor the analysis filter bankrsquos aliasing distortion Bit allocations are derived atboth the transmitter and receiver from the same set of quantized data makingit unnecessary to transmit explicitly any bit allocation information Average bitallocations on a subset of the standard ISO test material were 31 18 12 and3 kbs respectively for subbands 1 through 4 Given that the G728 codec isintended to operate at a fixed rate the primary challenge facing the algorithmdesigners was implementing dynamic subband bit allocations Computationallyefficient variable rate versions of G728 were constructed for bands 2 through 4by structuring standard LBG (K-means) [Lind80] codebooks to deal with mul-tiple rates (variable precision codebook indices) Unfortunately the first (lowfrequency) subband requires an average bit rate of 32 kbs for perceptual trans-parency which translates to an impractical codebook size of 220 vectors Tosolve this problem the authors implemented a highly efficient D5 lattice VQscheme [Conw88] which dramatically reduced the search complexity for eachinput vector by constraining the search space to a 50-vector neighborhood Lat-tice vector shapes were assigned 16 bits and gains 4 bits The lattice schemewas shown to perform nearly as well as an exhaustive search over a codebookcontaining more than 50000 vectors Neither objective nor subjective qualitymeasures were reported for this hybrid system

862 Hybrid SubbandCELP Algorithm for Low-ComplexityApplications

Intended for achieving CD quality in low-complexity decoder applications asecond example of a hybrid filter bankCELP algorithm appeared in [Vand98]Like [Hay97] the proposed algorithm follows a critically sampled filter bankwith a quantization and encoding stage of parallel variable-rate CELP codersone per subband (Figure 812) Unlike [Hay97] however this algorithm makesuse of a higher resolution longer delay filter bank Thus channel separation isgained at the expense of delay At the same time this algorithm utilizes relativelylow-order LP synthesis filters which significantly reduce decoder complexityIn contrast [Hay97] captures significant spectral detail in the high-order (50-thorder) predictors that are embedded in the G728 blocks The proposed algorithmclosely resembles ISOIEC MPEG-1 layer 1 in its filter bank and psychoacousticmodeling sections In particular the filter bank is identical to the 32-band 512-tap PQMF bank of [ISOI92] Also like [ISOI92] the subband sequences areprocessed in 12-sample blocks corresponding to 384 input samples

The proposed algorithm however replaces the block companding of [ISOI92]with the CELP quantization and encoding for all 32 subbands For every block of12 subband samples bits are allocated to the subbands on the basis of maskingthresholds delivered by the perceptual model This practice establishes minimumSNRs required in each subband to achieve perceptually transparent coding Thenparallel noise scaling is applied to the target SNRs to adjust the bit rate to ascalable target Finally CELP blocks quantize and encode each subband usingthe number of bits allocated by the perceptual model The particulars of the 32

236 SUBBAND CODERS

Nc (32)

Nc (2)

Nc (1)

s(n)

CELP 1

CELP 2

CELP 32

Perceptual Model

32

PQF

32

32

i1 g1

i2 g2

i32 g32

Figure 812 Low-complexity hybrid filter-bankCELP algorithm [Vand98]

identical CELP stages are as follows In order to maintain low complexity thebackward-adaptive LP synthesis filters are second order The codebook which isidentical for all stages contains 12-element stochastic excitation vectors that arestructured for gain-shape quantization with 6 bits allocated to the gains and 8 bitsallocated to the shapes for each of the 256 codewords Because bits are allocateddynamically for each subband in accordance with a masking threshold the CELPblocks are configured for variable rate operation Each CELP coder will combineexcitation contributions from up to 4 codebooks meaning that available rates foreach subband are 0 167 233 35 and 467 bits per sample The closed-loopanalysis-by-synthesis excitation search procedure relies upon a standard MSEminimization codebook search The total bit budget R is given by

R = 2Nb +Nbsum

i=1

8Nc(i) +Nbsum

i=1

6Nc(i) (85)

where Nb is the number of bands (32) Nc(i) is the number of codebooks requiredin the i-th band to achieve the SNR demanded by the perceptual model From leftto right the terms in Eq (85) represent the bits required to specify the number ofcodebooks being used in each subband the bits required for the shape codewordsand the bits required for the gain codewords In informal subjective tests overa set of unspecified test material the algorithm was reported to produce qualityldquonear transparencyrdquo at 62 kbs ldquogood qualityrdquo at 50 and 37 kbs and quality thatwas ldquoweakrdquo at 30 kbs

PROBLEMS 237

87 SUBBAND CODING WITH IIR FILTER BANKS

Although the majority of subband and wavelet audio coding algorithms foundin the literature employ banks of perfect reconstruction FIR filters this does notpreclude the possibility of using infinite impulse response (IIR) filter banks forthe same purpose Compared to FIR filters IIR filters are able to achieve simi-lar magnitude response characteristics with reduced filter orders and hence withreduced complexity In the multiband case IIR filter banks also offer complex-ity advantages over FIR filter banks Enhanced performance however comesat the expense of an increased sensitivity and implementation cost for IIR filterbanks Creusere and Mitra constructed a template subband audio coding sys-tem modeled after [Lokh92] to compare performance and to study the tradeoffsinvolved when choosing between FIR and IIR filter banks for the audio codingapplication [Creu96] In the study two IIR and two FIR coding schemes wereconstructed from the template using a structured all-pass filter bank a parallelall-pass filter bank a tree-structured QMF bank and a PQMF bank Beyond thisstudy IIR filter banks have not been widely used for audio coding The applica-tion of IIR filter banks to subband audio coding remains a subject that is largelyunexplored

PROBLEMS

81 In this problem we will show that STFT can be interpreted as a bank ofsubband filters Given the STFT X(n k) of the input signal x(n)

X(n k) =infinsum

m=minusinfinx(m)w(n minus m)eminusjkm = w(n) lowast x(n)eminusjkn

where w(n) is the sliding analysis window Give a filter-bank realizationof the STFT for a discrete frequency variable k = k() k = 0 1 7(ie 8 bands) Choose such that the speech band (20ndash4000 Hz) iscovered Assume that the frequencies k are uniformly spaced

82 The mother wavelet function ξ(t) is given in Figure 813 Determine andsketch carefully the wavelet basis functions ξυτ (t) for υ = 0 1 2 andτ = 0 1 2 associated with ξ(t)

ξυτ (t) 2minusυ2ξ(2minusυ t minus τ) (86)

where υ and τ denote the dilation (frequency scaling) and translation (timeshift) indices respectively

83 Let h0(n) = [1radic

2 1radic

2] and h1(n) = [1radic

2 minus1radic

2] Compute the scal-ing and wavelet functions φ(t) and ξ(t) Using ξ(t) as the mother waveletand generate the wavelet basis functions ξ00(t) ξ01(t) ξ10(t) and ξ11(t)

238 SUBBAND CODERS

0t

1

x(t )

1

Figure 813 An example wavelet function

Hint From the DWT theory the Fourier transforms of φ(t) and ξ(t) aregiven by

() = 1radic2H0(e

j2)

infinprod

p=2

H0(ej2p

) (87)

ξ () = 1radic2H1(e

j2)

infinprod

p=2

H0(ej2p

) (88)

where H0(ej) and H1(e

j) are the DTFTs of the causal FIR filters h0(n)

and h1(n) respectively For convenience we assumed Ts = 1 in = ωTs

and φ(t)CFTlarrminusminusrarr(ω) equiv () h0(n)

DTFTlarrminusminusrarrH0(ej)

84 Let H0(ej) and H1(e

j) be ideal lowpass and highpass filters with cut-off frequency π 2 as shown in Figure 814 Sketch () ξ () and thewavelet basis functions ξ00(t) ξ01(t) ξ10(t) and ξ11(t)

85 Show that if both H0(ej) and H1(e

j) are causal FIR filters of order N thenthe wavelet basis functions ξυτ (t) will have finite duration of (N + 1)2υ

86 Using equations (87) and (88) prove the following 1) (2) =infinprod

p=2

H0(ej2p

) and 2) |()|2 + |ξ ()|2 = |(2)|2

87 From problem 86 we have () = 1radic2H0(e

j2)(2) and ξ () =1radic2H1(e

j2)(2) Show that φ(t) and ξ(t) can be obtained

PROBLEMS 239

H0(e jΩ) H1(e jΩ)

minusp pminus

2

pminus

2

11

pp

2

p

200

ΩΩ

Figure 814 Ideal lowpass and highpass filters with cutoff frequency π 2

0t

1

2

f ( t )

0t

1

2

1

x ( t )

(a) (b)

0t

2

x(t )

21

1

43 65

4

3

(c)

Figure 815 (a) The scaling function φ(t) (b) the mother wavelet function ξ(t) and(c) input signal x(t)

240 SUBBAND CODERS

recursively as

φ(t) = radic2sum

n


ξ(t) = radic2sum

n


COMPUTER EXERCISE

88 Let the scaling function φ(t) and the mother wavelet function ξ(t) be asshown in Figure 815(a) and Figure 815(b) respectively Assume that theinput signal x(t) is as shown in Figure 815(c) Given the wavelet seriesexpansion

x(t) =infinsum

τ=minusinfinα(τ)φ(t minus τ) +

infinsum

υ=0

infinsum

τ=minusinfinβ(υ τ)ξυτ (t) (811)

where both υ and τ are integers and denote the dilation and translationindices respectively α(τ) and β(υ τ) are the wavelet expansion coeffi-cients Solve for α(τ) and β(υ τ)[Hint Compute the coefficients using inner products α(τ) 〈x(t)φ(t minusτ)〉 = int

x(t)φ(t minus τ)dt and β(υ τ) 〈x(t)ξυτ (t)〉 = intx(t)ξυτ (t)dt =int

x(t)2minusυ2ξ(2minusυt minus τ)dt ]

CHAPTER 9

SINUSOIDAL CODERS

91 INTRODUCTION

This chapter addresses perceptual coding algorithms based on sinusoidal modelsAlthough sinusoidal signal models have been applied successfully since the 1980sin speech coding [Hede81] [Alme83] [McAu86] [Geor87] [Geor92] and music syn-thesis [Serr90] perceptual properties were not introduced in sinusoidal modelinguntil later [Edle96c] [Pena96] [Levin98a] [Pain01] The advent of MPEG-4 stan-dardization established new research goals for high-quality coding of general audiosignals at bit rates in the range of 6ndash24 kbs In experiments reported as part of theMPEG-4 standardization effort it was determined that sinusoidal coding is capableof achieving good quality at low rates without being constrained by a restrictivesource model Furthermore unlike CELP and other classical low rate speech codingmodels the parametric sinusoidal coding is amenable to pitch and time-scale mod-ification at the decoder Additionally the emergence of Internet-based streamingaudio has motivated considerable research on the application of sinusoidal signalmodels to high-quality audio coding at low bit rates For example Levine and Smithdeveloped a hybrid sinusoidal-filter-bank coding scheme that achieves very highquality at rates around 32 kbs [Levin98a] [Levi99]

This chapter describes some of the sinusoidal algorithms for low rate audiocoding that exploit perceptual properties In Section 92 we review the clas-sical sinusoidal model Section 93 presents the analysissynthesis audio codec(ASAC) which was eventually considered for MPEG-4 standardizationSection 94 describes an enhanced version of ASAC the harmonic and indi-vidual lines plus noise (HILN) algorithm The HILN algorithm has been adopted


241


as part of the MPEG-4 standard Section 95 examines the use of FM synthesisoperators in sinusoidal audio coding In Section 96 we investigate the sines +transients + noise (STN) model Finally Section 97 is concerned with algo-rithms that combine sinusoidal modeling with other well-known techniques invarious hybrid architectures to achieve efficient low-rate audio coding

92 THE SINUSOIDAL MODEL

This section describes the sinusoidal model that forms the basis for the parametricaudio coding and the extended hybrid model given in the latter portions of thischapter In particular standard methodologies are presented for sinusoidal analysistracking interpolation and synthesis The classical sinusoidal model comprisesan analysis-synthesis framework ([McAu86] [Serr90] [Quat02]) that represents asignal s(n) as the sum of a collection of K sinusoids (ldquopartialsrdquo) with time-varyingfrequencies phases and amplitudes ie


k=1

Ak cos(ωk(n)n + φk(n)) (91)

where Ak represents the amplitude ωk(n) represents the instantaneous frequencyand φk(n) represents the instantaneous phase of the k-th sinusoid It is assumedthat the amplitude frequency and phase functions evolve on a time scale substan-tially longer than a signal period Analysis for this model amounts to estimatingthe amplitudes phases and frequencies of the constituent partials Although thisestimation is typically accomplished by peak picking in the short-time Fourierdomain [McAu86] [Span91] [Serr90] analysis-by-synthesis estimation techniquesthat minimize explicitly a mean square error in terms of the sinusoidal parametershave also been proposed [Geor87] [Geor90] [Geor92] Sinusoidal analysis-by-synthesis has also been presented within the more generalized framework of match-ing pursuits using overcomplete signal dictionaries [Good97] [Verm99] Whetherclassical short-time Fourier transform (STFT) peak picking or analysis-by-synthesisis used for parameter estimation the analysis yields partial parameters on eachframe and the data rate of the parameterization is given by the analysis stride andthe order of the model In the synthesis stage the frame-rate model parameters areconnected from frame to frame by a line tracking process and then interpolatedusing low-order polynomial models to derive sample-rate control functions for abank of oscillators Interpolation is carried out based on synthesis frames which areimplicitly established by the analysis stride Although the bank of synthesis oscil-lators can be realized through additive combination of cosines computationallyefficient alternatives are available based on the FFT (eg [McAu88] [Rode92])


The STFT-based analysis scheme [McAu86] [Serr89] that estimates the sinusoidalmodel parameters is presented here Figure 91 First the input is segmented into


Mag

nitu

de

Frame 1

Frame 2

Frame 3

Mag

nitu

deM

agni

tude

Frequency

X

X X

X

X

X X XX X X

Time

Am

plitu

de

STFT

STFT

STFT

X X

XX X X

X X

X X

X X X X X X

Frequency

Frequency

Figure 91 Sinusoidal model analysis The time-domain signal is segmented into over-lapping frames that are transformed to the frequency domain using STFT analysis Localmagnitude spectral maxima are identified It is assumed that each peak is associated witha pure tone (partial) component of the input For each of the peaks a parameter triad con-taining frequency amplitude and phase is extracted Finally a tracking algorithm formstime trajectories for the sinusoids by matching the amplitude andor frequency parametersacross time

overlapping frames In the hybrid signal model analysis frame lengths are signaladaptive Frames are typically overlapped by half of their length After segmenta-tion the frames are analyzed with the STFT which yields magnitude and phasespectra The sinusoidal analysis scheme assumes that magnitude spectral peaks areassociated with underlying pure tones in the input Therefore spectral peaks areidentified by a peak detector and then passed to a tracking algorithm that formstime trajectories by associating peaks from frame to frame For a time-domaininput s(n) let Sl(k) denote the complex-valued STFT of the signal s(n) on the l-thframe A spectral peak is defined as a local maximum in the magnitude spectrum|Sl(k)| ie an STFT magnitude peak on bin k0 that satisfies the inequality

|Sl(k0 minus 1)| |Sl(k0)| |Sl(k0 + 1)| (92)


Following peak identification on frames l and l + 1 a tracking procedure forms timetrajectories by matching across frames those spectral peaks which satisfy certainmatching criteria The resulting trajectories are intended to represent the smoothlytime-varying frequencies amplitudes and phases of the sinusoidal partials thatcomprise the signal under analysis Several trajectory tracking algorithms havebeen demonstrated to perform well [McAu86] [Serr89]

The tracking procedure (Figure 92) works in the following way First denoteby ωl

i the frequencies associated with the sinusoids identified on frame l with1 i p Similarly denote by ωl+1

j the frequencies associated with the sinusoidsidentified on frame l + 1 with 1 j r Given two sets of unmatched sinusoidsthe tracking objective is to identify for the i-th sinusoid on frame l the j -thsinusoid on frame l + 1 that is closest in frequency andor amplitude (here onlyfrequency matching is considered) Therefore in the first step of the procedurean initial match is formed between ωl

i and ωl+1j such that the difference ω =

|ωli minus ωl+1

j | is minimized and such that the distance ω is less than a specifiedmaximum ωmax

Following an initial match three outcomes are possible First the trajectorywill be continued (Figure 92a) if a match is found and there are no match con-flicts to be resolved In this case the frequency amplitude and phase parametersare interpolated from frame l to frame l + 1 On the other hand if no initialmatch is found during the first step it is assumed that the trajectory associatedwith frequency ωl

i must terminate In this case the trajectory is declared ldquodeadrdquo(Figure 92c) and is matched to itself with zero amplitude on frame l + 1 In thethird possible outcome the initial match creates a conflict In this case the i-thtrajectory attempts to match with a peak that has already been claimed by another

Fre

quen

cy

XX

Frame k Frame k + 1

∆fmax∆fmax∆fmax

Fre

quen

cy

X

X

Frame k Frame k + 1

O

Fre

quen

cy

O

X

Frame k Frame k + 1

X

Continuation Birth Death

Figure 92 Sinusoidal trajectory formation In the figure an lsquoxrsquo denotes the presence ofa sinusoid at the specified frequency while an lsquoorsquo denotes the absence of a sinusoid atthe specified frequency In part (a) a sinusoid on frame k is matched to a sinusoid onframe k + 1 because the two sinusoids are sufficiently close in frequency and becausethere are no conflicts During synthesis the frequency amplitude and phase parametersare interpolated from frame k to frame k + 1 In part (b) a sinusoid on frame k + 1 isdeclared ldquobornrdquo because a sufficiently close matching sinusoid does not exist on frame kIn this case frequency is held constant but amplitude is interpolated from zero on framek to the measured amplitude on frame k + 1 In part (c) a sinusoid on frame k is declaredldquodeadrdquo because a sufficiently close matching sinusoid does not exist on frame k + 1 Inthis case frequency is held constant but amplitude is interpolated from the measuredamplitude on frame k to zero on frame k + 1


trajectory The conflict is resolved in favor of the closest frequency match If thecurrent trajectory loses it picks the next best available match that satisfies thedifference criterion outlined above If the pre-existing match loses the conflictthe current trajectory claims the peak and the pre-existing match is returned tothe pool of available trajectories This process is repeated until all trajectoriesare either matched or declared ldquodeadrdquo At the conclusion of the matching pro-cedure any unclaimed sinusoids on frame l + 1 are declared ldquobornrdquo As shownin Figure 92(b) trajectories at ldquobirthrdquo are backwards matched to themselves onframe l with the amplitude interpolated from zero on frame l to the measuredamplitude on frame l + 1


The sinusoidal trajectories of frequency amplitude and phase triads are updatedat a rate of once per frame The synthesis portion of the sinusoidal model usesthe frame-rate parameters that were extracted during the analysis procedure togenerate a sample-rate output sequence s(n) by appropriately controlling theoutput of a bank of oscillators One method for generating the model output is asfollows On the l-th frame let output sample on index m + lH represent the sumof the contributions of the K partials that were estimated on the l-th frame ie

s(m + lH) =Ksum

k=1

Alk cos(ωkm + φk) 0 m lt H (93)

where the parameter triad ωlk A

lk φ

lk represents the frequency amplitude and

phase respectively of the k-th sinusoid and the parameter H corresponds to thesynthesis hop size (equal to analysis hop size unless time-scale modification isrequired) The problem with this approach is that the sinusoidal parameters arenot interpolated between frames and therefore the sequence s(n) will in generalcontain jump discontinuities at the frame boundaries In order to avoid disconti-nuities and the associated artifacts a better approach is to use oscillator controlfunctions that interpolate the trajectory parameters from one frame to the nextIf the k-th trajectory parameters on frames l and l + 1 are given by ωl

k Alk φ

lk

and ωl+1k Al+1

k φl+1k respectively then the instantaneous amplitude Al

k(m)can be linearly interpolated between the measured amplitudes Al

k and Al+1k using

the relation

Alk(m) = Al

k + Al+1k minus Al

k

Hm 0 m lt H (94)

Measured values for frequency and phase are interpolated next For clarity thesubscript index k has been dropped throughout the remainder of this discussionand the frame index l is used in its place Frequency and phase interpolationare less straightforward than amplitude interpolation because of the fact thatfrequency is the phase derivative Before defining a phase interpolation functionit is important to note that the instantaneous phase θ (m) is defined as

θ (m) = mω + φ 0 m lt H (95)


where ω and φ are the measured frequency and measured phase respectively Forsmooth interpolation between frames therefore it is necessary that the instan-taneous phase be equal to the measured phases at the frame boundaries andsimultaneously it is also necessary that the instantaneous phase derivatives beequal to the measured frequencies at the frame boundaries To accomplish thisa cubic phase interpolation polynomial was proposed [McAu86] of the form

θl(m) = γ + κm + αm2 + βm3 0 m lt H (96)

After some manipulation it can be shown [McAu86] that the instantaneousphase is given by

θl (m) = φl + ωlm + α(Mlowast)m2 + β(Mlowast)m3 0 m lt H (97)

and that the parameters α and β are obtained as follows

[α(M)

β(M)

]

=

3

H 2minus 1

H

minus 2

H 3

1

H 2

[φl+1 minus φl minus ωlH + 2πM

ωl+1 minus ωl

]

(98)

Although many values for the parameter M will allow θ (m) to satisfy the frameboundary conditions it was shown in [McAu86] that the smoothest function (inthe sense that the integral of the square of the second derivative of the functionθ (m) is minimized) is obtained for the value M = Mlowast which is given by

Mlowast = round

(1

2π

[

(φl + ωlH minus φl+1) + (ωl+1 minus ωl)H

2

])

(99)

where the round() operation denotes taking the integer closest to the functionargument We now restore the notational conventions used earlier ie that thesubscript corresponds to the parameter index and that the superscript representsa frame index Given the interpolated amplitude and instantaneous phase func-tions Al

k(m) and θ lk(m) respectively the interpolated sinusoidal model synthesis

expression becomes

s(m + lH) =Ksum

k=1

Alk(m) cos(θ l

k(m)) 0 m lt H (910)

Unless otherwise specified this is the standard synthesis expression used in thehybrid sinusoidal model presented throughout the remainder of this chapterThis section has provided the essential details of the basic sinusoidal modelIn Section 96 the basic model is extended to enhance its robustness over awider class of signals by including explicit model extensions for transient andnoise-like signal energy In Sections 93 through 95 we apply the sinusoidal

ANALYSISSYNTHESIS AUDIO CODEC (ASAC) 247

s(n)

QuantEncode

Pre-analysisEnvelopeFraming

FFT

FFT

Σ

Σ

PsychoacousticModel

minus

divide MAX

Accumulator

ParameterEstimation

Aififi

Synthesis

si(n)ˆ

minus

ei(n)

fi

Figure 93 ASAC encoder (after [Edle96c])

model presented here in the context of parametric audio codecs ie the ASACthe HILN and the FM

93 ANALYSISSYNTHESIS AUDIO CODEC (ASAC)

The sinusoidal analysissynthesis audio codec (ASAC) for robust coding of gen-eral audio signals at rates between 6 and 24 kbs was developed by Edler et al atthe University of Hannover and proposed for MPEG-4 standardization [Edle95]in 1995 An enhanced ASAC proposal later appeared in [Edle96a] InitiallyASAC segments input audio into analysis frames over which the signal is assumedto be nearly stationary Sinusoidal synthesis parameters are then extracted accord-ing to perceptual criteria quantized encoded and transmitted to the decoderfor synthesis The algorithm distributes synthesis parameters across basic andenhanced bit streams to allow scalable output quality at bitrates of 6 and 24 kbsArchitecturally the ASAC scheme (Figure 93) consists of a pre-analysis blockfor window selection and envelope extraction a sinusoidal analysis-by-synthesisparameter estimation block a perceptual model and a quantization and cod-ing block Although it bears similarities to sinusoidal speech coding [McAu86][Geor92] [Geor97] and music synthesis [Serr90] algorithms that have been avail-able for some time the ASAC coder also incorporates some new techniques Inparticular whereas previous speech-specific sinusoidal coders emphasized wave-form matching by minimizing reconstruction error norms such as the mean squareerror (MSE) ASAC disregards classical error minimization criteria and insteadselects sinusoids in decreasing order of perceptual importance by means of aniterative analysis-by-synthesis loop The perceptual significance of each com-ponent sinusoid is judged with respect to the masking power of the synthesissignal which is determined by a simplified version of the psychoacoustic modelfrom [Baum95] but without the temporal masking considerations The remainderof the section presents details of the ASAC encoder [Edle96c]



Before analysis-by-synthesis begins a pre-analysis segmentation adapts the anal-ysis window length to match characteristics of the input signal s(n) A shortrectangular window with tapered edges is used during transient events or a longHanning window twice the synthesis window length is used during stationarysegments The pre-analysis process also extracts via weighted linear regression aparametric amplitude envelope for each frame to deal with rapid changes in theinput signal During transients envelopes are parameterized in terms of a peaktime as well as attack and decay slopes Non-attack envelopes on the other handare parameterized with only a single attack (+) or decay (minus) slope


The iterative analysis-by-synthesis block [Edle96c] estimates one at a time theparameters of the i-th individual constituent sinusoid or partial and every iterationidentifies the most perceptually significant sinusoid remaining in the synthesisresidual ei(n) = s(n) minus si (n) and adds it to the synthetic output si (n) Per-ceptual significance is assessed by comparing the synthesis residual against themasked threshold associated with the current synthetic output and choosing theresidual sinusoid with the largest suprathreshold margin This perceptual selectioncriterion assumes that the signal component with the most unmasked energy willbe the most perceptually important After the most important peak frequencyf i has been identified a high-resolution spectral analysis technique inspiredby [Kay89] refines the frequency estimate Given refined frequency estimatesfor the current partial at frame start f start

i and end f endi the amplitude Ai

and phase φi parameters are extracted from a complex correlation coefficientobtained from an inner product between the synthesis residual and a complexexponential of linearly time-varying frequency End-point frequencies for thecomplex exponential correspond to the frame start and frame end frequenciesfor the partial of interest For the synthesis phase both unscaled and envelope-scaled versions of the current partial are generated The analysis-by-synthesis loopselects the version that minimizes the residual variance subtracts the new partialfrom the current residual and then repeats the analysis-by-synthesis parameterextraction loop with the updated residual The loop repeats until the bit bud-get is exhausted We note that in contrast to the time-varying global gain usedin [Geor92] a unique feature of the ASAC coder is that the temporal envelopemay be selectively applied to individual synthesis partials


The ASAC quantization and coding block provides bit rate and output qualityscaling by generating simultaneously basic and enhanced bitstreams for eachframe The basic bitstream contains the three temporal envelope parameters aswell as parameters representing frequencies amplitudes envelope enable bits

HARMONIC AND INDIVIDUAL LINES PLUS NOISE CODER (HILN) 249

and continuation bits for each partial A tracking algorithm classifies each par-tial as either new or continued by matching partials from frame to frame on thebasis of amplitude and frequency similarities [Edle96c] Track disputes (multiplematches for a given partial) are resolved in favor of the candidate that maximizesa weighted amplitude and frequency closeness measure ASAC partial trackingand dispute resolution differs from [McAu86] in that amplitude matching is con-sidered explicitly in addition to frequency proximity For noncontinued partialsboth frequencies and amplitudes are log quantized For continued partials on theother hand frequency differences are uniformly quantized while amplitude ratiosare quantized on a log scale The enhancement bitstream contains parameters toimprove reconstructed signal fidelity over the basic bitstream Enhanced bitstreamparameters include finer quantization bits for the envelope parameters phases foreach partial and finer frequency quantization bits for noncontinued partials abovea threshold frequency The phases are uniformly quantized Because of the fixed-rate ASAC architecture between 10 and 17 spectral lines may fit within the6 kbs basic bitstream for a given 32-ms frame Like the time-frequency MPEGfamily algorithms [Bran94a] [Bosi96] (Chapter 10) the ASAC quantization andcoding block maintains a bit reservoir to smooth local maxima in bit demandThese maxima tend to occur during transients when many noncontinued partialsarise and inflate the bit rate Bits are deposited into the reservoir whenever thenumber of partials reaches a threshold and before exhausting the bit budget Con-versely reservoir bits are withdrawn whenever the number of partials is belowa threshold while the bit budget has already been exhausted

9331 ASAC Performance When compared to standard speech codecs atsimilar bit rates the first version of ASAC [Edle95] reportedly offered improvedquality for nonharmonic tonal signals such as spectrally complex music similarquality for single instruments and impaired quality for clean speech [ISOI96b]The later ASAC [Edle96a] was improved for certain signals [ISOI96c]

94 HARMONIC AND INDIVIDUAL LINES PLUS NOISE CODER (HILN)

The ASAC algorithm outperformed speech-specific algorithms at the same bitrate in subjective tests for certain test signals particularly spectrally complexmusic characterized by large numbers of nonharmonically related sinusoids Theoriginal ASAC however failed to match speech codec performance for othertest signals such as clean speech As a result the ASAC core was embeddedin an enhanced algorithm [Purn98] intended to better match the coderrsquos signalmodel with diverse input signal characteristics In research proposed as part ofan MPEG-4 ldquocore experimentrdquo [Purn97] Purnhagen et al developed an ldquoobject-basedrdquo algorithm In this approach harmonic sinusoid individual sinusoid andcolored noise objects could be combined in a hybrid source model to create aparametric signal representation The enhanced algorithm known as the ldquohar-monic and individual lines plus noiserdquo (HILN) is architecturally very similar tothe original ASAC with some modifications


Like ASAC the HILN coder (Figure 94) segments the input signal into 32-ms overlapping analysis frames and extracts a parametric temporal envelopeduring preanalysis Unlike ASAC however the iterative analysis-synthesis blockis extended to include a cascade of analysis stages for each of the available objecttypes In the enhanced analysis-synthesis system of HILN harmonic analysis isapplied first followed by individual spectral line analysis followed by shapednoise modeling of the two-stage residual The remainder of the section presentssome unique details of the HILN encoder including descriptions of the HILNschemes for sinusoidal analysis-by-synthesis bit allocation and quantization andencoding


The HILN sinusoidal analysis-by-synthesis occurs in a cascade of three stagesFirst the harmonic section estimates a fundamental frequency and also quantifiesthe harmonic partial amplitudes using essentially the same method as ASACCepstral analysis is used to estimate the fundamental Harmonics may occur oneither integer or noninteger multiples of the fundamental A stretching [Flet91]parameter is available to account for the noninteger harmonic phenomena inducedby eg stiff-stringed instruments In the second stage a partial discriminatorseparates those partials belonging to the harmonic object from those partialsbelonging to the individual sinusoid object In the last stage the noise extractionmodule uses a cosine series expansion to estimate the spectral envelope of thenoise residual left after all partials have been extracted including those partialsnot already extracted by the harmonic and individual stages

QuantEncodes(n)

FundamentalFrequencyEstimation

IndividualSpectral LineExtraction

ResidualSpectrumExtraction

IndividualFreqsAmplitudes

Noise EnvelopeAmplitude

Parameter Estimation

HarmonicFreqsAmplitudes

Figure 94 HILN encoder (after [Purn98])

FM SYNTHESIS 251


Quantization and encoding is applied to the various object parameters in thefollowing way For individual sinusoids quantization and coding is essentiallythe same as in ASAC For the harmonic sinusoids the fundamental frequencyand partial amplitudes are log quantized and encoded Since spectrally complexsignals may contain in excess of 100 partials the harmonic partial amplitudesbeyond the tenth are grouped and represented by an average amplitude instead ofby the individual amplitudes This approach saves bits by exploiting the reducedfrequency resolution of the auditory filter bank at higher frequencies For thecolored noise object a gain is log quantized and transmitted along with thequantized parameters of the noise envelope Bits are distributed between the threeobject types according to perceptual relevance Unlike ASAC none of the HILNsinusoidal coding objects include phase information Start-up phases for newsinusoids are randomized at the decoder and then continued smoothly for laterframes The decoder also employs smooth fade in and fade out of partial tracks toprevent jump discontinuities in the output The output noise object is synthesizedusing the parametric spectral envelope and randomized phase in conjunction withan inverse FFT As is true of the ASAC algorithm the parametric nature of HILNmeans that speed and pitch changes are realized in a straightforward manner

9421 HILN Performance Results from subjective listening tests at 6 kbsshowed significant improvements for HILN over ASAC particularly for the mostcritical test items that had previously generated the most objectionable ASACartifacts [Purn97] Compared to HILN the CELP speech codecs are still able torepresent more efficiently clean speech at low rates and time-frequency codecsare able to encode more efficiently general audio at rates above 32 kbs Neverthe-less the HILN improvements relative to ASAC inspired the MPEG-4 committeeto incorporate HILN into the MPEG-4 committee draft as the recommended lowrate parametric audio coder [ISOI97a] As far as applications are concerned theHILN algorithm was deployed in a scalable low-rate Internet streaming audioscheme [Feit98]

95 FM SYNTHESIS

The HILN algorithm seeks to optimize coding efficiency by making combineduse of three distinct source models Although the HILN harmonic sinusoid objecthas been shown to facilitate increased coding gain for certain signals it ispossible that other object types may offer opportunities for greater efficiencywhen representing spectrally complex harmonic signals This notion motivated arecent investigation into the use of frequency modulation (FM) synthesis tech-niques [Chow73] in low-rate sinusoidal audio coding for harmonically structuredsingle instrument sounds [Wind98] In this section we first review the basic prin-ciples of FM synthesis and then present an experimental audio codec that makesuse of FM synthesis operators for low-rate coding



FM synthesis offers advantages over other harmonic coding methods(eg [Ferr96b] [Edle96c]) because of its ability to model with relatively fewparameters harmonic signals that have many partials In the simplest FMsynthesis for example the frequency of a sine wave (carrier) is modulated byanother sine wave (modulator) to generate a complex waveform with spectralcharacteristics that depend on a modulation index and the parameters of the twosine waves In continuous time the FM signal is given by

s(t) = A sin[2πfct + I sin(2πfmt)] (911)

where A is the amplitude fc is the carrier frequency fm is the modulationfrequency I is the modulation index and t is the time index The associatedFourier series representation is

s(t) =infinsum

k=minusinfinJk(I ) sin(2πfct + 2πkfmt) (912)

where Jk(I ) is the Bessel function of the first kindIt can be seen from Eq (912) that a large number of harmonic partials can

be generated (Figure 95) by controlling only three parameters per FM operatorOne can observe that the fundamental and harmonic frequencies are determinedby fc and fm and that the harmonic partial amplitudes are controlled by themodulation index I The Bessel envelope moreover essentially determines theFM spectral bandwidth Example harmonic FM spectra for a unit amplitude200 Hz carrier are given in Figure 95 for modulation indices of 1 (Figure 95a)and 15 (Figure 95b) While both examples have identical harmonic structurethe amplitude envelopes and bandwidths differ markedly as a function of theindex I Clearly the central issue in making effective use of the FM techniquefor signal modeling is parameter estimation accuracy


Winduratna [Wind98] proposed an FM synthesis audio coding scheme in whichthe outputs of parallel FM ldquooperatorsrdquo are combined to model a single instrumentsound The algorithm (Figure 96) works as follows First the preanalysis blocksegments the input into frames and then extracts parameters for a set of individualspectral lines as in [Edle96c] Next the preanalysis identifies a harmonic structureby maximizing an objective function [Wind98] Given a fundamental frequencyestimate from the pre-analysis f0 the iterative parameter extraction loop estimatesthe parameters of individual FM operators and accumulates their contributions untilthe composite spectrum from the multiple parallel FM operators closely resemblesthe original Perceptual closeness is judged to be adequate when the magnitude ofthe original minus synthetic harmonic (difference) spectrum is below the masked

FM SYNTHESIS 253

500 1000 1500 2000 2500 3000 3500 40000

10

20

30

40

50

60

Frequency (Hz)

Leve

l (dB

)

(b)

500 1500 2000 2500 3000 3500 40000

10

20

30

40

50

60

Frequency (Hz)

Leve

l (dB

)

(a)

1000

Figure 95 Examples of harmonic FM spectra for fc = fm = 200 Hz with (a) I = 1and (b) I = 15

threshold [Baum95] During each loop iteration error minimizing values for thecurrent operator are determined by means of an exhaustive search Then the contri-bution of the current operator is added to the existing output Finally the harmonicresidual spectrum is checked against the masked threshold The loop repeats andadditional operators are synthesized until the error spectrum is below the masked


threshold Spectral line amplitude prediction is used for adjacent frames having thesame fundamental frequency

In the quantization and encoding block the parameters f0 fc fm I andA are encoded with 12 7 7 5 and 7 bits respectively for each operatorThe fundamental frequency and amplitude are log quantized The carrier andmodulation frequencies are encoded as integer multiples of the fundamentalwhile the amplitude prediction weight p is quantized and encoded with 8 bitsfor adjacent frames having identical fundamental frequencies At the expenseof high complexity the FM coding scheme was shown to efficiently representsingle instrument sounds at bit rates between 21 and 48 kbs Performance ofthe scheme was investigated not only for audio but also for speech Significantperformance gains were realized relative to previous sinusoidal schemes Using a24-ms analysis window for example one critical male speech item was encodedat 212 kbs using FM synthesis compared to 45 kbs for ASAC [Wind98] withsimilar output quality Despite estimation difficulties for signals with more thanone fundamental frequency eg polyphonic music the high efficiency of the FMsynthesis technique for parametric representations of complex harmonic spectramakes it a likely candidate for future inclusion in object-based algorithms suchas the HILN coder

96 THE SINES + TRANSIENTS + NOISE (STN) MODEL

Although the basic sinusoidal model (Eq (91)) can achieve very efficient rep-resentations of some signals (eg sustained vowels via harmonically relatedcomponents) extensions to the basic model have also been proposed to deal withother signals containing significant nontonal energy that can lead to modelinginefficiencies A viable model for many musical signals is one of a deterministicplus a stochastic component The deterministic part corresponds to the pitchedpart of the sound and the stochastic part accounts for intrinsically random musi-cal characteristics such as breath noise or bow noise The spectral modeling and

QuantEncode

Pre-analysis

minus

s(n)Σ

PsychoacousticModel

Comparator

FM SynthesisLoop

AccumulatorΣ

z minus1

f0

X

p

Figure 96 FM synthesis coding scheme (after [Wind98])


synthesis system (SMS) [Serr89] [Serr90] treats audio as the sum of a collectionof K sinusoids (ldquopartialsrdquo) with time-varying parameters (Eq (91)) with theaddition of a stochastic component e(n) ie


k=1

Ak cos(ωk(n)n + φk(n)) + e(n) (913)

where Ak represents the amplitude ωk(n) represents the instantaneous frequencyand φk(n) represents the instantaneous phase of the k-th sinusoid Different para-metric models for the stochastic component have been proposed such as paramet-ric noise spectral envelopes [Serr89] or parametric Bark-band noise [Good96]Other investigators have constructed hybrid signal models in which the residualof the sinusoidal analysis-synthesis s(n) minus s(n) is decomposed using a discretewavelet packet transform (DWPT) [Hamd96] and then represented in terms ofcarefully selected principal components Although the sines + noise signal modelhas demonstrated improved performance relative to the basic model one furtherextension has gained popularity namely the addition of a separate model fortransient components or in other words the construction of a three-part modelconsisting of sines + transients + noise [Verm98a] [Verm98b]

97 HYBRID SINUSOIDAL CODERS

This section examines two experimental audio coders that attempt to enhancesource robustness and output quality beyond that of the purely sinusoidal algo-rithms at low bit rates by embedding additional signal models in the coderarchitecture The motivation for this work is as follows Whereas the waveform-preserving perceptual transform (Chapter 7) and subband (Chapter 8) coders tendto target transparent quality at bitrates between 32 and 128 kbs per channel thesinusoidal coders proposed thus far in the literature have concentrated on lowrate applications between 2 and 16 kbs Rather than transparent quality thesealgorithms have emphasized source robustness ie the ability to deal with gen-eral audio at low rates without constraining source model dependence Low ratesinusoidal algorithms (ASAC HILN etc) represent the perceptually significantportions of the magnitude spectrum from the original signal without explicitlytreating the phase spectrum As a result perceptually transparent coding is typ-ically not achieved with these algorithms It is generally agreed that differentclasses of state-of-the-art coding techniques perform most efficiently in terms ofoutput quality achieved for a given bit rate In particular CELP speech algo-rithms offer the best performance for clean speech below 16 kbs parametricsinusoidal techniques perform best for general audio between 16 and 32 kbsand so-called ldquotime-frequencyrdquo audio codecs tend to offer the best performanceat rates above 32 kbs Designers of comprehensive bit rate scalable coding sys-tems therefore must decide whether to cascade multiple stages of fundamentallydifferent coder architectures with each stage operating on a residual signal from


the previous stage or alternatively to simulcast independent bitstreams from dif-ferent coder architectures and then select an appropriate decoder at the receiverIn fact some experimental work performed in the context of MPEG-4 standard-ization has demonstrated that a cascaded hybrid sinusoidaltime-frequency codercan not only meet but in some cases even exceed the output quality achieved bythe time-frequency (transform) coder alone at the same bit rate for certain criticaltest signals [Edle96b] Several hybrid coder architectures that exploit these ideasby combining sinusoidal modeling with other well-known techniques have beenproposed Two examples are considered next


In one experimental hybrid scheme (Figure 97) for example a locally decodedsinusoidal output signal (generated by eg ASAC) is transformed using an inde-pendent MDCT that is configured identically to the second stage coderrsquos internalMDCT Then a frequency selective switch (scalability tool) in the second stagecoder determines whether quantization and encoding will be more efficient forMDCT coefficients from the original signal or for MDCT coefficient differencesbetween the original and first stage decoded output Clearly the coefficient dif-ferencing scheme is only beneficial if the difference magnitudes are smaller thanthe original spectral magnitudes Several of the issues critical to cascading suc-cessfully a parametric sinusoidal coder with a transform-based time-frequencycoder are addressed in [Edle98] For the scheme depicted in Figure 97 phaseinformation is essential to minimization of the MDCT coefficient differences Inaddition frequency quantization also influences significantly the maximum resid-ual amplitude Although log frequency quantization is well suited for stand-aloneoperation uniform frequency quantization better controls residual amplitudes inthe hybrid scheme These considerations motivated the inclusion of phase param-eters and additional high-frequency spectral line quantization bits in the ASACenhancement bitstream In the case of harmonically structured signals however

SinusoidalEncoder

SinusoidalDecoder

Σ

MDCT

MDCT

minusQuantEncode

FreqSelectMux

s(n)

Figure 97 Hybrid sinusoidalT-F (MDCT) encoder (after [Edle98])


the presence of many partials could overwhelm the enhancement bit allocation ifhundreds of partial phases were required One solution is to limit the enhancementprocessing to the first ten or so partials Moreover for a first stage coder such asthe object-based HILN the noise object would increase rather than decrease theMDCT residuals and should therefore be disabled in a cascade configuration


It was earlier noted that CELP speech algorithms typically outperform the para-metric sinusoidal coders for clean speech inputs at rates below 16 kbs Thereis some uncertainty however as to which class of algorithm is best suitedwhen both speech and music are present A hybrid scheme intended to out-perform CELPparametric ldquosimulcastrdquo for speechmusic mixtures was proposedin [Edle98] The hybrid scheme (Figure 98) works as follows A first-stage para-metric coder extracts the dominant harmonic tones forms an internal residualand then extracts the remaining harmonic as well as individual tones Then anexternal residual is formed between the original signal and a parametrically syn-thesized signal containing the contributions of only the second set of harmonicand individual tones This system implicitly assumes that the speech signal isdominant The idea is that the second stage cascaded vocoder will receive primar-ily the speech portions of the signal since the first stage parametric coder removesthe secondary harmonics and individual tones associated with the musical signalcomponents

This hybrid structure was reported to outperform simulcast configurations onlywhen the voice signal was dominant [Edle98] Quality degradations were reportedfor mixtures containing dominant musical signals In the future hybrid structuresof this type will benefit from emerging techniques in speechmusic discrimination(eg [Saun96] [Sche98b]) As observed by Edler on the other hand future audiocoding research is also quite likely to focus on automatic decomposition of com-plex input signals into components for which individual coding is more efficientthan direct coding of the mixture [Edle97] using hybrid structures Advances in

Vocoder

QuantEncode

FirstHarmonic

minus

s(n)

Sinusoidal Coder

SecondHarmonic +Individual

minusΣ

Σ

Figure 98 Hybrid sinusoidalvocoder (after [Edle98])


sound separation and auditory scene analysis [Breg90] [Elli96] techniques willeventually make the automated decomposition process viable

98 SUMMARY

In this Chapter we reviewed sinusoidal signal models for low-rate audio cod-ing Some of the key topics we studied include the ASAC the HILN algorithmhybrid sinusoidal coders and the STN model The parametric representation ofsinusoidal coders has the potential for scalable audio compression and delivery ofhigh-quality audio over the Internet We will discuss the scalable and parametricaudio coding tools integrated in the ISOIEC MPEG 4 audio standard in the nextchapter

PROBLEMS

91 Modify the DFT such that the frequency components are computed from

n = minusN

2 0

N

2minus 1 Why is this approach useful in sinusoidal

analysis-synthesis

92 In sinusoidal analysis-synthesis we require the analysis window to be nor-malized ie

sumN2n=minusN2 w(n) = 1 Explain with mathematical arguments

the reason for the use of this normalization

93 Explain with diagrams text and equations how the sinusoidal birth-deathfrequency tracker works

94 Let s(n) be the input audio frame of size N samples Let S(k) be theN -point DFT

S(k) = 1

N

Nminus1sum

n=0

s(n)eminusj2πknN k = 0 1 2 N minus 1

Obtain a p-th order frequency-domain AR model using least squares algo-rithm that fits the DFT spectral components Assume that p lt N minus 1

95 Write an algorithm that selects the K peaks from an N -point DFT magni-tude spectrum (K lt N ) Assume that the K spectral peaks are associatedwith pure tones in the input signal s(n) Design a sinusoidal analysis-synthesis model that minimizes the MSE ε

ε =sum

n

e2(n) =sum

n

[

s(n) minusKsum

k=1

Ak cos(ωkn + φk)

]2

where s(n) is the input speech Ak represents the amplitude ωk representsthe frequency and φk denotes the phase of the k-th sinusoid Hint Referto [McAu86]


96 In this problem we will design a sinusoidal analysis-by-synthesis (A-by-S) algorithm Use the sinusoidal model s(n) asymp s(n) = sumK

k=1 Ak cos(ωkn

+ φk) where Ak is the amplitude ωk represents the frequency and φk

denotes the phase of the k-th sinusoid Attempting to minimize E =sum

n e2(n) = sumn

[s(n) minus sumK

k=1 Ak cos(ωkn + φk)]2

with respect to Ak ωk and φk simultaneously would lead to a nonlinear optimization problemHence design a suboptimal A-by-S algorithm to solve for the parametersseparately Hint Start your design by assuming that the (K minus 1) sinusoidalparameters are already determined and the frequencies ωk are known TheMSE will then become

ε =sum

n

s(n) minusKminus1sum

k=1

Ak cos(ωkn + φk)

︸︷︷︸eKminus1(n)

minusAK cos(ωKn + φK)

2

where eKminus1(n) is the error after K minus 1 sinusoidal components were deter-mined See [Geor87] and [Geor90] to obtain additional information

97 Assuming that a cubic phase interpolation polynomial is used in a sinusoidalanalysis-synthesis model derive Eq (97) using results from [McAu86]

COMPUTER EXERCISES

98 Write a MATLAB program to implement the sinusoidal analysis-by-synthesis algorithm that you derived in Problem 96 Use the audio filech9aud1wav Consider a frame size of N = 512 samples no overlapbetween frames and K = 20 sinusoids In your deliverables provide theMATLAB program give the plots of the input audio s(n) the synthesizedsignal s(n) and the error e(n) = s(n) minus s(n) for the entire audio record

99 Use the MATLAB program from the previous problem and estimate thesample mean square error Ei = E[e2

i (n)] for each frame i where E[] isthe expectation operator

a Calculate the average MSE EAvg = 1

Nf

sumNf

i=1 Ei where Nf is the num-

ber of framesb Repeat step (a) for a different number of sinusoids eg K = 5 15 30

50 and 60 Plot EAvg in dB across the number of sinusoids K c How does the MSE behave with respect to the number of sinusoids

selected

910 In this problem we will study the significance of phase information in thesinusoidal coding of audio Use the audio file ch9aud1wav Consider aframe size of N = 512 samples Let Si(k) be the 512-point FFT of the


i-th audio frame and Si(k) = |Si(k)|eji(k) where |Si(k)| is the magnitudespectrum and i(k) is the phase at the k-th FFT bin Write a MATLABprogram to pick 30 spectral peaks from the 512-point FFT magnitude spec-trum Set the rest of the spectral component magnitudes to a very smallvalue eg 0000001 Call the resulting spectral magnitude as |Si(k)|a Set the phase i(k) = 0 and reconstruct the audio signal si (n) =

IFFT512[Si(k)] for all the frames i = 1 2 Nf b Calculate the sample MSE Ei = E[e2

i (n)] for each frame i Compute

the average MSE EAvg = 1

Nf

sumNf

i=1 Ei where Nf is the number of

framesc Set the phase i = π(2rand(1 512) minus 1)) where i is the [1 times 512]

uniform random phase vector that varies between ndash π and π Recon-struct the audio signal si (n) = IFFT512[Si(k)eji ] for all the framesNote that you must avoid using the same set of random phase compo-nents in all the frames Repeat step (b)

d Set the phase i(k) = arg[Si(k)] ie we are using the input signalphase Reconstruct the audio si (n) = IFFT512[Si(k)eji(k)] for all theframes Repeat step (b)

e Perform an experiment where the low-frequency sinusoids (lt15 kHz)use the input signal phase and the high-frequency components use randomphase as in part (c) Compute Ei and EAvg using step (b)

f Give a dB plot of Ei obtained from steps (a) and (b) across the numberof frames i = 1 2 Nf Now superimpose the Eirsquos obtained insteps (c) (d) and (e) Which case performed better in terms of theMSE measure

g Evaluate to the synthesized audio obtained from steps (a) (c) (d) and(e) Which case results in better perceptual quality

911 In this problem we will become familiar with the sinusoidal trajectoriesUse the audio file ch9aud2wav Consider a frame size of N = 512 sam-ples no overlap between frames and K = 10 sinusoids Write a MATLABprogram to implement the sinusoidal A-by-S algorithm [Geor90] For eachframe i compute the amplitudes Ak

i the frequencies ωki and the phases

φki of the k-th sinusoid

a Give a contour plot of the frequencies ωki versus the frame number

i = 1 2 Nf associated with each of the 10 sinusoidsb Now assume a 50 overlap and repeat the above step Comment on

the smoothness and continuity of the frequency trajectories when framesare overlapped 50

912 Extend the Computer Exercise 98 when a 50 overlap is allowed betweenthe frames Use a Bartlett window or a Hanning window for overlapSee [Geor97] for hints on overlap-add A-by-S implementation Do you


observe any improvements in terms of audio quality when overlapping isemployed

913 In this problem we will design a hybrid subbandsinusoidal algorithmUse the audio file ch9aud3 24kwav Consider a frame size of N = 512samples no overlap between frames and K = 20 sinusoids Note that thesampling frequency of the given audio file is 24 kHza Write a MATLAB program to design a sinusoidal A-by-S algorithm to

pick 20 sinusoids in each frame Compute the MSE E = 1

Nf

sumNf

i=1 Ei

where Nf is the number of frames and Ei = E[e2i (n)] is the MSE at

the i-the frameb Modify the program in (a) to design a sinusoidal A-by-S algorithm to

pick 15 sinusoids between 20 Hz and 6 kHz and 5 sinusoids between6 kHz and 12 kHz in each of the frames Did you observe any per-formance improvements in terms of computational efficiency andorperceptual quality

CHAPTER 10

AUDIO CODING STANDARDSAND ALGORITHMS

101 INTRODUCTION

Despite the several advances research towards developing lower rate coders forstereophonic and multichannel surround sound systems is strong in many industryand university labs Multimedia applications such as online radio web jukeboxesand teleconferencing created a demand for audio coding algorithms that can deliverreal-time wireless audio content This will in turn require audio compression algo-rithms to deliver high-quality audio at low bit-rates with resiliencerobustness to biterrors Motivated by the need for audio compression algorithms for streaming audioresearchers pursue techniques such as combined speechaudio architectures as wellas joint source-channel coding algorithms that are optimized for the packet switchedInternet [Ben99] [Liu99] [Gril02] Bluetooth [Joha01] [Chen04] [BWEB] and insome cases wideband cellular network [Ji02] [Toh03] Also the need for transparentreproduction quality coding algorithms in storage media such as the super audio CD(SACD) and the DVD-audio provided designers with new challenges There is in factan ongoing debate over the quality limitations associated with lossy compressionSome experts believe that uncompressed digital CD-quality audio (441 kHz16 bit)is inferior to the analog original They contend that sample rates above 55 kHz andword lengthsgreater than20 bits are necessary toachieve transparency in the absenceof any compression

As a result several standards have been developed [ISOI92] [ISOI94a] [Davi94][Fiel96] [Wyl96b] [ISOI97b] particularly in the last five years [Gerz99] [ISOI99]


263


[ISOI00] [ISOI01b] [ISOI02a] [Jans03] [Kuzu03] and several are now beingdeployed commercially This chapter and the next address some of the importantaudio coding algorithms and standards deployed during the last decade In partic-ular we describe the lossy audio compression (LAC) algorithms in this chapterand the lossless audio coding (L2AC) schemes in Chapter 11

Some of the LAC schemes (Figure 101) described in this chapter includethe ISOMPEG codec series the Sony ATRAC the Lucent TechnologiesPACEPACMPAC the Dolby AC-2AC-3 the APT-x 100 and the DTS-coherentacoustics

The rest of the chapter is organized as follows Section 102 reviews theMIDI standard Section 103 serves as an introduction to the multichannel sur-round sound format Section 104 is dedicated to MPEG audio standards Inparticular Sections 1041 through 1046 respectively describe the MPEG-1MPEG-2 BCLSF MPEG-2 AAC MPEG-4 MPEG-7 and MPEG-21 audiostandards Section 105 presents the adaptive transform acoustic coding (ATRAC)algorithm the MiniDisc and the Sony dynamic digital sound (SDDS) systemsSection 106 reviews the Lucent Technologies perceptual audio coder (PAC) theenhanced PAC (EPAC) and the multichannel PAC (MPAC) coders Section 107describes the Dolby AC-2 and the AC-3Dolby Digital algorithms Section 108is devoted to the Audio Processing Technology ndash APTx-100 system Finallyin Section 109 we examine the principles of coherent acoustics in codingthat are embedded in the Digital Theater SystemsndashCoherent Acoustics(DTS-CA)

102 MIDI VERSUS DIGITAL AUDIO

The musical instrument digital interface (MIDI) encoding is an efficient wayof extracting and representing semantic features from audio signals [Lehr93][Penn95] [Hube98] [Whit00] MIDI synthesizers originally established in 1983are widely used for musical transcriptions Currently the MIDI standards aregoverned by the MIDI Manufacturers Association (MMA) in collaboration withthe Japanese Association of Musical Electronics Industry (AMEI)

The digital audio representation contains the actual sampled audio data whilea MIDI synthesizer represents only the instructions that are required to playthe sounds Therefore the MIDI data files are extremely small when com-pared to the digital audio data files Despite being able to represent high-qualitystereo data at 10ndash30 kbs there are certain limitations with MIDI formats Inparticular the MIDI protocol uses a slow serial interface for data streamingat 3125 kbs [Foss95] Moreover MIDI is hardware dependent Despite suchlimitations musicians prefer the MIDI standard because of its simplicity andhigh-quality sound synthesis capability


A simple MIDI system (Figure 102) consists of a MIDI controller a sequencerand a MIDI sound module The keyboard is an example of a MIDI controller

MIDI Versus Digital Audio 265

Lossy(perceptual)

Audio Coding Schemes

ISOIEC MPEGAudio Standards

MPEG-1(ISOIEC 11172-3)[ISOI92] [Bran94a]

MPEG-2 BCLSF(ISOIEC 13818-3)

[ISOI94a]

MPEG-2 AAC(ISOIEC 13818-7)[ISOI97b] [Bosi97]

MPEG-4(ISOIEC 14496-3)[ISOI99] [ISOI00]

MPEG-7(ISOIEC 15938-4)

[ISOI01b]Sony

Adaptive TransformAcoustic Coding

(ATRAC) [Yosh94]

Lucent TechnologiesAudio Coding Standards

Perceptual Audio Coder(PAC)

[John96c]

Enhanced PAC(EPAC)[Sinh96]

Multi-channel PAC(MPAC)

[Sinh98a]

Dolby Audio CodingStandards

Dolby AC-2 AC-2A[Fiel91] [Davi92]

Dolby AC-3[Fiel96] [Davis93]

Audio ProcessingTechnology (APT-x 100)

[Wyli96b]

DTS ndash Coherent Acoustics

[Smyt96] [Smyt99]

Figure 101 A list of some of the lossy audio coding algorithms

that translates the music notes into a real-time MIDI data stream The MIDIdata stream includes a start bit 8 data bits and one stop bit A MIDI sequencercaptures the MIDI data sequence and allows for various manipulations (egediting morphing combining etc) On the other hand a MIDI sound moduleacts as a sound player


MIDI OUT

MIDI Sequencer

MIDI sound module

MIDI OUTMIDI IN

MIDI INMIDI OUT

MIDI Controller

MIDI OUT

MIDI SequencerMIDI Sequencer

MIDI sound moduleMIDI Sound Module

MIDI ControllerMIDI Controller

Figure 102 A simple MIDI system


In order to facilitate a greater degree of file compatibility the MMA developedthe general MIDI (GM) standard The GM constitutes a MIDI synthesizer witha standard set of voices (16 categories of 8 different sounds = 16 times 8 = 128sounds) that are fixed Although the GM standard does not describe the soundquality of synthesizer outputs it provides details on the MIDI compatibilityie the MIDI sounds composed on one sequencer can be reproduced or playedback on any other system with reduced or no distortion Different GM versionsare available in the market today ie GM Level-1 GM Level-2 GM lite andscalable polyphonic MIDI (SPMIDI) Table 101 summarizes the various GMlevels and versions


MIDI has been successful in a wide range of applications including music-retrieval and classification [Mana02] music databases search [Kost96] musicalinstrument control [MID03] MIDI karaoke players [MIDI] real-time object-based coding [Bros03] automatic recognition of musical phrases [Kost96]audio authoring [Mode98] waveform-editing [MIDI] singing voice synthe-sis [Maco97] loudspeaker design [Bald96] and feature extraction [Kost95] TheMPEG-4 structured audio tool incorporates many MIDI-like features Other appli-cations of MIDI are attributed to MIDI GM Level-2 [Mode00] XMIDI [LKur96]and PCM to MIDI transportation [Mart02] [MID03] [MIDI]


Table 101 General MIDI (GM) formats and versions

GM specifications GM L-1 GM L-2 GM Lite SPMIDI

Number of MIDI channels 16 16 16 16Percussion (drumming) channel 10 10 11 10 ndashPolyphony (voices) 24 voices 32 voices Limited ndash

Other related information [MID03]GM L-2 This is the latest standard introduced with capabilities of registered parameter controllersMIDI tuning universal system exclusive messages GM L-2 is backwards compatible with GM L-1GM Lite As the name implies this is a light version of GM L-1 and is intended for devices withlimited polyphonySPMIDI Intended for mobile devices SPMIDI functions based on the fundamentals of GM Liteand scalable polyphony This GM standard has been adopted by the Third-Generation PartnershipProject (3GPP) for the multimedia messaging applications in cellular phones

103 MULTICHANNEL SURROUND SOUND

Surround sound tracks (or channels) were included in motion pictures in the early1950s in order to provide a more realistic cinema experience Later the pop-ularity of surround sound resulted in its migration from cinema halls to hometheaters equipped with matrixed multichannel sound (eg Dolby ProLogic) Thiscan be attributed to the multichannel surround sound format [Bosi93] [Holm99][DOLBY] and subsequent improvements in the audio compression technology

Until the early 1990s almost all surround sound formats were based on matrix-ing ie the information from all the channels (front and surround) was encodedas a two-channel stereo as shown in Figure 103 In the mid-1990s discreteencoding ie 51 separate channels of audio was introduced by Dolby Labora-tories and Digital Theater Systems (DTS)


Table 102 lists some of the milestones in the history of multichannel surroundsound systems In the early 1950s the first commercial multichannel soundformat was developed for cinema applications ldquoQuadrdquo (Quadraphonic) was thefirst home-multichannel format promoted in the early 1970s But due to someincompatibility issues in the encodingdecoding techniques the Quad was notsuccessful In the mid-1970s Dolby overcame the incompatibility issues asso-ciated with the optical sound tracks and introduced a new format called theDolby stereo a special encoding technique that later became very popular Withthe advent of compact discs (CDs) in the early 1980s high-performance stereosystems became quite common With the emergence of digital versatile discs(DVDs) in 1995ndash1996 content creators began to distribute multichannel music indigital format Dolby laboratories in 1992 introduced another coding algorithm(Dolby AC-3 Section 107) called the Dolby Digital that offers a high-qualitymultichannel (51-channel) surround sound experience The Dolby Digital waslater chosen as the primary audio coding technique for DVDs and for digital


audio broadcasting (DAB) The following year Digital Theater Systems Inc(DTS) announced a new format based on the Coherent Acoustics encoding prin-ciple (DTS-CA) The same year Sony proposed the Sony Dynamic Digital Sound(SDDS) system that employs the Adaptive Transform Acoustic Coding (ATRAC)algorithm Lucent Technologiesrsquo Multichannel Perceptual Audio Coder (MPAC)also has a five-channel surround sound configuration Moreover the developmentof two new audio recording technologies namely the Meridian Lossless Packing(MLP) and the Direct Stream Digital (DSD) for use in the DVD-Audio [DVD01]and SACD [SACD02] formats respectively offer audiophiles listening experi-ences that promise to be more realistic


Figure 104 shows the three most common sound formats ie mono stereo andsurround Mono is a simple method of recording sound onto a single channelthat is typically played back on one speaker In stereo encoding a two-channelrecording is employed Stereo provides a sound field in front while the multichan-nel surround sound provides multi-dimensional sound experience The surroundsound systems typically employ a 51-channel configuration ie sound tracks arerecorded using five main channels left (L) center (C) right (R) left surround(LS) and right surround (RS) In addition to these five channels a sixth channelcalled the low-frequency-effects (LFE) channel is used for the subwoofer Sincethe LFE channel covers only a fraction (less than 150 Hz) of the total frequencyrange it is referred as the 1-channel


In an effort to evaluate and standardize the so-called 51- or 32-channel configuration several technical documents appeared [Bosi93] [ITUR94c][EBU99] [Holm99] [SMPTE99] [AES00] [Bosi00] [SMPTE02] Various inter-national standardization bodies became involved in multichannel algorithmadoptionevaluation process These include the Audio Engineering Society

+

+

+

+

90

L

C

R

S

_

L-total

R-total

minus3 dB

+

+

+

+

minus3 dB

90deg

L

Figure 103 Multichannel surround sound matrixing


Table 102 Milestones in multichannel surround sound

Year Description

1941 Fantasia (Walt-Disney Productions) was the first motion picture to bereleased in the multichannel format

1955 Introduction of the first 3570-mm magnetic stripe capable of providing 46channels

1972 Video cassette consumer format ndash mono (1 channel)1976 Dolbyrsquos stereo in optical format1978 Videocassette ndash stereo (2 channels)1979 Dolbyrsquos first stereo surround called the split-surround sound format offered

3 screen channels 2 surround channels and a subwoofer (3201)1982 Dolby surround format implemented on a compact disc (2 channel)1992 Dolby digital optical (51 channel)1993 Digital Theater Systems (DTS)1993ndash94 Sony Dynamic Digital Sound (SDDS) based on ATRAC1994 The ISOIEC 13818-3 MPEG-2 Backward compatible audio standard1995 Dolby digital chosen for DVD (51 channel)1997 DVD video released in market (51 channel)1998 Dolby digital selected for digital audio broadcasting (DAB) in US1999ndash Super Audio CD and DVD-Audio storage formats2000ndash Direct Stream Digital (DSD) and Meridian Lossless Packing (MLP)

Technologies

(AES) the European Broadcasting Union (EBU) the Society of Motion Pictureand Television Engineers group (SMPTE) the ISOIEC MPEG and the ITU-Radio communication sector (ITU-R)

Figure 105 shows a 51-channel configuration described in the ITU-R BS775-1 standard [ITUR94c] Ideally five full-bandwidth (150 Hzndash20 kHz) loudspeak-ers ie L R C LS and RS are placed on the circumference of a circle in thefollowing manner the left (L) and right (R) front loudspeakers are placed at theextremities of an arc subtending 2θ = 60 at the reference listening position(see Figure 105) and the center (C) loudspeaker must be placed at 0 from thelistenerrsquos axis This enables the compatibility with the listening arrangement for aconventional two-channel system The two surround speakers ie LS and RS areusually placed at φ = 110 to 120 from the listenerrsquos axis In order to achievesynchronization the front and surround speakers must be equidistant λ (usu-ally 2ndash4 m) from the reference listening point with their acoustic centers in thehorizontal plane as shown in the figure The sixth channel ie the LFE channeldelivers bass-only omnidirectional information (20ndash150 Hz) This is because lowfrequencies imply longer-wavelengths where the ears are not sensitive to local-ization The subwoofer placement receives less attention in the ITU-R standardhowever we note that the subwoofers are typically placed in a front corner (seeFigure 104) In [Ohma97] Ingvar discusses the various problems associated withthe subwoofer placement Moreover [SMPTE02] provides of information on the


Mono

Left Right

Left RightCenter

Leftsurround

Rightsurround

Sub-woofer

Figure 104 Mono stereo and surround sound systems

loudspeaker placement for audio monitoring The [SMPTE99] specifies the audiochannel assignment and their relative levels for audio program recordings (3ndash6audio channels) onto storage media for television sound

104 MPEG AUDIO STANDARDS

MPEG is the acronym for Moving Pictures Experts Group that forms a work-group (WG-11) of ISOIEC JTC-1 subcommittee (SC-29) The main functionsof MPEG are a) to publish technical results and reports related to audiovideocompression techniques b) to define means to multiplex (combine) video audioand information bitstreams into a single bitstream and c) to provide descriptionsand syntax for low bit rate audiovideo coding tools for Internet and bandwidth-restricted communications applications MPEG standards do not characterize orprovide any rigid encoder specifications but rather standardizes the type of infor-mation that an encoder has to produce as well as the way in which the decoder hasto decompress this information The MPEG workgroup has its own official web-page that can be accessed at [MPEG] The MPEG video aspect of the standard isbeyond the scope of this book however we include some tutorial references andrelevant standards [LeGal92] [Scha95] [ISO-V96] [Hask97] [Mitc97] [Siko97a][Siko97b]


Reference listening position

Worst-case listening positions

q = 30degj = 110deg to 120degl ~ g = 2 to 4 m lrsquo ~ 065 to 085 m

q

j

l

l

g

RS

L

CR

LS

4

21

3

lrsquo

Figure 105 A typical 32-channel configuration described in the ITU-R BS775-1 stan-dard [ITUR94c] L left C center R right LS left surround and RS right surroundloud speakers Note that the above figure is not according to a scale

MPEG Audio ndash Background MPEG has come a long way since the firstISOIEC MPEG standard was published in 1992 With the emergence of the Inter-net MPEG is now also addressing content-based multimedia descriptions anddatabase search There are five different MPEG audio standards published ieMPEG-1 MPEG-2 BC MPEG-2 NBCAAC MPEG-4 and MPEG-7 MPEG-21is being formed

Before proceeding with the details of the MPEG audio standards howeverit is necessary to discuss terminology and notation The phases correspond tothe MPEG audio standards type and to a lesser extent to their relative releasetime eg MPEG-1 MPEG-2 MPEG-4 etc The layers represent a family ofcoding algorithms within the MPEG standards Only MPEG-1 and -2 are pro-vided with layers ie MPEG-1 layer-I -II and -III MPEG-2 layer-I -II and-III The versions denote the various stages in the audio coding standardizationphase MPEG-4 was standardized in two stages (version-1 and -2) with newfunctionality being added to the older version The newer versions are back-ward compatible to the older versions Table 103 itemizes the various MPEGaudio standards and their specifications A brief overview of the MPEG standardsfollows

Tabl

e10

3

An

over

view

ofth

eM

PE

Gau

dio

stan

dard

s

Stan

dard

Stan

dard

izat

ion

deta

ilsB

itra

tes

(kb

s)Sa

mpl

ing

rate

s(k

Hz)

Cha

nnel

sR

elat

edre

fere

nces

Rel

ated

info

rmat

ion

MPE

G-1

ISO

IE

C11

172-

319

9232

ndash44

8(L

ayer

I)32

ndash38

4(L

ayer

II)

32ndash

320

(Lay

erII

I)

32 44

14

8M

ono

(1)

ster

eo(2

)[I

SOI9

2][B

ran9

4a]

[Shl

i94]

[Pan

95]

[Nol

l93]

[Nol

l95]

[Nol

l97]

[Joh

n99]

Age

neri

cco

mpr

essi

onst

anda

rdth

atta

rget

spr

imar

ilym

ultim

edia

stor

age

and

retr

ieva

l

MPE

G-2

BC

LSF

ISO

IE

C13

818-

319

9432

ndash25

6(L

ayer

I)8

ndash16

0(L

ayer

sII

II

I)

16 22

05

24

Mul

ticha

nnel

Surr

ound

soun

d(5

1)

[ISO

I94a

][S

tol9

3a]

[Gri

l94]

[Nol

l93]

[Nol

l95]

[Joh

n99]

Firs

tdi

gita

lte

levi

sion

stan

dard

that

enab

les

low

erfr

eque

ncie

san

dm

ultic

hann

elau

dio

codi

ng

MPE

G-2

NB

CA

AC

ISO

IE

C13

818-

719

978

ndash16

08

ndash96

Mul

ticha

nnel

[ISO

I97b

][B

osi9

6][B

osi9

7][J

ohn9

9][I

SOI9

9][K

oen9

9][Q

uac9

8b]

[Gri

l99]

Adv

ance

dau

dio

codi

ngsc

hem

eth

atin

corp

orat

esne

wco

ding

met

hodo

logi

es(e

g

pred

ictio

nno

ise

shap

ing

etc

)

272

Tabl

e10

3

(con

tinue

d)

Stan

dard

Stan

dard

izat

ion

deta

ils

Bit

rate

s(k

bs)

Sam

plin

gra

tes

(kH

z)C

hann

els

Rel

ated

refe

renc

esR

elat

edin

form

atio

n

MPE

G-4

(Ver

sion

1)

MPE

G-4

(Ver

sion

2)

ISO

IE

C14

496-

3O

ct

1998

ISO

IE

C14

496-

3A

MD

-1

Dec

19

99

02

ndash38

4

02

ndash38

4(fi

ner

leve

lsof

incr

emen

tpo

ssib

le)

8ndash

96

8ndash

96

Mul

ticha

nnel

Mul

ticha

nnel

[Gri

l97]

[Par

k97]

[Edl

e99]

[Pur

n00a

][S

che9

8a]

[Vaa

n00]

[Sch

e01]

[ISO

I00]

[Kim

01]

[Her

r00a

][P

urn9

9b]

[All

a99]

[Hil

p00]

[Spe

r00]

[Spe

r01]

The

first

cont

ent-

base

dm

ultim

edia

stan

dard

al

low

ing

univ

ersa

l-ity

inte

ract

ivity

and

aco

mbi

natio

nof

natu

ral

and

synt

hetic

mat

eria

lco

ded

inth

efo

rmof

obje

cts

MPE

G-7

ISO

IE

C15

938-

4Se

pt

2001

ndashndash

ndash[I

SOI0

1b]

[Nac

k99a

][N

ack9

9b]

[Qua

c01]

[Lin

d99]

[Lin

d00]

[Lin

d01]

[Spe

ng01

][I

SOI0

2a]

Ano

rmat

ive

met

adat

ast

anda

rdth

atpr

ovid

esa

Mul

timed

iaC

onte

ntD

escr

iptio

nIn

terf

ace

MPE

G-2

1IS

OI

EC

-210

00ndash

ndashndash

[ISO

I03a

][B

orm

03]

[Bur

n03]

Am

ultim

edia

fram

ewor

kth

atpr

ovid

esin

tero

pera

bilit

yin

cont

ent-

acce

ssan

ddi

stri

buti

on

273


MPEG-1 After four years of extensive collaborative research by audio codingexperts worldwide the first ISOMPEG audio coding standard MPEG-1 [ISOI92]for VHS-stereo-CD-quality was adopted in 1992 The MPEG-1 supports video bitrates up to about 15 Mbs providing a Video Home System (VHS) quality andstereo audio at 192 kbs Applications of MPEG-1 range from storing video andaudio on CD-ROMs to Internet streaming through the popular MPEG-1 layer III(MP3) format

MPEG-2 Backward Compatible (BC) In order to extend the capabilities offeredby MPEG-1 to support the so-called 32 (or 51) channel format and to facili-tate higher bit rates for video MPEG-2 [ISOI94a] was published in 1994 TheMPEG-2 standard supports digital video transmission in the range of 2ndash15 Mbsover cable satellite and other broadcast channels audio coding is defined at thebit rates of 64ndash192 kbschannel Multichannel MPEG-2 is backward compatiblewith MPEG-1 hence the acronym MPEG-2 BC The MPEG-2 BC standard isused in the high definition Television (HDTV) [ISOI94a] and produces the videoquality required in digital television applications

MPEG-2 Nonbackward CompatibleAdvanced Audio Coding (AAC) The back-ward compatibility constraints imposed on the MPEG-2 BCLSF algorithm madeit impractical to code five channels at rates below 640 kbs As a result MPEGbegan standardization activities for a nonbackward compatible advanced codingsystem targeting ldquoindistinguishablerdquo quality at a rate of 384 kbs for five fullbandwidth channels In less than three years this effort led to the adoption ofthe MPEG-2 nonbackward compatibleadvanced audio coding (NBCAAC) algo-rithm [ISOI97b] a system that exceeded design goals and produced the desiredquality at only 320 kbs for five full bandwidth channels

MPEG-4 MPEG-4 was established in December 1998 after many proposedalgorithms were tested for compliance with the program objectives establishedby the MPEG committee MPEG-4 video supports bit rates up to about 1 GbsThe MPEG-4 audio [ISOI99] [ISOI00] was released in several steps resultingin versions 1 and 2 MPEG-4 comprises an integrated family of algorithms withwide-ranging provisions for scalable object-based speech and audio coding at bitrates from as low as 200 bs up to 60 kbs per channel The distinguishing featuresof MPEG-4 relative to its predecessors are extensive scalability object-based rep-resentations user interactivityobject manipulation and a comprehensive set ofcoding tools available to accommodate trade-offs between bit rate complexityand quality Very low rates are achieved through the use of structured represen-tations for synthetic speech and music such as text-to-speech and MIDI Thestandard also provides integrated coding tools that make use of different signalmodels depending upon the desired bit rate bandwidth complexity and quality

MPEG-7 The MPEG-7 audio committee activities started in 1996 In less thanfour years a committee draft was finalized and the first audio standard address-ing ldquomultimedia content description interfacerdquo was published in September 2001MPEG-7 [ISOI01b] targets the content-based multimedia applications In partic-ular the MPEG-7 audio supports a broad range of applications ndash multimediadigital libraries broadcast media selection multimedia editing and searching


multimedia indexingsearching Moreover it provides ways for efficient audiofile retrieval and supports both the text-based and context-based queries

MPEG-21 The MPEG-21 ISOIEC-21000 standard [MPEG] [ISOI02a][ISOI03a] defines interoperable and highly automated tools that enable contentdistribution across different terminals and networks in a programmed mannerThis structure enables end-users to have capabilities for universal multimediaaccess


The MPEG-1 audio standard (ISOIEC 11172-3) [ISOI92] comprises a flexiblehybrid coding technique that incorporates several methods including subbanddecomposition filter-bank analysis transform coding entropy coding dynamicbit allocation nonuniform quantization adaptive segmentation and psychoa-coustic analysis MPEG-1 audio codec operates on 16-bit PCM input data atsample rates of 32 441 and 48 kHz Moreover MPEG-1 offers separate modesfor mono stereo dual independent mono and joint stereo Available bit ratesare 32ndash192 kbs for mono and 64ndash384 kbs for stereo Several tutorials onthe MPEG-1 standards [Noll93] [Bran94a] [Shli94] [Bran95] [Herr95] [Noll95][Pan95] [Noll97] [John99] have appeared Chapter 5 Section 57 presents step-by-step procedure involved in the ISOIEC 11172-3 (MPEG-1 layer I) psychoa-coustic model 1 [ISOI92] simulation We summarize these steps in the contextof MPEG-1 audio standard

The MPEG-1 architecture contains three layers of increasing complexitydelay and output quality Each higher layer incorporates functional blocks fromthe lower layers Figure 106 shows the MPEG-1 layer III encoder block dia-gram The input signal is first decomposed into 32 critically subsampled sub-bands using a polyphase realization of a pseudo-QMF (PQMF) bank (see alsoChapter 6) The channels are equally spaced such that a 48-kHz input signal issplit into 750-Hz subbands with the subbands decimated 321 A 511th-order pro-totype filter was chosen such that the inherent overall PQMF distortion remainsbelow the threshold of audibility Moreover the prototype filter was designed

32 Channel PQMF

analysis bank

FFTcomputation

(L1 512L2 1024)

Psychoacousticsignal analysis

32Block

compandingquantization

Dynamicbit

allocation

32

SMR

Quantizers

Data MULTIPLEXER

Sideinfo

x(n)To channel

Figure 106 ISOMPEG-1 layer III encoder


for a high sidelobe attenuation (96 dB) to insure that intraband aliasing remainsnegligible For the purposes of psychoacoustic analysis and determination of justnoticeable distortion (JND) thresholds a 512 (layer I) or 1024 (layer II) point FFTis computed in parallel with the subband decomposition for each decimated blockof 12 input samples (8 ms at 48 kHz) Next the subbands are block companded(normalized by a scale factor) such that the maximum sample amplitude in eachblock is unity then an iterative bit allocation procedure applies the JND thresh-olds to select an optimal quantizer from a predetermined set for each subbandQuantizers are selected such that both the masking and bit rate requirements aresimultaneously satisfied In each subband scale factors are quantized using 6 bitsand quantizer selections are encoded using 4 bits

10411 Layers I and II For layer I encoding decimated subband sequencesare quantized and transmitted to the receiver in conjunction with side informationincluding quantized scale factors and quantizer selections Layer II improvesthree portions of layer I in order to realize enhanced output quality and reducebit rates at the expense of greater complexity and increased delay First thelayer II perceptual model relies upon a higher-resolution FFT (1024 points) thandoes layer I (512 points) Second the maximum subband quantizer resolutionis increased from 15 to 16 bits Despite this increase a lower overall bit rateis achieved by decreasing the number of available quantizers with increasingsubband index Finally scale factor side information is reduced while exploitingtemporal masking by considering properties of three adjacent 12-sample blocksand optionally transmitting one two or three scale factors plus a 2-bit sideparameter to indicate the scale factor mode Average mean opinion scores (MOS)of 47 and 48 were reported [Noll93] for monaural layer I and layer II codecsoperating at 192 and 128 kbs respectively Averages were computed over arange of test material

10412 Layer III The layer III MPEG architecture (Figure 107) achievesperformance improvements by adding several important mechanisms on top ofthe layer III foundation The MPEG layer-III algorithm operates on consecutiveframes of data Each frame consists of 1152 audio samples a frame is furthersplit into two subframes of 576 samples each A subframe is called a granule Atthe decoder every granule can be decoded independently A hybrid filter bankis introduced to increase frequency resolution and thereby better approximatecritical band behavior The hybrid filter bank includes adaptive segmentation toimprove pre-echo control Sophisticated bit allocation and quantization strate-gies that rely upon nonuniform quantization analysis-by-synthesis and entropycoding are introduced to allow reduced bit rates and improved quality Thehybrid filter bank is constructed by following each subband filter with an adap-tive MDCT This practice allows for higher-frequency resolution and pre-echocontrol Use of an 18-point MDCT for example improves frequency resolu-tion to 4167 Hz per spectral line The adaptive MDCT switches between 6 and18 points to allow improved pre-echo control Shorter blocks (4 ms) provide

32 C

hann

elP

olyp

hase

Ana

lysi

sF

ilter

bank

FF

T

M U X

32

Psy

choa

cous

ticA

naly

sis

32

SM

R

s(n)

Dat

a

To C

hann

el

MD

CT

Ada

ptiv

e S

egm

enta

tion

Blo

ck C

ompa

nder

Qua

ntiz

atio

nH

uffm

an C

odin

g

Bit

Allo

cati

on

Lo

op

Cod

eS

ide

Info

Fig

ure

107

IS

OM

PEG

-1la

yer

III

enco

der

277


for temporal pre-masking of pre-echoes during transients longer blocks duringsteady-state periods improve coding gain by reducing side information and hencebit rates

Bit allocation and quantization of the spectral lines are realized in a nestedloop procedure that uses both nonuniform quantization and Huffman coding Theinner loop adjusts the nonuniform quantizer step sizes for each block until thenumber of bits required to encode the transform components falls within the bitbudget The outer loop evaluates the quality of the coded signal (analysis-by-synthesis) in terms of quantization noise relative to the JND thresholds AverageMOS of 31 and 37 were reported [Noll93] for monaural layer II and layer IIIcodecs operating at 64 kbs

10413 Applications MPEG-1 has been successful in numerous applica-tions For example MPEG-1 layer III has become the standard for transmis-sion and storage of compressed audio for both World Wide Web (WWW) andhandheld media applications (eg IPod) In these applications the ldquoMP3rdquolabel denotes MPEG-1 layer III Note that MPEG-1 audio coding has steadilygained acceptance and ultimately has been deployed in several other large scalesystems including the European digital radio (DBA) or Eureka [Jurg96] thedirect broadcast satellite or ldquoDBSrdquo [Prit90] and the digital compact cassette orldquoDCCrdquo [Lokh92] In particular the Philips Digital Compact Cassette (DCC) is anexample of a consumer product that essentially implements the 384 kbs stereomode of MPEG-1 layer I A discussion of the precision adaptive subband coding(PASC) algorithm and other elements of the DCC system are given in [Lokh92]and [Hoog94]

The collaborative European Advanced Communications Technologies and Ser-vices (ACTS) program adopted MPEG audio and video as the core compressiontechnology for the Advanced Television at Low Bit rates And Networked Transmis-sion over Integrated Communication systems (ATLANTIC) project ATLANTICis a system intended to provide functionality for television program productionand distribution [Stor97] [Gilc98] This system posed new challenges for MPEGdeployment such as seamless bitstream (source) switching [Laub98] and robusttranscoding (tandem coding) Bitstream (source) switching becomes nontrivialwhen different bit rates andor MPEG layers are associated with different programsources Robust transcoding is also essential in the video production environmentEditing tasks inevitably require retrieval of compressed bit streams from archivalstorage processing of program material in uncompressed form and then replace-ment of the recoded compressed bit stream to the archival system Unfortunatelytranscoding is neither guaranteed nor likely to preserve perceptual noise mask-ing [Rits96] The ATLANTIC designers proposed a buried data ldquoMOLErdquo signal tomitigate and in some cases eliminate transcoding distortion for cascaded MPEG-1layer II codecs [Flet98] ideally allowing downstream tandem stages to preservethe original bit stream The idea behind the MOLE is to apply the same set of quan-tizers to the same set of data in the downstream codecs as in the original codecThe output bit stream will then be identical to the original bit stream provided that


numerical precision in the analysis filter banks does not bias the data [tenK96b]It is possible in a cascade of MPEG-1 layer II codecs to regenerate the same setof decimated subband sequences in the downstream codec filter banks as in theoriginal codec filter bank if the full-band PCM signal is properly time aligned at theinput to each cascaded stage Essentially delays at the filter-bank input must cor-respond to integer delays at the subband level [tenK96b] and the analysis framesmust contain the same block of data in each analysis filter bank The MOLE signaltherefore provides downstream codecs with timing synchronization bit allocationand scale-factor information for the MPEG bit stream on each frame The MOLEis buried in the PCM samples between tandem stages and remains inaudible byoccupying the LSB of each 20-bit PCM word Although optimal time-alignmentbetween codecs is possible even without the MOLE [tenK96b] there is unfortu-nately no easy way to force selection of the same set of quantizers and thus preservethe bit stream

The widespread use and maturity of MPEG-1 relative to the more recentstandards provided several concrete examples for the above discussion of MPEG-1 audio applications Various real-time implementation schemes of MPEG-1layers-I II and III codecs were proposed [Gbur96] [Hans96] [Main96] [Wang01]We will next consider the MPEG-2 BCLSF MPEG-2 AAC the MPEG-4 andthe MPEG-7 algorithms The discussion will focus primarily upon architecturalnovelties and differences with respect to MPEG-1


MPEG-2 BCLSF Audio [Stol93a] [Gril94] [ISOI94a] [Stol96] extends the capa-bilities offered by MPEG-1 to support the so-called 32-channel format with left(L) right (R) center (C) and left and right surround (LS and RS) channels TheMPEG-2 BCLSF audio standard is backward compatible with MPEG-1 whichmeans that the 32 channel information transmitted by an MPEG-2 encoder canbe appropriately decoded for 2-channel presentation by an MPEG-1 receiverAnother important feature that was implemented in MPEG-2 BCLSF is themultilingual compatibility The acronym BC corresponds to the backward compat-ibility of MPEG-2 towards MPEG-1 and the extension of sampling frequenciesto lower ranges (16 2205 and 24 kHz) is denoted by LSF Several tutorials onMPEG-2 [Noll93] [Noll95] [John99] have appeared Meares and Theile studiedthe potential application of matrixed surround sound [Mear97] in MPEG audioalgorithms

10421 The Backward Compatibility Feature Depending on the bit-demand constraints interchannel dependencies and the complexity allowed atthe decoder different methods can be employed to realize compatibility betweenthe 32- and 2-channel formats These methods include midside (MS) intensitycoding simulcast and matrixing The MS and intensity coding techniquesare particularly handy when bit demand imposed by multiple independentchannels exceeds the bit budget The MS scheme is carefully controlled [Davi98]


to maintain compatibility among the mono stereo and the surround soundformats Intensity coding also known as channel coupling is a multichannelirrelevancy reduction coding technique that exploits properties of spatial hearingThe idea behind intensity coding is to transmit only one envelope with someside information instead of two or more from independent channels The sideinformation consists of a set of coefficients that is used to recover individualspectra from the intensity channel The simulcast encoding involves transmissionof both stereo and multichannel bitstreams Two separate bitstreams ie onefor 2-channel stereo and another one for the multichannel audio are transmittedresulting in reduced coding efficiency

MPEG-2 BCLSF employs matrixing techniques [tenK92] [tenK94] [Mear97]to down-mix the 32 channel format to the 2-channel format Down-mixingcapability is essential for the 51-channel system since many of the playbacksystems are stereophonic or even monaural Figure 108 depicts the matrix-ing technique employed in the MPEG-2 BCLSF and can be mathematicallyexpressed as follows

L total = x(L + yC + zLs) (101)

R total = x(R + yC + zRs) (102)

where x y and z are constants specified by the IS-13818-3 MPEG-2 stan-dard [ISOI94a] In Eqs (101) and (102) L C R Ls and Rs represent the32-channel configuration and the parameters L total and R total correspond tothe 2-channel format

Three different choices are provided in the MPEG-2 audio standard [ISOI94a]for choosing the values of x y and z to perform the 32-channel to 2-channeldown-mixing These include

Choice1 x = 1

1 + radic2 y = 1radic

2 and z = 1radic

2(103)

Choice2 x = 2

3 + radic2

y = 1radic2

and z = 1

2(104)

Choice3 x = 1 y = 1 and z = 1 (105)

The selection of the down-mixing parameters is encoder dependent The availabil-ity of the basic stereo format channels ie L total and R total and the surroundsound extension channels ie CLs and Rs at the decoder helps to decode both32-channel and 2-channel bitstreams This insures the backwards compatibilityin the MPEG-2 BCLSF audio coding standard

10422 MPEG-2 BCLSF Encoder The steps involved in the reductionof the objective redundancies and the removal of the perceptual irrelevancies inMPEG-2 BCLSF encoding are the same as in MPEG-1 audio standard Howeverthe differences arise from employing multichannel and multilingual bitstream


(a)

minus

L_total = x(L + yC + zLs)

R_total = x(R + yC + zRs)

+

+ +

minus3 dB

minus3 dB 90deg

L

C

R

Ls

L-total

R-total

Rs

C

Ls

Rs

+

L-total

R-total

+

+

C

LsRs

Decoder

L

R

C

Ls

Rs

(b)

minus

Figure 108 Multichannel surround sound matrixing (a) encoder (b) decoder

format in the MPEG-2 audio A matrixing module is used for this purposeAnother important feature employed in the MPEG-2 BCLSF is the ldquodynamiccross-talkrdquo a multichannel irrelevancy reduction technique This feature exploitsproperties of spatial hearing and encodes only one envelope instead of two ormore together with some side information (ie scale factors) Note that this tech-nique is in some sense similar to the intensity coding that we discussed earlierIn summary matrixing enables backwards compatibility between the MPEG-2and MPEG-1 bitstreams and dynamic cross-talk reduces the interchannel red-undancies

In Figure 109 first the segmented audio frames are decomposed into 32 crit-ically subsampled subbands using a polyphase realization of a pseudo QMF(PQMF) bank Next a matrixing module is employed for down-mixing pur-poses Matrixing results in two stereo-format channels ie L total and R totaland three extension channels ie C Ls and Rs In order to remove statisticalredundancies associated with these channels a second-order linear predictor isemployed [Fuch93] [ISOI94a] The predictor coefficients are updated on eachsubband using a backward adaptive LMS algorithm [Widr85] The resulting

32 C

hann

el

PQ

MF

anal

ysis

ban

k FF

Tco

mpu

tatio

nP

sych

oaco

ustic

sign

al a

naly

sis

32

Blo

ckco

mpa

ndin

gqu

antiz

atio

n

Dyn

amic

bit

allo

catio

n

32

SM

R

Qua

ntiz

ersD

ata

M U L T I P L E X E RS

ide

info

x(n)

To

chan

nel

Mat

rixin

g

Dyn

amic

cros

s-ta

lk

Sca

le fa

ctor

s

Pre

dict

ion

Fig

ure

109

IS

OM

PEG

-2B

CL

SFau

dio

(lay

erI

II)

enco

der

algo

rith

m

282


prediction error is further processed to eliminate interchannel dependencies JNDthresholds are computed in parallel with the subband decomposition for eachdecimated block A bit-allocation procedure similar to the one in the MPEG-1audio standard is used to estimate the number of appropriate bits required forquantization

10423 MPEG-2 BCLSF Decoder Synchronization followed by error det-ection and correction are performed first at the decoder Then the coded audiobitstream is de-multiplexed into the individual subbands of each audio channelNext the subband signals are converted to subband PCM signals based on theinstructions in the header and the side information transmitted for every subbandDe-matrixing is performed to compute L and R bitstreams as follows

L = L total

xminus (yC + zLs) (106)

R = R total

xminus (yC + zRs) (107)

where x y and z are constants and are known at the decoder The inverse-quantized de-matrixed subband PCM signals are then inverse-filtered to recon-struct the full-band time-domain PCM signals for each channel

The second MPEG-2 standard ie the MPEG-2 NBCAAC sacrificed back-ward MPEG-1 compatibility to eliminate quantization noise unmasking arti-facts [tenK94] which are potentially introduced by the forced backward com-patibility


The 11172-3 MPEG-1 and IS13818-3 MPEG-2 BCLSF are standardized algo-rithms for high-quality coding of monaural and stereophonic program materialBy the early 1990s however the demand for high-quality coding of multichannelaudio at reduced bit rates had increased significantly The backwards compatibil-ity constraints imposed on the MPEG-2 BCLSF algorithm made it impracticalto code 5-channel program material at rates below 640 kbs As a result MPEGbegan standardization activities for a nonbackward compatible advanced cod-ing system targeting ldquoindistinguishablerdquo quality [ITUR91] [ISOI96a] at a rate of384 kbs for five full-bandwidth channels In less than three years this effort ledto the adoption of the IS13818-7 MPEG-2 Non-backward CompatibleAdvancedAudio Coding (NBCAAC) algorithm [ISOI97b] a system that exceeded designgoals and produced the desired quality at 320 kbs for five full-bandwidth chan-nels While similar in many respects to its predecessors the AAC algorithm[Bosi96] [Bosi97] [Bran97] [John99] achieves performance improvements by in-corporating coding tools previously not found in the standards such as filter-bankwindow shape adaptation spectral coefficient prediction temporal noise shap-ing (TNS) and bandwidth- and bit-rate-scaleable operation Bit rate and quality


improvements are also realized through the use of a sophisticated noiseless codingscheme integrated with a two-stage bit allocation procedure Moreover the AACalgorithm contains scalability and complexity management tools not previouslyincluded with the MPEG algorithms As far as applications are concerned theAAC algorithm is embedded in the atob and LiquidAudio players for stream-ing of high-fidelity stereophonic audio It is also a candidate for standardizationin the United States Digital Audio Radio (US DAR) project The remainder ofthis section describes some of the features unique to MPEG-2 AAC

The MPEG-2 AAC algorithm (Figure 1010) is organized as a set of codingtools Depending upon available CPU or channel resources and desired qualityone can select from among three complexity ldquoprofilesrdquo namely main low andscalable sample rate profiles Each profile recommends a specific combination oftools Our focus here is on the complete set of tools available for main profilecoding which works as follows

10431 Filter Bank First a high-resolution MDCT filter bank obtains a spec-tral representation of the input Like previous MPEG coders the AAC filter-bankresolution is signal adaptive Quasi-stationary segments are analyzed with a 2048-point window while transients are analyzed with a block of eight 256-pointwindows to maintain time synchronization for channels using different filter-bank resolutions during multichannel operations The frequency resolution istherefore 23 Hz for a 48-kHz sample rate and the time resolution is 26 msUnlike previous MPEG coders however AAC eliminates the hybrid filter bankand relies on the MDCT exclusively The AAC filter bank is also unique in itsability to switch between two distinct MDCT analysis window shapes ie a sinewindow (Eq (108)) and a Kaiser-Bessel designed (KBD) window (Eq (109))Given specific input signal characteristics the idea behind window shape adap-tation is to optimize filter-bank frequency selectivity in order to localize thesupra-masking threshold signal energy in the fewest spectral coefficients Thisstrategy seeks essentially to maximize the perceptual coding gain of the filterbank While both windows satisfy the perfect reconstruction and aliasing cancel-lation constraints of the MDCT they offer different spectral analysis propertiesThe sine window is given by

w(n) = sin

[(

n + 1

2

)π

2M

]

(108)

for 0 n M minus 1 where M is the number of subbands This particular windowis perhaps the most popular in audio coding In fact this window has becomestandard in MDCT audio applications and its properties are typically referencedas performance benchmarks when new windows are proposed The so-calledKBD window was obtained in a procedure devised at Dolby Laboratories byapplying a transformation of the form

wa(n) = ws(n)

radicradicradicradic

sumnj=0 v(j)

sumMj=0 v(j)

0 n lt M (109)

Per

cept

ual M

odel

MD

CT

256

2048

pt

Mul

ti-C

hann

elM

S

Inte

nsity

G

ain

Con

trol

TN

SP

redi

ctio

n

Sca

leF

acto

rE

xtra

ctQ

uant

Ent

ropy

Cod

ing

Itera

tive

Rat

e C

ontr

ol L

oop

zminus1

s(n)

Sid

e In

form

atio

n C

odin

g B

itstr

eam

For

mat

ting

To

Cha

nnel

Fig

ure

101

0IS

OI

EC

IS13

818-

7(M

PEG

-2N

BC

AA

C)

enco

der

([B

osi9

6])

285


where the sequence v(n) represents the symmetric kernel The resulting identicalanalysis and synthesis windows wa(n) and ws(n) respectively are of lengthM + 1 and symmetric ie w(2M minus n minus 1) = w(n) More detailed explanationon the MDCT windows is given in Chapter 6 Section 67

A filter-bank simulation exemplifying the performance of the two windowssine and KBD for the MPEG-2 AAC algorithm follows A sine window isselected when narrow pass-band selectivity is more beneficial than strong stop-band attenuation For example sounds characterized by a dense harmonic struc-ture (less than 140 Hz spacing) such as harpsichord or pitch pipe benefit froma sine window On the other hand a Kaiser-Bessel designed (KBD) window isselected in cases for which stronger stop-band attenuation is required or for situ-ations in which strong components are separated by more than 220 Hz The KBDwindow in AAC has its origins in the MDCT filter bank window designed atDolby Labs for the AC-3 algorithm using explicit perceptual criteria By sacrific-ing pass-band selectivity the KBD window gains improved stop-band attenuationrelative to the sine window In fact the stop-band magnitude response is belowa conservative composite minimum masking threshold for a tonal masker atthe center of the pass-band A KBD versus sine window simulation example(Figure 1011) for a signal containing 300 Hz plus 3 harmonics shows the KBDpotential for reduced bit allocation A masking threshold estimate generated byMPEG-1 psychoacoustic model 2 is superimposed (red line) It can be seen thatfor the given input the KBD window is advantageous in terms of supra-thresholdcomponent minimization All of the MDCT components below the superimposedmasking threshold will potentially require allocations of zero bits This tradeoffcan ultimately lead to a lower bit rate Details of the minimum masking templatedesign procedure are given in [Davi94] and [Fiel96]

10432 Spectral Prediction The AAC algorithm realizes improved codingefficiency relative to its predecessors by applying prediction over time to the trans-form coefficients below 16 kHz as was done previously in [Mahi89] [Fuch93][Fuch95] In this case the spectral prediction tool is applied only during long anal-ysis windows and then only if a bit-rate reduction is obtained when coding theprediction residuals instead of the original coefficients Side information is mini-mal since the second-order lattice predictors are updated on each frame using abackward adaptive LMS algorithm The predictor banks which can be selectivelyactivated for individual quantization scale-factor bands produced an improvementfor a fixed bit rate of +1 point on the ITU 5-point impairment scale for the criticalpitch pipe and harpsichord test material

10433 Bit Allocation The bit allocation and quantization strategies in AACbear some similarities to previous MPEG coders in that they make use of anested-loop iterative procedure and in that psychoacoustic masking thresholdsare obtained from an analysis model similar to MPEG-1 model recommendationnumber two Both lossy and lossless coding blocks are integrated into the rate-control loop structure so that redundancy removal and irrelevancy reduction are


0 200 400 600 800 1000 1200 1400 1600

minus80

minus60

minus40

minus20

0

Leve

l (dB

)Sine

0 200 400 600 800 1000 1200 1400 1600

minus80

minus60

minus40

minus20

0

Frequency (Hz)

Leve

l (dB

)

Kaiser-Bessel

Supra-maskingthreshold

Figure 1011 Comparison of the MPEG-2 AAC MDCT analysis filter-bank outputs forthe sine window vs the KBD window

simultaneously affected in a single analysis-by-synthesis process The schemeworks as follows As in the case of MPEG-1 layer III the AAC coefficients aregrouped into 49 scale-factor bands that mimic the auditory systemrsquos frequencyresolution

In the nested-loop allocation procedure the inner loop adjusts scale-factorquantizer step sizes in increments of 15 dB (approximates intensity differencelimen (DL)) and obtains Huffman codewords for both quantized scale factors andquantized coefficients until the desired bit rate is achieved Then in the outerloop the quantization noise introduced by the inner loop is compared to themasking threshold in order to assess noise audibility Undercoded scale factorbands are amplified to force increased coding precision and then the inner loopis called again for compliance with the desired bit rate A best result is storedafter each iteration since the two-loop process is not guaranteed to converge Aswith other algorithms such as the MPEG-1 layer III and the Lucent TechnologiesPAC [John96c] a bit reservoir is maintained to compensate for time-varyingperceptual bit-rate requirements

10434 Noiseless Coding The noiseless coding block [Quac97] embeddedin the rate-control loop has several innovative features as well Twelve Huffmancode books are available for 2- and 4-tuple blocks of quantized coefficients Sec-tioning and merging techniques are applied to maximize redundancy reduction


Individual code books are applied to time-varying ldquosectionsrdquo of scale-factorbands and the sections are defined on each frame through a greedy merge algo-rithm that minimizes the bit rate Grouping across time and intraframe frequencyinterleaving of coefficients prior to code-book application are also applied tomaximize zero coefficient runs and further reduce bit rates

10435 Other Enhancements Relative to MPEG-1 and MPEG-2 BCLSFother enhancements have also been embedded in AAC For example the AAC algo-rithm has an embedded TNS module [Herr96] for pre-echo control (Section 69)a special profile for sample-rate scalability (SSR) and time-varying as well asfrequency subband selective application of MS andor intensity stereo coding for5-channel inputs [John96b]

10436 Performance Incorporation of the nonbackward compatible codingenhancements proved to be a judicious strategy for the AAC algorithm In inde-pendent listening tests conducted worldwide [ISOI96d] the AAC algorithm metthe strict ITU-R BS1116 criteria for indistinguishable quality [ITUR94b] at arate of 320 kbs for five full-bandwidth channels [Kirb97] This level of qualitywas achieved with a manageable decoder complexity Two-channel real-timeAAC decoders were reported to run on 133-MHz Pentium platforms using 40and 25 of available CPU resources for the main and low-complexity profilesrespectively [Quac98a] MPEG-2 AAC maintained its presence as the core ldquotime-frequencyrdquo coder reference model for the MPEG-4 standard

10437 Reference Model Validation (RM) Before proceeding with a dis-cussion of MPEG-4 we first consider a significant system-level aspect of MPEG-2 AAC that also propagated into MPEG-4 Both algorithms are structured in termsof so-called reference models (RMs) In the RM approach generic coder blocks ortools (eg perceptual model filter bank rate-control loop etc) adhere to a set ofdefined interfaces The RM therefore facilitates the testing of incremental singleblock improvements without disturbing the existing macroscopic RM structureFor instance one could devise a new psychoacoustic analysis model that satisfiesthe AAC RM interface and then simply replace the existing RM perceptual modelin the reference software with the proposed model It is then a straightforwardmatter to construct performance comparisons between the RM method and theproposed method in terms of quality complexity bit rate delay or robustnessThe RM definitions are intended to expedite the process of evolutionary coderimprovements

In fact several practical AAC improvements have already been analyzed withinthe RM framework For example a backward predictor was proposed [Yin97] asa replacement for the existing backward adaptive LMS predictors This methodthat relies upon a block LPC estimation procedure rather than a running LMSestimation was reported to achieve comparable quality with a 38 (instruction)complexity reduction [Yin97] This contribution was significant in light of the factthat the spectral prediction tool in the AAC main profile decoder constitutes 40


of the computational complexity [Yin97] Decoder complexity is further reducedsince the block predictors only require updates when the prediction module hasbeen enabled rather than requiring sample-by-sample updating regardless of acti-vation status Forward adaptive predictors have also been investigated [Ojan99]In another example of RM efficacy improvements to the AAC noiseless codingmodule were reported in [Taka97] A modification to the greedy merge section-ing algorithm was proposed in which high-magnitude spectral peaks that tendedto degrade Huffman coding efficiency were coded separately The improvementyielded consistent bit-rate reductions up to 11 In informal listening tests itwas found that the bit savings resulted in higher quality at the same bit rate Inyet another example of RM innovation aimed at improving quality for a givenbit rate product code VQ techniques [Gers92] were applied to increase AACscale-factor coding efficiency [Sree98a] In the proposed scheme scale-factorsare decorrelated using a DCT and then grouped into subvectors for quantizationby a product code VQ The method is intended primarily for low-rate codingsince the side information bit burden rises from roughly 6 at 64 kbs to in somecases 25 at 16 kbs As expected subjective tests reflected an insignificant qual-ity improvement at 64 kbs On the other hand the reduction in bits allocated toside information at low rates (eg 16 kbs) allowed more bits for spectral coef-ficient coding and therefore produced mean improvements of +052 and +036on subjective differential improvement tests at bit rates of 16 and 40 kbs respec-tively [Sree98b] Additionally noise-to-mask ratios (NMRs) were reduced by asmuch as minus243 for the ldquoharpsichordrdquo critical test item at 16 kbs Several archi-tectures for MPEG-2 AAC real-time implementations were proposed Some ofthese include [Chen99] [Hilp98] [Geye99] [Saka00] [Hong01] [Rett01] [Taka01][Duen02] [Tsai02]

10438 Enhanced AAC in MPEG-4 The next section is concerned with themultimodal MPEG-4 audio standard for which the MPEG-2 AAC RM core wasselected as the ldquotime-frequencyrdquo audio coding RM with some improvements Forexample perceptual noise substitution (PNS) was included [Herr98a] as part ofthe MPEG-4 AAC RM Moreover the long-term prediction (LTP) [Ojan99] andtransform-domain weighted interleave VQ (TwinVQ) [Iwak96] modules becamepart of the MPEG-4 audio LTP after the MPEG-2 AAC prediction block providesa higher coding precision for tonal signals while the TwinVQ provided scalabilityand ultra-low bit-rate audio coding


The MPEG-4 ISOIEC-14496 Part 3 audio was adopted in December 1998 aftermany proposed algorithms were tested [Cont96] [Edle96a] [ISOI96b] [ISOI96c]for compliance with the program objectives [ISOI94b] established by the MPEGcommittee MPEG-4 audio (Figure 1012) encompasses a great deal more func-tionality than just perceptual coding [Koen96] [Koen98] [Koen99] It comprisesan integrated family of algorithms with wide-ranging provisions for scalable


Universality

Scalability

Object-basedRepresentation

Content-basedInteractivity

EnhancedComposition

NaturalSyntheticRepresentation

MPEG-4Audio

[ISOI99]

FEATURES

TOOLS

PROFILES

bull Synthesis profilebull Speech profilebull Scalable profilebull Main profile

General AudioCoding[Gril99]

Scalable AudioCoding

[Bran94b] [Gril97]

Parametric AudioCoding

[Edle96q] [Purn99a]

Speech Coding[Edle99] [Nish99]

Structured AudioCoding

[Sche98a] [Sche01]

Low Delay AudioCoding

[Alla99] [Hilp00]

Error ResilientFeature

[Sper00] [Mein01]

Figure 1012 An overview of the MPEG-4 audio coder

object-based speech and audio coding at bit rates from as low as 200 bs upto 64 kbs per channel The distinguishing features of MPEG-4 relative to itspredecessors are extensive scalability object-based representations user inter-activityobject manipulation and a comprehensive set of coding tools availableto accommodate almost any desired tradeoff between bit rate complexity andquality Efficient and flexible coding of different content (objects) such as nat-ural audiospeech and synthetic audiospeech became indispensable for some ofthe innovative multimedia applications To facilitate this MPEG-4 audio pro-vides coding and composition of natural and synthetic audiospeech content atvarious bit rates Very low rates are achieved through the use of structured repre-sentations for synthetic speech and music such as text-to-speech and MIDI Forhigher bit rates and ldquonatural audiordquo speech and music the standard provides inte-grated coding tools that make use of different signal models the choice of whichis made depending upon desired bit rate bandwidth complexity and qualityCoding tools are also specified in terms of MPEG-4 ldquoprofilesrdquo that essentiallyrecommend tool sets for a given level of functionality and complexity Beyondits provisions specific to coding of speech and audio MPEG-4 also specifies


numerous sophisticated system-level functions for media-independent transportefficient buffer management syntactic bitstream descriptions and time-stampingfor synchronization of audiovisual information units

10441 MPEG-4 Audio Versions The MPEG-4 audio standard was releasedin several steps due to timing constraints This resulted in two different versionsof MPEG-4 Version 1 [ISOI99] was standardized in February 1999 followedby version 2 [ISOI00] (also referred as Amendment 1 to version 1) in February2000 New amendments for bandwidth extension parametric audio extensionMP3 on MP4 audio lossless coding and scalable to lossless coding have alsobeen considered in the MPEG-4 audio standard

MPEG-4 Audio Version 1 The MPEG-4 audio version 1 comprises the major-ity of the MPEG-4 audio tools These are general audio coding scalable codingspeech coding techniques structured audio coding and text-to-speech syntheticcoding These techniques can be grouped into two main categories ie natu-ral [Quac98b] and synthetic audio coding [Vaan00] The MPEG-4 natural audiocoding part describes traditional type speech coding and high-quality audio cod-ing algorithms at bit rates ranging from 2 kbs to 64 kbs and above Three typesof coders enable hierarchical (scalable) coding in MPEG-4 Audio version-1 at dif-ferent bit rates Firstly at lower bit rates ranging from 2 kbs to 6 kbs parametricspeech coding is employed Secondly a code excited linear predictive (CELP)coding is used for medium bit rates between 6 kbs and 24 kbs Finally for thehigher bit rates typically ranging from 24 kbs transform-based (time-frequency)general audio coding techniques are applied The MPEG-4 synthetic audio cod-ing part describes the text-to-speech (TTS) and structured audio synthesis toolsTypically the structured tools are used to provide effects like echo reverberationand chorus effects the TTS synthetic tools generate synthetic speech from textparameters

MPEG-4 Audio Version 2 While remaining backwards compatible with MPEG-4 version 1 version 2 adds new profiles that incorporate a number of significantsystem-level enhancements These include error robustness low-delay audio cod-ing small-step scalability and enhanced composition [Purn99b] At the systemlevel version 2 includes a media independent bit stream format that supportsstreaming editing local playback and interchange of contents Furthermore inversion 2 an MPEG-J programmatic system specifies an application programminginterface (API) for interoperation of MPEG players with JAVA Version 2 offersimproved audio realism in sound rendering New tools allow parameterization ofthe acoustical properties of an audio scene enabling features such as immersiveaudiovisual rendering room acoustical modeling and enhanced 3-D sound presen-tation New error resilience techniques in version 2 allow both equal and unequalerror protection for the audio bit streams Low-delay audio coding is employed atlow bit rates where the coding delay is significantly high Moreover to facilitatethe bit rate scalability in small steps version 2 provides a highly desirable toolcalled small-step scalability or fine-grain scalability Text-to-speech (TTS) inter-faces from version 1 are enhanced in version 2 with a mark-up TTS intended for


applications such as speech-enhanced web browsing verbal email and story-telleron demand Markup TTS has the ability to process HTML SABLE and facialanimation parameter (FAP) bookmarks

10442 MPEG-4 Audio Profiles Although many coding and processingtools are available in MPEG-4 audio cost and complexity constraints often dic-tate that it is not practical to implement all of them in a particular system Version1 therefore defines four complexity-ranked audio profiles intended to help sys-tem designers in the task of appropriate tool subset selection In order of bit ratethey are as follows The low rate synthesis audio profile provides only wavetable-based synthesis and a text-to-speech (TTS) interface For natural audio processingcapabilities the speech audio profile provides a very-low-rate speech coder and aCELP speech coder The scalable audio profile offers a superset of the first twoprofiles With bit rates ranging from 6 to 24 kbs and bandwidths from 35 to9 kHz this profile is suitable for scalable coding of speech music and syntheticmusic in applications such as Internet streaming or narrow-band audio digitalbroadcasting (NADIB) Finally the main audio profile is a superset of all otherprofiles and it contains tools for both natural and synthetic audio

10443 MPEG-4 Audio Tools Unlike MPEG-1 and MPEG-2 the MPEG-4 audio describes not only a set of compression schemes but also a completefunctionality for a broad range of applications from low-bit-rate speech codingto high-quality audio coding or music synthesis This feature is called the uni-versality MPEG-4 enables scalable audio coding ie variable rate encoding isprovided to adapt dynamically to the varying transmission channel capacity Thisproperty is called scalability One of the main features of the MPEG-4 audio isits ability to represent the audiovisual content as a set of objects This enablesthe content-based interactivity

Natural Audio Coding Tools MPEG-4 audio [Koen99] integrates a set of tools(Figure 1013) for coding of natural sounds [Quac98b] at bit rates ranging fromas low as 200 bs up to 64 kbs per channel For speech and audio three distinctalgorithms are integrated into the framework These include parametric codingCELP coding and transform coding The parametric coding is employed forbit rates of 2ndash4 kbs and 8 kHz sampling rate as well as 4ndash16 kbs and 8 or16 kHz sampling rates (Section 94) For higher quality narrow-band (8 kHzsampling rate) and wideband (16 kHz) speech is handled by a CELP speech codecoperating between 6 and 24 kbs For generic audio at bit rates above 16 kbsa timefrequency perceptual coder is employed and in particular the MPEG-2 AAC algorithm with extensions for fine-grain bit-rate scalability [Park97] isspecified in MPEG-4 version 1 RM as the time-frequency coder The multimodalframework of MPEG-4 audio allows the user to tailor the coder characteristicsto the program material

Synthetic Audio Coding Tools While the earlier MPEG standards treated onlynatural audio program material the MPEG-4 audio achieves very-low-rate codingby supplementing its natural audio coding techniques with tools for syntheticaudio processing [Sche98a] [Sche01] and interfaces for structured high-level


CELP Coder

TF Coder (AAC or TWIN-VQ) General Audio Coder

2

Text-to-Speech

02

Structured Audio

MPEG-4Naturalaudiocoding tools

MPEG-4Syntheticaudio codingtools

ParametricCoder(eg HILN)

64

Bitrate Per Channel (kbs)

Scalable Coder

20 kHz Typical Audio Bandwidth

32244 6 8 10 12 14 16

4 kHz 8 kHz

Figure 1013 ISOIEC MPEG-4 integrated tools for audio coding ([Koen99])

audio representations Chief among these are the text-to-speech interface (TTSI)and methods for score-driven synthesis The TTSI provides the capability for200ndash1200 bs transmission of synthetic speech that can be represented in termsof either text only or text plus prosodic parameters such as a pitch contour or aset of phoneme durations Also one can specify the age gender and speech rateof the speaker Additionally there are facilities for lip synchronization controlinternational language and dialect support as well as controls for pause resumeand jump forwardbackward The TTSI specifies only an interface rather than anormative speech synthesis methodology in order to maximize implementationflexibility

Beyond speech general music synthesis capabilities in MPEG-4 are providedby a set of structured audio tools [Sche98a] [Sche98d] [Sche98e] Syntheticsounds are represented using the structured audio orchestra language (SAOL)SAOL [Sche98d] treats music as a collection of instruments Instruments arethen treated as small networks of signal-processing primitives all of which canbe downloaded to a decoder Some of the available synthesis methods includewavetable FM additive physical modeling granular synthesis or nonparametrichybrids of any of these methods [Sche98c] An excellent tutorial on these andother structured audio methods and applications appeared in [Verc98] TheSAOL instruments are controlled at the decoder by ldquoscoresrdquo or scripts in the


structured audio score language (SASL) A score is a time-sequenced set ofcommands that invokes various instruments at specific times to contribute theiroutputs to an overall performance SASL provides significant flexibility in that notonly can instruments be controlled but the existing sounds can be modified Forthose situations in which fine control is not required structured audio in MPEG-4 also provides backward compatibility with the MIDI protocol Moreover astandardized ldquowavetable bank formatrdquo is available for low-functionality termi-nals [Koen99] In the next seven subsections ie 10444 through 104410 wedescribe in detail the features and tools (Figure 1012) integrated in the MPEG-4 audio

10444 MPEG-4 General Audio Coding The MPEG-4 General AudioCoder (GAC) [Gril99] has the most vital and versatile functionality associatedwith the MPEG-4 tool-set that covers the arbitrary natural audio signals TheMPEG-4 GAC is often called as the ldquoall-roundrdquo coding system among the MPEG-4 audio schemes and operates at bit rates ranging from 6 to 300 kbs and atsampling rates between 735 kHz and 96 kHz The MPEG-4 GA coder is builtaround the MPEG-2 AAC (Figure 1010 discussed in Section 1043) along withsome extended features and coder configurations highlighted in Figure 1014These features are given by the perceptual noise substitution (PNS) long-termprediction (LTP) Twin VQ coding and scalability

Perceptual Noise Substitution (PNS) The PNS exploits the fact that a ran-dom noise process can be used to model efficiently transform-coefficients innoise-like frequency subbands provided the noise vector has an appropriate tem-poral fine structure [Schu96] Bit-rate reduction is realized since only a compactparametric representation is required for each PNS subband (ie noise energy)rather than full quantization and coding of subband transform coefficients ThePNS technique was integrated into the existing AAC bitstream definition in abackward-compatible manner Moreover PNS actually led to reduced decodercomplexity since pseudo-random sequences are less expensive to compute thanHuffman decoding operations Therefore in order to improve the coding effi-ciency the following principle of PNS is employed

The PNS acronym is composed from the following perceptual coding +substitute parametric form of noise-like signals ie PNS allows frequency-selective parametric encoding of noise-like components These noise-like com-ponents are detected based on a scale-factor band and are grouped into separatecategories The spectral coefficients corresponding to these categories are notquantized and are excluded from the coding process Furthermore only a noisesubstitution flag along with the total power of these spectral coefficients are trans-mitted for each band At the decoder the spectral coefficients are replaced bythe pseudo-random vectors with the desired target noise power At a bit rate of32 kbs a mean improvement due to PNS of +061 on the comparison meanopinion score (CMOS) test (for critical test items such as speech castanetsand complex sound mixtures) was reported in [Herr98a] The multichannel PNSmodes include some provisions for binaural masking level difference (BMLD)compensation

s(n

)

Per

cept

ual M

odel

Gai

nC

ontr

olM

DC

TT

NS

PN

S

Inte

nsity

C

oupl

ing

Sca

leF

acto

rE

xtra

ct

Ent

ropy

codi

ngT

win

VQ

Itera

tive

Rat

e C

ontr

ol L

oop

Sid

e in

form

atio

n C

odin

g B

itstr

eam

For

mat

ting

Mul

ti-ch

anne

lM

SLT

P

To

Cha

nnel

Fig

ure

101

4M

PEG

-4G

Aco

der

295


Long-Term Prediction (LTP) Unlike the noise-like signals the tonal signalsrequire higher coding precision In order to achieve a required coding precision(20 dB for tone-like and 6 dB for noise-like signals) the long-term prediction(LTP) technique [Ojan99] is employed In particular since the tonal signal com-ponents are predictable the speech coding pitch prediction techniques [Span94]can be used to improve the coding precision The only significant differencebetween the prediction techniques performed in a common speech coder and inthe MPEG-4 GA coder is that in the latter case the LTP is performed in thefrequency domain while in speech codecs the LTP is carried out in the timedomain A brief description of the LTP scheme in MPEG-4 GA coder followsFirst the input audio is transformed to frequency domain using an analysis filterbank and later a TNS analysis filter is employed for shaping the noise artifactsNext the processed spectral coefficients are quantized and encoded For pre-diction purposes these quantized coefficients are transformed back to the timedomain by a synthesis filter bank and the associated TNS operation The optimumpitch lag and the gain parameters are determined based on the residual and theinput signal In the next step both the input signal and the residual are mappedto a spectral representation via the analysis filter bank and the forward TNS filterbank Depending on which alternative is more favorable coding of either thedifference signal or the original signal is selected on a scale-factor basis Thisis achieved by means of a so-called frequency-selective switch (FSS) which isalso used in the context of the MPEG-4 GA scalable systems The complexityassociated with the LTP in MPEG-4 GA scheme is considerably (50) reducedcompared to the MPEG-2 AAC prediction scheme [Gril99]

TwinVQTwinVQ [Iwak96] [Hwan01] [Iwak01] isanacronymof theTransform-domain Weighted Interleave Vector Quantization The Twin VQ performs vec-tor quantization of the transformed spectral coefficients based on a perceptuallyweighted model The quantization distortion is controlled through a perceptualmodel [Iwak96] The Twin VQ provides high coding efficiencies even for musicand tonal signals at extremely low bit rates (6ndash8 kbs) which CELP coders fail toachieve The Twin VQ performs quantization of the spectral coefficients in two stepsas shown in Figure 1015 First the spectral coefficients are flattened and normalizedacross the frequency axis Second the flattened spectral coefficients are quantizedbased on a perceptually weighted vector quantizer

From Figure 1015 the first step includes a linear predictive coding periodicitycomputation a Bark scale spectral estimation scheme and a power computationblock The LPC provides the overall spectral shape The periodic componentincludes information on the harmonic structure The Bark-scale envelope codingprovides the required additional flattening of the spectral coefficients The nor-malization restricts these spectral coefficients to a specific target range In thesecond step the flattened and normalized spectral coefficients are interleaved intosubvectors Based on some spectral properties and a weighted distortion measureperceptual weights are computed for each subvector These weights are appliedto the vector quantizer (VQ) A conjugate-structure VQ that uses a pair of codebooks is employed More detailed information on the conjugate structure VQ can

Tw

in V

QS

tep

ndash 1

Nor

mal

izat

ion

Line

arP

redi

ctiv

eC

odin

g (L

PC

)B

ark

Sca

leP

ower

Est

imat

ion

Nor

mal

izat

ion

ofS

pect

ral C

oeffi

cien

ts

Fla

ttene

dS

pect

ral C

oeffi

cien

ts

Inpu

tS

igna

l

Tw

in V

QS

tep

ndash 2

Per

cept

ually

Wei

ghte

d V

Q

Inpu

t Vec

tor

Sub

-vec

torminus

1S

ub-v

ecto

rminus4

Sub

-vec

torminus

3S

ub-v

ecto

rminus2

Inte

rleav

ing

Wei

ghte

d V

QW

eigh

ted

VQ

Wei

ghte

d V

QW

eigh

ted

VQ

Per

cept

ual W

eigh

ts

Vec

tor

Qua

ntiz

erIn

dex

Val

ues

Per

iodi

city

Ext

ract

ion

Fig

ure

101

5Tw

inV

Qsc

hem

ein

MPE

G-4

GA

code

r

297


be obtained from [Kata93] [Kata96] The MPEG-4 Audio Twin VQ scheme pro-vides audio coding at ultra-low bit rates (6ndash8 kbs) and supports the perceptualcontrol of the quantization distortion Comparative tests of MPEG AAC withand without Twin VQ tool were performed and are given in [ISOI98] Further-more the Twin VQ tool has provisions for scalable audio coding which will bediscussed next

10445 MPEG-4 Scalable Audio Coding MPEG-4 scalable audio codingimplies a variable rate encodingdecoding of bitstreams at bit rates that can beadapted dynamically to the varying transmission channel capacity [Gril97] [Park97][Herr98b] [Creu02] Scalable coding schemes [Bran94b] generate partial bitstreamsthat can be decoded separately Therefore encodingdecoding of a subset of the totalbitstream will result in a valid signal at a lower bit rate The various types of scalabil-ity [Gril97] are given by signal-to-noise ratio (SNR) scalability noise-to-mask ratio(NMR) scalability audio bandwidth scalability and bit-rate scalability The bit-ratescalability is considered to be one of the core functionalities of the MPEG-4 audiostandard Therefore in our discussion on the MPEG-4 scalable audio coding we willconsider only the bit-rate scalability and the various scalable coder configurationsdescribed in the standard

The MPEG-4 bit-rate scalability scheme (Figure 1016) allows an encoder totransmit bitstreams at a high bit rate while decoding successfully a low-ratebitstream contained within the high-rate code For instance if an encoder trans-mits bitstreams at 64 kbs the decoder can decode at bit rates of 16 32 or64 kbs according to channel capacity receiver complexity and quality require-ments Typically scalable audio coders constitute several layers ie a core layerand a series of enhancement layers For example Figure 1016 depicts one corelayer and two enhancement layers The core layer encodes the core (main) audiostream while the enhancement layers provide further resolution and scalability Inparticular in the first stage the core layer encodes the input audio s(n) basedon a conventional lossy compression scheme Next an error signal (residual)E1(n) is calculated by subtracting the reconstructed signal s(n) (that is obtainedby decoding the compressed bitstream locally) from the input signal s(n) Inthe second stage (first enhancement layer) the error signal E1(n) is encoded toobtain the compressed residual e1(n) The above sequence of steps is repeatedfor all the enhancement layers

To further demonstrate this principle we consider an example (Figure 1016)where the core layer uses 32 kbs and the two enhancement layers employ bitrates of 16 kbs and 8 kbs and the final sink layer supports 8 kbs codingTherefore if no side information is encoded then the coding rate associatedwith the codec is 64 kbs At the decoder one can decode this multiplexed audiobitstream at various rates ie 64 32 or 40 kbs etc depending up on the bit-raterequirements receiver complexity and channel capacity In particular the corebitstream guarantees reconstruction of the original input audio with minimumartifacts On top of the core layer additional enhancement layers are added toincrease the quality of the decoded signal

Com

pres

sion

sche

me

Aud

io In

put

s(n

)

Loca

lD

ecod

er

minus

Com

pres

sed

audi

ox

(n)

Err

orE

1(n

)C

ompr

essi

onsc

hem

e

Loca

lD

ecod

er

minus

Com

pres

sed

resi

dual

e 1(n

)

Err

orE

2(n

)C

ompr

essi

onsc

hem

e

Loca

lD

ecod

ersum

sum

sum

minusCom

pres

sed

resi

dual

e 2(n

)

Err

orE

3(n

)

M U L T I P L E X E RC

ompr

essi

onsc

hem

e

e 3(n

)

s(n

)ˆ

Fig

ure

101

6M

PEG

-4sc

alab

leau

dio

codi

ng

299


Scalable audio coding finds potential applications in the fields of digital audiobroadcasting mobile multimedia communication and streaming audio It sup-ports real-time streaming with a low buffer delay One of the significant exten-sions of the MPEG-4 scalable audio coding is the fine-grain scalability [Kim01]where a bit-sliced arithmetic coding (BSAC) [Kim02a] is used In each frame bitplanes are coded in the order of significance beginning with the most significantbits (MSBs) and progressing to the LSBs This results in a fully embedded codercontaining all lower-rate codecs The BSAC and fine-grain scalability conceptsare explained below in detail

Fine-Grain Scalability It is important that bit-rate scalability is achieved with-out significant coding efficiency penalty compared to fixed-bit-rate systems andwith low computational complexity This can be achieved using the fine-grainscalability technique [Purn99b] [Kim01] In this approach bit-sliced arithmeticcoding is employed along with the combination of advanced audio coding tools(Section 1042) In particular the noiseless coding of spectral coefficients andthe scale-factor selection scheme is replaced by the BSAC technique that pro-vides scalability in steps of 1 kbschannel The BSAC scheme works as followsFirst the quantized spectral values are grouped into frequency bands each ofthese groups contain the quantized spectral values in the binary form Then thebits of each group are processed in slices and in the order of significance begin-ning with the MSBs These bit-slices are then encoded using an arithmetic codingtechnique (Chapter 3) Usually the BSAC technique is used in conjunction withthe MPEG-4 GA tool where the Huffman coding is replaced by this special typeof arithmetic coding

10446 MPEG-4 Parametric Audio Coding In research proposed as partof an MPEG-4 ldquocore experimentrdquo [Purn97] Purnhagen at the University of Han-nover developed in conjunction with Deutsche Telekom Berkom an object-basedalgorithm In this approach harmonic sinusoid individual sinusoid and colorednoise objects were combined in a hybrid source model to create a paramet-ric signal representation The enhanced algorithm known as the ldquoHarmonicand Individual Lines Plus Noiserdquo (HILN) [Purn00a] [Purn00b] is architecturallyvery similar to the original ASAC [Edle96b] [Edle96c] [Purn98] [Purn99a] withsome modifications The parametric audio coding scheme is a part of MPEG-4version 2 and is based on the HILN scheme (see also Section 94) This tech-nique involves coding of audio signals at bit rates of 4 kbs and above basedon the possibilities of modifying the playback speed or pitch during decod-ing The parametric audio coding tools have also been extended to high-qualityaudio [Oom03]

10447 MPEG-4 Speech Coding The MPEG-4 natural speech codingtool [Edle99] [Nish99] provides a generic coding framework for a wide rangeof applications with speech signals at bit rates between 2 kbs and 24 kbs TheMPEG-4 speech coding is based on two algorithms namely harmonic vectorexcitation coding (HVXC) and code excited linear predictive coding (CELP) The


HVXC algorithm essentially based on the parametric representation of speechhandles very low bit rates of 14ndash4 kbs at a sampling rate of 8 kHz On the otherhand the CELP algorithm employs multipulse excitation (MPE) and regular-pulseexcitation (RPE) coding techniques (Chapter 4) and supports higher bit rates of4ndash24 kbs operating at sampling rates of 8 kHz and 16 kHz The specificationsof MPEG-4 Natural Speech Coding Tool Set are summarized in Table 104

In all the aforementioned algorithms ie HVXC CELP-MPE and CELP-RPE the idea is that an LP analysis filter models the human vocal tract while anexcitation signal models the vocal chord and the glottal activity All the three con-figurations share the same LP analysis method while they generally differ onlyin the excitation computation In the LP analysis first the autocorrelation coeffi-cients of the input speech are computed once every 10 ms and are convertedto LP coefficients using the Levinson-Durbin algorithm The LP coefficientsare transformed to line spectrum pairs using Chebyshev polynomials [Kaba86]These are later quantized using a two-stage split-vector quantizer The exci-tation signal is chosen in such a way that the error between the original andreconstructed signal is minimized according to a perceptually weighted distortionmeasure

Multiple Bit RatesSampling Rates Scalability The speech coder family inMPEG-4 audio is different from the standard speech coding algorithms (egITU-T G7231 G729 etc) Some of the salient features and functionalities(Figure 1017) of the MPEG-4 speech coder include multiple sampling rates andbit rates bit-rate scalability [Gril97] and bandwidth scalability [Nomu98]

The multiple bit ratessampling rates functionality provides flexible bit rateselection among multiple available bit rates (14ndash24 kbs) based on the channelconditions and the bandwidth availability (8 kHz and 16 kHz) At lower bit ratesan algorithmic delay of the order of 30ndash40 ms is expected while at higher bit

Table 104 MPEG-4 speech coding sampling rates and bandwidthspecifications [Edle99]

Specification HVXC CELP-MPE CELP-RPE

Samplingfrequency(kHz)

8 8 16 16

Bit rate (kbs) 14ndash4 385ndash238 109ndash23858 Bit rates 30 Bit rates

Frame size(ms)

10ndash40 10ndash40 10ndash20

Delay (ms) 335ndash56 sim15ndash45 sim20ndash25Features Multi-bit-rate

coding bit-ratescalability

Multi-bit-rate codingbit-rate scalabilitybandwidthscalability

Multi-bit-ratecoding bit-ratescalability

Mul

tiple

sam

plin

g ra

tes

and

bitr

ates

Sca

labi

lity

MP

EG

-4 S

peec

h C

odin

gF

unct

iona

litie

s

Nar

row

-ban

d sp

eech

cod

ing

- B

itrat

es 2

ndash 1

22k

bs

- S

ampl

ing

rate

8kH

z-

Del

ay 4

0 ndash

15m

s-

HV

XC

and

CE

LP

Wid

e-ba

nd s

peec

h co

ding

- B

itrat

es 1

0 ndash

24kb

s-

Sam

plin

g ra

te 1

6kH

z-

Del

ay 2

5 ndash

15m

s-

CE

LP

Bitr

ate

scal

abili

ty

- F

or b

itrat

es 2

ndash 2

4kb

s-

HV

XC

and

CE

LP

Ban

dwid

th s

cala

bilit

y

- F

or b

itrat

es 4

ndash 2

4kb

s-

CE

LP o

nly

Fig

ure

101

7M

PEG

-4sp

eech

code

rfu

nctio

nalit

ies

302


rates 15ndash25 ms delay is common The bit-rate scalability feature allows a widerange of bit rates (2ndash24 kbs) in step sizes of as low as 100 bs Both HVXC andCELP tools can be used to realize bit-rate scalability by employing a core layerand a series of enhancement layers at the encoder When the HVXC encoding isused one enhancement layer is preferred while three bit-rate scalable enhance-ment layers may be used for the CELP codec [Gril97] Bandwidth scalabilityimproves the audio quality by adding an additional coding layer that extends thetransmitted audio bandwidth Only CELP-RPE and CELP-MPE schemes allowbandwidth scalability in the MPEG-4 audio Furthermore only one bandwidthscalable enhancement layer is possible [Nomu98] [Herr00a]

10448 MPEG-4 Structured Audio Coding Structured audio (SA) intro-ducedbyVercoeet al [Verc98]presentsanewdimension toMPEG-4audioprimar-ily due to its ability to represent and encode efficiently the synthetic audio and multi-media content The MPEG-4SAtool [Sche98a] [Sche98c] [Sche99a] [Sche99b] wasdeveloped based on a synthesizer-description language called the Csound [Verc95]developed by Vercoe at the MIT Media Labs Moreover the MPEG-4 SA tool inher-its features from ldquoNetsoundrdquo [Case96] a structured audio experiment carried out byCasey et al based on the Csound synthesis language Instead of specifying a syn-thesis method the MPEG-4 SA describes a special language that defines synthesismethods In particular the MPEG-4 SA tool defines a set of syntax and seman-tic rules corresponding to the synthesis-description language called the StructuredAudio Orchestra Language (SAOL) [Sche98d] A control (score) language calledthe StructuredAudioScore Language (SASL)wasalsodefined todescribe the detailsof the SAOL code compaction Another component namely the Structured AudioSample Bank Format (SASBF) is used for the transmission of data samples in blocksThese blocks contain sample data as well as details of the parameters used for select-ing optimum wave-table synthesizers and facilitate algorithmic modifications Atheoretical basis for the SA coding was established in [Sche01] based on the Kol-mogorov complexity theory Also in [Sche01] Scheirer proposed a new paradigmcalled the generalized audio coding in which SA encompasses all other audio cod-ing techniques Furthermore treatment of structured audio in view of both losslesscoding and perceptual coding is also given in [Sche01]

The SA bitstream available at the MPEG-4 SA decoder (Figure 1018) con-sists of a header sample data and score data The SAOL decoder block actsas an interpreter and reads the header structure It also provides the informa-tion required to reconfigure the synthesis engine The header carries descriptionsof several instruments synthesizers control algorithms and routing instructionsThe Event List and Data block obtains the actual stream of data samples andparameters controlling algorithmic modifications In particular the bitstream dataconsists of access units that primarily contain the list of events Furthermoreeach event refers to an instrument described (eg in the orchestra chunk) inthe header [Sche01] The SASL decoder block compiles the score data fromthe SA bitstream and provides control sequences and signals to the synthesisengine via a run-time scheduler This control information determines the time at


which the events (or commands) are to be dispatched in order to create notes (orinstances) of an instrument Each note produces some sound output Finally allthese sound outputs (corresponding to each note) are added in order to create theoverall orchestra output In Figure 1018 we represented the run-time schedulerand reconfigurable synthesis engine blocks separately however in practice theyare usually combined into one block

As mentioned earlier the structured audio tool and the text-to-speech (TTS)fall in the synthetic audio coding group Recall that the structured audio tools con-vert structured representation into synthetic sound while the TTS tools translatetext to synthetic speech In both these methods the particular synthesis methodor implementation is not defined by the MPEG-4 audio standard however theinput-output relation for SA and the TTS interface are standardized The nextquestion that arises is how the natural and synthetic audio content can be mixedThis is typically carried out based on a special format specified by the MPEG-4namely the Audio Binary Format for Scene Description (AudioBIFS) [Sche98e]AudioBIFS enables sound mixing grouping morphing and effects like echo(delay) reverberation (feedback delay) chorus etc

10449 MPEG-4 Low-Delay Audio Coding Significantly large algorithmicdelays (of the order of 100ndash200 ms) in the MPEG-4 GA coding tool (discussedin Section 10444) hinder its applications in two-way real-time communica-tion These algorithmic delays in the GA coder can be attributed primarily tothe analysissynthesis filter bank window the look-ahead the bit-reservoir andthe frame length In order to overcome large algorithmic delays a simplifiedversion of the GA tool ie the MPEG-4 low-delay (LD) audio coder has beenproposed [Herr98c] [Herr99] One of the main reasons for the wide proliferationof this tool is the low algorithm delay requirements in voice-over Internet protocol(VoIP) applications In contrast to the ITU-T G728 speech standard that is based

DEMUX ReconfigurableSynthesis

Engine

SAOL Decoder

Run-timeScheduler

SASL Decoder

Score DataControl

BitstreamHeader

Samples Event Listamp

Data

SABitstream

AudioOutput

Figure 1018 MPEG-4 SA decoder (after [Sche98a])


on the LD-CELP [G728] the MPEG-4 LD audio coder [Alla99] is derived fromthe GA coder and MPEG-2 AAC The ITU-T G728 LD-CELP algorithm operateson speech frames of 25 ms (20 samples) at a sampling rate of 8 kHz and resultsin an algorithmic delay of 0625 ms (5 samples) On the other hand the MPEG-4LD audio coding tool operates on 512 or 480 samples at a sampling rate of up to48 kHz with an overall algorithmic delay of 20 ms Recall that the GA tool that isbased on the MPEG-2 AAC operates on frames of 1024 or 960 samples

The delays due to the analysissynthesis filter-bank window can be reduced byemploying shorter windows The look-ahead delays can be avoided by not employ-ing the block switching To reduce pre-echo distortions (Sections 69 and 610)TNS is employed in conjunction with window shape adaptation In particular fornontransient parts of the signal a sine window is used while a so-called low-overlap window is used in case of transient signals to achieve optimum TNSperformance [Purn99b] [ISOI00] Although most algorithms are fixed rate theinstantaneous bit rates required to satisfy masked thresholds on each frame arein fact time-varying Thus the idea behind a bit reservoir is to store surplus bitsduring periods of low demand and then to allocate bits from the reservoir duringlocalized periods of peak demand resulting in a time-varying instantaneous bitrate but at the same time a fixed average bit rate However in MPEG-4 LD audiocodec the use of the bit reservoir is minimized in order to reach the desired targetdelay

Based on the results published in [Alla99] [Herr99] [Purn99b] [ISOI00] theMPEG-4 LD audio codec performs relatively well compared to the MP3 coderat a bit rate of 64 kbschannel It can also be noted from the MPEG-4 version 2audio verification test [ISOI00] the quality measures of MPEG-2 AAC at 24 kbsand MPEG-4 LD audio codec at 32 kbs can be favorably compared Moreoverthe MPEG-4 LD audio codec [Herr98c] [Alla99] [Herr99] outperformed the ITU-T G728 LD CELP [G728] for the case of coding both music and speech signalsHowever as expected the coding efficiency in the case of MPEG-4 LD codec isslightly reduced compared to its predecessors MPEG-2 AAC and MPEG-4 GAIt should be noted that this reduction in the coding efficiency is attributed to thelow coding delay achieved

104410 MPEG-4 Audio Error Robustness Tool One of the key issuesin achieving reliable transmission over noisy and fast time-varying channels isthe bit-rate scalability feature (discussed in Section 10445) The bit-rate scala-bility enables flexible selection of coding features and dynamically adapts to thechannel conditions and the varying channel capacity However the bit-rate scala-bility feature alone is not adequate for reliable transmission The error resilienceand error protection tools are also essential to obtain high quality audio Tothis end the MPEG-4 audio version 2 is fitted with codec-specific error robust-ness techniques [Purn99b] [ISOI00] In this subsection we will review the errorrobustness and equal and unequal error protection (EEP and UEP) tools in theMPEG-4 audio version 1 In particular we discuss the error resilience [Sper00][Sper02] error protection [Purn99b] [Mein01] and error concealment [Sper01]functionalities that are primarily designed for mobile applications


The main idea behind the error resilience and protection tools is to providebetter protection to sensitive and priority (important) bits For instance the audioframe header requires maximum error robustness otherwise transmission errorsin the header will seriously impair the entire audio frame The codewords cor-responding to these priority bits are called the priority codewords (PCW) Theerror resilience tools available in the MPEG-4 audio version 2 are classified intothree groups the Huffman codeword reordering (HCR) the reversible variablelength coding (RVLC) and the virtual codebooks (VCB11) In the HCR tech-nique some of the codewords eg the PCWs are sorted in advance and placedat known positions First a presorting procedure is employed that reorders thecodewords based on their priority The resulting PCWs are placed such that anerror in one codeword will not affect the subsequent codewords This can beachieved by defining segments of known length (LSEG) and placing the PCWsat the beginning of these segments The non-PCWs are filled into the gaps leftby the PCWs as shown in Figure 1019

The various applications of reversible variable length codes (RVLC) [Taki95][Wen98][Tsai01]inimagecodinghaveinspiredresearcherstoconsider theminerror-resilient techniques for MPEG-4 audio RVLC codes are used instead of Huffmancodes for packing the scale factors in an AAC bitstream The RVLC codes are(symmetrically) designed to enable both forward and backward decoding withoutaffecting the coding efficiency In particular RVLCs allow instantaneous decodingin both directions that provides error robustness and significantly reduces the effectsof bit errors in delay-constrained real-time applications The next important toolemployed for error resilience is the virtual codebook 11 (VCB11) Virtual codebooksare used to detect serious errors within spectral data [Purn99b] [ISOI00] The errorrobustness techniques are codec specific (eg AAC and BSAC bitstreams) Forexample AAC supports the HCR the RVLC and the VCB11 error-resilient toolsOn the other hand BSAC supports segmented binary arithmetic coding [ISOI00] toavoid error propagation within spectral data

LSEG

PCW PCW PCW PCWNon-PCW Non-PCW

PCW PCW PCW PCW

Before codeword reordering

After codeword reordering

Non-PCW

Figure 1019 Huffman codeword reordering (HCR) algorithm to minimize error propa-gation in spectral data


The MPEG-4 audio error protection tools include cyclic redundancy check(CRC) forward error correction (FEC) and interleaving Note that these toolsare inspired by some of the error correctingdetecting features inherent in theconvolutional and block codes that essentially provide the controlled redundancydesired for error protection Unlike the error-resilient tools that are limited only tothe AAC and BSAC bitstreams the error protection tools can be used in conjunc-tion with a variety of MPEG-4 audio tools namely General Audio Coder (LTPand TwinVQ) Scalable Audio Coder parametric audio coder (HILN) CELPHVXC and low-delay audio coder Similar to the error-resilient tools the firststep in the EP tools is to reorder the bits based on their priority and error sensi-tiveness The bits are sorted and grouped into different classes (usually 4 or 5)according to their error sensitivities For example consider that there are fourerror-sensitive classes (ESC) namely ESC-0 ESC-1 ESC-2 and ESC-3 Usu-ally header bitstream or other very important bits that control the syntax and theglobal gain are included in the ESC-0 While the scale factors and spectral data(spectral envelope) are grouped in ESC-1 and ESC-2 respectively The remain-ing side information and indices of MDCT coefficients are classified in ESC-3After reordering (grouping) the bits each error sensitive class receives a differenterror protection depending on the overhead allowed for each configuration CRCand systematic rate-compatible punctured convolutional code (SRCPC) enableerror detection and forward error correction (FEC) The SRCPC codes also aid inadjusting the redundancy rates in small steps An interleaver is employed typicallyto deconcentrate or spread the burst errors Shortened Reed-Solomon (SRS) codesare used to protect the interleaved data Details on the design of Reed-Solomoncodes for MPEG AAC are given in [Huan02] For an in-depth treatment on theerror correcting and error detecting codes refer to [Lin82] [Wick95]

104411 MPEG-4 Speech Coding Tool Versus ITU-T SpeechStandards It is noteworthy to compare the MPEG-4 speech coding tool againstthe ITU-T speech coding standards While the latter applies source-filter configu-ration to model the speech parameters the former employs a variety of techniquesin addition to the traditional parametric representation The MPEG-4 speech cod-ing tool allows bit-rate scalability and real-time processing as well as applicationsrelated to storage media The MPEG-4 speech coding tool incorporates algorithmssuch as the TwinVQ the BSAC and the HVXCCELP The MPEG-4 speech cod-ing tool also accommodates multiple sampling rates Error protection and errorresilient techniques are provided in the MPEG-4 speech coding tool to obtainimproved performance over error-prone channels Other important features thatdistinguish the MPEG-4 tool from the ITU-T speech standards are the content-based interactivity and the ability to represent the audiovisual content as a set ofobjects

104412 MPEG-4 Audio Applications The MPEG-4 audio standard findsapplications in low-bit-rate audiospeech compression individual coding of nat-ural and synthetic audio objects low-delay coding error-resilient transmission


and real-time audio transmission over packet-switching networks such as theInternet [Diet96] [Liu99] MPEG-4 tools allow parameterization of the acousticalproperties of an audio scene with features such as immersive audiovisual render-ing (virtual 3-D environments [Kau98]) room acoustical modeling and enhanced3-D sound presentation MPEG-4 finds interesting applications in remote robotcontrol system design [Kim02b] Streaming audio codecs have also been pro-posed as a result of the MPEG-4 standardization efforts

Applications of MPEG-4 audio in DRM digital narrowband broadcasting(DNB) and digital multimedia broadcasting (DMB) are given in [Diet00]and [Grub01] respectively The general audio coding tool provides the necessaryinfrastructure for the design of error-robust scalable coders [Mori00b] anddelivers improved speechaudio quality [Moori00a] The ldquobit rate scalabilityrdquoand ldquoerror resilienceprotectionrdquo tools of the MPEG-4 audio standard dynamicallyadapt to the channel conditions and the varying channel capacity Other importantapplication-oriented features of MPEG-4 audio include low-delay bi-directionalaudio transmission content-based interactivity and object-based representationReal-time implementation of the MPEG-4 audio is reported in [Hilp00] [Mesa00][Pena01]

104413 Spectral Band Replication and Parametric Stereo Spectralband replication (SBR) [Diet02] and parametric stereo (PS) [Schu04] are thetwo new compression techniques recently added to the MPEG 4 audio standard[ISOI03c] The SBR technique is used in conjunction with a conventional codersuch as the MP3 or the MPEG AAC The audio signal is divided into low- andhigh-frequency bands The underlying core coder operates at a reduced sam-pling rate and encodes the low-frequency band The SBR technique operates atthe original sampling rate to estimate the spectral envelope associated with theinput audio The spectral envelope along with a set of control parameters areencoded and transmitted to the decoder The control parameters contain infor-mation regarding the gain and the spectral envelope level adjustment of the highfrequency components At the decoder the SBR reconstructs the high frequenciesbased on the transposition of the lower frequencies

aacPlus v1 is the combination of AAC and SBR and is standardized as theMPEG 4 high-efficiency (HE)-AAC [ISOI03c] [Wolt03] Relative to the con-ventional AAC the MPEG 4 HE-AAC results in bit rate reductions of about30 [Wolt03] The SBR has also been used to enhance the performance of MP3[Zieg02] and the MPEG layer 2 digital audio broadcasting systems [Gros03]

aacPlus v2 [Purn03] adds the parametric stereo coding to the MPEG 4 HE-AAC standard In the PS encoding [Schu04] the stereo signal is representedas a monaural signal plus ancillary data that describe the stereo image Thestereo image is described using four different PS parameters ie inter-channelintensity differences (IID) inter-channel phase differences (IPD) inter-channelcoherence (IC) and overall phase difference (OPD) These PS parameters cancapture the perceptually relevant spatial cues at bit rates as low as 10 kbs[Bree04]



MPEG-7 audio standard targets content-based multimedia applications [ISOI01b]MPEG-7 audio supports a broad range of applications [ISOI01d] that includemultimedia indexingsearching multimedia editing broadcast media selectionand multimedia digital library sorting Moreover it provides ways for efficientaudio file retrieval and supports both text-based and context-based queries Itis important to note that MPEG-7 will not replace MPEG-1 MPEG-2 BCLSFMPEG-2 AAC or MPEG-4 It is intended to provide complementary functional-ity to these MPEG standards If MPEG-4 is considered as the first object-basedmultimedia representation standard then MPEG-7 can be regarded as the firstcontent-based standard that incorporates multimedia interfaces through descrip-tions These descriptions are the means of linking the audio content features andattributes with the audio itself Figure 1020 presents an overview of the MPEG-7audio standard This figure depicts the various audio tools features and profilesassociated with the MPEG-7 audio Publications on the MPEG-7 Audio Standardinclude [Lind99] [Nack99a] [Nack99b] [Lind00] [ISOI01b] [ISOI01e] [Lind01][Quac01] [Manj02]

Motivated by the need to exchange multimedia content through the WorldWide Web in 1996 the ISOIEC MPEG workgroup worked on a project calledldquoMultimedia Content Description Interfacerdquo (MCDI) ndash MPEG-7 A working draftwas formed in December 1999 followed by a final committee draft in February2001 Seven months later MPEG-7 ISOIEC 15938 Part 4 Audio an inter-national standard (IS) for content-based multimedia applications was publishedalong with seven other parts of the MPEG-7 standard (Figure 1020) Figure 1020shows a summary of various features applications and profiles specified by theMPEG-7 audio coding standard

10451 MPEG-7 Parts MPEG-7 defines the following eight parts [MPEG](Figure 1020) MPEG-7 Systems MPEG-7 DDL MPEG-7 Visual MPEG-7Audio MPEG-7 MDS MPEG-7 Reference Software (RS) MPEG-7 Confor-mance Testing (CT) and MPEG-7 Extraction and use of Descriptions

MPEG-7 Systems (Part I) specifies the binary format for encoding MPEG-7Descriptions MPEG-7 DDL (Part II) is the language for defining the syntax ofthe Description Tools MPEG-7 Visual (Part III) and MPEG-7 Audio (Part IV)deal with the visual and audio descriptions respectively MPEG-7 MDS (PartV) defines the structures for multimedia descriptions MPEG-7 RS (Part VI)is a unique software implementation of certain parts of the MPEG-7 Standardwith noninformative status MPEG-7 CT (Part VII) provides the essential guide-linesprocedures for conformance testing of MPEG-7 implementations Finallythe eighth part addresses the use and formulation of a variety of description toolsthat we will discuss later in this section

In our discussion on MPEG-7 Audio we refer to MPEG-7 DDL and MPEG-7 MDS parts quite regularly mostly due to their interconnectivity within theMPEG-7 Audio Framework Therefore it is necessary that we introduce thesetwo parts first before we move on to the MPEG-7 Audio Description Tools

Lo

w-l

evel

or

Gen

eric

To

ols

1A

udio

fram

ewor

k2

Sile

nce

segm

ent

MP

EG

-7[IS

OI0

1e]

MP

EG

-7 A

UD

IOD

ES

CR

IPT

ION

TO

OLS

PA

RT

S

PR

OF

ILE

S

bull S

impl

e pr

ofile

bull U

ser

Des

crip

tion

prof

ile

bull S

umm

ary

prof

ile

bull A

udio

visu

al L

oggi

ng p

rofil

e

Sys

tem

s

Des

crip

tion

Def

initi

onLa

ngua

ge (

DD

L)[IS

OI0

1a]

Vis

ual

Au

dio

[IS

OI0

b]

Mul

timed

iaD

escr

iptio

n S

chem

e(M

DS

) [IS

OI0

c]

Ref

eren

ce S

oftw

are

Con

form

ance

Tes

ting

Ext

ract

ion

and

Use

of

Des

crip

tions

Hig

h-l

evel

(A

pp

licat

ion

-sp

ecif

ic)

Des

crip

tio

n T

oo

ls

1S

poke

n co

nten

t2

Mus

ical

inst

rum

ent t

imbe

r3

Mel

ody

4S

ound

rec

ogni

tion

and

inde

xing

5A

udio

sig

natu

rer

obus

tm

atch

ing

FE

AT

UR

ES

Mul

timed

ia In

dexi

ng

Mul

timed

ia E

ditin

g

Sel

ectin

g B

road

cast

Med

ia

Mul

timed

ia D

igita

lLi

brar

y S

ortin

g

Cog

nitiv

e C

odin

g

A N

ew D

escr

iptio

nT

echn

olog

y

Mul

timed

ia C

onte

ntD

escr

iptio

n In

terf

ace

AP

PLI

CA

TIO

NS

[ISO

I01d

]

Fig

ure

102

0A

nov

ervi

ewof

the

MPE

G-7

stan

dard

310


D4D5

D1 D2

D3

Descriptor (D)

DescriptionScheme (DS)

NOTE1 Descriptors (Ds) are the features and attributes associated with an audio waveform

2 The structure and the relationships among the descriptors are defined by a DescriptionScheme (DS)

3 Description Definition Language (DDL) defines the syntax necessary to create extendand combine avariety of DSs and Ds

DS1

DS2

D49D50

DSN

Description Definition Language (DDL)

Figure 1021 Some essential building blocks of MPEG-7 standard descriptors (Ds)description schemes (DSs) and description definition language (DDL)

MPEG-7 Description Definition Language (DDL) ndash Part II We mentioned ear-lier that MPEG-7 incorporates multimedia interfaces through descriptors Thesedescriptors are the features and attributes associated with the audio For exampledescriptors in the case of MPEG-7 Visual part describe the visual features such ascolor resolution contour mapping techniques etc A group of descriptors relatedin a manner suitable for a specific application forms a description scheme (DS)The standard [ISOI01e] defines the description scheme as one that specifies astructure for the descriptors and semantics of their relationships

MPEG-7 in its entirety has been built around these descriptors (Ds) anddescription schemes (DSs) and most importantly on a language called the descrip-tion definition language (DDL) The DDL defines the syntax necessary to createextend and combine a variety of DSs and Ds In particular the DDL formsldquothe core partrdquo of the MPEG-7 standard and will also be invoked by other parts(ie Visual Audio and MDS) to create new Ds and DSs The DDL follows aset of programming rulesstructure similar to the ones employed in the eXten-sible Markup Language (XML) It is important to note that the DDL is not amodeling language but a schema language that is based on the WWW Con-sortiumrsquos XML schema [XML] [ISOI01e] Several modifications were neededbefore adopting the XML schema language as the basis for the DDL We referto [XML] [ISOI01a] [ISOI01e] for further details on the XML schema languageand its liaison with MPEG-7 DDL [ISOI01a]

MPEG-7 Multimedia Description Schemes (MDS) ndash Part V Recall that a descrip-tion scheme (DS) specifies structures for descriptors similarly a multimedia


description scheme (MDS) [ISOI01c] provides details on the structures for describ-ing multimedia content (in particular audio visual and textual data) MPEG-7 MDSdefines two classes of description tools namely the basic (or low-level) and multi-media (or high-level) tools [ISOI01c] Figure 1022 shows the classification of MDSelementsThebasic tools specifiedbytheMPEG-7MDSare thegenericentitiesusu-ally associated with simple descriptors such as the basic data types textual databaseetc On the other hand the high-level multimedia tools deal with the content-specificentities that are complex and involve signal structures semantics models efficientnavigation and access The high-level (complex) tools are further subdivided intofive groups (Figure 1022) ie content description content management contentorganization navigation and access and user interaction

Let us consider an example to better understand the concepts of DDL andMDS framework Suppose that an audio signal s(n) is described using threedescriptors namely spectral features D1 parametric models D2 and energy D3Similarly visual v(i j) and textual content can also be described as shown inTable 105 We arbitrarily chose four description schemes (DS1 through DS4)that link these multimedia features (audio visual and textual) in a structuredmanner This linking mechanism is performed through DDL a schema languagedesigned specifically for MPEG-7 From Table 105 the descriptors D2 D8 D9

are related using the description scheme DS2 The melody descriptor D8 providesthe melodic information (eg rhythmic high-pitch etc) and the timbre descrip-tor D9 represents some perceptual features (eg pitchloudness details basstrebleadjustments in audio etc) The parametric model descriptor D2 describes theaudio encoding model and related encoder details (eg MPEG-1 layer III sam-pling rates delay bit rates etc) While the descriptor D2 provides details onthe encoding procedure the descriptors D8 and D9 describe audio morphingechoreverberation tone control etc

MPEG-7 Audio ndash Part IV MPEG-7 Audio represents part IV of the MPEG-7standard and provides structures for describing the audio content Figure 1023shows the organization of MPEG-7 audio framework

MPEG-7Multimedia Description

Scheme (MDS)

Basic tools Concept-specific tools

ContentDescription

ContentManagement

ContentOrganization

Navigationand Access

UserInteraction

Figure 1022 Classification of multimedia description scheme (MDS) tools


Table 105 A hypothetical example that gives a broader perspective on multimediadescriptors ie audio visual and textual features to describe a multimediacontent

Group Descriptors Description schemes

Audio content s(n) D1 Spectral featuresD2 Parametric modelsD3 Energy of the signal

DS1 D1 D3

Visual content v(i j) D4 ColorDS2 D2 D8 D9

D5 ShapeDS3 DS2 D4 D5

Textual descriptions D6 Title of the clipDS4 DS1 D6 D7

D7 Author informationD8 Melody detailsD9 Timbre details

Low-level or Generic Tools

1 Audio framework2 Silence segment

MPEG-7 AUDIODESCRIPTION TOOLS

High-level (Application-specific)Description Tools

1 Spoken content2 Musical instrument timber3 Melody4 Sound recognition and indexing5 Audio signaturerobust matching

Figure 1023 MPEG-7 audio description tools

10452 MPEG-7 Audio Versions and Profiles New extensions (Amend-ment 1) for the existing MPEG-7 Audio are being considered Some of theextensions are in the areas of application-specific spoken content tempo descrip-tion and specification of precision for low-level data types This new amendmentwill be standardized as MPEG-7 Audio Version 2 (Final drafts of InternationalStandard (FDIS) for Version 2 were finalized in March 2003)

Although many description tools are available in MPEG-7 audio it is notpractical to implement all of them in a particular system MPEG-7 Version 1therefore defines four complexity-ranked profiles (Figure 1020) intended to helpsystem designers in the task of tool subset selection These include simple profileuser description profile summary profile and audiovisual logging profile


10453 MPEG-7 Audio Description Tools The MPEG-7 Audio frameworkcomprises two main categories namely generic tools and a set of application-specific tools (see Figure 1020 and Figure 1023)

104531 Generic Tools The generic toolset consists of 17 low-level audiodescriptors and a silence segment descriptor (Table 106)

MPEG-7 Audio Low-level Descriptors MPEG-7 audio [ISOI01b] defines twoways of representing the low-level audio features ie segmenting and samplingIn segmentation usually common datatypes or scalars are grouped together (egenergy power bit rate sampling rate etc) On the other hand sampling enablesdiscretization of audio features in a vector form (eg spectral features excitationsamples etc) Recently a unified framework called the scalable series [Lind99][Lind00] [ISOI01b] [Lind01] has been proposed to manipulate these discretizedvalues This is somewhat similar to MPEG-4 scalable audio coding that wediscussed in Section 1044 A list of low-level audio descriptors defined bythe MPEG-7 Audio standard [ISOI01b] is summarized in Table 106 These

Table 106 Low-level audio descriptors (17 in number) and the silence descriptorsupported by the MPEG-7 generic toolset [ISOI01b]

Generic toolset Descriptors

Low-levelaudiodescriptorsgroup

1 Basic D1 Audio waveform

D2 Power2 Basic spectral D3 Spectrum envelope

D4 Spectral centroidD5 Spectral spreadD6 Spectral flatness

3 Signal parameters D7 HarmonicityD8 Fundamental frequency

4 Spectral basis D9 Spectrum basisD10 Spectrum projection

5 Timbral spectral D11 Harmonic spectralcentroid

D12 Harmonic spectraldeviation

D13 Harmonic spectral spreadD14 Harmonic spectral

variationD15 Spectral centroid

6 Timbral temporal D16 Log attack timeD17 Temporal centroid

Silence 7 Silence segment D18 Silence descriptor


descriptors can be classified into the following groups basic basic spectralsignal parameters spectral basis timbral spectral and timbral temporal

MPEG-7 Silence Segment The MPEG-7 silence segment attaches a semanticof silence to an audio segment The silence descriptor provides ways to specifythreshold levels (eg the level of silence)

104532 High-Level or Application-Specific MPEG-7 Audio Tools Bes-ides the aforementioned generic toolset the MPEG-7 audio standard describesfive specialized high-level tools (Table 107) These application-specific descrip-tion tools can be grouped as spoken content musical instrument melody soundrecognitionindexing and robust audio matching

Spoken Content Description Tool (SC-DT) The SC-DT provides descriptionsof spoken words in an audio clip thereby enabling speech recognition and speechparameter indexingsearching Spoken content lattice and spoken content headerare the two important parts of the SC-DT (see Table 107) While the SC headercarries the lexical information (ie wordlexicon phonelexicon ConfusionInfoand SpeakerInfo descriptors) the SC-lattice DS represents lattice-structures toconnect words or phonemes chosen from the corresponding lexicon The idea ofusing lattice structures in the SC-lattice DS is similar to the one employed in atypical continuous automatic speech recognition scenario [Rabi89] [Rabi93]

Musical Instrument Timbre Description Tool (MIT-DT) The MIT-DT describesthe timbre features (ie perceptual attributes) of sounds from musical instrumentsTimbre can be defined as the collection of perceptual attributes that make two

Table 107 Application-specific audio descriptors and descriptionschemes [ISOI01b]

High-level descriptor toolset Descriptor details

SC-DT 1 SC-header D1 Word lexiconD2 Phone lexiconD3 Confusion infoD4 Speaker info

2 SC-lattice DS Provides structures to connect or linkthe wordsphonemes in the lexicon

MIT-DT 3 Timbre (perceptual) featuresof musical instruments

D1 Harmonic Instrument Timbre

D2 Percussive Instrument TimbreM-DT 4 Melody features DS1 Melody contour

DS2 Melody sequenceSRI-DT 5 Sound recognition and

indexing applicationD1 Sound Model State Path

D2 Sound Model State HistogramDS1 Sound modelDS2 Sound classification model

AS-DT 6 Robust audio identification DS1 Audio signature DS


audio clips having the same pitch and loudness sound different [ISOI01b]Musical instrument sounds in general can be classified as harmonic-coherent-sustained percussive-nonsustained nonharmonic-coherent-sustained and non-coherent-sustained The standard defines descriptors for the first two classesof musical sounds (Table 107) In particular MIT-DT defines two descriptorsnamely the harmonic instrument timbre (HIT) descriptor and the percussiveinstrument timbre (PIT) descriptor The HIT descriptor was built on the fourharmonic low-level descriptors (ie D11 through D14 in Table 106) and theLogattacktime descriptor On the other hand the PIT descriptor is based on thecombination of the timbral temporal low-level descriptors (ie Logattacktimeand Temporalcentroid) and the spectral centroid descriptor

Melody Description Tool (M-DT) The M-DT represents the melody detailsof an audio clip The melodycontourDS and the melodysequenceDS are the twoschemes included in M-DT While the former scheme enables simple and robustmelody contour representation the latter approach involves detailed and expandedmelodyrhythmic information

Sound Recognition and Indexing Description Tool (SRI-DT) The SRI-DT ison automatic sound identificationrecognition and indexing Recall that the SC-DT employs lexicon descriptors (Table 107) for SC recognition in an audio clipIn the case of SRI classificationindexing of sound tracks are achieved throughsound models These models are constructed based on the spectral basis low-level descriptors ie spectral basis (D9) and spectral projection (D10) listed inTable 106 Two descriptors namely the sound model state path descriptor andthe sound model state histogram descriptor are defined to keep track of the activepaths in a trellis

Robust Audio Identification and Matching Robust matching and identification ofaudioclips isoneof the importantapplicationsofMPEG-7audiostandard [ISOI01d]This feature is enabled by the low-level spectral flatness descriptor (Table 106) Adescription scheme namely the Audio Signature DS defines the semantics andstructures for the spectral flatness descriptor Hellmuth et al [Hell01] proposed anadvanced audio identification procedure based on content descriptions

10454 MPEG-7 Audio Applications Being the first metadata standardMPEG-7 audio provides new ideas for audio-content indexing and archiving[ISOI01d] Some of the applications are in the areas of multimedia searchingaudio file indexing sharing and retrieval and media selection for digital audiobroadcasting (DAB) We discussed most of these applications while addressingthe high-level audio descriptors and description schemes A summary of theseapplications follows Unlike in an automatic speech recognition scenario whereword or phoneme lattices (based on feature vectors) are employed for identi-fying speech in MPEG-7 these lattice structures are denoted as Ds and DSsThese description data enable spoken content retrieval MPEG-7 audio version 2includes new tools and specialized enhancements to spoken content search Musi-cal instrument timbre search is another important application that targets content-based editing Melody search enables query by humming [Quac01] Sound recog-nitionindexing and audio identificationfingerprinting form two other important


applications of the MPEG-7 We will next address the concepts of ldquointeroperabil-ityrdquo and ldquouniversal multimedia accessrdquo (UMA) in the context of the new workinitiated by the ISOIEC MPEG workgroup in June 2000 called the MultimediaFramework ndash MPEG 21 [Borm03]


Motivated by the need for a standard that enables multimedia content access anddistribution the ISOIEC MPEG workgroup addressed the 21st Century Multi-media Framework ndash MPEG-21 ISOIEC 21000 [Spen01] [ISOI02a] [ISOI03a][ISOI03b] [Borm03] [Burn03] This multimedia standard should be interoperableand highly automated [Borm03] The MPEG-21 multimedia framework envisionscreating a platform that encompasses a great deal of functionalities for bothcontent-users and content-creatorsproviders Some of these functions include themultimedia resource delivery to a wide range of networks and terminals (egpersonal computers (PCs) PDAs and other digital assistants mobile phonesthird-generation cellular networks digital audiovideo broadcasting (DABDVB)HDTVs and several other home entertainment systems) protection of intellectualproperty rights through digital rights management (DRM) systems

Content creators and service providers face several challenging tasks in order tosatisfy simultaneously the conflicting demands of ldquointeroperabilityrdquo and ldquointellec-tual property management and protectionrdquo (IPMP) To this end MPEG-21 definesa multimedia framework that comprises seven important parts [ISOI02a] as shownin Table 108 Recall that the MPEG-7 ISOIEC-15938 standard defines a funda-mental unit called ldquoDescriptorsrdquo (Ds) to definedeclare the features and attributesof multimedia content In a manner analogous to this MPEG-21 ISOIEC-21000Part 1 defines a basic unit called the ldquoDigital Itemrdquo (DI) Besides DI MPEG-21specifies another entity called the ldquoUserrdquo interaction [ISOI02a] [Burn03] that pro-vides details on how each ldquoUserrdquo interacts with other users via objects called theldquoDigital Itemsrdquo Furthermore MPEG-21 Parts 2 and 3 define the declaration andidentification of the DIs respectively (see Table 108) MPEG-21 ISOIEC-21000Parts 4 through 6 enables interoperable digital content distribution and transactionsthat take into account the IPMP requirements In particular a machine-readablelanguage called the Rights Expression Language (REL) is specified in MPEG-21ISOIEC-21000 Part 5 that defines the rights and permissions for the access anddistribution of multimedia resources across a variety of heterogeneous terminalsand networks MPEG-21 ISOIEC-21000 Part 6 defines a dictionary called theRights Data Dictionary (RDD) that contains information on content protection andrights

MPEG-7 and MPEG-21 standards provide an open framework on which onecan build application-oriented interfaces or tools that satisfy a specific criterion(eg a query an audio file indexing etc) In particular the MPEG-7 standardprovides an interface for indexing accessing and distribution of multimediacontent and the MPEG-21 defines an interoperable framework to access themultimedia content


Table 108 MPEG-21 multimedia framework and the associated parts [ISOI02a]


Part 1 Vision technologies andstrategy

Defines the visionrequirements andapplications of thestandard andprovides an overviewof the multimediaframeworkIntroduces two newterms ie digitalitem (DI) and userinteraction

Part 2 Digital item declaration Defines the relationshipbetween a variety ofmultimedia resourcesand providesinformationregarding thedeclaration of Dis

Part 3 Digital item identification Provides ways toidentify differenttypes of digital items(DIs) and descrip-torsdescriptionschemes (DsDSs)via uniform resourceidentifiers (URIs)

Part 4 IPMP Defines a frameworkfor the intellectualproperty managementand protection(IPMP) that enablesinteroperability

Part 5 Rights expression language A syntax language thatenables multimediacontent distributionin a way that protectsthe digital contentThe rights and thepermissions areexpressed or declaredbased on the termsdefined in the rightsdata dictionary

ADAPTIVE TRANSFORM ACOUSTIC CODING (ATRAC) 319



Part 6 Rights data dictionary A database or adictionary thatcontains theinformation regardingthe rights andpermissions to protectthe digital content

Part 7 Digital item adaptation Defines the concept of anadapted digital item

Until now our focus was primarily on ISOIEC MPEG Audio Standards Inthe next few sections we will attend to company-oriented perceptual audio cod-ing algorithms ie the Sony Adaptive Transform Acoustic Coding (ATRAC)the Lucent Technologies Perceptual Audio Coder (PAC) the Enhanced PAC(EPAC) the Multichannel PAC (MPAC) Dolby Laboratories AC-2AC-2AAC-3 Audio Processing Technology (APT-x100) and the Digital Theater Systems(DTS) Coherent Acoustics (CA) encoder algorithms


MPEG spatial audio coding began receiving attention during the early 2000s[Fall01] [Davis03] Advances in joint stereo coding of multichannel signals[Herr04b] binaural cue coding [Fall01] and the success of the recent low com-plexity parametric stereo encoding in MPEG 4 HE-AACPS standard [Schu04]generated interest in the MPEG surround and spatial audio coding [Herr04a][Bree05] Unlike the discrete 51-channel encoding as used in Dolby Digital orDTS the MPEG spatial audio coding captures the ldquospatial imagerdquo of a multi-channel audio signal The spatial image is represented using a compact set ofparameters that describe the perceptually relevant differences among the chan-nels Typical parameters include the interchannel level difference (ICLD) theinterchannel coherence (ICC) and the interchannel time difference (ICTD) Themultichannel signal is first downmixed to a stereo signal and then a conventionalMP3 coder is used Spatial image parameters are computed using the binaural cuecoding (BCC) technique and are transmitted to the decoder as side information[Herr04a] At the decoder a one-to-two (OTT) or two-to-three (TTT) channelmapping is used to synthesize the multichannel surround sound

105 ADAPTIVE TRANSFORM ACOUSTIC CODING (ATRAC)

The ATRAC algorithm developed by Sony for use in its rewriteable Mini-Disc system [Yosh94] combines subband and transform coding to achieve nearlyCD-quality coding of 441 kHz 16-bit PCM input data at a bit rate of 146 kbs per


QMFAnalysisBank 1

QMFAnalysisBank 2

0 - 55 kHz

55 - 11 kHz

11 - 22 kHz

MDCT32128 pt

s(n) MDCT32128 pt

MDCT32256 pt

WindowSelect

Bit Allocation

Quantization

Bit Allocation

Quantization

Bit Allocation

Quantization

RL(i )

RM(i)

RH(i )

SL(k)ˆ

SM(k)ˆ

SH(k)ˆ

Figure 1024 Sony ATRAC (embedded in MiniDisc SDDS)

channel [Tsut98] Using a tree-structured QMF analysis bank (Section 65) theATRAC encoder (Figure 1024) first splits the input signal into three subbands of0ndash55 kHz 55ndash11 kHz and 11ndash22 kHz Like MPEG-1 layer III the ATRACQMF bank is followed by signal-adaptive MDCT analysis in each subband Nexta window-switching scheme is employed that can be summarized as follows Dur-ing steady-state input periods high-resolution spectral analysis is attained using512 sample blocks (116 ms) During input attack or transient periods shortblock sizes of 145 ms in the high-frequency band and 29 ms in the low- andmid-frequency bands are used for pre-echo cancellation

After MDCT analysis spectral components are clustered into 52 nonuniformsubbands (block floating units or BFUs) according to critical band spacing TheBFUs are block-companded quantized and encoded according to a psychoa-coustically derived bit allocation For each analysis frame the ATRAC encodertransmits quantized MDCT coefficients subband window lengths BFU scale-factors and BFU word lengths to the decoder Like the MPEG family theATRAC architecture decouples the decoder from psychoacoustic analysis andbit-allocation details Evolutionary improvements in the encoder bit allocationstrategy are therefore possible without modifying the decoder structure An addedbenefit of this architecture is asymmetric complexity which enables inexpensivedecoder implementations

Suggested bit allocation techniques for ATRAC are of lower complexity thanthose found in other standards since ATRAC is intended for low-cost battery-powered devices One proposed method distributes bits between BFUs accordingto a weighted combination of fixed and adaptive bit allocations [Tsut96] For thek-th BFU bits are allocated according to the relation

r(k) = αra(k) + (1 minus α)rf (k) minus β (1010)

where rf (k) is a fixed allocation ra(k) is a signal-adaptive allocation the parame-ter β is a constant offset computed to guarantee a fixed bit rate and the parameterα is a tonality estimate ranging from 0 (noise-like) to 1 (tone-like) The fixed


allocations rf (k) are the same for all inputs and concentrate more bits at thelower frequencies The signal adaptive bit allocations ra(k) assign bits accord-ing to the strength of the MDCT components The effect of Eq (1010) is thatmore bits are allocated to BFUs containing strong peaks for tonal signals Fornoise-like signals bits are allocated according to a fixed allocation rule with lowbands receiving more bits than high bands

Sony Dynamic Digital Sound (SDDS) In addition to providing near CD-qualityon a MiniDisc medium the ATRAC algorithm has also been deployed as thecore of Sonyrsquos digital cinematic sound system SDDS SDDS integrates eightindependent ATRAC modules to carry the program information for the left (L)left center (LC) center (C) right center (RC) right (R) subwoofer (SW) leftsurround (LS) and right surround (RS) channels typically present in a moderntheater SDDS data is recorded using optical black and white dot-matrix tech-nology onto two thin strips along the right and left edges of the film outside ofthe sprocket holes Each edge contains four channels There are 512 ATRAC bitsper channel associated with each movie frame and each optical data frame con-tains a matrix of 52 times 192 bits [Yama98] SDDS data tracks do not interfere withor replace the existing analog sound tracks Both Reed-Solomon error correctionand redundant track information are delayed by 18 frames and employed to makeSDDS robust to bit errors introduced by run-length scratches dust splice pointsand defocusing during playback or film printing Analog program information isused as a backup in the event of uncorrectable digital errors

106 LUCENT TECHNOLOGIES PAC EPAC AND MPAC

The pioneering research contributions on perceptual entropy [John88b] mono-phonic PXFM [John88a] stereophonic PXFM [John92a] and ASPEC [Bran91]strongly influenced not only the MPEG family architecture but also evolved atATampT Bell Laboratories into the Perceptual Audio Coder (PAC) The PAC algo-rithm eventually became property of Lucent ATampT meanwhile became activein the MPEG-2 AAC research and standardization The low-complexity profileof AAC became the ATampT coding standard

Like the MPEG coders the Lucent PAC algorithm is flexible in that it supportsmonophonic stereophonic and multiple channel modes In fact the bit streamdefinition will accommodate up to 16 front side 7 surround and 7 auxiliarychannel pairs as well as 3 low-frequency effects (LFE or subwoofer) channelsDepending upon the desired quality PAC supports several bit rates For a modestincrease in complexity at a particular bit rate improved output quality can berealized by enabling enhancements to the original system For example whereas96 kbs output was judged to be adequate with stereophonic PAC near CD qualitywas reported at 56ndash64 kbs for stereophonic enhanced PAC [Sinh98a]


The original PAC system described in [John96c] achieves very-high-quality cod-ing of stereophonic inputs at 96 kbs Like the MPEG-1 layer III and the ATRAC


the PAC encoder (Figure 1025) uses a signal-adaptive MDCT filter bank to ana-lyze the input spectrum with appropriate frequency resolution A long windowof 2048 points (1024 subbands) is used during steady-state segments or else aseries of short 256-point windows (128 subbands) is applied for segments con-taining transients or sharp attacks In contrast to MPEG-1 and ATRAC howeverPAC relies on the MDCT alone rather than a hybrid filter-bank structure thusrealizing a complexity reduction As noted previously [Bran88a] [Mahi90] theMDCT lends itself to compact representation of stationary signals and a 2048-point block size yields sufficiently high-frequency resolution for most sourcesThis segment length was also associated with the maximum realizable coding gainas a function of block size [Sinh96] Filter-bank resolution switching decisionsare made on the basis of PE (high complexity) or signal energy (low complexity)criteria

The PAC perceptual model derives noise masking thresholds from filter-bankoutput samples in a manner similar to MPEG-1 psychoacoustic model recommen-dation 2 [ISOI92] and the PE calculation in [John88b] The PAC model howeveraccounts explicitly for both simultaneous and temporal masking effects Samplesare grouped into 13 critical band partitions tonality is estimated in each bandand then time and frequency spreading functions are used to compute a maskingthreshold that can be related to the filter-bank outputs One can observe that PACrealizes some complexity reduction relative to MPEG by avoiding parallel fre-quency analysis structures for quantization and perceptual modeling The maskingthresholds are used to select one of 128 exponentially distributed quantizationstep sizes in each of 49 or 14 coder bands (analogous to ATRAC BFUs) in high-resolution and low-resolution modes respectively The coder bands are quantizedusing an iterative rate control loop in which thresholds are adjusted to satisfysimultaneously bit-rate constraints and an equal loudness criterion that attemptsto shape quantization noise such that its absolute loudness is constant relative tothe masking threshold The rate control loop allows time-varying instantaneousbit rates so that additional bits are available in times of peak demand muchlike the bit reservoir of MPEG-1 layer III Remaining statistical redundanciesare removed from the stream of quantized spectral samples prior to bit streamformatting using eight structured multidimensional Huffman codebooks Thesecodebooks are applied to DPCM-encoded quantizer outputs By clustering coderbands into sections and selecting only one codebook per section the systemminimizes the overhead

s(n)MDCT2562048 pt

Perceptual Model

Quantization HuffmanCoding Bitstream

Figure 1025 Lucent Technologies Perceptual Audio Coder (PAC)


MDCT2048 pt

FilterbankSelect

PerceptualModel

Quantization

WaveletFilterbank

SS

TR

Transient Steady StateSwitch

SS

TR

HuffmanCoding Bitstream

s(n)

Figure 1026 Lucent Technologies Enhanced Perceptual Audio Coder (EPAC)


In an effort to enhance PAC output quality at low bit rates Sinha and John-ston [Sinh96] introduced a novel signal-adaptive MDCTWP1 switched filter bankscheme that resulted in nearly transparent coding for CD-quality source mate-rial at 64 kbs per stereo pair EPAC (Figure 1026) is unique in that it switchesbetween two distinct filter banks rather than relying upon hybrid [Tsut98] [ISOI92]or nonuniform cascade [Prin95] structures

A 2048-point MDCT decomposition is applied normally during ldquostationaryrdquoperiods EPAC switches to a tree-structured wavelet packet (WP) decompositionmatched to the auditory filter bank during sharp transients Switching decisionsoccur every 25 ms as in PAC using either PE or energy criteria The WP analysisoffers the dual advantages of more compact signal representation during transientperiods than MDCT as well as improved time resolution at high frequencies foraccurate estimation of the timefrequency distribution of masking power containedin sharp attacks In contrast to the uniform time-frequency tiling associated withMDCT window-length switching schemes (eg [ISOI92] [Bran94a]) the EPACWP transform (tree-structured QMF bank) achieves a nonuniform time-frequencytiling For a suitably designed analysis wavelet and tree-structure an improvementin time resolution is restricted to the high-frequency regions of interest while good-frequency resolution is maintained in the low-frequency subbands The EPACWP filter bank was specifically designed for time-localized impulse responsesat high frequencies to minimize the temporal spread of quantization noise (pre-echo) Novel start and stop windows are inserted between analysis frames duringswitching intervals to mitigate boundary effects associated with the MDCT-to-WP and WP-to-MDCT transitions Other than the enhanced filter bank EPAC isidentical to PAC In subjective tests involving 12 expert and nonexpert listenerswith difficult castanets and triangle test signals EPAC outperformed PAC for a64 kbs per stereo pair by an average of 04ndash06 on a five-point quality scale


Like the MPEG AC-3 and SDDS systems the PAC algorithm also extendsits monophonic processing capabilities into stereophonic and multiple-channel1 See Chapter 8 Sections 82 and 83 for descriptions on wavelet filter bank and WP transforms


modes Stereophonic PAC computes individual masking thresholds for the leftright mono and stereo (L R M = L + R and S = L minus R) signals using a versionof the monophonic perceptual model that has been modified to account for binary-level masking differences (BLMD) or binaural unmasking effects [Moor77]Then monaural PAC methods encode either the signal pairs (L R) or (M S) Inorder to minimize the overall bit rate however a LRMS switching procedureis embedded in the rate-control loop such that different encoding modes (LR orMS) can be applied to the individual coder bands on the same analysis frame

In the MPAC 5-channel configuration composite coding modes are availablefor the front side left center right and left and right surround (L C R Lsand Rs) channels On each frame the composite algorithm works as followsFirst appropriate window-switched filter-bank frequency resolution is determinedseparately for the front side and surround channels Next the four signal pairsLR MS LsRs and MsSs (Ms = Ls + Rs Ss = Ls minus Rs) are generated TheMPAC perceptual model then computes individual BLMD-compensated maskingthresholds for the eight LR and MS signals as well as the center channel COnce thresholds have been obtained a two-step coding process is applied In step1 a minimum PE criterion is first used to select either MS or LR coding for thefront side and surround channel groups in each coder band Then step 2 appliesinterchannel prediction to the quantized spectral samples The prediction residualsare quantized such that the final quantization noise satisfies the original maskingthresholds for each channel (LR or MS) The interchannel prediction schemesare summarized in [Sinh98a] In pursuit of a minimum bit rate the compositecoding algorithm may elect to utilize either step 1 or step 2 both step 1 and step2 or neither step 1 nor step 2 Finally the composite perceptual model computesa global masking threshold as the maximum of the five individual thresholdsminus a safety margin This threshold is phased in gradually for joint codingwhen the bit reservoir drops below 20 [Sinh98a] The safety margin magnitudedepends upon the bit reservoir state Composite modes are applied separately foreach coder band on each analysis frame In terms of performance the MPACsystem was found to produce the best quality at 320 kbs for 5 channels duringa recent ISO test of multichannel algorithms [ISOII94]

Applications Both 128 and 160 kbs stereophonic versions of PAC were con-sidered for standardization in the US Digital Audio Radio (DAR) project Inan effort to provide graceful degradation and extend broadcast range in thepresence of heavy fading associated with fringe reception areas perceptuallymotivated unequal error protection (UEP channel coding) schemes were exam-ined in [Sinh98b] The proposed scheme ranks bit stream elements into twoclasses of perceptual importance Bit stream parameters associated with cen-ter channel information and certain mid-frequency subbands are given greaterchannel protection (class 1) than other parameters (class 2) Subjective testsrevealed a strong preference for UEP over equal error protection (EEP) partic-ularly when bit error rates (BER) exceeded 2 times 10minus4 For network applicationsacceptable PAC output quality at bit rates as low as 12ndash16 kbs per channelin conjunction with the availability of JAVA PAC decoder implementations are


reportedly increasing PAC deployment among suppliers of Internet audio programmaterial [Sinh98a] MPAC has also been considered for cinematic and advancedtelevision applications Real-time PAC and EPAC decoder implementations havebeen demonstrated on 486-class PC platforms

107 DOLBY AUDIO CODING STANDARDS

Since the late 1980s Dolby Laboratories has been active in perceptual audiocoding research and standardization and Dolby researchers have made numer-ous scientific contributions within the collaborative framework of MPEG audioOn the commercial front Dolby has developed the AC-2 and the AC-3 algo-rithms [Fiel91] [Fiel96]


The AC-2 [Davi90] [Fiel91] is a family of single-channel algorithms operatingat bit rates between 128 and 192 kbs for 20 kHz bandwidth input sampled at441 or 48 kHz There are four available AC-2 variants all of which share anarchitecture in which the input is mapped to the frequency domain by an evenlystacked TDAC filter bank [Prin86] with a novel parametric Kaiser-Bessel analy-sis window (Section 67) optimized for improved stop-band attenuation relativeto the sine window The evenly stacked TDAC differs from the oddly stackedMDCT in that the evenly stacked low-band filter is half-band and its magni-tude response wraps around the fold-over frequency (see Chapter 6) A uniquemantissa-exponent coding scheme is applied to the TDAC transform coefficientsFirst sets of frequency-adjacent coefficients are grouped into blocks (subbands)of roughly critical bandwidth For each block the maximum is identified andthen quantized as an exponent in terms of the number of left shifts required untiloverflow occurs The collection of exponents forms a stair-step spectral envelopehaving 6 dB (left shift = multiply by 2 = 602 dB) resolution and normaliz-ing the transform coefficients by the envelope generates a set of mantissas Theenvelope approximates the short-time spectrum and therefore a perceptual modeluses the exponents to compute both a fixed and a signal-adaptive bit allocationfor the mantissas on each frame

As far as details on the four AC-2 variants are concerned two versions aredesigned for low-complexity low-delay applications and the other two for higherquality at the expense of increased delay or complexity In version 1 a 128-sample (64-channel) filter bank is used and the coder operates at 192 kbs perchannel resulting in high-quality output with only 7-ms delay at 48 kHz Ver-sion 2 is also for low-delay applications with improved quality at the same bitrate and it uses the same filter bank but exploits time redundancy across blockpairs thus increasing delay to 12 ms Version 3 achieves similar quality with thereduced rate of 128 kbs per channel at the expense of longer delay (45 ms) byusing a 512-sample (256 channel) filter bank to improve steady-state coding gainFinally version 4 (the AC-2A [Davi92] algorithm) employs a switched 128512-point TDAC filter bank to improve quality for transient signals while maintaining


high coding gain for stationary signals A 320-sample bridge window preservesPR filter bank properties during mode switching and a transient detector consist-ing of an 8-kHz Chebyshev highpass filter is responsible for switching decisionsOrder of magnitude peak level increases between 64-sample sub-blocks at thefilter output are interpreted as transient events The Kaiser window parametersused for the KBD windows in each of the AC-2 algorithms appeared in [Fiel96]For all four algorithms the AC-2 encoder multiplexes spectral envelope and man-tissa parameters into an output bitstream along with some auxiliary informationByte-wide Reed-Solomon ECC allows for correction of single byte errors in theexponents at the expense of 1 overhead resulting in good performance up toa BER of 0001

One AC-2 feature that is unique among the standards is that the perceptualmodel is backward adaptive meaning that the bit allocation is not transmittedexplicitly Instead the AC-2 decoder extracts the bit allocation from the quan-tized spectral envelope using the same perceptual model as the AC-2 encoderThis structure leads to a significant reduction of side information and inducesa symmetric encoderdecoder complexity which was well suited to the origi-nal AC-2 target application of single point-to-point audio transport An examplesingle point-to-point system now using low-delay AC-2 is the DolbyFAX afull-duplex codec that carries simultaneously two channels in both directionsover four ISDN ldquoBrdquo links for film and TV studio distance collaboration Low-delay AC-2 codecs have also been installed on 950-MHz wireless digital studiotransmitter links (DSTL) The AC-2 moderate delay and AC-2A algorithms havebeen used for both network and wireless broadcast applications such as cableand direct broadcast satellite (DBS) television The AC-2A is the predecessor tothe now popular multichannel AC-3 algorithm As the next section will showthe AC-3 coder has inherited and enhanced several facets of the AC-2AC-2Aarchitecture In fact the AC-2 encoder is nearly identical to (one channel of)the simplified AC-3 encoder shown in Figure 1027 except that AC-2 does nottransmit explicitly any perceptual model parameters

TransientDetector

MDCT256512-pt

Spectral EnvelopeExponentEncoder

MantissaQuantizer

PerceptualModel

BitAllocation

MUX

s(n)

Figure 1027 Dolby Laboratories AC-3 encoder



The 51-channel ldquosurroundrdquo format that had become the de facto standard inmost movie houses during the 1980s was becoming ubiquitous in home the-aters of the 1990s that were equipped with matrixed multichannel sound (egDolby ProLogic) As a result of this trend it was clear that emerging appli-cations for perceptual coding would eventually minimally require stereophonicor even multichannel surround-sound capabilities to gain consumer acceptanceAlthough single-channel algorithms such as the AC-2 can run on parallel inde-pendent channels significantly better performance can be realized by treatingmultiple channels together in order to exploit interchannel redundancies and irrel-evancies The Dolby Laboratories AC-3 algorithm [Davis93] [Todd94] [Davi98]also known as ldquoDolby Digitalrdquo or ldquoSR middot Drdquo was developed specifically for mul-tichannel coding by refining all of the fundamental AC-2 blocks including thefilter bank the spectral envelope encoding the perceptual model and the bitallocation The coder carries 51 channels of audio (left center right left sur-round right surround and a subwoofer) but at the same time it incorporates aflexible downmix strategy at the decoder to maintain compatibility with conven-tional monaural and stereophonic sound reproduction systems The ldquo1rdquo channelis usually reserved for low-frequency effects and is lowpass bandlimited below120 Hz The main features of the AC-3 algorithm are as follows

ž Sample rates 32 441 and 48 kHzž Bit rates 32ndash640 kbs variablež High-quality output at 64 kbs per channelž Delay roughly 100 msž MDCT filter bank (oddly stacked TDAC [Prin87]) KBD prototype windowž MDCT coefficients quantized and encoded in terms of exponents mantissasž Spectral envelope represented by exponentsž Signal-adaptive exponent strategy with time-varying time-frequency reso-

lutionž Hybrid forward-backward adaptive perceptual modelž Parametric bit allocationž Uniform quantization of mantissas according to signal-adaptive bit allocationž Perceptual model improvements possible at the encoder without changing

decoderž Multiple channels processed as an ensemblež Frequency-selective intensity coding as well as LR MSž Robust decoder downmix functionality from 51 to fewer channelsž Integral dynamic range control systemž Board-level real-time encoders availablež Chip-level real-time decoders available


The AC-3 works in the following way A signal-adaptive MDCT filter bank witha customized KBD window (Section 67) maps the input to the frequency domainLong windows are applied during steady-state segments and a pair of short win-dows is used for transient segments The MDCT coefficients are quantized andencoded by an exponentmantissa scheme similar to AC-2 Bit allocation for themantissas is performed according to a perceptual model that estimates the maskedthreshold from the quantized spectral envelope Like AC-2 an identical percep-tual model resides at both the encoder and decoder to allow for backward adaptivebit allocation on the basis of the spectral envelope thus reducing the burden ofside information on the bitstream Unlike AC-2 however the perceptual modelis also forward adaptive in the sense that it is parametric Model parameters canbe changed at the encoder and the new parameters transmitted to the decoderin order to affect modified masked threshold calculations Particularly at lowerbit rates the perceptual bit allocation may yield insufficient bits to satisfy boththe masked threshold and the rate constraint When this happens mid-side andintensity coding (ldquochannel couplingrdquo above 2 kHz) reduce the demand for bits byexploiting respectively interchannel redundancies and irrelevancies Ultimatelyexponents mantissas coupling data and exponent strategy data are combinedand transmitted to the receiver

The remainder of this section provides details on the major functional blocksof the AC-3 algorithm including the filter bank exponent strategy perceptualmodel bit allocation mantissa quantization intensity coding system-level func-tions complexity and applications and standardization activities

10721 Filter Bank Although the high-level AC-3 structure (Figure 1027)resembles that of AC-2 there are significant differences between the two algo-rithms Like AC-2 the AC-3 algorithm first maps input samples to the frequencydomain using a PR cosine-modulated filter bank with a novel KBD window(Section 67 parameters given in [Fiel96]) Unlike AC-2 however AC-3 is basedon the oddly stacked MDCT The AC-3 also handles window switching differ-ently than AC-2A Long 512-sample (9375 Hz resolution 48 kHz) windowsare used to achieve reasonable coding gain during stationary segments Duringtransients however a pair of 256-sample windows replaces the long window tominimize pre-echoes Also in contrast to the MPEG and AC-2 algorithms theAC-3 MDCT filter bank retains PR properties during window switching withoutresorting to bridge windows by introducing a suitable phase shift into the MDCTbasis vectors (Chapter 6 Eq (638a) and (638b) see also [Shli97]) for one ofthe two short transforms Whenever a scheme similar to the one used in AC-2Adetects transients short filter-bank windows may activate independently on anyone or more of the 51 channels

10722 Exponent Strategy The AC-3 algorithm uses a refined version ofthe AC-2 exponentmantissa MDCT coefficient representation resulting in a sig-nificantly improved coding gain In AC-3 the MDCT coefficients correspondingto 1536 input samples (six transform blocks) are combined into a single frame


Then a frame-processing routine optimizes the exponent representation to exploittemporal redundancy while at the same time representing the stair-step spectralenvelope with adequate frequency resolution In particular spectral envelopes areformed from partitions of either one two or four consecutive MDCT coefficientson each of the six MDCT blocks in the frame To exploit time-redundancy thesix envelopes can be represented individually or any or all of the six can becombined into temporal partitions As in AC-2 the exponents correspond to thepeak values of each time-frequency partition and each exponent is representedwith 6 dB of resolution by determining the number of left shifts until overflowThe overall exponent strategy is selected by evaluating spectral stability Manystrategies are possible For example all transform coefficients could be transmit-ted for stable spectra but time updates might be restricted to 32-ms intervalsie an envelope of single-coefficient partitions might be repeated five times toexploit temporal redundancy On the other hand partitions of two or four com-ponents might be encoded for transient signals but the time-partition might besmaller eg updates could occur for every 53-ms MDCT block Regardlessof the particular strategy in use for a given frame exponents are differentiallyencoded across frequency Differential coding of exponents exploits knowledgeof the filter-bank transition band characteristics thus avoiding slope overloadwith only a five-level quantization strategy The AC-3 exponent strategy exploitsin a signal-dependent fashion the time- and frequency-domain redundancies thatexist on a frame of MDCT coefficients

10723 Perceptual Model A novel parametric forward-backward adaptiveperceptual model estimates the masked threshold on each frame The forward-adaptive component exists only at the encoder Given a rate constraint this blockinteracts with an iterative rate control loop to determine the best set of percep-tual model parameters These parameters are passed to the backward-adaptivecomponent which estimates the masked threshold by applying the parametersfrom the forward-adaptive component to a calculation involving the quantizedspectral envelope Identical backward-adaptive model components are embeddedin both the encoder and decoder Thus model parameters are fixed at the encoderafter several threshold calculations in an iterative rate control process and thentransmitted to the decoder The decoder only needs to perform one thresholdcalculation given the parameter values established at the encoder

The backward-adaptive model component works as follows First the quan-tized spectral envelope exponents are clustered into 50 05-Bark-width subbandsThen a spreading function is applied (Figure 1028a) that accounts only for theupward spread of masking To compensate for filter-bank leakage at low fre-quencies spreading is disabled below 200 Hz Also spreading is not enabledbetween 200 and 700 Hz for frequencies below the occurrence of the first sig-nificant masker The absolute threshold of hearing is accounted for after thespreading function has been applied Unlike other algorithms AC-3 neglects thedownward spread of masking assumes that masking power is nonadditive andmakes no explicit assumptions about the relationship between tonality and the


skirt slopes on the spreading function Instead these characteristics are capturedin a set of parameters that comprise the forward-adaptive model componentMasking threshold calculations at the decoder are controlled by a set of param-eters transmitted by the encoder creating flexibility for model improvements atthe encoder such that improved estimates of the masked threshold can be realizedwithout modifying the embedded model at the decoder

For example a parametric (upwards only) spreading function is defined(Figure 1028a) in terms of two slopes Si and two level offsets Li for i isin [1 2]While the parameters S1 and L1 can be uniquely specified for each channelthe parameters S2 and L2 are applied to all channels The parametric spreadingfunction is advantageous in that it allows the perceptual model at the encoder

SMR

Leve

l (d

B)

Frequency (Barks)

(a)

Frequency (Barks)

(b)

S1

S2

L1

L2

Masker

ParametricSpreading Function

100

SMR = 30 dB

20 dB

60

40 15 dB

SP

L (d

B)

Figure 1028 Dolby AC-3 parametric perceptual model (a) prototype spreading func-tion (b) masker SMR level-dependence


to account for tonality or dynamic masking patterns without the need to alterthe decoder model A range of values is available for each parameter Withunits of dB per 12 Bark the slopes are defined to be within the rangesminus295 S1 minus577 and minus07 S2 minus098 With units of dB SPL the levelsare defined to be within the ranges minus6 L1 minus48 and minus49 L2 minus63Ultimately there are 512 unique spreading function shapes to choose from Theacoustic-level dependence of masking thresholds is also modeled in AC-3 Itis in general true that the signal-to-mask ratio (SMR) increases with increasingstimulus level (Figure 1028b) ie the threshold moves closer to the stimulus asthe stimulus intensity decreases In the AC-3 parametric perceptual model thisphenomenon is captured by adding a positive bias to the masked thresholds whenthe spectral envelope is below a threshold level Acoustic level threshold biasingis applied on a band-by-band basis The decision threshold for the biasing isone of the forward adaptive parameters transmitted by the encoder This functioncan also be disabled altogether The parametric perceptual model also providesa convenient upgrade path in the form of a bit allocation delta parameter

It was envisioned that future more sophisticated AC-3 encoders might runin parallel two perceptual models with one being the original reference modeland the other being an enhanced model with more accurate estimates of maskedthreshold The delta parameter allows the encoder to transmit a stair-step functionfor which each tread specifies a masking level adjustment for an integral numberof 12-Bark bands Thus the masking model can be incrementally improvedwithout alterations to the existing decoders Other details on the hybrid backward-forwards AC-3 perceptual model can be found in [Davi94]

10724 Bit Allocation and Mantissa Quantization A bit allocation isdetermined at the encoder for each frame of mantissas by an iterative pro-cedure that adjusts the mantissa quantizers the multichannel coding strategies(below) and the forward-adaptive model parameters to satisfy simultaneouslythe specified rate constraint and the masked threshold Within the rate-controlloop threshold partitions are formed on the basis of a bit allocation frequencyresolution parameter with coefficient partitions ranging in width between 94 and375 Hz In a manner similar to MPEG-1 quantizers are selected for the set ofmantissas in each partition based on an SMR calculation Sufficient bits are allo-cated to ensure that the SNR for the quantized mantissas is greater than or equalto the SMR The quantization noise is thus rendered inaudible below maskedthreshold Uniform quantizers are selected from a set of 15 having 0 3 5 7 11and 15 levels symmetric about 0 and conventional 2rsquos-complement quantizershaving 32 64 128 256 512 1024 2048 4096 16384 or 65536 levels Certainquantizer codewords are group-encoded to make more efficient usage of avail-able bits Dithering can be enabled optionally on individual channels for 0-bitmantissas If the bit supply is insufficient to satisfy the masked threshold thenSNRs can be reduced in selected threshold partitions until the rate is satisfied orintensity coding and MS transformations are used in a frequency-selective fash-ion to reduce the bit demand Two variable-rate methods are also available to


satisfy peak-rate demands Within a frame of six MDCT coefficient blocks bitscan be distributed unevenly such that the instantaneous bit rate is variable butthe average rate is constant In addition bit rates are adjustable and a unique ratecan be specified for each frame of six MDCT blocks Unlike some of the otherstandardized algorithms the AC-3 does not include an explicit lossless codingstage for final redundancy reduction after quantization and encoding

10725 Multichannel Coding When bit demand imposed by multiple inde-pendent channels exceeds the bit budget the AC-3 ensemble processing of 51channels exploits interchannel redundancies and irrelevancies respectively bymaking frequency-selective use of mid-side (MS) and intensity coding tech-niques Although the MS and intensity functions can be simultaneously activeon a given channel they are restricted to nonoverlapping subbands The MSscheme is carefully controlled [Davi98] to maintain compatibility between AC-3and matrixed surround systems such as Dolby ProLogic Intensity coding alsoknown as ldquochannel couplingrdquo is a multichannel irrelevancy reduction codingtechnique that exploits properties of spatial hearing There is considerable exper-imental evidence [Blau74] suggesting that the interaural time difference of asignalrsquos fine structure has negligible influence on sound localization above a cer-tain frequency Instead the ear evaluates primarily energy envelopes Thus theidea behind intensity coding is to transmit only one envelope in place of twoor more sufficiently correlated spectra from independent channels together withsome side information The side information consists of a set of coefficients thatis used to recover individual spectra from the intensity channel

A simplified version of the AC-3 intensity coding scheme is illustrated inFigure 1029 At the encoder (Figure 1029a) two or more input spectra are addedtogether to form a single intensity channel Prior to the addition an optionaladjustment is applied to prevent phase cancellation Then groups of adjacentcoefficients are partitioned into between 1 and 18 separate intensity subbandson both the individual and the intensity channels A set of coupling coefficientsis computed cij that expresses the fraction of energy contributed by the i-thindividual channel to the j -th band of the intensity envelope ie cij = βij αj where βij is the power contained in the j -th band of the i-th channel and αj isthe power contained in the j -th band of the intensity channel Finally the inten-sity spectrum is quantized encoded and transmitted to the decoder The couplingcoefficients cij are transmitted as side information Once the intensity channelhas been recovered at the decoder (Figure 1029b) the intensity subbands arescaled by the coupling coefficients cij in order to recover an appropriate frac-tion of intensity energy in the j -th band of the i-th channel The intensity-codedcoefficients are then combined with any remaining uncoupled transform coeffi-cients and passed through the synthesis filter bank to reconstruct the individualchannel The AC-3 coupling coefficients have a dynamic range that spans minus132to +18 dB with quantization step sizes between 028 and 053 dB Intensity cod-ing is applied in a frequency-selective manner parameterized by a start frequencyof 342 kHz or higher and a bandwidth expressed in multiples of 12 kHz for


(a)

BandSplit

BandSplit

PhaseAdjust

PhaseAdjust

BandSplit

PowerCalc

PowerCalc

PowerCalc

PowerCalc

PowerCalc

PowerCalc

b11

bij

aj

a1

a2

b21

b22

b12

Ch 1

Ch 2

cij

Intensity Channel

CouplingCoefficients

divide

sum

Figure 1029 Dolby AC-3 intensity coding (ldquochannel couplingrdquo) (a) encoder (b)decoder

a 48-kHz system [Davi98] Note that unlike the simplified system shown in thefigure the actual AC-3 intensity coding scheme may couple the spectra from asmany as five channels

10726 System-Level Functions At the system level AC-3 provides me-chanismsforchanneldown-mixinganddynamicrangecontrolDown-mixcapabilityis essential for the 51-channel system since the majority of potential playback sys-tems are still monaural or at best stereophonic Down-mixing is performed at thedecoder in the frequency domain rather than the time-domain to reduce complex-ity This is possible because of the filter-bank linearity The bit stream carries somedown-mix informationsincedifferent listeningsituationscall fordifferentdown-mixweighting Dialog-levelnormalization isalsoavailable at the decoder Finally the bitstream has available facilities to handle other control and ancillary user informationsuch as copyright language production and time-code data [Davis94]

10727 Complexity Assuming the standard HDTV configuration of 384 kbswith a 48 kHz sample rate and implementation using the Zoran ZR38001 general-purposeDSPinstructionset theAC-3decodermemoryrequirementsandcomplexity


(b)

X

SynthesisFilterBank

Uncoupled Coefs

X

X

X

BandSplit

c11

c12

c21

c22

IntensityChannel

Uncoupled Coefs

Ch 1

Ch 2 Synthesis

FilterBank

Figure 1029 (continued )

are as follows 66 kbytes RAM 54 kbytes ROM 273 MIPS for 51 channels and31 kbytes RAM 54 kbytes ROM and 265 MIPS for 2 channels [Vern95] Notethat complexity estimates are processor-dependent For example on a MotorolaDSP56002 45 MIPS are required for a 51-channel decoder Encoder complexityvaries between two and five times decoder complexity depending on the encodersophistication [Vern95] Numerous real-time encoder and decoder implementationshavebeenreportedEarlyon forexampleasingle-chipdecoderwas implementedona Zoran DSP [Vern93] More recently a DP561 AC-3 encoder (51 channels 441- or48-kHz sample rate) for DVD mastering was implemented in real-time on a PC hostwith a plug-in DSP subsystem The computational requirements were handled by anAriel PC-Hydra DSP array of eight Texas Instruments TMS 320C44 floating-pointDSP devices clocked at 50 MHz [Terr96] Current information on real-time AC-3implementations is also available online from Dolby Laboratories


10728 Applications and Standardization The first popular AC-3 appli-cation was in the cinema The ldquoDolby Digitalrdquo or ldquoSR Drdquo AC-3 information isinterleaved between sprocket holes on one side of the 35-mm film The AC-3was first deployed in only three theaters for the film Star Trek VI in 1991 afterwhich the official rollout of Dolby SR D occurred in 1992 with Batman ReturnsBy 1997 April over 900 film soundtracks had been AC-3 encoded Nowadaysthe AC-3 algorithm is finding use in digital versatile disc (DVD) cable televi-sion (CATV) and direct broadcast satellite (DBS) Many hi-fidelity amplifiersand receiver units now contain embedded AC-3 decoders and accept an AC-3digital rather than an analog feed from external sources such as DVD

In addition the DP504524 version of the DolbyFAX system (Section 1071)has added AC-3 stereo and MPEG-1 layer II to the original AC-2-based systemFilm television and music studios use DolbyFAX over ISDN links for auto-matic dialog replacement music collaboration sound effects delivery and remotevideotape audio playback As far as standardization is concerned the UnitedStates Advanced Television Systems Committee (ATSC) has adopted the AC-3 algorithm as the A52 audio compression standard [USAT95b] and as theaudio component of the A52 Digital Television (DTV) Standard [USAT95a] TheUnited States Federal Communications Commission (US FCC) in 1996 Decemberadopted the ATSC standard for DTV including the AC-3 audio component Onthe international standardization front the Digital Audio-Visual Council (DAVIC)selected AC-3 and MPEG-1 layer II for the audio component of the DAVIC 12specification [DAVC96]

10729 Recent Developments ndashThe Dolby Digital Plus A Dolby digitalplus system or the enhanced AC-3 (E-AC-3) [Fiel04] was recently introducedto extend the capabilities of the Dolby AC-3 algorithm While remaining back-ward compatible with the Dolby AC-3 standard the Dolby digital plus providesseveral enhancements Some of the extensions include flexibility to encode upto 131 channels extended data rates up to 6144 Mbs The AC-3 filterbank issupplemented with a second stage DCT to exploit the stationary characteristicsin the audio Other coding tools include spectral extension enhanced channelcoupling and transient pre-noise processing The E-AC-3 is used in cable andsatellite television set-top boxes and broadcast distribution transcoding devicesFor a detailed description on the Dolby digital plus refer to [Fiel04]

108 AUDIO PROCESSING TECHNOLOGY APT-x100

Without exception all of the commercial and international audio coding standardsdescribed thus far couple explicit models of auditory perception with classicalquantization techniques in an attempt to distribute quantization noise over thetime-frequency plane such that it is imperceptible to the human listener In addi-tion to irrelevancy reduction most of these algorithms simultaneously seek toreduce statistical redundancies For the sake of comparison and perhaps to betterassess the impact of perceptual models on realizable coding gain it is instructive


to next consider a commercially available audio coding algorithm that relies onlyupon redundancy removal without any explicit regard for auditory perception

We turn to the Audio Processing Technology APT-x100 algorithm whichhas been reported to achieve nearly transparent coding of CD-quality 441 kHz16-bit PCM input at a compression ratio of 41 or 1764 kbs per monauralchannel [Wyli96b] Like the ITU-T G722 wideband speech codec [G722] theAPT-x100 encoder (Figure 1030) relies upon subband signal decomposition fol-lowed by independent ADPCM quantization of the decimated subband outputsequences Codewords from four uniform bandwidth subbands are multiplexedonto the channel and sent to the decoder where the ADPCM and filter-bank oper-ations are inverted to generate an output As shown in the figure a tree-structuredQMF filter bank splits the input signal into four subbands The first and secondfilter stages have 64 and 32 taps respectively Backward adaptive prediction isapplied to the four subband output sequences The resulting prediction residualis quantized with a backward-adaptive Laplacian quantizer Backward adaptationin the prediction and quantization steps eliminates side information but increasessensitivity to fast transients On the other hand both prediction and adaptivequantization were found to significantly improve coding gain for a wide range oftest signals [Wyli96b] Adaptive quantization attempts to track signal dynamicsand tends to produce constant SNR in each subband during stationary segments

Unlike the other algorithms reviewed in this document APT-x100 containsno perceptual model or rate control loop The ADPCM output codewords are offixed resolution (1 bit per sample) and therefore with four subbands the outputbit rate is reduced 41 A comparison between APT-x100 quantization noise and

QMFAnalysis

Filter(64)

QMFAnalysis

Filter(32)

A1(z)

Q

1Q

A4(z)

Q

1Q

MUX OutputBitstream

Channel-2

Channel-3

QMFAnalysis

Filter(32)

sum

sum

minus

minus

s(n)

Σ

Σ

∆

∆

Figure 1030 Audio Processing Technology APT-x100 encoder


32 ChannelPQMF

analysis bank

Psychoacousticsignal analysis Bit allocation

SMR

Quantizers

Data MULTIPLEXER

SideInfo

x(n)

SMR signal-to-mask ratio

x(n) input audio

Tochannel

amp

Quantization

SubbandADPCM

Quantization

Figure 1031 DTS-coherent acoustics (DTS-CA) encoding scheme

noise masking thresholds computed as in [John88a] for a variety of test signalsfrom the SQAM test CD [SQAM88] revealed two trends in the APT-x100 noisefloor First as expected it is flat rather than shaped Second the noise is belowthe masking threshold in most critical bands for most stationary test signalsbut tends to exceed the threshold in some critical bands for transient signalsIn [Wyli96b] however the fast step-size adaptation in APT-x100 is reportedto exploit temporal masking effects and mitigate the audibility of unmaskedquantization noise While the lack of a perceptual model results in an inefficientflat noise floor it also affords some advantages including reduced complexityreduced frequency resolution requirements and low delay of only 122 samplesor 277 ms at 441 kHz

Several other relevant facts on APT-x100 quality and robustness were alsoreported in [Wyli96b] Objective output quality was evaluated in terms of averagesubband SNRs which were 30 15 10 and 7 dB respectively for the low-est to highest subbands and the authors stated that the algorithm outperformedNICAM [NICAM] in an informal subjective comparison [Wyli96b] APT-x100was robust to both random bit errors and tandem encoding Errors were inaudiblefor a bit error rate (BER) of 10minus4 and speech remained intelligible for a BER of10minus1 In one test 10 stages of synchronous tandeming reduced output SNR from45 dB to 37 dB An auxiliary channel that accommodates up to 14 kbs of thesample rate in buried data (eg 24 kbs for 48-kHz stereo samples) by bit steal-ing from one of the subbands had a negligible effect on output quality Finallyreal-time APT-x100 encoder and decoder modules were implemented on a singleATampT DSP16A masked ROM DSP As far as applications are concerned APT-x100 has been deployed in digital studio-transmitter links audio storage productsand cinematic surround sound applications A cursory performance comparisonof the nonperceptual algorithms versus the perceptually based algorithms (egNICAM or APT-x100 vs MPEG or PAC etc) confirms that some awarenessof peripheral auditory processing is necessary to achieve high-quality coding ofCD-quality audio for compression ratios in excess of 41


109 DTS ndash COHERENT ACOUSTICS

The performance comparison of the nonperceptual algorithms versus the percep-tually based algorithms (eg APT-x100 vs MPEG or PAC etc) given in theearlier section highlights that some awareness of peripheral auditory processingis necessary to achieve high-quality encoding of digital audio for compressionratios in excess of 41 To this end DTS employs an audio compression algorithmbased on the principles of ldquocoherent acoustics encodingrdquo [Smyt96] [Smyt99][DTS] In coherent acoustics both ADPCM-subband filtering and psychoacous-tic analysis are employed to compress the audio data The main emphasis in DTSis to improve the precision (and hence the quality) of the digital audio The DTSencoding algorithm provides a resolution of up to 24 bits per sample and at thesame time can deliver compression rates in the range of 3 to 40 Moreover DTScan deliver up to eight discrete channels of multiplexed audio at sampling fre-quencies of 8ndash192 kHz and at bit rates of 8ndash512 kbs per channel Table 109summarizes the various bit rates sampling frequencies and the bit resolutionsemployed in the four configurations supported by the DTS-coherent acoustics


The DTS-CA encoding algorithm (Figure 1031) operates on 24-bit linear PCMsignals The audio signals are typically analyzed in blocks (frames) of 1024samples although frame sizes of 256 512 2048 and 4096 samples are alsosupported depending on the bit rates and sampling frequencies used (Table 1010)For example if operating at bit rates of 1024ndash2048 kbs and sampling frequenciesof 32 or 441 or 48 kHz then the maximum number of samples allowed per frameis 1024 Next the segmented audio frames are decomposed into 32 criticallysubsampled subbands using a polyphase realization of a pseudo QMF (PQMF)bank (Chapter 6) Two different PQMF filter-bank structures namely perfectreconstructing (PR) and nonperfect reconstructing (NPR) are provided in DTS-coherent acoustics In the example that we considered above a frame size of 1024samples results in 32 PCM samples per subband (ie 102432) The channelsare equally spaced such that a 32 kHz input signal is split into 500 Hz subbands(ie 16 kHz32) with the subbands being decimated at the ratio 321

Table 109 A list of encoding parameters used in DTS-coherent acousticsafter [Smyt99]

Bit rates (kbschannel) Sampling rates (kHz) Bit resolution per sample

8ndash32 24 1632ndash96 48 2096ndash256 96 24256ndash512 192 24


Table 1010 Maximum frame sizes allowed in DTS-CA (after [Smyt99])

Sampling frequency-set fs = [8110512] (kHz)

Bit rates (kbs) fs 2fs 4fs 8fs 16fs

0ndash512 Max 1024 Max 2048 Max 4096 NA NA512ndash1024 NA Max 1024 Max 2048 NA NA1024ndash2048 NA NA Max 1024 Max 2048 NA2048ndash4096 NA NA NA Max 1024 Max 2048


While the subband filtering stage minimizes the statistical dependencies associ-ated with the input PCM signal the psychoacoustic analysis stage eliminates theperceptually irrelevant information Since we have already established the neces-sary background on psychoacoustic analysis in Chapters 5 we will not elaborateon these steps However we describe next the advantages of combining thedifferential subband coding techniques (eg ADPCM) with the psychoacousticanalysis


A block diagram depicting the steps involved in the differential subband codingin DTS-CA is shown in Figure 1032 A fourth-order forward linear predictionis performed on each subband containing 32 PCM samples From the aboveexample we have 32 subbands and 32 PCM samples per subband Recall that inLP we predict the current time-domain audio sample based on a linearly weightedcombination of previous samples From the LPC analysis corresponding to thei-th subband we obtain predictor coefficients aik for k = 0 1 4 and theresidual error ei(n) for n = 0 1 2 31 samples The predictor coefficientsare usually vector quantized in the line spectral frequency (LSF) domain

Two stages of ADPCM modules are provided in the DTS-CA algorithm iethe ADPCM estimation stage and the real ADPCM stage ADPCM utilizes theredundancy in the subband PCM audio by exploiting the correlation betweenadjacent samples First the ldquoestimation ADPCMrdquo module is used to determinethe degree of prediction achieved by the fourth-order linear prediction filter(Figure 1032) Depending upon the statistical features of audio a decision toenable or disable the second ldquoreal ADPCMrdquo stage is made

A predictor mode flag ldquoPMODErdquo = 1 or 0 is set to indicate if the ldquorealADPCMrdquo module is active or not respectively

sipred(n) =4sum

k=0

aiksi(n minus k) for n = 0 1 31 (1011)

Firs

t Sta

ge(lsquoE

stim

atio

nrsquo)

AD

PC

M

Tra

nsie

ntA

naly

sis

Line

arP

redi

ctio

n

Inpu

tB

uffe

r

Sub

band

PC

M A

udio

32 S

ubba

nds

Pre

dict

ion

Ana

lysi

s

Sca

lefa

ctor

Com

puta

tion

Dec

isio

n W

heth

er to

Dis

able

the

Sec

ond-

stag

e A

DP

CM

Pre

dict

orC

oeffi

cien

ts

Res

idua

l

TM

OD

E

PM

OD

E

Sec

ond

Sta

ge(lsquoR

ealrsquo)

AD

PC

M

Sid

e In

form

atio

n

Qua

ntiz

edP

redi

ctor

Coe

ffici

ents

MU

X

Sca

le-

fact

ors

Diff

eren

tial S

ub-

band

Aud

io

Fig

ure

103

2D

iffe

rent

ial

subb

and

(AD

PCM

)co

ding

inth

eD

TS-

CA

enco

der

340


ei(n) = si(n) minus sipred(n)

= si(n) minus4sum

k=0

aiksi(n minus k)(1012)

While the ldquoprediction analysisrdquo block computes the PMODE flag based on theprediction gain the ldquotransient analysisrdquo module monitors the transient behaviorof the error residual In particular when a signal with a sharp attack (ie rapidtransitions) begins near the end of a transform block and immediately following aregion of low energy pre-echoes occur Several pre-echo control strategies havebeen developed (Chapter 6 Section 610) These include window switching gainmodification switched filter banks including the bit reservoir and temporal noiseshaping In DTS-CA the pre-echo artifacts are controlled by dividing the subbandanalysis buffer into four sub-buffers A transient mode ldquoTMODErdquo = 0 1 2 or3 is set to denote the beginning of a transient signal in sub-buffers 1 2 3 or 4respectively In addition two scale factors are computed for each subband (iebefore and after the transition) based on the peak magnitude of the residual errorei(n) A 64-level nonuniform quantizer is usually employed to encode the scalefactors in DTS-CA

Note that the PMODE flag is a ldquoBooleanrdquo and the TMODE flag has fourvalues Therefore a total of 15 bits (ie 12 bits for two scale factors 1 bit forthe ldquoPMODErdquo flag and 2 bits for the ldquoTMODErdquo flag) are sufficient to encode theentire side information in the DTS-CA algorithm Next based on the predictormode flag (1 or 0) the second-stage ADPCM is used to encode the differentialsubband PCM audio as shown in Figure 1032 The optimum number of bits (inthe sense of minimizing the quantization noise) required to encode the differentialaudio in each subband is estimated using a bit allocation procedure


Bit allocation is determined at the encoder for each frame (32 subbands) by aniterative procedure that adjusts the scale-factor quantizers the fourth-order lin-ear predictive model parameters and the quantization levels of the differentialsubband audio This is done in order to satisfy simultaneously the specified rateconstraint and the masked threshold In a manner similar to MPEG-1 quantizersare selected in each subband based on an SMR calculation A sufficient numberof bits is allocated to ensure that the SNR for the quantized error is greater than orequal to the SMR The quantization noise is thus rendered inaudible ie belowthe masked threshold Recall that the main emphasis in DTS-CA is to improveprecision and hence the quality of the digital audio while giving relatively lessimportance to minimizing the data rate Therefore the DTS-CA bit reservoir willalmost always meet the bit demand imposed by the psychoacoustic model Similarto some of the other standardized algorithms (eg MPEG codecs lossless audiocoders) the DTS-CA includes an explicit lossless coding stage for final redun-dancy reduction after quantization and encoding A data multiplexer merely packsthe differential subband data the side information the synchronization details


and the header syntax into a serial bitstream Details on the structure of the ldquoout-put data framerdquo employed in the DTS-CA algorithm are given in [Smyt99] [DTS]

As an extension to the current coherent acoustics algorithm Fejzo et alproposed a new enhancement that delivers 96 kHz 24-bit resolution audio qual-ity [Fejz00] The proposed enhancement makes use of both ldquocorerdquo and ldquoexten-sionrdquo data to reproduce 96-kHz audio bitstreams Details on the real-time imple-mentation of the 51-channel decoder on a 32-bit floating-point processor arealso presented in [Fejz00] Although much work has been done in the area ofencoderdecoder architectures for the DTS-CA codecs relatively little has beenpublished [Mesa99]


The DTS-Coherent Acoustics and the Dolby AC-3 algorithms were the two com-peting standards during the mid-1990s While the former employs an adaptivedifferential linear prediction (ADPCMndashsubband coding) in conjunction with aperceptual model the latter employs a unique exponentmantissa MDCT coef-ficient encoding technique in conjunction with a parametric forward-backwardadaptive perceptual model

PROBLEMS

101 List some of the primary differences between the DTS the Dolby digitaland the Sony ATRAC encoding schemes

102 Using a block diagram describe how the ISOIEC MPEG-1 layer I codecis different from the ISOIEC MPEG-1 layer III algorithm

103 What are the enhancements integrated into MPEG-2 AAC relative to theMP3 algorithm State key differences in the algorithms

104 List some of the distinguishing features of the MP4 audio format over theMP3 format Give bitrates and cite references

105 What is the main idea behind the scalable audio coding Explain using ablock diagram Give examples

106 What is structured audio coding and parametric audio coding

107 How is ISOIEC MPEG-7 standard different from the other MPEGstandards

COMPUTER EXERCISE

108 The audio files Ch10aud2Lwav Ch10aud2Rwav Ch10aud2CwavCh10aud2Lswav and Ch10aud2Rswav correspond to left rightcenter left-surround and right-surround respectively of a 32-channelconfiguration Using the matrixing technique obtain a stereo output

CHAPTER 11

LOSSLESS AUDIO CODINGAND DIGITAL WATERMARKING

111 INTRODUCTION

The emergence of high-end storage formats such as the DVD-audio and thesuper-audio CD (SACD) created opportunities for lossless audio coding schemesLossless audio coding techniques yield high-quality audio without any arti-facts Lossy audio coding (LAC) results typically in compression ratios of101ndash251 while the lossless audio coding (L2AC) algorithms achieve com-pression ratios of 21ndash41 Therefore L2AC techniques are not typically usedin real-time storagemultimedia processing and Internet streaming but can be ofuse in storage-rich formats Several lossless audio coding algorithms includingthe SHORTEN [Robi94] the DVD [Crav96] [Oome96] [Crav97] the MUSI-Compress [Wege97] the AudioPaK [Hans98a] [Hans98b] [Hans01] the loss-less transform coding of audio (LTAC) [Pura97] and the IntMDCT [Geig01][Geig02] have been proposed The meridian lossless packing (MLP) [Gerz99]and the direct-stream digital (DSD) techniques [Brue97] form a group of high-end lossless compression algorithms that have already become standards in theDVD-Audio [DVD01] and the SACD [SACD02] storage formats respectivelyIn Figure 111 we taxonomize various lossless audio coding algorithms

This chapter is organized as follows In Section 112 we describe the principlesof lossless audio coding A survey of various lossless audio coding algorithms isgiven in Section 1122 This is followed by more detailed descriptions of the DVD-audio (Section 113) and the SACD (Section 114) formats Section 115 addresses


343


Lossless audiocoding (L2AC)

schemes

Prediction-basedTechniques

SHORTEN[Robi94]

The DVD Algorithm[Crav96] [Crav97]

MUSICompress[Wege97]

AudioPak[Hans98a] [Hans98b]

Direct Stream Digital (DSD)Technique

[Jans03] [Reef01a][SACD02] [SACD03]

Meridian Lossless Packing(MLP)

[Gerz99] [Kuzu03] [DVD01]

Transform-basedTechniques

Lossless TransformCoding (LTAC)

[Pura97]

IntMDCT[Geig01] [Geig02]

C-LPAC[Qiu01]

MPEG-4 Lossless AudioCoding

[Mori02a] [Sche01] [Vasi02]

Figure 111 A list of lossless audio coding algorithms

digital audio watermarking Finally in Section 116 we present commercial appli-cations and recent developments in audio coding standardization

112 LOSSLESS AUDIO CODING (L2AC)

LAC schemes employ psychoacoustic principles in conjunction with time-frequ-ency mapping techniques to eliminate perceptually irrelevant information Hencethe encoded signal is not be an exact replica of the input audio and mightcontain some artifacts Lossless coding schemes (L2AC) obtain bit-exact com-pression by eliminating the statistical dependencies associated with the signalvia prediction techniques Several prediction modeling methods for L2AC havebeen proposed namely FIRIIR prediction [Robi94] [Crav96] [Crav97] poly-nomial approximationcurve fitting methods [Robi94] [Hans98b] transform cod-ing [Pura97] [Geig02] high-order context modeling [Qiu01] backward adaptive


prediction method [Angu97] subband linear prediction [Giur98] and set parti-tioning [Raad02] L2AC algorithms are broadly classified into two categoriesnamely prediction-based and transform-based

Prediction-based L2AC algorithms employ linear predictive (LP) modeling orsome form of polynomial approximation to remove redundancies in the wave-form The SHORTEN the DVD the MUSICompress and the C-LPAC use LPanalysis while the AudioPaK is based on curve-fitting methods Transform-basedcoding schemes employ specific transforms to decorrelate samples within a frameThe LTAC scheme proposed by Purat et al and the IntMDCT scheme introducedby Geiger et al are two L2AC schemes that use transform coding

In all the aforementioned L2AC methods the idea is to obtain a close approxi-mation to the input audio and compute a low variance prediction residual that canbe coded efficiently Typically the residual error is entropy coded using one ofthe following methods Huffman coding Lempel-Ziv coding run-length codingarithmetic coding or Rice coding Table 111 lists the various L2AC coders andtheir associated prediction and entropy coding methods

Before getting into the details of lossless coding of digital audio let us takea look at some of the lossless data compression algorithms ie PkZip WinZipWinRAR gzip and TAR Although these compression algorithms achieve a40ndash60 bit rate reductions with data and text files they only achieve a mere 10in the case of audio [Hans01] lists an interesting example that illustrates the fol-lowing A PostScript file of 456 MB can be compressed with PkZip to 696 KBachieving a bit rate reduction of about 85 whereas an audio file of 3372 MBcan be compressed to 3157 MB resulting in a bit-rate reduction of only about65 In designing and evaluating L2AC algorithms one must address two keyissues ie the trade-offs of compression ratio and average residual energy perframe and the selection of the entropy coding method

1121 L2AC Principles

A general block diagram of a L2AC algorithm is depicted in Figure 112 Firsta conventional lossy audio coding scheme (eg MPEG-1 layer III) is used toproduce a compressed bitstream Then an error signal is calculated by subtracting

Table 111 Prediction and entropy coding schemes in some of the L2AC schemes

L2AC scheme Prediction model Entropy coding used

SHORTEN FIR prediction filter Rice codingDVD IIR prediction Huffman codingMUSICompress Adaptively varied approximations Huffman codingAudioPak Curve fittingpolynomial

approximationsGolomb coding

LTAC Orthonormal transform coding Rice codingIntMDCT Integer transform coding Huffman codingC-LPAC FIR prediction High-order context modeling


s(n)

Approximator Module

LossyCompression

Scheme

Audio Inputs(n)

LocalDecoder

minus

LossyCompressedAudio x(n)

MUX

Residual Errore(n)

Channelor

OtherMedium

DEMUX

x(n)

e(n)

Decoder

s(n)

OriginalAudio s(n)

Σ

Σ

Figure 112 Lossless audio coding scheme based on the LAC method

the reconstructed signal from the input signal Reconstruction involves decodingthe compressed bitstream locally at the encoder from the input signal The residualis encoded using entropy coding methods and transmitted along with the encoded(lossy) audio bitstream The transmission of the coding error enables the decoderto obtain an exact replica of the input audio resulting in a lossless scheme Analternative to this scheme was also proposed where instead of using a lossyaudio compression scheme a predictor is used to approximate the input audiosignal

1122 L2AC Algorithms

11221 The SHORTEN Algorithm Framing prediction and residual cod-ing are the three primary stages associated with the SHORTEN algorithm Thealgorithm operates on 16-bit linear PCM signals sampled at 16 kHz The audiosignal is divided into blocks of 16 ms corresponding to 256 samples An L-thorder LP analysis is performed using the Levinson-Durbin algorithm The LPanalysis window (usually rectangular) is typically L samples longer than theframe size ie N = (256 + L) Typically a 10th-order (L = 10) LP filter isemployed

A prediction technique based on curve fitting and polynomial approximationshas also been proposed in [Robi94] (Figure 113) This is a simplified form of



x2(n) = 2 x(n minus 1) minus x(n minus 2)

x(n minus 1)

x(n minus 2)

x(n minus 3)

n0 1 2 3

x(n)

x(n)

x0(n)

ˆ

x2(n)

x1(n)

x3(n)ˆ

Input audio samples

Predicted samples

x0(n) = 0ˆ

x3(n) = 3 x(n minus 1) minus 3 x(n minus 2) + x(n minus 3)

Figure 113 Polynomial approximation of x(n) used in the SHORTEN [Robi94] and inthe AudioPaK [Hans98a] algorithms

linear predictor that uses a selection criterion by fitting an L-th order polynomialto the last L data points For instance four polynomials ie i = 0 1 2 3 areformed as follows

x0(n) = 0


x2(n) = 2x(n minus 1) minus x(n minus 2)

x3(n) = 3x(n minus 1) minus 3x(n minus 2) + x(n minus 3)

(111)

Note that the FIR predictor coefficients in the above equation are integers Theresidual ei(n) is computed by subtracting the corresponding polynomial estimatefrom the actual signal ie

ei(n) = x(n) minus xi(n) (112)

From (111) and (112) we obtain

e0(n) = x(n) (113)


Similarly e1(n) is given by

e1(n) = x(n) minus x1(n) (114)

Finally from (111) and (113) we get

e1(n) = e0(n) minus e0(n minus 1) (115)

The above equations lead to an efficient recursive algorithm to compute thepolynomial prediction residuals ie

e0(n) = x(n)




(116)

Next the residual energy Ei corresponding to each of the four residuals iscomputed

Ei =sum

n

ei2(n) for i = 0 1 2 3 (117)

The polynomial that results in the smallest residual energy is considered as thebest approximation for that particular frame It should be noted that the predictionbased on polynomial approximations does not always result in maximum datarate reduction often due to the poor estimation of signal statistics On the otherhand FIR predictors with real-valued coefficients are more versatile than integerpredictors Based on experiments conducted by Robinson [Robi94] the residuale(n) exhibits a Laplacian distribution Usually a Huffman coder that best fits theLaplacian distribution is employed for residual packing Note that these codesare also called Rice codes in the entropy coding literature An error statistic of0004 bitssample has been reported in [Robi94]

11222 The AudioPaK Coder The AudioPaK algorithm developed by Hanset al [Hans98a] [Hans98b] uses linear prediction with integer coefficients andGolomb-Rice coding for packing the error In particular the AudioPaK coderemploys an adaptive polynomial approximation method in order to eliminatethe intrachannel dependencies In addition to the intrachannel decorrelation theAudioPaK algorithm also allows interchannel decorrelation in the case of thestereo and dual modes1 Inter-channel decorrelation in these modes can be achie-ved as follows First the residuals associated with the left and right channels

1 Recall that in the stereo mode two audio channels that form a stereo pair (left and right) are encodedwith one bitstream On the other hand in the dual mode two audio channels with independentprogram contents eg bilingual are encoded within one bitstream However the encoding processis the same for both modes [ISOI92]


are computed followed by the differences between the left and the right channelresiduals An average residual energy Ei corresponding to each of the four right-channel residuals and Ei corresponding to the four difference residuals arecomputed (similar to Eq (117)) The residual that results in the smallest residualenergy among E0 E1 E2 E3 E0 E1 E2 and E3 is considered as thebest approximation for that particular frame Hans and Schafer reported smallimprovements in the compression rates when the interchannel decorrelation isincluded [Hans01]

Golomb Coding The entropy coding techniques employed in AudioPaK areinspired by the Golomb codes used in the lossless image compression scheme[Wein96] [ISOJ97] First the frames are classified as silent (constant) or timevarying In the case of silent frames the residuals e0(n) and e1(n) are used as thepredictor polynomials On the other hand for the varying frames Golomb codingis employed In particular residual e0(n) is used to encode silent frames madeof zeros and residual e1(n) is used to encode silent frames made of a constantvalue For frames that are classified as time varying Golomb coding is employedas follows Golomb codes are optimal for exponentially decaying probabilitydistributions of positive integers Therefore a mapping process is required toreorder the negative integers to positive values This is done as follows

ei(n) =

2ei(n) if ei(n) 02|ei(n)| minus 1 otherwise

(118)

where ei(n) is the residual corresponding to the i-th polynomial Note that theGolomb codes are prefix codes that can be characterized by a unique parameterldquomrdquo An integer ldquoIrdquo can be encoded using a Golomb code as follows If m = 2k the codeword for ldquoIrdquo consists of ldquokrdquo LSBs of ldquoI rdquo followed by the number formedby the remaining MSBs of ldquoIrdquo in unary representation and a stop bit Therefore

the length of the code is k +[

I

2k

]

+ 1 For example if n = 69 m = 16 ie k =4 Part 1 4 LSBsof [1000101] = 0101 Part 2 unary(100) = unary(4) = 1110Stopbit = 0 GolombCode = [010111100] Length of the code = 9 Usuallythe value of ldquokrdquo is constant over an entire frame and is computed based on theexpectation E(|ei(n)|) as follows

k = [log2(|ei(n)|)] (119)

The lower and upper bounds of k are given by 0 and 15 respectively for a 16-bitaudio input

11223 The DVD Algorithm Craven et al proposed a L2AC scheme basedon IIR prediction filters [Crav97] This was later adopted in the DVD standardAUDIO SPECIFICATIONS Part 4 [DVD99] An important insight to FIRIIRprediction techniques towards the lossless compression is given in [Crav96] TheFIR prediction fails to provide satisfactory compression when dealing with signals


exhibiting wide dynamic ranges (gt60 dB) Usually these wider dynamic rangesin the spectrum of input audio are more likely when the sampling frequenciesare in excess of 96 kHz [ARA95] To this end Craven et al pointed out thatIIR filter prediction techniques [Crav96] [Crav97] perform well for these cases(Figure 114)

(b)

5 15 20 25 30 35 40 45 5010

010

1525

205

Bit

reso

lutio

n

Frequency (kHz)

5 10 15 20 25

Frequency (kHz)

0

minus20

minus40

minus60

Spe

ctra

l lev

el (

dB)

Input audio

FIR Residual for Np = 10

IIR Residual for Na = 5 Nb = 3minus80

(a)

Figure 114 (a) A simulation example depicting the advantages of employing IIR predic-tors over FIR predictors in L2AC schemes Note that the FIR predictor fails to flatten thedrop in the input spectrum above 20 kHz while a third-order IIR predictor is able to dothis efficiently (b) Bit resolution as a function of frequency The shaded area representsthe total information content of 96 kHz sampled signal (after [Crav96])


The DVD algorithm [Crav96] works as follows First the input audio isdivided into frames of lengths that are integer multiples of 384 A third-order(Nb = Na = 3) IIR predictor with fine coefficient quantization was proposedFigure 114(a) illustrates the advantages of employing an IIR predictor Notethat a 10th-order FIR predictor fails to flatten the drop (sim50dB above 20 kHz)in the input spectrum On the other hand a third-order IIR predictor is able todo this efficiently IIR prediction schemes provide superior performance overFIR prediction techniques for the cases where control of both average and peakdata rates is equally important Moreover IIR predictors are particularly effectivewhen large portions of the spectrum are left unoccupied ie if the filter roll-offis much less than the sampling rate fc Fs as shown in Figure 114(b) Notethat the area under the curve in Figure 114(b) represents the total informationcontent and fc asymp 30 kHz Fs = 96 kHz Results of 1-bit reduction per sampleper channel in the data rate were reported in [Crav96] when the order in thedenominator was incremented (Nb = 3 Na = 4) In other words a performanceimprovement of 6 dB can be realized by simply using an extra order in thedenominator of the predictor transfer function A special quantization structure isproposed in order to avoid precision errors while dealing with fixed-point DSPcomputations [DVD99] A Huffman coder that best fits the Laplacian distributionis employed for the residual packing The Huffman codes are chosen accordingto an adaptive scheme [Crav97] (based on the input quantization step size) inorder to accommodate a wide range of input word lengths

It is interesting to note that the L2AC IIR prediction techniques [Crav97]described in this section resulted in new initiatives towards a more efficient loss-less packing of digital audio [Gerz99] In particular these ideas were integratedin the Meridian lossless packing algorithm (MLP) proposed by Gerzon et al thatbecame an integral part of the DVD-audio specifications [DVD99]

11224 The MUSICompress Wegner at Soundspace Audio developed a lowcomplexity low-MIPS fast L2AC scheme called the MUSICompress (MC) algo-rithm [Wege97] This algorithm employs an adaptive approximator to estimate

Approximator

Audio inputs(n)

Σminus Residual

error e(n)

Header

Header

Approxaudio

Error

Compressedbitstream

blocksApproximated

audio s(n)ˆ

Figure 115 The MUSICompress (MC) algorithm L2AC scheme


the residual error as shown in Figure 115 Each encoded bitstream block con-tains a header the approximated audio s(n) and the residual e(n) The headerblock specifies the type of the approximator used for compression Dependingupon the bit-budget and computational complexity the MC algorithm can alsosupport lossy compression by discarding LSBs from the encoded bitstream Ametric for comparing L2AC algorithms is also proposed in [Wege97] This isgiven by

L2AC metric = Compression ratio

Number of MIPS (1110)

Although this metric is not universal it provides a rough estimate by consideringboth the compression ratio and the complexity associated with a L2AC algo-rithm At the decoder these compressed bitstream chunks are first demultiplexedAn interpreter reads the header-structure that provides information regarding theapproximator used in the encoder A synthesis operation is performed at thedecoder to reconstruct the lossless audio signal Real-time DSP and hardwareimplementation aspects of the MC algorithm along with the compression resultsare given in [Wege97]

11225 The Context Modeling ndash Linear Predictive Audio Coding(C-LPAC) Inspired by applications of context-modeling in image coding[Wu97] [Wu98] Qiu proposed a L2AC scheme called C-LPAC [Qiu01] Themain idea behind the C-LPAC scheme is that the residual error e(n) is encodedbased on higher-order context modeling that uses local (neighborhood) informa-tion to estimate the probability of the occurrence of an event An FIR linearprediction is employed to compute the prediction coefficients and the residualerror An error feedback is also employed in order to model the random processrp(n) ie

spred(n) =Lsum

k=1

aks(n minus k) + rp(n) (1111)

where spred(n) is the predicted value based on the previous L samples of theinput signal s(n) and ak are the prediction coefficients The LP coefficients arecomputed by minimizing the error e(n) between the predicted and the originalsignal as follows

e(n) = s(n) minus spred(n) (1112)


k=1

aks(n minus k) minus r(n) n = 1 2 N (1113)

The mean square error (MSE) is given by

ε = 1

N

Nsum

n=1

e2(n) (1114)


where N is the number of samples Note that the minimization of the MSEε (1114) fails to model the random process r(n) since the audio frames assumedto be stationary with constant mean and variance Therefore in order to estimatecorrectly the random process r(n) the residual error e(n) is modified as follows

ef (n) =

e(n) minus (n) for encodere(n) + (n) for decoder

(1115)

where(n) = F(e(n minus 1) e(n minus 2) e(n minus nf )) (1116)

where nf denotes the number of error samples fed back to compute the correctionfactor and (n) is based on an arbitrary function F Although details on theselection of nf and the function F are not included in [Qiu01] we note that thebasic idea is to employ an error feedback in order to account for the estimationof the random process rp(n) The modified residual error ef (n) is entropycoded using the high-order context modeling In particular bit-plane encoding isemployed [Kunt80] [Cai95] For example let e(i minus 1) e(i) and e(i + 1) be theresidual samples Each bit plane (BP0 through BP7 ) contains a representation ofeach significant bit corresponding to a sample The bit-planes BP 0 BP 7contain the bits corresponding to MSB LSB respectively and thereforethe most significant information is included in the top few bit-planes

BP 0 BP 7darr darr

e(i minus 1) = 63 (00111111)

e(i) = 64 (01000000)

e(i + 1) = 65 (01000001)

(1117)

In the above example bit-plane 0 contains ( 0 0 0 ) bit-plane-1 contains( 0 1 1 ) and so on First a local energy ξe(i) = sump

m1=1 |e(i minus m1)|+sumq

m2=1 |e(i + m2)| in the neighborhood of ldquoprdquo samples before ldquoeirdquo and ldquoqrdquosamples after ldquoeirdquo is computed Depending on the energy level and the bit-planeposition (0 through 7 in our example) the significance of the coding symbol ei is computed Expressions for context modeling for ldquosignificantrdquo coding ldquosignrdquocoding and ldquorefinementrdquo coding are given in [Qiu01] The research involvingthe higher-order context modeling applied to L2AC uses concepts from imagecoding [Wu97] [Wu98]

11226 The Lossless Transform Audio Coding (LTAC) The LTAC algo-rithm [Pura97] employs a unitary transform (AH = Aminus1) In particular an ortho-normal DCT is used to decorrelate the input signal Figure 116 shows the LTACblock diagram where the input audio s(n) is first buffered in Nbuff sampleblocks transformed to the frequency domain using the DCT and then quantizedto obtain the spectral coefficients X(k) Both fixed and adaptive block length


modes are available for DCT block transformation In the fixed block lengthmode the DCT block size is equal to the number of input samples (Nbuff = 128256 1024 and 2048) On the other hand in the adaptive block length mode thenumber of input samples per block Nbuff = 2048 and the transformation thatresults in the optimum performance among the 512 1024 or 2048-point DCTis selected Higher block lengths often provide good decorrelation of the inputsignal However this is true only when the input signal is not rapidly time vary-ing The LTAC algorithm works as follows An estimate of the input audios(n) is computed at the local decoder by performing the inverse transformation(ie IDCT) as shown in Figure 116 An error signal (residual) is calculated bysubtracting the estimated signal from the input signal

e(n) = s(n) minus s(n) (1118)

where s(n) = Q〈IDCT[Q〈DCT[s(n)]〉]〉 and lsquoQrsquo denotes quantizationThe residual e(n) is entropy coded using Rice codes and transmitted along

with the DCT coefficients X(k) as shown in Figure 116 Compression ratesin the range of 59ndash72 and performance improvements over prediction-basedL2AC schemes have been reported in [Pura97]

11227 The IntMDCT Algorithm Inspired by the use of the DCT andother integer transforms in lossless audio coding Geiger et al proposed theinteger-MDCT algorithm [Geig01] [Geig02] One of the attractive properties thathas contributed to the widespread use of the MDCT in audio standards is the

DCT(OrthonormalTransform)

Audio inputs(n)

minus

Lossycompressedaudio X(k)

Residualerror e(n)

Quantizer

Lossy compression scheme

IDCT Quantizer

Local decoderEntropycoding

and

MUX

s(n)ˆ

Σ

Figure 116 Lossless transform audio coding (LTAC)


availability of FFT-based algorithms [Duha91] [Sevi94] that make it viable forreal-time applications MDCT features that are important in audio coding includ-ing its forward and inverse transform interpretations prototype filter (window)design criteria and window design examples are described in Chapter 6 Recallthat in the orthogonal case the generalized perfect reconstruction conditionsgiven in Chapter 6 Eqs (619) through (623) can be reduced to linear phaseand Nyquist constraints on the window w(n) ie

minus w(2M minus 1 minus n) + w(n) = 0 and

w2(n) + w2(n + M) = 1 for 0 n M minus 1(1119)

where M denotes the number of transform coefficientsGivens Rotations and Lifting Scheme The IntMDCT algorithm is based on

the integer approximation of the MDCT algorithm This is accomplished usingthe ldquolifting schemerdquo [Daub96] proposed by Daubechies et al First the MDCT isdecomposed [Geig01] into Givens rotations2 by choosing c and s in Eq (1120)so that a vector X is orthogonally transformed (rotated) into a vector Y as follows

dArr Givensrotation

Y =[

y1

y2

]

=[

c minuss

s c

] [x1

x2

]

=[

y1

0

]

(1120)

Note from the above equation

y2 = minussx1 + cx2 = 0 andc2 + s2 = 1

(1121)

Details on the decomposition of the MDCT into windowing time-domain alias-ing and DCT-IV and eventually into Givens rotations are given in [Vaid93] Notethe similarities between Eqs (1119) and (1121) The lifting scheme is appliednext in order to approximate Givens rotations The lifting scheme splits theGivens rotations into three lifting steps Eq (1122) that can be easily approx-imated and mapped onto the integer domain [Daub96] This integer mapping isreversible and therefore contributes to the lossless coding Equations (1123) and(1124) show that the lifting scheme is indeed a reversible integer transform sinceeach of the three lifting steps can be inverted uniquely by simply subtracting thevalue that has been added

Lifting scheme

[c minuss

s c

]

=

1c minus 1

s0 1

[

1 0s 1

]

1c minus 1

s0 1

(1122)

2 Note that c and s represent cos(α) and sin(α) where ldquoαrdquo corresponds to the angle of rotationof some arbitrary plane In particular the Givens rotation results in an orthogonal transformationwithout changing any of the norms of the vectors X and Y


The lifting scheme and its inverting property are described as follows

Y =[

c minuss

s c

]

X =

1c minus 1

s0 1

[

1 0s 1

]

1c minus 1

s0 1

X

X =[

c minuss

s c

]minus1

Y =[

c s

minuss c

]

Y

=

1 minus(

c minus 1

s

)

0 1

[

1 0minuss 1

]

1 minus(

c minus 1

s

)

0 1

Y

(1123)

The lifting scheme is a reversible integer transform ie

Consider the first lifting step

(1 (c minus 1)s

0 1

)

Y 1 =[

y11

y21

]

=[

1 (c minus 1)s

0 1

] [x1

1

x21

]

=[

x11 + x2

1(c minus 1)s

x21

]

(1124)

Decoding

X1 =[

x11

x21

]

=[

1 minus(c minus 1)s

0 1

] [y1

1

y21

]

=[

y11 minus y2

1(c minus 1)s

y21

]

The IntMDCT is typically used in conjunction with the MDCT-based percep-tual coding scheme as shown in Figure 117 First the input audio s(n) iscompressed using a perceptual coding scheme (eg ISOIEC MPEG-audio algo-rithms) resulting in a lossy compressed audio X(k) Later these coefficientsare inverse quantized and integer rounded and subtracted from the IntMDCTspectrum coefficients This results in an error residual e(n) The error essen-tially carries the enhancement bitstream that contributes to lossless coding Theerror spectrum and the MDCT coefficients can be entropy coded using Huffmancodes Bit-sliced arithmetic coding (BSAC) that inherits the properties of finegrain scalability can also be used to encode the lossy audio and the enhancementbitstream

In BSAC arithmetic coding replaces Huffman coding In each frame bitplanes are coded in order of significance beginning with the most significantbits (MSBs) This results in a fully embedded coder containing all lower-ratecodecs With proper tuning of quantization steps the IntMDCT algorithm shownin Figure 117 can also be used in perceptual (lossy) audio coding applica-tions [Geig02] [Geig03]

113 DVD-AUDIO

The DVD Forum Committee recommended standardizing the DVD-A specifica-tions [DVD01] and the related coding techniques for DVD-A ie the Meridian

MD

CT

Aud

io In

put

s(n

)

minus

Loss

yco

mpr

esse

dau

dio

X(k

)

Err

oren

hanc

emen

tbi

tstr

eam

e(n

)

Blo

ckC

ompa

ndin

gA

ndQ

uant

izat

ion

Per

cept

ual c

odin

g sc

hem

e

Psy

choa

cous

ticA

naly

sis

Inve

rse

Qua

ntiz

atio

nA

ndIn

tege

r R

ound

ing

IntM

DC

T

Ent

ropy

Cod

ing

orB

it-sl

iced

Arit

hmet

icC

odin

g (B

SA

C)

and

MU

X

Σ

Fig

ure

117

L

ossl

ess

audi

oco

ding

usin

gIn

tMD

CT

algo

rith

m

357


lossless packing [DVD99] DVD-A offers two important features ie multi-ple sampling rates (441 48 882 96 1764 and 192 kHz) and multiple bit-resolutions (16- 20- and 24-bit) The DVD-A format can store up to 6 channelsof digital media at 24-bit resolution sampled at 96 kHz Table 112 lists someof the important parameters of the DVD-A format DVD-A provides a dynamicrange of the order of 144 dB over a 96-kHz frequency range and the maxi-mum input data rate allowed is 96 Mbs Along with these features DVD-Aaddresses the digital content protection and copyright issues with a variety ofencryption and audio watermarking technologies A 74-min 96-kHz multichan-nel 20-bit resolution audio will require a minimum of 893 GB storage space(see Table 112) However the maximum storage capacity offered by DVD-Ais 47 GB Therefore a lossless compression is required The Meridian losslesspacking (MLP) scheme [Gerz99] [Kuzu03] developed by Gerzon et al has beenchosen as the compression standard for the DVD-Audio format


The MLP encoding scheme [Gerz99] was built around the work done by Cravenand Gerzon [Crav96] [Crav97] and Oomen Bruekers and Vleuten [Oome96]Furthermore it inherits some features from the DVD algorithm discussed insection 1122 Figure 118 shows the five important stages associated with theMLP encoding scheme These include channel remapping and shifting inter-channel decorrelation intrachannel decorrelation (or prediction) entropy cod-ing (lossless packing) and interleavingMLP bitstream formatting In the firststage N channels of PCM audio are remapped and shifted in order to facilitateefficient buffering of MLP substreams and to improve bit-precision MLP canencode a maximum of 63 audio channels In the second stage the interchan-nel dependencies within these N channels are exploited Typically matrixingtechniques (see Chapter 10) are employed to eliminate interchannel redundan-ciesirrelevancies In particular MLP employs lossless matrixing scheme that isdescribed in Figure 119 From this figure it can be noted that only one PCMaudio channel for instance x1 is modified while the channels x2 through xN

are unchanged The coefficients [a2 a3 a4 aN ] are computed for each audioframe that results in minimum interchannel redundancy Stage 3 is concerned withthe intrachannel redundancy removal based on the FIRIIR prediction techniques

114 SUPER-AUDIO CD (SACD)

Almost 20 years after the launch of the audio CD format [PhSo82] [IECA87]Philips and Sony established a new storage format called the ldquosuper-audio CDrdquo(SACD) [SACD02] [SACD03] Unlike the conventional CD format that employsPCM encoding with a 16-bit resolution and a sampling rate of 441 kHz theSACD system uses a superior 1-bit recording technology called direct streamdigital (DSD) [Nish96a] [Nish96b] [Reef01a] SACD operates at a samplingfrequency 64 times the rate used in conventional CDs ie 64 times 441 kHz =28224 MHz

Tabl

e11

2

Con

vent

iona

lC

Ds

vers

usth

ene

xtge

nera

tion

stor

age

form

ats

Para

met

erC

onve

ntio

nal

CD

SAC

DD

VD

-Aud

io

Stan

dard

ized

year

1982

[I

EC

A87

]19

99

[SA

CD

02]

1999

[D

VD

01]

Enc

odin

gsc

hem

ePu

lse

code

mod

ulat

ion

(PC

M)

Dir

ect-

stre

amdi

gita

l(D

SD)

Mer

idia

nlo

ssle

sspa

ckin

g(M

LP)

Bit

reso

lutio

n16

-bit

linea

rPC

M1-

bit

DSD

16

20

or24

-bit

linea

rPC

MSa

mpl

ing

rate

441

kHz

2822

4kH

z44

1

48

882

96

17

64

or19

2kH

zD

ata

rate

14

Mb

s5

6to

84

Mb

slt

96

Mb

sFr

eque

ncy

rang

eD

Cto

20kH

zD

Cto

100

kHz

DC

to96

kHz

Dyn

amic

rang

e96

dBsim1

20dB

sim144

dBN

umbe

rof

chan

nels

22

to6

1to

6Pl

ayba

cktim

e74

min

sim74

(plusmn5)

min

sim70

to80

0m

in(d

epen

ding

upon

the

bit

rate

)M

axim

umst

orag

eca

paci

tyav

aila

ble

sim750

to80

0M

Bsim4

38

to4

7G

B(s

ingl

e-la

yer)

sim47

GB

Stor

age

requ

ired

74times

60times

4410

0times

(2)

times(

16 8

)sim =

740

MB

74times

60times

282

24times

106times

(2+

6)

times(

1 8

)sim =

116

7G

B

74times

60times

96times

103times

(1+

2+

6)

times(

20 8

)sim =

893

GB

Com

pres

sion

Not

requ

ired

Req

uire

dR

equi

red

359

Inpu

tA

udio

x(

n)

Cha

nnel

Re-

map

ping

Shi

ft R

egis

ter-

1

Shi

ft R

egis

ter-

2

Shi

ft R

egis

ter-

N

Cha

nnel

Re-

map

ping

and

Shi

fting

Cha

n-1

Cha

n-2

Cha

n-N

Inte

r-ch

anne

lD

e-co

rrel

ator

Loss

less

Mat

rixin

g

Pre

dict

or-1

Pre

dict

or-2

Pre

dict

or-N

Intr

a-C

hann

elD

e-co

rrel

ator

Ent

ropy

code

r-1

Ent

ropy

code

r-2

Ent

ropy

code

r-N

Buf

ferin

gIn

terle

avin

gan

d M

LPbi

tstr

eam

form

attin

g

N n

umbe

r of

aud

io c

hann

els

~ 6

3 fo

r M

LP

ST

AG

E-1

ST

AG

E-2

ST

AG

E-3

ST

AG

E-4

ST

AG

E-5

x 1 x 2 x N

xrsquo1

xrsquo2

xrsquoN

Fig

ure

118

M

LP

enco

ding

sche

me

[Ger

z99]

360


Lossless Matrixing

x 2 = x2

x 11

0 0 0

0 1 0

10 0

rarr (III)

Lossless De minusmatrixing

aNx1 a2 a4

x2

x3

xN

x3 = x3

x N = xN

=

(1) Decoding x2 through xN is straightforward from (III)

(2) Lets see if decoding x1 is lossless

rarr (IV)

x 2x 3

x N

=gt The above equation when re-arranged is same as (II)

=gt there4 Lossless matrix encoding and decoding

From (IV) =gt x1 = x1 + a2 x2 + a3 x3 + a4 x4 + + aN xN

1

0

a3

0

0

0 0

( x i = xi from (III))

there4

x1

x2

x3

1

0 0 0

0 1 0

10 0

rarr (I)

rarr (II)

xN

x1 minusa2 minusa3 minusa4 minusaN

x2x 3

x N

=

1

0

0

0

0 0

Note a2 x2 corresponds to Quantization of a2 x2

x 1 = x1 minus a2 x2 minus a3 x3 minus a4 x4 minus minus aN xN

x1 = x1 + a2 x2 + a3 x3 + a4 x4 + + aN xN

Figure 119 Lossless matrixing in MLP

Moreover SACD enables frequency response from DC to 100 kHz with a dyna-mic range of 120 dB SACD systems enable both high-resolution surround soundaudio recordings (51 or 32-channel format) as well as high-quality stereo encod-ing Some important characteristics of the red book audio CD format [IECA87]and the SACD and DVD-audio storage formats are summarized in Table 112

The main idea behind the DSD technology is to avoid the decimation andinterpolation filters that invariably impart quantization noise in the compressionprocess To this end DSD directly records each sample as a 1-bit pulse [Hori96][Nish96a] [Nish96b] [Reef01a] thereby eliminating the need for down-sampling


and up-sampling filters (see Figure 1110(b)) Moreover the DSD encoding enab-les improved lossless compression of digital audio [Reef01a] [Jans03] ie directstream transfer (DST) (compression ratios range from 2 to 4) Figures 1110 (a)and (b) show the steps involved in the conventional PCM encoding and theDSD technology respectively Before describing the DSD encoding we providea brief introduction to the SACD storage format [tenK97] and the sigma-delta() modulators [Cand92] [Sore96] [Nors97] in the next two subsections


SACDs come in three formats based on the number of layers employed [tenK97][SACD02] [SACD03] [Jans03] These include the single layer the dual-layer andthe hybrid formats In the single-layer SACD a single high-density layer for thehigh-resolution DSD recording (47 GB) is allowed On the other hand the dual-layer SACD contains one or two layers of DSD content (2 times 47 GB = 84 GB)In order to allow backwards compatibility with conventional CD players a hybriddisc format is standardized In particular the hybrid disc contains one innerlayer of DSD content and one outer layer of conventional CD content (ie750 MB + 47 GB = 545 GB) Similar to the conventional CDs the SACDshave a 12 cm diameter and 12 mm thickness The lens numerical aperture (NA)is 06 and the laser wavelengths to read the SACD and the CD content are650 nm and 780 nm respectively


Sigma-Delta () modulators [Cand92] [Sore96] [Nors97] convert analog sig-nals to one-bit digital outputs Figure 1111 shows a schematic block diagramof the modulator that consists of a single-bit quantizer and a loop filter(ie low-pass filter) in a negative feedback loop modulators output 1-bitsamples that indicate whether the current sample is larger or smaller than thevalue accumulated from the previous samples If the amplitude of the currentsample is greater than the value accumulated during the negative-feedback-loop the SDM outputs ldquo1rdquo or otherwise ldquo0rdquo SDM involves high samplingfrequencies and a two-level (ldquo0rdquo or ldquo1rdquo) quantization process Over-samplingrelaxes the requirements on the analog antialiasing filter (see Figure 1110) andhence it simplifies analog hardware Recall from Chapter 3 that multi-bit PCMquantizers (ie more than two discrete amplitude levels) suffer from differ-ential nonlinearities particularly with analog signals Several solutions havebeen proposed [Nors92] [Nors93] [Dunn96] [Angu97] to overcome the nonlin-ear distortion in the PCM quantization process including oversampling and noiseshaping

Nonlinear distortion are eliminated by switching from multi-bit PCM to single-bit quantizers However the single-bit quantizer employed in the modulatorresults in high quantization noise therefore a noise shaping process is essen-tial Noise shaping removes or ldquomovesrdquo the quantization noise from the audible

Ana

log

to D

igita

lC

onve

rter

(1-

bit A

DC

)x

(n)

Dec

imat

ion

Filt

erP

CM

Enc

odin

g

64 F

s

1-bi

t

Fs

20-b

it

Inte

rpol

atio

nF

ilter

and

Sig

ma-

Del

taM

odul

ator

Fs

20-b

it

Ana

log

Low

-pas

s F

ilter

64 F

s

1-bi

t

Ana

log

Out

put

(a)

Ana

log

to D

igita

lC

onve

rter

(1-

bit A

DC

)D

SD

Enc

odin

g

64 F

s

1-bi

t

Ana

log

Low

-pas

s F

ilter

64 F

s

1-bi

t

Ana

log

Inpu

tA

nalo

gO

utpu

t

(b)

Fig

ure

111

0(a

)C

onve

ntio

nal

PCM

enco

ding

and

(b)

dire

ct-s

trea

mdi

gita

l(D

SD)

tech

nolo

gy

363


Low-pass Filter(Integrator)

H(z)

1-bit Quantizer

minus

AnalogSignal x(n)

1-bit Digital Output y(n)

k

e(n)

sumsum

Figure 1111 Sigma-delta () modulator

band (lt20 kHz) to the high-frequency band (gt 50 kHz) From Figure 1111 thetransfer function of the SDM system is given by

(X(z) minus Y (z))H(z)k + E(z) = Y (z) (1125)

Y (z) =(

kH(z)

1 + kH(z)

)

︸︷︷︸Signaltransf erf unction

X(z) +(

1

1 + kH(z)

)

︸︷︷︸Noiseshapingf unction

E(z) (1126)

where H(z) is the loop-filter k is the error in gain and E(z) is the DC offset corre-sponding to 1-bit quantization In order to remove the quantization noise from the

audible band the noise shaping function NS(z) = 1

1 + kH(z) must be a high-pass

filter and hence the loop-filter H(z) must be a low-pass filter that acts as an inte-grator The next important step is to decide the order of the loop-filter H(z) Ahigher-order loop-filter results in higher resolution and larger dynamic ranges Wediscussed earlier that SACDs provide dynamic ranges in the order of 120 dB Exper-imental results [Reef01a] [Jans03] indicate that at least a fifth-order loop-filter mustbe employed in order to obtain a dynamic range greater than 120 dB Howeverdue to the negative feedback loop employed in the modulator depending onthe analog input signal amplitude the SDM may become unstable if the order ofthe loop-filter exceeds two In order to overcome these stability issues Reefmanand Janssen proposed a fifth-order modulator structure with integrator-clippingdesigned specifically for DSD encoding [Reef01a] [Reef02] Several publicationsthat describe the SDM structures for 1-bit audio encoding [Angu97] [East97][Angu98] [Angu01] [Reef01a] [Reef02] [Harp03] [Jans03] have appeared


Designed primarily to enhance the performance of 1-bit lossless audio codingthe DSD encoding offers several advantages [Reef01b] over standard PCM tech-nology ndash it eliminates the requirement of decimation and interpolation filtersenables lossless compression and higher dynamic ranges (sim120 dB) reducesstorage requirements (since 1-bit encoding) allows much higher time resolution(due to higher sampling rates) and involves controlled transient behavior (sinceit employs simple antialiasing low-pass filters) Moreover the DSD bitstreams


can be easily down-converted to lower frequencies (32 kHz 441 kHz 96 kHzetc) During mid-1990s Nishio et al provided the first ideas of direct streamdigital audio system [Nish96a] [Nish96b] [Nogu97] At the same time severalother researchers proposed methods for 1-bit (lossless) audio coding [Hori96][Brue97] [East97] [Ogur97] In 1999 Philips and Sony released the licensingfor the SACD systems and considered the 1-bit DSD encoding as its record-ingcoding technology The following year Reefman and Janssen at PhilipsResearch Labs presented a white paper on the signal processing aspects of DSDfor SACD recording [Reef01a] Since then several advancements towards effi-cient DSD encoding have been made [Reef01c] [Hawk02] [Kato02] [Reef02][Harp03] [SACD03]

Figure 1112 shows the DSD encoding procedure that employs a mod-ulator and the direct stream transfer (DST) technique to perform lossless com-pression

From Table 112 note that an SACD allows 47 GB of storage space howeverthe storage required for a 74-min 1-bit DSD data is approximately 1167 GB andhence the need for lossless compression of 1-bit audio Lossless coding in caseof 1-bit digital signals is altogether a different scenario from what we discussedearlier in Section 112 Bruekers et al in 1997 proposed a lossless coding tech-nique for 1-bit audio signals [Brue97] This was later adopted as the direct streamtransfer (DST) L2AC technique for 1-bt DSD encoding by Philips and Sony

11431 Analog to 1-Bit Digital Conversion ( Modulator) First theanalog input signal x(n) is converted to 1-bit digital output bitstream y(n)

(Figure 1112) The steps involved in the analog to 1-bit digital bitstream con-version have been presented earlier in Section 1142 In this section we presentsome simulation results (Figure 1113) to analyze the dynamic ranges offered incase of 16-bit linear PCM coding (sim96 dB) and 1-bit DSD encoding (sim120 dB)Figure 1113(a) shows the time-domain input waveform x(n) which in this caseis a 4 kHz sine wave Figure 1113(b) represents the modulator output ie1-bit digital output bitstream y(n) Note that as the amplitude of the inputsignal increases the density of ldquo1srdquo increases (plotted as a bar-chart) and ifthe amplitude decreases the density of ldquo0srdquo increases Figure 1113(c) showsthe frequency-domain output of the 16-bit linear PCM quantizer for the 1-kHzsine wave input x(n) Recall from Chapter 3 that the SNR for linear PCM willimprove approximately by 6 dB per bit This is given by

SNRPCM = 602Rb + K1 (1127)

where Rb denotes the number of bits and the factor K1 is a constant that accountsfor the step size and loading factors Therefore for a 16-bit linear PCM the outputSNR is given by SNRPCM asymp 602(16) asymp 96 dB This can be associated with thedynamic ranges (sim96 dB) obtained in case of conventional CDs that employ16-bit linear PCM encoding (see Figure 1113(c)) Figure 1113(d) shows thespectrum of the 1-bit digital output bitstream Although there are no mathematical

Pre

dict

ion

Loop

Filt

erH

(z)

Sin

gle-

bit

Qua

ntiz

er

Fra

min

gor

Buf

ferin

g

Map

ping

0rarr minus

11rarr

+1

Pre

dict

ion

zminus1A

(z)

t1-

biQ

uant

izer

sum

sum

Ent

ropy

Cod

ing

DS

DB

itstr

eam

For

mat

ting

Ana

log

Sig

nal

x(n

)minus

sing

le-b

itdi

gita

l out

put

bits

trea

my

(n)

376

32 b

its p

er

fram

ey f

(n)

376

32 b

its p

er

fram

ey f

(n)

0 1

Dire

ct S

trea

m T

rans

fer

(DS

T)

minus1

+1si

ngle

-bit

reso

lutio

n

r(n)

mul

ti-bi

tre

solu

tion

0 1

si

ngle

-bit

reso

lutio

n

minus

Coe

ffici

ents

Pre

dict

ion

Coe

ffici

ents r e

(n)

Sid

e-in

form

atio

n(e

rror

pro

babi

lity

tabl

e)

sumM

odul

ator

D

Fig

ure

111

2D

SDen

codi

ngus

ing

a

m

odul

ator

and

the

dire

ctst

ream

tran

sfer

(DST

)te

chni

que

for

1-bi

tlo

ssle

ssau

dio

codi

ngin

SAC

Ds

366


15

minus15

1

minus1

05

minus05

0

0 05 1 15 2 25 3

Am

plitu

de

Time times 10minus4

15

1

minus1

0

0 2 4 6 8 10 122-le

vel q

uant

izer

am

plitu

de

Number of samples times 104

1-bit DSD Waveform

Time domain input waveform

minus50

minus100

minus150

0

0 01 02 03 04 05 06 07 08 09 1 11

Mag

nitu

de (

dB)


Frequency-domain output of 16-bit LPCM

minus50

minus100

minus150

0

0 01 02 03 04 05 06 07 08 09 1 11

Mag

nitu

de (

dB)


Frequency-domain output of 1-bit DSD

(b)

(a)

(c)

(d)

4 kHz

4 kHz

Figure 1113 Analog to 1-bit digital conversion (a) time-domain input sinusoid (4 kHz)x(n) (b) modulator output (c) frequency-domain output of 16-bit linear PCM encod-ing (DR sim96 dB) and (d) frequency-domain representation of 1-bit DSD (DR sim120 dB)


models to compute the SNR values directly for 1-bit DSD encoding severalexperimental and simulation results are available [Reef01a] [Jans03] The outputSNR values in case of 1-bit encoding (and therefore the dynamic ranges) are loop-filter-order dependent In particular a fourth-order loop filter provides a dynamicrange (DR) of 105 dB a fifth-order loop filter results in a DR of 120 dB anda seventh-order loop filter offers a DR of 135 dB In Figure 1113(d) slightincrease in the noise-floor around the 50-kHz range is due to the noise-shapingemployed in case of the modulator

11432 Direct Stream Transfer (DST) The 1-bit digital output bitstream(sim11 GB) available at the output of modulator is to be compressed inorder to fit into an SACD (sim47 GB) A lossless coding technique called theDST is employed DST is somewhat similar to the prediction-based losslesscompression techniques discussed earlier in this chapter However several mod-ifications are required in order to process the 1-bit digital samples The DSTalgorithm [Brue97] [Reef01a] works as follows (see Figure 1112) First the 1-bit pulse stream y(n) is buffered into frames of 37632 bits yf (n) A mappingprocedure is applied in order to reorder the amplitudes of yf (n) DST employs aprediction filter A(z) in order to reduce the redundancies associated with sam-ples in a frame The prediction filter can be adaptive and the filter orders canrange from 1 to 128 It is intuitive that the resulting predicted signal yfm(n)exhibits multi-bit resolution The predicted signal yfm(n) is converted (using a1-bit quantizer) to single-bit resolution yfs(n) The residual r(n) is given by

r(n) = yf (n) minus yfs(n) (1128)

The residual error r(n) is entropy coded based on an error probability look-uptable For each frame the prediction coefficients ai i isin [1 2 128] the entropycoded residual re(n) and some side information (error probability table etc) arearranged according to the DSD bitstream formatting specified by [SACD02]

11433 Content Protection and Copyright in SACD Unlike typical audiowatermarking techniques employed in conventional CDs SACDs employ a unique(distinct for each SACD) invisible watermark called the Pit Signal Process-ingndashPhysical Disc Mark (PSP-PDM) [SACD02] [SACD03] and three importantcontrols ie Disc-Access Control Content-Access Control and Playback Con-trol Figure 1114 provides details on these SACD controls and the PSP-PDMSACD employs several scrambling (content rearrangement) and encryption tech-niques In order to descramble and play back the DSD content knowledge ofPSP-PDM ldquokeyrdquo is essential

115 DIGITAL AUDIO WATERMARKING

Audio content protection from unauthorized users and the issue of copyright man-agement are very important these days considering the vast amount of multimediaresources available Moreover with the popularity of efficient search engines and


D C P

Disc-access Control

SACD Mark holds the key to access the disc and theassociated SACD parameters (ie PSP-PDM etc)This in fact is the first step to check if the SACD iscompliant with the drive

Content-access Control

Once it is found that the disc was compatible the nextstep is to obtain the hidden PSP-PDM key to access theSACD content Using the PSP-PDM key and the InitialValues information the SACD content is de-scrambled

Playback Control

The de-scrambled data is played back once the PSP-PDM key is verified to be authentic

CONTROLS

Figure 1114 D-C-P controls in SACD

fast Internet connections it became feasible to searchdownload an audio clip overthe World Wide Web in the MP3 format that can offer CD-quality high-fidelityaudio For example an audio clip of 6-minute duration in the MP3 format ingeneral would require 3ndash5 MB of storage space and note that at 20 kBs itwould take no more than 5 minutes to download the entire audio clip Becauseof this security issues in MP3 audio streaming have become important [Horv02][Thor02] Furthermore with the advancements in the computer industry anyunprotected CD can be reproduced (ie copied) with minimal effort The needfor ldquointellectual property management and protectionrdquo (IPMP) [Spec98] [Spec99][Rose01] [Beck04] motivated content creators and service providers to seek effi-cient encryption and data hiding techniques The main idea behind these datahiding techniques is to insert (or hide) an auxiliary data string into the digitalmedia in order to resolve rightful ownership Some of the important tasks of theseencryptiondata-hiding techniques include the ability to ensure that the encryp-tion is imperceptible and robust to a variety of signal processing manipulations(eg down-sampling up-sampling filtering etc) [Bend96] [Bone96] to providereliable information regarding the ownership of a media file [Crav97] [Crav98]to deactivate the playback option if the media clip was found to be a pirated oneand to comply with the digital rights management (DRM) principles [Rose01][Arno02] [Beck04]

This section addresses various data-hiding techniques employed for audio con-tent protection A data hiding technique [Bend95] [Peti99] [Cox01a] [Wu03]involves schemes that insert signals of certain characteristics (imperceptiblerobust statistically undetectable etc) in the audio Such signals are called


watermarks A digital audio watermark (DAW) can be defined as a stream of bitsembedded (or hidden) in an audio file that offers features such as content protec-tion intellectual property management and proof of ownership Before gettinginto the details of audio watermarking techniques we present some of the basicconcepts of ldquocontent protectionrdquo and ldquocopyright managementrdquo A simple idea isto include in the audio clip a password or a key that is relatively difficult tobe ldquohackedrdquo in a given amount of time Details of this key must be essential inorder for someone to playback the audio file This ensures content protection Forcopyright management a data string or some form of a signature that indicatesthe proof of ownership is included in the audio However in both the cases if thekey or the signature string is lost one cannot guarantee for either content pro-tection or copyright management Therefore in applications related to copyrightmanagement the watermarking techniques are usually combined with encryptionschemes The latter is beyond the scope of this text and the reader can find moreinformation on several encryption techniques in [Wick95] [Trap02] [Mao03]

Some of the important aspects that pose major challenges in digital audiowatermarking are described below Digital audio and video watermarkingtechniques rely on the perceptual properties of the human auditory (HAS) andhuman visual systems (HVS) respectively In certain ways the HVS is lesssensitive compared to the human auditory system (HAS) This indicates that it ismore challenging to embed imperceptible audio watermarks Moreover the HASis characterized by larger dynamic ranges in terms of poweramplitudes (sim 1081)and frequencies (sim 1041) Unlike in video compression where data rates of theorder of 1 Mbs are employed in audio coding the data rates typically rangefrom 16 to 320 kbs This means that the stream of bits embedded in an audiofile is much shorter compared to the auxiliary information inserted in a video filefor watermarking Researchers published a variety of sophisticated techniques([Bend95] [Cox95] [Bend96] [Bone96] [Cox97] [Bass98] [Lacy98] [Swan98a][Kiro03a] [Zhen03]) in order to meet several conflicting demands ([Crav98][Swan99] [Cox99] [Barn03a] [Barn03b]) associated with the digital audiowatermarking ([Cox01a] [Arno02] [Wu03])

1151 Background

The advent of ISOIEC MPEG-4 standardization (1996ndash2000) [ISOI99] [ISOI00a]established new research goals for high-quality coding of general audio signals evenat low bit rates Moreover bit-rate scalability and error resilienceprotection toolsof the MPEG-4 audio standard enabled flexible selection of coding features anddynamically adapt to the channel conditions and the varying channel capacity Theseand other significant strides in the area of digital audio coding have motivated consid-erable research during the last 10 years towards formulation of several audio water-marking algorithms Table 113 lists some of the important innovations in digitalaudio watermarking research

In 1995 Bender et al at the MIT Media Laboratory proposed a variety ofdata-hiding techniques for audio [Bend95] [Bend96] These include low-bit cod-ing phase coding spread-spectrum and echo data hiding The pioneering work


Table 113 Some of the milestones in the digital audio watermarking research ndash ahistorical perspective


1995 Phase coding low-bit coding echo-hidingand spread-spectrum techniques

[Bend95] [Bend96]

1995ndash96 Echo-hiding watermarking scheme [Gruh96]1995ndash96 Spread-spectrum watermarking for

multimedia[Cox95] [Cox96]

[Cox97]1996 Watermarking based on perceptual masking [Bone96] [Swan98a]1997 DAW based on vector quantization [Mori97]1998 Time-domain DAW technique [Bass98] [Bass01]1999 DAW based on bandlimited random

sequences[Xu99a]

1999 DAW based on a psychoacoustic model andspread-spectrum techniques

[Garc99]

1999 Permutation and data scrambling [Wang99]2000 DAW in the cepstrum-domain [Lee00a] [Lee00b]2000 Watermarking techniques for MPEG-2 AAC

bitstreams[Neub00a] [Neub00b]

2000 Audio watermarking based on theldquopatchworkrdquo algorithm

[Arno00]

2001 Combined audio compressionwatermarkingmethods

[Herr01a] [Herr01b][Herr01c]

2001 Modified echo hiding technique [Oh01] [Oh02]2001 Adaptive and content-based DAW scheme

(improve the accuracy of watermarkdetection in echo hiding)

[Foo01]

2001 Modified patchwork algorithm [Yeo01] [Yeo03]2001 Time-scale modification techniques for DAW [Mans01a] [Mans01b]2001 Replica modification [Petr01]2001 Security aspects of DAW schemes [Kalk01a] [Kalk01b]

[Mans02]2001 Synchronization methods of audio

watermarks[Leon01] [Gang02]

2002 Pitch scaling techniques for robust DAW [Shin02]2002 DAW using artificial neural networks [Huij02]2002 Time-spreading echo hiding using PN

sequences[Ko02]

2002 Improved spread-spectrum audiowatermarking

[Kiro01a] [Kiro01b][Kiro02] [Kiro03a]

2002 Hybrid spread spectrum DAW [Munt02]2002 DAW using subband divisionQMF banks [Sait02a] [Sait02b]2002 Blind cepstrum-domain DAW [Hsie02]2002 Blind DAW with self-synchronization [Huan02]2003 Time-frequency techniques [Esma03]2003 Chirp-based techniques for robust DAW [Erku03]

(continued overleaf )




2003 DAW using turbo codes [Cvej03]2003 DAW using sinusoidal patterns based on PN

sequences[Zhen03]

Year Tutorial reviews and survey papers Reference1996 Techniques for data hiding [Bend96]1997 Secure spread spectrum watermarking for

multimedia[Cox97]

1997 Resolving rightful ownerships with invisiblewatermarks

[Crav97] [Crav98]

1998 Multimedia data-embedding andwatermarking technologies

[Swan98]

1998 Special issue on copyright and privacyprotection

[Spec98]

1999 Special issue on identification and protectionof multimedia information

[Spec99]

1999 Watermarking as communications with sideinformation

[Cox99]

1999 Multimedia watermarking techniques [Hart99]1999 Information hiding ndash a survey [Peti99]1999 Current state of the art challenges and future

directions for audio watermarking[Swan99]

2001 Digital watermarking algorithms andapplications

[Podi01]

2001 Electronic watermarking the first 50 years [Cox01]2003 What is the future of watermarking Part I amp

II[Barn03a] [Barn03b]

of Cox et al resulted in a novel scheme called spread spectrum watermarking formultimedia [Cox95] [Cox96] [Cox97] In 1996 the first ever international work-shop on information hiding was held in Cambridge UK that attracted manyresearchers working on audio watermarking [Cox96] [Gruh96] Boney et al in1996 presented a watermarking scheme for audio signals based on the percep-tual masking properties of the human ear [Bone96] [Swan98a] This was perhapsthe first audio watermarking scheme that employed the psychoacoustic principlesand the frequencytemporal masking characteristics of the HAS A watermark-ing scheme based on vector quantization was also proposed [Mori97] All theaforementioned data hiding schemes except the ldquoecho hidingrdquo and ldquophase cod-ingrdquo techniques embed watermarks in the frequency domain In 1998 Bassiaet al presented a time-domain audio watermarking technique [Bass98] [Bass01]Audio watermarking using bandlimited random sequences [Iked99] multi-bit


1992 1994 1996 1998 2000 2002 20040

50

100

150

200

250

300

350

Year

Num

ber

of p

ublic

atio

ns

Number of publications on DAW

Projectednumber of papers

Based onthe INSPEC surveyJan 1999 [Piti99] [Rosh98]

Figure 1115 Number of publications on digital watermarking (according to INSPECJan 1999 [Roch98] [Peti99])

hopping and HAS properties [Xu99a] psychoacoustic models and spread spec-trum theory [Garc99] and permutation-based data scrambling [Wang99] havealso been proposed Data hiding research during 1995ndash1999 [Swan99] [Arno00]laid the foundation for the next generation (2000ndash2004) watermarking tech-niques Figure 1115 presents a review of the number of publications on digitalwatermarking appeared over the years according to INSPEC Jan 1999 [Roch98][Peti99] Special publication issues on copyrightprivacy protection and multi-media information protection have appeared in Proceedings of the IEEE andother journals [Spec98] [Spec99] during 1998ndash1999 Several tutorials and surveypapers highlighting watermarking techniques for multimedia have been pre-sented [Bend96] [Cox97] [Swan98a] [Swan98b] [Cox99] [Hart99] [Peti99]

In general the success of a data hiding scheme is characterized by the watermarkembedding and detection procedures Researchers began to develop schemes thatenable improved watermark detection [Cilo00] [Furo00] [Kim00] [Lean00][Seok00] Lee et al presented a digital audio watermarking scheme that works inthe cepstrum domain [Lee00a] [Lee00b] Neubauer and Herre proposed watermark-ing techniques for MPEG-2 AAC bitstreams [Neub00a] [Neub00b] Enhanced sp-read spectrum watermarking for MPEG-2 AAC was proposed later by Cheng et al[Chen02b] With the advent of sophisticated audio compression schemes the com-bined audio compressionwatermarking techniques also gained interest [Herr01a][Herr01b] [Herr01c]

Motivated by the success of the ldquopatchworkrdquo algorithm in image watermark-ing [Bend96] Arnold proposed an audio watermarking scheme based on the sta-tistical patchwork method that operates in the frequency domain [Arno00] LaterYeo et al presented a modified patchwork algorithm (MPA) [Yeo01] [Yeo03]


that is relatively more robust compared to other patchwork algorithms for DAWRobust audio watermarking schemes based on time-scale modification [Mans01a][Mans01b] replica modulation [Petr01] pitch scaling [Shin02] and artificial neu-ral networks [Huij02] have also been proposed

The echo hiding technique proposed by Gruhl et al [Gruh96] was modifiedand improved by Oh et al [Oh01] [Oh02] for robust and imperceptible audiowatermarking Ko et al [Ko02] further modified the echo hiding technique bytime-spreading the echo using pseudorandom noise (PN) sequences In order toavoid the audible echoes and further improve the accuracy of watermark detec-tion in the echo hiding technique (ie [Gruh96]) an adaptive and content-basedaudio watermarking technique has been proposed [Foo01] Costarsquos pioneeringwork on the theoretical bounds on the amount of data that can be hidden in asignal [Cost83] motivated researchers to apply these bounds to the data hidingproblem in audio [Chou01] [Kalk02] [Neub02] [Peel03]

Watermark detection becomes particularly challenging when a part of theembedded watermark is lost due to some common signal processing manipula-tions (eg cropping sampling shifting etc) Hence security and synchronizationaspects of DAW became an important issue during the early 2000 Severalsynchronization methods for audio watermarks have been proposed [Lean01][Gang02] Efficient methods to improve the security of watermarks have alsobeen presented [Kalk01a] [Kalk01b] [Gang02] [Mans02] Kirovski and Mal-var presented a robust framework for direct-sequence spread-spectrum (DSSS)-based audio watermarking by incorporating several sophisticated techniques suchas robustness to desynchronization cepstrum filtering and chess watermarks[Kiro01a] [Kiro01b] [Kiro02] [Kiro03a] [Kiro03b] DAW based on hybrid spread-spectrum methods have also been investigated [Munt02] Saito et al employed asubband division technique based on QMF banks for digital audio watermarking[Sait02a] [Sait02b] DAW based on time-frequency characteristics and chirp-based techniques have also been proposed [Erku03] [Esma03] Motivated by theapplications of turbo codes in encryption techniques Cvejic et al [Cvej03] devel-oped a robust DAW scheme using turbo codes DAW using sinusoidal patternsbased on PN sequences has been described in [Zhen03]

Blind and asymmetric audio watermarking gained popularity due to theirinherent property that the original signal is not required for watermark detec-tion Several blind audio watermarking algorithms have been reported [Hsie02][Huan02] [Peti02] however the research associated with the blind DAW is stillin its early stages Asymmetric watermarking schemes involve watermark detec-tion by knowing only a part of the secret information used to encode and embedthe watermark


During the last decade researches have proposed several efficient methods foraudio watermarking Most of these algorithms are based on the generic architec-ture shown in Figure 1116 The three primary modules associated with a digital


Watermarkdesign

Select details to design a watermark1 Input audio features2 Psychoacoustic principles3 Spread-spectrum techniques

Time-domain or frequency-domain or cepstrum-domain watermarks

Input audiox +

Watermarkinsertion

Watermark w

Key

Watermarkedaudio xw

Arbitraryembedding ofa signature ora security key

(a)

(b)

Watermarkedaudio xw

Watermarkretrieval

Select details to retrieve a watermark1 Synchronization details2 Watermark type3 Original audio required in some cases

Key

Check the correctness

of thewatermark Watermark

detection wassuccessful

Provideaccess to theinput audio

Input audiox

Watermarkdetection wasunsuccessful

Do not provideaccess to theinput audio

Figure 1116 A general audio watermarking framework (a) watermark encoding proce-dure (watermark design and embedding) and (b) watermark decoder

audio watermarking scheme include the watermark design the watermark embed-ding algorithm and the watermark retrieval The first two modules belong to thewatermark encoding procedure (Figure 1116(a)) and the watermark retrievalcan be viewed as a watermark decoder (Figure 1116(b))


First the watermark design section generates an auxiliary data string basedon a security key and a set of design parameters (eg bit rates robustness psy-choacoustic principles etc) Furthermore the watermark generation techniqueoften depends on the analysis features of the input audio the human auditorysystem (HAS) and some data scramblingspread spectrum technique Neverthe-less the ultimate objective is to design a watermark that is perceptually inaudibleand robust The watermark embedding algorithm relies in hiding the designedwatermark sequence w in the audio bitstream x This is given by

xw = weierp(x w) (1129)

where xw is the watermarked audio and weierp represents the watermark embeddingfunction

The watermark embedding can be performed either in the frequency domainThe idea is to exploit the frequency- and the temporal-masking properties of thehuman ear Three key features of the human auditory systems that are used inthe audio watermarking are

i) sounds that are louder typically tend to mask out the quieter soundsii) the human ear perceives only the ldquorelative phaserdquo associated with the

sounds andiii) the HAS fails to recognize minor reverberations and echoes

These features when employed in conjunction with encryption schemes[Mao03] or spread spectrum techniques [Pick82] lead to efficient DAW schemesIn particular direct sequence spread spectrum (DSSS) technique ndash due to itsinherent suitability towards secure transmission ndash is the most popular spreadingscheme employed in data hiding The DSSS technique employs pseudo-noise(PN) sequences to transform the original signal to a wideband and noise-likeThe resulting spread spectrum signals are resilient to interferences difficult todemodulatedecode robust to collusion attacks and therefore forms the basisfor many watermarking schemes From Figure 1116(a) note that a signature ora security key that provides details on the proof of authorshipownership canalso be inserted Depending on overall system objectives and design philosophywatermarks can be inserted either before or after audio compression Moreoveralgorithms for combined audio compression and watermarking are also available[Herr01a] [Herr01b] The watermark security should lie in the hidden randomkeys and not in the watermark-embedding algorithm In other words details ofthe watermark-embedding scheme must be available at the decoder for efficientwatermark detection

Figure 1116(b) illustrates the watermark decoding procedure The secret keyand the design parameters employed in watermark embedding are available duringthe watermark retrieval process Two common scenarios of watermark retrievalinclude (1130) ndash when the input audio is employed for decoding and when the


input audio is not available for watermark detection The latter case results inblind watermark detection

Case I Input audio required for decoding

wprime = ς(xw x)

Case II Input audio NOT required

wprime = ς(xw)

(1130)

where wprime is the decoded watermark and ς represents the watermark retrievalfunction Note that in case of perfect watermark retrieval wprime = w Watermarkretrieval must be unambiguous In other words the watermark detection mustprovide reliable details about the content ownership In particular an audiowatermark must be imperceptible statistically undetectable remain in the audioafter repeated encodingcopying and robust to a variety of signal processingmanipulations and deliberate attacks For a DAW scheme to be successful incopyright management and audio content protection applications it should havesome important attributes that are described next


11531 Imperceptibility or Perceptual Transparency Watermarks em-bedded in the digital audio must be perceptually transparent and should notresult in any audible artifacts Watermarks can be inserted in either perceptuallysignificant or insignificant regions depending upon the application If insertedin a perceptually significant region special care must be taken for watermarksto be inaudible Inaudibility can be guaranteed by placing watermark signalsin audio segments whose low-frequency components have higher energy Notethat this technique is based on the frequency-masking characteristics of theHAS that we discussed in Chapter 5 On the other hand if embedded in per-ceptually insignificant regions watermarks become vulnerable to compressionalgorithms Therefore watermark designers must consider the trade-offs betweenthe robustness and the imperceptibility features while designing a watermarkingscheme Another important factor is the optimum energy of the watermarkedsignal required for efficient watermark detection Watermarks with higher energypossess some inherent features such as the reliable watermark detection and therobustness to intentional and unintentional signal processing operations How-ever it should be noted that increase in the audio watermark energy may resultin audible artifacts

11532 Statistical Transparency Consider a set of audio signals x y z

andawatermarksignalwLetthewatermarkedsignalsbedenotedasxw yw zwWatermarked signals must be statistically undetectable Watermarks are designedsuch that users are not able to detect the embedded watermark by performing somestatistical signal processing operation (eg correlation autoregressive estimation


etc) This scenario can be treated as a special case of intentional or collusion attacksthat employ statistical methods

11533 Robustness to Signal Processing Manipulations Watermarksare embedded directly in the audio stream In fact watermark bits are dis-tributed (typically 2ndash128 bs) uniformly across the audio bitstream An audiosignal undergoes several changes on its way from a transmitterencoder to areceiverdecoder Therefore watermark robustness to a variety of signal process-ing operations is very important Common signal processing operations typicallyconsidered for the robustness of an audio watermark include linear and nonlinearfiltering resampling requantization A-to-D and D-to-A conversions croppingMP3 compression low-passhigh-pass filtering dithering bass and treble con-trol changes etc Watermark robustness can be improved by also increasingthe number of watermark bits Nonetheless this results in increased encoding bitrates Spread-spectrum techniques offer good robustness however with increasedcomputational complexity and synchronization problems

116 SUMMARY OF COMMERCIAL APPLICATIONS

Some of the popular applications (Table 114) for embedded audio coding includedigital broadcast audio (DBA) [Stol93b] [Jurg96] direct broadcast satellite(DBS) [Prit90] digital versatile disc (DVD) [Crav96] high-definition television(HDTV) [USAT95b] cinema [Todd94] electronic distribution of music [Bran02]and audio-on-demand over wide area networks such as the Internet [Diet96]Audio coding has also enabled miniaturization of digital audio storage mediasuch as compact MiniDisc [Yosh94] With the advent of the MP3 audio formatthat denotes audio files that have been compressed using the MPEG-1 layerIII algorithm perceptual audio coding has become of central importance toovernetwork exchange of multimedia information Moreover the MP3 algorithmhas been integrated into several popular portable consumer audio playbackdevices that are specifically designed for web compatibility Streaming audiocodecs have been developed by several commercial entities Example codecsinclude RealAudio and atob audio In addition DolbyNET [DOLBY] aversion of the Dolby AC-3 algorithm modified to accommodate Internet streaming[Vern99] has been successfully integrated into streaming audio processors fordelivery of audio-on-demand to the desktop web browser Moreover some of theMPEG-4 audio tools enable real-time audio transmission over packet-switchednetworks such as the Internet [Diet96] [Ben99] [Liu99] [Zhan00] transparentcompression at lower bit rates with reduced complexity low delay coding andenhanced bit-error-robustness The MPEG-7 audio supports multimedia indexingand searching multimedia editing broadcast media selection and multimediadigital library sorting

New audio coding algorithms must satisfy several requirements First con-sumers expect high quality at low to medium bit rates Secondly the popularityof streaming audio requires that algorithms intended for packet-switched networks

Tabl

e11

4

Aud

ioco

ding

stan

dard

san

dap

plic

atio

ns

Alg

orith

mA

pplic

atio

nsR

efer

ence

s

Sony

AT

RA

CM

iniD

isc

[Yos

h94]

Luc

ent

PAC

DB

A

128

160

kbs

[Joh

n95]

Phili

psPA

SCD

igita

lau

dio

tape

(DA

T)

[Lok

h91]

[Lok

h92]

Dol

byA

C-2

Dig

ital

broa

dcas

tau

dio

(DB

A)

[Tod

d84]

Dol

byA

C-3

Cin

ema

HD

TV

D

VD

[Tod

d94]

Aud

ioPr

oces

sing

Tech

nolo

gy

APT

-X10

0C

inem

a[W

yli9

6a]

[Wyl

i96b

]D

TS

Coh

eren

tac

oust

ics

Cin

ema

[Sm

yt96

][B

osi0

0]M

PEG

-1

LI-

III

ldquoMP3

rdquoL

III

[ISO

I92]

DB

A

LII

25

6kb

sD

BS

LII

22

4kb

sD

CC

L

I38

4kb

sM

PEG

-2B

C-L

SFC

inem

a[I

SOI9

4a]

MPE

G-2

AA

CIn

tern

etw

ww

e

g

Liq

uidA

udio

atob

au

dio

[ISO

I96a

][I

SOI9

7b]

MPE

G-4

Low

bit

rate

lo

wde

lay

and

scal

able

audi

oen

codi

ng

stre

amin

gau

dio

appl

icat

ions

[ISO

I97a

][I

SOI9

9][I

SOI0

0]

MPE

G-7

Bro

adca

stm

edia

sele

ctio

nm

ultim

edia

inde

xing

sea

rchi

nge

diti

ng

[ISO

I01b

][I

SOI0

1d]

Dir

ect-

stre

amdi

gita

l(D

SD)

enco

ding

Supe

rau

dio

CD

enco

ding

[Ree

f01a

][J

ans0

3]M

erid

ian

loss

less

pack

ing

(ML

P)D

VD

-Aud

io[G

erz9

9][K

uzu0

3]

379


are able to deal with a highly time-varying channel Therefore new algorithmsmust include facilities for scalable operation ie the ability to decode only asubset of the overall bitstream with minimal loss of quality Finally for anyalgorithm to be viable it should be possible to implement the decoder in realtime on a general-purpose processor architecture Some of the important aspectsthat have been influential over the years in the design of an audio codec arelisted below

ž Streaming audio applicationsž Low delay audio codingž Error robustnessž Hybrid speechaudio coding structuresž Digital audio watermarkingž Perceptual audio quality measurementsž Bit rate scalabilityž Complexity issuesž Psychoacoustic signal analysis techniquesž Parametric models in audio codingž Lossless audio compressionž Interoperable multimedia frameworkž Intellectual property management and protection

Development of sophisticated audio coding algorithms that deliver rate-scalable and quality-scalable compression in error-prone channels receivedattention during the late 1990s Several researchers developed structured soundrepresentations oriented specifically towards low-bandwidth and native-signal-processing sound synthesis applications For example NetSound [Case96] is asound and music description system in which sound streams are described bydecomposition into a sound-specification description that represents arbitrarilycomplex signal processing algorithms as well as an event list comprising scoresor MIDI files The advantage of this approach over streaming codecs is thatonly a very small amount of data is required to transmit complex instrumentdescriptions and appropriately parameterized scores The trade-off with respectto streaming codecs however is that significant client-side computing resourcesare required during real-time synthesis Some examples of synthesis techniquesused by NetSound include wavetable FM phase-vocoder or additive synthesisScheirer proposed a paradigm called the generalized audio coding [Sche01] inwhich Structured Audio (SA) encompasses all other audio coding techniquesMoreover the MPEG-4 SA tool offers a flexible framework in which bothlossy and lossless audio compression can be realized A variety of lossless audiocoding features have been incorporated in the MPEG-4 audio standard [Sche01][Mori02a] [Vasi02]

Although streaming audio has achieved widespread acceptance on 288 kbsdial-up telephone connections consistent high-quality output is still difficult

SUMMARY OF COMMERCIAL APPLICATIONS 381

Network-specific design considerations motivated research into joint source-chan-nel coding [Ben99] for audio over the Internet Another emerging trend is oneof convergence between low-rate audio coding algorithms and speech codersthat are increasingly embedding mechanisms to exploit perceptual irrelevancies[Malv98] [Mori98] [Ramp98] [Ramp99] [Trin99a] [Trin99b] [Maki00] Sev-eral wideband speech and audio coders based on the CS-ACELP ITU-T G729standard have been proposed [Kois00] [Kuma00] Low delay [Jbir98] [Alla99][Dorw01] [Schu02] [Schu05] enhanced bit error robustness [ISOI00] [Sper00][Sper02] and scalability [Bran94b] [Gril95] [Span97] [Gril98] [Hans98] [Herr98][Jin99] [Zhou01] [Dong02] [Kim02] [Mori02b] have also become very impor-tant In particular the MPEG-4 bit rate scalability tool provides encoder withoptions to transmit bitstreams at variable bit rates Note that scalable audio codershave several layers ie a core layer and a series of enhancement layers Thecore layer encodes the main audio stream while the enhancement layers providefurther resolution and scalability Scalable algorithms will ultimately be usedto accommodate the unique challenges associated with audio transmission overtime-varying channels such as the packet-switched networks Several ideas thatprovide a link between the perceptual and the lossless audio coding techniques[Geig02] [Mori00] [Schu01] [Mori02a] [Mori02b] [Geig03] [Mori03] have alsobeen proposed

Hybrid coding algorithms that make use of specific characteristics of a signalwhile operating over a range of bit rates became popular [Bola98] [Deri99][Rong99] [Maki00] [Ning02] Some experimental work performed in the contextof MPEG-4 standardization has demonstrated that a cascaded hybrid sinusoidaltime-frequency coder can not only meet but in some cases even exceed the outputquality achieved by the time-frequency coder alone at the same bit rate for certaintest signals [Edle96b] Sinusoidal signal models due to their inherent suitabilitytowards scalable audio coding gained significant research interests during thelate 1990s [Taor99] [Pain00] [Ferr01a] [Ferr01b] [Pain01] [Raad01] [Herm02][Heus02] [Pain02] [Purn02] [Shao02] [Pain05]

Potential improvements for the various perceptual coder building blocks such asnovel filter banks for low-delay coding and reduced pre-echo [Schu98] [Schu00]efficient bit-allocation strategies [Yang03] and new psychoacoustic signal analy-sis techniques [Baum98] [Huan99] [Vande02] have been considered Researchersalso investigated new algorithms for tasks of peripheral interest to perceptual audiocoding such as transform-domain signal modifications [Lanc99] and digital water-marking [Neub98] [Tewf99]

In order to offer audiophiles with listening experiences that promise to bemore realistic than ever audio codec designers pursued several sophisticatedmultichannel audio coding techniques [DOLBY] [Bosi93] [Holm99] [Bosi00]In the mid-1990s discrete encoding ie 51 separate channels of audio wasintroduced by the Dolby Laboratories and the Digital Theater Systems (DTS) Thehuman auditory system allows hearing in substantially more directions than whatcurrent multichannel audio systems offer therefore future research will focuson overcoming the limitations of existing multichannel systems In particular


research involving spatial audio real-time acoustic source localization binauralcue coding and application of head-related transfer functions (HRTF) towardsrendering immersive audio [Sand01] [Fall02a] [Fall02b] [Kimu02] [Yang02] havebeen considered

PROBLEMS

111 What are the differences between the SACD and DVD-A formats Explainusing diagrams Provide bitrates

112 Use diagrams and describe the MLP algorithm employed in the DVD-A

113 How is direct stream digital (DSD) technology employed in the SACDdifferent from other encoding schemes Describe using a block diagram

114 What is the main idea behind the Integer MDCT and its application inlossless audio coding Explain in a concise manner and by using mathe-matical arguments the difference between Integer MDCT (Int MDCT) andthe MDCT Show the utility of the FFT in both

115 Explain using an example the lossless matrixing employed in the MLPHow is this scheme different from the Givens rotations employed in theIntMDCT

116 List some of the key differences between the lossless transform audiocoder (LTAC) and the IntMDCT algorithm

117 Survey the literature for studies on fast algorithms for IntMDCT and des-cribe at least one such algorithm Show how computational savings areachieved

118 Survey the literature for studies on the properties of IntMDCT Discuss inparticular properties dealing with preservation of energypower (whicheveris appropriate) and transform pair uniqueness

119 Describe the concept of noise shaping Explain how it is applied in theDSD algorithm

COMPUTER EXERCISE

1110 Write MATLAB code to implement an IntMDCT Profile the algorithmand compare the computational complexity against a standard MDCT

CHAPTER 12

QUALITY MEASURES FORPERCEPTUAL AUDIO CODING

121 INTRODUCTION

In many situations and particularly in the context of standardization activitiesperformance measures are needed to evaluate whether one of the establishedor emerging techniques is in some sense superior to other alternative methodsPerceptual audio codecs are most often evaluated in terms of bit rate complex-ity delay robustness and output quality Reliable and repeatable output qualityassessment (which is related to robustness) presents a significant challenge It iswell known that perceptual coders can achieve transparent quality over a verybroad highly signal-dependent range of segmental SNRs ranging from as low as13 dB to as high as 90 dB Classical objective measures of signal fidelity suchas signal-to-noise ratio (SNR) or total harmonic distortion (THD) are thereforeinadequate [Ryde96] As a result time-consuming and expensive subjective lis-tening tests are required to measure the small impairments that often characterizeperceptual coding algorithms Despite some confounding factors subjective lis-tening tests are nevertheless the most reliable tool available for codec qualityevaluation and standardized listening test procedures have been developed tomaximize reliability Research into improved subjective testing methodologies isalso quite active At the same time considerable research is being devoted tothe development of automatic perceptual measurement systems that can predictaccurately the outcomes of subjective listening tests Several algorithms havebeen proposed in the last two decades and the ITU-R has recently adopted an


383


international standard after combining several of the best techniques proposed inthe literature

This chapter offers a perspective on quality measures for perceptual audiocoding The first half of the chapter is concerned with subjective evaluations ofperceptual codec quality Sections 122 through 125 describe subjective qualitymeasurement techniques identify confounding factors that complicate subjectivetests and give sample subjective test results that facilitate comparisons betweenseveral 2- and 51-channel standards The second half of the chapter reviewsfundamental techniques and significant developments in automatic perceptualmeasurement systems with particular attention given to several of the candidatesfrom the ITU-R standardization process In particular Sections 126 and 127respectively describe perceptual measurement systems and the classes of algo-rithms that have been proposed for standardization Section 128 describes therecently completed ITU-R TG 104 which led to the adoption of the ITU-RRec BS1387 algorithm for perceptual measurement Section 910 also providessome information on the ITU-T P861 Section 129 offers some final remarkson anticipated future developments in quality measures for perceptual codecs

122 SUBJECTIVE QUALITY MEASURES

Although listening tests are often conducted informally the ITU-R Recommen-dation BS1116 [ITUR94b] formally specifies a listening environment and testprocedure appropriate for subjective evaluations of the small impairments asso-ciated with high-quality audio codecs The standard procedure calls for grading byexpert listeners [Bech92] using the CCIR continuous impairment scale(Figure 121-a) [ITUR90] in a double blind A-B-C triple stimulus hidden ref-erence comparison paradigm While stimulus A always contains the reference(uncoded) signal the B and C stimuli contain in random order a repetition ofthe reference and then the impaired (coded) signal ie either B or C is a hid-den reference After listening to all three the subject must identify either B orC as the hidden reference and then grade the impaired stimulus (coded signal)relative to the reference stimulus using the five-category 41-point ldquocontinuousrdquoabsolute category rating (ACR) impairment scale shown in the left-hand columnof Figure 121 (a) From best to worst the five ACR ranges rate the coding distor-tion as ldquoimperceptible (5)rdquo ldquoperceptible but not annoying (40ndash49)rdquo ldquoslightlyannoying (30ndash39)rdquo ldquoannoying (20ndash29)rdquo or ldquovery annoying (10ndash19)rdquo Adefault grade of 50 is assigned to the stimulus identified by the subject as thehidden reference A subjective difference grade (SDG) is computed by subtractingthe score assigned to the actual hidden reference from the score assigned to theactual impaired signal Thus negative difference grades are obtained when thesubject identifies correctly the hidden reference and positive difference grades areobtained if the subject misidentifies the hidden reference Over many subjects andmany trials mean impairment scores are calculated and used to evaluate codecperformance relative to the ideal of transparency It is important to notice thedifference between the small impairment subjective measurements in [ITUR94b]

SUBJECTIVE QUALITY MEASURES 385

and the five-point mean opinion score (MOS) most often associated with speechcoding algorithms [ITUT94] Unlike the small impairment scale the scale of thespeech coding MOS is discrete and scores are absolute rather than relative to ahidden reference To emphasize this difference it has been proposed [Spor96] thatldquomean subjective scorerdquo (MSS) denote the small impairment subjective score forperceptual audio coders During data reduction MSS and MSS standard devia-tions are computed for both the coded signals and the hidden references An MSSfor the hidden reference different from 50 means that subjects have difficultydistinguishing the hidden reference from the coded signal

In fact nearly transparent quality for the coded signal is implied if the hiddenreference MSS lies within the 95 confidence interval of the coded signal andthe coded signal MSS lies within the 95 confidence interval of the hiddenreference Unless otherwise specified the subjective listening test scores citedfor the various algorithms described throughout this text are from either theabsolute or the differential small impairment scales in Figure 121 (a)

It is important to realize that the most reliable subjective evaluation strategyfor a given perceptual codec depends on the nature of the coding distortionAlthough the small-scale impairments associated with nearly transparent codingare well characterized by measurements relative to a reference standard usinga fine-grade scale some experts have argued that the more-audible distortionsassociated with nontransparent coding are best measured using a different scalethat can better cope with large impairments

For example in listening tests [Keyh98] on 16-kbs codecs for the WorldSpacesatellite communications system it was determined that an ITU-T P800P830

Perceptible butNOT Annoying

SlightlyAnnoying

Annoying

VeryAnnoying

Imperceptible

292827262524232221

20

494847464544434241

40393837363534333231

30

191817161514131211

10

50 00

minus01

minus10minus11

minus20

minus30

minus21

minus31

minus40

Ab

solu

te Grad

e





Differen

ce Grad

e

A slightly betterthan B

A same as B

A slightly worsethan B

A much worsethan B

A better than B +2

+1

0

minus1

minus2A worse than B

minus3

CCR

A much betterthan B +3

Figure 121 Subjective quality scales (a) ITU-R Rec BS1116 [ITUR94b] smallimpairment scale for absolute and differential subjective quality grades (b) ITU-T RecP800P830 [ITUR96] large impairment comparison category rating


seven-point comparison category rating (CCR) method [ITUR96] was bettersuited to the evaluation task (Figure 121b) than the scale of BS1116 because ofthe nontransparent quality associated with the test signal Investigators preferredthe CCR over both the small impairment scale as well as the five-point absolutecategory rating (ACR) commonly used in tests of speech codecs A listening teststandard for large scale impairments analogous to BS1116 does not yet exist foraudio codec evaluation This is likely to change in light of the growing popularityfor network and wireless applications of low-rate nontransparent audio codecscharacterized at times by distinctly audible (not small-scale) artifacts

123 CONFOUNDING FACTORS IN SUBJECTIVE EVALUATIONS

Regardless of the particular grading scale in use subjective test outcomes gen-erated using even rigorous methodologies such as the ITU-R BS1116 are stillinfluenced by factors such as context site selection and individual listener acu-ity (physical) or preference (cognitive) Before comparing subjective test resultson particular codecs therefore one should be prepared to interpret the subjec-tive scores with some care For example consider the variability of ldquoexpertrdquolisteners A study of decision strategies [Prec97] using multidimensional scalingtechniques [Schi81] found that subjects disagree on the relative importance withwhich to weigh perceptual criteria during impairment detection tasks In anotherstudy [Shli96] Shlien and Soulodre presented experimental evidence that can beinterpreted as a repudiation of the ldquogolden earrdquo Expert listeners were tasked withdiscriminating between clean audio and audio corrupted by low-level artifacts typ-ically induced by audio codecs (five types were analyzed in [Miln92]) includingpre-echo distortion unmasked granular (quantization) noise and high-frequencyboost or attenuation Different experts were sensitive to different artifact typesFor instance a subject strongly sensitive to pre-echo distortion was not neces-sarily proficient when asked to detect high-frequency gain or attenuation In factartifact sensitivities were linked to psychophysical measurement profiles Becausetest subjects tended to be sensitive to one impairment type but not to others theability of an individual to perform as an expert listener depended upon the typeof artifacts to be detected Sporer [Spor96] reached similar conclusions after yeta third study of expert listeners A statistical analysis of the results indicatedthat significant site and subject differences exist that even well-trained listen-ers cannot necessarily repeat their own results (self-recalibration) and that somesubjects are sensitive to specific artifacts but insensitive to others The numer-ous confounding factors in subjective evaluations perhaps point to the need fora fundamental change in the evaluation of perceptual codecs Sporer [Spor96]suggested that codecs could be evaluated on the basis of specific artifact per-ceptibility Then appropriate artifact-sensitive subjects could be matched to thetest material It has also been suggested that the influence of individual nonre-peatability could be minimized with larger subject pools Nonhuman factors alsoinfluence subjective listening test outcomes For example playback level (SPL)and background noise respectively can influence excitation pattern shapes and

SUBJECTIVE EVALUATIONS OF TWO-CHANNEL STANDARDIZED CODECS 387

introduce undesired masking effects While both SPL and background noise canbe replicated in different listening test environments other environmental factorsare not as easy to replicate The presentation method can strongly influence per-ceived quality because loudspeakers introduce distortions on their own and inconjunction with a listening room These effects can introduce site dependenciesIn short although they have proven effective existing subjective test proceduresfor audio codecs are clearly suboptimal Recent research into more reliable toolsfor subjective codec evaluations has shown promise and is continuing Confound-ing factors must be considered carefully when subjective test results are comparedbetween codecs and between different test sites These considerations have moti-vated a number of subjective tests in which multiple algorithms were tested onthe same site under identical listening conditions In Sections 124 and 125 wewill consider two example tests one for 2-channel codecs and the other for51-channel algorithms

124 SUBJECTIVE EVALUATIONS OF TWO-CHANNELSTANDARDIZED CODECS

The influence of site and subject dependencies on subjective listening tests canpotentially invalidate direct comparisons between independent test results fordifferent algorithms Ideally fair intercodec comparisons require that scores areobtained from a single site with the same test subjects Soulodre et al conducted aformal ITU-R BS1116-compliant [ITUR94b] listening test that compared severalstandardized two-channel stereo codecs [Soul98] including the MPEG-1 layerII [ISOI92] the MPEG-1 layer III [ISOI92] the MPEG-2 AAC [ISOI96a] theLucent Technologies PAC [John95] and the Dolby AC-3 [Fiel96] codecs over avariety of bit rates between 64 and 192 kbs per stereo pair The AAC algorithmwas tested in the main complexity profile but with the dynamic window shapeswitching and intensity coding tools disabled The MPEG-1 layer II codec wastested in both software simulation and in a real-time implementation (ldquoITISrdquo)In all 17 algorithmbit rate combinations were examined in the tests Listeningmaterial was selected from a library of 80 items deemed critical by experts andultimately the two most critical items were chosen for each codec tested Meandifference grades were computed over the set of results from 21 expert listenersafter three of the original subjects were eliminated from consideration due totheir statistically unreliable performance (t-test disqualification)

The test results reproduced in Table 121 clearly show eight performanceclasses The AAC and AC-3 codecs at 128 and 192 kbs respectively exhibitedbest performance with mean difference grades better than minus10 The MPEG-2 AAC algorithm at 128 kbs however was the only codec that satisfied thequality requirements defined by ITU-R Rec BS1115 [ITUR97] for perceptualaudio coding systems in broadcast applications namely that there may not beany audio materials rated below minus100 Overall the ranking of the families frombest to worst with respect to quality was AAC PAC MPEG-1 layer III AC-3MPEG-1 layer II and ITIS (MPEG-1 LII hardware implementation) The trend


is exemplified in Table 122 where the codec families are ranked in terms ofSDGs for a fixed bit rate of 128 kbs The class three results can be interpretedto mean that bit rate increases of 32 64 and 96 kbs per stereo pair are requiredfor the PAC AC-3 and layer II codec families respectively to match the outputquality of the MPEG-2 AAC at 96 kbs per stereo pair

125 SUBJECTIVE EVALUATIONS OF 51-CHANNEL STANDARDIZEDCODECS

In addition to two-channel stereo systems multichannel perceptual audiocoders are increasingly in demand for multimedia cinema and home theater

Table 121 Comparison of standardized two-channel algorithms (after [Soul98])

Group Algorithm Rate (kbs)Mean diff

gradeTransparent

itemsItems below

minus100

1 AAC 128 minus047 1 0AC-3 192 minus052 1 1

2 PAC 160 minus082 1 33 PAC 128 minus103 1 4

AC-3 160 minus104 0 4AAC 96 minus115 0 5

MP-1 L2 192 minus118 0 54 IT IS 192 minus138 0 65 MP-1 L3 128 minus173 0 6

MP-1 L2 160 minus175 0 7PAC 96 minus183 0 6IT IS 160 minus184 0 6

6 AC-3 128 minus211 0 8MP-1 L2 128 minus214 0 8

IT IS 128 minus221 0 77 PAC 64 minus309 0 88 IT IS 96 minus332 0 8

Table 122 Comparison of standardized two-channelalgorithms at 128 kbs (after [Soul98])

Rank Algorithm (128 kbs) Mean diff grade

1 AAC minus0472 PAC minus1033 MP-1 L3 minus1734 AC-3 minus2115 MP-1 L2 minus2146 IT IS minus221

SUBJECTIVE EVALUATIONS USING PERCEPTUAL MEASUREMENT SYSTEMS 389

applications As a result the European Broadcasting Union (EBU) sponsoredDeutsche Telekom Berkom in a formal subjective evaluation [Wust98] thatcompared the output quality for real-time implementations of the 51-channelDolby AC-3 and the matrixed 51-channel MPEG-2BC layer II algorithmsat bit rates between 384 and 640 kbs (Table 123) The tests adhered tothe methodologies outlined in ITU BS1116 and the five-channel listeningenvironment was configured according to ITU-R Rec BS775 [ITUR93a] Meansubjective difference grades were computed over the scores from 32 experts aftera t-test analysis was used to discard inconsistent results from seven subjects

The resulting difference grades given in Table 123 represent averages of themean grades reported for a collection of eight critical test items Even thoughnone of the tested codec configurations satisfied ldquotransparencyrdquo the MPEG-2BClayer II had only one nontransparent test item at the 640 kbs bit rate On theother hand the Dolby AC-3 outperformed MPEG-2BC layer II at 384 kbs bymore than half a difference grade In terms of specific weaknesses the ldquoapplausearoundrdquo item was most critical for AC-3 (eg minus241 384 kbs) whereas theldquopitch piperdquo item was most critical for MPEG-2BC layer II (eg minus356 384 kbs)

126 SUBJECTIVE EVALUATIONS USING PERCEPTUALMEASUREMENT SYSTEMS

Even well-designed and carefully controlled experiments [ITUR94b] are sus-ceptible to problems with reliability repeatability site-dependence and subject-dependence [Grew91] Moreover subjective tests are difficult time consumingtedious and expensive These circumstances have inspired researchers to seekautomatic perceptual measurement techniques that can evaluate coding distortionon a subjective impairment scale with the ultimate objective being the replace-ment of human test subjects Ideally a perceptual measurement system shouldanswer the following questions

ž Is the codec transparent for all sourcesž If so what is the ldquocoding marginrdquo or the degree of distortion inaudibilityž If not how disturbing are the artifacts on a meaningful subjective scale

Table 123 Comparison of standardized 51-channel algorithms

Group Algorithm Rate (kbs) Mean diff grade

1 MP-2 BC 640 minus0512 AC-3 448 minus093

MP-2 BC 512 minus099A3 AC-3 384 minus117

MP-2 BC 384 minus173


In other words the system should quantify overall subjective quality provide adistance measure for the coding margin (ldquoperceptual headroomrdquo) and quantify theaudibility of unmasked distortion Many researchers have proposed measurementsystems that attempt to answer these questions The resulting system architecturescan be broadly categorized into two classes (Figure 122) namely systems thatcompare internal auditory representations (CIR) and systems that perform noisesignal evaluation (NSE)


The CIR systems (Figure 122a) generate independent auditory images of theoriginal and coded signals and then compare them The degree of audible differ-ence between the auditory images is then evaluated using either a deterministicdifference thresholding scheme or a probabilistic detector The auditory imagesin the figure could represent excitation patterns loudness patterns or someother perceptually relevant physiological quantity such as neural firing ratesfor example CIR systems avoid explicit masking threshold calculations by cap-turing the essence of cochlear signal processing in a filter bank combined witha series of post-filter-bank operations on each cochlear channel Moreover CIRsystems do not require any explicit knowledge of an error signal


The NSE systems (Figure 122b) on the other hand require an explicit repre-sentation of the coding distortion The idea behind NSE systems is to analyzethe coding distortion with respect to properties of the auditory system TypicalNSE systems estimate explicitly the masked threshold for each short time signal

(a)

PerceptualCodec

DelayGain Comp

AuditoryModel

AuditoryModel

Comparisonof

AuditoryldquoImagesrdquo

s (n )

(b)

s (n )

DelayGain Comp

TFAnalysis

TFAnalysis

PerceptualModel

ComparisonErr vs Thr

minusPerceptualCodec

sum

Figure 122 Perceptual measurement system architectures (a) comparison of internalrepresentation (CIR) (b) noise signal evaluation (NSE)


segment using the properties of simultaneous and temporal masking as is usu-ally done in perceptual audio coders Then the distribution of noise energy overtime and frequency is compared against the corresponding masked threshold Inboth the CIR and NSE scenarios the automatic measurement system seeks toemulate the signal processing behavior of the human ear andor the cognitiveprocessing of auditory stimuli in the human listener Emulation of the humanear for the purpose of making measurements requires a filter bank with time andfrequency resolution similar to the auditory filter bank [Spor95a] Although it ispossible to model the characteristics of individual listeners [Treu96] the usualapproach is to match the acuity of average subjects Frequency resolution shouldbe sufficient to model excitation pattern or hair cell selectivity for NSE and CIRalgorithms respectively For modeling temporal effects time resolution shouldbe at least 3 ms below 300 Hz and at least 15 ms at higher frequencies Fortu-nately a great deal of information about the auditory filter bank is available toassist in the modeling task The primary challenge is to design a system of man-ageable complexity while adequately emulating the earrsquos time-frequency analysisproperties Cognitive effects however are less well understood and more diffi-cult to model Cognitive weighting of perceptual criteria can be quite differentacross listeners [Prec97] While some subjects might interpret single strong dis-tortions such as clicks to be less annoying than small omnipresent distortions theconverse can also be true Moreover impairments are asymmetrically weightedSubjects tend to forget distortions at the beginning of a sample but rememberthose occurring at the end Other subjective preferences can also affect outcomesSome subjects prefer bandwidth limiting (ldquodullnessrdquo) over low-level artifacts athigh frequencies but others would prefer greater ldquobrilliancerdquo at the expense oflow-level artifacts While these issues present significant modeling challengesthey can also undermine the utility of subjective listening tests whereas an auto-matic measurement system guarantees a repeatable output The next few sectionsare concerned with examples of particular CIR and NSE perceptual measurementschemes that have been proposed including several that have been integrated intointernational standards

127 ALGORITHMS FOR PERCEPTUAL MEASUREMENT

Numerous examples of both CIR- and NSE-based perceptual measurementschemes for speech and audio codec output quality evaluation have appearedin the literature since the late 1970s with particular emphasis placed on thosesystems that can predict accurately subjective listening test results In 1979Schroeder Atal and Hall at Bell Labs proposed an NSE technique calledthe ldquonoise loudnessrdquo (NL) metric for assessment of vocoder output qualityThe NL system [Schr79] estimates output quality by computing a ratio of thecoding distortion loudness to the loudness of the original signal For bothsignal and noise loudness patterns are derived every 20 ms from frequency-domain excitation patterns that are in turn computed from short-time criticalband densities The critical band densities are obtained from FFT-based spectral


estimates The approach is hampered by insufficient analysis resolution at lowfrequencies Karjaleinen in 1985 proposed a CIR technique [Karj85] known asthe ldquoauditory spectrum distancerdquo (ASD) Instead of using an FFT for spectralanalysis the ASD front-end models the cochlear filter-bank using a 48-channelbank of overlapping FIR filters with roughly 05 Bark spacing The filter bankchannel magnitude responses are designed to model the smearing of spectralenergy that produces the usual psychophysical excitation patterns Then asquare-law post-processor and two low-pass filters are used to model hair cellrectification and temporal masking respectively for each channel The ASDmeasure is derived from the maximum deviation between the post-processedfilter-bank outputs for the reference and test signals The ASD profile is analyzedover time and frequency to extract some estimate of perceptual quality

As the field of perceptual audio coding matured rapidly and created greaterdemand for listening tests there was a corresponding growth of interest in per-ceptual measurement schemes Several NSE and CIR algorithms appeared inthe space of the next few years namely the noise-to-mask Ratio [Bran87a](NMR 1987) the perceptual audio quality measure (PAQM 1991) [Beer91]the peripheral internal representation [Stua93] (PIR 1991) the Bark transform(BT 1992) [Kapu92] the perceptual evaluation (PERCEVAL 1992) [Pail92]the perceptual objective measure (POM 1995) [Colo95] the distortion index(DIX 1996) [Thie96] and the objective audio signal evaluation (OASE 1997)[Spor97] After a proposal phase the proponents of several of these measurement

systems ultimately collaborated on the development of the international standardon perceptual measurement the ITU-T BS1387 [ITUR98] (1998) Additionalschemes were reported in a special session on perceptual measurement systemsfor audio codecs during a 1992 Audio Engineering Society conference [Cabo92][Stau92] An NSE measure based on quantifying the dissonances associated withcoding distortion was proposed in [Bank96]

The remainder of this chapter is intended to provide some functional insightson perceptual measurement schemes Selected details of the perceptual audioquality measure (PAQM) the noise-to-mask ratio (NMR) and the objective audiosignal evaluation (OASE) algorithms are given as representative examples of boththe CIR and NSE methodologies


Beerends and Stemerdink in 1991 proposed a perceptual measurementsystem [Beer91] known as the ldquoperceptual audio quality measurerdquo (PAQM) forevaluating audio device output quality relative to a reference signal It waslater shown that the parameters of PAQM can be optimized for predictingthe results of subjective listening tests on speech [Beer94a] and perceptualaudio [Beer92a][Beer92b] codecs The PAQM (Figure 123) is a CIR-typealgorithm that maps independently the original s(n) and coded signal s(n)from the time domain to a corresponding pair of internal psychophysicalrepresentations In particular original and coded signals are processed by a two-stage auditory signal processing model This model captures mostly the peripheral

Hz

rarr B

ark

Mid

dle

Ear

TF

Exc

itatio

nS

calin

gLo

udne

ss

Com

pare

Win

dow

Mid

dle

Ear

TF

Exc

itatio

nLo

udne

ssT

FA

vera

ge

Λn

ss at

af

minusW

indo

w

Hz

rarr B

ark

Σ

gg

2F

FT

2F

FT

Fig

ure

123

Pe

rcep

tual

audi

oqu

ality

mea

sure

(PA

QM

)

393


process of nonlinear time-frequency energy smearing [Vier86] followed by thepredominantly central process of nonlinear intensity compression [Pick88] Amodified mapping derived from Zwickerrsquos loudness model [Zwic67] [Zwic77]is used to transform the two internal representations to a compressed loudnessscale Then a difference is computed to quantify the noise disturbance associatedwith the coding distortion Ultimately a second transformation can be applied thatmaps the noise disturbance compressed loudness n to a predicted listening testsubjective difference grade Precise computational details of the PAQM algorithmappeared in the appendices of [Beer92b] The important steps however can besummarized as follows

Note that the sequence of operations is the same for both the reference andimpaired signals First short-time spectral estimates are obtained using a 2048-point Hann-windowed FFT with 50 block overlap yielding time resolutionclose to 20 ms and frequency resolution of about 22 Hz A middle-ear transferfunction is applied at the output of the FFT and the frequency axis is warped fromHertz to Barks To simulate the excitation patterns generated along the basilarmembrane the PAQM next smears the middle-ear signal energy across timeand frequency in accordance with empirically derived psychophysical modelsFor frequency-domain smearing within one analysis window Terhardtrsquos [Terh79]narrow-band level- and frequency-dependent excitation pattern approximationwith slopes defined by

S1 = 31 dBBark fm gt f (121a)

S2 = 22 + min(230fm 10) minus 02LdBBark fm lt f (121b)

is used to spread the spectral energy across critical bands Thus the slope S1

controls the downward spread of excitation the slope S2 controls the upwardspread of excitation and the parameters fm L and f correspond respectivelyto the masker center frequency in Hz the masker level in units of dB SPLand the frequencies of the excitation components Within an analysis windowindividual excitation patterns Mi are combined point-wise to form a compositepattern Mc using a parametric power law of the form

Mc =(

Nsum

i=1

Mαi

)1α

α lt 2 (122)

where the parameter N corresponds to the number of individual excitation patternsbeing combined at a particular point and the parameter α controls the powerlaw Power law addition has been shown to govern the combination of multiplesimultaneous [Luft83] [Luft85] and temporal masking [Penn80] patterns but theparameter α may be different along the frequency and time axes We denote byαf the value of the parameter α used in PAQM for combining frequency-domainexcitation patterns The PAQM also models the temporal spread of excitationEnergy from the k-th block is scaled by an exponentially decaying gain g(t)


with a frequency-dependent time constant τ(f ) ie

g(t) = eminustτ (f ) (123)

where the time parameter t corresponds to the time distance between blockseg 20 ms The scaled version of the k-th block energy is then combined withthe energy of the (k + 1)-th (overlapping) block using a power law of the formgiven in Eq (122) and the parameter α is set for the time domain ie α =αt It was found that the level dependence of temporal excitation spread couldbe disregarded for the purposes of PAQM without affecting significantly themeasurement quality Thus the parameters S1 S2 and αf control the spread offrequency-domain signal energy while the parameters τ(f ) and αt control thesignal energy dispersion in time Once excitation patterns have been obtainedinternal original and coded signal representations are generated by transformingat each frequency the excitation energy E into a measure of loudness usingZwickerrsquos compressive excitation-to-loudness mapping [Zwic67] ie

= k

(E0

s

)γ [(

1 minus s + sE

E0

)γ

minus 1

]

(124)

where the parameter γ controls the compression characteristic the parameter k

is an arbitrary scaling constant the constant E0 represents the amount of exci-tation in a tone of the same frequency at the absolute hearing threshold andthe constants is a so-called ldquoschwellrdquo factor as defined in [Zwic67] Althoughthe mapping in Eq (124) was proposed in the context of loudness estimationthe particular value of the parameter γ used in PAQM is different from thevalue empirically determined in Zwickerrsquos experiments and therefore the inter-nal representations are not strictly speaking loudness measurements but arerather compressed loudness estimates Indexed over Bark frequencies z andtime t the compressed loudness of the original signal s(z t) and the codedsignal s(z t) are compared in three Bark regions Gains are extracted for eachBark subband in order to scale s(z t) such that the matching between s(z t)

and the scaled version of s(z t) is maximized The instantaneous frequency-localized noise disturbance n(z t) = s(z t) minus s(z t) ie the differencebetween internal representations is integrated over the Bark scale and then acrossa suitable time interval to obtain a composite noise disturbance estimate nFor the purposes of predicting the results of subjective listening tests the PAQMparameters of αf αt and γ can be adapted to minimize the standard deviationof a third-order regression line that fits the absolute performance grades fromactual subjective listening tests to the PAQM output log(n)

For example in a fitting experiment using listening test results from theISOIEC MPEG 1990 database [ISOI90] for loudspeaker presentation of the testmaterial the values of 08 06 and 004 respectively were obtained for theparameters αf αt and γ This yielded a correlation coefficient of 0968 andstandard deviation of 0351 performance grades With a correlation coefficient of091 and standard deviation of 048 MSS prediction was not quite as good when


the same parameters were used to validate PAQM on subjective listening testresults from the ISOIEC MPEG 1991 database [ISOI91] Some of the reducedprediction accuracy however was attributed to inexperienced listeners partic-ularly for some very high quality items for which the grading task is moredifficult In fact better prediction accuracy was realized during a later compari-son with more experienced listeners on the same database [Beer92b] In additionthe model was also validated using the results of the ITU-R Task Group 1021993 audio codec test [ITUR93b] on the contribution distribution emission (CDE)database resulting in a correlation coefficient of 088 and standard deviation of029 MSS As reported in [ITUR93c] these results were later verified by theSwedish Broadcasting Corporation It is also interesting to note that the PAQMwas optimized for listening test predictions on narrowband speech codecs Sev-eral modifications were required to achieve a correlation coefficient of 099 andstandard deviation of 014 points relative to a 5-point discrete MOS In particulartime and frequency energy dispersion had to be eliminated the loudness compres-sion parameter was changed to γ = 0001 and silence intervals required a zeroweighting [Beer94a] In an effort to improve PAQM performance cognitive effectmodeling was investigated in [Beer94b] and [Beer95] It was concluded that aunified PAQM configuration for both narrowband and wideband codecs mustaccount for subjective phenomena such as the asymmetry in perceived distortionbetween audible artifacts and the perceived distortion when signal componentsare eliminated rather than distorted The signal-dependent importance of mod-eling informational masking effects [Leek84] was also demonstrated [Beer96]although a general formulation was not proposed More details on cognitivemodeling improvements for PAQM appeared recently in [Beer98] Future PAQMenhancements will involve better cognitive models as well as binaural processingcapabilities


In the late 1980s Brandenburg proposed an NSE perceptual measurement schemeknown as the ldquonoise-to-mask ratiordquo (NMR) [Bran87a] The system was orig-inally intended to assess output quality and quantify noise margin during thedevelopment of the early perceptual transform coders such as OCF [Bran87b]and ASPEC [Bran91] but it has also been widely used in the development ofmore recent and sophisticated algorithms (eg MPEG-2 AAC [ISOI96a]) TheNMR output measurement η represents a time- and frequency-averaged dBdistance between the coding distortion s(n) minus s(n) and the masked thresholdassociated with the original signal s(n) The NMR system also sets a binarymasking flag [Bran92b] whenever the noise density in any critical band exceedsthe masked threshold Tracking the masking flag over time helps designers toidentify critical items for a particular codec In a procedure markedly similar tothe perceptual entropy front-end calculation (Chapter 5 Section 55) the NMR(Figure 124) is measured as follows First the original signal s(n) is time-delayed and in some cases amplitude scaled to maximize the temporal matching


with the coded signal s(n) so that the coding distortion can be estimated througha simple difference operation ie s(n) minus s(n) Then FFT-based spectral analy-sis is performed on Hann-windowed 1024-sample blocks (233 ms at 441 kHz)of the signal and the noise sequences Although not shown in the figure the dif-ference is sometimes computed in the spectral domain (after the FFT) to reducethe impact of phase errors on estimated audible noise New spectral estimates arecomputed every 512 samples (50 block overlap) to improve the effective tem-poral resolution of the measurement Next critical band signal and noise densitiesare estimated on a set of Bark-like subbands by summing and then normalizingthe energy contained in 27 blocks of adjacent FFT bins that approximate 1-Barkbandwidth [Bran92b] The masked threshold is derived from the Bark-scale signaldensity by means of simplified simultaneous and temporal masking models Forsimultaneous masking a conservatively estimated level-independent prototypespreading function (specified in [Bran92b]) is convolved with the Bark densityUnlike other perceptual models NMR assumes that the in-band masked thresh-old is fixed at 3 dB below the masker peak independent of tonality While thisassumption apparently overestimates the overall excitation level the NMR modelcompensates by ignoring masking additivity The absolute threshold of hearingin quiet is also accounted for in the Bark-domain masking calculation As faras temporal masking is concerned only postmasking is accounted for explicitlywith the simultaneous masked threshold decreased for each critical band indepen-dently by 6 dB per analysis window (1165 ms at 441 kHz) Finally the localNMR ηloc is defined as

ηloc = 10 log10

(1

27

27sum

i=1

σn(i)

σm(i)

)

(125)

ie the mean of the ratio between the critical band noise density σn(i) and thecritical band mask density σm(i) For a signal containing N blocks the globalNMR η is defined as the mean local NMR ie

η = 10 log10

(1

N

Nsum

i=1

10(01ηloc(i))

)

(126)

MaskingModel

NMR

minus

s

ss

Window

Window

h

sum 2FFT

2FFT

Figure 124 Noise-to-mask ratio (NMR)


Finally a masking flag (MF) is set once every 512 samples if the noise inat least one critical band exceeds the masked threshold One appealing featureof the NMR algorithm is its modest complexity making it amenable to real-time implementation In fact a real-time NMR system for 441 kHz signals hasbeen implemented on a set of three ATampT DSP32C 50 MHz devices [Herr92a][Herr92b] Processor-specific tasks were partitioned as follows 1) correlator fordelay and gain determination 2) FFT and windowing and 3) perceptual modelNMR calculations We will consider next how the raw NMR measurements areapplied in codec performance evaluation tasks

Given that NMR updates occur once every 512 samples (117 ms at 441 kHz)the raw NMR measurements of ηloc and the masking flag can create a consider-able volume of data To simplify the evaluation task several postprocessed datareducing metrics have been proposed for codec evaluation purposes [Beat96]These secondary measures including the global NMR (126) worst-case localNMR (max(ηloc)) the percentage of nonzero masking flag (MF) and the ldquoNMRfingerprintrdquo have been demonstrated to predict quite accurately the presence ofcoding artifacts For example a global NMR η of less than or equal to minus10 dBtypically implies an artifact-free signal High quality is also implied by a MF thatoccurs less than 3 of the time An NMR fingerprint shows the time-averageddistribution of NMR over frequency (critical bands) A well-balanced fingerprinthas been shown to correspond with better output quality than an unbalancedldquopeakyrdquo fingerprint for which high NMRs are localized to a just few criticalbands NMR equivalent bit rate has also been proposed as a quality metric par-ticularly in the case of impaired channel performance characterization In thismeasurement scheme an impaired codec is quantified in terms of an equiva-lent bit rate for which an unimpaired reference codec would achieve the sameNMR For example a 64-kbs codec in a three-stage tandem configuration mightachieve the same NMR as a 32-kbs version of the same codec in the absenceof any tandeming These and other NMR-based figures of merit have been usedsuccessfully to evaluate the performance of perceptual audio codecs in consumerrecording equipment [Keyh93] tandem coding configurations [Keyh94] and net-work applications [Keyh95]

The various NMR-based measures have proven to be useful evaluation toolsfor many phases of codec research and development The ultimate objective ofTG 104 however is to standardize automatic evaluation tools that can predictsubjective quality on an impairment scale In fact NMR measurements can becorrelated directly with the results of subjective listening tests [Herr92b] More-over an appropriately interpreted NMR output can emulate a human subject inpsychophysical experiments [Spor95b] In the case of listening test predictionsa minimum mean square error (MMSE) linear mapping from the raw NMR mea-surements to the 5-grade CCIR impairment was derived on listening test data fromthe 1990 Swedish Radio ISOMPEG database [ISOI90] The mapping achieveda correlation of 094 and mean deviation of 028 CCIR impairment grades Asfor the psychophysical measurements the NMR was treated as a test subject inseveral simultaneous and temporal masking experiments and then the resulting


psychometric functions were compared against average results for actual humansubjects In the experiments the NMR ldquosubjectrdquo was judged to detect a givenprobe if the masking flag was set for more than half of the blocks and thenthe threshold was interpreted as the lowest level probe for which more than50 of the masking flags were set In experiments on noise-masking-tone andtone-masking-noise the NMR ldquosubjectrdquo exhibited masking patterns similar toaverage human subjects with some artifacts at low frequencies evident In tone-masking-tone experiments however the NMR ldquosubjectrdquo diverged from humanperformance because the NMR perceptual model fails to account for the maskingrelease caused by beating and difference tones

The level of performance achieved by the NMR system on listening test pre-diction and psychometric measurement emulation tasks resulted in its inclusionin the ITU TG 104 evaluation process Although the NMR perceptual model hadbeen ldquofrozenrdquo in 1994 [Spor95b] to facilitate comparisons between NMR resultsgenerated at different sites for different algorithms it was clear that perceptualmodel enhancements relative to [Bran92b] were necessary in order to improvemasked threshold predictions and ultimately overall performance Thus whilemaintaining a complexity level compatible with real-time constraints the TG104 NMR submission was improved relative to the original NMR in severalareas For instance the excitation model is now dependent on tonality and pre-sentation level Furthermore the additivity of masking is now modeled explicitlyFinally level dependencies are now matched to the actual presentation SPL Thesystem might eventually realize further performance gains by using higher res-olution spectral analysis at low frequencies and by estimating the BLMD forstereo signals


None of the NMR improvements cited above change the fact the uniform band-width channels of the FFT filter bank are fundamentally different from theauditory filter bank In spite of other NMR modeling improvements the FFThandicaps measurement system because of inadequate frequency resolution at lowfrequencies and inadequate temporal resolution at high frequencies To overcomesuch limitations an alternative CIR-type algorithm for perceptual measurementknown as the objective audio signal evaluation (OASE) [Spor97] was also sub-mitted to the ITU TG 104 as one of the three variant proposals from the NMRproponents Unlike the other NMR variants the OASE system (Figure 125) isa CIR system that generates independent auditory feature sets for the coded andreference signals denoted by s(n) and s(n) respectively Then a probabilis-tic detection stage determines the perceptual distance between the two auditoryfeature sets

The OASE was designed to rectify perceptual measurement shortcomingsinherent in FFT-based systems by achieving analysis resolution in both time andfrequency that is greater than or equal to auditory resolution In particular theOASE filter bank approximates a Mel frequency scale by using 241 highly over-lapping 128-tap FIR bandpass filters centered at integral multiples of 01 Bark


DetectorBank

FilterBank

FilterBank

TemporalModel

TemporalModel

241 minus

241

LoudnessMapping

pg pcb

DiffGrade

s

241

241

s SubjectiveMapping

sum

Figure 125 Objective audio signal evaluation (OASE)

Meanwhile the time resolution is fs 32 or 067 ms at 48 kHz Moderate complex-ity is achieved through an efficient structured multirate realization that exploitsfast convolutions Individual filter shapes are level- and frequency-dependentmodeled after experimental results on excitation patterns The bandpass filtersalso simulate outer and middle ear transfer characteristics A square-law recti-fier post-processes the output from each channel of the filter bank and then theauditory ldquointernal noiserdquo associated with the absolute threshold of hearing is sim-ulated by adding to each output a predefined constant Finally the nonlinear filtersproposed in [Karj85] are applied to simulate temporal integration and smoothingThe output of the temporal processing block constitutes a 241-element auditoryfeature vector for each signal A Decibel difference is computed on each channelbetween the features associated with the reference and coded signals This time-frequency difference vector forms the input to a bank of detectors At each timeindex a frequency-localized difference detection probability pi(t) is assigned tothe i = 1 2 241 features as a function of the difference magnitude using asignal-dependent detection profile For example artificial signals use a detectionprofile different than speech and music The time index t is discrete and takes onvalues that are integer multiples of 067 ms at 48 kHz (fs 32) Finally a globalprobability of difference detection pg(t) is computed as the complement of thecombined probability that no single channel differences are detected ie

pg(t) = 1 minus241prod

i=l

(1 minus pi(t)) (127)

Alternatively a critical band detection probability pcb(t) for a particular 1-Bark subband can be computed by combining the frequency-localized detectionprobabilities as follows

pcb(t) = 1 minusk+5prod

i=kminus4

(1 minus pi(t)) (128)

where the critical band of interest is centered on any of the eligible Mel-scalefeatures ie 5 k 236 In cases where it is desirable for OASE to pre-dict the results of a subjective listening test a noise loudness mapping can beapplied to the feature difference vectors available at the output of the temporalprocessing stages Such a configuration estimates the loudness of the differ-ence between the original and coded signals Then as is possible with many

ITU-R BS1387 AND ITU-T P861 401

other perceptual measurement schemes the noise loudness can be mapped to apredicted subjective difference grade The ITU TG 104 has in fact used OASEmodel outputs to predict subjective scores The results were reported in [Spor97]to have been reasonable and similar to other perceptual measurement systemsbut no correlation coefficients or standard deviations were given

128 ITU-R BS1387 AND ITU-T P861 STANDARDS FORPERCEPTUAL QUALITY MEASUREMENT

In 1995 the Radio Communication Sector of the International Telecommunica-tions Union (ITU-R) convened Task Group (TG) 104 to select a perceptual mea-surement scheme capable of predicting the results of subjective listening tests inthe form of an objective difference grade (ODG) Six proposals were submitted toTG 104 in 1995 including the distortion index (DIX) from TU-Berlin [Thie96]the noise-to-mask ratio (NMR) from FhG-IIS Erlangen [Bran87a] the perceptualaudio quality measure (PAQM) from KPN [Beer92a] the perceptual evalua-tion (PERCEVAL) from CRC Canada [Pail92] the perceptual objective measure(POM) from CCETT France [Colo95] and the Toolbox (TB) from IRT MunichDuring initial TG 104 tests in 1996 three variants of each proposal were requiredto predict the results of an unknown listening test A lengthy comparison and ver-ification process ensued (eg [ITUR94a] compares NMR PAQM and PERCE-VAL) Ultimately the six TG 104 proponents in conjunction with Opticom andDeutsche Telekom Berkom collaborated to design a single system combining themost successful features from each individual methodology This effort resultedin the recently adopted ITU-R Recommendation BS1387 [ITUR98] which spec-ifies two versions of the measurement scheme known as perceptual evaluationof audio quality (PEAQ) The basic low-complexity version is based on an FFTfilter bank much like the PAQM PERCEVAL POM and NMR algorithms Theadvanced version incorporates a more sophisticated filter bank that better approx-imates the auditory system as in for example the DIX and OASE algorithmsNearly all features extracted from the perceptual models to estimate quality aresimilar to the feature sets previously used in the individual proposals Unlikethe regression polynomials used in earlier tests however in BS1387 a neuralnetwork performs the final mapping from the objective output to an ODG Stan-dardization activities are ongoing and the ITU has formed a successor groupto the now dissolved TG 104 under the designation JWP10-11Q [Spor99a]Research into perceptual measurement techniques has also resulted in a stan-dardized algorithm for speech quality measurements In particular Study Group12 within ITU Telecommunications Sector (ITU-T) in 1996 adopted the Rec-ommendation P861 [ITUT96] for perceptual quality measurements of telephoneband signals (300ndash3400 Hz) The technique is essentially a speech-specific ver-sion of the PAQM algorithm in which the time-frequency smearing blocks areusually disabled the outer ear transfer function is eliminated and simplificationsare made in both the intensity-to-loudness mapping and the modeling of cognitiveasymmetry effects


129 RESEARCH DIRECTIONS FOR PERCEPTUAL CODEC QUALITYMEASURES

This chapter has provided a perspective on current performance evaluationtechniques for perceptual audio codecs including both subjective and objectivequality measures At present the subjective listening test remains the most reliabletechnique for measurement of the small impairments associated with high-quality codecs and a formal test methodology optimized for maximum reliabilityhas been adopted as an international standard the ITU-R BS1116 [ITUR94b]Nevertheless even the results of BS1116-compliant subjective tests continueto exhibit site and subject dependencies Future research will continue toseek new methods for improving the reliability of subjective test outcomesFor example Lucent Technologies sponsored an investigation at MoultonLaboratories [Moul98a][Moul98b] into the effectiveness of multifacet Raschmodels [Lina94] for improved reliability of subjective listening tests on high-quality audio codecs Developed initially in the context of intelligence testing theRasch model [Rasc80] is a statistical analysis technique designed to remove theeffects of local disturbances on test outcomes This is accomplished by computingan expected data value for each data point that can be compared with the actualcollected value to determine whether or not the data point is consistent withthe demands of reproducible measurement The Rasch methodology involves aniterative data pruning process of suspending misfitting data from the analysisrecomputing the measures and expected values and then repeating the pruningprocess until all significant misfits are removed from the data set The Raschanalysis proponents have indicated [Moul98a] [Moul98b] that future work willattempt to demonstrate the elimination of site dependencies and other local datadisturbances Research along these lines will continue as long as subjectivelistening tests exhibit site and subject dependence

Meanwhile the considerable research that has been devoted to developmentof automatic perceptual measurement schemes has resulted in the adoption of aninternational standard the ITU-R BS1387 [ITUR98] Experts do not consider thestandardized algorithm to be a human subject replacement however research intoimproved perceptual measurement schemes will continue (eg ITU-R JWP10-11Q) While the signal processing models of the auditory periphery have reacheda high level of sophistication with manageable complexity it is likely that futureperceptual measurement schemes will realize the greatest performance gains asa result of improved models for binaural listening effects and general cognitiveeffects As described in [Beer98] several central cognitive effects are importantin subjective audio quality assessment including

Informational Masking The masked threshold of a complex target maskedby a complex masker may decrease by more than 40 dB [Leek84] after sub-ject training Modeling this effect has been shown to cause both increases anddecreases in correlation between ODGs and SDGs depending on the listeningtest database [Beer98] Future work will concentrate on the development of satis-factory models for informational masking ie models that do not lead to reducedcorrelation

RESEARCH DIRECTIONS FOR PERCEPTUAL CODEC QUALITY MEASURES 403

Separation of Linear from Nonlinear Distortions Linear distortions tend to beless objectionable than nonlinear distortions Experimental attempts to quantifythis effect in perceptual measurement schemes however have not yet succeeded

Auditory Scene Analysis This describes the cognitive mechanism that groupsdifferent auditory events into distinct objects During a successful applicationof this principle to perceptual measurement systems [Beer94b] it was foundthat there is a significant asymmetry between the perceived distortion associatedwith the introduction of a new time-frequency component versus the perceiveddistortion associated with the removal of an existing time-frequency componentMoreover the effect was shown to improve SDGODG correlation for somelistening test results

Naturally there is some debate over the feasibility of creating an automaticsystem that can predict the responses of human test subjects with sufficient preci-sion to allow for differentiation between competing high-quality algorithms Evenrecent systems tend to have difficulty predicting some observable psychoacousticphenomena For example most systems fail to model the beat and combina-tion tone detection cues that generate a masking release (reduced threshold) intone masking tone experiments For these and other reasons the large correlationcoefficients that are obtained when mapping particular objective measures to cer-tain subjective test sets have been shown to be data dependent (eg [Beer92b])Keyhl et al calibrated the perceptual audio quality measure (PAQM) to matcha set of subjective test results for an MPEG codec in the WorldSpace satel-lite system [Keyh98] The idea was to later use the automatic measure in placeof expensive subjective tests for quality assurance over time In spite of anyapparent shortcomings considerable interest in perceptual measurement schemeswill persist as long as subjective tests remain time-consuming expensive andmarginally reliable For additional information the interested reader can consultseveral survey papers on perceptual measurement schemes that appeared in anAES collection [Beat96] [Koml96] [Ryde96] as well as the textbook on percep-tual measurement techniques [Beer98] Two popular standards published includethe perceptual evaluation of speech quality (PESQ) [Rix01] and the perceptualevaluation of audio quality (PEAQ) standard [PEAQ]

REFERENCES

[Adou87] JP Adoul et al ldquoFast CELP coding based on algebraic codesrdquo in ProcIEEE ICASSP-87 vol 12 pp 1957ndash1960 Dallas TX Apr 1987

[Adou95] J-P Adoul and R Lefebvre ldquoWideband speech codingrdquo in Speech Codingand Synthesis Eds WB Kleijn and KK Paliwal Kluwer AcademicPublishers Norwell MA 1995

[AES00] AES Technical Council Document AESTD1001101ndash10 MultichannelSurround Sound Systems and Operations Audio Engineering Society NewYork NY

[Ager94] F Agerkvist Time-Frequency Analysis and Auditory Models PhD ThesisThe Acoustics Laboratory Technical University of Denmark 1994

[Ager96] F Agerkvist ldquoA time-frequency auditory model using wavelet packetsrdquo JAud Eng Soc vol 44 no 12 pp 37ndash50 Feb 1996

[Akag94] K Akagiri ldquoTechnical Description of Sony Preprocessingrdquo ISOIEC JTC1SC29WG11 MPEG Input Document 1994

[Akai95] K Akaigiri ldquoDetailed Technical Description for MPEG-2 Audio NBCrdquoTechnical Report ISOIEC JTC1SC29WG11 1995

[Akan92] A Akansu and R Haddad Multiresolution Signal Decomposition Trans-forms Subbands Wavelets Academic Press San Diego CA 1992

[Akan96] A Akansu and MJT Smith eds Subband and Wavelet Transforms Designand Applications Kluwer Academic Publishers Norwell MA 1996

[Ali96] M Ali Adaptive Signal Representation with Application in Audio CodingPhD Thesis University of Minnesota Mar 1996


405

406 REFERENCES

[Alki95] O Alkin and H Calgar ldquoDesign of efficient M-band coders with linearphase and perfect-reconstruction propertiesrdquo IEEE Trans Sig Proc vol43 no 7 pp 1579ndash1589 July 1995

[Alla99] E Allamanche R Geiger J Herre and T Sporer ldquoMPEG-4 low-delayaudio coding based on the AAC codecrdquo in Proc 106th Conv Aud EngSoc preprint 4929 May 1999

[Alle77] JB Allen and LR Rabiner ldquoA unified theory of short-time spectrumanalysis and synthesisrdquo Proc IEEE vol 65 no 11 pp 1558ndash1564 Nov1977

[Alme83] L Almeida ldquoNonstationary spectral modeling of voiced speechrdquo IEEETrans Acous Speech and Sig Proc vol 31 pp 374ndash390 June 1983

[Angu01] JA Angus ldquoAchieving effective dither in delta-sigma modulation systemsrdquoin Proc 110th Conv Aud Eng Soc preprint 5393 May 2001

[Angu97a] JA Angus and NM Casey ldquoFiltering signals directlyrdquo in Proc 102ndConv Aud Eng Soc preprint 4445 Mar 22ndash25 1997

[Angu97b] JAS Angus ldquoBackward adaptive lossless coding of audio signalsrdquo in Proc103rd Conv Aud Eng Soc preprint 4631 New York Sept 1997

[Angu98] JA Angus and S Draper ldquoAn improved method for directly filtering

audio signalsrdquo in Proc 104th Conv Aud Eng Soc preprint 4737 May16ndash19 1998

[Arno00] M Arnold ldquoAudio watermarking features applications and algorithmsrdquo inInt Conf on Multimedia and Expo (ICME-2000) vol 2 pp 1013ndash1016 30Julyndash2 Aug 2000

[Arno02] M Arnold et al Techniques and Applications of Digital Watermarking andContent Protection Artech House Norwood MA Dec 2002

[Arno02a] M Arnold and O Lobisch ldquoReal-time concepts for block-based wat-ermarking schemesrdquo in Proc of 2nd Int Conf on WEDELMUCSIC-02 pp156ndash160 Dec 2002

[Arno02b] M Arnold ldquoSubjective and objective quality evaluation of watermarkedaudio tracksrdquo in Proc of 2nd Int Conf on WEDELMUCSIC-02 pp161ndash167 Dec 2002

[Arno03] M Arnold ldquoAttacks on digital audio watermarks and countermeasuresrdquo inProc of 3rd Int Conf on WEDELMUCSIC-03 pp 55ndash62 Sept 2003

[Atal82a] BS Atal ldquoPredictive coding of speech at low bit ratesrdquo IEEE Trans onComm vol 30 no 4 p 600 Apr 1982

[Atal82b] BS Atal and J Remde ldquoA new model for LPC excitation for producingnatural sounding speech at low bit ratesrdquo in Proc IEEE ICASSP-82pp 614ndash617 Paris France April 1982

[Atal84] B Atal and MR Schroeder ldquoStochastic coding of speech signals at verylow bit ratesrdquo in Proc Int Conf Comm pp 1610ndash1613 May 1984

[Atal90] B Atal V Cuperman and A Gersho Advances in Speech Coding KluwerNorwell MA 1990

[Atti05] V Atti and A Spanias ldquoSpeech analysis by estimating perceptually ndashrelevant pole locationrdquo in Proc IEEE ICASSP-05 vol 1 pp 217ndash220Philadelphia PA March 2005

REFERENCES 407

[Bald96] A Baldini ldquoDSP and MIDI applications in loudspeaker designrdquo in Proc100th Conv Aud Eng Soc preprint 4189 May 1996

[Bamb94] R Bamberger et al ldquoGeneralized symmetric extension for size-limitedmultirate filter banksrdquo IEEE Trans Image Proc vol 3 no 1 pp 82ndash87Jan 1994

[Bank96] M Bank et al ldquoAn objective method for sound quality estimation ofcompression systemsrdquo in Proc 101st Conv Aud Eng Soc preprint 43731996

[Barn03a] M Barni ldquoWhat is the future of watermarking Part-Irdquo IEEE SignalProcessing Magazine vol 20 no 5 pp 55ndash60 Sept 2003

[Barn03b] M Barni ldquoWhat is the future of watermarking Part-IIrdquo IEEE SignalProcessing Magazine vol 20 no 6 pp 53ndash59 Nov 2003

[Bass98] P Bassia and I Pitas ldquoRobust audio watermarking in the time-domainrdquo inProc EUSIPCO vol 1 pp 25ndash28 Greece Sept 1998

[Bass01] P Bassia I Pitas and N Nikolaidis ldquoRobust audio watermarking in thetime domainrdquo IEEE Trans on Multimedia vol 3 no 2 pp 232ndash241 June2001

[Baum95] F Baumgarte C Ferekidis and H Fuchs ldquoA nonlinear psychoacousticmodel applied to the ISO MPEG layer 3 coderrdquo in Proc 99th Conv AudioEng Soc preprint 4087 New York Oct 1995

[Bayl97] J Baylis Error Correcting Codes A Mathematical Introduction CRC PressBoca Raton FL Dec 1997

[Beat89] RJ Beaton ldquoHigh quality audio encoding within 128 kbitsrdquo in IEEEPacific Rim Conf on Comm Comp and Sig Proc pp 388ndash391 June 1989

[Beat96] R Beaton et al ldquoObjective perceptual measurement of audio qualityrdquo inCollected Papers on Digital Audio Bit-Rate Reduction N Gilchrist and CGrewin Eds Aud Eng Soc pp 126ndash152 1996

[Bech92] S Bech ldquoSelection and training of subjects for listening tests on sound-reproducing equipmentrdquo J Aud Eng Soc pp 590ndash610 JulAug 1992

[Beck04] E Becker et al Digital Rights Management Technological Economic andLegal and Political Aspects Springer Verlag Berlin Jan 2004

[Beer91] J Beerends and J Stemerdink ldquoMeasuring the quality of audio devicesrdquo inProc 90th Conv Aud Eng Soc preprint 3070 Feb 1991

[Beer92a] J Beerends and J Stemerdink ldquoA perceptual audio quality measurerdquo inProc 92nd Conv Aud Eng Soc preprint 3311 Mar 1992

[Beer92b] J Beerends and J Stemerdink ldquoA perceptual audio quality measure basedon a psychoacoustic sound representationrdquo J Aud Eng Soc vol 40 no 12pp 963ndash978 Dec 1992

[Beer94a] J Beerends and J Stemerdink ldquoA perceptual speech-quality measure basedon a psychoacoustic sound representationrdquo J Aud Eng Soc vol 42 no 3pp 115ndash123 Mar 1994

[Beer94b] J Beerends and J Stemerdink ldquoModeling a cognitive aspect in themeasurement of the quality of music codecsrdquo in Proc 96th Conv AudEng Soc preprint 3800 1994

408 REFERENCES

[Beer95] J Beerends and J Stemerdink ldquoMeasuring the quality of speech and musiccodecs an integrated psychoacoustic approachrdquo in Proc 98th Conv AudEng Soc preprint 3945 1995

[Beer96] J Beerends et al ldquoThe role of informational masking and perceptualstreaming in the measurement of music codec qualityrdquo in Proc 100th ConvAud Eng Soc preprint 4176 May 1996

[Beer98] J Beerends ldquoAudio quality determination based on perceptual measurementtechniquesrdquo in Applications of Digital Signal Processing to Audio andAcoustics M Kahrs and K Brandenburg Eds Kluwer AcademicPublishers Boston pp 1ndash83 1998

[Beke60] G von Bekesy Experiments in Hearing McGraw Hill New-York 1960

[Bell76] M Bellanger et al ldquoDigital filtering by polyphase network application tosample rate alteration and filter banksrdquo IEEE Trans Acous Speech and SigProc vol 24 no 4 pp 109ndash114 Apr 1976

[Ben99] L Ben and M Sandler ldquoJoint source and channel coding for Internet audiotransmissionrdquo in Proc 106th Conv Aud Eng Soc preprint 4932 May1999

[Bend95] W Bender D Gruhl and N Morimoto ldquoTechniques for data hidingrdquo inProc of the SPIE 1995

[Bend96] W Bender D Gruhl N Morimoto and A Lu ldquoTechniques for data hidingrdquoIBM Syst J vol 35 nos 3 and 4 pp 313ndash336 1996

[Bene86] N Benevuto et al ldquoThe 32 kbs coding standardrdquo ATampT Technical Journal vol 65(5) pp 12ndash22 SeptndashOct 1986

[Berm02] G Berman and I Jouny ldquoTemporal watermarking invertible only throughsource separationrdquo in Proc IEEE ICASSP-2002 vol 4 pp 3732ndash3735Orlando FL May 2002

[Bess02] B Bessette et al ldquoThe adaptive multi-rate wideband speech codec (AMR-WB)rdquo IEEE Trans on Speech and Audio Proc vol 10 no 8 pp 620ndash636Nov 2002

[Blac95] M Black and M Zeytinoglu ldquoComputationally efficient wavelet packetcoding of wideband stereo audio signalsrdquo in Proc IEEE ICASSP-95 pp3075ndash3078 Detroit MI May 1995

[Blau74] J Blauert Spatial Hearing The MIT Press Cambridge MA 1974

[Bola95] S Boland and M Deriche ldquoHigh quality audio coding using multipulse LPCand wavelet decompositionrdquo in Proc IEEE ICASSP-95 pp 3067ndash3069Detroit MI May 1995

[Bola96] S Boland and M Deriche ldquoAudio coding using the wavelet packet transformand a combined scalar-vector quantizationrdquo in Proc IEEE ICASSP-96 pp1041ndash1044 Atlanta GA May 1996

[Bola97] S Boland and M Deriche ldquoNew results in low bitrate audio coding using acombined harmonic-wavelet representationrdquo in Proc IEEE ICASSP-97 pp351ndash354 Munich Germany April 1997

[Bola98] S Boland and M Deriche ldquoHybrid LPC and discrete wavelet transformaudio coding with a novel bit allocation algorithmrdquo in Proc IEEE ICASSP-98 vol 6 pp 3657ndash3660 Seattle WA May 1998

REFERENCES 409

[Bone96] L Boney A Tewfik and K Hamdy ldquoDigital watermarks for audio signalsrdquoin Proc of Multimedia rsquo96 pp 473ndash480 Piscataway NJ IEEE Press 1996

[Borm03] J Bormans J Gelissen and A Perkis ldquoMPEG-21 the 21st centurymultimedia frameworkrdquo IEEE Sig Proc Mag vol 20 no 2 pp 53-62ndashMar 2003

[Bosi00] M Bosi ldquoHigh-quality multichannel audio coding trends and challengesrdquoJ Audio Eng Soc vol 48 no 6 pp 588 590ndash595 June 2000

[Bosi93] M Bosi et al ldquoAspects of current standardization activities for high-qualitylow rate multichannel audio codingrdquo IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics New Paltz NY Oct 1993

[Bosi96] M Bosi K Brandenburg S Quackenbush L Fielder K Akagiri and HFuchs ldquoISOIEC MPEG-2 advanced audio codingrdquo in Proc 101st ConvAud Eng Soc preprint 4382 Los Angeles Nov 1996

[Bosi96a] M Bosi et al ldquoISOIEC MPEG-2 advanced audio codingrdquo in Proc 101stConv Audio Eng Soc preprint 4382 1996

[Bosi97] M Bosi et al ldquoISOIEC MPEG-2 advanced audio codingrdquo J Audio EngSoc pp 789ndash813 Oct 1997

[Boyd88] I Boyd and C Southcott ldquoA speech codec for the Skyphone servicerdquo BrTelecom Technical J vol 6(2) pp 51ndash55 April 1988

[Bran87a] K Brandenburg ldquoEvaluation of quality for audio encoding at low bit ratesrdquoin Proc 82nd Conv Aud Eng Soc preprint 2433 Mar 1987

[Bran87b] K Brandenburg ldquoOCF ndash a new coding algorithm for high quality soundsignalsrdquo in Proc IEEE ICASSP-87 vol 12 pp 141ndash144 Dallas TX May1987

[Bran88a] K Brandenburg ldquoHigh quality sound coding at 25 bitssamplerdquo in Proc84th Conv Aud Eng Soc preprint 2582 Mar 1988

[Bran88b] K Brandenburg ldquoOCF coding high quality audio with data rates of64 kbitsecrdquo in Proc 85th Conv Aud Eng Soc preprint 2723 Mar 1988

[Bran90] K Brandenburg and J Johnston ldquoSecond generation perceptual audiocoding the hybrid coderrdquo in Proc 88th Conv Aud Eng Soc preprint2937 Mar 1990

[Bran91] K Brandenburg et al ldquoASPEC adaptive spectral entropy coding of highquality music signalsrdquo in Proc 90th Conv Aud Eng Soc preprint 3011Feb 1991

[Bran92a] K Brandenburg et al ldquoComparison of filter banks for high quality audiocodingrdquo in Proc IEEE ISCAS 1992

[Bran92b] K Brandenburg and T Sporer ldquolsquoNMRrsquo and lsquoMasking Flagrsquo Evaluation ofquality using perceptual criteriardquo in Proc 11th Int Conf Aud Eng Soc pp169ndash179 May 1992

[Bran94a] K Brandenburg and G Stoll ldquoISO-MPEG-1 audio a generic standard forcoding of high-quality digital audiordquo J Audio Eng Soc pp 780ndash792 Oct1994

[Bran94b] K Brandenburg and B Grill ldquoFirst ideas on scalable audio codingrdquo in Proc97th Conv Aud Eng Soc San Francisco preprint 3924 1994

410 REFERENCES

[Bran95] K Brandenburg and M Bosi ldquoOverview of MPEG-audio current and futurestandards for low bit-rate audio codingrdquo in Proc 99th Conv Aud Eng Socpreprint 4130 Oct 1995

[Bran97] K Brandenburg and M Bosi ldquoISOIEC MPEG-2 advanced audio codingoverview and applicationsrdquo in Proc 103th Conv Aud Eng Soc preprint4641 Sept 1997

[Bran98] K Brandenburg ldquoPerceptual coding of high quality digital audiordquo inApplications of Digital Signal Processing to Audio and Acoustics M Kahrsand K Brandenburg Eds Kluwer Academic Publishers Boston 1998

[Bran02] K Brandenburg ldquoAudio coding and electronic distribution of musicrdquoin Proc of 2nd International Conference on Web Delivering of Music(WEDELMUSIC-02) pp 3ndash5 Dec 2002

[Bree04] J Breebaart et al ldquoHigh-quality parametric spatial audio coding at lowbitratesrdquo in 116th Conv Aud Eng Soc Berlin Germany May 2004

[Bree05] J Breebart et al ldquoMPEG spatial audio codingMPEG surround Overviewand current statusrdquo in Proc 119th AES convention paper 6599 New YorkNew York Oct 2005

[Breg90] A S Bregman Auditory Scene Analysis MIT Press Cambridge MA 1990

[Bros03] PM Brossier MB Sandler and MD Plumbley ldquoReal time object basedcodingrdquo in Proc 101st Conv Aud Eng Soc preprint 5809 Mar 2003

[Brue97] F Bruekers W Oomen R van der Vleuten and L van de KerkhofldquoImproved lossless coding of 1-bit audio signalsrdquo in Proc 103rd ConvAud Eng Soc preprint 4563 New York Sept 1997

[Burn03] I Burnett et al ldquoMPEG-21 goals and achievementsrdquo in IEEE Multimedia vol 10 no 4 pp 60ndash70 OctndashDec 2003

[Buzo80] A Buzo et al ldquoSpeech coding based upon vector quantization rdquo IEEETrans Acoust Speech Signal Process vol 28 no 5 pp 562ndash574 Oct1980

[BWEB] Specification of the Bluetooth System available online at wwwbluetoothcom

[Cabo92] R Cabot ldquoMeasurements on low-rate codersrdquo in Proc 11th Int Conf AudEng Soc pp 216ndash222 May 1992

[Cacc01] G Caccia R Lancini M Ludovico F Mapelli and S TubaroldquoWatermarking for musical pieces indexing used in automatic cue sheetgeneration systemsrdquo IEEE 4th Workshop on Mult and Sig Proc pp487ndash492 Oct 2001

[Cagl91] H Caglar et al ldquoStatistically optimized PR-QMF designrdquo in Proc SPIEVis Comm and Img Proc pp 86ndash94 Nov 1991

[Cai95] H Cai and G Mirchandani ldquoWavelet transform and bit-plane codingrdquo inProc ICIP pp 578ndash581 1995

[Camp86] J Campbell and TE Tremain ldquoVoicedunvoiced classification of speechwith applications to the US government LPC-10e algorithmrdquo in Proc IEEEICASSP-86 vol 11 pp 473ndash476 Tokyo Japan April 1986

[Camp90] J Campbell TE Tremain and V Welch ldquoThe Proposed Federal Standard1016 4800 bps voice coder CELPrdquo Speech Tech Mag pp 58ndash64 Apr1990

REFERENCES 411

[Cand92] JC Candy and GC Temes ldquoOversampling methods for AD and DAconversionrdquo in Oversampling Delta-Sigma Converters Eds JC Candy andGC Temes Piscataway NJ IEEE Press 1992

[Casa98] A Casal et al ldquoTesting a flexible time-frequency mapping for highfrequencies in TARCO (Tonal Adaptive Resolution COdec)rdquo in Proc 104thConv Aud Eng Soc preprint 4676 Amsterdam May 1998

[Case96] MA Casey and P Smaragdis ldquoNetsoundrdquo in Proc Int Computer MusicConf Hong Kong 1996

[CD82] Philips and Sony ldquoAudio CD System Descriptionrdquo Philips LicensingEindhoven The Netherlands 1982

[Chan90] W-Y Chan and A Gersho ldquoHigh fidelity audio transform coding with vectorquantizationrdquo in Proc IEEE ICASSP-90 pp 1109ndash1112 Albuquerque NMMay 1990

[Chan91a] W Chan and A Gersho ldquoconstrained-storage quantization of multiple vectorsources by codebook sharingrdquo IEEE Trans Comm vol 39 no 1 pp11ndash13 Jan 1991

[Chan91b] W Chan and A Gersho ldquoConstrained-storage vector quantization in highfidelity audio transform codingrdquo in Proc IEEE ICASSP-91 pp 3597ndash3600Toronto Canada May 1991

[Chan96] W Chang and C Want ldquoA masking-threshold-adapted weighting filter forexcitation searchrdquo IEEE Trans on Speech and Audio Proc vol 4 no 2pp 124ndash132 Mar 1996

[Chan97] W Chang et al ldquoAudio coding using sinusoidal excitation representationrdquoin Proc IEEE ICASSP-97 vol 1 pp 311ndash314 Munich Germany April1997

[Char88] A Charbonnier and JP Petit ldquoSub-band ADPCM coding for high qualityaudio signalsrdquo in Proc IEEE ICASSP-88 pp 2540ndash2543 New York NYMay 1988

[Chen92] J Chen R Cox Y Lin N Jayant and M Melchner ldquoA low-delay CELPcoder for the CCITT 16 kbs speech coding standardrdquo IEEE Trans SelectedAreas Commun vol 10 no 5 pp 830ndash849 June 1992

[Chen95] B Chen et al ldquoOptimal signal reconstruction in noisy filterbanks multirateKalman synthesis filtering approachrdquo IEEE Trans Sig Proc vol 43 no 11pp 2496ndash2504 Nov 1995

[Chen99] J Chen and H-M Tai ldquoMPEG-2 AAC coder on a fixed-point DSPrdquo in Procof Int Conf on Consumer Electronics ICCE-99 pp 24ndash25 June 1999

[Chen02a] S Cheng H Yu and Z Xiong ldquoError concealment of MPEG-2 AAC audiousing modulo watermarksrdquo in Proc IEEE ISCAS-2002 vol 2 pp 261ndash264May 2002

[Chen02b] S Cheng H Yu and X Zixiang ldquoEnhanced spread spectrum watermarkingof MPEG-2 AACrdquo in Proc IEEE ICASSP-2002 vol 4 pp 3728ndash3731Orlando FL May 2002

[Chen04] L-J Chen R Kapoor K Lee MY Sanadidi M Gerla ldquoAudio streamingover Bluetooth an adaptive ARQ timeout approachrdquo in 6th Int Workshopon Mult Network Syst and App(MNSA 2004) Tokyo Japan 2004

412 REFERENCES

[Cheu95] S Cheung and J Lim ldquoIncorporation of biorthogonality into lappedtransforms for audio compressionrdquo in Proc IEEE ICASSP-95 pp3079ndash3082 Detroit MI May 1995

[Chia96] H-C Chiang and J-C Liu ldquoRegressive implementations for the forward andinverse MDCT in MPEG audio codingrdquo IEEE Sig Proc Letters vol 3no 4 pp 116ndash118 Apr 1996

[Chou01a] J Chou K Ramchandran A Ortega ldquoNext generation techniques for robustand imperceptible audio data hidingrdquo in Proc IEEE ICASSP-2001 vol 3pp 1349ndash1352 Salt Lake City UT May 2001

[Chou01b] J Chou K Ramchandran and A Ortega ldquoHigh capacity audio data hidingfor noisy channelsrdquo in Proc Int Conf on Inf Tech Coding and Computingpp 108ndash112 Apr 2001

[Chow73] J Chowing ldquoThe synthesis of complex audio spectra by means of frequencymodulationrdquo J Audio Eng Soc pp 526ndash529 Sep 1973

[Chu85] PL Chu ldquoQuadrature mirror filter design for an arbitrary number of equalbandwidth channelsrdquo IEEE Trans Acous Speech and Sig Proc vol 33no 1 pp 203ndash218 Feb 1985

[Cilo00] T Ciloglu and SU Karaaslan ldquoAn improved all-pass watermarking schemefor speech and audiordquo in Proc Int Conf on Mult and Expo (ICME-2000)vol 2 pp 1017ndash1020 JulyndashAug 2000

[Coif92] R Coifman and M Wickerhauser ldquoEntropy based algorithms for best basisselectionrdquo IEEE Trans Information Theory vol 38 no 2 pp 712ndash718Mar 1992

[Colo95] C Colomes et al ldquoA perceptual model applied to audio bit-rate reductionrdquoJ Aud Eng Soc vol 43 no 4 pp 233ndash240 Apr 1995

[Cont96] L Contin et al ldquoTests on MPEG-4 audio codec proposalsrdquo J Sig ProcImage Comm Oct 1996

[Conw88] J Conway and N Sloane Sphere Packing Lattices and Groups Springer-Verlag New York 1988

[Cove91] T Cover and J Thomas Elements of Information Theory Wiley-Interscience New York Aug 1991

[Cox86] Cox R ldquoThe design of uniformly and nonuniformly spaced pseudo QMFrdquoIEEE Trans Acous Speech and Sig Proc vol 34 pp 1090ndash1096 Oct1986

[Cox95] IJ Cox J Kilian T Leighton and T Shamoon ldquoSecure spread spectrumwatermarking for multimediardquo Tech Rep 95ndash10 NEC Research InstitutePrinceton NJ 1995

[Cox96] IJ Cox J Kilian T Leighton and T Shamoon ldquoA secure robustwatermark for multimediardquo in Proc of Information Hiding Workshop rsquo96Cambridge UK pp 147ndash158 May 1996

[Cox97] IJ Cox J Kilian T Leighton and T Shamoon ldquoSecure spread spectrumwatermarking for multimediardquo IEEE Trans Image Processing vol 6 no 12pp 1673ndash1687 1997

[Cox99] IJ Cox ML Miller AL McKellips ldquoWatermarking as communicationswith side informationrdquo Proc of the IEEE vol 87 no 7 pp 1127ndash1141July 1999

REFERENCES 413

[Cox01a] IJ Cox ML Miller and JA Bloom Digital Watermarking MorganKaufmann Publishers Sanfransisco 2001

[Cox01b] IJ Cox and ML Miller ldquoElectronic watermarking the first 50 yearsrdquo IEEE4th Workshop on Mult Sig Proc pp 225ndash230 Oct 2001

[Crav96] PG Craven and M Gerzon ldquoLossless coding for audio discsrdquo J AudioEng Soc vol 44 no 9 pp 706ndash720 Sept 1996

[Crav97] S Craver N Memon B-L Yeo and M Yeung ldquoCan invisible watermarksresolve rightful ownershiprdquo IBM Research Report RC 20509 Feb 1997

[Crav97] PG Craven M Law and J Stuart ldquoLossless compression using IIRprediction filtersrdquo in Proc 102nd Conv Aud Eng Soc preprint 4415Munich March 1997

[Crav98] S Craver N Memon B-L Yeo and M Yeung ldquoResolving rightfulownerships with invisible watermarking techniques limitations attacks andimplicationsrdquo in IEEE Journal on Selected Areas in Communications vol16 no 4 pp 573ndash586 1998

[Crav01] SA Craver M Wu and B Liu ldquoWhat can we reasonably expect fromwatermarksrdquo in 2001 IEEE Workshop on the Applications of SignalProcessing to Audio and Acoustics 21ndash24 Oct 2001 pp 223ndash226

[Creu96] C Creusere and S Mitra ldquoEfficient audio coding using perfect reconstructionnoncausal IIR filter banksrdquo IEEE Trans Speech and Aud Proc vol 4 no 2pp 115ndash123 Mar 1996

[Creu02] CD Creusere ldquoAn analysis of perceptual artifacts in MPEG scalable audiocodingrdquo Proc of Data Compression Conference (DCC-02) pp 152ndash161Apr 2002

[Croc76] RE Crochiere et al ldquoDigital coding of speech in subbandsrdquo Bell Sys TechJ vol 55 no 8 pp 1069ndash1085 Oct 1976

[Croc83] RE Crochiere and L R Rabiner Multirate Digital Signal ProcessingPrentice-Hall Englewood Cliffs NJ 1983

[Croi76] A Croisier et al ldquoPerfect channel splitting by use of interpolationdecimationtree decomposition techniquesrdquo in Proc Int Conf on Inf Sciand Syst Patras 1976

[Cumm73] P Cummiskey et al ldquoAdaptive quantization in differential PCM coding ofspeechrdquo Bell Syst Tech J vol 52 no 7 p 1105 Sept 1973

[Cupe85] V Cuperman and A Gersho ldquoVector predictive coding of speech at 16 kbsrdquoIEEE Trans on Commun vol 33 no 7 pp 685ndash696 Jul 1985

[Cupe89] V Cuperman ldquoOn adaptive vector transform quantization for speechcodingrdquo IEEE Trans on Commun vol 37 no 3 pp 261ndash267 Mar 1989

[Cvej01] N Cvejic A Keskinarkaus T Seppanen ldquoAudio watermarking using m-sequences and temporal maskingrdquo IEEE Workshop on App of Sig Proc toAudio and Acoustics pp 227ndash230 Oct 2001

[Cvej03] N Cvejic D Tujkovic T Seppanen ldquoIncreasing robustness of an audiowatermark using turbo codesrdquo in Proc ICME-03 vol 1 pp 217ndash220 July2003

[Daub88] I Daubechies ldquoOrthonormal bases of compactly supported waveletsrdquoComm Pure Appl Math pp 909ndash996 Nov 1988

414 REFERENCES

[Daub92] I Daubechies Ten Lectures on Wavelets Society for Industrial and AppliedMathematics Philadelphia PA 1992

[Daub96] I Daubechies and W Sweldens ldquoFactoring wavelet transforms into liftingstepsrdquo Tech Rep Bell Laboratories Lucent Technologies 1996

[DAVC96] Digital Audio-Visual Council (DAVIC) DAVIC Technical Specification 12Part 9 ldquoInformation Representationrdquo Dec 1996 Available online fromhttpwwwdavicorg

[Davi86] G Davidson and A Gersho ldquoComplexity reduction methods for vectorexcitation codingrdquo in Proc IEEE ICASSP-86 vol 11 pp 3055ndash3058Tokyo Japan April 1986

[Davi90] G Davidson et al ldquoLow-complexity transform coder for satellite linkapplicationsrdquo in Proc 89th Conv Aud Eng Soc preprint 2966 1990

[Davi92] G Davidson and M Bosi ldquoAC-2 high quality audio coding for broadcas-ting and storagerdquo in Proc 46th Annual Broadcast Eng Conf pp 98ndash105Apr 1992

[Davi94] G Davidson et al ldquoParametric bit allocation in a perceptual audio coderrdquoin Proc 97th Conv Aud Eng Soc preprint 3921 Nov 1994

[Davi98] G Davidson ldquoDigital audio coding Dolby AC-3rdquo in The Digital SignalProcessing Handbook V Madisetti and D Williams Eds CRC Press pp411ndash4121 1998

[Davis93] M Davis ldquoThe AC-3 multichannel coderrdquo in Proc 95th Conv Aud EngSoc preprint 3774 Oct 1993

[Davis94] M Davis and C Todd ldquoAC-3 operation bitstream syntax and featuresrdquo inProc 97th Conv Aud Eng Soc preprint 3910 1994

[Davis03] M Davis ldquoHistory of spatial codingrdquo J Audio Eng Soc vol 51 no 6pp 554-569 June 2003

[Dehe91] YF Dehery et al ldquoA MUSICAM source codec for digital audiobroadcasting and storagerdquo in Proc IEEE ICASSP-91 pp 3605ndash3608Toronto Canada May 1991

[Delo96] A Delopoulos and S Kollias ldquoOptimal filterbanks for signal reconstructionfrom noisy subband componentsrdquo IEEE Trans Sig Proc vol 44 no 2 pp212ndash224 Feb 1996

[Deri99] M Deriche D Ning and S Boland ldquoA new approach to low bit rate audiocoding using a combined harmonic-multiband-wavelet representationrdquo inProc ISSPA-99 vol 2 pp 603ndash606 Aug 1999

[Diet96] M Dietz et al ldquoAudio compression for network transmissionrdquo J Aud EngSoc pp 58ndash71 Jan Feb 1996

[Diet00] M Dietz and T Mlasko ldquoUsing MPEG-4 audio for DRM digital narrowbandbroadcastingrdquo in Proc ISCAS-2000 vol 3 pp 205ndash208 May 2000

[Diet02] M Dietz et al ldquoSpectral band replication a novel approach in audio codingrdquoin 112th Conv Aud Eng Soc preprint 5553 Munich Germany May2002

[Ditt00] J Dittmann A Mukherjee and M Steinebach ldquoMedia-independentwatermarking classification and the need for combining digital video andaudio watermarking for media authenticationrdquo in Proc Int Conf on InfTech Coding and Computing pp 62ndash67 Mar 2000

REFERENCES 415

[Ditt01] J Dittmann and M Steinebach ldquoJoint watermarking of audio-visual datardquoIEEE 4th Workshop on Mult Sig Proc pp 601ndash606 Oct 2001

[Dobs97] W Dobson et al ldquoHigh quality low complexity scalable wavelet audiocodingrdquo in Proc IEEE ICASSP-97 pp 327ndash330 Munich Germany April1997

[DOLBY] ldquoSurround Sound and Multimedia Dolby Surround and Dolby Digital forvideo games CD ROM DVD ROM the Internet ndash and morerdquo Availableonline at httpwwwdolbycommultisurmultihtml

[Dong02] H Dong JD Gibson and MG Kokes ldquoSNR and bandwidth scalablespeech codingrdquo in Proc IEEE ICASSP-02 vol 2 pp 859ndash862 OrlandoFL May 2002

[Dorw01] S Dorward D Huang SA Savari G Schuller and B Yu ldquoLow delayperceptually lossless coding of audio signalsrdquo in Proc of Data CompressionConference (DCC-01) pp 312ndash320 Mar 2001

[DTS] The Digital Theater Systems (DTS) Available online at wwwdtsonlinecom

[Duen02] A Duenas R Perez B Rivas E Alexandre and A Pena ldquoA robust andefficient implementation of MPEG-24 AAC natural audio codersrdquo in Proc112th Conv Aud Eng Soc preprint 5556 May 2002

[Duha91] P Duhamel et al ldquoA fast algorithm for the implementation of filter banksbased on time domain aliasing cancellationrdquo in Proc IEEE ICASSP-91 pp2209ndash2212 Toronto Canada May 1991

[Dunn96] C Dunn and M Sandler ldquoA comparison of dithered and chaotic sigma-deltamodulatorsrdquo J Audio Eng Soc vol 44 no 4 pp 227ndash244 Apr 1996

[Duro96] X Durot and J-B Rault ldquoA new noise injection model for audio compressionalgorithmsrdquo in Proc 101st Conv Aud Eng Soc preprint 4374 Nov 1996

[DVD01] DVD Forum ldquoDVD Specifications for Read-only Discrdquo Part 4 AudioSpecifications Version 12 Tokyo Japan Mar 2001

[DVD96] DVD Consortium ldquoDVD-video and DVD-ROM specificationsrdquo 1996

[DVD99] DVD Forum ldquoDVD Specifications for Read-Only Discrdquo Part 4 AudioSpecifications Packed PCM MLP Reference Information Version 10Tokyo Japan Mar 1999

[East97] PC Eastty C Sleight and PD Thorpe ldquoResearch on cascadable filteringequalization gain control and mixing of 1-bit signals for professional audioapplicationsrdquo in Proc 102th Conv Aud Eng Soc preprint 4444 Mar22ndash25 1997

[EBU99] EBU Technical Recommendation R96-1999 Formats for production anddelivery of multichannel audio programs European Broadcasting UnionGeneva Switzerland 1999

[Edle89] B Edler ldquoCodierung von Audiosignalen mit uberlappender Transformationund adaptiven Fensterfunktionen (Coding of Audio Signals with OverlappingBlock Transform and Adaptive Window Functions)rdquo Frequenz pp252ndash256 1989

[Edle95] B Edler ldquoTechnical Description of the MPEG-4 Audio Coding Proposalfrom University of Hannover and Deutsche Bundespost Telekomrdquo ISOIECJTC1SC29WG11 MPEG95MO414 Oct 1995

416 REFERENCES

[Edle96a] B Edler and H Purnhagen ldquoTechnical Description of the MPEG-4 AudioCoding Proposal from University of Hannover and Deutschet Telekom AGrdquoISOIEC JTC1SC29WG11 MPEG96MO632 Jan 1996

[Edle96a] B Edler ldquoCurrent status of the MPEG-4 audio verification modeldevelopmentrdquo in Proc 101st Conv Aud Eng Soc Preprint 4376 Nov1996

[Edle96b] B Edler and L Contin ldquoMPEG-4 Audio Test Results (MOS Test)rdquo ISOIECJTC1SC29WG11 N1144 Jan 1996

[Edle96b] B Edler et al ldquoASAC ndash AnalysisSynthesis Audio Codec for very low bitratesrdquo in Proc 100th Conv Aud Eng Soc preprint 4179 May 1996

[Edle96c] B Edler and H Purnhagen ldquoParametric audio codingrdquo in Proc InternationalConference on Signal Processing (ICSP 2000) Beijing 2000

[Edle97] B Edler ldquoVery low bit rate audio coding developmentrdquo in Proc 14th AudioEng Soc Int Conf June 1997

[Edle98] B Edler and H Purnhagen ldquoConcepts for hybrid audio coding schemesbased on parametric techniquesrdquo in Proc 105th Conv Audio Eng Socpreprint 4808 1998

[Edle99] B Edler ldquoSpeech coding in MPEG-4rdquo Int J of Speech Technology vol 2no 4 pp 289ndash303 May 199

[Egan50] J Egan and H Hake ldquoOn the masking pattern of a simple auditory stimulusrdquoJ Acoust Soc Am vol 22 pp 622ndash630 1950

[Ehre04] A Ehret et al ldquoaacPlus only a low-bitrate codecrdquo in 117th Conv AudEng Soc preprint 6199 San Francisco CA Oct 2004

[Ekud99] R Ekudden R Hagen I Johansson and J Svedburg ldquoThe adaptive multi-rate speech coderrdquo in Proc IEEE Workshop on Speech Coding pp 117ndash119June 1999

[Elli96] PW Ellis ldquoPrediction-Driven Computational Auditory Scene AnalysisrdquoPhD Thesis Massachusetts Institute of Technology June 1996

[Erku03] S Erkucuk S Krishnan and M Zeytinoglu ldquoRobust audio watermarkingusing a chirp based techniquerdquo in Proc ICME-03 vol 2 pp 513ndash516 July2003

[Esma03] S Esmaili S Krishnan and K Raahemifar ldquoA novel spread spectrum audiowatermarking scheme based on time-frequency characteristicsrdquo in Proc ofCCECE-03 vol 3 pp 1963ndash1966 May 2003

[Este77] D Esteban and C Galand ldquoApplication of quadrature mirror filters tosplit band voice coding schemesrdquo in Proc IEEE ICASSP-77 vol 2pp 191ndash195 Hartford CT May 1977

[ETSI98] ETSI AMR Qualification Phase Documentation 1998

[Fall01] C Faller and F Baumgarte ldquoEfficient representation of spatial audio usingperceptual parameterizationrdquo IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics New Paltz New York 2001

[Fall02a] C Faller and F Baumgarte ldquoBinaural cue coding a novel and efficientrepresentation of spatial audiordquo in Proc IEEE ICASSP-02 vol 2 pp1841ndash1844 Orlando FL May 2002

REFERENCES 417

[Fall02b] C Faller and F Baumgarte ldquoEstimation of auditory spatial cues for binauralcue codingrdquo in Proc IEEE ICASSP-02 vol 2 pp 1801ndash1804 Orlando FLMay 2002

[Fall02b] C Faller B-H Juang P Kroon H-L Lou SA Ramprashad C-EWSundberg ldquoTechnical advances in digital audio radio broadcastingrdquo Proc ofthe IEEE vol 90 no 8 pp 1303ndash1333 Aug 2002

[Feit98] B Feiten et al ldquoDynamically scalable audio internet transmissionrdquo in Proc104th Conv Audio Eng Soc preprint 4686 May 1998

[Fejz00] Z Fejzo S Smyth K McDowell and Y-L You ldquoBackward compatibleenhancement of DTS multi-channel audio coding that delivers 96-kHz24-bitaudio qualityrdquo in Proc of 109th Conv Aud Eng Soc preprint 5259 Sept22ndash25 2000

[Ferr96a] A Ferreira ldquoConvolutional effects in transform coding with TDAC anoptimal windowrdquo IEEE Trans on Speech and Audio Proc vol 4 no 2pp 104ndash114 Mar 1996

[Ferr96b] AJS Ferreira ldquoPerceptual coding of harmonic signalsrdquo in Proc 100thConv Audio Eng Soc preprint 4746 May 1996

[Ferr01a] AJS Ferreira ldquoCombined spectral envelope normalization and subtractionof sinusoidal components in the ODFT and MDCT frequency domainsrdquoIEEE Workshop on the Applications of Signal Processing to Audio andAcoustics pp 51ndash54 Oct 2001

[Ferr01b] AJS Ferreira ldquoAccurate estimation in the ODFT domain of the frequencyphase and magnitude of stationary sinusoidsrdquo IEEE Workshop on theApplications of Signal Processing to Audio and Acoustics pp 47ndash51 Oct2001

[Fiel91] L Fielder and G Davidson ldquoAC-2 a family of low complexity transform-based music codersrdquo in Proc 10th AES Int Conf Sep 1991

[Fiel96] L Fielder et al ldquoAC-2 and AC-3 low-complexity transform-based audiocodingrdquo in Collected Papers on Digital Audio Bit-Rate Reduction NGilchrist and C Grewin Eds Aud Eng Soc pp 54ndash72 1996

[Fiel04] L Fielder et al ldquoIntroduction to Dolby digital plus an enhancement to theDolby digital coding systemrdquo in Proc 117th AES convention paper 6196San Francisco CA Oct 2004

[Flet37] H Fletcher and W Munson ldquoRelation between loudness and maskingrdquoJ Acoust Soc Am vol 9 pp 1ndash10 1937

[Flet40] H Fletcher ldquoAuditory patternsrdquo Rev Mod Phys pp 47ndash65 Jan 1940

[Flet91] N Fletcher and T Rossing The Physics of Musical Instruments SpringerNew York 1991

[Flet98] J Fletcher ldquoISOMPEG layer 2 ndash optimum re-encoding of decoded audiousing a MOLE signalrdquo in Proc 104th Conv Aud Eng Soc preprint 4706Amsterdam 1998 See also httpwwwbbccoukatlantic

[Foo01] SW Foo TH Yeo and DY Huang ldquoAn adaptive audio watermarkingsystemrdquo in Proc Int Conf on Electrical and Electronic Technology vol 2pp 509ndash513 Aug 2001

[Foss95] R Foss and T Mosala ldquoRouting MIDI messages over Ethernetrdquo in Proc99 th Conv Aud Eng Soc preprint 4057 Oct 1995

418 REFERENCES

[FS1015] Federal Standard 1015 Telecommunications Analog to Digital Conversionof Radio Voice by 2400 bitsecond Linear Predictive Coding NationalCommunication System ndash Office Technology and Standards Nov 1985

[FS1016] Federal Standard 1016 Telecommunications Analog to Digital Conversionof Radio Voice by 4800 bitsecond Code Excited Linear Prediction (CELP)National Communication System ndash Office Technology and Standards Feb1991

[Fuch93] H Fuchs ldquoImproving joint stereo audio coding by adaptive inter-channelpredictionrdquo in Proc IEEE ASSP Workshop on Apps of Sig Proc to Audand Acous 1993

[Fuch95] H Fuchs ldquoImproving MPEG audio coding by backward adaptive linearstereo predictionrdquo in Proc 99th Conv Aud Eng Soc preprint 4086 Oct1995

[Furo00] T Furon N Moreau and P Duhamel ldquoAudio public key watermarkingtechniquerdquo in Proc IEEE ICASSP-2000 vol 4 pp 1959ndash1962 IstanbulTurkey June 2000

[G722] ITU Recommendation G722 ldquo7 kHz Audio Coding within 64 kbsrdquo BlueBook vol III Fascicle III Oct 1988

[G7222] ITU Recommendation G7222 ldquoWideband coding of speech at around16 kbs using Adaptive Multi-rate Wideband (AMR-WB)rdquo Jan 2001

[G7231] ITU Recommendation G7231 ldquoDual Rate Speech Coder for MultimediaCommunications transmitting at 53 and 63 kbsrdquo Draft 1995

[G726] ITU Recommendation G726 ldquo24 32 40 kbs Adaptive Differential PulseCode Modulation (ADPCM)rdquo Blue Book vol III Fascicle III3 Oct 1988

[G728] ITU Draft Recommendation G728 ldquoCoding of Speech at 16 kbits UsingLow-Delay Code Excited Linear Prediction (LD-CELP)rdquo 1992

[G729] ITU Study Group 15 Draft Recommendation G729 ldquoCoding of Speech at8 kbs using Conjugate-Structure Algebraic-Code-Excited Linear-Prediction(CS-ACELP)rdquo 1995

[G721] CCITT Recommendation G721 ldquo32 kbs adaptive differential pulse codemodulation (ADPCM)rdquo in Blue Book vol III Fascicle III Oct 1988

[G722] CCITT (ITU-T) Rec G722 ldquo7 kHz audio coding within 64 kbitsrdquo BlueBook vol III fasc III4 pp 269ndash341 1988

[G726] ITU Recommendation G726 (formerly G721) ldquo24 32 40 kbs Adaptivedifferential pulse code modulation (ADPCM)rdquo Blue Book vol III FascicleIII3 Oct 1988

[G728] International Telecommunications Union Rec G728 ldquoCoding of Speech at16 kbps Using Low-Delay Code Excited Linear Predictionrdquo Sept 1992

[G729] ITU-T Recommendation G729 ldquoCoding of speech at 8 kbps usingconjugate-structure algebraic code excited linear prediction (CS-ACELP)rdquo1995

[Gang02] L Gang AN Akansu and M Ramkumar ldquoSecurity and synchronization inwatermark sequencerdquo in Proc IEEE ICASSP-2002 vol 4 pp 3736ndash3739Orlando FL May 2002

REFERENCES 419

[Gao01a] Y Gao et al ldquoThe SMV algorithm selected by TIA and 3GPP2 for CDMAapplicationsrdquo in Proc IEEE ICASSP-01 vol 2 pp 709ndash712 Salt Lake CityUT May 2001

[Gao01b] Y Gao et al ldquoEX-CELP A speech coding paradigmrdquo in Proc IEEEICASSP-01 vol 2 pp 689ndash692 Salt Lake City UT May 2001

[Garc99] RA Garcia ldquoDigital watermarking of audio signals using a psychoacousticauditory model and spread spectrum theoryrdquo in Proc 107th Conv Aud EngSoc preprint 5073 New York Sept 1999

[Gass54] G Gassler ldquoUber die horschwelle fur schallereignisse mit verschiedenbreitem frequenzspektrumrdquo Acustica vol 4 pp 408ndash414 1954

[Gbur96] U Gbur M Werner and M Dietz ldquoRealtime implementation of anISOMPEG layer 3 encoder on Pentium PCsrdquo in Proc 101st Conv AudEng Soc preprint 4386 Nov 1996

[Geig01] R Geiger T Sporer J Koller and K Brandenburg ldquoAudio coding basedon integer transformsrdquo in Proc 111th Conv Aud Eng Soc Preprint 5471New York 2001

[Geig02] R Geiger J Herre J Koller and K Brandenburg ldquoIntMDCT ndash A linkbetween perceptual and lossless audio codingrdquo in Proc IEEE ICASSP-2002 vol 2 pp 1813ndash1816 Orlando FL May 2002

[Geig03] R Geiger J Herre G Schuller and T Sporer ldquoFine grain scalableperceptual and lossless audio coding based on IntMDCTrdquo in Proc IEEEICASSP-2003 Hong Kong April 2003

[Geor87] EB George and MJT Smith ldquoA new speech coding model based ona least-squares sinusoidal representationrdquo in Proc IEEE ICASSP-87 pp1641ndash1644 Dallas TX Apr 1987

[Geor90] EB George and MJT Smith ldquoPerceptual Considerations in a low bit ratesinusoidal vocoderrdquo in Proc IEEE Int Phoenix Conf on Comp in Commpp 268ndash275 Mar 1990

[Geor92] EB George and MJT Smith ldquoAnalysis-by-synthesisoverlap-addsinusoidal modeling applied to the analysis and synthesis of musical tonesrdquoJ Audio Eng Soc pp 497ndash516 June 1992

[Geor97] EB George and MJT Smith ldquoSpeech analysissynthesis and modificationusing and analysis-by-synthesisoverlap-add sinusoidal modelrdquo IEEE Transon Speech and Audio Proc pp 389ndash406 Sept 1997

[Gers83] A Gersho and V Cuperman ldquoVector quantization a pattern matchingtechnique for speech codingrdquo IEEE Commun Mag vol 21 no 9 pp15ndash21 Dec 1983

[Gers90a] A Gersho and S Wang ldquoRecent trends and techniques in speech codingrdquoin Proc 24th Asilomar Conf Pacific Grove CA Nov 1990

[Gers90b] IA Gerson and MA Jasiuk ldquoVector sum excited linear prediction (VSELP)speech coding at 8 kbsrdquo in Proc of ICASSP-90 vol 1 pp 461ndash464Albuquerque NM Apr 1990

[Gers91] I Gerson and M Jasiuk ldquoTechniques for improving the performanceof CELP-type speech codersrdquo in Proc IEEE ICASSP-91 pp 205ndash208Toronto Canada May 1991

[Gers92] A Gersho and R Gray Vector Quantization and Signal Compression Kluwer Academic Publishers Norwell MA 1992

420 REFERENCES

[Gerz99] MA Gerzon JR Stuart RJ Wilson PG Craven and MJ Law ldquoTheMLP Lossless Compression Systemrdquo in AES 17th Int Conf High-QualityAudio Coding pp 61ndash75 Florence Italy Sept 1999

[Geye99] S Geyersberger et al ldquoMPEG-2 AAC multichannel realtime implementa-tion on floating-point DSPsrdquo in Proc 106th Conv Aud Eng Soc preprint4977 May 1999

[Gibs74] J Gibson ldquoAdaptive prediction in speech differential encoding systemsrdquoProc of the IEEE vol 68 pp 1789ndash1797 Nov 1974

[Gibs78] J Gibson ldquoSequentially adaptive backward prediction in ADPCM speechcodersrdquo IEEE Trans Commun pp 145ndash150 Jan 1978

[Gibs90] J Gibson et al ldquoBackward adaptive prediction in multi-tree speech codersrdquoin Advances in Speech Coding B Atal V Cuperman and A Gersho EdsKluwer Norwell MA 1990 pp 5ndash12

[Gilc98] N Gilchrist ldquoATLANTIC audio preserving technical quality during low bitrate coding and decodingrdquo in Proc 104th Conv Aud Eng Soc preprint4694 Amsterdam May 1998

[Giur98] CD Giurcaneanu I Tabus and J Astola ldquoLinear prediction from subbandsfor lossless audio compressionrdquo in Proc Norsig-98 3rd IEEE Nordic SignalProcessing Symposium pp 225ndash228 Vigso Denmark June 1998

[Glas90] B Glasberg and BCJ Moore ldquoDerivation of auditory filter shapes fromnotched-noise datardquo Hearing Res vol 47 pp 103ndash138 1990

[Golo66] SW Golomb ldquoRun-length encodingsrdquo in IEEE Trans on Info Theory vol12 pp 399ndash401 1966

[Good96] M Goodwin ldquoResidual modeling in music analysis-synthesisrdquo Proc IEEEICASSP-96 Atlanta GA May 1996

[Good97] M Goodwin ldquoAdaptive Signal Models Theory Algorithms and AudioApplicationsrdquo PhD Dissertation University of California at Berkeley 1997

[Gray84] R Gray ldquoVector quantizationrdquo IEEE ASSP Mag vol 1 no 2 pp 4ndash29Apr 1984

[Gree61] DD Greenwood ldquoCritical bandwidth and the frequency coordinates of thebasilar membranerdquo J Acous Soc Am pp 1344ndash1356 Oct 1961

[Gree90] DD Greenwood ldquoA cochlear frequency-position function for severalspecies ndash 29 years laterrdquo J Acoust Soc Am vol 87 pp 2592ndash2605 June1990

[Grew91] C Grewin and T Ryden ldquoSubjective assessments on low bit-rate codecsrdquoin Proc 10th Int Conf Aud Eng Soc pp 91ndash102 1991

[Gril02] B Grill et al ldquoCharacteristics of audio streaming over IP networks withinthe ISMA Standardrdquo in Proc 112th Conv Aud Eng Soc preprint 5514May 2002

[Gril94] B Grill et al ldquoImproved MPEG-2 audio multi-channel encodingrdquo in Proc96th Conv Aud Eng Soc preprint 3865 Feb 1994

[Gril97] B Grill ldquoA bit-rate scalable perceptual coder for MPEG-4 audiordquo in Proc103rd Conv Aud Eng Soc preprint 4620 Sept 1997

[Gril99] B Grill ldquoThe MPEG-4 general audio coderrdquo in Proc AES 17 th Int ConfSept 1999

REFERENCES 421

[Gros03] A Groschel et al ldquoEnhancing audio coding efficiency of MPEG layer-2 with spectral band replication for digitalradio (DAB) in a backwardscompatible wayrdquo in 114th Conv Aud Eng Soc preprint 5850Amsterdam Netherlands Mar 2003

[Grub01] M Grube P Siepen C Mittendorf J Friess and S Bauer ldquoApplicationsof MPEG-4 digital multimedia broadcastingrdquo in Proc of Int Conf onConsumer Electronics ICCE-01 pp 136ndash137 June 2001

[Gruh96] D Gruhl A Lu and W Bender ldquoEcho hidingrdquo in Proc of InformationHiding Workshop rsquo96 Cambridge UK pp 295ndash316 May 1996

[GSM89] GSM 0610 ldquoGSM full-rate transcodingrdquo Technical Report Version 32ETSIGSM July 1989

[GSM96a] GSM 0660 ldquoGSM Digital Cellular Communication Standards EnhancedFull-Rate Transcodingrdquo ETSIGSM 1996

[GSM96b] GSM 0620 ldquoGSM Digital Cellular Communication Standards Half RateSpeech Half Rate Speech Transcodingrdquo ETSIGSM 1996

[Hadd95] R Haddad and K Park ldquoModeling analysis and optimum design ofquantized M-band filterbanksrdquo IEEE Trans Sig Proc vol 43 no 11 pp2540ndash2549 Nov 1995

[Hall97] JL Hall ldquoAsymmetry of masking revisited generalization of masker andprobe bandwidthrdquo J Acoust Soc Am vol 101 pp 1023ndash1033 Feb 1997

[Hall98] JL Hall ldquoAuditory psychophysics for coding applicationsrdquo in The DigitalSignal Processing Handbook V Madisetti and D Williams Eds CRCPress Boca Raton FL pp 391ndash3925 1998

[Hamd96] KN Hamdy et al ldquoLow bit rate high quality audio coding with combinedharmonic and wavelet representationsrdquo in Proc IEEE ICASSP-96 pp1045ndash1048 Atlanta GA May 1996

[Hans96] MC Hans and V Bhaskaran ldquoA fast integer-based CPU scalable MPEG-1layer-II audio decoderrdquo in Proc 101st Conv Aud Eng Soc preprint 4359Nov 1996

[Hans98a] M Hans Optimization of Digital Audio for Internet Transmission PhDdissertation School Elect Comp Eng Georgia Inst Technol Atlanta 1998

[Hans98b] M Hans and RW Schafer ldquoAudioPaK ndash an integer arithmetic losslessaudio codecrdquo in Proc Data Compression Conf p 550 Snowbird UT 1998

[Hans01] M Hans and RW Schafer ldquoLossless compression of digital audiordquo IEEESig Proc Mag vol 18 no 4 pp 21ndash32 July 2001

[Harm96] A Harma et al ldquoWarped linear prediction (WLP) in audio codingrdquo in ProcNORSIG rsquo96 pp 367ndash370 Sep 1996

[Harm97a] A Harma et al ldquoAn experimental audio codec based on warpedlinear prediction of complex valued signalsrdquo in Proc IEEE ICASSP-97pp 323ndash326 Munich Germany April 1997

[Harm97b] A Harma et al ldquoWLPAC ndash a perceptual audio codec in a nutshellrdquo Proc102nd Conv Audio Eng Soc preprint 4420 Munich 1997

[Harm98] A Harma et al ldquoA warped linear predictive stereo codec using temporalnoise shapingrdquo in Proc NORSIG-98 pp 229ndash232 May 1998

[Harm00] A Harma ldquoImplementation of frequency-warped recursive filtersrdquo signalprocessing vol 80 pp 543ndash548 Feb 2000

422 REFERENCES

[Harm01] A Harma and UK Laine ldquoA comparison of warped and conventional linearpredictionrdquo IEEE Trans Speech Audio Proc vol 9 no 5 pp 579ndash588July 2001

[Harp03] P Harpe D Reefman and E Janssen ldquoEfficient trellis-type sigma-delta modulatorrdquo in Proc 114th Conv Aud Eng Soc preprint 5845Amsterdam Mar 22ndash25 2003

[Hart99] F Hartung and M Kutter ldquoMultimedia watermarking techniquesrdquo Proc ofthe IEEE vol 87 no 7 pp 1079ndash1107 July 1999

[Hask97] BG Haskell A Puri and AN Netravali Digital Video An Introduction toMPEG-2 Chapman amp Hall New York 1997

[Hawk02] MOJ Hawksford ldquoTime-quantized frequency modulation with timedispersive codes for the generation of sigma-delta modulationrdquo in Proc112th Conv Aud Eng Soc preprint 5618 Munich Germany May 10ndash132002

[Hay96] K Hay et al ldquoA low delay subband audio coder (20 Hz ndash 15 kHz) at64 kbitsrdquo in Proc IEEE Int Symp on Time-Freq and Time-Scale Analpp 265ndash268 June 1996

[Hay97] K Hay et al ldquoThe D5 lattice quantization for a 64 kbits low-delay subbandaudio coder with a 15 kHz bandwidthrdquo in Proc IEEE ICASSP-97 pp319ndash322 Munich Germany Apr 1997

[Hede81] P Hedelin ldquoA tone-oriented voice-excited vocoderrdquo in Proc IEEE ICASSP-81 pp 205ndash208 Atlanta GA Mar 1981

[Hedg97] R Hedges ldquoHybrid wavelet packet analysisrdquo in Proc 31st Asilomar Confon Sig Sys and Comp Oct 1997

[Hedg98a] R Hedges and D Cochran ldquoHybrid wavelet packet analysisrdquo in Proc IEEESP Int Sym on Time-Freq and Time-Scale Anal Oct 1998

[Hedg98b] R Hedges and D Cochran ldquoHybrid wavelet packet analysis a topdown approachrdquo in Proc 32nd Asilomar Conf on Sig Sys and CompNov 1998

[Hell72] R Hellman ldquoAsymmetry of masking between noise and tonerdquo Percep andPsychphys vol 11 pp 241ndash246 1972

[Hell01] O Hellmuth et al ldquoAdvanced audio identification using MPEG-7 contentdescriptionrdquo in Proc 111th Conv Aud Eng Soc preprint 5463 NovndashDec2001

[Herl95] C Herley ldquoBoundary filters for finite-length signals and time-varying filterbanksrdquo IEEE Trans Circ Sys II vol 42 pp 102ndash114 Feb 1995

[Herm90] H Hermansky ldquoPerceptual linear predictive (PLP) analysis of speechrdquo JAcoust Soc Amer vol 87 no 4 pp 1738ndash1752 Apr 1990

[Herm02] K Hermus W Verhelst and P Wambacq ldquoPsychoacoustic modeling ofaudio with exponentially damped sinusoidsrdquo Proc IEEE ICASSP-02 vol 2pp 1821ndash1824 Orlando FL 2002

[Herr92a] J Herre et al ldquoAdvanced audio measurement system using psychoacousticpropertiesrdquo in Proc 92nd Conv Aud Eng Soc preprint 3321 Mar 1992

[Herr92b] J Herre et al ldquoAnalysis tool for realtime measurements using perceptualCriteriardquo in Proc 11th Int Conf Aud Eng Soc pp 180ndash190 May 1992

REFERENCES 423

[Herr95] J Herre K Brandenburg E Eberlein and B Grill ldquoSecond-generationISOMPEG-audio layer III codingrdquo in Proc 98th Conv Aud Eng Socpreprint 3939 Feb 1995

[Herr96] J Herre and JD Johnston ldquoEnhancing the performance of perceptual audiocoders by using temporal noise shaping (TNS)rdquo in Proc 101st Conv AudEng Soc preprint 4384 Nov 1996

[Herr98] J Herre and D Schulz ldquoExtending the MPEG-4 AAC codec by perceptualnoise substitutionrdquo in Proc 104th Conv Audio Eng Soc preprint 4720Amsterdam 1998

[Herr98a] J Herre and D Schulz ldquoExtending the MPEG-4 AAC codec by perceptualnoise substitutionrdquo in Proc 104th Conv Aud Eng Soc preprint 4720May 1998

[Herr98b] J Herre et al ldquoThe integrated filterbank-based scalable MPEG-4 audiocoderrdquo in Proc 105th Conv Aud Eng Soc preprint 4810 Sept 1998

[Herr98c] J Herre E Allamanche R Geiger and T Sporer ldquoProposal for a low delayMPEG-4 audio coder based on AACrdquo ISOIEC JTC1SC29WG11 M4139Oct 1998

[Herr99] J Herre E Allamanche R Geiger and T Sporer ldquoInformation andproposed enhancements for MPEG-4 low delay audio codingrdquo ISOIECJTC1SC29WG11 M4560 Mar 1998

[Herr00a] J Herre and B Grill ldquoOverview of MPEG-4 audio and its applications inmobile communicationsrdquo in Proc of 5th Int Conf on Sig Proc (ICSP-00)vol 1 pp 11ndash20 Aug 2000

[Herr01a] J Herre C Neubauer and F Siebenhaar ldquoCombined compressionwatermar-king for audio signalsrdquo in Proc 110th Conv Aud Eng Soc preprint 5344Amsterdam May 2001

[Herr01b] J Herre C Neubauer F Siebenhaar and R Kulessa ldquoNew results oncombined audio compressionwatermarkingrdquo in Proc 111th Conv Aud EngSoc preprint 5442 New York Dec 2001

[Herr01c] J Herre R Kulessa and C Neubauer ldquoA compatible family of bitstreamwatermarking schemes for MPEG-Audiordquo in Proc 110th Conv Audio EngSoc preprint 5346 Amsterdam May 2001

[Herr04a] J Herre et al ldquoMP3 surround Efficient and compatible coding ofmultichannel audiordquo in Proc 116 thAES convention paper 6049 BerlinGermany May 2004

[Herr04b] J Herre et al ldquoFrom joint stereo to spatial audio coding ndash recent progressand standardizationrdquo in 6th International Conference on Digital AudioEffects Naples Italy Oct 2004

[Herr05] J Herre et al ldquoThe reference model architecture for MPEG spatial audiocodingrdquo in Proc 118th AES convention paper 6447 Barcelona Spain May2005

[Herre98] J Herre et al ldquoThe integrated filterbank based scalable MPEG-4 audiocoderrdquo in Proc 105th Conv Aud Eng Soc preprint 4810 Sep 1998

[Heus02] R Heusdens and S van de Par ldquoRate-distortion optimal sinusoidal modelingof audio and speech using psychoacoustical matching pursuitsrdquo in ProcICASSP-02 vol 2 pp 1809ndash1812 2002

424 REFERENCES

[Hilp98] J Hilpert M Braun M Lutzky S Geyersberger and R BuchtaldquoImplementing ISOMPEG-2 advanced audio coding realtime on a fixed-point DSPrdquo in Proc 105th Conv Aud Eng Soc preprint 4822 Sept1998

[Hilp00] J Hilpert et al ldquoReal-time implementation of the MPEG-4 low delayadvanced audio coding algorithm (AAC-LD) on Motorola DSP56300rdquo inProc 108th Conv Aud Eng Soc preprint 5081 Feb 2000

[Holm99] T Holman ldquoNew factors in sound for cinema and televisionrdquo J Aud EngSoc vol 39 pp 529ndash539 July-Aug 1999

[Hong01] J-W Hong T-J Lee and JS Lucas ldquoDesign and implementation of a real-time audio service using MPEG-2 AAC and streaming technologyrdquo in Proc110th Conv Aud Eng Soc preprint 5343 May 2001

[Hoog94] A Hoogendoorn ldquoDigital compact cassetterdquo Proc of the IEEE vol 82 no10 pp 1479ndash1489 Oct 1994

[Hori96] N Horikawa and PC Eastty ldquoOne-bit audio recordingrdquo in AES UK Audiofor New Media Conference Apr 1996 London

[Horv02] P Horvatic and N Schiffner ldquoSecurity issues in high quality audiostreamingrdquo in Proc of 2nd Int Conf on Web Delivering of Music(WEDELMUSIC-02) p 224 Dec 2002

[Howa94] P Howard and J Vitter ldquoArithmetic coding for data compressionrdquo Proc ofthe IEEE vol 82 no 6 pp 857ndash865 June 1994

[Hsie02] CT Hsieh and PY Sou ldquoBlind cepstrum domain audio watermarkingbased on time energy featuresrdquo in Proc Int Conf on DSP-2002 vol 2pp 705ndash708 July 2002

[Huan02] DY Huang JN Al Lee SW Foo and L Weisi ldquoReed-Solomon codesfor MPEG advanced audio codingrdquo in Proc 113th Conv Aud Eng Socpreprint 5682 Oct 2002

[Huan02] J Huang Y Wang and YQ Shi ldquoA blind audio watermarking algorithmwith self-synchronizationrdquo in Proc IEEE ISCAS-2002 vol 3 pp 627ndash630Orlando FL May 2002

[Hube98] DM Huber The MIDI Manual second edition Focal Press MA 1998

[Huff52] D Huffman ldquoA method for the construction of minimum redundancy codesrdquoin Proc of the Institute of Radio Engineers vol 40 pp 1098ndash1101 Sept1952

[Huij02] Y Huijuan JC Patra and C W Chan ldquoAn artificial neural network-basedscheme for robust watermarking of audio signalsrdquo in Proc IEEE ICASSP-2002 vol 1 pp 1029ndash1032 Orlando FL May 2002

[Hwan01] Y-T Hwang N-J Liu and M-C Tsai ldquoAn MPEG-4 Twin-VQ based highquality audio codec designrdquo in IEEE Workshop on Sig Proc Systems pp321ndash331 Sept 2001

[IECA87] International Electrotechnical CommissionAmerican National StandardsInstitute (IECANSI) CEI-IEC 908 ldquoCompact-Disc Digital Audio Systemrdquo(ldquored bookrdquo) 1987

[Iked95] K Ikeda et al ldquoError protected TwinVQ audio coding at less than64 kbitsrdquo in Proc IEEE Speech Coding Workshop pp 33ndash34 1995

REFERENCES 425

[Iked99] M Ikeda K Takeda and F Itakura ldquoAudio data hiding by use ofband-limited random sequencesrdquo in Proc IEEE ICASSP-99 vol 4 pp2315ndash2318 Phoenix AZ Mar 1999

[ISO-V96] Multifunctional Ad hoc Group ldquoCore Experiments Descriptionrdquo ISOIECJTC1SC29WG11 N1266 March 1996

[ISOI90] ISOIEC JTC1SC2WG11-MPEG ldquoMPEGAudio Test Report 1990rdquo DocMPEG90N0030 1990

[ISOI91] ISOIEC JTC1SC2WG11 ldquoMPEGAudio Test Report 1991rdquo DocMPEG910010 1991

[ISOI92] ISOIEC JTC1SC29WG11 MPEG IS11172-3 ldquoInformation Technology ndashCoding of Moving Pictures and Associated Audio for Digital Storage Mediaat up to About 15 Mbits Part 3 Audiordquo 1992 (ldquoMPEG-1rdquo)

[ISOI94] ISOIEC JTC1SC29WG11 MPEG IS13818ndash3 ldquoInformation Technol-ogy ndash Generic Coding of Moving Pictures and Associated Audio Part 3Audiordquo 1994 (ldquoMPEG-2 BC-LSFrdquo)

[ISOI94a] ISOIEC JTC1SC29WG11 MPEG IS13818-3 ldquoInformation Technology ndashGeneric Coding of Moving Pictures and Associated Audio Part 3 Audiordquo1994 (ldquoMPEG-2 BC-LSFrdquo)

[ISOI94b] ISOIEC JTC1SC29WG11 MPEG94443 ldquoRequirements for Low BitrateAudio CodingMPEG-4 Audiordquo 1994 (ldquoMPEG-4rdquo)

[ISOI96] ISOIEC JTC1SC29WG11 MPEG Committee Draft 13818-7 ldquoGenericCoding of Moving Pictures and Associated Audio Audio (non backwardscompatible coding NBC)rdquo 1996 (ldquoMPEG-2 NBCAACrdquo)

[ISOI96a] ISOIEC JTC1SC29WG11 MPEG Committee Draft 13818-7 ldquoGenericCoding of Moving Pictures and Associated Audio Audio (non backwardscompatible coding NBC)rdquo 1996 (ldquoMPEG-2 NBCAACrdquo)

[ISOI96b] ISOIEC JTC1SC29WG11 ldquoMPEG-4 Audio Test Results (MOS Tests)rdquoISOIEC JTC1SC29WG11N1144 Munich Jan 1996

[ISOI96c] ISOIEC JTC1SC29WG11 ldquoReport of the Ad Hoc Group onthe Evaluation of New Audio Submissions to MPEG-4rdquo ISOIECJTC1SC29WG11MPEG96M0680 Munich Jan 1996

[ISOI96d] ISOIEC JTC1SC29WG11 N1420 ldquoOverview of the Report on the FormalSubjective Listening Tests of MPEG-2 AAC Multichannel Audio CodingrdquoNov 1996

[ISOI97a] ISOIEC JTC1SC29WG11 ldquoMPEG-4 Audio Committee Draft 14496-3rdquoISOIEC JTC1SC29WG11 N1903 Oct 1997 (Available on httpwwwtntuni-hannoverdeprojectmpegaudiodocuments)

[ISOI97b] ISOIEC 13818-7 ldquoInformation Technology ndash Generic Coding of MovingPictures and Associated Audio Part 7 Advanced Audio Codingrdquo 1997

[ISOI98] ISOIEC JTC1SC29WG11 (MPEG) document N2011 ldquoResults of AACand TwinVQ Tool Comparative Testsrdquo San Jose CA 1998

[ISOI99] ISOIEC JTC1SC29WG11 (MPEG) International Standard ISOIEC14496-3 ldquoCoding of Audio-Visual Objects ndash Part 3 Audiordquo 1999(ldquoMPEG-4rdquo)

426 REFERENCES

[ISOI00] ISOIEC JTC1SC29WG11 (MPEG) International Standard ISOIEC14496-3 AMD-1 ldquoCoding of Audio-Visual Objects ndash Part 3 Audiordquo 2000(ldquoMPEG-4 version-2rdquo)

[ISOI01a] ISOIEC JTC1SC29WG11 MPEG IS15938-2 ldquoMultimedia ContentDescription Interface Description Definition Language (DDL)rdquo Interna-tional Standard Sept 2001 (ldquoMPEG-7 Part-2rdquo)

[ISOI01b] ISOIEC JTC1SC29WG11 MPEG IS15938-4 ldquoMultimedia ContentDescription Interface Audiordquo International Standard Sept 2001 (ldquoMPEG-7 Part-4rdquo)

[ISOI01c] ISOIEC JTC1SC29WG11 MPEG IS15938-5 ldquoMultimedia ContentDescription Interface Multimedia Description Schemes (MDS)rdquo Interna-tional Standard Sept 2001 (ldquoMPEG-7 Part-5rdquo)

[ISOI01d] ISOIEC JTC1SC29WG11 MPEG ldquoMPEG-7 Applications DocumentrdquoDoc N3934 Jan 2001 Also available online httpwwwcseltitmpeg

[ISOI01e] ISOIEC JTC1SC29WG11 MPEG ldquoIntroduction to MPEG-7rdquo DocN4032 Mar 2001 Also available online httpwwwcseltitmpeg

[ISOI02a] MPEG Requirements Group MPEG-21 Overview ISOMPEG N5231 Oct2002

[ISOI03a] MPEG Requirements Group MPEG-21 Requirements ISOMPEG N5873July 2003

[ISOI03b] MPEG Requirements Group MPEG-21 Architecture Scenarios and IPMPRequirements ISOMPEG N5874 July 2003

[ISOI03c] ISOIEC JTC1SC29WG11 MPEG IS14496-32001Amd12003 ldquoInforma-tion technology - coding of audio-visual objects Part 3 Audio Amendment1 Bandwidth Extensionrdquo 2003 (MPEG 4 SBR)

[ISOII94] ISO-II ldquoReport on the MPEGAudio Multichannel Formal SubjectiveListening Testsrdquo ISOMPEG doc MPEG94063 ISOMPEG-II AudioCommittee 1994

[ISOJ97] ISOIEC JTC1SC29WG1 (1997) JPEG-LS Emerging losslessnear-lossless compression standard for continuous-tone images JPEG Lossless

[IS-54] TIAEIA-PN 2398 (IS-54) ldquoThe 8 kbits VSELP Algorithmrdquo 1989

[IS-96] TIAEIAIS-96 ldquoQCELPrdquo Speech Service Option 3 for Wideband SpreadSpectrum Digital Systemsrdquo TIA 1992

[IS-127] TIAEIAIS-127 ldquoEnhanced Variable Rate Codecrdquo Speech Service Option3 for Wideband Spread Spectrum Digital Systems TIA 1997

[IS-641] TIAEIAIS-641 ldquoCellularPCS Radio Interface ndash Enhanced Full-RateSpeech Codecrdquo TIA 1996

[IS-893] TIAEIAIS-893 ldquoSelectable Mode Vocoderrdquo Service Option for WidebandSpread Spectrum Communications Systems ver 20 Dec 2001

[ITUR90] International Telecommunications Union Radio Communications Sector(ITU-R) ldquoSubjective Assessment of Sound Qualityrdquo CCIR Rec 562-3 volX Part 1 Dusseldorf 1990

[ITUR91] ITU-R Document TG10-23-E only ldquoBasic Audio Quality Requirementsfor Digital Audio Bit Rate Reduction Systems for Broadcast Emission andPrimary Distributionrdquo Oct 1991

REFERENCES 427

[ITUR92] International Telecommunications Union Rec G728 ldquoCoding of Speech at16 kbps Using Low-Delay Code Excited Linear Predictionrdquo Sept 1992

[ITUR93a] International Telecommunications Union Radio Communications Sector(ITU-R) Recommendation BS775-1 ldquoMulti-Channel Stereophonic SoundSystem With and Without Accompanying Picturerdquo Nov 1993

[ITUR93b] ITU-R Task Group 102 ldquoCCIR Listening Tests Network Verification Testswithout Commentary Codecs Final Reportrdquo Delayed Contribution 10-2511993

[ITUR93c] ITU-R Task Group 102 ldquoPAQM Measurementsrdquo Delayed Contribution 10-251 1993

[ITUR94b] ITU-R BS1116 ldquoMethods for Subjective Assessment of Small Impairmentsin Audio Systems Including Multichannel Sound Systemsrdquo 1994

[ITUR94c] ITU-R BS775-1 Multichannel Stereophonic Sound System with and WithoutAccompanying Picture International Telecommunication Union GenevaSwitzerland 1994

[ITUR96] International Telecommunications Union Rec P 830 ldquoSubjective Perfor-mance Assessment of Telephone-Band and Wideband Digital Codecsrdquo 1996

[ITUR97] International Telecommunications Union Radio Communications Sector(ITU-R) Recommendation BS1115 ldquoLow Bit Rate Audio Codingrdquo GenevaSwitzerland 1997

[ITUR98] International Telecommunications Union Radiocommunications Sector(ITU-R) Recommendation BS1387 ldquoMethod For Objective Measurementsof Perceived Audio Qualityrdquo Dec 1998

[ITUT94] International Telecommunications Union Telecommunications Sector (ITU-T) Recommendation P80 ldquoTelephone Transmission Quality SubjectiveOpinion Testsrdquo 1994

[ITUT96] International Telecommunications Union Telecommunications Sector (ITU-T) Recommendation P861 ldquoObjective Quality Measurement of Telephone-Band (300ndash3400 Hz) Speech Codecsrdquo 1996

[Iwad92] M Iwadare et al ldquoA 128 kbs hi-fi audio codec based on adaptive transformcoding with adaptive block size MDCTrdquo IEEE J Sel Areas in Comm vol10 no 1 pp 138ndash144 Jan 1992

[Iwak95] N Iwakami et al ldquoHigh-quality audio-coding at less than 64 kbits by usingtransform-domain weighted interleave vector quantization (TWINVQ)rdquo inProc IEEE ICASSP-95 pp 3095ndash3098 Detroit MI May 1995

[Iwak96] N Iwakami and T Moriya ldquoTransform domain weighted interleave vectorquantization (TwinVQ)rdquo in Proc 101st Conv Aud Eng Soc preprint 43771996

[Iwak01] N Iwakami et al ldquoFast encoding algorithms for MPEG-4 TwinVQ audiotoolrdquo in Proc IEEE ICASSP-01 vol 5 pp 3253ndash3256 Salt Lake City UTMay 2001

[Jako96] C Jakob and A Bradley ldquoMinimising the effects of subband quantisation ofthe time domain aliasing cancellation filter bankrdquo in Proc IEEE ICASSP-96pp 1033ndash1036 Atlanta GA May 1996

[Jans03] E Janssen and D Reefman ldquoSuper-audio CD an introductionrdquo in IEEESig Proc Mag vol 20 no 4 pp 83ndash90 July 2003

428 REFERENCES

[Jawe95] B Jawerth and W Sweldens ldquoBiorthogonal smooth local trigonometricbasesrdquo J Fourier Anal Appl vol 2 no 2 pp 109ndash133 1995

[Jaya76] NS Jayant Waveform Quantization and Coding IEEE Press New York1976

[Jaya84] N Jayant and P Noll Digital Coding of Waveforms Principles andApplications to Speech and Video Prentice-Hall Englewood Cliffs NJ 1984

[Jaya90] NS Jayant V Lawrence and D Prezas ldquoCoding of speech and widebandaudiordquo ATampT Tech J vol 69(5) pp 25ndash41 SeptndashOct 1990

[Jaya92] N Jayant J Johnston and Y Shoham ldquoCoding of wideband speechrdquoSpeech Comm vol 11 no 2ndash3 1992

[Jaya93] N Jayant et al ldquoSignal compression based on models of human perceptionrdquoProc IEEE pp 1385ndash1422 Oct 1993

[Jbir98] A Jbira N Moreau and P Dymarski ldquoLow delay coding of widebandaudio (20 Hzndash15 kHz) at 64 kbpsrdquo in Proc IEEE ICASSP-98 vol 6 pp3645ndash3648 Seattle WA May 1998

[Jess99] P Jessop ldquoThe business case for audio watermarkingrdquo in Proc IEEEICASSP-99 vol 4 pp 2077ndash2078 Phoenix AZ Mar 1999

[Jest82] W Jesteadt S Bacon and J Lehman ldquoForward masking as a function offrequency masker level and signal delayrdquo J Acoust Soc Am vol 71 pp950ndash962 1982

[Jetz79] J Jetzt ldquoCritical distance measurement of rooms from the sound energyspectrum responserdquo J Acoust Soc Am vol 65 pp 1204ndash1211 1979

[Ji02] Z Ji Q Zhang W Zhu and J Zhou ldquoPower-efficient distortion-minimizedrate allocation for audio broadcasting over wireless networksrdquo in Proc ofICME-2002 vol 1 pp 257ndash260 Aug 2002

[Joha01] P Johansson M Kazantzidls R Kapoor and M Gerla ldquoBluetooth anenabler for personal area networkingrdquo IEEE Network vol 15 no 5 pp28ndash37 SeptndashOct 2001

[John80] J Johnston ldquoA filter family designed for use in quadrature mirror filterbanksrdquo in Proc IEEE ICASSP-80 pp 291ndash294 Denver CO 1980

[John88a] J Johnston ldquoTransform coding of audio signals using perceptual noisecriteriardquo IEEE J Sel Areas in Comm vol 6 no 2 pp 314ndash323 Feb1988

[John88b] J Johnston ldquoEstimation of perceptual entropy using noise masking criteriardquoin Proc IEEE ICASSP-88 pp 2524ndash2527 New York NY May 1988

[John89] J Johnston ldquoPerceptual transform coding of wideband stereo signalsrdquo inProc IEEE ICASSP-89 pp 1993ndash1996 Glasgow Scotland May 1989

[John92a] J Johnston and A Ferreira ldquoSum-difference stereo transform codingrdquo inProc IEEE ICASSP-92 vol 2 pp 569ndash572 San Francisco CA May 1992

[John95] JD Johnston et al ldquoThe ATampT Perceptual Audio Coder (PAC)rdquo Presentedat the AES convention New York Oct 1995

[John96] J Johnston et al ldquoATampT Perceptual Audio Coding (PAC)rdquo in CollectedPapers on Digital Audio Bit-Rate Reduction N Gilchrist and C GrewinEds Aud Eng Soc pp 73ndash81 1996

REFERENCES 429

[John96a] J Johnston ldquoAudio coding with filter banksrdquo in Subband and WaveletTransforms A Akansu and MJT Smith Eds Kluwer AcademicPublishers pp 287ndash307 Norwell MA 1996

[John96b] J Johnston J Herre M Davis and U Gbur ldquoMPEG-2 NBC audio-stereoand multichannel coding methodsrdquo in Proc 101st Conv Aud Eng Socpreprint 4383 Los Angeles Nov 1996

[John96c] J Johnston et al ldquoATampT Perceptual Audio Coding (PAC)rdquo in CollectedPapers on Digital Audio Bit-Rate Reduction N Gilchrist and C GrewinEds Aud Eng Soc pp 73ndash81 1996

[John99] J Johnston et al ldquoMPEG Audio Codingrdquo in Wavelet Subband and BlockTransforms in Communications and Multimedia A Akansu and M Medleyeds Kluwer Academic Publishers Boston 1999

[Juan82] B Juang and A Gray ldquoMultiple stage vector quantization for speechcodingrdquo in Proc IEEE ICASSP-82 pp 597ndash600 Paris France 1982

[Jurg96] RK Jurgen ldquoBroadcasting with digital audiordquo IEEE Spectrum pp 52ndash59Mar 1996

[Kaba86] P Kabal and RP Ramachandran ldquoThe computation of line-spectralfrequencies using Chebyshev polynomialsrdquo IEEE Trans Acoust SpeechSignal Processing vol 34 pp 1419ndash1426 1986

[Kalk01a] T Kalker ldquoConsiderations on watermarking securityrdquo IEEE 4th Workshopon Mult Sig Proc pp 201ndash206 Oct 2001

[kalk01b] T Kalker W Oomen AN Lemma J Haitsma F Bruekers and M VeenldquoRobust multi-functional and high quality audio watermarking technologyrdquoin Proc 110th Conv Aud Eng Soc preprint 5345 Amsterdam May 2001

[Kalk02] T Kalker and FMJ willems ldquoCapacity bounds and constructions forreversible data-hidingrdquo in Proc DSP-2002 vol 1 pp 71ndash76 July 2002

[Kapu92] R Kapust ldquoA human ear related objective measurement technique yieldsaudible error and error marginrdquo in Proc 11th Int Conf Aud Eng Soc pp191ndash201 May 1992

[Karj85] M Karjaleinen ldquoA new auditory model for the evaluation of sound qualityof audio systemsrdquo in Proc IEEE ICASSP-85 pp 608ndash611 Tampa FL May1985

[Kata93] A Kataoka et al ldquoConjugate structure CELP coder for the CCITT8 kbs standardization candidaterdquo in IEEE Workshop on Speech Coding forTelecommunications pp 25ndash26 1993

[Kata96] A Kataoka et al ldquoAn 8-kbs conjugate structure CELP (CS-CELP) speechcoderrdquo IEEE Trans on Speech and Audio Processing vol 4 no 6pp 401ndash411 Nov 1996

[Kato02] H Kato ldquoTrellis noise-shaping converters and 1-bit digital audiordquo in Proc112th Conv Aud Eng Soc preprint 5615 Munich Germany May 10ndash132002

[Kau98] P Kauff J-R Ohm S Rauthenberg and T Sikora ldquoThe MPEG-4 standardand its applications in virtual 3D environmentsrdquo in 32nd ASILOMAR Confon Sig Syst and Computers vol 1 pp 104ndash107 Nov 1998

[Kay89] S Kay ldquoA fast and accurate single frequency estimatorrdquo IEEE Trans AcousSpeech and Sig Proc pp 1987ndash1990 Dec 1989

430 REFERENCES

[Keyh93] M Keyhl et al ldquoNMR measurements of consumer recording devices whichuse low bit-rate audio codingrdquo in Proc 94th Conv Aud Eng Soc preprint3616 1993

[Keyh94] M Keyhl et al ldquoNMR measurements on multiple generations audio codingrdquoin Proc 96th Conv Aud Eng Soc preprint 3803 1994

[Keyh95] M Keyhl et al ldquoMaintaining sound quality ndash experiences and constraintsof perceptual measurements in todayrsquos and future networksrdquo in Proc 98thConv Aud Eng Soc preprint 3946 1995

[Keyh98] M Keyhl et al ldquoQuality assurance tests of MPEG encoders for adigital broadcasting system (Part II) ndash minimizing subjective test efforts byperceptual measurementsrdquo in Proc 104th Conv Aud Eng Soc preprint4753 May 1998

[Kim00] H Kim ldquoStochastic model based audio watermark and whitening filter forimproved detectionrdquo in Proc IEEE ICASSP-00 vol 4 pp 1971ndash1974Istanbul Turkey June 2000

[Kim01] S-W Kim S-H Park and Y-B Kim ldquoFine grain scalability in MPEG-4Audiordquo in Proc 111th Conv Aud Eng Soc preprint 5491 NovndashDec2001

[Kim02] KT Kim SK Jung YC Park and DH Youn ldquoA new bandwidth scalablewideband speechaudio coderrdquo in Proc IEEE ICASSP-02 vol 1 pp 13ndash17Orlando FL May 2002

[Kim02a] D-H Kim J-H Kim and S-W Kim ldquoScalable lossless audio coding basedon MPEG-4 BSACrdquo in Proc 113th Conv Aud Eng Soc preprint 5679Oct 2002

[Kim02b] J-W Kim B-D Choi S-H Park K-K Kim S-J Ko ldquoRemote control systemusing real-time MPEG-4 streaming technology for mobile robotrdquo in Procof Int Conf on Consumer Electronics ICCE-02 pp 200ndash201 June 2002

[Kimu02] T Kimura K Kakehi K Takeda and F Itakura ldquoSpatial compression ofmulti-channel audio signals using inverse filtersrdquo in Proc ICASSP-02 vol4 p 4175 May 2002

[Kirb97] D Kirby and K Watanabe ldquoFormal subjective testing of the MPEG-2NBC multichannel coding algorithmrdquo in Proc 102nd Conv Aud Eng Socpreprint 4418 Munich 1997

[Kirk83] S Kirkpatrick et al ldquoOptimization by simulated annealingrdquo Science pp671ndash680 May 1983

[Kiro01a] D Kirovski and H Malvar ldquoRobust spread-spectrum audio watermarkingrdquoin Proc IEEE ICASSP-01 vol 3 pp 1345ndash1348 Salt Lake City UT May2001

[Kiro01b] D Kirovski and H Malvar ldquoSpread-spectrum audio watermarking re-quirements applications and limitationsrdquo IEEE 4th Workshop on Mult SigProc pp 219ndash224 Oct 2001

[Kiro02] D Kirovski and H Malvar ldquoEmbedding and detecting spread-spectrumwatermarks under estimation attacksrdquo in Proc IEEE ICASSP-02 vol 2pp 1293ndash1296 Orlando FL May 2002

[Kiro03a] D Kirovski and HS Malvar ldquoSpread-spectrum watermarking of audiosignalsrdquo Trans on Sig Proc vol 51 no 4 pp 1020ndash1033 Apr 2003

REFERENCES 431

[Kiro03b] D Kirovski and FAP Petitcolas ldquoBlind pattern matching attack onwatermarking systemsrdquo Trans on Sig Proc vol 51 no 4 pp 1045ndash1053Apr 2003

[Kiya94] H Kiya et al ldquoA development of symmetric extension method for subbandimage codingrdquo IEEE Trans Image Proc vol 3 no 1 pp 78ndash81 Jan 1994

[Klei90a] WB Kleijn et al ldquoFast methods for the CELP speech coding algorithmrdquoIEEE Trans on Acoustics Speech and Signal Proc vol 38 no 8 p 1330Aug 1990

[Klei90b] WB Kleijn ldquoSource-dependent channel coding and its application toCELPrdquo Advances in Speech Coding Eds B Atal V Cuperman andA Gersho pp 257ndash266 Kluwer Academic Publishers Norwell MA 1990

[Klei92] WB Kleijn et al ldquoGeneralized analysis-by-synthesis coding and itsapplication to pitch predictionrdquo in Proc IEEE ICASSP-92 vol 1pp 337ndash340 San Francisco CA Mar 1992

[Klei95] WB Kleijn and KK Paliwal ldquoAn introduction to speech codingrdquo inSpeech Coding and Synthesis Eds WB Kleijn and KK Paliwal KluwerAcademic Publishers Norwell MA Nov 1995

[Knis98] DN Knisely S Kumar S Laha and S Navda ldquoEvolution of wireless dataservices IS-95 to CDMA2000rdquo IEEE Comm Mag vol 36 pp 140ndash149Oct 1998

[Ko02] B-S Ko R Nishimura and Y Suzuki ldquoTime-spread echo method for digitalaudio watermarking using PN sequencesrdquo in Proc IEEE ICASSP-02 vol2 pp 2001ndash2004 Orlando FL May 2002

[Koen96] R Koenen et al ldquoMPEG-4 context and objectivesrdquo J Sig Proc ImageComm Oct 1996

[Koen98] R Koenen ldquoOverview of the MPEG-4 Standardrdquo ISOIEC JTC1SC29WG11 N2323 Jul 1998 Available online at httpwwwcseltitmpegstan-dardsmpeg-4mpeg-4html

[Koen99] R Koenen ldquoOverview of the MPEG-4 Standardrdquo ISOIEC JTC1SC29WG11 N3156 Dec 1999 Available online at httpwwwcseltitmpegstan-dardsmpeg-4mpeg-4htm

[Koil91] R Koilpillai and PP Vaidyanathan ldquoNew results on cosine-modulated FIRfilter banks satisfying perfect reconstructionrdquo in Proc IEEE ICASSP-91 pp1793ndash1796 Toronto Canada May 1991

[Koil92] R Koilpillai and PP Vaidyanathan ldquoCosine-modulated FIR filter bankssatisfying perfect reconstructionrdquo IEEE Trans Sig Proc vol 40 pp770ndash783 Apr 1992

[Kois00] K Koishida V Cuperman and A Gersho ldquoA 16-kbits bandwidth scalableaudio coder based on the G729 standardrdquo Proc IEEE ICASSP-00 vol 2pp 49ndash52 Istanbul Turkey June 2000

[Koll99] J Koller T Sporer and K Brandenburg ldquoRobust coding of high qualityaudio signalsrdquo in Proc 103rd Conv Aud Eng Soc preprint 4621 NewYork 1997

[Koml96] A Komly ldquoAssessing the performance of low-bit-rate audio codecs anaccount of the work of ITU-R Task Group 102rdquo in Collected Papers onDigital Audio Bit-Rate Reduction N Gilchrist and C Grewin Eds AudEng Soc pp 105ndash114 1996

432 REFERENCES

[Kons94] K Konstantinides ldquoFast subband filtering in MPEG audio codingrdquo IEEESig Proc Letters vol 1 pp 26ndash28 Feb 1994

[Kost95] B Kostek ldquoFeature extraction methods for the intelligent processing ofmusical signalsrdquo in Proc 99th Conv Aud Eng Soc preprint 4076 Oct1995

[Kost96] B Kostek and M Szczerba ldquoMIDI database for the automatic recognitionof musical phrasesrdquo in Proc 100th Conv Aud Eng Soc preprint 4169May 1996

[Kova95] J Kovacevic ldquoSubband coding systems incorporating quantizer modelsrdquoIEEE Trans Image Proc vol 4 no 5 pp 543ndash553 May 1995

[Krah85] D Krahe ldquoNew Source Coding Method for High Quality Digital AudioSignalsrdquo NTG Fachtagung Hoerrundfunk Mannheim 1985

[Krah88] D Krahe ldquoGrundlagen eines Verfahrens zur Dataenreduktion bei QualitativHochwertigen digitalen Audiosignalen auf Basis einer Adaptiven Transfor-mationscodierung unter Berucksightigung Psychoakustischer PhanomenerdquoPhD Thesis Duisburg 1988

[kroo86] P Kroon E Deprettere and RJ Sluyeter ldquoRegular-pulse excitation ndash anovel approach to effective and efficient multi-pulse coding of speechrdquoIEEE Trans on Acoustics Speech and Signal Proc vol 34 no 5pp 1054ndash1063 Oct 1986

[Kroo90] P Kroon and B Atal ldquoPitch predictors with high temporal resolutionrdquo inProc IEEE ICASSP-90 pp 661ndash664 Albuquerque NM April 1990

[Kroo95] P Kroon and WB Kleijn ldquoLinear-prediction based analysis-synthesiscodingrdquo in Speech Coding and Synthesis Eds WB Kleijn andKK Paliwal Kluwer Academic Publishers Norwell MA Nov 1995

[Krug88] E Kruger and H Strube ldquoLinear prediction on a warped frequencyscalerdquo IEEE Trans on Acoustics Speech and Signal Proc vol 36 no 9pp 1529ndash1531 Sept 1988

[Kudu95a] P Kudumakis and M Sandler ldquoOn the performance of wavelets for low bitrate coding of audio signalsrdquo in Proc IEEE ICASSP-95 pp 3087ndash3090Detroit MI May 1995

[Kudu95b] P Kudumakis and M Sandler rdquoWavelets regularity complexity andMPEG-Audiordquo in Proc 99th Conv Aud Eng Soc preprint 4048Oct 1995

[Kudu96] P Kudumakis and M Sandler ldquoOn the compression obtainable with four-tapwaveletsrdquo IEEE Sig Proc Let pp 231ndash233 Aug 1996

[Kuma00] A Kumar ldquoLow complexity ACELP coding of 7 kHz speech and audioat 16 kbpsrdquo IEEE Int Conf on Personal Wireless Communications pp368ndash372 Dec 2000

[Kunt80] M Kunt and O Johnsen ldquoBlock coding of graphics A tutorial reviewrdquo inProc of the IEEE vol 7 pp 770ndash786 1980

[Kuo02] SS Kuo JD Johnston W Turin and SR Quackenbush ldquoCovert audiowatermarking using perceptually tuned signal independent multiband phasemodulationrdquo in Proc IEEE ICASSP-02 vol 2 pp 1753ndash1756 OrlandoFL May 2002

REFERENCES 433

[Kurt02] F Kurth A Ribbrock and M Clausen ldquoIdentification of highly distortedaudio material for querying large scale data basesrdquo in Proc 112th ConvAud Eng Soc preprint 5512 Munich May 2002

[Kuzu03] BH Kuzuki N Fuchigami and JR Stuart ldquoDVD-audio specificationsrdquo inIEEE Sig Proc Mag pp 72ndash82 July 2003

[Lacy98] J Lacy SR Quackenbush AR Reibman D Shur and JH Snyder ldquoOncombining watermarking with perceptual codingrdquo in Proc IEEE ICASSP-98vol 6 pp 3725ndash3728 Seattle WA May 1998

[Lafl90] C Laflamme et al ldquoOn reducing computational complexity of codebooksearch in CELP coder through the use of algebraic codesrdquo in Proc IEEEICASSP-90 vol 1 pp 177ndash180 Albuquerque NM Apr 1990

[Lafl93] C Laflamme R Salami and J-P Adoul ldquo96 kbits ACELP coding ofwideband speechrdquo in Speech and Audio Coding for Wireless and NetworkApplications Eds B Atal V Cuperman and A Gersho Kluwer AcademicPublishers Norwell MA 1993

[Laft03] C Laftsidis A Tefas N Nikolaidis and I Pitas ldquoRobust multibit audiowatermarking in the temporal domainrdquo in Proc IEEE ISCAS-03 vol 2 pp944ndash947 May 2003

[Lanc02] R Lancini F Mapelli and S Tubaro ldquoEmbedding indexing informationin audio signal using watermarking techniquerdquo Int Symp on VideoImageProc and Mult Comm pp 257ndash261 June 2002

[Laub98] P Lauber and N Gilchrist ldquoATLANTIC audio switching layer 3 signalsrdquoin Proc 104th Aud Eng Soc Conv preprint 4738 Amsterdam May 1998

[Lean00] CTG de Leandro M Bonnet and N Moreau ldquoCyclostationarity-basedaudio watermarking with private and public hidden datardquo in Proc 109thConv Aud Eng Soc preprint 5258 Los Angeles Sept 2000

[Leand01] CTG de Leandro E Gomez and N Moreau ldquoResynchronization methodsfor audio watermarking an audio systemrdquo in Proc 111th Conv Aud EngSoc preprint 5441 New York Dec 2001

[Lee90] JI Lee et al ldquoOn reducing computational complexity of codebook searchin CELP codingrdquo IEEE Trans on Comm vol 38 no 11 pp 1935ndash1937Nov 1990

[Lee00a] SK Lee and YS Ho ldquoDigital audio watermarking in the cepstrum domainrdquoInt Conf on Consumer Electronics (ICCE-00) pp 334ndash335 June 2000

[Lee00b] SK Lee and YS Ho ldquoDigital audio watermarking in the cepstrum domainrdquoTrans on Consumer Electronics vol 46 no 3 pp 744ndash750 Aug 2000

[Leek84] M Leek and C Watson ldquoLearning to detect auditory pattern componentsrdquoJ Acoust Soc Am vol 76 pp 1037ndash1044 1984

[Lefe94] R Lefebvre R Salami C Laflamme and J-P Adoul ldquoHigh quality codingof wideband audio signals using transform coded excitation (TCX)rdquo in ProcIEEE ICASSP-94 vol 1 pp 93ndash96 Adelaide Australia 1994

[LeGal92] DJ Le Gall ldquoThe MPEG video compression algorithmrdquo J Sig Proc ImageComm vol 4 no 2 pp 129ndash140 1992

[Lehr93] P Lehrman and T Tully MIDI for the Professional Music Sales Corp1993

434 REFERENCES

[Lemm03] AN Lemma J Aprea W Oomen and LV de Kerkhof ldquoA temporaldomain audio watermarking techniquerdquo Trans on Sig Proc vol 51 no 4pp 1088ndash1097 Apr 2003

[Levi98a] S Levine and J Smith ldquoA sines+transients+noise audio representation fordata compression and timepitch scale modificationsrdquo in Proc 105th ConvAudio Eng Soc preprint 4781 Sep 1998

[Levi99] SN Levine and JO Smith ldquoA switched parametric and transform audiocoderrdquo in Proc IEEE ICASSP-99 vol 2 pp 985ndash988 Phoenix AZ Mar1999

[Lie01] W-N Lie and L-C Chang ldquoRobust and high-quality time-domain audiowatermarking subject to psychoacoustic maskingrdquo in Proc IEEE ISCAS-01vol 2 pp 45ndash48 May 2001

[Lin82] S Lin and DJ Costello Error Control Coding Fundamentals andApplications Englewood Cliffs NJ Prentice Hall 1982

[Lin91] X Lin RA Salami and R Steele ldquoHigh quality audio coding usinganalysis-by-synthesis techniquerdquo in Proc IEEE ICASSP-91 vol 5pp 3617ndash3620 Toronto Canada Apr 1991

[Lina94] J Linacre Many-Facet Rasch Measurement MESA Press Chicago 1994

[Lind80] Y Linde A Buzo and A Gray ldquoAn algorithm for vector quantizer designrdquoIEEE Trans Commun vol 28 no 1 pp 84ndash95 Jan 1980

[Lind99] A T Lindsay and W Kriechbaum ldquoTherersquos more than one way to hear itmultiple representations of music in MPEG-7rdquo J New Music Res vol 28no 4 pp 364ndash372 1999

[Lind00] A T Lindsay S Srinivasan et al ldquoRepresentation and linking mechanismfor audio in MPEG-7rdquo Signal Proc Image Commun vol 16 pp 193ndash2092000

[Lind01] AT Lindsay and J Herre ldquoMPEG-7 and MPEG-7 audio ndash an overviewrdquoJ Audio Eng Soc vol 49 no 78 pp 589ndash594 JulyAug 2001

[Link93] M Link ldquoAn attack processing of audio signals for optimizing the temporalcharacteristics of a low bit-rate audio coding systemrdquo in Proc 95th ConvAudio Eng Soc preprint 3696 1993

[Liu98] C-M Liu and W-C Lee ldquoA unified fast algorithm for cosine modulatedfilter banks in current audio coding standardsrdquo in Proc 104th Conv AudioEng Soc preprint 4729 1998

[Liu99] F Liu J Shin C-C J Kuo and AG Tescher ldquoStreaming of MPEG-4 speechaudio over Internetrdquo Int Conf on Consumer Electronics pp300ndash301 June 1999

[LKur96] E Lukac-Kuruc ldquoBeyond MIDI XMidi an affordable solutionrdquo in Proc101st Conv Aud Eng Soc preprint 4348 Nov 1996

[Lloy82] S P Lloyd ldquoLeast squares quantization in PCMrdquo originally publishedin 1957 and republished in IEEE Trans Inform Theory vol 28 pt IIpp127ndash135 Mar 1982

[Lokh91] GCP Lokhoff ldquoDCC ndash digital compact cassetterdquo in IEEE Trans onConsumer Electronics vol 37 no 3 pp 702ndash706 Aug 1991

REFERENCES 435

[Lokh92] GCP Lokhoff ldquoPrecision adaptive sub-band coding (PASC) for the digitalcompact cassette (DCC)rdquo IEEE Trans Consumer Electron vol 38 no 4pp 784ndash789 Nov 1992

[Lu98] Z Lu and W Pearlman ldquoAn efficient low-complexity audio coder deliveringmultiple levels of quality for interactive applicationsrdquo in Proc IEEE SigProc Soc Workshop on Multimedia Sig Proc Dec 1998

[Luft83] R Lufti ldquoAdditivity of simultaneous maskingrdquo J Acoust Soc Am vol 73pp 262ndash267 1983

[Luft85] R Lufti ldquoA power-law transformation predicting masking by sounds withcomplex spectrardquo J Acoust Soc of Am vol 77 pp 2128ndash2136 1985

[Maco97] MW Macon L Jensen-Link J Oliverio MA Clements and EB GeorgeldquoConcatenation-based MIDI-to-singing voice synthesisrdquo in Proc 101stConv Aud Eng Soc preprint 4591 Nov 1997

[Madi97] V Madisetti and DB Williams The Digital Signal Processing Handbook CRC Press Boca Raton FL Dec 1997

[Mahi89] Y Mahieux et al ldquoTransform coding of audio signals using correlationbetween successive transform blocksrdquo in Proc IEEE ICASSP-89 pp2021ndash2024 Glasgow Scotland May 1989

[Mahi90] Y Mahieux and J Petit ldquoTransform coding of audio signals at 64 kbitssecrdquoin Proc IEEE Globecom lsquo90 vol 1 pp 518ndash522 Nov 1990

[Mahi94] Y Mahieux and JP Petit ldquoHigh-quality audio transform coding at 64 kbpsrdquoIEEE Trans Comm vol 42 no 11 pp 3010ndash3019 Nov 1994

[Main96] L Mainard and M Lever ldquoA bi-dimensional coding scheme applied to audiobitrate reductionrdquo in Proc IEEE ICASSP-96 pp 1017ndash1020 Atlanta GAMay 1996

[Main96] L Mainard C Creignou M Lever and YF Dehery ldquoA real-time PC-based high-quality MPEG layer II codecrdquo in Proc 101st Conv Aud EngSoc preprint 4345 Nov 1996

[Makh75] J Makhoul ldquoLinear prediction A tutorial reviewrdquo in Proc of the IEEE vol 63 no 4 pp 561ndash580 Apr 1975

[Makh78] J Makhoul et al ldquoA mixed-source model for speech compression andSynthesisrdquo J Acoust Soc Am vol 64 pp 1577ndash1581 Dec 1978

[Makh85] J Makhoul et al ldquoVector quantization in speech codingrdquo Proc of the IEEE vol 73 no 11 pp 1551ndash1588 Nov 1985

[Maki00] K Makino and J Matsumoto ldquoHybrid audio coding for speech and audiobelow medium bit raterdquo Int Conf on Consumer Electronics (ICCE-00) pp264ndash265 June 2000

[Malv90a] H Malvar ldquoLapped transforms for efficient transformsubband codingrdquoIEEE Trans Acous Speech and Sig Process vol 38 no 6 pp 969ndash978June 1990

[Malv90b] H Malvar ldquoModulated QMF filter banks with perfect reconstructionrdquoElectronics Letters vol 26 pp 906ndash907 June 1990

[Malv91] HS Malvar Signal Processing with Lapped Transforms Artech HouseNorWood MA 1991

436 REFERENCES

[Malv98] H Malvar ldquoBiorthogonal and nonuniform lapped transforms for transformcoding with reduced blocking and ringing artifactsrdquo IEEE Trans Sig Procvol 46 no 4 pp 1043ndash1053 Apr 1998

[Malv98] H Malvar ldquoEnhancing the performance of subband audio coders for speechsignalsrdquo in Proc IEEE ISCAS-98 vol 5 pp 98ndash101 May 1998

[Mana02] B Manaris T Purewal and C McCormick ldquoProgress towards recognizingand classifying beautiful music with computers ndash MIDI-encoded music andthe Zipf-Mandelbrot lawrdquo in Proc of Southeast Conf pp 52ndash57 Apr 2002

[Manj02] BS Manjunath P Salembier and T Sikora editors Introduction to MPEG-7 Multimedia Content Description Language John Wiley amp Sons NewYork 2002

[Mans01a] MF Mansour and AH Tewfik ldquoAudio watermarking by time-scalemodificationrdquo in Proc IEEE ICASSP-01 vol 3 pp 1353ndash1356 Salt LakeCity UT May 2001

[Mans01b] MF Mansour and AH Tewfik ldquoEfficient decoding of watermarkingschemes in the presence of false alarmsrdquo IEEE 4th Workshop on Mult SigProc pp 523ndash528 Oct 2001

[Mans02] MF Mansour and AH Tewfik ldquoImproving the security of watermark publicdetectorsrdquo Int Conf on DSP-02 vol 1 pp 59ndash66 July 2002

[Mao03] W Mao Modern Cryptography Theory and Practice Prentice Hall PTRJuly 2003

[Mark76] J Markel and A Gray Jr Linear Prediction of Speech Springer-VerlagNew York 1976

[Marp87] S Marple Digital Spectral Analysis with Applications Prentice-HallEnglewood Cliffs NJ 1987

[Marp89] L Marple Digital Spectral Analysis with Applications 2nd edition PrenticeHall Englewood Cliffs NJ 1989

[Mart02] LG Martins AJS Ferreira ldquoPCM to MIDI transpositionrdquo in Proc 112thConv Aud Eng Soc preprint 5524 May 2002

[Mass85] J Masson and Z Picel ldquoFlexible design of computationally efficient nearlyperfect QMF filter banksrdquo in Proc IEEE ICASSP-85 vol 10 pp 541ndash544Tampa FL Mar 1985

[Matv96] G Matviyenko ldquoOptimized Local Trigonometric Basesrdquo Appl ComputHarmonic Anal v 3 n 4 pp 301-323ndash 1996

[McAu86] R McAulay and T Quatieri ldquoSpeech analysis synthesis based on asinusoidal representationrdquo IEEE Trans Acous Speech and Sig Proc vol34 no 4 pp 744ndash754 Aug 1986

[McAu88] R McAulay and T Quatieri ldquoComputationally efficient sine-wave synthesisand its application to sinusoidal transform codingrdquo in Proc IEEE ICASSP-88 New York NY 1988

[McCr91] A McCree and T Barnwell III ldquoA new mixed excitation LPC vocoderrdquo inProc IEEE ICASSP-91 vol 1 pp 593ndash596 Toronto Canada Apr 1991

[McCr93] A McCree and T Barnwell III ldquoImplementation and evaluation of a 2400BPS mixed excitation LPC vocoderrdquo in Proc IEEE ICASSP -93 vol 2p 159 Minneapolis MN Apr 1993

REFERENCES 437

[Mear97] DJ Meares and G Theile ldquoMatrixed surround sound in an MPEG digitalworldrdquo in Proc 102th Conv Aud Eng Soc preprint 4431 Mar 1997

[Mein01] N Meine B Edler and H Purnhagen ldquoError protection and concealmentfor HILN MPEG-4 parametric audio codingrdquo in Proc 110th Conv AudEng Soc preprint 5300 May 2001

[Merh00] N Merhav ldquoOn random coding error exponents of watermarking systemsrdquoTrans on Info Theory vol 46 no 2 pp 420ndash430 Mar 2000

[Merm88] P Mermelstein ldquoG722 A new CCITT coding standard for digitaltransmission of wideband audio signalsrdquo IEEE Comm Mag vol 26pp 8ndash15 Feb 1988

[Mesa99] VZ Mesarovic MV Dokic and RK Rao ldquoDTS multichannel audiodecoder on a 24-bit fixed-point dual DSPrdquo in Proc 106th Conv Aud EngSoc preprint 4964 May 1999

[MID03] ldquoMIDI and musical instrument controlrdquo in the Journal of AES vol 51no 4 pp 272ndash276 Apr 2003

[MIDI] The MIDI Manufacturers Association Official home page available onlineat httpwwwmidiorg

[Mill47] G Miller ldquoSensitivity to changes in the intensity of white noise and itsrelation to masking and loudnessrdquo J Acoust Soc Am vol 19 pp 609ndash6191947

[Miln92] A Milne ldquoNew test methods for digital audio data compression algorithmsrdquoin Proc 11th Int Conf Aud Eng Soc pp 210ndash215 May 1992

[Mitc97] JL Mitchell WB Pennebaker CE Fogg and DJ LeGall MPEG VideoCompression Standard in Digital Multimedia Standards Series Chapman ampHall New York NY 1997

[Mizr02] S Mizrachi and D Malah ldquoParameter identification of a class of nonlinearsystems for robust audio watermarking verificationrdquo in Proc 22nd ConvElectrical and Electronics Engineers pp 13ndash15 Dec 2002

[Mode98] T Modegi and S-I Iisaku ldquoProposals of MIDI coding and its application foraudio authoringrdquo in Proc Int Conf on Mult Comp and Syst pp 305ndash314JunendashJuly 1998

[Mode00] T Modegi ldquoMIDI encoding method based on variable frame-length analysisand its evaluation of coding precisionrdquo in IEEE Int Conf on Mult and ExpoICME-00 vol 2 pp 1043ndash1046 JulyndashAug 2000

[Mont94] P Monta and S Cheung ldquoLow rate audio coder with hierarchicalfilterbanksrdquo in Proc IEEE ICASSP-94 vol 2 pp 209ndash212 AdelaideAustralia May 1994

[Moor77] BCJ Moore Introduction to the Psychology of Hearing University ParkPress London 1977

[Moor78] BCJ Moore ldquoPsychophysical tuning curves measured in simultaneous andforward maskingrdquo J Acoust Soc Am vol 63 pp 524ndash532 1978

[Moor83] BCJ Moore and B Glasberg ldquoSuggested formulae for calculating auditory-filter bandwidths and excitation patternsrdquo J Acous Soc Am vol 74 pp750ndash753 1983

438 REFERENCES

[Moor96] BCJ Moore ldquoMasking in the human auditory systemrdquo in Collected Paperson Digital Audio Bit-Rate Reduction N Gilchrist and C Grewin Eds AudEng Soc pp 9ndash19 1996

[Moor96] JA Moorer ldquoBreaking the sound barrier mastering at 96 kHz and beyondrdquoin Proc 101st Conv Aud Eng Soc preprint 4357 Los Angeles Nov 1996

[Moor97] BCJ Moore B Glasberg and T Baer ldquoA model for the prediction ofthresholds loudness and partial loudnessrdquo J Audio Eng Soc vol 45 no 4pp 224-240ndash Apr 1997

[Moor03] BCJ Moore An Introduction to the Psychology of Hearing Fifth EditionAcademic Press San Diego Jan 2003

[More95] F Moreau de Saint-Martin et al ldquoA measure of near orthogonality ofPR biorthogonal filter banksrdquo in Proc IEEE ICASSP-95 pp 1480ndash1483Detroit MI May 1995

[Mori96] T Moriya et al ldquoExtension and complexity reduction of TWINVQ audiocoderrdquo in Proc IEEE ICASSP-96 vol 2 pp 1029ndash1032 Atlanta GA May1996

[Mori97] T Moriya Y Takashima T Nakamura and N Iwakami ldquoDigitalwatermarking schemes based on vector quantizationrdquo in IEEE Workshopon Speech Coding for Telecommunications pp 95ndash96 1997

[Mori00] T Moriya N Iwakami A Jin and T Mori ldquoA design of lossy and losslessscalable audio codingrdquo in Proc IEEE ICASSP 2000 vol 2 pp 889ndash892Istanbul Turkey June 2000

[Mori00a] T Moriya A Jin N Iwakami and T Mori ldquoDesign of an MPEG-4 generalaudio coder for improving speech qualityrdquo in Proc of IEEE Workshop onSpeech Coding pp 139ndash141 Sept 2000

[Mori00b] T Moriya T Mori N Iwakami and A Jin ldquoA design of error robustscalable coder based on MPEG-4Audiordquo in Proc IEEE ISCAS-2000 vol3 pp 213ndash216 May 2000

[Mori02a] T Moriya ldquoReport of AHG on issues in lossless audio codingrdquo ISOIECJTC1SC29WG11 M7955 Mar 2002

[Mori02b] T Moriya A Jin T Mori K Ikeda and T Kaneko ldquoLossless scalableaudio coder and quality enhancementrdquo in Proc IEEE ICASSP-02 vol 2pp 1829ndash1832 Orlando FL 2002

[Mori03] T Moriya A Jin T Mori K Ikeda and T Kaneko ldquoHierarchical losslessaudio coding in terms of sampling rate and amplitude resolutionrdquo in ProcIEEE ICASSP-03 vol 5 pp 409ndash412 Hong Kong 2003

[Moul98a] D Moulton and M Moulton ldquoMeasurement of small impairments ofperceptual audio coders using a 3-facet Rasch modelrdquo in Proc 104th ConvAud Eng Soc preprint 4709 May 1998

[Moul98b] D Moulton and M Moulton ldquoCodec lsquotransparencyrsquo listener lsquoseverityrsquoprogram lsquointolerancersquo suggestive relationships between Rasch measures andsome background variablesrdquo in Proc 105th Conv Aud Eng Soc preprint4843 Sept 1998

[MPEG] The MPEG workgroup web-page available online at httpwwwmpegorg

REFERENCES 439

[Munt02] T Muntean E Grivel and M Najim ldquoAudio digital watermarking based onhybrid spread spectrumrdquo in Proc 2nd Int Conf WEBDELMUSIC-02 pp150ndash155 Dec 2002

[Nack99a] F Nack and AT Lindsay ldquoEverything you wanted to know about MPEG-7Part-Irdquo IEEE Multimedia vol 6 no 3 pp 65ndash71 JulyndashSept 1999

[Nack99b] F Nack and AT Lindsay ldquoEverything you wanted to know about MPEG-7Part-IIrdquo IEEE Multimedia vol 6 no 4 pp 64ndash73 OctndashDec 1999

[Naja00] H Najafzadeh and P Kabal ldquoPerceptual bit allocation for low rate codingof narrowband audiordquo in Proc IEEE ICASSP-00 vol 2 pp 5ndash9 IstanbulTurkey June 2000

[Neub00a] C Neubauer and J Herre ldquoAudio watermarking of MPEG-2 AAC bitstreamsrdquo in Proc 108th Conv Aud Eng Soc preprint 5101 Paris Feb2000

[Neub00b] C Neubauer and J Herre ldquoAdvanced watermarking and its applicationsrdquo inProc 109th Conv Aud Eng Soc preprint 5176 Los Angeles Sept 2000

[Neub02] C Neubauer F Siebenhaar J Herre and R Bauml ldquoNew high data rateaudio watermarking based on SCS (scalar costa scheme)rdquo in Proc 113thConv Aud Eng Soc preprint 5645 Los Angeles Oct 2002

[Neub98] C Neubauer and J Herre ldquoDigital watermarking and its influence on audioqualityrdquo in Proc 105th Conv Aud Eng Soc preprint 4823 San FranciscoSept 1998

[NICAM] NICAM 728 ldquoSpecification for Two Additional Digital Sound Channelswith System I Televisionrdquo Independent Broadcasting Authority (IBA)British Radio Equipment Manufacturersrsquo Association (BREMA) and BritishBroadcasting Corporation (BBC) Aug 1988

[Ning02] D Ning and M Deriche ldquoA new audio coder using a warped linearprediction model and the wavelet transformrdquo in Proc IEEE ICASSP-02vol 2 pp 1825ndash1828 Orlando FL 2002

[Nish96a] A Nishio et al ldquoDirect stream digital audio systemrdquo in Proc 100th ConvAud Eng Soc preprint 4163 May 1996

[Nish96b] A Nishio et al ldquoA new CD mastering processing using direct stream digitalrdquoin Proc 101st Conv Aud Eng Soc preprint 4393 Los Angeles Nov 1996

[Nish99] M Nishiguchi ldquoMPEG-4 speech codingrdquo in Proc AES 17 th Int Conf Sept1999

[Nish01] A Nishio and H Takahashi ldquoInvestigation of practical 1-bit delta-sigmaconversion for professional audio applicationsrdquo in Proc 110th Conv AudEng Soc preprint 5392 May 2001

[Nogu97] M Noguchi G Ichimura A Nishio and S Tagami ldquoDigital signalprocessing in a direct stream digital editing systemrdquo in Proc 102nd ConvAud Eng Soc Preprint 4476 May 1997

[Noll93] P Noll ldquoWideband speech and audio codingrdquo IEEE Comm Mag pp 34ndash44 Nov 1993

[Noll95] P Noll ldquoDigital audio coding for visual communicationsrdquo Proc of the IEEE pp 925ndash943 June 1995

[Noll97] P Noll ldquoMPEG digital audio codingrdquo IEEE Sig Proc Mag pp 59ndash81Sept 1997

440 REFERENCES

[Nomu98] T Nomura et al ldquoA bitrate and bandwidth scalable CELP coderrdquo in ProcIEEE ICASSP-1998 vol 1 pp 341ndash344 Seattle WA May 1998

[Nors92] SR Norsworthy ldquoEffective dithering of sigma-delta modulatorsrdquo in ProcIEEE ISCAS-92 vol 3 pp 1304ndash1307 1992

[Nors93] SR Norsworthy and DA Rich ldquoIdle channel tones and dithering in delta-sigma modulatorsrdquo in Proc 95th Conv Aud Eng Soc preprint 3711 NewYork Oct 1993

[Nors97] SR Norsworthy R Schreier and GC Temes Delta-Sigma ConvertersTheory Design and Simulation New York IEEE Press 1997

[Nuss81] Nussbaumer H J ldquoPseudo QMF filter bankrdquo IBM Tech DisclosureBulletin vol 24 pp 3081ndash3087 Nov 1981

[Ogur97] Y Ogura T Sugihara S Okada H Yamauchi and A Nishio ldquoOne-bittwo-channel recorder systemrdquo in Proc 103rd Conv Aud Eng Soc preprint4564 New York Sept 1997

[Oh01] HO Oh JW Seok JW Hong and DH Youn ldquoNew echo embeddingtechnique for robust and imperceptible audio watermarkingrdquo in Proc IEEEICASSP-01 vol 3 pp 1341ndash1344 Salt Lake City UT May 2001

[Oh02] HO Oh DH Youn JW Hong JW Seok ldquoImperceptible echo for robustaudio watermarkingrdquo in Proc 113th Conv Aud Eng Soc preprint 5644Los Angeles Oct 2002

[Ohma97] I Ohman ldquoThe placement of one or several subwoofersrdquo in Music ochLjudteknik no 1 Sweden 1997

[Ojan99] J Ojanpera and M Vaananen ldquoLong-term predictor for transform domainperceptual audio codingrdquo in Proc 107th Conv Aud Eng Soc preprint5036 New York 1999

[Oliv48] BM Oliver J Pierce and CE Shannon ldquoThe philosophy of PCMrdquo ProcIRE vol 36 pp 1324ndash1331 Nov 1948

[Onno93] P Onno and C Guillemot ldquoTradeoffs in the design of wavelet filters forimage compressionrdquo in Proc VCIP pp 1536ndash1547 Nov 1993

[Oom03] W Oomen E Schuijers B den Brinker and J Breebaart ldquoAdvances inparametric coding for high-quality audiordquo in Proc 114th Conv Aud EngSoc preprint 5852 Mar 2003

[Oome96] AWJL Oomen AAM Bruekers RJ van der Vleuten and LM Van deKerkhof ldquoLossless coding for DVD audiordquo in 101st Conv Aud Eng Socpreprint 4358 Nov 1996

[Oppe71] A Oppenheim et al ldquoComputation of spectra with unequal resolution usingthe fast Fourier transformrdquo Proc of the IEEE vol 59 pp 299ndash301 Feb1971

[Oppe72] A Oppenheim and D Johnson ldquoDiscrete representation of signalsrdquo Procof the IEEE vol 60 pp 681ndash691 June 1972

[Oppe99] AV Oppenheim RW Schafer JR Buck Discrete-Time Signal Processing2nd edition Prentice Hall Englewood Cliffs NJ 1999

[Orde91] E Ordentlich and Y Shoham ldquoLow-delay code excited linear predictivecoding of wideband speech at 32 kbsrdquo in Proc IEEE ICASSP-93 vol 2pp 9ndash12 Minneapolis MN 1993

REFERENCES 441

[Pail92] B Paillard et al ldquoPERCEVAL perceptual evaluation of the quality of audiosignalsrdquo J Aud Eng Soc vol 40 no 12 pp 21ndash31 JanFeb 1992

[Pain00] T Painter and A Spanias ldquoPerceptual coding of digital audiordquo Proc of theIEEE vol 88 no 4 pp 451ndash513 Apr 2000

[Pain00a] T Painter and A Spanias ldquoPerceptual Coding of Digital Audiordquo Proc ofthe IEEE vol 88 no 4 pp 451ndash513 April 2000

[Pain01] T Painter and A Spanias ldquoPerceptual component selection in sinusoidalcoding of audiordquo IEEE 4th Workshop on Multimedia Signal Processing pp187ndash192 Oct 2001

[Pain05] T Painter and A Spanias ldquoPerceptual segmentation and component selectionfor sinusoidal representations of audiordquo IEEE Trans Speech Audio Procvol 13 no 2 pp 149ndash162 Mar 2005

[Pali91] KK Paliwal and B Atal ldquoEfficient vector quantization of LPC parametersat 24 kbsrdquo in Proc IEEE ICASSP-91 pp 661ndash663 Toronto ON Canada1991

[Pali93] KK Paliwal and BS Atal ldquoEfficient vector quantization of LPC parametersat 24 bitsframerdquo IEEE Trans Speech Audio Process vol 1 no 1 pp3ndash14 1993

[Pan93] D Pan ldquoDigital audio compressionrdquo Digital Tech J vol 5 no 2 pp28ndash40 1993

[Pan95] D Pan ldquoA tutorial on MPEGaudio compressionrdquo IEEE Mult vol 2 no 2pp 60ndash74 Summer 1995

[Papa95] P Papamichalis ldquoMPEG audio compression algorithms and implementa-tionrdquo in Proc DSP 95 Int Conf on DSP pp 72ndash77 June 1995

[Papo91] A Papoulis Probability Random Variables and Stochastic Processes 3rdedition McGraw-Hill New York 1991

[Para95] M Paraskevas and J Mourjopoulos ldquoA differential perceptual audio codingmethod with reduced bitrate requirementsrdquo IEEE Trans Speech and AudioProc vol 3 no 6 pp 490ndash503 Nov 1995

[Park97] SH Park et al ldquoMulti-layered bit-sliced bit-rate scalable audio codingrdquo inProc 103rd Conv Aud Eng Soc preprint 4520 1997

[Paul82] D Paul ldquoA 500ndash800 bs adaptive vector quantization vocoder using aperceptually motivated distance measurerdquo in Proc Globecom 1982 (MiamiFL Dec 1982) p E631

[PEAQ] The PEAQ webpage available at httpwwwpeaqorg

[Peel03] CB Peel ldquoOn lsquodirty-paper codingrsquordquo in IEEE Sig Proc Mag vol 20 no 3pp 112ndash113 May 2003

[Pena95] A Pena ldquoA suggested auditory information environment to ease thedetection and minimization of subjective annoyance in perceptive-basedsystemsrdquo in Proc 98th Conv Aud Eng Soc preprint 4019 Paris 1995

[Pena96] A Pena et al ldquoARCO (Adaptive Resolution COdec) a hybrid approachto perceptual audio codingrdquo in Proc 100th Conv Aud Eng Soc preprint4178 May 1996

[Pena97a] A Pena et al ldquoNew improvements in ARCO (Adaptive Resolution Codec)rdquoin Proc 102nd Conv Aud Eng Soc preprint 4419 Mar 1997

442 REFERENCES

[Pena97b] A Pena et al ldquoA flexible tiling of the time axis for adaptive waveletpacket decompositionsrdquo in Proc IEEE ICASSP-97 pp 2137ndash2140 MunichGermany Apr 1997

[Pena01] A Pena E Alexandre B Rivas R Perez and A Duenas ldquoRealtimeimplementations of MPEG-2 and MPEG-4 natural audio codersrdquo in Proc110th Conv Aud Eng Soc preprint 5302 May 2001

[Penn80] M Penner ldquoThe coding of intensity and the interaction of forward andbackward maskingrdquo J Acoust Soc of Am vol 67 pp 608ndash616 1980

[Penn95] B Pennycook ldquoAudio and MIDI Markup Tools for the World Wide Webrdquoin Proc 99th Conv Aud Eng Soc preprint 4058 Oct 1995

[Peti99] FAP Petitcolas RJ Anderson and MG Kuhn ldquoInformation hiding ndash asurveyrdquo in Proc of the IEEE vol 87 no 7 pp 1062ndash1078 July 1999

[Peti02] FAP Petitcolas and D Kirovski ldquoThe blind pattern matching attack onwatermark systemsrdquo in Proc IEEE ICASSP-02 vol 4 pp 3740ndash3743Orlando FL May 2002

[Petr01] R Petrovic ldquoAudio signal watermarking based on replica modulationrdquo itInt Conf on Telecomm in Modern Satellite Cable and Broadcasting vol1 pp 227ndash234 Sept 2001

[Phil95a] P Philippe et al ldquoA relevant criterion for the design of wavelet filters inhigh-quality audio codingrdquo in Proc 98th Conv Aud Eng Soc preprint3948 Feb 1995

[Phil95b] P Philippe et al ldquoOn the Choice of Wavelet Filters for AudioCompressionrdquo in Proc IEEE ICASSP-95 pp 1045ndash1048 Detroit MI May1995

[Phil96] P Philippe et al ldquoOptimal wavelet packets for low-delay audio codingrdquo inProc IEEE ICASSP-96 pp 550ndash553 Atlanta GA May 1996

[PhSo82] Philips and Sony Audio CD System Description Philips LicensingEindhoven The Netherlands 1982

[Pick82] R Pickholtz D Schilling and L Milstein ldquoTheory of spread-spectrumcommunications ndash a tutorialrdquo IEEE Trans on Comm vol 30 no 5 pp855ndash884 May 1982

[Pick88] J Pickles An Introduction to the Physiology of Hearing Academic PressNew York 1988

[Podi01] CI Podilchuk and EJ Delp ldquoDigital watermarking algorithms andapplicationsrdquo IEEE Sig Proc Mag vol 18 no 4 pp 33ndash46 July 2001

[Port80] MR Portnoff ldquoTime-frequency representation of digital signals and systemsbased on short-time fourier analysisrdquo IEEE Trans ASSP vol 28 no 1pp 55ndash69 Feb 1980

[Port81a] MR Portnoff ldquoShort-time fourier analysis of sampled speechrdquo IEEE TransASSP vol 29 no 3 pp 364ndash373 June 1981

[Prec97] K Precoda and T Meng ldquoListener differences in audio compressionEvaluationsrdquo J Aud Eng Soc vol 45 no 9 pp 708ndash715 Sept 1997

[Prel96a] N Prelcic and A Pena ldquoAn adaptive tree search algorithm with applicationto multiresolution based perceptive audio codingrdquo in Proc IEEE Int Sympon Time-Freq and Time-Scale Anal 1996

REFERENCES 443

[Prel96b] N Prelcic et al ldquoConsiderations on the performance of filter design methodsfor wavelet packet audio decompositionrdquo in Proc 100th Conv Aud EngSoc preprint 4235 May 1996

[Pres89] W Press et al Numerical Recipes Cambridge Univ Press 1989

[Prin86] J Princen and A Bradley ldquoAnalysissynthesis filter bank design based ontime domain aliasing cancellationrdquo IEEE Trans ASSP vol 34 no 5 pp1153ndash1161 Oct 1986

[Prin87] J Princen et al ldquoSubbandtransform coding using filter bank designs basedon time domain aliasing cancellationrdquo in Proc IEEE ICASSP-87 vol 12pp 2161ndash2164 Dallas TX May 1987

[Prin94] J Princen ldquoThe design of non-uniform modulated filterbanksrdquo in Proc IEEEInt Symp on Time-Frequency and Time-Scale Analysis Oct 1994

[Prin95] J Princen and J Johnston ldquoAudio coding with signal adaptive filter banksrdquoin Proc IEEE ICASSP-95 pp 3071ndash3074 Detroit MI May 1995

[Prit90] WL Pritchard and M Ogata ldquoSatellite direct broadcastrdquo Proc of the IEEE vol 78 no 7 pp 1116ndash1140 July 1990

[Pura95] M Purat and P Noll ldquoA new orthonormal wavelet packet decompositionfor audio coding using frequency-varying modulated lapped transformsrdquo inIEEE ASSP Workshop on Applic of Sig Proc to Aud and Acous Oct 1995

[Pura96] M Purat and P Noll ldquoAudio coding with a dynamic wavelet packetdecomposition based on frequency-varying modulated lapped transformsrdquoin Proc IEEE ICASSP-96 pp 1021ndash1024 Atlanta GA May 1996

[Pura97] M Purat T Liebchen and P Noll ldquoLossless transform coding of audiosignalsrdquo in Proc 102nd Conv Aud Eng Soc Munich preprint 4414 Mar1997

[Purn00a] H Purnhagen N Meine ldquoHILN-the MPEG-4 parametric audio codingtoolsrdquo in Proc IEEE ISCAS-2000 vol 3 pp 201ndash204 May 2000

[Purn00b] H Purnhagen N Meine and B Edler ldquoSpeeding up HILN-MPEG-4parametric audio encoding with reduced complexityrdquo in Proc 109th ConvAud Eng Soc preprint 5177 Sept 2000

[Purn02] H Purnhagen N Meine and B Edler ldquoSinusoidal coding using loudness-based component selectionrdquo in Proc IEEE ICASSP-02 vol 2 pp1817ndash1820 Orlando FL 2002

[Purn97] H Purnhagen et al ldquoProposal of a Core Experiment for extended lsquoHarmonicand Individual Lines plus Noisersquo Tools for the Parametric Audio CoderCorerdquo ISOIEC JTC1SC29WG11 MPEG972480 July 1997

[Purn98] H Purnhagen et al ldquoObject-based analysissynthesis audio coder for verylow bit ratesrdquo in Proc 104th Conv Aud Eng Soc preprint 4747 May1998

[Purn99a] H Purnhagen ldquoAdvances in parametric audio codingrdquo in Proc IEEEWASPAA New Paltz NY Oct 17ndash20 1999

[Purn99b] H Purnhagen ldquoAn overview of MPEG-4 audio version-2rdquo in Proc AES17th Int Conf Florence Italy Sept 1999

[Purn03] H Purnhagen et al ldquoCombining low complexity parametric stereo withhigh efficiency AACrdquo ISOIEC JTC1SC29WG11 MPEG2003M10385Dec 2003

444 REFERENCES

[Qiu01] T Qiu ldquoLossless audio coding based on high order context modelingrdquo inIEEE Fourth Workshop on Multimedia Signal Processing pp 575ndash580 Oct2001

[Quac01] S Quackenbush and AT Lindsay ldquoOverview of MPEG-7 Audiordquo IEEETrans on Circuits and Systems for Video Technology vol 2 no 6 pp725ndash729 June 2001

[Quac97] S Quackenbush ldquoNoiseless coding of quantized spectral components inMPEG-2 advanced audio Codingrdquo IEEE ASSP Workshop on Apps of SigProc to Aud And Acous Mohonk NY 1997

[Quac98a] S Quackenbush and Y Toguri ISOIEC JTC1SC29WG11 N2005ldquoRevised Report on Complexity of MPEG-2 AAC Toolsrdquo Feb 1998

[Quac98b] S Quackenbush ldquoCoding of Natural Audio in MPEG-4rdquo in Proc IEEEICASSP-98 Seattle WA May 1998

[Quat02] T Quatieri Discrete-Time Speech Signal Processing Prentice Hall PTRUpper Saddle River NJ 2002

[Quei93] R de Queiroz ldquoTime-varying lapped transforms and wavelet packetsrdquo IEEETrans Sig Proc vol 41 pp 3293ndash3305 1993

[Raad01] M Raad and IS Burnett ldquoAudio coding using sorted sinusoidal parametersrdquoin Proc IEEE ISCAS-02 vol 2 pp 401ndash404 May 2001

[Raad02] M Raad and A Mertins ldquoFrom lossy to lossless audio coding using SPIHTrdquoin Proc of the 5th Int Conf on Digital Audio Effects Hamburg GermanySept 26ndash28 2002

[Rabi78] LR Rabiner and RW Schafer Digital Processing of Speech Signals Pren-tice-Hall Englewood Cliffs NJ 1978

[Rabi89] LR Rabiner ldquoA tutorial on hidden markov models and selected applicationsin speech recognitionrdquo Proc of the IEEE vol 77 no 2 pp 257ndash286 Feb1989

[Rabi93] LR Rabiner and BH Juang Fundamentals of Speech Recognition PrenticeHall Englewood Cliffs NJ 1993

[Rahm87] MD Rahman and K Yu ldquoTotal least squares approach for frequencyestimation using linear Predictionrdquo IEEE Trans ASSP vol 35 no 10 pp1440ndash1454 Oct 1987

[Ramp98] SA Ramprashad ldquoA two stage hybrid embedded speechaudio codingstructurerdquo in Proc IEEE ICASSP-98 vol 1 pp 337ndash340 Seattle WAMay 1998

[Ramp99] SA Ramprashad ldquoA multimode transform predictive coder (MTPC) forspeech and audiordquo IEEE Workshop on Speech Coding pp 10ndash12 June1999

[Rams82] T Ramstad ldquoSub-band coder with a simple adaptive bit-allocation algorithma possible candidate for digital mobile telephonyrdquo in Proc IEEE ICASSP-82vol 7 pp 203ndash207 Paris France May 1982

[Rams86] T Ramstad ldquoConsiderations on quantization and dynamic bit-allocation insubband codersrdquo in Proc IEEE ICASSP-86 vol 11 pp 841ndash844 TokyoJapan Apr 1986

REFERENCES 445

[Rams91] T Ramstad ldquoCosine modulated analysis-synthesis filter bank with criticalsampling and perfect reconstruction in Proc IEEE ICASSP-91 pp1789ndash1792 Toronto Canada May 1991

[Rao90] K Rao and P Yip The Discrete Cosine Transform Algorithm Advantagesand Applications Academic Press Boston 1990

[Rasc80] G Rasch Probabilistic Models for Some Intelligence Attainment Tests University of Chicago Press Chicago 1980

[Reef01a] D Reefman and E Janssen ldquoWhite paper on signal processing for SACDrdquoPhilips IPampS [Online] Available at httpwwwsuperaudiocdphilipscom

[Reef01b] D Reefman and P Nuijten ldquoWhy direct stream digital (DSD) is the bestchoice as a digital audio formatrdquo in Proc 110th Conv Aud Eng Socpreprint 5396 Amsterdam May 2001

[Reef01c] D Reefman and P Nuijten ldquoEditing and switching in 1-bit audio streamsrdquo inProc 110th Conv Aud Eng Soc preprint 5399 Amsterdam May 12ndash152001

[Reef02] D Reefman and E Janssen ldquoEnhanced sigma delta structures for superaudio CD applicationrdquo in Proc 112th Conv Aud Eng Soc preprint 5616Munich Germany May 2002

[Rice79] R Rice ldquoSome practical universal noiseless coding techniquesrdquo JetPropulsion Laboratory JPL Publication Pasadena CA Mar 1979

[Riou91] O Rioul and M Vetterli ldquoWavelets and signal processingrdquo IEEE SP Magpp 14ndash38 Oct 1991

[Riou94] O Rioul and P Duhamel ldquoA Remez exchange algorithm for orthonormalwaveletsrdquo IEEE Trans Circ Sys II Aug 1994

[Riss79] J Rissanen and G Langdon ldquoArithmetic codingrdquo IBM J Res Develop vol23 no 2 pp 146ndash162 Mar 1979

[Rits96] S Ritscher and U Felderhoff ldquoCascading of Different Audio Codecsrdquo inProc 100th Audio Eng Soc Conv preprint 4174 Copenhagen May 1996

[Rix01] Rix et al ldquoPerceptual evaluation of speech quality (PESQ) ndash a new methodfor speech quality assessment of telephone networks and codecsrdquo in ProcIEEE ICASSP-01 vol 2 pp 749ndash752 Salt Lake City UT 2001

[Robi94] T Robinson ldquoSHORTEN Simple lossless and near-lossless waveformcompressionrdquo Technical Report 156 Engineering Department CambridgeUniversity Dec 1994

[Roch98] S Roche and J-L Dugelay ldquoImage watermarking based on the fractaltransformrdquo in Proc Workshop Multimedia Signal Processing pp 358ndash363Los Angeles CA 1998

[Rode92] X Rodet and P Depalle ldquoSpectral envelopes and inverse FFT synthesisrdquoin Proc 93rd Conv Audio Eng Soc preprint 3393 Oct 1992

[Rong99] Y Rongshan and KC Chung ldquoHigh quality audio coding using a novelhybrid WLP-subband coding algorithmrdquo 5th Asia-Pacific Conf on Commvol 2 pp 952ndash955 Oct 1999

[Rose01] B Rosenblatt B Trippe and S Mooney Digital Rights ManagementBusiness and Technology New York John Wiley amp Sons Nov 2001

[Roth83] JH Rothweiler ldquoPolyphase quadrature filters ndash a new subband codingtechniquerdquo in Proc ICASSP-83 pp 1280ndash1283 May 1983

446 REFERENCES

[Ruiz00] FJ Ruiz and JR Deller ldquoDigital watermarking of speech signals for theNational Gallery of the Spoken Wordrdquo in Proc IEEE ICASSP-00 vol 3pp 1499ndash1502 Istanbul Turkey June 2000

[Ryde96] T Ryden ldquoUsing listening tests to assess audio codecsrdquo in Collected Paperson Digital Audio Bit-Rate Reduction N Gilchrist and C Grewin Eds AudEng Soc pp 115ndash125 1996

[Sabi82] M Sabin and R Gray ldquoProduct code vector quantizers for speech waveformcodingrdquo in Proc Globecom 1982 (Miami FL Dec 1982) p E651

[SACD02] Philips and Sony ldquoSuper Audio CD System Descriptionrdquo Philips LicensingEindhoven The Netherlands 2002

[SACD03] Philips Super Audio CD 2003 Available at httpwwwsuperaudiocdphilipscom

[Sait02a] S Saito T Furukawa and K Konishi ldquoA data hiding for audio using banddivision based on QMF bankrdquo in Proc IEEE ISCAS-02 vol 3 pp 635ndash638May 2002

[Sait02b] S Saito T Furukawa and K Konishi ldquoA digital watermarking for audiodata using band division based on QMF bankrdquo in Proc IEEE ICASSP-02vol 4 pp 3473ndash3476 Orlando FL May 2002

[Saka00] T Sakamoto M Taruki and T Hase ldquoMPEG-2 AAC decoder for a 32-bitMCUrdquo Int Conf on Consumer Electronics ICCE-00 pp 256ndash257 June2000

[Sala98] R Salami et al ldquoDesign and description of CS-ACELP A toll quality8 kbs speech coderrdquo IEEE Trans on Speech and Audio Processing vol 6no 2 pp 116ndash130 Mar 1998

[Sand01] FE Sandnes ldquoEfficient large-scale multichannel audio codingrdquo in Proc ofEuromicro Conference pp 392ndash399 Sept 2001

[Saun96] J Saunders ldquoReal time discrimination of broadcast speechmusicrdquo in ProcIEEE ICASSP-96 pp 993ndash996 Atlanta GA May 1996

[SBC90] Swedish Broadcasting Corporation ldquoISO MPEGAudio Test ReportrdquoStokholm July 1990

[Scha70] B Scharf ldquoCritical bandsrdquo in Foundations of Modern Auditory Theory Academic Press New York 1970

[Scha75] R Schafer and L Rabiner ldquoDigital representations of speech signalsrdquo ProcIEEE vol 63 no 4 pp 662ndash677 Apr 1975

[Scha79] R Schafer and J Markel Speech Analysis IEEE Press New York 1979

[Scha95] R Schafer and T Sikora ldquoDigital video coding standards and their role invideo communicationsrdquo in Proc of the IEEE vol 83 no 6 June 1995

[Sche98a] E Scheirer ldquoThe MPEG-4 Structured Audio Standardrdquo in Proc IEEEICASSP-98 vol 6 pp 3801ndash3804 Seattle WA May 1998

[Sche98b] E Scheirer and M Slaney ldquoConstruction and Evaluation of a RobustMultifeature SpeechMusic Discriminatorrdquo in Proc IEEE ICASSP-98Seattle WA May 1998

[Sche98c] E Scheirer and L Ray ldquoAlgorithmic and wavetable synthesis in the MPEG-4multimedia standardrdquo in Proc 105th Conv Aud Eng Soc preprint 4811Sept 1998

REFERENCES 447

[Sche98d] E Scheirer ldquoThe MPEG-4 structured audio orchestra languagerdquo in ProcICMC Oct 1998

[Sche98e] E Scheirer et al ldquoAudioBIFS the MPEG-4 standard for effects processingrdquoin Proc DAFX98 Workshop on Digital Audio Effects Processing Nov 1998

[Sche99a] E Scheirer ldquoStructured audio and effects processing in the MPEG-4multimedia standardrdquo ACM Multimedia Sys vol7 no 1 pp 11ndash22 1999

[Sche99b] E Scheirer and BL Vercoe ldquoSAOL The MPEG-4 structured audioorchestra languagerdquo Comput Music J vol 23 no 2 pp 31ndash51 1999

[Sche01] E Scheirer ldquoStructured audio Kolmogorov complexity and generalizedaudio codingrdquo in IEEE Trans on Speech and Audio Proc vol 9 no 8 pp914ndash931 Nov 2001

[Schi81] S Schiffman et al Introduction to Multidimensional Scaling TheoryMethod and Applications Academic Press New York 1981

[Schr79] MR Schroeder BS Atal and JL Hall ldquoOptimising digital speech codersby exploiting masking properties of the human earrdquo J Acoust Soc Amvol 66 1979

[Schr79] M Schroeder et al ldquoOptimizing digital speech coders by exploiting maskingproperties of the human earrdquo J Acoust Soc Am pp 1647ndash1652 Dec 1979

[Schr85] MR Schroeder and B Atal ldquoCode-excited linear prediction (CELP) highquality speech at very low bit ratesrdquo in Proc IEEE ICASSP-85 vol 10pp 937ndash940 Tampa FL Apr 1985

[Schr85] M Schroeder and B Atal ldquoCode-excited linear prediction (CELP) highquality speech at very low bit ratesrdquo in Proc IEEE ICASSP-85 pp937ndash940 Tampa FL May 1985

[Schr86] E Schroder and W Voessing ldquoHigh quality digital audio encoding with 30bitssample using adaptive transform codingrdquo in Proc 80th Conv Aud EngSoc preprint 2321 Mar 1986

[Schu96] D Schulz ldquoImproving audio codecs by noise substitutionrdquo J Audio EngSoc pp 593ndash598 JulyAug 1996

[Schu98] G Schuller ldquoTime-varying filter banks with low delay for audio codingrdquo inProc 105th Conv Audio Eng Soc preprint 4809 Sept 1998

[Schu00] G Schuller and T Karp ldquoModulated filter banks with arbitrary system delayefficient implementations and the time-varying caserdquo IEEE Trans SpeechAudio Proc vol 48 no 3 pp 737ndash748 Mar 2000

[Schu01] G Schuller Y Bin and D Huang ldquo Lossless coding of audio signals usingcascaded predictionrdquo in Proc IEEE ICASSP-01 vol 5 pp 3273ndash3276 SaltLake City UT 2001

[Schu02] G Schuller and A Harma ldquoLow delay audio compression using predictivecodingrdquo in Proc IEEE ICASSP-02 vol 2 pp 1853ndash1856 Orlando FL2002

[Schu04] E Schuijers et al ldquoLow complexity parametric stereo codingrdquo in 116thConv Aud Eng Soc preprint 6073 Berlin Germany May 2004

[Schu05] G Schuller J Kovacevic F Masson and V Goyal ldquoRobust low-delayaudio coding using multiple descriptionsrdquo IEEE Trans Speech Audio Procvol 13 no 5 pp 1014ndash1024 Sep 2005

448 REFERENCES

[Sega76] A Segall ldquoBit allocation and encoding for vector sourcesrdquo IEEE TransInfo Theory vol 22 no 2 pp 162ndash169 Mar 1976

[Seok00] J-W Seok and J-W Hong ldquoPrediction-based audio watermark detectionalgorithmrdquo in Proc 109th Conv Aud Eng Soc preprint 5254 Los AngelesSept 2000

[Sera97] C Serantes et al ldquoA fast noise-scaling algorithm for uniform quantizationin audio coding schemesrdquo in Proc IEEE ICASSP-97 pp 339ndash342 MunichGermany Apr 1997

[Serr89] X Serra ldquoA System for Sound AnalysisTransformationSynthesis Basedon a Deterministic Plus Stochastic Decompositionrdquo PhD Thesis StanfordUniversity 1989

[Serr90] X Serra and JO Smith III ldquoSpectral modeling and synthesis asound analysissynthesis system based on a deterministic plus stochasticdecompositionrdquo Comput Mus J pp 12ndash24 Winter 1990

[Sevi94] D Sevic and M Popovic ldquoA new efficient implementation of the oddly-stacked Princen-Bradley filter bankrdquo IEEE Sig Proc Lett vol 1 no 11Nov 1994

[Shan48] CE Shannon ldquoA mathematical theory of communicationrdquo Bell Sys TechJ vol 27 pp 379ndash423 pp 623ndash656 JulyOct 1948

[Shao02] S Shaopeng Y Junxun Y Yongcong and A-M Raed ldquoA low bit-rate audiocoder based on modified sinusoidal modelrdquo in IEEE Int Conf on CommCircuits and Systems and West Sino Expositions vol 1 pp 648ndash652 June2002

[Shap93] J Shapiro ldquoEmbedded image coding using zerotrees of wavelet coefficientsrdquoIEEE Trans Sig Proc vol 41 no 12 pp 3445ndash3462 Dec 1993

[Shin02] S Shin O Kim J Kim and J Choil ldquoA robust audio watermarkingalgorithm using pitch scalingrdquo Int Conf on DSP-02 vol 2 pp 701ndash704July 2002

[Shli94] S Shlien ldquoGuide to MPEG-1 audio standardrdquo IEEE Trans Broadcast pp206ndash218 Dec 1994

[Shli96] S Shlien and G Soulodre ldquoMeasuring the characteristics of lsquoexpertrsquolistenersrdquo in Proc 101st Conv Aud Eng Soc preprint 4339 Nov 1996

[Shli97] S Shlien ldquoThe modulated lapped transform its time-varying forms and itsapplications to audio coding standardsrdquo IEEE Trans on Speech and AudioProc vol 5 no 4 pp 359ndash366 July 1997

[Shoh88] Y Shoham and A Gersho ldquoEfficient bit allocation for an arbitrary set ofquantizersrdquo IEEE Trans ASSP vol 36 pp 1445ndash1453 1988

[Siko97a] T Sikora ldquoThe MPEG-4 Video Standard Verification Modelrdquo in IEEE TransCSVT vol7 no 1 Feb1997

[Siko97b] T Sikora ldquoMPEG digital video-coding standardsrdquo in IEEE Sig Proc Magvol 14 no 5 pp 82ndash100 Sept 1997

[Silv74] HF Silverman and NR Dixon ldquoA parametrically controlled spectralanalysis system for speechrdquo IEEE Trans ASSP vol 22 no 5 pp 362ndash381oct 1974

REFERENCES 449

[Silv01] GCM Silvestre NJ Hurley GS Hanau and WJ Dowling ldquoInformedaudio watermarking scheme using digital chaotic signalsrdquo in Proc IEEEICASSP-01 vol 3 pp 1361ndash1364 Salt Lake City UT May 2001

[Sing84] S Singhal and B Atal ldquoImproving performance of multi-pulse LPC codersat low bit ratesrdquo in Proc IEEE ICASSP-84 vol 9 pp 9ndash12 San DiegoCA Mar 1984

[Sing90] S Singhal ldquoHigh quality audio coding using multipulse LPCrdquo in Proc IEEEICASSP-90 vol 2 pp 1101ndash1104 Albuquerque NM Apr 1990

[Sinh93a] D Sinha and A Tewfik ldquoLow bit rate transparent audio compression usinga dynamic dictionary and optimized waveletsrdquo in Proc IEEE ICASSP-93pp I-197ndashI-200ndash Minneapolis MN May 1993

[Sinh93b] D Sinha and A Tewfik ldquoLow bit rate transparent audio compression usingadapted waveletsrdquo IEEE Trans Sig Proc vol 41 no 12 pp 3463ndash3479Dec 1993

[Sinh96] D Sinha and J Johnston ldquoAudio compression at low bit rates using a signaladaptive switched filter bankrdquo in Proc IEEE ICASSP-96 pp 1053ndash1056Atlanta GA May 1996

[Sinh98a] D Sinha et al ldquoThe perceptual audio coder (PAC)rdquo in The Digital SignalProcessing Handbook V Madisetti and D Williams Eds CRC Press BocaRaton FL pp 421ndash4218 1998

[Sinh98b] D Sinha and CEW Sundberg ldquoUnequal error protection (UEP) forperceptual audio codersrdquo in Proc 104th Conv Aud Eng Soc preprint 4754May 1998

[Smar95] G Smart and A Bradley ldquoFilter bank design based on time-domain aliasingcancellation with non-identical windowsrdquo in Proc IEEE ICASSP-94 vol 3no 185ndash188 Detroit MI May 1995

[Smit86] MJT Smith and T Barnwell ldquoExact reconstruction techniques for tree-structured subband codersrdquo IEEE Trans Acous Speech and Sig Processvol 34 no 3 pp 434ndash441 June 1986

[Smit95] JO Smith and J Abel ldquoThe Bark bilinear transformrdquo in Proc IEEEWorkshop on App Sig Proc to Audio and Electroacoustics Oct 1995

[Smit99] JO Smith and JS Abel ldquoBark and ERB bilinear transformsrdquo IEEE Transon Speech and Audio Proc vol 7 no 6 pp 697ndash708 Nov 1999

[SMPTE99] SMPTE 320M-1999 Channel Assignments and Levels on MultichannelAudio Media Society of Motion Picture and Television Engineers WhitePlains NY 1999

[SMPTE02] SMPTE RP 173ndash2002 Loudspeaker Placements for Audio Monitoringin High-Definition Electronic Production Society of Motion Picture andTelevision Engineers White Plains NY 2002

[Smyt96] S Smyth WP Smith M Smyth M Yan and T Jung ldquoDTS coherentacoustics delivering high quality multichannel sound to the consumerrdquo inProc 100th Conv Aud Eng Soc May 11ndash14 preprint 4293 1996

[Smyt99] M Smyth ldquoWhite paper An Overview of the Coherent AcousticsCoding Systemrdquo June 1999 Available on the Internet at wwwdtsonlinecommediauploadspdfswhitepaperpdf

[Soda94] I Sodagar et al ldquoTime-varying filter banks and waveletsrdquo IEEE Trans SigProc vol 42 pp 2983ndash2996 Nov 1994

450 REFERENCES

[Sore96] HV Sorensen PM Azziz and J van der Spiegel ldquoAn overview of sigma-delta convertersrdquo IEEE Sig Proc Mag vol 13 pp 61ndash84 Jan 1996

[Soul98] G Soulodre et al ldquoSubjective evaluation of state-of-the-art two-channelaudio codecsrdquo J Aud Eng Soc vol 46 no 3 pp 164ndash177 Mar 1998

[Span91] A Spanias ldquoA hybrid transform method for speech analysis and synthesisrdquoSignal Processing Vol 24 pp 217ndash229 1991

[Span92] A Spanias M Deisher P Loizou and G Lim ldquoFixed-point implementa-tion of the VSELP algorithmrdquo ASU-TRC Technical Report TRC-SP-ASP-9201 July 1992

[Span94] A Spanias ldquoSpeech coding A tutorial reviewrdquo in Proc of the IEEE vol82 no 10 pp 1541ndash1582 Oct 1994

[Span97] A Spanias and T Painter ldquoUniversal Speech and Audio Coding Using aSinusoidal Signal Modelrdquo ASU-TRC Technical Report 97ndash001 Jan 1997

[Spec98] ldquoSpecial issue on copyright and privacy protectionrdquo IEEE J Select AreasCommun vol 16 May 1998

[Spec99] ldquoSpecial issue on identification and protection of multimedia informationrdquoin Proc of the IEEE vol 87 July 1999

[Spen01] G Spenger J Herre C Neubauer and N Rump ldquoMPEG-21 ndash What doesit bring to Audiordquo in Proc 111th Conv Aud Eng Soc preprint 5462NovndashDec 2001

[Sper00] R Sperschneider ldquoError resilient source coding with variable length codesand its application to MPEG advanced audio codingrdquo in Proc 109th ConvAud Eng Soc preprint 5271 Sept 2000

[Sper01] R Sperschneider and P Lauber ldquoError concealment for compressed digitalaudiordquo in Proc 111th Conv Aud Eng Soc preprint 5460 NovndashDec2001

[Sper02] R Sperschneider D Homm and L-H Chambat ldquoError resilient sourcecoding with differential variable length codes and its application to MPEGadvanced audio codingrdquo in Proc 112th Conv Aud Eng Soc preprint 5555May 2002

[Spor95a] T Sporer and K Brandenburg ldquoConstraints of filter banks used forperceptual measurementrdquo J Aud Eng Soc vol 43 no 3 pp 107ndash115Mar 1995

[Spor95b] T Sporer et al ldquoEvaluating a measurement systemrdquo J Aud Eng Soc vol43 no 5 pp 353ndash363 May 1995

[Spor96] T Sporer ldquoEvaluating small impairments with the mean opinionscale ndash reliable or just a guessrdquo in Proc 101st Conv Aud Eng Socpreprint 4396 Nov 1996

[Spor97] T Sporer ldquoObjective audio signal evaluation ndash applied psychoacoustics formodeling the perceived quality of digital audiordquo in Proc 103rd Conv AudEng Soc preprint 4512 Sept 1997

[Spor99a] T Sporer web site on TG 104 and other perceptual measure-ment standardization activities available at httpwwwltee-technikuni-erlangendesimspotg104indexhtml

[SQAM88] Sound Quality Assessment Material Recordings for Subjective Testsrdquo EBUDoc Tech 3253 (includes SQAM Compact Disc) 1988

REFERENCES 451

[Sree98a] T Sreenivas and M Dietz ldquoVector quantization of scale factors in advancedaudio coder (AAC)rdquo in Proc IEEE ICASSP-98 vol 6 pp 3641ndash3644Seattle WA May 1998

[Sree98b] T Sreenivas and M Dietz ldquoImproved AAC Performance lt64 kbs usingVQrdquo in Proc 104th Conv Audio Eng Soc preprint 4750 Amsterdam 1998

[Srin97] P Srinivasan Speech and Wideband Audio Compression Using Filter Banksand Wavelets Ph D Thesis Purdue University May 1997

[Srin98] P Srinivasan and L Jamieson ldquoHigh quality audio compression using Anadaptive wavelet packet decomposition and psychoacoustic modelingrdquo IEEETrans Sig Proc pp 1085ndash1093 Apr 1998 lsquoCrsquo Libraries and examplesavailable on httpwwwwaveletecnpurdueedusim speechg

[Stau92] J Stautner ldquoPhysical testing of psychophysically-based digital audio codingsystemsrdquo in Proc 11th Int Conf Aud Eng Soc pp 203ndash209 May 1992

[Stol93a] G Stoll et al ldquoExtension of ISOMPEG-audio layer II to multi-channelcoding the future standard for broadcasting telecommunication andmultimedia applicationsrdquo in Proc 94th Conv Aud Eng Soc preprint 3550Mar 1993

[Stol96] G Stoll et al ldquoISO-MPEG-2 Audio A generic standard for the codingof two-channel and multi-channel Soundrdquo in Collected Papers on DigitalAudio Bit-Rate Reduction N Gielchrist and C Grewin Eds 1996

[Stoll88] G Stoll et al ldquoMasking-pattern adapted subband coding use of thedynamic bit-rate marginrdquo in Proc 84th Conv Aud Eng Soc preprint 2585Mar 1988

[Stor97] Storey R ldquoATLANTIC Advanced Television at Low Bitrates andNetworked Transmission over Integrated Communication Systemsrdquo ACTSCommon European Newsletter Feb 1997

[Stra96] G Strang and T Nguyen Wavelets and Filter Banks Wellesley-CambridgePress Wellesley MA 1996

[Stru80] H Strube ldquoLinear prediction on a warped frequency scalerdquo J Acoust SocAm vol 68 no 4 pp 1071ndash1076 Oct 1980

[Stua93] J Stuart ldquoNoise methods for estimating detectability and thresholdrdquo inProc 94th Audio Eng Soc Conv preprint 3477 Berlin Mar 1993

[Sugi90] A Sugiyama et al ldquoAdaptive transform coding with an adaptive block size(ATC-ABS)rdquo in Proc IEEE ICASSP-90 pp 1093ndash1096 Albuquerque NMMay 1990

[Swan98a] M Swanson B Zhu A Tewfik and L Boney ldquoRobust audio watermarkingusing perceptual maskingrdquo Elsevier Signal Processing Special Issue onCopyright Protection and Access Control vol 66 no 3 pp 337ndash355 May1998

[Swan98b] MD Swanson M Kobayashi and AH Tewfik ldquoMultimedia data-embedding and watermarking technologiesrdquo Proc of the IEEE vol 86no 6 pp 1064ndash1087 June 1998

[Swan99] M Swanson B Zhu and A Tewfik ldquoCurrent state of the art challengesand future directions for audio watermarkingrdquo in Proc ICMCS vol 1 pp19ndash24 Florence Italy June 1999

[Swee02] P Sweeney Error Control Coding From Theory to Practice John Wiley ampSons New York May 2002

452 REFERENCES

[Taka88] M Taka et al ldquoDSP implementations of sophisticated speech codecsrdquo IEEETrans on Selected Areas in Communications vol 6 no 2 pp 274ndash282Feb 1988

[Taka97] Y Takamizawa ldquoAn efficient tonal component coding algorithm for MPEG-2 audio NBCrdquo in Proc IEEE ICASSP-97 pp 331ndash334 Munich GermanyApr 1997

[Taka01] Y Takamizawa and T Nomura ldquoProcessor-efficient implementation of ahigh quality MPEG-2 AAC encoderrdquo in Proc 110th Conv Aud Eng Socpreprint 5294 May 2001

[Taki95] Y Takishima M Wada and H Murakami ldquoReversible variable lengthcodesrdquo IEEE Trans Commun vol 43 pp 158ndash162 FebndashApr 1995

[Tan89] E Tan and B Vermuelen ldquoDigital audio tape for data storagerdquo IEEESpectrum vol 26 no 10 pp 34ndash38 Oct 1989

[Tang95] B Tang et al ldquoSpectral analysis of subband filtered signalsrdquo in Proc IEEEICASSP-95 pp 1324ndash1327 Detroit MI May 1995

[Taor99] R Taori and RJ Sluijter ldquoClosed-loop tracking of sinusoids for speech andaudio codingrdquo IEEE Workshop on Speech Coding Proceedings pp 1ndash31999

[Tefa03] A Tefas A Nikolaidis N Nikolaidis V Solachidis S Tsekeridou andI Pitas ldquoPerformance analysis of correlation-based watermarking schemesemploying Markov chaotic sequencesrdquo Trans on Sig Proc vol 51 no 7pp 1979ndash1994 July 2003

[Teh90] D Teh et al ldquoSubband coding of high-fidelity quality audio signals at128 kbpsrdquo in Proc IEEE ICASSP-92 vol 2 pp 197ndash200 AlbuquerqueNM May 1990

[tenK92] W ten Kate et al ldquoMatrixing of bit-rate reduced signalsrdquo in Proc IEEEICASSP vol 2 pp 205ndash208 San Francisco CA May 1992

[tenK94] W ten Kate et al ldquoCompatibility matrixing of multi-channel bit rate reducedaudio signalsrdquo in Proc 96th Conv Aud Eng Soc preprint 3792 1994

[tenK96b] W ten Kate ldquoMaintaining audio quality in cascaded psychoacoustic codingrdquoin Proc 101st Conv Aud Eng Soc preprint 4387 Nov 1996

[tenK97] R tenKate ldquoDisc-technology for super-quality audio applicationsrdquo in Proc103rd Conv Aud Eng Soc preprint 4565 Sept 1997 New York

[Terh79] Terhardt E ldquoCalculating virtual pitchrdquo Hearing Research vol 1 pp 155ndash182 1979

[Terh82] E Terhardt et al ldquoAlgorithm for extraction of pitch and pitch salience fromcomplex tonal signalsrdquo J Acous Soc Am vol 71 pp 679ndash688 Mar 1982

[Terr96] K Terry and J Seaver ldquoA real-time multichannel Dolby AC-3 audioencoder implementationrdquo in Proc 101st Conv Aud Eng Soc preprint4363 Nov 1996

[Tewf93] A Tewfik and M Ali ldquoEnhanced wavelet based audio coderrdquo in Conf Recof the 27th Asilomar Conf on Sig Sys and Comp pp 896ndash900 Nov 1993

[Thei87] G Theile et al ldquoLow-bit rate coding of high quality audio signalsrdquo in Proc82nd Conv Aud Eng Soc preprint 2432 Mar 1987

REFERENCES 453

[Thie96] T Thiede and E Kabot ldquoA new perceptual quality measure for bit ratereduced audiordquo in Proc 100th Conv Aud Eng Soc preprint 4280 May1996

[Thom82] D Thomson ldquoSpectrum estimation and harmonic analysisrdquo Proc of theIEEE pp 1055ndash1096 Sep 1982

[Thor00] NJ Thorwirth P Horvatic R Weis and J Zhao ldquoSecurity methods forMP3 music deliveryrdquo in 34th ASILOMAR Signals Systems and Computersvol 2 pp 1831ndash1835 Nov 2000

[Todd94] C Todd et al ldquoAC-3 flexible perceptual coding for audio transmission andstoragerdquo in Proc 96th Conv Aud Eng Soc preprint 3796 Feb 1994

[Toh03] C-K Toh et al ldquoTransporting audio over wireless ad hoc networksexperiments and new insightsrdquo in Proc of Personal Indoor and MobileRadio Communications (IMRC-2003) vol 1 pp 772ndash777 Sept 2003

[Tran90] I Trancoso and B Atal ldquoEfficient search procedures for selecting theoptimum innovation in stochastic codersrdquo IEEE Trans ASSP vol 38 no 3pp 385ndash396 Mar 1990

[Trap02] W Trappe and LC Washington Introduction to Cryptography with CodingTheory Prentice Hall Englewood Cliffs NJ Jan 2002

[Trem82] TE Tremain ldquoThe government standard linear predictive coding algorithmLPC-10rdquo Speech Technology Magazine pp 40ndash49 Apr 1982

[Treu96] W Treurniet ldquoSimulation of individual listeners with an auditory modelrdquoin Proc 100th Conv Aud Eng Soc preprint 4154 May 1996

[Tsai01] CW Tsai and JL Wu ldquoOn constructing the Huffman-code based reversiblevariable length codesrdquo IEEE Trans Commun vol 49 pp 1506ndash1509 Sept2001

[Tsai02] T-H Tsai and J-N Liu ldquoArchitecture design for MPEG-2 AAC filterbankdecoder using modified regressive methodrdquo in Proc IEEE ICASSP vol 3pp 3216ndash3219 Orlando FL May 2002

[Tsut96] K Tsutsui et al ldquoATRAC Adaptive Transform Acoustic Coding forMiniDiscrdquo in Collected Papers on Digital Audio Bit-Rate Reduction NGilchrist and C Grewin Eds Aud Eng Soc pp 95ndash101 1996

[Tsut98] K Tsutsui ldquoATRAC (Adaptive Transform Acoustic Coding) and ATRAC 2rdquoin The Digital Signal Processing Handbook V Madisetti and D WilliamsEds CRC Press Boca Raton FL pp 4316ndash4320 1998

[Twve99] On-line animations of cochlear traveling waves a) Boys TownNational Research Hospital Communication Engineering Laboratoryhttpwwwbtnrhboystownorgcelwaveshtm b) Department of Physiologyat the University of Wisconsin Madison httpwwwneurophyswisceduanimations c) Scuola Internazionale Superiore di Studi AvanzatiInter-national School for Advanced Studies (SISSAISAS) httpwwwsissaitbpCochleatwlohtm and d) Ear Lab at Boston University httpearlabbueduphysiologymechanicshtml

[USAT95a] United States Advanced Television Systems Committee (ATSC) DocA53 ldquoDigital Television Standardrdquo Sep 1995 Available online athttpwwwatscorgStandardsA53

454 REFERENCES

[USAT95b] United States Advanced Television Systems Committee (ATSC) Doc A52ldquoDigital Audio Compression Standard (AC-3)rdquo Dec 1995 Available onlineat httpwwwatscorgStandardsA52

[Vaan00] R Vaananen ldquoSynthetic audio tools in MPEG-4 standardrdquo in Proc 108thConv Aud Eng Soc preprint 5080 Feb 2000

[Vaid87] PP Vaidyanathan ldquoQuadrature mirror filter banks M-band extensions andperfect-reconstruction techniquesrdquo IEEE ASSP Mag vol 4 no 3 pp 4ndash20July 1987

[Vaid88] PP Vaidyanathan and PQ Hoang ldquoLattice structures for optimal design androbust implementation of two-channel perfect reconstruction QMF banksrdquoIEEE Trans Acous Speech and Sig Process vol 36 no 1 pp 81ndash94Jan 1988

[Vaid90] PP Vaidyanathan ldquoMultirate digital filters filter banks polyphase networksand applications a tutorialrdquo Proc of the IEEE vol 78 no 1 pp 56ndash93Jan 1990

[Vaid93] PP Vaidyanathan Multirate Systems and Filter Banks Prentice-HallEnglewood Cliffs NJ 1993

[Vaid94] PP Vaidyanathan and T Chen ldquoStatistically optimal synthesis banks forsubband codersrdquo in Proc 28th Asilomar Conf on Sig Sys and Comp vol2 pp 986ndash990 Nov 1994

[vand98] O van der Vrecken et al ldquoA new subband perceptual audio coder usingCELPrdquo in Proc IEEE ICASSP-98 pp 3661ndash3664 Seattle WA May 1998

[Vande02] S van de Par A Kohlrausch G Charestan and R Heusdens ldquoA newpsychoacoustical masking model for audio coding applicationsrdquo in ProcIEEE ICASSP-02 vol 2 pp 1805ndash1808 Orlando FL 2002

[Vasi02] N Vasiloglou RW Schafer and MC Hans ldquoLossless audio coding withMPEG-4 structured audiordquo 2nd Int Conf on WEBDELMUSIC-02 pp184ndash191 Dec 2002

[Vaup91] T Vaupel ldquoEin Beitrag zur Transformationscodierung von Audiosignalenunter Verwendung der Methode der lsquoTime Domain Aliasing Cancellation(TDAC)rsquo und einer Signalkompandierung in Zeitbereichrdquo PhD DissertationDuisburg Germany Univ of Duisburg Apr 1991

[Veen03] MV Veen AV Leest and F Bruekers ldquoReversible audio watermarkingrdquoin Proc 114th Conv Aud Eng Soc preprint 5818 Amsterdam Mar 2003

[Veld89] RNJ Veldhuis ldquoSubband coding of digital audio signals without loss ofqualityrdquo in Proc IEEE ICASSP-89 pp 2009ndash2012 Glasgow ScotlandMay 1989

[Verc95] BL Vercoe Csound A Manual for the Audio-Processing System MITMedia Lab Cambridge 1995

[Verc98] BL Vercoe et al ldquoStructured audio creation transmission and rendering ofparametric sound representationsrdquo Proc IEEE vol 85 no 5 pp 922ndash940May 1998

[Verm98a] T Verma and T Meng ldquoAn analysissynthesis tool for transient signals thatallows a flexible sines+transients+noise model for audiordquo in Proc IEEEICASSP-98 Seattle WA May 1998

REFERENCES 455

[Verm98b] T Verma et al ldquotransient modeling synthesis a flexible analysissynthesistool for transient signalsrdquo in Proc Int Comp Mus Conf pp 164ndash167Greece 1997

[Vern93] S Vernon et al ldquoA single-chip DSP implementation of a high-quality lowbit-rate multi-channel audio coderrdquo in Proc 95th Conv Aud Eng Soc1993

[Vern95] S Vernon ldquoDesign and implementation of AC-3 codersrdquo IEEE TransConsumer Elec vol 41 no 3 Aug 1995

[Vett92] M Vetterli and C Herley ldquoWavelets and filter banksrdquo IEEE Trans SigProc vol 40 no 9 pp 2207ndash2232 Sept 1992

[Vier86] MA Viergever ldquoCochlear macromechanics ndash a reviewrdquo in PeripheralAuditory Mechanisms J Allen et al Eds Springer Berlin 1986

[Volo01] S Voloshynovskiy S Pereira T Pun JJ Eggers and JK SuldquoAttacks on digital watermarks classification estimation based attacks andbenchmarksrdquo IEEE Comm Mag vol 39 no 8 pp 118ndash126 Aug 2001

[Vora97] S Voran ldquoPerception-based bit-allocation algorithms for audio codingrdquo inIEEE App of Sig Proc to Audio and Acoustics (ASSP) Workshop 19ndash22Oct 1997

[Voro88] P Voros ldquoHigh-quality sound coding within 2 times 64 kbits usinginstantaneous dynamic bit-allocationrdquo in Proc IEEE ICASSP-88 pp2536ndash2539 New York NY May 1988

[Wang98] Y Wang ldquoA new watermarking method of digital audio content for copyrightprotectionrdquo in Proc ICSP-98 vol 2 pp 1420ndash1423 Oct 1998

[Wang01] Y Wang J Ojanpera M Vilermo and M Vaananen ldquoSchemes for re-compressing MP3 audio bitstreamsrdquo in Proc 111th Conv Aud Eng Socpreprint 5435 NovndashDec 2001

[Watk88] J Watkinson Art of Digital Audio Focal Press London and Boston 1988

[Wege97] A Wegener ldquoMUSICompress lossless low-MIPS audio compression insoftware and hardwarerdquo in Proc of the Int Conf on Sig Proc App andTech Sept 1997

[Wein96] M Weinberger G Seroussi and G Sapiro ldquoLOCO-I a low complexitycontext-based lossless image compression algorithmrdquo Proc DataCompression Conf 1996

[Welc84] T Welch ldquoA technique for high performance data compressionrdquo IEEEComp vol 17 no 6 pp 8ndash19 June 1984

[Wen98] J Wen and J Villasenor ldquoReversible variable length codes for efficient androbust image and video codingrdquo in Proc IEEE Data Compression Conf pp471ndash480 1998

[West88] P H Westerink et al ldquoAn optimal bit allocation algorithm for sub-bandcodingrdquo in Proc IEEE ICASSP-88 pp 757ndash760 New York NY May 1988

[Whit00] P White Basic MIDI Sanctuary Press Feb 2000

[Wick94] M Wickerhauser Adapted Wavelet Analysis from Theory to Software AKPeters Wellesley MA 1994

[Wick95] SB Wicker Error Control Systems for Digital Communication and StoragePrentice Hall Englewood Cliffs NJ 1995

456 REFERENCES

[Widr85] B Widrow and S Stearns Adaptive Signal Processing Prentice HallEnglewood Cliffs NJ 1985

[Wies90] D Wiese and G Stoll ldquoBitrate reduction of high quality audio signals bymodelling the earrsquos masking thresholdsrdquo in Proc 89th Conv Aud Eng Socpreprint 2970 Sep 1990

[Wind98] B Winduratna ldquoFM analysissynthesis based audio codingrdquo Proc 104thConv Audio Eng Soc preprint 4746 May 1998

[Witt87] IH Witten RM Neal and JG Cleary ldquoArithmetic Coding for datacompressionrdquo Comm ACM vol 30 no 6 pp 520ndash540 June 1987

[Wolt03] M Wolters et al ldquoA closer look into MPEG-4 high efficiency AACrdquo in115th Conv Aud Eng Soc preprint 5871 New York NY Oct 2003

[Wu97] X Wu ldquoHigh-order context modeling and embedded conditional entropycoding of wavelet coefficients for image compressionrdquo in Proc of 31 st

ASILOMAR Conf on Signals Systems and Computers pp 1378ndash1382Nov 1997

[Wu98] X Wu and Z Xiong ldquoAn empirical study of high-order context modelingand entropy coding of wavelet coefficientsrdquo ISOIEC JTC-1SC -29WG-1No 771 Feb 1998

[Wu01] M Wu S Craver EW Felten and B Liu ldquoAnalysis of attacks on SDMIaudio watermarksrdquo in Proc IEEE ICASSP-01 vol 3 pp 1369ndash1372 SaltLake City UT May 2001

[Wu03] M Wu and B Liu Multimedia Data Hiding Springer Verlag New YorkJan 2003

[Wust98] U Wustenhagen et al ldquoSubjective listening test of multichannel audiocodecsrdquo in Proc 105th Conv Aud Eng Soc preprint 4813 Sept 1998

[Wyli96b] F Wylie ldquoAPT-x100 low-delay low-bit-rate-subband ADPCM digitalaudio codingrdquo in Collected Papers on Digital Audio Bit-Rate Reduction N Gilchrist and C Grewin Eds Aud Eng Soc pp 83ndash94 1996

[Xu99a] C Xu J Wu and Q Sun ldquoA robust digital audio watermarking techniquerdquoin Proc of the Int Symposium on Signal Processing and Its Applications(ISSPA-99) vol 1 pp 95ndash98 Aug 1999

[Xu99b] C Xu J Wu and Q Sun ldquoDigital audio watermarking and its application inmultimedia databaserdquo in Proc of the Int Symposium on Signal Processingand Its Applications (ISSPA-99) vol 1 pp 91ndash94 Aug 1999

[Xu99c] C Xu J Wu Q Sun and K Xin ldquoApplications of watermarking technologyin audio signalsrdquo J Aud Eng Soc vol 47 no 10 pp 805ndash812 Oct 1999

[Yama98] H Yamauchi et al ldquoThe SDDS system for digitizing film soundrdquo in TheDigital Signal Processing Handbook V Madisetti and D Williams EdsCRC Press Boca Ration FL pp 436ndash4312 1998

[Yang02] D Yang A Hongmei and C-CJ Kuo ldquoProgressive multichannel audiocodec (PMAC) with rich featuresrdquo in Proc IEEE ICASSP-02 vol 3 pp2717ndash2720 Orlando FL May 2002

[Yang03] C-H Yang and H-M Hang ldquoEfficient bit assignment strategy for perceptualaudio codingrdquo in Proc IEEE ICASSP-03 vol 5 pp 405ndash408 Hong Kong2003

REFERENCES 457

[Yatr88] P Yatrou and P Mermelstein ldquoEnsuring predictor tracking in ADPCMspeech coders under noisy transmission conditionsrdquo IEEE J Selected AreasCommun (Special Issue on Voice Coding for Communications) vol 6 no 2pp 249ndash261 Feb 1988

[Yeo01] I-K Yeo and HJ Kim ldquoModified patchwork algorithm a novel audiowatermarking schemerdquo Int Conf on Inf Tech Coding and Computing pp237ndash242 Apr 2001

[Yeo03] I-K Yeo and HJ Kim ldquoModified patchwork algorithm a novel audiowatermarking schemerdquo in Trans on Speech and Aud Proc vol 11 no 4pp 381ndash386 July 2003

[Yin97] L Yin M Suonio and M Vaananen ldquoA new backward predictor for MPEGaudio codingrdquo in Proc 103 th Conv Aud Eng Soc preprint 4521 Sept1997

[Yosh94] T Yoshida ldquoThe rewritable MiniDisc Systemrdquo Proc of the IEEE vol 82no 10 pp 1492ndash1500 Oct 1994

[Yosh97] S Yoshikawa et al ldquoDoes high sampling frequency improve perceptualtime-axis resolution of a digital audio signalrdquo 103rd Conv Aud Eng Socpreprint 4562 New York Sept 1997

[Zara02] RH Morelos-Zaragoza The Art of Error Correcting Coding John Wiley ampSons New York Apr 2002

[Zhan00] Q Zhang Y-Q Zhang and W Zhu ldquoResource allocation for audio andvideo streaming over the Internetrdquo in Proc IEEE ISCAS-00 vol 4 pp21ndash24 May 2000

[Zhen03] L Zheng and A Inoue ldquoAudio watermarking techniques using sinusoidalpatterns based on pseudorandom sequencesrdquo in Trans Circuits and Systemsfor Video Technology vol 13 no 8 pp 801ndash812 Aug 2003

[Zhou01] J Zhou Q Zhang Z Xiong and W Zhu ldquoError resilient scalableaudio coding (ERSAC) for mobile applicationsrdquo in IEEE 4th Workshop onMultimedia Signal Processing pp 307ndash312 Oct 2001

[Zieg02] T Ziegler et al ldquoEnhancing MP3 with SBR Features and capabilities of thenew mp3PRO algorithmrdquo in 112th Conv Aud Eng Soc preprint 5560Munich Germany May 2002

[Ziv77] J Ziv and A Lempel ldquoA universal algorithm for sequential datacompressionrdquo IEEE Trans Inform Theory vol 23 no 3 pp 337ndash343May 1977

[Zwic67] E Zwicker and R Feldtkeller Das Ohr als Nachrichtenempfinger HirzelStuttgart Germany 1967

[Zwic77] E Zwicker ldquoProcedure for calculating loudness of temporally variablesoundsrdquo J Acoust Soc of Am vol 62 pp 675ndash682 1977

[Zwic90] E Zwicker and H Fastl Psychoacoustics Facts and Models Springer-Verlag New York 1990

[Zwic91] E Zwicker and U Zwicker ldquoAudio engineering and psychoacousticsmatching signals to the final receiver the human auditory systemrdquo J AudioEng Soc pp 115ndash126 Mar 1991

[Zwis65] J Zwislocki ldquoAnalysis of some auditory characteristicsrdquo in Handbook ofMathematical Psychology R Luce et al Eds John Wiley and Sons NewYork 1965

INDEX

AD conversion 20 331-bit digital conversion 3652-channel stereo 28032-channel format 267 2793G cellular standards 1013G wireless communications 10351 channel configuration 267ndash268 319

327 332 361

Absolute category rating (ACR) 384Absolute hearing threshold 113 115 128

137ACELP codebook 101Adaptive codebook 64Adaptive differential PCM 60Adaptive Huffman coding 80 81Adaptive resolution codec (ARCO) 230

232Adaptive spectral entropy coding

(ASPEC) 178 195197 203Adaptive step size 58Adaptive transform coder (ATC) 196Adaptive VQ 64 219ADPCM 2 51 60 96 102 336ADPCM subband filtering 338A-law companding 58

Algebraic CELP 101Aliasing 18 20 33 39 146 155 164 362Analysis-by-synthesis 97 104 197 233

242 248Arithmetic coding 52 83 345Asymmetric watermarking schemes 374Audio fingerprinting 316Audio graphic equalizer 32Audio identification 315Audio processing technology (APT-x100)

335 379Audio signature 316AudioPaK 343 348Auditory scene analysis 258 309Auditory spectrum distance 392Autocorrelation 40ndash44 60 94Automatic perceptual measurement 383

389 402Autoregressive moving average (ARMA)

25Auxiliary data string 369 376Averaging filter 26

Backward adaptive 60 64 201 281 326Bandwidth expansion 95 104Bandwidth scalability 298 301


459

460 INDEX

Bark frequency scale 106 117 119Bark scale envelope 296Bark-domain masking 397Bark-scale warping 105Binaural cue coding (BCC) 319 382Binaural masking difference 294 396Binaural unmasking effects 324Bi-orthogonal MDCT basis 167 168 171Bit reservoir 182 249 287 304Bit-sliced arithmetic coding 300 356Butterfly stages 23

Cascaded CQF 156 159CCIR J41 reference audio codec 203CELP algorithms 100 233Continuous Fourier transform (CFT) 13

18Channel symbol 62Closed-loop analysis 91 106 236C-LPAC 344 352Cochlea 115ndash117 392Cochlear channel 116 390Code division multiple access 100Codebook 62ndash69 100 206Code-excited linear predictive (CELP) 67

70 98 100 233 291 300Coherent acoustics 265 268 337Coiled basilar membrane 115Compare internal-auditory representation

(CIR) 390Comparison category rating (CCR) 385Conformance testing 309Conjugate quadrature filters (CQF) 155Conjugate structure VQ 51 69Content protection 317 358 369Content retrieval 316Content-access control 368Convolution 16Cosine-modulated filterbank 160 163Critical band densities 391Critical bandwidth 105 116 119Critically-sampled 4 146 154 224Cross-correlation 41CS-ACELP 69 101 381Curve fitting 118 344

Data hiding 369Daubechiesrsquo wavelets 223Decimation 33 135 215 361Delta modulation 33 60Description scheme 311Deterministic signal 39 254 390DFT filterbank 178 179Difference equations 25

Differential PCM 59Differential perceptual audio coder

(DPAC) 195 204Digital audio watermarking 368Digital broadcast audio 212 378Direct broadcast satellite (DBS) 163 326

378Digital filters 25 30Digital item 317Digital multimedia 308Digital rights management 317 369Digital Sinc 21Digital theater systems (DTS) 264 267

338Digital versatile disc (DVD) 267 335Direct sequence spread spectrum 374Direct stream digital (DSD) 268 343

358 364Direct stream transfer (DST) 362 368Direct-form LP coefficients 95Dirichlet 22Disc-access control 368Discrete Bark spectrum 128Discrete cosine transform (DCT) 23 178

196 206 335 353Discrete Fourier transform (DFT) 22 155

178 197 200 205Discrete wavelet excitation 105Discrete wavelet packet transform

(DWPT) 155 214 255Discrete wavelet transform 214 224Discrete-time Fourier transform (DTFT)

20 34Distortion index 392 401Dolby AC-2AC-3 171 325Dolby digital plus 335Dolby proLogic 267 327 332Down-mixing 280 333Down-sampling 20 33 36DVD-Audio 263 268 343 356Dynamic bit-allocation 52 70 230 275Dynamic crosstalk 281Dynamic dictionary 218

Echo data hiding 370Encryption 358 369Entropy 52 74Equivalent noise bandwidth 118Equivalent rectangular bandwidth (ERB)

117 122Ergodic 40Error concealment 305Error protection 7 291 305 324Error resilience 291 306

INDEX 461

Euclidean distance 130 204European A-law 58European broadcasting union (EBU) 269

389Expansion function 57Expectation operator 10 40 70 259

Fine grain scalability 291 300 356Finite-length impulse response (FIR) 26FIR filters 39 167 214FIRIIR prediction 344 349 358Fixed codebooks 63Formant structure 93Formants 29 93 109Forward adaptive 60 64 289 328Forward error correction 307Forward linear prediction 94 339Forward MDCT 165 176Frequency response 10 27Frequency warped LP 92 105Frequency-to-place transformation 116FS 1016 CELP 100 110

Gain modification 182 185 341General audio coder 294 307Generalized audio coding 303 380Generalized PR cosine-modulated

filterbank 164Givens rotation 355Global masking threshold 125 130 137

202 324Golomb coding 52 82 345 349Granular quantization noise 386

Hann window 128 131 219 394Harmonic synthesis 228Harmonic vector excitation coding

(HVXC) 300Head-related transfer function 382High definition television (HDTV) 274

378Huffman coding 52 77 202 227Human auditory system 4 114 125 152Human visual system 370Hybrid filter bank 152 163 167 178 185Hybrid filter bankCELP 233 235Hybrid subbandCELP 234ndash235

IIR filterbanks 212 237Imperceptibility 377Individual masking threshold 126 136

324Infinite-length impulse response (IIR)

filter 25 43 202 237 344 351

Informational masking 396 402Instrument timbre 315Integer MDCT (IntMDCT) 354Intellectual property 317 369 380Interchannel coherence 319Interleaved pulse permutation codes 101Interoperable framework 317iPod 9 278IS 96 QCELP 102ISOIEC MPEG psychoacoustic model 5

105 113 130 275ITU 5-point impairment scale 286ITU-BS1116 288 384ITU-T G722 91 102ITU-T G7222 AMR-WB codec 103ITU-T G726 62 96ITU-T G728 LD-CELP 100 234 304ITU-T G729 69 101 301 381

JND estimates 129JND thresholds 198 276 283Just-noticeable distortion (JND) 126

Kaiser MDCT 171Kaiser-Bessel derived (KBD) window

171 284 326K-means algorithm 235Kraft inequality 76

Laplacian distribution 54 77 81 201 336LBG algorithm 64Lempel-Ziv coding 52Levinson-Durbin recursive algorithm 95

186 301Line spectrum pairs (LSPs) 95Linear PCM 51 55 338Linear prediction (LP) 41 92ndash99 102

186Linear predictive coding (LPC) 91 99LMS algorithm 281 286Log area ratios (LARs) 95Logarithmic companding 57Long-term prediction (LTP) 95 99 111

289 296Lossless audio coding 264 343Lossless matrixing 358 361Lossless transform audio coding (LTAC)

353Lossy audio coding 265 343Loudness mapping 395 400Low-delay coding 61 307 381LPC 69 91

462 INDEX

LPC10e 95ndash96LTI system 16LTP lags 96Lucent PAC 3 321 379

Mantissa quantization 328 331MASCAM 212Masking threshold 105 125 138Masking tone 121ndash124Matrixing 267 279Maximally decimated 146 154M-band filterbanks 148 160MDCT basis 165 168 328MDCT filterbank 149 167Mean opinion score (MOS) 9Mean subjective score 385Melody features 312 315Mel-scale 400Meridian lossless packing (MLP) 268 358Metadata 316Middle-ear transfer function 394Minimum masking threshold 126 221Minimum redundancy codes 77Minimum SMR 123 137MIPS 6Mixed excitation schemes 96Modified discrete cosine transform

(MDCT) 145 163Modulated bi-orthogonal lapped transform

(MBLT) 171Mother wavelet function 237Moving pictures expert group (MPEG)

270MP 1 Layer III MP3 274ndash276MP3 encoders 130MPEG 1 Layer I 275MPEG 2 Advanced audio coding (AAC)

283MPEG 2 BC 279MPEG 21 317MPEG 4 289MPEG 4 LD codec 304MPEG 4 natural audio 292MPEG 4 synthetic audio 303MPEG 7 309micro-law 57Multichannel coding 327 331Multichannel perceptual audio coder

(MPAC) 268Multichannel surround sound 4 267Multilingual compatibility 279Multimedia content description interface

(MCDI) 274 309Multipulse excitation (MPE) 5 104 301

Multipulse excited linear prediction 98Multirate signal processing 33Multistage tree-structured VQ (MSTVQ)

206Multitransform 232Musical instrument digital interface

(MIDI) 264Musical instrument timbre description tool

(MITDT) 315MUSICAM 212MUSICompress 3 351

Natural audio coding 291ndash292NetSound 303 380Noble identities 158Noise loudness 391 400Noise signal evaluation (NSE) 390Noise threshold 139 201Noise-masking-noise 124Noise-masking-tone 123 399Noise-to-mask ratio (NMR) 74 138 205

289 396Nonadaptive Huffman coding 80Nonadaptive PCM 57Non-integer factor sampling 36Nonlinear interpolative VQ 206Nonparametric quantization 51Nonsimultaneous masking 120 127North American PCM standard 58

Object-based representation 274 308Objective measures 6 383 403Octave-band 147 158Open-loop closed-loop 96Optimum bit assignment 72Optimum coding in the frequency domain

(OCF) 196

Parametric audio coding 242 300Parametric coder 257Parametric phase-modulated window 168Parametric spreading 330PDF-optimized PCM 57Peaking filters 30Perceptual audio coder (PAC) 321Perceptual audio quality measure (PAQM)

392Perceptual bit allocation 52 74 120 138

149Perceptual entropy 2 75 128Perceptual evaluation (PERCEVAL) 6

401PEAQ PEAQ 403Perceptual noise substitution (PNS) 294

INDEX 463

Perceptual transform coder (PXFM) 197Perceptual transparency 235 377Perceptual weighting filter PWF 91 98

104Perceptually-weighted speech 99Percussive instrument timbre 315Perfect reconstruction (PR) 4 36 147Phase coding 370Phase vocoder 380Phillips digital compact cassette (DCC) 3

278Phillips PASC 3 6Pitch prediction 111 296Playback control 368Poles and zeros 29Polynomial approximations 346Positive definite 41Post-masking 127Power spectral density (PSD) 10 42Prediction error 59 94 104Predictive differential coding 59Pre-echo control 182Pre-echo distortion 180Pre-masking 278Priority codewords 306Probability density function 53Pseudo QMF 160 177 275 338Pseudorandom noise (PN) 374Psychoacoustics 2 74 113Pulse-sinc pair 15

Quadraphonic 267Quadrature mirror filter (QMF) 36 102Quantization levels 54 70 199Quantization noise 53 75

Random process signal 39 53Real-time DSP 352Reed-solomon codes 307Reflection coefficients 95 104Regular pulse excitation (RPE) 98 301Rice codes 52 81Rights expression language (REL) 317Run-Length encoding 82

Sampling theorem 17Scalable audio coding 258 298Scalable DWPT coder 220Scalable polyphonic 266Scalar quantization 54Security key 375Selectable mode vocoder (SMV) 69 102Sensation level 115Sequential bit-allocation method 72

Shannon-Fano coding 76Shelving filters 30SHORTEN 307 343 346Short-term linear predictor 94Sigma-Delta () 20 362Signal-to-mask ratio (SMR) 74 123 126

138Signal-to-noise ratio 55 62 116Simulcast 256Simultaneous masking 120 126 130Sine MDCT 171Sinusoidal excitation coding 105Sony ATRAC 6 173 264 320Sony dynamic digital sound (SDDS) 321Sound pressure level (SPL) 10 113 131Sound recognition 315ndash316Source-system modeling 92Spatial audio coding 319Spectral band replication (SBR) 308Spectral flatness measure (SFM) 74 129

196Speech recognition 315Split-VQ 67Spoken content description tool (SCDT)

315Spread of masking 125Spread spectrum watermarking 372Standard deviation 40 214Statistical transparency 377Step size 54 57Stereophonic coder 198Stereophonic PXFM 321 198ndash199Streaming audio 4 241 263 300 378Structured audio orchestra language

(SAOL) 293 303Structured audio score language (SASL)

294Subband coding 211Subwoofer 268 270 321Subjective difference grade 384 391Super audio CD (SACD) 3 20 358Supra-masking threshold 284 287Synchronization 269 279 291 341 375

TDMA 100Temporal noise shaping (TNS) 106 182

185 283Text-to-speech 274 291 293 304Third generation partnership project

(3GPP) 103 267TIA IS 127 RCELP 101TIA IS 54 VSELP 95 100Time-domain aliasing cancellation

(TDAC) 164

464 INDEX

Time-frequency resolution 149 227 327Tonal adaptive resolution codec (TARCO)

232Tonal and noise maskers 131 136Tone-masking-noise 124Toeplitz 95Total harmonic distortion 383Training vectors 62ndash69Tree-structured filterbanks 157Tree-structured QMF 38 156 320Turbo codes 374Twin VQ 69 207 293 296

Uniform bit-allocation 70 72Uniform step size 55 57United States (US) digital audio radio

(DAR) 284Unvoiced speech 93Up-sampling 35

Variance 10 40Vector PCM (VPCM) 62Vector quantization (VQ) 62ndash69VSELP coder 95 100Video home system (VHS) 274

Vocal tract 29 92 233Vocal tract envelope 96Voice activity detection 104Voice-over-Internet-protocol (VoIP) 7

304

Warped linear prediction 105Watermark embedding detection 376Watermark encoding 375Wavelet basis 214 219 233Wavelet coder 220 229Wavelet packet 155 212 215Wavelet packet transform (WPT) 214Wavwrite 11Weighted-mean-square error 97 233White noise 25 41Wideband CDMA (WCDMA) 103Widesense stationary 40Window shape adaptation 283 305Window switching 173 182 203 320

328

z transform 10 20Zwickerrsquos loudness 394

AUDIO SIGNAL PROCESSING AND CODING

CONTENTS

PREFACE

1 INTRODUCTION

11 Historical Perspective

12 A General Perceptual Audio Coding Architecture

13 Audio Coder Attributes

131 Audio Quality

132 Bit Rates

133 Complexity

134 Codec Delay


14 Types of Audio Coders ndash An Overview

15 Organization of the Book

16 Notational Conventions

Problems

Computer Exercises


21 Introduction

22 Spectra of Analog Signals

23 Review of Convolution and Filtering

24 Uniform Sampling

25 Discrete-Time Signal Processing





26 Difference Equations and Digital Filters

27 The Transfer and the Frequency Response Functions



28 Review of Multirate Signal Processing





29 Discrete-Time Random Signals



210 Summary

Problems

Computer Exercises


31 Introduction


32 Density Functions and Quantization

33 Scalar Quantization




34 Vector Quantization

341 Structured VQ

342 Split-VQ


35 Bit-Allocation Algorithms

36 Entropy Coding

361 Huffman Coding

362 Rice Coding

363 Golomb Coding


37 Summary

Problems

Computer Exercises


41 Introduction

42 LP-Based Source-System Modeling for Speech

43 Short-Term Linear Prediction



44 Open-Loop Analysis-Synthesis Linear Prediction

45 Analysis-by-Synthesis Linear Prediction


46 Linear Prediction in Wideband Coding



47 Summary

Problems

Computer Exercises


51 Introduction

52 Absolute Threshold of Hearing

53 Critical Bands

54 Simultaneous Masking Masking Asymmetry and the Spread of Masking






55 Nonsimultaneous Masking

56 Perceptual Entropy

57 Example Codec Perceptual Model ISOIEC 11172-3 (MPEG - 1) Psychoacoustic Model 1






58 Perceptual Bit Allocation

59 Summary

Problems

Computer Exercises


61 Introduction

62 Analysis-Synthesis Framework for M-band Filter Banks

63 Filter Banks for Audio Coding Design Considerations

631 The Role of Time-Frequency Resolution in Masking Power Estimation



64 Quadrature Mirror and Conjugate Quadrature Filters

65 Tree-Structured QMF and CQF M-band Banks

66 Cosine Modulated ldquoPseudo QMFrdquo M-band Banks

67 Cosine Modulated Perfect Reconstruction (PR) M-band Banks and the Modified Discrete Cosine Transform (MDCT)




68 Discrete Fourier and Discrete Cosine Transform

69 Pre-echo Distortion

610 Pre-echo Control Strategies

6101 Bit Reservoir





611 Summary

Problems

Computer Exercises

7 TRANSFORM CODERS

71 Introduction

72 Optimum Coding in the Frequency Domain

73 Perceptual Transform Coder

731 PXFM

732 SEPXFM

74 Brandenburg-Johnston Hybrid Coder

75 CNET Coders

751 CNET DFT Coder



76 Adaptive Spectral Entropy Coding

77 Differential Perceptual Audio Coder

78 DFT Noise Substitution

79 DCT with Vector Quantization

710 MDCT with Vector Quantization

711 Summary

Problems

Computer Exercises

8 SUBBAND CODERS

81 Introduction


82 DWT and Discrete Wavelet Packet Transform (DWPT)

83 Adapted WP Algorithms

831 DWPT Coder with Globally Adapted Daubechies Analysis Wavelet



834 DWPT Coder with Adaptive Tree Structure and Locally Adapted Analysis Wavelet


84 Adapted Nonuniform Filter Banks



85 Hybrid WP and Adapted WPSinusoidal Algorithms



853 Hybrid SinusoidalDWPT Coder with WP Tree Structure Adaptation (ARCO)

86 Subband Coding with Hybrid Filter BankCELP Algorithms


862 Hybrid SubbandCELP Algorithm for Low-Complexity Applications

87 Subband Coding with IIR Filter Banks

Problems

Computer Exercise

9 SINUSOIDAL CODERS

91 Introduction

92 The Sinusoidal Model



93 AnalysisSynthesis Audio Codec (ASAC)




94 Harmonic and Individual Lines Plus Noise Coder (HILN)



95 FM Synthesis



96 The Sines + Transients + Noise (STN) Model

97 Hybrid Sinusoidal Coders



98 Summary

Problems

Computer Exercises


101 Introduction

102 MIDI Versus Digital Audio




103 Multichannel Surround Sound




104 MPEG Audio Standards








105 Adaptive Transform Acoustic Coding (ATRAC)

106 Lucent Technologies PAC EPAC and MPAC




107 Dolby Audio Coding Standards



108 Audio Processing Technology APT-x100

109 DTS ndash Coherent Acoustics






Problems

Computer Exercise


111 Introduction

112 Lossless Audio Coding (L(2)AC)

1121 L(2)AC Principles

1122 L(2)AC Algorithms

113 DVD-Audio


114 Super-Audio CD (SACD)




115 Digital Audio Watermarking

1151 Background



116 Summary of Commercial Applications

Problems

Computer Exercise


121 Introduction

122 Subjective Quality Measures

123 Confounding Factors in Subjective Evaluations

124 Subjective Evaluations of Two-Channel Standardized Codecs

125 Subjective Evaluations of 51-Channel Standardized Codecs

126 Subjective Evaluations Using Perceptual Measurement Systems



127 Algorithms for Perceptual Measurement




128 ITU-R BS1387 and ITU-T P861 Standards for Perceptual Quality Measurement

129 Research Directions for Perceptual Codec Quality Measures

REFERENCES

INDEX

audio signal processing and codings2.bitdl.ir/ebook/electronics/spanias - audio signal...10.3.2 the...

Documents