springer handbook of speech processing978-3-540-49127-9/1.pdfof voice exchange, and, in the...

32
Springer Handbook of Speech Processing

Upload: haliem

Post on 23-May-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Springer Handbookof Speech Processing

Springer Handbooks providea concise compilation of approvedkey information on methods ofresearch, general principles, andfunctional relationships in physi-cal sciences and engineering. Theworld’s leading experts in the fieldsof physics and engineering will beassigned by one or several renownededitors to write the chapters com-prising each volume. The contentis selected by these experts fromSpringer sources (books, journals,online content) and other systematicand approved recent publications ofphysical and technical information.

The volumes are designed to beuseful as readable desk referencebooks to give a fast and comprehen-sive overview and easy retrieval ofessential reliable key information,including tables, graphs, and bibli-ographies. References to extensivesources are provided.

123

HandbookSpringerof Speech Processing

Jacob Benesty, M. Mohan Sondhi, Yiteng Huang(Eds.)

With DVD-ROM, 456 Figures and 113 Tables

Editors:

Jacob BenestyINRS-EMT, University of Quebec800 de la Gauchetiere Ouest, Suite 6900Montreal, Quebec, H5A 1K6, [email protected]

M. Mohan SondhiAvayalabs Research233 Mount Airy RoadBasking Ridge, NJ 07920, [email protected]

Yiteng HuangBell Laboratories, Alcatel-Lucent600 Mountain AvenueMurray Hill, NJ 07974, [email protected]

Library of Congress Control Number: 2007931999

ISBN: 978-3-540-49125-5 e-ISBN: 978-3-540-49127-9

This work is subject to copyright. All rights reserved, whether the wholeor part of the material is concerned, specifically the rights of translation,reprinting, reuse of illustrations, recitation, broadcasting, reproduction onmicrofilm or in any other way, and storage in data banks. Duplication ofthis publication or parts thereof is permitted only under the provisions ofthe German Copyright Law of September, 9, 1965, in its current version,and permission for use must always be obtained from Springer-Verlag.Violations are liable for prosecution under the German Copyright Law.

Springer is a part of Springer Science+Business Media

springer.com

c© Springer-Verlag Berlin Heidelberg 2008

The use of designations, trademarks, etc. in this publication does not imply,even in the absence of a specific statement, that such names are exempt fromthe relevant protective laws and regulations and therefore free for generaluse.

Product liability: The publisher cannot guarantee the accuracy of anyinformation about dosage and application contained in this book. In everyindividual case the user must check such information by consulting therelevant literature.

Typesetting and production:LE-TEX Jelonek, Schmidt&Vöckler GbR, LeipzigSenior Manager Springer Handbook: Dr. W. Skolaut, HeidelbergTypography and layout: schreiberVIS, SeeheimIllustrations: Hippmann GbR, SchwarzenbruckCover design: eStudio Calamar Steinen, BarcelonaCover production: WMXDesign GmbH, HeidelbergPrinting and binding: Stürtz GmbH, Würzburg

Printed on acid free paper

SPIN 11544036 60/3180/YL 5 4 3 2 1 0

V

Foreword

J. L. Flanagan

Professor EmeritusElectrical and ComputerEngineeringRutgers University

Over the past three decades digital signal processing has emerged as a recognizeddiscipline. Much of the impetus for this advance stems from research in representation,coding, transmission, storage and reproduction of speech and image information. Inparticular, interest in voice communication has stimulated central contributions todigital filtering and discrete-time spectral transforms.

This dynamic development was built upon the convergence of three then-evolvingtechnologies: (i) sampled-data theory and representation of information signals (whichled directly to digital telecommunication that provides signal quality independent oftransmission distance); (ii) electronic binary computation (aided in early implementa-tion by pulse-circuit techniques from radar design); and, (iii) invention of solid-statedevices for exquisite control of electronic current (transistors – which now, through mi-croelectronic materials, scale to systems of enormous size and complexity). This timelyconvergence was soon followed by optical fiber methods for broadband informationtransport.

These advances impact an important aspect of human activity – information ex-change. And, over man’s existence, speech has played a principal role in humancommunication. Now, speech is playing an increasing role in human interaction withcomplex information systems. Automatic services of great variety exploit the comfortof voice exchange, and, in the corporate sector, sophisticated audio/video teleconfer-encing is reducing the necessity of expensive, time-consuming business travel. In eachinstance an overarching target is a user environment that captures some of the nat-uralness and spatial realism of face-to-face communication. Again, speech is a coreelement, and new understanding from diverse research sectors can be brought to bear.

Editors-in-Chief Benesty, Sondhi and Huang have organized a timely engineer-ing handbook to answer this need. They have assembled a remarkable compendiumof current knowledge in speech processing. And, this accumulated understanding canbe focused upon enlarging the human capacity to deal with a world ever increasing incomplexity. Benesty, Sondhi and Huang are renowned researchers in their own right,and they have attracted an international cadre of over 80 fellow authors and collab-orators who constitute a veritable Who’s Who of world leaders in speech processingresearch. The resulting book provides under one cover authoritative treatments thatcommence with the basic physics and psychophysics of speech and hearing, and rangethrough the related topics of computational tools, coding, synthesis, recognition, andsignal enhancement, concluding with discussions on capture and projection of soundin enclosures. The book can be expected to become a valuable resource for researchers,engineers and speech scientists throughout the global community. It should equallyserve teachers and students in human communication, especially delimiting knowledgefrontiers where graduate thesis research may be appropriate.

Warren, New Jersey Jim FlanaganOctober 2007

VII

Preface

Jacob Benesty

M. Mohan Sondhi

Yiteng Huang

The achievement of this Springer Handbook is the result of a wonderful journey thatstarted in March 2005 at the 30th International Conference on Acoustics, Speech, andSignal Processing (ICASSP). Two of the editors-in-chief (Benesty and Huang) met inone of the long corridors of the Pennsylvania Convention Center in Philadelphia withDr Dieter Merkle from Springer. Together we had a very nice discussion about the con-ference and immediately an idea came up for a handbook. After a short discussion weconverged without too much hesitation on a handbook of speech processing. It wasquite surprising to see that, even after 30 years of ICASSP and more than half a centuryof research in this fundamental area, there was still no major book summarizing the im-portant aspects of speech processing. We thought that the time was ripe for such a largeproject. Soon after we got home, a third editor-in-chief (Sondhi) joined the efforts.

We had a very clear objective in our minds: to summarize, in a reasonable numberof pages, the most important and useful aspects of speech processing. The content wasthen organized accordingly. This task was not easy since we had to find a good balancebetween feasible ideas and new trends. As we all know, practical ideas can be viewedas old stuff while emerging ideas can be criticized for not having passed the test oftime; we hope that we have succeeded in finding a good compromise. For this we reliedon many authors who are well established and are recognized as experts in their field,from all over the world, and from academia as well as from industry.

From simple consumer products such as cell phones and MP3 players to more-sophisticated projects such as human–machine interfaces and robots that can obeyorders, speech technologies are now everywhere. We believe that it is just a matter oftime before more applications of the science of speech become impossible to miss inour daily life. So we believe that this Springer Handbook will play a fundamental rolein the sustainable progress of speech research and development.

This handbook is targeted at three categories of readers: graduate students of speechprocessing, professors and researchers in academia and research labs who are activein this field, and engineers in industry who need to understand or implement specificalgorithms for their speech-related products. The handbook could also be used as a textfor one or more graduate courses on signal processing for speech and various aspectsof speech processing and applications.

For the completion of such an ambitious project we have many people to thank.First, we would like to thank the many authors who did a terrific job in delivering veryhigh-quality chapters. Second, we are very grateful to the members of the editorialboard who helped us so much in organizing the content and structure of this book, tak-ing part in all phases of this project from conception to completion. Third, we wouldlike to thank all the reviewers, who helped us to improve the quality of the mater-ial. Last, but not least, we would like to thank the Springer team for their availabilityand very professional work. In particular, we appreciated the help of Dieter Merkle,Christoph Baumann, Werner Skolaut, Petra Jantzen, and Claudia Rau.

We hope this Springer Handbook will inspire many great minds to find new researchideas or to implement algorithms in products.

Montreal, Basking Ridge, Murray Hill Jacob BenestyOctober 2007 M. Mohan Sondhi

Yiteng Huang

IX

List of Editors

Editors-in-Chief

Jacob Benesty, MontrealM. Mohan Sondhi, Basking RidgeYiteng (Arden) Huang, Murray Hill

Part Editors

Part A: Production, Perception, and Modeling of Speech

M. M. Sondhi, Basking Ridge

Part B: Signal Processing for Speech

Y. Huang, Murray Hill; J. Benesty, Montreal

Part C: Speech Coding

W. B. Kleijn, Stockholm

Part D: Text-to-Speech Synthesis

S. Narayanan, Los Angeles

Part E: Speech Recognition

L. Rabiner, Piscataway; B.-H. Juang, Atlanta

Part F: Speaker Recognition

S. Parthasarathy, Sunnyvale

Part G: Language Recognition

C.-H. Lee, Atlanta

Part H: Speech Enhancement

J. Chen, Murray Hill; S. Gannot, Ramat-Gan; J. Benesty, Montreal

Part I: Multichannel Speech Processing

J. Benesty, Montreal; I. Cohen, Haifa; Y. Huang, Murray Hill

XI

List of Authors

Alex AceroMicrosoft ResearchOne Microsoft WayRedmond, WA 98052, USAe-mail: [email protected]

Jont B. AllenUniversity of IllinoisECEUrbana, IL 61801, USAe-mail: [email protected]

Jacob BenestyUniversity of QuebecINRS-EMT800 de la Gauchetiere OuestMontreal, Quebec H5A 1K6, Canadae-mail: [email protected]

Frédéric BimbotIRISA (CNRS & INRIA) - METISSPièce C 320 - Campus Universitaire de Beaulieu35042 Rennes, Francee-mail: [email protected]

Thomas BrandCarl von Ossietzky Universität OldenburgSektion MedizinphysikHaus des Hörens, Marie-Curie-Str. 226121 Oldenburg, Germanye-mail: [email protected]

Nick CampbellKnowledge Creating Communication ResearchCentreAcoustics & Speech Research Project, SpokenLanguage Communication Group2-2-2 Hikaridai619-0288 Keihanna Science City, Japane-mail: [email protected]

William M. CampbellMIT Lincoln LaboratoryInformation Systems Technology Group244 Wood StreetLexington, MA 02420-9108, USAe-mail: [email protected]

Rolf CarlsonRoyal Institute of Technology (KTH)Department of Speech, Music and HearingLindstedtsvägen 2410044 Stockholm, Swedene-mail: [email protected]

Jingdong ChenBell LaboratoriesAlcatel-Lucent600 Mountain AveMurray Hill, NJ 07974, USAe-mail: [email protected]

Juin-Hwey ChenBroadcom Corp.5300 California AvenueIrvine, CA 92617, USAe-mail: [email protected]

Israel CohenTechnion–Israel Institute of TechnologyDepartment of Electrical EngineeringTechnion CityHaifa 32000, Israele-mail: [email protected]

Jordan CohenSRI International300 Ravenswood DriveMenlo Park, CA 94019, USAe-mail: [email protected]

Corinna CortesGoogle, Inc.Google Research76 9th Avenue, 4th FloorNew York, NY 10011, USAe-mail: [email protected]

Eric J. DiethornAvaya Labs ResearchMultimedia Technologies Research Department233 Mt. Airy RoadBasking Ridge, NJ 07920, USAe-mail: [email protected]

XII List of Authors

Simon DocloKatholieke Universiteit LeuvenDepartment of Electrical Engineering (ESAT-SCD)Kasteelpark Arenberg 10 bus 24463001 Leuven, Belgiume-mail: [email protected]

Jasha DroppoMicrosoft ResearchSpeech Technology GroupOne Microsoft WayRedmond, WA 98052, USAe-mail: [email protected]

Thierry DutoitFaculté Polytechnique de Mons FPMsTCTS LaboratoryBvd Dolez, 317000 Mons, Belgiume-mail: [email protected]

Gary W. Elkomh acoustics LLC25A Summit AveSummit, NJ 07901, USAe-mail: [email protected]

Sadaoki FuruiTokyo Institute of Technology StreetDepartment of Computer Science2-12-1 Ookayama, Meguro-ku152-8552 Tokyo, Japane-mail: [email protected]

Sharon GannotBar-Ilan UniversitySchool of Electrical EngineeringRamat-Gan 52900, Israele-mail: [email protected]

Mazin E. GilbertAT&T Labs, Inc., Research180 Park Ave.Florham Park, NJ 07932, USAe-mail: [email protected]

Michael M. GoodwinCreative Advanced Technology CenterAudio Research1500 Green Hills RoadScotts Valley, CA 95066, USAe-mail: [email protected]

Volodya GrancharovMultimedia TechnologiesEricsson Research, Ericsson ABTorshamnsgatan 23, Kista, KI/EAB/TVA/A16480 Stockholm, Swedene-mail: [email protected]

Björn GranströmRoyal Institute of Technology (KTH)Department for Speech, Music and HearingLindstedsvägen 2410044 Stockholm, Swedene-mail: [email protected]

Patrick HaffnerAT&T Labs-ResearchIP and Voice Services200 S Laurel Ave.Middletown, NJ 07748, USAe-mail: [email protected]

Roar HagenGlobal IP SolutionsMagnus Ladulsgatan 63B118 27 Stockholm, Swedene-mail: [email protected]

Mary P. HarperUniversity of MarylandCenter for Advanced Study of Language7005 52nd AvenueCollege Park, MD 20742, USAe-mail: [email protected]

Jürgen HerreFraunhofer Institute for Integrated Circuits(Fraunhofer IIS)Audio and MultimediaAm Wolfsmantel 3391058 Erlangen, Germanye-mail: [email protected]

List of Authors XIII

Wolfgang J. HessUniversity of BonnInstitute for Communication Sciences, Dept. ofCommunication, Language, and SpeechPoppelsdorfer Allee 4753115 Bonn, Germanye-mail: [email protected]

Kiyoshi HondaUniversité de la Sorbonne Nouvelle-Paris IIILaboratoire de Phonétique et de Phonologie, ATRCognitive Information LaboratoriesUMR-7018-CNRS, 46, rue Barrault75634 Paris, Francee-mail: [email protected]

Yiteng (Arden) HuangBell LaboratoriesAlcatel-Lucent600 Mountain AvenueMurray Hill, NJ 07974, USAe-mail: [email protected]

Matthieu HébertNetwork ASR Core TechnologyNuance Communications1500 UniversitéMontréal, Québec H3A-3S7, Canadae-mail: [email protected]

Biing-Hwang JuangGeorgia Institute of TechnologySchool of Electrical & Computer Engineering777 Atlantic Dr. NWAtlanta, GA 30332-0250, USAe-mail: [email protected]

Tatsuya KawaharaKyoto UniversityAcademic Center for Computing and Media StudiesSakyo-ku606-8501 Kyoto, Japane-mail: [email protected]

Ulrik KjemsOticon A/S9 Kongebakken2765 Smørum, Denmarke-mail: [email protected]

Esther KlabbersOregon Health & Science UniversityCenter for Spoken Language Understanding, OGISchool of Science and Engineering20000 NW Walker RdBeaverton, OR 97006, USAe-mail: [email protected]

W. Bastiaan KleijnRoyal Institute of Technology (KTH)School of Electrical Engineering, Sound and ImageProcessing LabOsquldas väg 1010044 Stockholm, Swedene-mail: [email protected]

Birger KollmeierUniversität OldenburgMedizinische Physik26111 Oldenburg, Germanye-mail: [email protected]

Ermin KozicaRoyal Institute of Technology (KTH)School of Electrical Engineering, Sound and ImageProcessing LaboratoryOsquldas väg 1010044 Stockholm, Swedene-mail: [email protected]

Sen M. KuoNorthern Illinois UniversityDepartment of Electrical EngineeringDeKalb, IL 60115, USAe-mail: [email protected]

Jan LarsenTechnical University of DenmarkInformatics and Mathematical ModellingRichard Petersens Plads2800 Kongens Lyngby, Denmarke-mail: [email protected]

Chin-Hui LeeGeorgia Institute of TechnologySchool of Electrical and Computer Engineering777 Atlantic Drive NWAtlanta, GA 30332-0250, USAe-mail: [email protected]

XIV List of Authors

Haizhou LiInstitute for Infocomm ResearchDepartment of Human Language Technology21 Heng Mui Keng TerraceSingapore, 119613e-mail: [email protected]

Jan LindenGlobal IP Solutions301 Brannan StreetSan Francisco, CA 94107, USAe-mail: [email protected]

Manfred LutzkyFraunhofer Integrated Circuits (IIS)Multimedia Realtime SystemsAm Wolfsmantel 3391058 Erlangen, Germanye-mail: [email protected]

Bin MaHuman Language TechnologyInstitute for Infocomm Research21 Heng Mui Keng TerraceSingapore, 119613e-mail: [email protected]

Michael MaxwellUniversity of MarylandCenter for Advanced Study of LanguageBox 25College Park, MD 20742, USAe-mail: [email protected]

Alan V. McCreeMIT Lincoln LaboratoryDepartment of Information Systems Technology244 Wood StreetLexington, MA 02420-9185, USAe-mail: [email protected]

Bernd MeyerCarl von Ossietzky Universität OldenburgMedical Physics Section, Haus des HörensMarie-Curie-Str. 226121 Oldenburg, Germanye-mail: [email protected]

Jens Meyermh acoustcis25A Summit Ave.Summit, NJ 07901, USAe-mail: [email protected]

Taniya MishraOregon Health and Science UniversityCenter for Spoken Language Understanding,Computer Science and Electrical Engineering, OGISchool of Science and Engineering20000 NW Walker RoadBeaverton, OR 97006, USAe-mail: [email protected]

Mehryar MohriCourant Institute of Mathematical Sciences251 Mercer StreetNew York, NY 10012, USAe-mail: [email protected]

Marc MoonenKatholieke Universiteit LeuvenElectrical Engineering Department ESAT/SISTAArenberg 103001 Leuven, Belgiume-mail: [email protected]

Dennis R. MorganBell Laboratories, Alcatel-Lucent700 Mountain Avenue 2D-537Murray Hill, NJ 07974-0636, USAe-mail: [email protected]

David NahamooIBM Thomas J. Watson Research CenterPO BOX 218Yorktown Heights, NY 10598, USAe-mail: [email protected]

Douglas O’ShaughnessyUniversité du QuébecINRS Énergie, Matériaux et Télécommunications(INRS-EMT)800, de la Gauchetiere OuestMontréal, Québec H5A 1K6, Canadae-mail: [email protected]

List of Authors XV

Lucas C. ParraSteinman Hall, The City College of New YorkDepartment of Biomedical Engineering140th and Convent AveNew York, NY 10031, USAe-mail: [email protected]

Sarangarajan ParthasarathyYahoo!, Applied Research1MC 743, 701 First AvenueSunnyvale, CA 94089-0703, USAe-mail: [email protected]

Michael Syskind PedersenOticon A/SKongebakken 92765 Smørum, Denmarke-mail: [email protected]

Fernando PereiraUniversity of PennsylvaniaDepartment of Computer and Information Science305 Levine Hall, 3330 Walnut StreetPhiladelphia, PA 19104, USAe-mail: [email protected]

Michael PichenyIBM Thomas J. Watson Research CenterYorktown Heights, NY 10598, USAe-mail: [email protected]

Rudolf RabensteinUniversity Erlangen-NurembergElectrical Engineering, Electronics, andInformation TechnologyCauerstrasse 7/LMS91058 Erlangen, Germanye-mail: [email protected]

Lawrence RabinerRutgers UniversityDepartment of Electrical and ComputerEngineering96 Frelinghuysen RoadPiscataway, NJ 08854, USAe-mail: [email protected]

Douglas A. ReynoldsMassachusetts Institute of TechnologyLincoln Laboratory, Information SystemsTechnology Group244 Wood StreetLexington, MA 02420-9108, USAe-mail: [email protected]

Michael RileyGoogle, Inc., Research111 Eighth AVNew York, NY 10011, USAe-mail: [email protected]

Aaron E. RosenbergRutgers UniversityCenter for Advanced Information Processing96 Frelinghuysen RoadPiscataway, NJ 08854-8088, USAe-mail: [email protected]

Salim RoukosIBM T. J. Watson Research CenterMultilingual NLP TechnologiesYorktown Heights, NY 10598, USAe-mail: [email protected]

Jan van SantenOregon Health And Science UniversityOGI School of Science and Engineering,Department of Computer Science and ElectricalEngineering20000 NW Walker RdBeaverton, OR 97006-8921, USAe-mail: [email protected]

Ronald W. SchaferHewlett-Packard Laboratories1501 Page Mill RoadPalo Alto, CA 94304, USAe-mail: [email protected]

Juergen SchroeterAT&T Labs - ResearchDepartment of Speech Algorithms and Engines180 Park AveFlorham Park, NJ 07932, USAe-mail: [email protected]

XVI List of Authors

Stephanie SeneffMassachusetts Institute of TechnologyComputer Science and ArtificialIntelligence Laboratory32 Vassar StreetCambridge, MA 02139, USAe-mail: [email protected]

Wade ShenMassachusetts Institute of TechnologyCommunication Systems, Information SystemsTechnology, Lincoln Laboratory244 Wood StreetLexington, MA 02420-9108, USAe-mail: [email protected]

Elliot SingerMassachusetts Institute of TechnologyInformation Systems Technology Group, LincolnLaboratory244 Wood StreetLexington, MA 02420-9108, USAe-mail: [email protected]

Jan SkoglundGlobal IP Solutions301 Brannan StreetSan Francisco, CA 94107, USAe-mail: [email protected]

M. Mohan SondhiAvayalabs Research233 Mount Airy RoadBasking Ridge, NJ 07920, USAe-mail: [email protected]

Sascha SporsDeutsche Telekom AG, LaboratoriesErnst-Reuter-Platz 710587 Berlin, Germanye-mail: [email protected]

Ann SprietESAT-SCD/SISTA, K.U. LeuvenDepartment of Electrical EngineeringKasteelpark Arenberg 103001 Leuven, Belgiume-mail: [email protected]

Richard SproatUniversity of Illinois at Urbana-ChampaignDepartment of LinguisticsUrbana, IL 61801, USAe-mail: [email protected]

Yannis StylianouInstitute of Computer ScienceHeraklion, Crete 700 13, Greecee-mail: [email protected]

Jes ThyssenBroadcom Corporation5300 California AvenueIrvine, CA 92617, USAe-mail: [email protected]

Jay WilponResearch AT&T LabsVoice and IP ServicesFlorham Park, NJ 07932, USAe-mail: [email protected]

Jan WoutersExpORL, Department of Neurosciences, K.U. LeuvenO.& N2, Herestraat 493000 Leuven, Belgiume-mail: [email protected]

Arie YeredorTel-Aviv UniversityElectrical Engineering - SystemsTel-Aviv 69978, Israele-mail: [email protected]

Steve YoungCambridge University Engineering DeptCambridge, CB21PZ, UKe-mail: [email protected]

Victor ZueMassachusetts Institute of TechnologyCSAI Laboratory32 Vassar StreetCambridge, MA 02139, USAe-mail: [email protected]

XVII

Contents

List of Abbreviations ................................................................................. XXXI

1 Introduction to Speech ProcessingJ. Benesty, M. M. Sondhi, Y. Huang ........................................................... 11.1 A Brief History of Speech Processing ............................................... 11.2 Applications of Speech Processing .................................................. 21.3 Organization of the Handbook ....................................................... 4References .............................................................................................. 4

Part A Production, Perception, and Modeling of Speech

2 Physiological Processes of Speech ProductionK. Honda................................................................................................. 72.1 Overview of Speech Apparatus ....................................................... 72.2 Voice Production Mechanisms ........................................................ 82.3 Articulatory Mechanisms ................................................................ 142.4 Summary ...................................................................................... 24References .............................................................................................. 25

3 Nonlinear Cochlear Signal Processing and Maskingin Speech PerceptionJ. B. Allen ................................................................................................ 273.1 Basics ........................................................................................... 273.2 The Nonlinear Cochlea ................................................................... 353.3 Neural Masking ............................................................................. 453.4 Discussion and Summary ............................................................... 55References .............................................................................................. 56

4 Perception of Speech and SoundB. Kollmeier, T. Brand, B. Meyer ............................................................... 614.1 Basic Psychoacoustic Quantities ..................................................... 624.2 Acoustical Information Required for Speech Perception ................... 704.3 Speech Feature Perception ............................................................. 74References .............................................................................................. 81

5 Speech Quality AssessmentV. Grancharov, W. B. Kleijn....................................................................... 835.1 Degradation Factors Affecting Speech Quality .................................. 845.2 Subjective Tests ............................................................................. 855.3 Objective Measures ........................................................................ 905.4 Conclusions ................................................................................... 95References .............................................................................................. 96

XVIII Contents

Part B Signal Processing for Speech

6 Wiener and Adaptive FiltersJ. Benesty, Y. Huang, J. Chen.................................................................... 1036.1 Overview....................................................................................... 1036.2 Signal Models ................................................................................ 1046.3 Derivation of the Wiener Filter ....................................................... 1066.4 Impulse Response Tail Effect .......................................................... 1076.5 Condition Number ......................................................................... 1086.6 Adaptive Algorithms ...................................................................... 1106.7 MIMO Wiener Filter ........................................................................ 1166.8 Conclusions ................................................................................... 119References .............................................................................................. 120

7 Linear PredictionJ. Benesty, J. Chen, Y. Huang.................................................................... 1217.1 Fundamentals ............................................................................... 1217.2 Forward Linear Prediction .............................................................. 1227.3 Backward Linear Prediction ........................................................... 1237.4 Levinson–Durbin Algorithm ........................................................... 1247.5 Lattice Predictor ............................................................................ 1267.6 Spectral Representation ................................................................. 1277.7 Linear Interpolation ...................................................................... 1287.8 Line Spectrum Pair Representation ................................................. 1297.9 Multichannel Linear Prediction ...................................................... 1307.10 Conclusions ................................................................................... 133References .............................................................................................. 133

8 The Kalman FilterS. Gannot, A. Yeredor ............................................................................... 1358.1 Derivation of the Kalman Filter ...................................................... 1368.2 Examples: Estimation of Parametric Stochastic Process

from Noisy Observations ................................................................ 1418.3 Extensions of the Kalman Filter ...................................................... 1448.4 The Application of the Kalman Filter to Speech Processing ............... 1498.5 Summary ...................................................................................... 157References .............................................................................................. 157

9 Homomorphic Systems and Cepstrum Analysis of SpeechR. W. Schafer ........................................................................................... 1619.1 Definitions .................................................................................... 1619.2 Z-Transform Analysis ..................................................................... 1649.3 Discrete-Time Model for Speech Production .................................... 1659.4 The Cepstrum of Speech ................................................................. 1669.5 Relation to LPC .............................................................................. 1699.6 Application to Pitch Detection ........................................................ 171

Contents XIX

9.7 Applications to Analysis/Synthesis Coding ....................................... 1729.8 Applications to Speech Pattern Recognition .................................... 1769.9 Summary ...................................................................................... 180References .............................................................................................. 180

10 Pitch and Voicing Determination of Speechwith an Extension Toward Music SignalsW. J. Hess ................................................................................................ 18110.1 Pitch in Time-Variant Quasiperiodic Acoustic Signals ....................... 18210.2 Short-Term Analysis PDAs .............................................................. 18510.3 Selected Time-Domain Methods ..................................................... 19210.4 A Short Look into Voicing Determination ......................................... 19510.5 Evaluation and Postprocessing ....................................................... 19710.6 Applications in Speech and Music .................................................. 20110.7 Some New Challenges and Developments ....................................... 20310.8 Concluding Remarks ...................................................................... 207References .............................................................................................. 208

11 Formant Estimation and TrackingD. O’Shaughnessy .................................................................................... 21311.1 Historical ...................................................................................... 21311.2 Vocal Tract Resonances .................................................................. 21511.3 Speech Production ........................................................................ 21611.4 Acoustics of the Vocal Tract ............................................................ 21811.5 Short-Time Speech Analysis ........................................................... 22111.6 Formant Estimation ....................................................................... 22311.7 Summary ...................................................................................... 226References .............................................................................................. 226

12 The STFT, Sinusoidal Models, and Speech ModificationM. M. Goodwin ........................................................................................ 22912.1 The Short-Time Fourier Transform .................................................. 23012.2 Sinusoidal Models ......................................................................... 24212.3 Speech Modification ...................................................................... 253References .............................................................................................. 256

13 Adaptive Blind Multichannel IdentificationY. Huang, J. Benesty, J. Chen.................................................................... 25913.1 Overview....................................................................................... 25913.2 Signal Model and Problem Formulation .......................................... 26013.3 Identifiability and Principle ........................................................... 26113.4 Constrained Time-Domain Multichannel LMS

and Newton Algorithms ................................................................. 26213.5 Unconstrained Multichannel LMS Algorithm

with Optimal Step-Size Control ...................................................... 26613.6 Frequency-Domain Blind Multichannel Identification Algorithms .... 26813.7 Adaptive Multichannel Exponentiated Gradient Algorithm .............. 276

XX Contents

13.8 Summary ...................................................................................... 279References .............................................................................................. 279

Part C Speech Coding

14 Principles of Speech CodingW. B. Kleijn .............................................................................................. 28314.1 The Objective of Speech Coding ...................................................... 28314.2 Speech Coder Attributes ................................................................. 28414.3 A Universal Coder for Speech .......................................................... 28614.4 Coding with Autoregressive Models ................................................ 29314.5 Distortion Measures and Coding Architecture .................................. 29614.6 Summary ...................................................................................... 302References .............................................................................................. 303

15 Voice over IP: Speech Transmission over Packet NetworksJ. Skoglund, E. Kozica, J. Linden, R. Hagen, W. B. Kleijn ............................ 30715.1 Voice Communication .................................................................... 30715.2 Properties of the Network .............................................................. 30815.3 Outline of a VoIP System ................................................................ 31315.4 Robust Encoding ........................................................................... 31715.5 Packet Loss Concealment ............................................................... 32615.6 Conclusion .................................................................................... 327References .............................................................................................. 328

16 Low-Bit-Rate Speech CodingA. V. McCree ............................................................................................. 33116.1 Speech Coding............................................................................... 33116.2 Fundamentals: Parametric Modeling of Speech Signals ................... 33216.3 Flexible Parametric Models ............................................................ 33716.4 Efficient Quantization of Model Parameters .................................... 34416.5 Low-Rate Speech Coding Standards................................................ 34516.6 Summary ...................................................................................... 347References .............................................................................................. 347

17 Analysis-by-Synthesis Speech CodingJ.-H. Chen, J. Thyssen .............................................................................. 35117.1 Overview....................................................................................... 35217.2 Basic Concepts of Analysis-by-Synthesis Coding .............................. 35317.3 Overview of Prominent Analysis-by-Synthesis Speech Coders .......... 35717.4 Multipulse Linear Predictive Coding (MPLPC) .................................... 36017.5 Regular-Pulse Excitation with Long-Term Prediction (RPE-LTP) ........ 36217.6 The Original Code Excited Linear Prediction (CELP) Coder .................. 36317.7 US Federal Standard FS1016 CELP ..................................................... 36717.8 Vector Sum Excited Linear Prediction (VSELP) ................................... 36817.9 Low-Delay CELP (LD-CELP) .............................................................. 370

Contents XXI

17.10 Pitch Synchronous Innovation CELP (PSI-CELP) ................................. 37117.11 Algebraic CELP (ACELP) .................................................................... 37117.12 Conjugate Structure CELP (CS-CELP) and CS-ACELP ............................. 37717.13 Relaxed CELP (RCELP) – Generalized Analysis by Synthesis ................ 37817.14 eX-CELP ........................................................................................ 38117.15 iLBC .............................................................................................. 38217.16 TSNFC ............................................................................................ 38317.17 Embedded CELP ............................................................................. 38617.18 Summary of Analysis-by-Synthesis Speech Coders .......................... 38817.19 Conclusion .................................................................................... 390References .............................................................................................. 390

18 Perceptual Audio Coding of Speech SignalsJ. Herre, M. Lutzky ................................................................................... 39318.1 History of Audio Coding ................................................................. 39318.2 Fundamentals of Perceptual Audio Coding ...................................... 39418.3 Some Successful Standardized Audio Coders.................................... 39618.4 Perceptual Audio Coding for Real-Time Communication .................. 39818.5 Hybrid/Crossover Coders ................................................................. 40318.6 Summary ...................................................................................... 409References .............................................................................................. 409

Part D Text-to-Speech Synthesis

19 Basic Principles of Speech SynthesisJ. Schroeter.............................................................................................. 41319.1 The Basic Components of a TTS System ............................................ 41319.2 Speech Representations and Signal Processing

for Concatenative Synthesis ........................................................... 42119.3 Speech Signal Transformation Principles ......................................... 42319.4 Speech Synthesis Evaluation .......................................................... 42519.5 Conclusions ................................................................................... 426References .............................................................................................. 426

20 Rule-Based Speech SynthesisR. Carlson, B. Granström .......................................................................... 42920.1 Background .................................................................................. 42920.2 Terminal Analog ............................................................................ 42920.3 Controlling the Synthesizer ............................................................ 43220.4 Special Applications of Rule-Based Parametric Synthesis ................. 43420.5 Concluding Remarks ...................................................................... 434References .............................................................................................. 434

21 Corpus-Based Speech SynthesisT. Dutoit .................................................................................................. 43721.1 Basics ........................................................................................... 437

XXII Contents

21.2 Concatenative Synthesis with a Fixed Inventory .............................. 43821.3 Unit-Selection-Based Synthesis ..................................................... 44721.4 Statistical Parametric Synthesis ...................................................... 45021.5 Conclusion .................................................................................... 453References .............................................................................................. 453

22 Linguistic Processing for Speech SynthesisR. Sproat ................................................................................................. 45722.1 Why Linguistic Processing is Hard ................................................... 45722.2 Fundamentals: Writing Systems and the Graphical Representation

of Language .................................................................................. 45722.3 Problems to be Solved and Methods to Solve Them ......................... 45822.4 Architectures for Multilingual Linguistic Processing ......................... 46522.5 Document-Level Processing ........................................................... 46522.6 Future Prospects ............................................................................ 466References .............................................................................................. 467

23 Prosodic ProcessingJ. van Santen, T. Mishra, E. Klabbers ........................................................ 47123.1 Overview....................................................................................... 47123.2 Historical Overview ........................................................................ 47523.3 Fundamental Challenges ............................................................... 47623.4 A Survey of Current Approaches ...................................................... 47723.5 Future Approaches ........................................................................ 48423.6 Conclusions ................................................................................... 485References .............................................................................................. 485

24 Voice TransformationY. Stylianou ............................................................................................. 48924.1 Background .................................................................................. 48924.2 Source–Filter Theory and Harmonic Models .................................... 49024.3 Definitions .................................................................................... 49224.4 Source Modifications ..................................................................... 49424.5 Filter Modifications ....................................................................... 49824.6 Conversion Functions..................................................................... 49924.7 Voice Conversion ........................................................................... 50024.8 Quality Issues in Voice Transformations .......................................... 50124.9 Summary ...................................................................................... 502References .............................................................................................. 502

25 Expressive/Affective Speech SynthesisN. Campbell............................................................................................. 50525.1 Overview....................................................................................... 50525.2 Characteristics of Affective Speech .................................................. 50625.3 The Communicative Functionality of Speech ................................... 50825.4 Approaches to Synthesizing Expressive Speech ................................ 51025.5 Modeling Human Speech ............................................................... 512

Contents XXIII

25.6 Conclusion .................................................................................... 515References .............................................................................................. 515

Part E Speech Recognition

26 Historical Perspective of the Field of ASR/NLUL. Rabiner, B.-H. Juang ........................................................................... 52126.1 ASR Methodologies ........................................................................ 52126.2 Important Milestones in Speech Recognition History ....................... 52326.3 Generation 1 – The Early History of Speech Recognition ................... 52426.4 Generation 2 – The First Working Systems for Speech Recognition .... 52426.5 Generation 3 – The Pattern Recognition Approach

to Speech Recognition ................................................................... 52526.6 Generation 4 – The Era of the Statistical Model ............................... 53026.7 Generation 5 – The Future ............................................................. 53426.8 Summary ...................................................................................... 534References .............................................................................................. 535

27 HMMs and Related Speech Recognition TechnologiesS. Young ................................................................................................. 53927.1 Basic Framework ........................................................................... 53927.2 Architecture of an HMM-Based Recognizer...................................... 54027.3 HMM-Based Acoustic Modeling ...................................................... 54727.4 Normalization ............................................................................... 55027.5 Adaptation.................................................................................... 55127.6 Multipass Recognition Architectures ............................................... 55427.7 Conclusions ................................................................................... 554References .............................................................................................. 555

28 Speech Recognition with Weighted Finite-State TransducersM. Mohri, F. Pereira, M. Riley ................................................................... 55928.1 Definitions .................................................................................... 55928.2 Overview....................................................................................... 56028.3 Algorithms .................................................................................... 56728.4 Applications to Speech Recognition ................................................ 57428.5 Conclusion .................................................................................... 582References .............................................................................................. 582

29 A Machine Learning Framework for Spoken-Dialog ClassificationC. Cortes, P. Haffner, M. Mohri .................................................................. 58529.1 Motivation .................................................................................... 58529.2 Introduction to Kernel Methods ..................................................... 58629.3 Rational Kernels ............................................................................ 58729.4 Algorithms .................................................................................... 58929.5 Experiments .................................................................................. 59129.6 Theoretical Results for Rational Kernels .......................................... 593

XXIV Contents

29.7 Conclusion .................................................................................... 594References .............................................................................................. 595

30 Towards Superhuman Speech RecognitionM. Picheny, D. Nahamoo.......................................................................... 59730.1 Current Status ............................................................................... 59730.2 A Multidomain Conversational Test Set ........................................... 59830.3 Listening Experiments ................................................................... 59930.4 Recognition Experiments ............................................................... 60130.5 Speculation ................................................................................... 607References .............................................................................................. 614

31 Natural Language UnderstandingS. Roukos ................................................................................................ 61731.1 Overview of NLU Applications ......................................................... 61831.2 Natural Language Parsing .............................................................. 62031.3 Practical Implementation .............................................................. 62331.4 Speech Mining .............................................................................. 62331.5 Conclusion .................................................................................... 625References .............................................................................................. 626

32 Transcription and Distillation of Spontaneous SpeechS. Furui, T. Kawahara .............................................................................. 62732.1 Background .................................................................................. 62732.2 Overview of Research Activities on Spontaneous Speech .................. 62832.3 Analysis for Spontaneous Speech Recognition ................................. 63232.4 Approaches to Spontaneous Speech Recognition ............................. 63532.5 Metadata and Structure Extraction of Spontaneous Speech.............. 64032.6 Speech Summarization .................................................................. 64432.7 Conclusions ................................................................................... 647References .............................................................................................. 647

33 Environmental RobustnessJ. Droppo, A. Acero................................................................................... 65333.1 Noise Robust Speech Recognition ................................................... 65333.2 Model Retraining and Adaptation .................................................. 65633.3 Feature Transformation and Normalization..................................... 65733.4 A Model of the Environment .......................................................... 66433.5 Structured Model Adaptation ......................................................... 66733.6 Structured Feature Enhancement ................................................... 67133.7 Unifying Model and Feature Techniques ......................................... 67533.8 Conclusion .................................................................................... 677References .............................................................................................. 677

34 The Business of Speech TechnologiesJ. Wilpon, M. E. Gilbert, J. Cohen............................................................... 68134.1 Introduction ................................................................................. 682

Contents XXV

34.2 Network-Based Speech Services ..................................................... 68634.3 Device-Based Speech Applications ................................................. 69234.4 Vision/Predications of Future Services – Fueling the Trends ............. 69734.5 Conclusion .................................................................................... 701References .............................................................................................. 702

35 Spoken Dialogue SystemsV. Zue, S. Seneff ....................................................................................... 70535.1 Technology Components and System Development ......................... 70735.2 Development Issues....................................................................... 71235.3 Historical Perspectives ................................................................... 71435.4 New Directions .............................................................................. 71535.5 Concluding Remarks ...................................................................... 718References .............................................................................................. 718

Part F Speaker Recognition

36 Overview of Speaker RecognitionA. E. Rosenberg, F. Bimbot, S. Parthasarathy ............................................ 72536.1 Speaker Recognition ...................................................................... 72536.2 Measuring Speaker Features .......................................................... 72936.3 Constructing Speaker Models.......................................................... 73136.4 Adaptation.................................................................................... 73536.5 Decision and Performance ............................................................. 73536.6 Selected Applications for Automatic Speaker Recognition ................ 73736.7 Summary ...................................................................................... 739References .............................................................................................. 739

37 Text-Dependent Speaker RecognitionM. Hébert ................................................................................................ 74337.1 Brief Overview ............................................................................... 74337.2 Text-Dependent Challenges ........................................................... 74737.3 Selected Results ............................................................................ 75037.4 Concluding Remarks ...................................................................... 760References .............................................................................................. 760

38 Text-Independent Speaker RecognitionD. A. Reynolds, W. M. Campbell ................................................................. 76338.1 Introduction ................................................................................. 76338.2 Likelihood Ratio Detector ............................................................... 76438.3 Features ....................................................................................... 76638.4 Classifiers ...................................................................................... 76738.5 Performance Assessment ............................................................... 77638.6 Summary ...................................................................................... 778References .............................................................................................. 779

XXVI Contents

Part G Language Recognition

39 Principles of Spoken Language RecognitionC.-H. Lee ................................................................................................. 78539.1 Spoken Language .......................................................................... 78539.2 Language Recognition Principles .................................................... 78639.3 Phone Recognition Followed by Language Modeling (PRLM) ............ 78839.4 Vector-Space Characterization (VSC) ................................................ 78939.5 Spoken Language Verification ........................................................ 79039.6 Discriminative Classifier Design ...................................................... 79139.7 Summary ...................................................................................... 793References .............................................................................................. 793

40 Spoken Language CharacterizationM. P. Harper, M. Maxwell ......................................................................... 79740.1 Language versus Dialect................................................................. 79840.2 Spoken Language Collections ......................................................... 80040.3 Spoken Language Characteristics .................................................... 80040.4 Human Language Identification ..................................................... 80440.5 Text as a Source of Information on Spoken Languages..................... 80640.6 Summary ...................................................................................... 807References .............................................................................................. 807

41 Automatic Language Recognition Via Spectraland Token Based ApproachesD. A. Reynolds, W. M. Campbell, W. Shen, E. Singer .................................... 81141.1 Automatic Language Recognition ................................................... 81141.2 Spectral Based Methods ................................................................. 81241.3 Token-Based Methods ................................................................... 81541.4 System Fusion ............................................................................... 81841.5 Performance Assessment ............................................................... 82041.6 Summary ...................................................................................... 823References .............................................................................................. 823

42 Vector-Based Spoken Language ClassificationH. Li, B. Ma, C.-H. Lee .............................................................................. 82542.1 Vector Space Characterization ........................................................ 82642.2 Unit Selection and Modeling .......................................................... 82742.3 Front-End: Voice Tokenization and Spoken Document Vectorization 83042.4 Back-End: Vector-Based Classifier Design ....................................... 83142.5 Language Classification Experiments and Discussion ....................... 83542.6 Summary ...................................................................................... 838References .............................................................................................. 839

Contents XXVII

Part H Speech Enhancement

43 Fundamentals of Noise ReductionJ. Chen, J. Benesty, Y. Huang, E. J. Diethorn .............................................. 84343.1 Noise ............................................................................................ 84343.2 Signal Model and Problem Formulation .......................................... 84543.3 Evaluation of Noise Reduction ....................................................... 84643.4 Noise Reduction via Filtering Techniques ........................................ 84743.5 Noise Reduction via Spectral Restoration ........................................ 85743.6 Speech-Model-Based Noise Reduction ........................................... 86343.7 Summary ...................................................................................... 868References .............................................................................................. 869

44 Spectral Enhancement MethodsI. Cohen, S. Gannot.................................................................................. 87344.1 Spectral Enhancement ................................................................... 87444.2 Problem Formulation..................................................................... 87544.3 Statistical Models .......................................................................... 87644.4 Signal Estimation .......................................................................... 87944.5 Signal Presence Probability Estimation ........................................... 88144.6 A Priori SNR Estimation .................................................................. 88244.7 Noise Spectrum Estimation ............................................................ 88844.8 Summary of a Spectral Enhancement Algorithm .............................. 89144.9 Selection of Spectral Enhancement Algorithms ................................ 89644.10 Conclusions ................................................................................... 898References .............................................................................................. 899

45 Adaptive Echo Cancelation for Voice SignalsM. M. Sondhi ........................................................................................... 90345.1 Network Echoes ............................................................................. 90445.2 Single-Channel Acoustic Echo Cancelation ...................................... 91545.3 Multichannel Acoustic Echo Cancelation ......................................... 92145.4 Summary ...................................................................................... 925References .............................................................................................. 926

46 DereverberationY. Huang, J. Benesty, J. Chen.................................................................... 92946.1 Background and Overview ............................................................. 92946.2 Signal Model and Problem Formulation .......................................... 93146.3 Source Model-Based Speech Dereverberation ................................. 93246.4 Separation of Speech and Reverberation

via Homomorphic Transformation .................................................. 93646.5 Channel Inversion and Equalization ............................................... 93746.6 Summary ...................................................................................... 941References .............................................................................................. 942

XXVIII Contents

47 Adaptive Beamforming and PostfilteringS. Gannot, I. Cohen.................................................................................. 94547.1 Problem Formulation..................................................................... 94747.2 Adaptive Beamforming .................................................................. 94847.3 Fixed Beamformer and Blocking Matrix .......................................... 95347.4 Identification of the Acoustical Transfer Function............................ 95547.5 Robustness and Distortion Weighting ............................................. 96047.6 Multichannel Postfiltering ............................................................. 96247.7 Performance Analysis .................................................................... 96747.8 Experimental Results ..................................................................... 97247.9 Summary ...................................................................................... 97247.A Appendix: Derivation of the Expected Noise Reduction

for a Coherent Noise Field .............................................................. 97347.B Appendix: Equivalence Between Maximum SNR

and LCMV Beamformers ................................................................. 974References .............................................................................................. 975

48 Feedback Control in Hearing AidsA. Spriet, S. Doclo, M. Moonen, J. Wouters................................................. 97948.1 Problem Statement ....................................................................... 98048.2 Standard Adaptive Feedback Canceller ........................................... 98248.3 Feedback Cancellation Based on Prior Knowledge

of the Acoustic Feedback Path........................................................ 98648.4 Feedback Cancellation Based on Closed-Loop System Identification . 99048.5 Comparison ................................................................................... 99548.6 Conclusions ................................................................................... 997References .............................................................................................. 997

49 Active Noise ControlS. M. Kuo, D. R. Morgan ............................................................................ 100149.1 Broadband Feedforward Active Noise Control .................................. 100249.2 Narrowband Feedforward Active Noise Control ................................ 100649.3 Feedback Active Noise Control ........................................................ 101049.4 Multichannel ANC .......................................................................... 101149.5 Summary ...................................................................................... 1015References .............................................................................................. 1015

Part I Multichannel Speech Processing

50 Microphone ArraysG. W. Elko, J. Meyer .................................................................................. 102150.1 Microphone Array Beamforming ..................................................... 102150.2 Constant-Beamwidth Microphone Array System .............................. 102950.3 Constrained Optimization of the Directional Gain ............................ 103050.4 Differential Microphone Arrays ....................................................... 103150.5 Eigenbeamforming Arrays .............................................................. 1034

Contents XXIX

50.6 Adaptive Array Systems .................................................................. 103750.7 Conclusions ................................................................................... 1040References .............................................................................................. 1040

51 Time Delay Estimation and Source LocalizationY. Huang, J. Benesty, J. Chen.................................................................... 104351.1 Technology Taxonomy ................................................................... 104351.2 Time Delay Estimation ................................................................... 104451.3 Source Localization ........................................................................ 105451.4 Summary ...................................................................................... 1061References .............................................................................................. 1062

52 Convolutive Blind Source Separation MethodsM. S. Pedersen, J. Larsen, U. Kjems, L. C. Parra ........................................... 106552.1 The Mixing Model .......................................................................... 106652.2 The Separation Model .................................................................... 106852.3 Identification ................................................................................ 107152.4 Separation Principle ...................................................................... 107152.5 Time Versus Frequency Domain ...................................................... 107652.6 The Permutation Ambiguity ........................................................... 107852.7 Results .......................................................................................... 108452.8 Conclusion .................................................................................... 1084References .............................................................................................. 1084

53 Sound Field ReproductionR. Rabenstein, S. Spors ............................................................................ 109553.1 Sound Field Synthesis .................................................................... 109553.2 Mathematical Representation of Sound Fields ................................ 109653.3 Stereophony ................................................................................. 110053.4 Vector-Based Amplitude Panning................................................... 110353.5 Ambisonics ................................................................................... 110453.6 Wave Field Synthesis ..................................................................... 1109References .............................................................................................. 1113

Acknowledgements ................................................................................... 1115About the Authors ..................................................................................... 1117Detailed Contents ...................................................................................... 1133Subject Index ............................................................................................. 1161

XXXI

List of Abbreviations

2TS two-tone suppression

A

ACELP algebraic code excited linear predictionACF autocorrelation functionACR absolute category ratingACS autocorrelation coefficient sequencesACeS Asia Cellular SatelliteADC analog-to-digital converterADPCM adaptive differential pulse code

modulationAEC acoustic echo cancelationAFE advanced front-endAGC automatic gain controlAGN automatic gain normalizationAL averaged acoustic frame likelihoodAMR-WB+ extended wide-band adaptive multirate

coderAMR-WB wide-band AMR speech coderAMSC-TMI American Mobile Satellite Corporation

Telesat Mobile IncorporatedAN auditory nerveANC active noise cancelationANN artificial neural networksANOVA analysis of varianceAPA affine projection algorithmAPC adaptive predictive codingAPCO Association of Public-Safety

Communications OfficialsAPP adjusted test-set perplexityAR autoregressiveARISE Automatic Railway Information Systems

for EuropeARMA autoregressive moving-averageARPA Advanced Research Projects AgencyARQ automatic repeat requestART advanced recognition technologyASAT automatic speech attribute transcriptionASM acoustic segment modelASR automatic speech recognitionATF acoustical transfer functionATIS airline travel information systemATN augmented transition networksATR advanced telecommunications researchAW acoustic word

B

BBN Bolt, Beranek and NewmanBIC Bayesian information criterion

BILD binaural intelligibility level differenceBM blocking matrixBN broadcast newsBSD bark spectral distortionBSS blind source separation

C

C consonantsCA cochlear amplifierCAF continuous adaptation feedbackCART classification and regression treeCASA computational auditory scene analysisCAT cluster adaptive trainingCCR comparison category ratingCDCN codeword-dependent cepstral

normalizationCDF cumulative distribution functionCDMA code division multiple accessCE categorical estimationCELP code-excited linear predictionCF characteristic frequencyCF coherence functionCH call homeCHN cepstral histogram normalizationCIS caller identification systemCMLLR constrained MLLRCMOS comparison mean opinion scoreCMR co-modulation masking releaseCMS cepstral mean subtractionCMU Carnegie Mellon UniversityCMVN cepstral mean and variance normalizationCNG comfort noise generationCOC context-oriented clusteringCR cross-relationCRF conditional random fieldsCRLB Cramèr–Rao lower boundCS-ACELP conjugate structure ACELPCS-CELP conjugate structure CELPCSJ corpus of spontaneous JapaneseCSR continuous speech recognitionCTS conversational telephone speechCVC consonant–vowel–consonantCVN cepstral variance normalizationCZT chirp z-transform

D

DAC digital-to-analogDAG directed acyclic graphDAM diagnostic acceptability measure

XXXII List of Abbreviations

DARPA Defense Advanced Research ProjectsAgency

DBN dynamic Bayesian networkDCF detection cost functionDCR degradation category ratingDCT discrete cosine transformDET detection error tradeoffDF disfluencyDFA deterministic finite automataDFT discrete Fourier transformDFW dynamic frequency warpingDM dialog managementDMOS degradation mean opinion scoreDP dynamic programmingDPCM differential PCMDPMC data-driven parallel model combinationDRT diagnostic rhyme testDSP digital signal processingDT discriminative trainingDTFT discrete-time Fourier transformDTW dynamic time warpingDoD Department of Defense

E

EC equalization and cancelationECOC error-correcting output codingEER equal error rateEGG electroglottographyEKF extended Kalman filterELER early-to-late energy ratioEM estimate–maximizeEM expectation maximizationEMG electromyographicEMLLT extended maximum likelihood linear

transformER AAC-LD error resilient low-delay advanced audioERB equivalent rectangular bandwidthERL echo return lossERLE echo return loss enhancementEVRC enhanced variable rate codereX-CELP extended CELP

F

FA false acceptFAP fast affine projectionFB-LPC forward backward linear predictive

codingFBF fixed beamformerFBS filter bank summationFC functional contourFCDT frame-count-dependent thresholdingFEC frame erasure concealmentFFT fast Fourier transformFIFO first-in first-out

FIR finite impulse responseFM forward maskingFMLLR maximum-likelihood feature-space

regressionFR filler rateFRC functional residual capacityFRLS fast recursive least-squaresFSM finite state machineFSN finite state networkFSS frequency selective switchFST finite state transducerFT Fourier transformFTF fast transversal filterFVQ fuzzy vector quantizationFXLMS filtered-X LMS

G

GCC generalized cross-correlationGCI glottal closure instantGEVD generalized eigenvalue decompositionGLDS generalized linear discriminant sequenceGLR generalized likelihood ratioGMM Gaussian mixture modelGPD generalized probabilistic descentGSC generalized sidelobe cancellerGSM Groupe Spéciale MobileGSV GMM supervector

H

HLDA heteroscedastic LDAHLT human language technologiesHMIHY How May I Help YouHMM hidden Markov modelsHMP hidden Markov processesHNM harmonic-plus-noise modelHOS higher-order statisticsHPF high-pass filterHRTF head-related transfer functionHSD honestly significant differenceHSR human speech recognitionHTK hidden Markov model toolkit

I

IAI International Associationfor Identification

ICA independent component analysisIDA Institute for Defense AnalysesIDFT inverse DFTIDTFT inverse discrete-time Fourier transformIETF Internet Engineering Task ForceIFSS inverse frequency selective switchIHC inner hair cellsII information index

List of Abbreviations XXXIII

IIR infinite impulse responseiLBC internet low-bit-rate codecIMCRA improved minima-controlled recursive

averagingIMDCT inverse MDCTIMM interacting multiple modelIO input–outputIP internet protocolIP interruption (disfluent) pointIPA International Phonetic AlphabetIPNLMS improved PNLMSIQMF QMF synthesis filterbankIR information retrievalIS Itakura–SaitoISU information state updateITU International Telecommunication UnionIVR interactive voice response

J

JADE joint approximate diagonalizationof eigenmatrices

JND just-noticeable difference

K

KL Kullbach–LeiblerKLT Karhunen–Loève transform

L

LAN local-area networkLAR log-area-ratioLCMV linearly constrained minimum-varianceLD-CELP low-delay CELPLDA linear discriminant analysisLDC Linguistic Data ConsortiumLDF linear discriminant functionLID language identificationLL log-likelihoodLLAMA learning library for large-margin

classificationLLR (log) likelihood ratioLMFB log mel-frequency filterbankLMR linear multivariate regressionLMS least mean squareLNRE large number of rare eventsLP linear predictionLPC linear prediction coefficientsLPC linear predictive codingLPCC linear predictive cepstral coefficientLRE language recognition evaluationLRT likelihood-ratio testLSA latent semantic analysisLSA log-spectral amplitudeLSF line spectral frequency

LSI latent semantic indexingLSR low sampling ratesLTI linear time invariantLTIC linear time-invariant causalLTP long term predictionLVCSR large vocabulary continuous speech

recognition

M

M-step maximization stageMA moving averageMAP maximum a posterioriMBE multiband excitedMBR minimum Bayes-riskMBROLA multiband resynthesis overlap-addMC multicategoryMCE minimum classification errorMCN multichannel NewtonMDC multiple description codingMDCT modified discrete cosine transformMDF multidelay filterMDP Markov decision processMELP mixed excitation linear predictionMFCC mel-filter cepstral coefficientMFoM maximal figure-of-meritMIMO multiple-input multiple-outputMIPS million instructions per secondML maximum-likelihoodMLLR maximum-likelihood linear regressionMLP multilayer perceptronMMI maximum mutual informationMMSE-LSA MMSE of the log-spectral amplitudeMMSE-SA MMSE of the spectral amplitudeMMSE minimum mean-square errorMNB measuring normalizing blocksMNRU modulated noise reference unitMOPS million operations per secondMOS mean opinion scoreMP matching pursuitMPE minimum phone errorMPEG Moving Pictures Expert GgroupMPI minimal pairs intelligibilityMPLPC multipulse linear predictive codingMRI magnetic resonance imagingMRT modified rhyme testMS minimum statisticsMSA modern standard ArabicMSD minimum significant differenceMSE mean-square errorMSG maximum stable gainMSNR maximum signal-to-noise ratioMSVQ multistage VQMUI multimodal user interfaceMUSHRA multi stimulus test with hidden reference

and anchor

XXXIV List of Abbreviations

MVE minimum verification errorMVIMP my voice is my passwordMW maximum winsMuSIC multiple signal classification

N

NAB North American Business NewsNASA National Aeronautics and Space

AdministrationNATO North Atlantic Treaty OrganizationNFA nondeterministic finite automataNFC noise feedback codingNIST National Institute of Standards

and TechnologyNLG natural language generationNLMS normalized least-mean-squareNLP natural language processingNLU natural language understandingNN neural networkNSA National Security AgencyNTT Nippon Telephone & TelegraphNUU nonuniform units

O

O objectOAE otoacoustic emissionsOHC outer-hair cellsOLA overlap-and-addOOV out-of-vocabularyOQ open quotientOR out-of-vocabulary rateOSI open systems interconnection reference

P

P-PRLM parallel PRLMPARCOR partial correlation coefficientsPBFD partitioned-block frequency-domainPCA principal component analysisPCBV phonetic-class-based verificationPCFG probabilistic context-free grammarPCM pulse-code modulationPDA pitch determination algorithmsPDC personal digital cellularPDF probability density functionPDP parallel distributed processingPDS positive-definite symmetricPEAQ perceptual quality assessment for digital

audioPESQ perceptual evaluation of speech qualityPGG photoglottography

PICOLA pointer interval controlled overlapand add

PIV particle image velocityPLC packet loss concealmentPLP perceptual linear predictionPMC parallel model combinationPNLMS proportionate NLMSPOS part-of-speechPP word perplexityPPRLM parallel PRLMPR phone recognizerPRA partial-rank algorithmPRLM phoneme recognition followed by

language modelingPSD power spectral densityPSI-CELP pitch synchronous innovation CELPPSI pitch synchronous innovationPSQM perceptual speech quality measurePSRELP pitch-synchronous residual excited linear

predictionPSTN public switched telephone networkPVT parallel voice tokenization

Q

QA question answeringQMF quadrature mirror filterQP quadratic programQoS quality-of-service

R

RD rate-distortionRASTA relative spectraRATZ multivariate Gaussian-based cepstral

normalizationRCELP relaxed CELPREW rapidly evolving waveformRF radio frequencyRIM repair interval modelRIR room impulse responseRL reticular laminaRLS recursive least-squaresRM resource managementRMS root mean squareRMSE root-mean-square errorROC receiver operating characteristicRPA raw phone accuracyRPD point of the reparandumRPE-LTP regular-pulse excitation with long-term

predictionRR reprompt ratesRS Reed–SolomonRSVP resource reservation protocolRT rich transcription

List of Abbreviations XXXV

RTM resonant tectorial membraneRTP real-time transport protocol

S

S2S Speech-to-speechS subjectSAT speaker-adaptive trainedSB switchboardSCTE Society of Cable Telecommunications

EngineersSD-CFG stochastic dependency context-free

grammarSD spectral distortionSDC shifted delta cepstralSDR-GSC speech distortion regularized generalized

sidelobe cancellerSDW-MWF speech distortion-weighted multichannel

Wiener filterSEW slowly evolving waveformSGML standard generalized markup languageSI speech intelligibilitySII speech intelligibility indexSIMO single-input multiple-outputSISO single-input single-outputSL sensation levelSLM statistical language modelSLS spoken language systemSM sinusoidal modelsSMS speaker model synthesisSMT statistical machine translationSMV selectable mode vocoderSNR signal-to-noise ratioSOLA synchronized overlap addSOS second-order statisticsSPAM subspace-constrained precision and

meansSPIN saturated Poisson internal noiseSPINE speech in noisy environmentSPL sound pressure levelSPLICE stereo piecewise linear compensation for

environmentSQ speed quotientSR speaking rateSRT speech reception thresholdSSML speech synthesis markup languageSTC sinusoidal transform coderSTFT short-time Fourier transformSU sentence unitSUI speech user interfaceSUNDIAL speech understanding and dialogSUR Speech Understanding ResearchSVD singular value decompositionSVM support vector machinesSWB switchboardSegSNR segmental SNR

T

T–F time–frequencyTBRR transient beam-to-reference ratioTC text categorizationTCP transmission control protocolTCX transform coded excitationTD-PSOLA time-domain pitch-synchronous

overlap-addTD time domainTDAC time-domain aliasing cancelationTDBWE time-domain bandwidth extensionTDMA time-division multiple-accessTDOA time difference of arrivalTDT topic detection and trackingTF-GSC transfer-function generalized sidelobe

cancellerTFIDF term frequency inverse document

frequencyTFLLR term frequency log-likelihood ratioTFLOG term frequency logarithmicTI transinformation indexTIA Telecommunications Industry

AssociationTITO two-input-two-outputTM tectorial membraneTM tympanic membraneTMJ temporomandibular jointTNS temporal noise shapingTPC transform predictive coderTSNFC two-stage noise feedback codingTTS text-to-speechToBI tone and break indices

U

UBM universal background modelUCD unit concatenative distortionUDP user datagram protocolUE user experienceUKF unscented Kalman filterULD ultra-low delayUSD unit segmental distortionUSM upward spread of maskingUT unscented transformUVT universal voice tokenization

V

V verbV vowelsVAD voice activity detectorVBAP vector based amplitude panningVCV vowel–consonant–vowelVLSI very large-scale integration

XXXVI List of Abbreviations

VMR-WB variable-rate multimode wide-bandVOT voice onset timeVQ vector quantizationVRCP voice recognition call processingVRS variable rate smoothingVSC vector space characterizationVSELP vector sum excited linear predictionVT voice tokenizationVTLN vocal-tract-length normalizationVTR vocal tract resonanceVTS vector Taylor-seriesVoIP voice over IP

W

WDRC wide dynamic-range multibandcompression

WER word error rateWFSA weighted finite-state acceptors

WFST weighted finite-state transducerWGN white Gaussian noiseWI waveform interpolationWLAN wireless LANWLS weighted least-squaresWMOPS weighted MOPSWSJ Wall Street JournalWSOLA waveform similarity OLAWiFi wireless fidelity

X

XML extensible mark-up languagesXOR exclusive-or

Z

ZC zero crossingZIR zero-input responseZSR zero-state response