springer.com springer handbook of speech...

2
springer.com ABCD springer.com Springer Handbook of Speech Processing ABCD springer.com recommend to YOUR LIBRARY Introducing The World’s Most Comprehensive Online Scientific Book Collection. Springer eBook Collection Springer, the world’s largest international publisher of scientific books introduces the world’s most comprehensive digitized scientific, technical and medical book collection. The Springer eBook collection offers the first online book collection especially made for the requirements of researchers and scientists. The collection includes online access to more than 10,000 books adding 3,000 new research book titles each year. For more information visit springer.com/ebooks 012504x Springer Handbook of Speech Processing Edited by J. Benesty, M. M. Sondhi, Y. Huang 7 Authoritative desk reference 7 Quick access to applicable, reliable, and comprehensive knowledge 7 Electronic contents on accompanying DVD WITH DVD About the Editors: Authors: Jacob Benesty Jacob Benesty received the Masters degree from Pierre & Marie Curie University, France, in 1987, and the Ph.D. degree in control and signal processing from Orsay University, France, in 1991. After positions at Telecom Paris University as a consultant and a member of the technical staff at Bell Laboratories, Murray Hill, Dr. Benesty joined the University of Quebec (INRS-EMT) in Montreal, Canada in May 2003 as a professor. His research interests are in signal processing, acoustic signal processing, and multimedia communications. M. M. Sondhi M. Mohan Sondhi is a consultant at Avaya Research Labs, Basking Ridge, New Jersey. Prior to joining Avaya he spent 39 years at Bell Labs, from where he retired in 2001. He holds undergraduate degrees in Physics and Elec- trical Communication Engineering, and M.S and Ph.D. degrees in Electrical Engineering. At Bell Labs he conducted research in speech signal processing, echo cancellation, acoustical inverse problems, speech recognition, articula- tory models for analysis and synthesis of speech and modeling of auditory and visual processing by humans. Y. Huang Yiteng (Arden) Huang received the B.S. degree from the Tsinghua University in 1994, the M.S. and Ph.D. degrees from the Georgia Institute of Technology (Georgia Tech) in 1998 and 2001, respectively, all in electrical and computer engineering. Upon graduation, he joined Bell Laboratories as a member of technical staff in March 2001. His current research interests are in acoustic signal processing and multimedia communications. A. Acero, Redmond, WA, USA Jont Allen, Urbana, IL, USA Jacob Benesty, Montreal, QC, Canada R. Bimbot, Rennes, France Nick Campbell, Kyoto, Japan William Campbell, Lexington, MA, USA Rolf Carlson, Stockholm, Sweden Jingdong Chen, Murray Hill, NJ, USA Juin-Hwey Chen, Irvine, CA, USA Israel Cohen, Haifa, Israel Jordan Cohen, Woburn, MA, USA Corinna Cortes, New York, NY, USA Eric Diethorn, Basking Ridge, NJ, USA T. Dutoit, Mons, Belgium Simon Doclo, Leuven-Heverlee, Belgium Jasha Droppo, Redmond, WA, USA Gary Elko, Summit, NJ, USA Sadaoki Furui, Tokyo, Japan Sharon Gannot, Ramat-Gan, Israel Mazin Gilbert, Florham Park, NJ, USA Michael Goodwin, Scotts Valley, CA, USA B. Granstrom, Stockholm, Sweden Volodya Grancharov, Stockholm , Sweden Patrick Haffner, Florham Park, NJ, USA Roar Hagen, Stockholm, Sweden Mary Harper, College Park, MD, USA Larry Heck, Sunnyvale, CA, USA Jürgen Herre, Erlangen, Germany Wolfgang Hess, Bonn, Germany Kiyoshi Honda, Kyoto, Japan Yiteng Huang, Murray Hill, NJ, USA Juang, B.-H. (Fred), Atlanta, GA, USA Tatsuya Kawahara, Kyoto, Japan Esther Klabbers, Beaverton, OR, USA Bastiaan Kleijn, Stockholm , Sweden Birger Kollmeier, Oldenburg, Germany Sen Kuo, DeKalb, IL, USA Chin-Hui Lee, Atlanta, GA, USA Haizhou Li, Singapore, Singapore Jan Linden, San Francisco, CA, USA Bin Ma, Singapore, Singapore Michael Maxwell, College Park, MD, USA Alan McCree, Lexington, MA, USA Mehryer Mohri, New York, NY, USA Marc Moonen, Leuven-Heverlee, Belgium Dennis Morgan, Murray Hill, NJ, USA David Nahamoo, Yorktown Heights, NY, USA Shri Narayanan, Los Angeles, CA, USA Douglas O’Shaughnessy, Montreal, QC, Canada Lucas Parra, New York, NY, USA S. Parthasarathy, Florham Park, NJ, USA Fernando Pereira, Philadelphia, PA, USA Michael Picheny, Yorktown Heights, NY, USA Rudolf Rabenstein, Erlangen , Germany Lawrence Rabiner, Piscataway, NJ, USA Douglas Reynolds, Lexington, MA, USA Mike Riley, New York, NY, USA Aaron Rosenberg, Piscataway, NJ, USA Salim Roukos, Yorktown Heights, NY, USA Ronald Schafer, Palo Alto, CA, USA Juergen Schroeter, Florham Park, NJ, USA Stephanie Seneff, Cambridge, MA, USA Wade Shen, Lexington, MA, USA Elliot Singer, Lexington, MA, USA Jan Skoglund, San Francisco, CA, USA Mohan Sondhi, Basking Ridge, NJ, USA Sascha Spors, Berlin, Germany Ann Spriet, Leuven-Heverlee, Belgium Richard Sproat, Urbana, IL, USA Yannis Stylianou, Heraklion, Greece Jes Thyssen, Irvine, CA, USA Jan Van Santen, Beaverton, OR, USA Jay Wilpon, Florham Park, NJ, USA Jan Wouters, Leuven, Belgium Arie Yeredor, Tel-Aviv, Israel Steve Young, Cambridge, USA Victor Zue, Cambridge, MA, USA Yes, please send me copies All € and £ prices are net prices subject to local VAT, e.g. in Germany 7% VAT for books and 19% VAT for electronic products. Pre-publication pricing: Unless otherwise stated, pre-pub prices are valid through the end of the third month following publication, and therefore are subject to change. All prices exclusive of carriage charges. Prices and other details are subject to change without notice. All errors and omissions excepted. Springer Distribution Center GmbH, Haberstrasse 7, 69126 Heidelberg, Germany 7 Call: + 49 (0) 6221-345-4301 7 Fax: +49 (0) 6221-345-4229 7 Email: [email protected] 7 Web: springer.com M0254 Available from Name Dept. Institution Street City / ZIP-Code Date Signature Country Email Please bill me Please charge my credit card: Eurocard/Access/Mastercard Visa/Barclaycard/Bank/Americard AmericanExpress Number Valid until Order Now ! Springer Handbook of Speech Processing Jacob Benesty M. M. Sondhi Y. Huang Print eReference Print + eReference ISBN 978-3-540-49125-5 7 € 249,00 | £191.50 Prepublication price, valid until February 29, 2008 7 € 199,95 | £154.00 ISBN 978-3-540-49127-9 7 € 249,00 | £191.50 Prepublication price, valid until February 29, 2008 7 € 199,95 | £154.00 ISBN 978-3-540-49128-6 7 € 311,00 | £239.00 Prepublication price, valid until February 29, 2008 7 € 250,00 | £192.50

Upload: phamphuc

Post on 16-Apr-2018

239 views

Category:

Documents


3 download

TRANSCRIPT

springer.comABCDspringer.com Springer Handbook of Speech Processing ABCD springer.com

recommend to

YOUR LIBRARY

Introducing The World’s Most Comprehensive Online Scientifi c Book Collection.

Springer eBook Collection

Springer, the world’s largest international publisher of scientifi c books introduces the

world’s most comprehensive digitized scientifi c, technical and medical book collection.

The Springer eBook collection off ers the fi rst online book collection especially made for

the requirements of researchers and scientists. The collection includes online access to

more than 10,000 books adding 3,000 new research book titles each year.

For more information visit springer.com/ebooks

012504x

Springer Handbook of Speech ProcessingEdited by J. Benesty, M. M. Sondhi, Y. Huang

7Authoritative desk reference

7Quick access to applicable, reliable, and comprehensive knowledge

7Electronic contents on accompanying DVD

WITH DVD

About the Editors:

Authors:

Jacob BenestyJacob Benesty received the Masters degree from Pierre & Marie Curie University, France, in 1987, and the Ph.D. degree in control and signal processing from Orsay University, France, in 1991. After positions at Telecom Paris University as a consultant and a member of the technical staff at Bell Laboratories, Murray Hill, Dr. Benesty joined the University of Quebec (INRS-EMT) in Montreal, Canada in May 2003 as a professor. His research interests are in signal processing, acoustic signal processing, and multimedia communications.

M. M. SondhiM. Mohan Sondhi is a consultant at Avaya Research Labs, Basking Ridge, New Jersey. Prior to joining Avaya he spent 39 years at Bell Labs, from where he retired in 2001. He holds undergraduate degrees in Physics and Elec-

trical Communication Engineering, and M.S and Ph.D. degrees in Electrical Engineering. At Bell Labs he conducted research in speech signal processing, echo cancellation, acoustical inverse problems, speech recognition, articula-tory models for analysis and synthesis of speech and modeling of auditory and visual processing by humans.

Y. HuangYiteng (Arden) Huang received the B.S. degree from the Tsinghua University in 1994, the M.S. and Ph.D. degrees from the Georgia Institute of Technology (Georgia Tech) in 1998 and 2001, respectively, all in electrical and computer engineering. Upon graduation, he joined Bell Laboratories as a member of technical staff in March 2001. His current research interests are in acoustic signal processing and multimedia communications.

A. Acero, Redmond, WA, USAJont Allen, Urbana, IL, USAJacob Benesty, Montreal, QC, CanadaR. Bimbot, Rennes, FranceNick Campbell, Kyoto, JapanWilliam Campbell, Lexington, MA, USARolf Carlson, Stockholm, SwedenJingdong Chen, Murray Hill, NJ, USAJuin-Hwey Chen, Irvine, CA, USAIsrael Cohen, Haifa, IsraelJordan Cohen, Woburn, MA, USACorinna Cortes, New York, NY, USAEric Diethorn, Basking Ridge, NJ, USAT. Dutoit, Mons, BelgiumSimon Doclo, Leuven-Heverlee, BelgiumJasha Droppo, Redmond, WA, USAGary Elko, Summit, NJ, USASadaoki Furui, Tokyo, JapanSharon Gannot, Ramat-Gan, IsraelMazin Gilbert, Florham Park, NJ, USAMichael Goodwin, Scotts Valley, CA, USAB. Granstrom, Stockholm, SwedenVolodya Grancharov, Stockholm , SwedenPatrick Haffner, Florham Park, NJ, USARoar Hagen, Stockholm, SwedenMary Harper, College Park, MD, USALarry Heck, Sunnyvale, CA, USA

Jürgen Herre, Erlangen, GermanyWolfgang Hess, Bonn, GermanyKiyoshi Honda, Kyoto, JapanYiteng Huang, Murray Hill, NJ, USAJuang, B.-H. (Fred), Atlanta, GA, USATatsuya Kawahara, Kyoto, JapanEsther Klabbers, Beaverton, OR, USABastiaan Kleijn, Stockholm , SwedenBirger Kollmeier, Oldenburg, GermanySen Kuo, DeKalb, IL, USAChin-Hui Lee, Atlanta, GA, USAHaizhou Li, Singapore, SingaporeJan Linden, San Francisco, CA, USABin Ma, Singapore, SingaporeMichael Maxwell, College Park, MD, USAAlan McCree, Lexington, MA, USAMehryer Mohri, New York, NY, USAMarc Moonen, Leuven-Heverlee, BelgiumDennis Morgan, Murray Hill, NJ, USADavid Nahamoo, Yorktown Heights, NY, USAShri Narayanan, Los Angeles, CA, USADouglas O’Shaughnessy, Montreal, QC, CanadaLucas Parra, New York, NY, USAS. Parthasarathy, Florham Park, NJ, USAFernando Pereira, Philadelphia, PA, USA

Michael Picheny, Yorktown Heights, NY, USARudolf Rabenstein, Erlangen , GermanyLawrence Rabiner, Piscataway, NJ, USADouglas Reynolds, Lexington, MA, USAMike Riley, New York, NY, USAAaron Rosenberg, Piscataway, NJ, USASalim Roukos, Yorktown Heights, NY, USARonald Schafer, Palo Alto, CA, USAJuergen Schroeter, Florham Park, NJ, USAStephanie Seneff, Cambridge, MA, USAWade Shen, Lexington, MA, USAElliot Singer, Lexington, MA, USAJan Skoglund, San Francisco, CA, USAMohan Sondhi, Basking Ridge, NJ, USASascha Spors, Berlin, GermanyAnn Spriet, Leuven-Heverlee, BelgiumRichard Sproat, Urbana, IL, USAYannis Stylianou, Heraklion, GreeceJes Thyssen, Irvine, CA, USAJan Van Santen, Beaverton, OR, USAJay Wilpon, Florham Park, NJ, USAJan Wouters, Leuven, BelgiumArie Yeredor, Tel-Aviv, IsraelSteve Young, Cambridge, USAVictor Zue, Cambridge, MA, USA

Yes, please send me copies

All € and £ prices are net prices subject to local VAT, e.g. in Germany 7% VAT for books and 19% VAT for electronic products. Pre-publication pricing: Unless otherwise stated, pre-pub prices are valid through the end of the third month following publication, and therefore are subject to change. All prices exclusive of carriage charges. Prices and other details are subject to change without notice. All errors and omissions excepted.

Springer Distribution Center GmbH, Haberstrasse 7, 69126 Heidelberg, Germany 7 Call: + 49 (0) 6221-345-4301 7 Fax: +49 (0) 6221-345-4229 7 Email: [email protected] 7 Web: springer.com

M0254

Available from Name

Dept.

Institution

Street

City / ZIP-Code

Date Signature

Country

Email

Please bill me

Please charge my credit card: Eurocard/Access/Mastercard Visa/Barclaycard/Bank/Americard AmericanExpress

Number Valid until

Order Now ! Springer Handbook of Speech Processing

Jacob Benesty

M. M. Sondhi

Y. Huang

Print eReference Print + eReference

ISBN 978-3-540-49125-5 7 € 249,00 | £191.50 Prepublication price, valid until February 29, 2008 7 € 199,95 | £154.00

ISBN 978-3-540-49127-9 7 € 249,00 | £191.50 Prepublication price, valid until February 29, 2008 7 € 199,95 | £154.00

ISBN 978-3-540-49128-6 7 € 311,00 | £239.00 Prepublication price, valid until February 29, 2008 7 € 250,00 | £192.50

springer.com springer.com springer.comSpringer Handbook of Speech Processing Springer Handbook of Speech Processing Springer Handbook of Speech Processing

J. Benesty, Université de Québec, Montréal, QC, Canada; M. M. Sondhi, Avayalabs Research, Basking Ridge, NJ, USA; Y. Huang, Bell Labs, Murray Hill, NJ, USA (Eds.)

This handbook is designed to play a funda-mental role in sustainable progress in speech research and development. With an accessible format and with accompanying DVD-RoM, it targets three categories of readers: graduate students, professors and active researchers in academia, and engineers in industry who need to understand or implement some specific algo-rithms for their speech-related products. It is a superb source of application-oriented, authori-tative and comprehensive information about

Springer Handbook of Speech Processing

Prepublication price, valid until February 29, 2008 7€ 199,95 | £154.00

Key topics7Introduction to Speech Processing.7Production, Perception,

and Modeling of Speech.7Signal Processing for Speech.7Speech Coding.7Text-to-Speech Synthesis.

7Speech Recognition.7Speaker Recognition.7Language Recognition.7Speech Enhancement.7Multichannel Speech Processing

14 Part A Production, Perception, and Modeling of Speech

4.3 Speech Feature Perception

The information-theoretic approach to describingspeech perception assumes that human speech recog-nition is based on the combined, parallel recognition ofseveral acoustical cues that are characteristic for certainspeech elements. While a phoneme represents the small-est unit of speech information, its acoustic realization(denoted as phone) can be quite variable in its acousticalproperties. Such a phone is produced in order to de-liver a number of acoustical speech cues to the listenerwho should be able to deduce from it the underlyingphoneme. Each speech cue represents one feature valueof more- or less-complex speech features like voicing,frication, or duration, that are linked to phonetics andto perception. These speech feature values are decodedby the listener independently of each other and are usedfor recognizing the underlying speech element (such as,e.g., the represented phoneme). Speech perception cantherefore be interpreted as reception of certain values ofseveral speech features in parallel and in discrete timesteps.

Each phoneme is characterized by a unique com-bination of the underlying speech feature values. Thearticulation of words and sentences produces (in thesense of information theory) a discrete stream of infor-mation via a number of simultaneously active channels(Fig. 4.13).

The spoken realization of a given phoneme causesa certain speech feature to assume one out of several dif-ferent possible values. For example, the speech featurevoicing can assume the value one (i. e., voiced sound) orthe value zero (unvoiced speech sound). Each of thesefeatures is transmitted via its own, specific transmissionchannel to the speech recognition system of the listener.

���������

��������������������

�����������

�������������

���������

�����������

���������

���������

��������������

���������

���������

���������

�����������

���������

���������

���������������

�����������������

Fig. 4.13 Schematic representationof speech recognition using speechfeatures. Each speech sound is char-acterized by a combination of speechfeatures that are modeled to be trans-mitted independently of each otherby specialized (noisy) transmissionchannels

The channel consists of the acoustical transmissionchannel to the listener’s ear and the subsequent decodingof the signal in the central auditory system of the receiver(which can be hampered by a hearing impairment ora speech pathology). The listener recognizes the actuallyassumed values of certain speech features and combinesthese features to yield the recognized phoneme.

If p(i) gives the probability (or relative frequency)that a specific speech feature assumes the value i andp�( j) gives the probability (or relative frequency, re-spectively) that the receiver receives the feature valuej, and p(i, j) gives the joint probability that the valuej is recognized if the value i is transmitted, then theso-called transinformation T is defined as

T = −N�

i=1

N�j=1

p(i, j) log2

�p(i)p�( j)

p(i, j)

�. (4.10)

The transinformation T assumes its maximal value forperfect transmission of the input values to the outputvalues, i. e., if p(i, j) takes the diagonal form or anypermutation thereof. T equals 0 if the distribution ofreceived feature values is independent of the distribu-tion of input feature values, i. e., if p(i, j) = p(i)p�( j).The maximum value of T for perfect transmission (i. e.,p(i, j) = p(i) = p�( j)) equals the amount of informa-tion (in bits) included in the distribution of input featurevalues H , i. e.,

H =N�

i=1

p(i)H(i) = −N�

i=1

p(i) log2[p(i)] . (4.11)

In order to normalize T to give values between 0 and 1,the so-called transinformation index (TI) is often used,

PartA

4.3

Perception of Speech and Sound 4.3 Speech Feature Perception 17

�����������

�����������������������

��������������������

�����������

���������������������

�����������������������

��������������������

��������������������

�����������

���������������������

�������������

���������

�������������

Fig. 4.15 Schematic diagram ofa model of the effective auditoryprocessing using a front end to trans-form the incoming speech signal intoand internal representation and a sub-sequent back end, ideal recognitionstage which is only limited by theinternal noise

4.3.3 Internal Representation Approachand Higher-Order Temporal-SpectralFeatures

The internal representation approach of modellingspeech reception assumes that the speech signal istransformed by our auditory system with some non-linear, parallel processing operations into an internalrepresentation. This representation is used as the in-put for a central, cognitive recognition unit which canbe assumed to operate as an ideal observer, i. e., itperforms a pattern match between the incoming inter-nal representation and the multitude of stored internalrepresentations. The accuracy of this recognition pro-cess is limited by the external variability of the speechitems to be recognized, i. e., by their deviation fromany of the stored internal templates. It is also limitedby the internal noise that blurs the received internalrepresentation due to neural noise and other physiolog-ical and psychological factors. The amount of internal

��������

��������������

��������������������� �����������������

������ �

����

����

����

���������

������ �

��

��������

��

������ �

��

��

��

��

��

��

Fig. 4.16a–c Auditory spectrogram representation of the German word Stall. It can be represented by a spectrogram(a),a bark spectrogram on a log- loudness scale(b) or as a contrast-enhanced version using nonlinear feedback loops(after [4.6, 34]) (c)

noise can be estimated quite well from psychoacousticalexperiments.

Such an internal representation model puts mostof the peculiarities and limitations of the speechrecognition process into the nonlinear, destructive trans-formation process from the acoustical speech waveforminto its internal representation, assuming that all trans-formation steps are due to physiological processes thatcan be characterized completely physiologically or bypsychoacoustical means (Fig. 4.15 for illustration).

Several concepts and models to describe such aninternal representation have been developed so far. Someof the basic ideas are as follows.

1. Auditory spectrogram: The basic internal represen-tation assumes that the speech sound is separatedinto a number of frequency bands (distributed evenlyacross a psychoacoustically-based frequency scalelike the Bark- or ERB-scale) and that the compressedfrequency-channel-specific intensity is represented

PartA

4.3

Chapter and section titlefor easy navigation

Clearly displayed math

Chapter title and section heading

Thumb indices identify the part and chapter section

these technologies, this work combines the established knowledge derived from research in such fast evolving disciplines as Signal Processing and Communications, Acoustics, Computer Science and Linguistics.7Authoritative desk reference of one of

tomorrow‘s breakthrough technologies7Provides quick access to applicable, reliable,

and comprehensive knowledge7Electronic contents on accompanying DVD7Handy format, lucid two-color layout,

uniformly styled figures

7For PC. For the complete system require-ments see: springer.com

Print

2008. Approx. 1500 p. 560 illus. With DVD. Hardcover ISBN 978-3-540-49125-5 7 € 249,00 | £191.50Prepublication price, valid until

February 29, 2008 7 € 199,95 | £154.00

Print +eReference

2008. Approx. 1500 p. With DVD. Print + eReference. ISBN 978-3-540-49128-6 7 € 311,00 | £239.00 Prepublication price, valid until February 29, 2008 7 € 250,00 | £192.50

eReference

2008. eReference. ISBN 978-3-540-49127-9 7 € 249,00 | £191.50 Prepublication price, valid until February 29, 2008 7 € 199,95 | £154.00

With DVD

1

Perception of4. Perception of Speech and Sound

The transformation of acoustical signals intoauditory sensations can be characterized by psy-chophysical quantities like loudness, tonality,or perceived pitch. The resolution limits of theauditory system produce spectral and temporalmasking phenomena and impose constraints onthe perception of amplitude modulations. Binau-ral hearing (i. e., utilizing the acoustical differenceacross both ears) employs interaural time and in-tensity differences to produce localization andbinaural unmasking phenomena such as the bin-aural intelligibility level difference, i. e. the speechreception threshold difference between listeningto speech in noise monaurally versus listening withboth ears.

The acoustical information available to the lis-tener for perceiving speech even under adverseconditions can be characterized using the Artic-ulation Index, the Speech Transmission Index,and the Speech Intelligibility Index. They can ob-jectively predict speech reception thresholds asa function of spectral content, signal-to-noise ra-tio and preservation of amplitude modulations inthe speech waveform that enter the listener’s ear.The articulatory or phonetic information availableto and received by the listener can be charac-terized by speech feature sets. Transinformationanalysis allows to detect the relative transmis-sion error connected with each of these speechfeatures. The comparison across man and machine

4.1 Basic Psychoacoustic Quantities ............. 24.1.1 Mapping of Intensity into Loudness 24.1.2 Pitch........................................... 44.1.3 Temporal Analysis and Modulation

Perception .................................. 54.1.4 Binaural Hearing .......................... 74.1.5 Binaural Noise Suppression ........... 8

4.2 Acoustical Information Required forSpeech Perception ................................ 104.2.1 Speech Intelligibility and Speech

Reception Threshold (SRT).............. 104.2.2 Measurement Methods ................. 114.2.3 Factors Influencing Speech

Intelligibility ............................... 124.2.4 Prediction Methods ...................... 12

4.3 Speech Feature Perception .................... 144.3.1 Formant Features ......................... 154.3.2 Phonetic and Distinctive Feature

Sets ............................................ 154.3.3 Internal Representation Approach

and Higher-OrderTemporal-Spectral Features ........... 16

4.3.4 Man–Machine Comparison ............ 20

References .................................................. 21

in speech recognition allows to test hypotheses andmodels of human speech perception. Conversely,automatic speech recognition may be improvedby introducing human signal processing principlesinto machine processing algorithms.

Acoustically produced speech is a very special sound toour ears and our brain. Humans are able to extract theinformation contained in a spoken message extremelyefficiently even if the speech energy is lower than anycompeting background sound. Hence, humans are ableto communicate acoustically even under adverse lis-tening conditions, e.g., in a cafeteria. The process ofunderstanding speech can be subdivided into two stages.First, an auditory pre-processing stage where the speechsound is transformed into its internal representation inthe brain and special speech features are extracted (such

as, e.g., acoustic energy in a certain frequency channelas a function of time, or instantaneous pitch of a speechsound). This process is assumed to be mainly bottom-upwith no special preference for speech sounds as com-pared to other sounds. In other words, the informationcontained in any of the speech features can be describedquite well by the acoustical contents of the input signalto the ears. In a second step, speech pattern recognitiontakes place under cognitive control where the internallyavailable speech cues are assembled by our brain to con-vey the underlying message of the speech sound. This

PartA

4

10 Part A Production, Perception, and Modeling of Speech

�������

�������������������������

������������

��� � �� ���

���

���

���

���

���

���

���

���

���

��

�� ��������

�������

�������������������������

������������

��� � �� ���

���

���

���

��

��

���

���

���

��

�� ������

���� ��� � �� ���

���������

Fig. 4.11 SRT data (filled symbols) and predictions for three differ-ent acoustical conditions and normal listeners. The crosses denotemodel predictions without introducing appropriate processing er-rors, whereas the open symbols denote predictions employinginternal processing errors, that have been taken from average valuesin other psychoacoustical tasks (after [4.8])

and amplifies one or both input channels to yield anapproximate match (equalization) of the composite in-put signal within each frequency band. In a second,the cancelation stage, the signals from both respec-tive sides of the head are (imperfectly) subtracted from

each other. Hence, if the masker (after the equaliza-tion step) is approximately the same in both ears, thecancelation step will eliminate the masker with the ex-ception of some remaining error signal. Conversely, thedesired signal, which differs in interaural phase and/orintensity relation from the masker, should stay nearlyunchanged, yielding an improvement in signal-to-noiseratio. Using an appropriate numerical optimization strat-egy to fit the respective equalization parameters acrossfrequency, the model depicted in Fig. 4.11 can predicthuman performance quite well even under acousticallydifficult situations, such as, e.g., several interferingtalkers within a reverberant environment. Note thatthis model effectively corresponds to an adaptive spa-tial beam former, i. e., a frequency-dependent optimumlinear combination of the two sensor inputs to bothears that yields a directivity optimized to improve thesignal-to-noise ratio for a given target direction andinterfering background noise. If the model output isused to predict speech intelligibility with an appropri-ate (monaural) speech intelligibility prediction method[such as, e.g., the speech intelligibility index (SII),see later], the binaural advantage for speech intelligi-bility in rooms can be predicted quite well (Fig. 4.11from [4.8]).

Note that in each frequency band only one EC cir-cuit is employed in the model. This reflects the empiricalevidence that the brain is only able to cancel out onedirection for each frequency band at each instant oftime. Hence, the processing strategy adopted will useappropriate compromises for any given real situation.

4.2 Acoustical Information Required for Speech Perception

4.2.1 Speech Intelligibilityand Speech Reception Threshold (SRT)

Speech intelligibility (SI) is important for various fieldsof research, engineering, and diagnostics for quantify-ing very different phenomena such as the quality ofrecordings, communication and playback devices, thereverberation of auditoria, characteristics of hearing im-pairment, benefit using hearing aids, or combinations ofthese topics. The most useful way to define SI is: speechintelligibility SI is the proportion of speech items (e.g.,syllables, words, or sentences) correctly repeated by (a)listener(s) for a given speech intelligibility test. This op-erative definition makes SI directly and quantitativelymeasurable.

The intelligibility function (Fig. 4.12) describes thelistener’s speech intelligibility SI as a function of speechlevel L which may either refer to the sound pressure level(measured in dB) of the speech signal or to the speech-to-noise ratio (SNR) (measured in dB), if the test isperformed with interfering noise.

In most cases it is possible to fit the logistic functionSI (L) to the empirical data

SI(L)= 1

A

⎛⎝1+SImax

A −1

1+ exp�− L−Lmid

s

�⎞⎠ , (4.4)

with Lmid: speech level of the midpoint of the intelligi-bility function; s: slope parameter, the slope at Lmid is

PartA

4.2

Easy to read and use:includes more than 500 diagrams and illustrationsdrawn to scale

Each chaptercomes with a summary and its own index for cross-referencing to sections

Part titlefor easy navigation

sample chapter online at springer.com

One-columnsection headings

Four-color figures