838 ieee transactions on audio, speech, and …spl.telhai.ac.il/speech/pub/ieee_asl_04100696.pdf ·...

838 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 3, MARCH 2007

An Effective Algorithm for Automatic Detectionand Exact Demarcation of Breath Sounds in

Speech and Song SignalsDima Ruinskiy, Student Member, IEEE, and Yizhar Lavner, Member, IEEE

Abstract—Automatic detection of predefined events in speechand audio signals is a challenging and promising subject in signalprocessing. One important application of such detection is re-moval or suppression of unwanted sounds in audio recordings, forinstance in the professional music industry, where the demand forquality is very high. Breath sounds, which are present in most songrecordings and often degrade the aesthetic quality of the voice,are an example of such unwanted sounds. Another example is badpronunciation of certain phonemes. In this paper, we present anautomatic algorithm for accurate detection of breaths in speechor song signals. The algorithm is based on a template matchingapproach, and consists of three phases. In the first phase, atemplate is constructed from mel frequency cepstral coefficients(MFCCs) matrices of several breath examples and their singularvalue decompositions, to capture the characteristics of a typicalbreath event. Next, in the initial processing phase, each short-timeframe is compared to the breath template, and marked as breathyor nonbreathy according to predefined thresholds. Finally, anedge detection algorithm, based on various time-domain andfrequency-domain parameters, is applied to demarcate the exactboundaries of each breath event and to eliminate possible falsedetections. Evaluation of the algorithm on a database of speechand songs containing several hundred breath sounds yielded acorrect identification rate of 98% with a specificity of 96%.

Index Terms—Breath detection, event spotting in speech andaudio, mel frequency cepstral coefficient (MFCC).

I. INTRODUCTION

AUTOMATIC detection of predefined events in speech andaudio signals is a challenging and promising subject in the

field of signal processing. In addition to speech recognition [1],[2], there are many applications of audio information retrieval[3], [4], and various tools and approaches were tested for thistask [5]–[7].

In case of speech recordings, the events to be recognized canbe part of the linguistic content of speech (e.g., words [8], syl-lables [9], and individual phonemes [10], [11]) or nonverbal

Manuscript received March 13, 2006; revised August 23, 2006. This workwas supported in part by Waves Audio Israel. The associate editor coordinatingthe review of this manuscript and approving it for publication was Dr. HiroshiSawada.

D. Ruinskiy was with the Department of Computer Science, Tel-Hai Aca-demic College, Upper Galilee 12210, Israel. He is now with the Faculty of Math-ematics and Computer Science, Feinberg Graduate School, Weizmann Instituteof Science, Rehovot 76100, Israel.

Y. Lavner is with the Computer Science Department, Tel-Hai Academic Col-lege, Upper Galilee 12210, Israel (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TASL.2006.889750

sounds or cues, such as laughs [12], breaths [13], coughs, andothers.

The applications of the detection are various. In [14], vowel/consonant recognition in speech served for differential time-scaling of speech for the hearing impaired. Nonverbal soundrecognition can be used for enhancing the description of situ-ations and events in speech-to-text algorithms [15] or for im-proving the efficiency of human transcriptionists [16]. Other ap-plications of event detection are the elimination of badly articu-lated phonemes and other sounds that degrade the quality of theacoustic signal [17] or accentuation of certain sounds.

Due to the high-quality requirement in professional voicerecordings, the sound engineers have to recognize places wherethe quality is degraded or undesired sounds are added. Since onmany occasions conducting a new recording session is either ex-pensive or impossible (for example, in historical recordings), theundesired sounds must be eliminated or attenuated manually, aprocess that can be tedious and time consuming. Therefore, anefficient automatic algorithm that detects and accurately demar-cates the required sound can be of great advantage.

An example of a sound that can degrade the quality of voicerecordings is breath, which is inherently present in nearly allrecordings, even those of professional singers and narrators.In professional audio, breaths are often considered unwantedsounds, which degrade the aesthetics of the voice, especiallywhen modification of the dynamic range is applied and mayunintentionally amplify them. In these cases, the sound engi-neer may want the breath sounds removed or suppressed signif-icantly. However, in other situations, breath sounds may havetheir purpose in songs, for instance, as part of the emotionalcontent, and then it is sometimes desirable to stress them andmake them more expressed. Whatever the purpose is, accuratedetection of the breath sounds is required for further manipula-tions, such as attenuation or amplification.

In addition to its role in improving the quality of songs andspeech recordings for the professional music industry, high-ac-curacy breath detection can be valuable in many other applica-tions. In spontaneous speech recognition, for example, effectssuch as long pauses, word fragments, pause fillers, and breathsounds, have been shown to significantly degrade the recogni-tion performance [18]. Accurate detection of breaths, amongother nonlexical sounds, was shown to improve the recognitionquality. Other applications are automatic segmentation of con-tinuous speech, where breath sounds may serve as natural delim-iters in utterances, automatic labeling of prosody patterns [19],and enhancement of speech to text applications by includingnonlexical sounds.

1558-7916/$25.00 © 2007 IEEE

RUINSKIY AND LAVNER: EFFECTIVE ALGORITHM FOR AUTOMATIC DETECTION AND EXACT DEMARCATION OF BREATH SOUNDS 839

Fig. 1. Basic block diagram of the algorithm. After preliminary classification as breathy/nonbreathy, a refinement of the detection is performed in the vicinity ofsections initially marked as breathy.

In this paper, we present an efficient algorithm for automaticdetection of breath sounds in song and speech recordings thatcan be used in a real-time environment (see Section V). The al-gorithm is based on a template-matching approach for the initialdetection and on a multistage feature tracking procedure for ac-curate edge detection and elimination of false alarms. The latterprocedure makes it possible to achieve very high accuracy interms of edge marking, which is often required in speech recog-nition and professional music applications.

Evaluation of the algorithm on a small database of recordings(see Section IV) showed that both the false negative rate and thefalse positive rate are very low.

For the initial detection, a prototype breath template is con-structed from a small number of breath examples. The featuresused for the template are the mel frequency cepstral coefficients(MFCCs) [20], which are known for their ability to distinguishbetween different types of audio data [21], [22].

The processed audio signal is divided into consecutive over-lapping frames, and each is compared to the template. If the sim-ilarity exceeds a predefined threshold, the frame is considered“breathy,” i.e., a part of a breath sound.

Whenever breaths are detected by the initial detection phase,further refinement is carried out by applying a feature-trackingprocedure aimed at accurate demarcation of the breath bound-aries, as well as elimination of false detections. The waveformfeatures used for this procedure include short-time energy, zero-crossing rate, and spectral slope, in addition to the MFCC-basedsimilarity measure. A schematic block diagram that describesthe basic algorithm is shown in Fig. 1. A detailed description of

the processes in each block will be presented in the followingsections.

Although the classification of each frame as breathy/non-breathy can be also performed by other standard machinelearning approaches such as support vector machine (SVM)[23] or Gaussian mixture model (GMM) [24], we find that thetemplate-matching approach used here has several advantages:it is simple and computationally efficient, and yet very accurateand reliable. In contrast, both SVM and GMM require morecomplex models with multiple parameters and assumptions,and therefore with higher time and space complexity [25]. Thetraining procedure in the presented algorithm is very fast andachieves high accuracy even with very few training examples.

This paper is organized as follows: In Section I-A, the physio-logical process of breath production and its relation to the detec-tion of breath boundaries in speech signals are briefly described.Next, the main algorithm is presented: the template constructionin Section II-A, the detection phase along with the audio fea-tures used in Section II-B, the computation of the breath sim-ilarity measure in Section II-C, and the edge detection algo-rithm in Section III. Following that, empirical results and per-formance evaluations of the proposed algorithm are presented inSection IV. Finally, some aspects of real-time implementationare discussed in Section V, followed by the conclusion.

A. Approach to Breath Edge Detection Based onPhysiological Aspects of Breath Production

During breathing, various muscles are operating to changethe intrapulmonary volume, and in consequence, the intrapul-


Fig. 2. Part of a voice waveform demonstrating a breath sound located betweentwo voiced phonemes. The upper line marks the breath, characterized by higherenergy near the middle and lower energy at the edges. The lower lines denotethe silence periods separating the breath from the neighboring phonemes.

monary pressure, which affects the air flow direction. In inspira-tion, a coordinated contraction of the diaphragm and other mus-cles cause the intrapulmonary volume to increase, which trans-lates into a decrease of the intrapulmonary pressure and causesair flow into the lungs. In expiration, the chest cavity and the in-trapulmonary volumes are decreased, followed by an increase inthe intrapulmonary pressure, causing air to be exhaled throughthe trachea out of the lungs.

Since there are several systems of muscles involved, suchchanges in the air flow direction cannot be instantaneous. Asa result, when inhaling occurs during speech or song, it requiresa certain pause, which results in a period of silence between thebreath and the preceding utterance, and a similar period betweenthe breath and the following utterance.

Measurements of breath events in many different speech se-quences from different speakers have shown that the silence pe-riods between a breath and the neighboring utterances are atleast 20 ms long. This duration is sufficient for the silence pe-riod to be detected by energy tracking mechanisms. A typicalbreath event residing between two utterances is shown in Fig. 2.The silence periods before the breath and after it are readily no-ticeable.

The detection of the silence periods before and after thebreath allows marking them as the breath edges, separating thebreath from the neighboring phonemes. The attenuation of thebreath in such cases creates a smooth and natural period ofsilence between the two utterances, without any artifacts.

In the following sections, the details of the algorithm thatdetects and demarcates breath events in speech recordings willbe presented.

II. INITIAL DETECTION ALGORITHM

The initial detection consists of two phases. The first is the“training” phase, during which the algorithm “learns” the fea-tures of typical breath sounds. The main feature that is used inthis study is a short-time cepstrogram, consisting of MFCC vec-tors, which are computed for consecutive short-time windowsof the speech signal. In the training phase, the cepstrograms ofseveral breath signals are computed, and a template cepstrogrammatrix is constructed, representing a typical breath signal.

The second phase is the detection phase, where a speechsignal is processed and short-time cepstrogram matrices arecomputed for consecutive analysis frames. For each frame, abreath similarity measure, which quantifies the similarity be-tween the frame’s cepstrogram and the template cepstrogram,

is computed and compared to a predefined threshold. In caseswhere the similarity measure is above the threshold, the frameis considered breathy (i.e., part of a breath event).

Once the training phase is executed and a template is con-structed, it can be used in multiple detection phases withoutretraining the system. Whereas the training phase is executedoffline, the detection phase can be used in a real-time environ-ment because the classification as breathy/nonbreathy is highlylocalized. More details on the real-time implementation of thealgorithm and the required latency are presented in Section V.

A. Constructing the Template

The breath template is constructed using several breath ex-amples, derived from one or more speakers. The template is ex-pected to represent the most typical characteristics of a breathand to serve as a prototype, to which frames of the signal arecompared in the detection phase. On the one hand, the tem-plate should contain the relevant data that distinguish the breathevent from other phonemes and sounds, and, on the other hand,it should be compact for computational efficiency.

Both of the above requirements are satisfied by the MFCC,which represent the magnitude spectrum of each short-timevoice signal compactly, with a small number of parameters [2],[20]. The efficiency of the MFCC in recognition of phonemesand other components of the human speech is well known andwidely used [12], [26], [27], and therefore, they were chosenfor template construction in this study.

Several vectors of the MFCC are computed for each breathexample, forming a short-time cepstrogram of the signal. Theaverage cepstrogram of the breath examples is used as the tem-plate. The stages for construction of the template are as follows(Fig. 3).

1) Several signals containing isolated breath examples are se-lected, forming the example set. From each example, a sec-tion of fixed length, typically equal to the length of theshortest example in the set (about 100–160 ms), is derived.This length is used throughout the algorithm as the framelength (see Section II-C).

2) Each breath example is divided into short consecutive sub-frames, with duration of 10 ms and hop size of 5 ms. Eachsubframe is then pre-emphasized using a first-order differ-ence filter ( , where ).

3) For each breath example, the MFCC are computed forevery subframe, thus forming a short-time cepstrogramrepresentation of the example. The cepstrogram is de-fined as a matrix whose columns are the MFCC vectorsfor each subframe. Each such matrix is denoted by

, where is the number of examples inthe examples set. The construction of the cepstrogram isdemonstrated in Fig. 4.

4) For each column of the cepstrogram, DC removal is per-formed, resulting in the matrix .

5) A mean cepstrogram is computed by averaging the ma-trices of the example set, as follows:

(1)


Fig. 3. Schematic block diagram describing the construction of the template.

Fig. 4. Schematic description of the procedure for constructing the cepstro-gram matrix. The frame is divided into overlapping subframes S , and for eachsubframe the MFCC vector V is computed. These vectors form the columns ofthe cepstrogram matrix.

This defines the template matrix . In a similar manner, avariance matrix is computed, where the distribution ofeach coefficient is measured along the example set.

6) In addition to the template matrix, another feature vectoris computed as follows: the matrices of the example setare concatenated into one matrix, and the singular valuedecomposition (SVD) of the resulting matrix is computed.Then, the normalized singular vector corresponding tothe largest singular value is derived. Due to the informationpacking property of the SVD transform [28], the singularvector is expected to capture the most important features ofthe breath event, and thus, improve the separation ability ofthe algorithm when used together with the template matrixin the calculation of the breath similarity measure of testsignals (see Section II-C).

B. Detection Phase

The input for the detection algorithm is an audio signal (amonophonic recording of either speech or song, with no back-ground music), sampled with 44 kHz. The signal is divided intoconsecutive analysis frames (with a hop size of 10 ms). For eachframe, the following parameters are computed: the cepstrogram(MFCC matrix, see Fig. 4), short-time energy, zero-crossingrate, and spectral slope (see below). Each of these is computedover a window located around the center of the frame. A graph-ical plot showing the waveform of a processed signal as well assome of the parameters computed is shown in Fig. 5.

1) The MFCC matrix is computed as in the template gener-ation process (see previous section). For this purpose, thelength of the MFCC analysis window used for the detec-

tion phase must match the length of the frame derived fromeach breath example in the training phase.

2) The short-time energy is computed according to the fol-lowing:

(2)

where is the sampled audio signal, and is thewindow length in samples (corresponding to 10 ms). It isthen converted to a logarithmic scale

(3)

3) The zero-crossing rate (ZCR) is defined as the number oftimes the audio waveform changes its sign, normalized bythe window length in samples (corresponding to 10 ms)

ZCR

(4)4) The spectral slope is computed by taking the discrete

Fourier transform of the analysis window, evaluating itsmagnitude at frequencies of and (correspondinghere to 11 and 22 kHz, respectively), and computing theslope of the straight line fit between these two points.It is known [29] that in voiced speech most of the spectralenergy is contained in the lower frequencies (below 4 kHz).Therefore, in voiced speech, the spectrum is expected to berather flat between 11 and 22 kHz. In periods of silence,the waveform is close to random, which also leads to a rel-atively flat spectrum throughout the entire band. This sug-gests that the spectral slope in voiced/silence parts wouldyield low values, when measured as described previously.On the other hand, in breath sounds, like in most unvoicedphonemes, there is still a significant amount of energy inthe middle frequency band (10–15 kHz) and relatively lowenergy in the high band (22 kHz). Thus, the spectral slopeis expected to be steeper, and could be used to differentiatebetween voiced/silence and unvoiced/breath. As such, thespectral slope is used here as an additional parameter foridentifying the edges of the breath (see Section III).

C. Computation of the Breath Similarity Measure

Once the aforementioned parameters are computed for agiven frame , its short-time cepstrogram (MFCC matrix) is


Fig. 5. A: Original signal, along with the V/UV/S marks (stair graph), and detected breath events (short bars). B: Reciprocal of the breath similarity function (top),the threshold (middle line), and the energy function (bottom). C: Short-time cepstrogram plot for the signal in A. The vertical axis represents the MFC coefficients,and the horizontal axis is the time axis. The color represents the magnitude of the cepstral coefficients using a “hot” scale.

Fig. 6. Schematic block diagram describing the calculation of the two breathsimilarity measuresC andC and their product as the final breath similaritymeasure.

used for calculating its breath similarity measure. The similaritymeasure, denoted , is computed between thecepstrogram of the frame, , the template cepstrogram(with being the variance matrix) and the singular vector .The steps of the computation are as follows (Fig. 6):

1) The normalized difference matrixis computed. The normalization (element-by-element) bythe variance matrix is performed in order to compensate forthe differences in the distributions of the various cepstralcoefficients.

2) The difference matrix is liftered by multiplying eachcolumn with a half-Hamming window that emphasizesthe lower cepstral coefficients. It has been found in pre-liminary experiments, that this procedure yields betterseparation between breath sounds and other sounds (seealso [2]).

3) A first similarity measure is computed by taking theinverse of the sum of squares of all elements of the normal-

ized difference matrix, according to the following equa-tion:

(5)

where is the number of subframes, and is the numberof MFC coefficients computed for each subframe.When the cepstrogram is very similar to the template, theelements of the difference matrix should be small, leadingto a high value of this similarity measure. When the framecontains a signal which is very different from breath, themeasure is expected to yield small values.This template matching procedure with a scaled Eu-clidean distance is essentially a special case of a two-classGaussian classifier [30] with a diagonal covariance ma-trix. This is due to the computation of the MFCC, whichinvolves a discrete cosine transform as its last step [20],known for its tendency to decorrelate the mel-scale filterlog-energies [22].

4) A second similarity measure is computed by takingthe sum of the inner products between the singular vector(see Section II-A) and the normalized columns of thecepstrogram. Since the singular vector is assumed tocapture the important characteristics of breath sounds,these inner products (and, therefore, ) are expected tobe small when the frame contains information from otherphonemes.

5) The final breath similarity measure is defined as theproduct of the two measures: . It was found ex-perimentally that this combination of similarity measuresyields better separation between breath and nonbreath thanusing just the difference matrix or the singular vector.

The breath detection involves a two-step decision. The ini-tial decision treats each frame independently of other framesand classifies each as breathy/not breathy based on its simi-larity measure (computed as explained above),


energy and zero-crossing rate. A frame is initially classified asbreathy if all three of the following occur.

1) The breath similarity measure is above a given threshold.This threshold is initially set in the learning phase, duringthe template construction, when the breath similarity mea-sure is computed for each of the examples.The minimum value of the similarity measures betweeneach of the examples and the template is determined, de-noted by . The threshold is set to . The logic be-hind this setting is that the frame-to-template similarity ofbreath sounds in general is expected to be somewhat lowerthan the similarity among examples used to construct thetemplate in the first place.

2) The energy is below a given threshold, which is chosento be below the average energy of voiced speech (seeSection III-A).

3) The zero-crossing rate is below a given threshold. Exper-imental data have shown that ZCR above 0.25 (assuminga sampling rate of 44 kHz) is exhibited only by a numberof unvoiced fricatives, and breath sounds have much lowerZCR (see Section III-A).

Following the initial detection, a binary breathiness index isassigned to each frame: breathy frames are assigned index 1,whereas nonbreathy frames are assigned index 0.

Whenever, at some point of the processing phase, there is aframe or a batch of frames that are classified as breathy, all theframes in their vicinity are examined more carefully, in orderto reject possible false detections and identify the edges of thebreath accurately. This constitutes the second step in the deci-sion. When the edges are identified, all frames between them areclassified as breathy. and all other frames in their vicinity areclassified as nonbreathy. Whenever an entire section is identi-fied as a false alarm in the second decision step, all the frameswithin this section are classified as nonbreathy. The exact pro-cedure, by which edge searching and false detection eliminationare performed, is described in the following section.

Frames that are marked as breathy can be further manipulatedon the output stream, for example by being attenuated or empha-sized.

III. EDGE DETECTION AND FALSE ALARM ELIMINATION

One of the purposes of applying breath detection in high-endrecording studios is to produce voices in which the breaths arenot audible. Although the algorithm described in the previoussection detects breath events with very high sensitivity (i.e., veryfew false negatives), it is still somewhat susceptible to falsepositive detections. Furthermore, its time resolution is not al-ways sufficient to detect the beginning and end of the breathsaccurately. In some applications, this can lead to incomplete re-moval of breath sounds on the one hand, and to partial removalof nearby phonemes on the other hand. Such inaccuracies, aswell as the presence of false detections, can introduce audibleartifacts that degrade speech quality.

In this section, we will present two alternative algorithmsdesigned to address both problems—the problem of false pos-itives and the problem of accurate edge detection. The algo-rithms use various features of the sound waveform, such as en-ergy and zero-crossing rate. In the general case, these features

Fig. 7. Probability density function curves of short-time energy (in decibels)of breath events (dashed line) in comparison with that of voiced speech (solidline), measured from voices of several different speakers, using nonoverlappingwindows of 10 ms.

on their own cannot separate between breath sounds and speechphonemes. Therefore, the edge detection algorithms are usedonly as a refinement and invoked only when the initial decisionsteps identifies some frames as breathy (see Section II-B). Thesealgorithms have two possible outcomes: either the edges of thebreath are accurately marked, or the entire event is rejected as afalse detection.

Any of the two algorithms presented can be used for the task,and both have been shown to perform well. We present themboth to provide a deeper insight of the problem and the pos-sible solutions. Comparative performance results of the two al-gorithms are described in Section IV.

A. General Approach to False Detection Elimination

In both of the edge detection algorithms presented here, sim-ilar criteria are used for rejection of false positives.

1) Preliminary Duration Threshold: A breath event is ex-pected to yield a significant peak in the contour of the breathsimilarity measure function, i.e., a considerable number offrames that rise above the similarity threshold. If the number ofsuch frames is too low, it is likely to be a false detection.

2) Upper Energy Threshold: Typically, the local energywithin a breath epoch is much lower than that of voiced speechand somewhat lower than most of the unvoiced speech (seeFig. 7). Hence, if some frames in the detected epoch, after edgemarking, exceed a predefined energy threshold, the sectionshould be rejected.

3) Lower ZCR Threshold: Since most of the breaths are un-voiced sounds, the ZCR during a breath event is expected tobe in the same range as that of most unvoiced phonemes, i.e.,higher than that of voiced phonemes. This was empirically veri-fied. Therefore, if the ZCR throughout the entire marked sectionis beneath a ZCR lower threshold, it will be rejected as a prob-able voiced phoneme.

4) Upper ZCR Threshold: It is known that certain unvoicedphonemes, such as fricative consonants, can exhibit a high ZCR(0.3–0.4 given a 44-kHz sampling frequency). Preliminary ex-periments showed that the maximum ZCR of breath sounds isconsiderably lower (see Fig. 8). An upper threshold on the ZCRcan thus prevent false detections of certain fricatives as breaths.


Fig. 8. Probability density function curves of ZCR of breath events (dashedline) in comparison with that of /s/ fricatives (solid line), measured from voicesof several different speakers, using nonoverlapping windows of 10 ms.

5) Final Duration Threshold: A breath sound is typicallylonger than 100 ms. Therefore, if the detected breath event, afteraccurate edge marking, is shorter than that duration, it should berejected. In practice, in order to account for very short breathsas well, the duration threshold may be set more permissively.

B. Edge Detection

The basic approach to accurate detection of the edges of thebreath is based on the principles described in Section I-A. Agenuine breath event is expected to exhibit a peak in the localenergy function, accompanied by two noticeable deeps on eachside of the peak, indicating periods of silence. Therefore, thetask of the edge detection algorithm is to locate the peak andthe two deeps near it. The edges of the breath can then be setwhere these deeps have occurred.

In practice, the energy function may not be smooth enough,and the algorithm will have to cope with spurious peaks anddeeps. The key difference between the two algorithms presentedbelow is the approach used to deal with them.

1) Edge Marking Using Double Energy Threshold and DeepPicking: One of the principles established in Section III-A isthat in most cases, the local energy of breath sounds does notexceed a certain upper threshold, even at its highest point (re-ferred here as “the peak of the breath”). We denote this upperthreshold as the “peak threshold.” However, it is also expectedthat the edges of the breath will exhibit considerably lower en-ergy, akin to that of silence (see Section I-A). It is reasonable toclaim that if there are not any frames in the detected event whoseenergy is lower than some silence threshold energy (denoted as“edge threshold”), the event is likely to be a false detection. Con-sequently, a section of the signal is to be marked as a breathevent only if its peak energy does not exceed the peak threshold

, and its edge energy does not exceed the edge threshold .The introduction of the edge energy threshold improves the al-gorithm’s robustness against potential false positive detections(Fig. 9).

An advantage of this double threshold algorithm is that itmakes actual edge marking very simple. It works as follows(Fig. 10). Let be the running index of the frame, representing

Fig. 9. Energy contour of three potential candidates for being marked as breathevents. The top threshold is the peak threshold, the bottom threshold is the edgethreshold. (A) will be marked as breath, since its peak energy is below the peakthreshold, and both its deeps are below the edge threshold. (B) will be rejected,because its edge energy exceeds the edge threshold. (C) will be rejected, becauseits peak energy exceeds the peak threshold and it is probably produced by avoiced phoneme.

Fig. 10. Peak of the breath and the corresponding edges, located at the deepsof the energy contour.

its location along the time axis, and let represent theshort-time energy of frame . Let be the frame with thehighest energy in the section in question (the peak of the breath).Let and be the first frames where the energy falls below

, to the left and to the right of , respectively. Letand be the first frames to the left of and the right of

, respectively, where the energy rises again above . Then,the entire sections between and , and between and

are close to silence. The edges of the breath are defined asthe centers of the frames with the lowest energy in their respec-tive silence sections, i.e., if we denote the left and right edgesby and , then

(6)

A schematic block diagram of the double threshold edge de-tection procedure is depicted in Fig. 11.

A simpler approach is to consider all peaks and deeps in thesection in question and for each peak mark the mostlikely left and right edges, according to the following (Fig. 12).

1) Let and be the nearest deeps to the left and to theright of , respectively.

2) Let be the nearest deep to the left of . If, then . Repeat until

or until there are no more deeps to theleft.

3) Let be the nearest deep to the right of . If, then . Repeat untilor until there are no more deeps to

the right.4) Output the pair .


Fig. 11. (a) Schematic block diagram of the double threshold algorithm for edge detection. (b) Detailed description of the double-threshold edge detectionprocedure.

This procedure results in a set of pairs , eachof which indicates a section, starting and ending with energydeeps. Each of these sections is examined to see if it violatesany of the conditions mentioned in Section III-A. Among the re-maining sections, we join those that overlap or share a commonedge to ensure that if the breath was divided into several sec-

tions due to spurious peaks/deeps in the energy contour, it willbe joined into one.

A mechanism that resembles the method presented here, withtwo energy thresholds and zero-crossing rate, was used in [31]for edge detection of words, in order to include all the phoneticcomponents for speech recognition. However, due to the dif-


Fig. 12. Deep picking procedure for edge searching: the curve is the energycontour. Big dots show the peak X and the final values of (X ;X ). BothX and X are lower than their neighboring deeps, which are spurious deeps(smaller dots).

ferent purpose, the actual algorithm is not the same. For ex-ample, in the latter, the thresholds were used strictly for indi-cation of the boundaries, while in our algorithm, their primaryuse is to avoid false detection of speech.

2) Edge Marking With Spurious Deep Elimination: A draw-back of the previous procedure, which considers every peak anddeep, is that a very noisy energy function can lead to incor-rect detection of the edges. The algorithm described in this sec-tion attempts to deal with this problem by eliminating spuriousdeeps. The edge marking relies more on the original breathinessindex (see Section II-C), namely, it searches for the edges of thebreath inside the section that was originally marked as breathy,or in close vicinity to it.

To reduce the effect of possible false detections, the binaryvector of breathiness indices is first smoothed with a nine-pointmedian filter. The smoothed contour is expected to contain ablock of successive binary ones (presumably a breath epoch)amid a sequence of zeros. On rare occasions, it may contain twoblocks of ones (in case of a false detection or two breath soundsbeing very close to each other). On such occasions, each block istreated separately. The block of ones indicates the approximatelocation of the breath, and the algorithm will look for the exactedges in the vicinity of this block.

Let us denote the first frame index (representing its locationalong the time axis) of the block of ones as and its lastframe index as . For simplicity, we shall refer to the section

as the “candidate section.”The edge search is conducted by examining the section’s en-

ergy contour and looking for deeps (local minima). Because ofthe high resolution of the energy tracking and the relatively shorttime windows (10 ms, see Section II-B), there are likely to bespurious deeps in the energy contour. To reduce the numberof such deeps, the energy contour is prefiltered with a three-point running average filter. The smoothing may not eliminateall the spurious minima. Therefore, after prefiltering, the re-maining deeps are divided into significant and insignificant. Agiven frame is defined as a significant deep if

(7)where and are the two energy peaks closest to onthe left and right sides, respectively, and and are theglobal maximum and minimum of the energy contour functionin some predefined vicinity of the candidate section. In otherwords, is considered significant if at least one of the energyvalues of and exceeds it by more than 25% of the dy-namic range of the energy.

TABLE IRULES FOR DEEP PICKING PROCEDURE FOR EDGE MARKING

Fig. 13. Block diagram showing the various steps of the edge detection algo-rithm with spurious deep elimination.

Having eliminated all the insignificant energy deeps, the al-gorithm checks to find which of the remaining significant deepsfall inside the candidate section , and acts accordingto the rules in Table I.

Finally, the section between the marked edges is checkedonce more to verify that it does not violate any of the condi-tions established in Section III-A and if it does, it is rejected.

A block diagram showing the algorithm step-by-step is givenin Fig. 13.

A variation of this algorithm uses spectral slope information,in addition to energy information, to search for edges. It is basedon the observation that there are usually significant differencesin the steepness of the spectral slope between breath soundsand silence/voiced (see Section II-B). The spectral slope is ex-pected to be steep near the middle of the breath and flat at theedges, suggesting that edges can be detected by applying sim-ilar deep-picking to the spectral slope. Thus, the initial edgemarking is based on the spectral slope deeps and later refinedwith the energy deeps.

Using this algorithm for edge detection yielded very accurateresults, as shown in the following section. Although the previousalgorithm, which uses the double energy threshold, was found


TABLE IIPERFORMANCE EVALUATION WITH A SINGLE TEMPLATE

FOR ALL EXAMPLES (M—MALE, F—FEMALE)

even less prone to errors, and could achieve almost perfect re-sults, it required considerable amount of fine-tuning to obtainthem.

IV. RESULTS AND EVALUATION

For the evaluation of the algorithm, we first constructedseveral breath templates and compared their performance. Thetemplates were constructed from isolated breath events, derivedfrom the voices of 14 singers (eight male and six female). Eachtemplate was generated using the breath signals of one voice.Mixture of breaths from several voices was also attempted, butyielded no performance gain, and therefore was not used in thefinal evaluation.

The test was carried out using 24 voices of professionalsingers and narrators, including both songs (a cappella, 22recordings) and normal speech (two recordings). All the voiceswere sampled and digitized with a sampling frequency of44 kHz. The total duration of the recordings was about 24 minand contained more than 330 breath events. In all the tests, thevoices used for constructing the templates or for measuring

TABLE IIIPERFORMANCE EVALUATION WITH SEPARATE

MALE (M)/FEMALE (F) TEMPLATES

the distributions of the various classification parameters (seeSection III-A) were excluded from the evaluation set.

In order to evaluate the performance of the system, the breathswere hand-marked by two labelers. In addition, the results wereconfirmed by listening to the original passages, the processedpassages with breaths suppressed and the detected breaths only.These ensure both the detection of the breath and the accuratemarking of its edges.

The results of one template (constructed using seven breathevents of one male singer) are presented in Table II. As can beseen, the sensitivity (the number of correct detections dividedby the total number of events present) of the algorithm usingthis template is 94.4% (304 correct detections out of 322 breathevents), and its specificity (the number of correct detections di-vided by the total number of detections) is 96.3%. Better resultswere achieved when a combination of two templates was used,


TABLE IVCOMPARISON BETWEEN THE TWO EDGE DETECTION ALGORITHMS

TABLE VCONTRIBUTION OF THE EDGE DETECTION ALGORITHM TO THE SYSTEM PERFORMANCE

each for its corresponding gender (Table III). The table showsa sensitivity of 97.6% and a specificity of 95.7% in 332 breaths(324 correct detections, of which 16 were detected only partiallyand 15 false detections).

Comparable results were achieved in previous studies, wherebreath detection was used as part of an automatic system for la-beling prosody patterns. In [32], a Bayesian classifier based oncepstral coefficients achieved detection rates of 73.2% to 91.3%,depending on whether the training set included passages fromthe same speech material as the test set or not. In [19], an ex-tension of the latter system achieved sensitivity of 100% andspecificity of 97% in speech corpora that included 364 breathevents, but the accurate detection of the exact location of thebreaths was not addressed by the authors. In an earlier paper[13], the authors mention that the breaths labeled by the algo-rithm are within 50 ms of those labeled by hand 95% of thetime. While such accuracy can be acceptable for a prosody la-beling application, it may be insufficient for other applicationsof breath detection mentioned in Section I.

In the results we report here, “correct detection” means thatthe breath was detected and marked accurately and completely(i.e., the processed voice contained no audible traces of thebreath). “False detection” includes both cases when the markedbreath boundaries encroached into an adjacent speech segmentand cases in which a breath was detected where it was notpresent (both of these were spotted by listening to the detectedevents, and by carefully examining the boundary marks). Assuch, it can be seen (Table III) that the breaths were detectedwith complete accuracy 94% of the time. Partial detectionsconstitute only 5% of the total detections.

The above results were achieved using the edge detectionalgorithm with spurious deep elimination (see Section III-B).This algorithm was chosen, because it requires less presettingto achieve its optimal performance. In Table IV, we provide acomparison between the results achieved with that algorithmand those achieved with the double energy threshold algorithmon a small selection of voices from four different speakers con-taining a total of 61 breaths. In both cases, the same templatewas used, constructed from the breaths of another speaker. As

can be seen, the performance is similar, although the double en-ergy threshold algorithm is slightly better.

The usefulness of the edge detection algorithm is demon-strated by testing the system twice on a small set of voicesfrom different speakers, once with the edge detection enabledand once without it. The results are displayed in Table V. It canbe seen that the edge detection algorithm contributes greatly byavoiding both partial detections and false positives.

V. REAL-TIME IMPLEMENTATION OF THE ALGORITHM

Implementation in a real-time environment puts very strictdemands on the processing speed of the application and the la-tency required by it. Any real-time algorithm must be able toprocess data faster than the rate of new data arrival.

Experimental tests have shown that the breath detection al-gorithm presented here is able to meet these demands, given thecurrent state of hardware and software.

The processing speed depends largely on the choice of lengthfor the analysis windows and the hop size. Smaller hop sizemeans that more frames need to be analyzed for a given timeperiod, increasing the processing time. Similarly, longer anal-ysis windows also slow down the operation, because more dataneed to be processed per single frame.

Table VI shows the processing time as a function of the dif-ferent window lengths, for a given signal. It can be seen thatin all cases the processing time is well below the signal length,as required by real-time applications. Considering the fact thatpractical applications are coded using methods which are knownto be far more efficient than those of MATLAB, the algorithm’sspeed was found sufficient for real-time implementation.

Another important factor for real-time implementation isthe latency of the algorithm, i.e., the minimum required delaybetween the time data is received for processing and the timethe processed data is ready for output. This latency must bebounded, which means that the processing must be local-ized—analysis of a given frame must be completed before theentire audio sequence is available.

In our implementation, the classification of the frame asbreathy or not is based only on the parameters of the frame


TABLE VIPROCESSING TIME OF THE ALGORITHM

itself and of frames in its vicinity. The latency, therefore,depends on the number of frames that are examined by the edgedetection algorithm for breath boundary search.

Our experiments have shown that accurate results can beachieved when the look-back and look-ahead periods forboundary search are around 200–400 ms (in each direc-tion). The required delay in this case will be of the order of400–800 ms. Modern audio processing applications, both hard-ware-based and software-based, are equipped with memorybuffers that can easily store such an amount of audio data,thus making the algorithm usable in real-time. Indeed, thebreath detection algorithm presented here was implementedas a real-time plug-in for an audio production environment(http://www.waves.com/content.asp?id=1749).

VI. CONCLUSION

In this paper, we presented an algorithm for automaticdetection and demarcation of breath sounds in speech andsong signals. The algorithm is based on a template-matchingframe-based classifier using MFCC for the coarse detectionand on a set of additional parameters in a multistage algorithmfor the refinement of the detection. Tested on a selection ofvoices from different speakers, containing several hundreds ofbreath sounds, the algorithm showed very high sensitivity andspecificity, as well as very accurate location of the breath edges(see Section IV).

This level of performance cannot be achieved by any of thetwo components independently, because the frame-based clas-sifier cannot provide the sufficient accuracy in edge detection,while the edge detection algorithm uses features that on theirown cannot reliably distinguish between breath and nonbreathsounds (see Section III). It is the combination of the two thatprovides the necessary accuracy and robustness (see Table V).

Although the current paper describes an algorithm for the de-tection of breath signals, a slightly modified version of this al-gorithm may be used for the detection of other sounds, such ascertain phonemes. In preliminary experiments, it was shown toyield high-quality results in the detection of fricatives, such as/s/ and /z/, thus proving the feasibility of the general scheme forthe broader task of event spotting.

Our approach somewhat resembles that of the system de-scribed in [16] for suppressing pauses during the playback of

audio recordings for transcription. The latter also uses a two-stage detection strategy, with a speech recognizer for the firststage and additional time-based rules for the second. The al-gorithm presented here is simpler, compared to a full-fledgedspeech recognizer; it is very efficient, with low computationalcomplexity (suitable for real-time processing), and with a shortand a simple procedure for presetting the system. These advan-tages make it a better choice for the task of breath detectionin applications where no speech recognition is required (seeSection I).

The breath detection algorithm presented here was imple-mented as a real-time plug-in for an audio production environ-ment.

A limitation of the algorithm in its current form is that it per-forms well on monophonic audio signals (pure speech), but maynot be suitable for polyphonic signals, containing music or otherbackground sounds. This limitation will be addressed in futurework. Additional directions of future research include testingdifferent classifiers for the problem and extending the algorithmto the detection of other events in audio signals.

ACKNOWLEDGMENT

The authors would like to thank I. Neoran, Director of R&Dof Waves Audio, for valuable discussions and ideas. The au-thors would also like to thank G. Speier for valuable ideas, M.Shaashua for helpful comments, and Y. Yakir for technical as-sistance. The authors are grateful to the anonymous reviewersfor valuable comments and suggestions.

REFERENCES

[1] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge,MA: MIT Press, 1998.

[2] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition.Englewood Cliffs, NJ: Prentice-Hall, 1993.

[3] J. T. Foote, “An overview of audio information retrieval,” MultimediaSyst., vol. 7, pp. 2–10, 1999.

[4] E. Brazil, M. Fernstrom, G. Tzanetakis, and P. Cook, “Enhancing sonicbrowsing using audio information retrieval,” presented at the Int. Conf.Auditory Display (ICAD), Kyoto, Japan, 2002, unpublished.

[5] G. Tzanetakis and P. Cook, “Audio Information Retrieval (AIR) tools,”presented at the Int. Symp. Music Information Retrieval (ISMIR), Ply-mouth, MA, 2000.

[6] G. H. Li, D. F. Wu, and J. Zhang, “Concept framework for audio infor-mation retrieval: ARF,” J. Comput. Sci. Technol., vol. 18, pp. 667–673,2003.

[7] C. Spevak and E. Favreau, “Soundspotter—a prototype system for con-tent based audio retrieval,” presented at the Int. Conf. Digital AudioEffects (DAFx-02), Hamburg, Germany, 2002.

[8] P. Gelin and C. J. Wellekens, “Keyword spotting for video soundtrackindexing,” presented at the IEEE Int. Conf.Acoust., Speech, SignalProcess. (ICASSP-96), 1996.

[9] H. Sawai, A. Waibel, M. Miyatake, and K. Shikano, “Spotting JapaneseCV-syllables and phonemes using the time-delay neural networks,”presented at the Int. Conf. Acoust., Speech, Signal Process. (ICASSP-89), Glasgow, U.K., 1989.

[10] D. Bauer, A. Plinge, and M. Finke, “Selective phoneme spotting forrealisation of an /s, z, C, t/ transpose,” in Lecture Notes in ComputerScience—ICHHP 2002. Linz, Austria: Springer, 2002, vol. 2398.

[11] A. Plinge and D. Bauer, “Introducing restoration of selectivity inhearing instrument design through phoneme spotting,” in AssistiveTechnology: Shaping the Future, ser. Assistive Technology ResearchSeries, G. M. Craddock, L. P. McCormack, R. B. Reilly, and H.Knops, Eds. Amsterdam, The Netherlands: IOS Press, 2003, vol. 11.

[12] L. Kennedy and D. Ellis, “Laughter detection in meetings,” presentedat the NIST ICASSP 2004 Meeting Recognition Workshop, Montreal,QC, Canada, 2004.


[13] P. J. Price, M. Ostendorf, and C. W. Wightman, “Prosody and parsing,”presented at the DARPA Workshop on Speech and Natural Language,Cape Cod, MA, 1989.

[14] M. Covell, M. Withgott, and M. Slaney, “Mach1: Nonuniformtime-scale modification of speech,” presented at the IEEE ICASSP-98,Seattle, WA, 1998.

[15] L. Kennedy and D. Ellis, “Pitch-based emphasis detection for char-acterization of meeting recordings,” presented at the AutomaticSpeech Recognition Understanding Workshop (IEEE ASRU 2003),St. Thomas, VI, 2003.

[16] C. W. Wightman and J. Bachenko, “Speech-recognition-assisted selec-tive suppression of silent and filled speech pauses during playback ofan audio recording,” U.S. Patent 6 161 087, Dec. 12, 2000.

[17] P. A. A. Esquef, M. Karjalainen, and V. Välimäki, “Detection of clicksin audio signals using warped linear prediction,” presented at the 14thIEEE Int. Conf. Digital Signal Process. (DSP-02), Santorini, Greece,2002.

[18] J. Butzberger, H. Murveit, E. Shriberg, and P. Price, “Spontaneousspeech effects in large vocabulary speech recognition applica-tions,” presented at the Workshop on Speech and Natural Language,Harimman, New York, 1992.

[19] C. W. Wightman and M. Ostendorf, “Automatic labeling of prosodicpatterns,” IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp.469–481, Oct. 1994.

[20] S. B. Davis and P. Mermelstein, “Comparison of parametric represen-tations for monosyllabic word recognition in continuously spoken sen-tences,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-28,no. 4, pp. 357–366, Aug. 1980.

[21] R. Mammone, X. Zhang, and R. Ramachandran, “Robust speakerrecognition—A feature-based approach,” IEEE Signal Process. Mag.,vol. 13, no. 5, pp. 58–71, Sep. 1996.

[22] T. F. Quatieri, Discrete-Time Speech Signal Processing. UpperSaddle River, NJ: Prentice-Hall, 2001.

[23] N. Cristianini and J. Shawe-Taylor, An Introduction to Support VectorMachines and Other Kernel-Based Learning Methods. Cambridge,U.K.: Cambridge Univ. Press, 2000.

[24] D. A. Reynolds, “Speaker identification and verification usingGaussian mixture speaker models,” Speech Commun., vol. 17, pp.91–108, 1995.

[25] C. J. C. Burges, “Simplified support vector decision rules,” presentedat the 13th Int. Conf. Machine Learning, Bari, Italy, 1996.

[26] R. Cai, L. Lu, H. J. Zhang, and L. H. Cai, “Highlight sound effectsdetection in audio stream,” presented at the 4th IEEE Int. Conf. Multi-media and Expo, Baltimore, MD, 2003.

[27] M. Spina and V. Zue, “Automatic transcription of general audiodata: preliminary analyses,” presented at the Int. Conf. Spoken Lang.Process., Philadelphia, PA, 1996.

[28] S. Theodoridis and K. Koutroumbas, Pattern Recognition. London,U.K.: Academic, 1999.

[29] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Sig-nals. Englewood Cliffs, NJ: Prentice-Hall, 1978.

[30] R. O. Duda, P. O. Hart, and D. G. Stork, Pattern Classification, 2nded. New York: Wiley, 2001.

[31] L. R. Rabiner and M. R. Sambur, “An algorithm for determiningthe endpoints of isolated utterances,” Bell Syst. Tech. J., vol. 54, pp.297–315, 1975.

[32] C. W. Wightman and M. Ostendorf, “Automatic recognition ofprosodic phrases,” presented at the IEEE Int. Conf Acoust., Speech,Signal Process., Toronto, ON, Canada, 1991.

Dima Ruinskiy (S’06) received the B.Sc. degree incomputer science, specializing in signal processing,from Tel-Hai Academic College, Upper Galilee,Israel. He is currently pursuing the M.Sc. degree incomputer science and applied mathematics at theFeinberg Graduate School, Weizmann Institute ofScience, Rehovot, Israel, specializing in cryptanal-ysis of public key protocols.

Yizhar Lavner (M’01) received the Ph.D. degreefrom The Technion—Israel Institute of Technology,Haifa, in 1997.

He has been with the Computer Science Depart-ment, Tel-Hai Academic College, Upper Galilee,Israel, since 1997, where he is now a Senior Lecturer.He also has been teaching in the Signal and ImageProcessing Laboratory (SIPL), Electrical Engi-neering Faculty, Technion, since 1998. His researchinterests include audio and speech signal processing,voice analysis and perception, and genomic signal

processing.

838 ieee transactions on audio, speech, and …spl.telhai.ac.il/speech/pub/ieee_asl_04100696.pdf ·...

Documents