blind dereverberation

Technische Universitat Darmstadt

Institut fur Automatisierungtechnik

Fachgebiet Regelungtheorie und Robotik

Prof. Dr.-Ing. Jurgen Adamy

Landgraf-Georg-Str. 4

D-64283 Darmstadt

Diplomarbeit

Study of blind dereverberation algorithms for real-time

applications

Xavier Domont

Work in cooperation with:

Honda Research Institute Europe Gmbh

D-63073 Offenbach/Main

Tutors:

Dr.-Ing. Martin Heckmann (HRI)

Dipl.-Ing. Bjoern Scholing (TUD)

Juni 2005

Abstract

At Honda Research Institute Europe, an automatic speech recognition system is

developed for the humanoid robot ASIMO. The reverberation effect alters the

perception of speech signals emitted in a room and reduces the performance of

automatic speech recognition. A lot of methods have been proposed in the past

few decades to enhance reverberant speech signals. This diploma thesis studies

the most promising algorithms and discusses if they can be implemented in real-

time for real environments.

The existing methods can be classified in two families:

1. Those who estimate directly the clean speech signal and treat reverberations

as disturbances.

2. Those who estimate the room impulse response and invert the estimated

system to recover the clean speech.

These two approaches are compared in this thesis, based on implementations

in Matlab of selected algorithms. The focus of this comparison is set on the

suitability of these algorithms for real environments, where speaker and robot

are moving, and a possible real-time implementation.

Kurzfassung

Am Honda Research Institute Europe wird ein automatisches Spracherken-

nungssystem fur den Roboter ASIMO entwickelt. Hall stort die Sprachqualitat

und senkt deutlich die Ergebnisse bei der Spracherkennung. Seit 30 Jahre sind

viele Methoden vorgeschlagen worden, um Sprachsignale zu verbessern. Diese

Diplomarbeit untersucht die aussichtsreichsten Algorithmen im Hinblick auf

Echtzeitfahigkeit und Anwendbarkeit unter realen Bedingungen.

Es gibt zwei Ansatze dieses Problem zu losen:

1. Das original Sprachsignal kann direkt aus dem beobachtete Signal geschatzt

werden. Der Halleffekt wird als Storung des reinen Signals angenommen.

2. Die Raumimpulsantwort kann bestimmt werden und wird dann an-

schlieend invertiert, um das originale Sprachsignal zu bekommen.

Diese zwei Ansatze werden in dieser Diplomarbeit verglichen. Dafur werden aus-

gewahlte Algorithmen implementiert. Der Hauptpunkt des Vergleichs war die

Untersuchung der Methoden auf Einsetzbarkeit in echten Umgebungen, in denen

sich Sprecher und Roboter bewegen.

Contents

1 Introduction 17

1.1 What is blind dereverberation? . . . . . . . . . . . . . . . . . . . 17

1.2 Motivation of this diploma-thesis . . . . . . . . . . . . . . . . . . 18

1.3 Audio processing architecture on ASIMO . . . . . . . . . . . . . . 19

1.3.1 Overview of the peripheral auditory system . . . . . . . . 20

1.3.2 The Gammatone filterbank, a model of the basilar membrane 21

1.4 Overview of this report . . . . . . . . . . . . . . . . . . . . . . . . 22

2 Model of a reverberant signal 25

2.1 Properties of a speech signal . . . . . . . . . . . . . . . . . . . . . 25

2.1.1 Quick overview of the speech production system . . . . . . 25

2.1.2 Speech segments categorization . . . . . . . . . . . . . . . 27

2.1.3 Harmonicity of a speech signal . . . . . . . . . . . . . . . 28

2.1.4 Linear prediction analysis . . . . . . . . . . . . . . . . . . 28

2.2 Room acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.1 Measurement of real room impulse responses . . . . . . . . 31

2.2.2 Simulation of the room impulse response . . . . . . . . . . 32

2.2.3 Linear Time-Invariant model of the room . . . . . . . . . . 35

2.2.4 Effect on the spectrogram . . . . . . . . . . . . . . . . . . 36

8 CONTENTS

2.3 Inversion of the room impulse response . . . . . . . . . . . . . . . 37

2.3.1 Conditions on the inversion of FIR filters . . . . . . . . . . 37

2.3.2 Are room transfer functions minimum-phase ? . . . . . . . 39

2.3.3 Multiple input inverse filter . . . . . . . . . . . . . . . . . 41

3 Enhancement of a speech signal 45

3.1 Harmonicity based dereverberation . . . . . . . . . . . . . . . . . 45

3.1.1 Effect of reverberation on a sweep signal . . . . . . . . . . 46

3.1.2 Adaptive harmonic filtering . . . . . . . . . . . . . . . . . 47

3.1.3 Dereverberation operator . . . . . . . . . . . . . . . . . . . 48

3.1.4 The HERB method . . . . . . . . . . . . . . . . . . . . . . 52

3.1.5 Test of the method . . . . . . . . . . . . . . . . . . . . . . 53

3.1.6 Discussion of the method . . . . . . . . . . . . . . . . . . . 57

3.2 Dereverberation using LP analysis . . . . . . . . . . . . . . . . . . 59

3.2.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . 59

3.2.2 The kurtosis as measure of the reverberation . . . . . . . . 60

3.2.3 Maximization of the kurtosis . . . . . . . . . . . . . . . . . 62

3.3 Discussion of the method . . . . . . . . . . . . . . . . . . . . . . . 64

4 Equalization of room impulse responses 67

4.1 Principle of the channel estimation . . . . . . . . . . . . . . . . . 67

4.1.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.1.2 Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.1.3 How can this idea be implemented? . . . . . . . . . . . . . 69

4.1.4 Why have the channels to be coprime? . . . . . . . . . . . 70

4.1.5 Estimation of the length of the filters . . . . . . . . . . . . 70

CONTENTS 9

4.2 Batch method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.2.1 Extraction of the common part . . . . . . . . . . . . . . . 72

4.2.2 Noisy case . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3 Iterative method . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3.1 Choice of the optimization method . . . . . . . . . . . . . 76

4.3.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.4 Improvement of the method . . . . . . . . . . . . . . . . . . . . . 77

4.5 Discussion of the channel estimation methods . . . . . . . . . . . 79

5 Conclusion and outlook 81

5.1 Review of the studied methods . . . . . . . . . . . . . . . . . . . 81

5.1.1 Harmonicity-based dereverberation . . . . . . . . . . . . . 81

5.1.2 Linear prediction analysis . . . . . . . . . . . . . . . . . . 81

5.1.3 Channel estimation . . . . . . . . . . . . . . . . . . . . . . 82

5.1.4 Direct comparison of the methods . . . . . . . . . . . . . . 82

5.2 Speech model based method vs. channel estimation . . . . . . . . 83

5.3 What should we decide for ASIMO? . . . . . . . . . . . . . . . . . 83

A Proofs 87

List of Figures

1.1 Different paths of a sound wave in a room . . . . . . . . . . . . . 17

1.2 General model of a reverberant signal . . . . . . . . . . . . . . . . 18

1.3 General shape of a room impulse response . . . . . . . . . . . . . 18

1.4 Peripheral auditory system [1] . . . . . . . . . . . . . . . . . . . . 20

1.5 Impulse and frequency responses of a Gammatone filter . . . . . . 21

1.6 Analysis filters of a Gammatone filter-bank with 16 channels. . . . 22

2.1 General model of a reverberant signal . . . . . . . . . . . . . . . . 25

2.2 Schematic diagram of the human speech production mechanism

(source: [3]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Block diagram of the human speech production (source: [3]) . . . 27

2.4 Discrete-Time speech production model. (a) True Model. (b)

Model to be estimated using LP analysis. (source [3]) . . . . . . . 30

2.5 System to identify (one microphone case) . . . . . . . . . . . . . . 31

2.6 Measurement method . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.7 Example of measured room impulse response . . . . . . . . . . . . 32

2.8 Image Method: Direct path . . . . . . . . . . . . . . . . . . . . . 33

2.9 Image Method: virtual source . . . . . . . . . . . . . . . . . . . . 33

2.10 Image Method: Sound wave reflecting off two walls . . . . . . . . 33

2.11 Room impulse response simulated with the image method . . . . . 35

12 LIST OF FIGURES

2.12 Spectrograms of an anechoic signal (left) and the resulting spec-

trogram of its convolution with the impulse response of figure 2.7

(right). This spectrograms were obtained with a Gammatone filter-

bank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.13 Inversion of a filter . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.14 Pole () and zero () of an all-pass filter . . . . . . . . . . . . . . 40

2.15 Energy of a non-minimum phase system (dashed - blue) and the

corresponding minimum-phase system (red). . . . . . . . . . . . . 41

2.16 Multiple input inverse filter . . . . . . . . . . . . . . . . . . . . . 42

3.1 Spectrograms of a sweeping sinusoid and its reverberant signal. . . 46

3.2 Adaptive harmonic filtering . . . . . . . . . . . . . . . . . . . . . 47

3.3 Diagram of the HERB dereverberation method . . . . . . . . . . . 52

3.4 up-left: Original signal (sweep with harmonics). up-right: Re-

verberant signal. bottom-left: Harmonic estimate with the

Gammatone filter-bank. bottom-right: Harmonic estimate with

Nakatanis harmonic filter. . . . . . . . . . . . . . . . . . . . . . . 54

3.5 Spectrogram of the clean and reverberant signal used to test the

reverberation operator. . . . . . . . . . . . . . . . . . . . . . . . . 56

3.6 Spectrogram of the enhanced signal computed in the frequency

domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.7 Impulse response of the dereverberation operator and spectrogram

of the enhanced signal computed in the time domain. . . . . . . . 57

3.8 Effect of the reverberation on the fundamental frequency. . . . . . 58

3.9 Example of platykurtic (left) and leptokurtic (right) distributions.

Both distributions have the same standard deviation . . . . . . . 60

3.10 On the left, extract of the LP residuals of a speech signal. Note

the strong peaks corresponding to the glottal pulses. On the right,

the same signal impaired by reverberations. . . . . . . . . . . . . 61

LIST OF FIGURES 13

3.11 Estimation of the probability density functions of the LP residuals

of a clean speech signal (blue) and of a reverberant signal (red).

Both signal have been centered and normalized such that their

means = 0 and their standard deviations = 1. . . . . . . . . . 61

3.12 (a) A single channel time-domain adaptive algorithm for maximiz-

ing the kurtosis of the LP residuals. (b) Equivalent system, which

avoids LP reconstruction artifacts. . . . . . . . . . . . . . . . . . . 63

3.13 Two-channel frequency-domain adaptive algorithm for maximiza-

tion of the kurtosis of the LP residual. . . . . . . . . . . . . . . . 64

3.14 On the left the LP residual of a clean signal. On the right the LP

residual of the resulting dereverberated signal. The kurtosis of the

dereverberated signal is higher than the kurtosis of the original

signal. The resulting signal is strongly distorted. . . . . . . . . . . 65

4.1 SIMO System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.2 Channel identification with overestimated channel orders. . . . . . 71

4.3 Estimated zeros and real zeros for one channels (left). Zeros of all

the estimated channels. On the left 4 estimated zeros are alone,

they do not correspond to a real pole of the filter. On the right it

can be noticed that these 4 additional zeros are common to all the

estimated channels. . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.4 Eigenvalues of the matrix Rx in the noiseless case. On the right:

zoom on the smallest eigenvalues. . . . . . . . . . . . . . . . . . . 73

4.5 left: 4 of the 11 eigenvectors of the null space. right: common

part of the null space (blue) and real impulse response (red). The

impulse response of the 2 channels are concatenated and 10 zeros

(corresponding to the over-estimation of the order) were added. . 74

4.6 Eigenvalues of the correlation matrix in the noisy case. The vari-

ance of the noise is equal to 1010 on the left and 106 on theright. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

14 LIST OF FIGURES

4.7 Iterative estimation of the channel impulse responses using two

microphones. On the left the estimated zeros (blue) of one of the

channels are compared with their real values (red). On the right

the remaining impulse response after inversion of the system is

drawn (blue), in the ideal case it should a Dirac (red) . . . . . . . 77

4.8 Iterative estimation of the channel impulse responses using 5 mi-

crophones. On the left the estimated zeros (blue) of one of the

channels are compared with their real values (red). On the right

the remaining impulse response after inversion of the system is

drawn (blue), in the ideal case it should a Dirac (red) . . . . . . . 78

4.9 Comparison of the position of the zeros when the convolution and

the subsampling are performed in a different order. . . . . . . . . 79

Acronyms

FFT Fast Fourier Transform

DFT Discrete Fourier Transform

STFT short time Fourier transform

SISO Single-Input Single-Output

SIMO Single-Input Multiple-Output

MIMO Multiple-Input Multiple-Output

ROC Region of Convergence

LTI Linear Time-Invariant

FIR Finite Impulse Response

IIR Infinite Impulse Response

MINT Multiple input inverse filter

Chapter 1

Introduction

1.1 What is blind dereverberation?

Figure 1.1: Different paths of a sound wave in a room

The acoustic signals emitted in a room reflect off the walls and other objects

(see figure 1.1). The direct signal and all the reflected sound waves arrive to

the microphone/listener with a different delay and sum up. This effect is called

reverberation. Sometimes the term echo is used instead of reverberation. However

echo generally implies a distinct, delayed version of a sound. In a room, each

delayed sound wave arrives in such a short period of time that we do not perceive

each reflection as a copy of the original sound. Even though we cant discern

18 1 Introduction

every reflection, we still hear the effect of the entire series of reflections.

Whereas a human being, without hearing problems, can quite well cope with these

distortions, this reverberation effect impairs the speech intelligibility in devices

such as hands-free conferences telephones and automatic speech recognition.

The diagram in figure 1.2 shows how the system can be modeled. The effect of

the room is considered as a filter with impulse response h(t) whose input is the

clean speech signal s(t) and the output is the observed reverberant signal x(t).

s(t) h(t) x(t)

Figure 1.2: General model of a reverberant signal

Figure 1.3 shows the general shape of a room impulse response. The reverbera-

tion corrupts the speech by blurring its temporal structure. However, due to the

spectral continuity of speech, the early reflections mainly increase the intensity of

the reverberant speech, whereas the later ones are deleterious to speech quality

and intelligibility.

Figure 1.3: General shape of a room impulse response

The aim of the blind dereverberation is to recover the clean signal s(t) out of the

observed reverberant signal x(t). The term blind means that neither the clean

signal nor the impulse response of the room are known before the processing.

1.2 Motivation of this diploma-thesis

This diploma thesis was written in cooperation with Honda Research Institute

(HRI) Europe. One of the important projects of HRI is the development of the

1.3 Audio processing architecture on ASIMO 19

humanoid robot ASIMO (Advanced Step in Innovative MObility). At HRI Europe

the CARL (Child-like Acquisition of Representation and Language) Group of Dr.

Frank Joublin aims at developing a system of automatic speech recognition (ASR)

and production for ASIMO. As the distortions caused by reverberations alters

the performance of ASR, we will investigate if a signal processing method can be

found to dereverberate the signals heard by ASIMO.

During the past decades many dereverberation methods have been proposed.

However, no standard method has yet been found and this research topic is still

very active. The aim of this diploma-thesis is to establish a state of the art of the

existing methods and then to evaluate if some of them could be integrated to the

audio processing system of ASIMO.

The important requirements for ASIMO are, firstly, that the dereverberation is

processed in real-time, and, secondly that the system must adapt to a real and

changing environment. It means that the algorithms have to adapt themselves to

the room conditions faster than these conditions change. As both ASIMO and

the speaker can be in movement the effects of the room are susceptible to change

very rapidly.

To perform this study we selected, out of the recently proposed methods, the ones

which seemed the most promising. The selected methods have then been imple-

mented in MATLAB in order to determine their advantages and their drawbacks.

For the implementation, we try, as often as possible, to use the existing audio

processing architecture of ASIMO, described in section 1.3.

In addition to the analysis of their performances, it will be discussed if the studied

methods, while enhancing the perception of speech, do not alter some signal char-

acteristics which are used by following audio processing on ASIMO. In particular,

the phase spectrum is essential to the localization of a source of speech.

1.3 Audio processing architecture on ASIMO

The audio processing system at HRI uses a Gammatone filterbank. This type

of filterbank is widely used in audio signal processing as it simulate the human

auditory system.

20 1 Introduction

Figure 1.4: Peripheral auditory system [1]

1.3.1 Overview of the peripheral auditory system

The aim of the peripheral auditory system (see figure 1.4) is to transform a sound

(which is actually a pressure variation in air) into nerve impulses. These impulses

are then conveyed by the auditory nerve to the brain stem. The nerve cells in

the brain stem act as relay stations, eventually conveying nerve impulses to the

auditory cortex.

The outer ear is composed of the pinna (the visible part) and the auditory canal

or meatus. The pinna significantly modifies the incoming sound in a way that

depends on the angle of incidence of the sound relative to the head. This is

important for the sound localization. Sound travels down the meatus and cause

the eardrum, or tympanic membrane, to vibrate. These vibrations are transmitted

through the middle ear by three small bones, the osscicles, to a membrane-covered

opening in the bony wall of the spiral-shaped structure of the inner ear, the

cochlea.

The cochlea is shaped like the spiral shell of a snail. It is filled with almost incom-

pressible fluids and is divided along its length by two membranes, the Reissners

membrane and the the basilar membrane. The motion of the basilar membrane

in response to a sound is of primary interest.

1.3 Audio processing architecture on ASIMO 21

1.3.2 The Gammatone filterbank, a model of the basilar

membrane

A point on the basilar membrane is characterized by its impulse response. The

Gammatone function approximates physiologically recorded impulse responses:

g(t) = tn1 exp(2pibt) cos(2pif0t+ ) (1.1)

where t is the time (t 0), b determines the duration of the impulse response, nis the order of the filter and determines the slope of the skirts of the filter, is a

phase and f0 is the center frequency.

Figure 1.5: Impulse and frequency responses of a Gammatone filter

It can be observed from figure 1.5 that the Gammatone filter is a bandpass with

its center frequency at f0. Its bandwidth depends on b.

To simulate the whole basilar membrane, a bank of Gammatone filters can be

used. Each filter channel represents the frequency response of one point on the

basilar membrane.

The parameters of the Gammatone filters are determined out of psychoacoustic

measurements. Glasberg and Moore [2] summarized the equivalent rectangular

bandwidth (ERB) of the human auditory filter. The ERB of a filter is defined as

the width of a rectangular filter whose height equals the peak gain of the filter

and which passes the same total power as the filter.

The relation between the bandwidth and the center frequency of the Gammatone

filters is given by:

ERB = 24.7 + 0.108 f0. (1.2)

22 1 Introduction

The figure 1.6 shows the transfer functions for a bank of 16 filters with center

frequency spaced between 50 Hz and 8 kHz. As the spectral resolution of the

basilar membrane decreases as the frequency increases, the center frequencies of

the Gammatone filters are not linearly distributed and their bandwidth increase

with the center frequency according to equation (1.2). We can also note that the

pass bands overlap.

Figure 1.6: Analysis filters of a Gammatone filter-bank with 16 channels.

1.4 Overview of this report

The existing blind dereverberation methods can be classified in two families.

1. We can estimate directly the clean speech signal, or the parameters and

excitation of an appropriate parametric model, as a missing data problem

by treating reverberations as disturbances.

2. We can model the effect of the room by a filter. The coefficients of this filter

are estimated by treating the clean speech as disturbance. The observed

signal is then deconvolved by the estimated filter to recover the clean speech.

In chapter 2 we will discuss how speech signals and room impulse responses can

be modeled. This modeling step is essential to determine what, in the observed

signal, is due to the speech and what is an effect of the room.

1.4 Overview of this report 23

In chapter 3 two methods, which use the properties of speech to enhance re-

verberant signals, will be studied. These methods consider that the room effect

is a disturbance and try and restore the characteristics of the speech that the

reverberations altered.

In chapter 4 the possibility of estimating the room impulse responses will be

discussed. This approach is very interesting as, knowing the effect of the room on

the signal, it will then make possible to revert this effect and recover the clean

signal.

Chapter 2

Model of a reverberant signal

In terms of signal processing a room can be seen as a filter. The original (anechoic)

signal s(n) goes through a filter h(n) and gives the reverberant signal x(n), see

figure 2.1. In the case of blind dereverberation the input signal s(n) and the room

transfer function h(n) are unknown.

s(n) h(n) x(n)

Figure 2.1: General model of a reverberant signal

The task of dereverberation is to find an estimate s(n) of s(n), given the output

x(n) of the system. In order to make this task feasible, a model of the speech

signal and/or a model of the room are required.

In section 2.1 different ways to model a speech signal will be discussed. In section

2.2 the effects of the room on the speech signal will be investigated. At last section

2.3 will discuss the possibility of inverting the effects of the room.

2.1 Properties of a speech signal

2.1.1 Quick overview of the speech production system

The principal components of the human speech production system are (see figure

2.2) the lungs, trachea(windpipe), larynx (organ of voice production), pharyn-

26 2 Model of a reverberant signal

Figure 2.2: Schematic diagram of the human speech production mechanism (source:[3])

geal cavity (throat), oral or buccal cavity (mouth), and nasal cavity (nose). The

pharyngeal and oral cavities are usually grouped and referred to as vocal tract.

It is useful to think of speech production in terms of an acoustic filtering oper-

ation. The pharyngeal, oral and nasal cavities comprise the main acoustic filter.

This filter is excited by the organs below it, and is loaded at its main output by

a radiation impedance due to the lips. The articulators are used to change the

properties of the system, its form of excitation, and its output loading over time.

Figure 2.3 shows a simplified acoustic model illustrating these ideas.

2.1 Properties of a speech signal 27

Figure 2.3: Block diagram of the human speech production (source: [3])

2.1.2 Speech segments categorization

The spectral characteristics of the speech wave are non-stationary, since the phys-

ical system changes rapidly over time. Speech can therefore be divided into sound

segments which present similar properties over a short period of time. Without

going into further details the main way to classify a speech sound is with the type

of excitation.

The two elementary types of excitation are voiced and unvoiced. There are actu-

ally a few other type of excitation (mixed, plosive, whisper, silence) but they can

be seen just as a combination of the two elementary types.

Voiced sounds are produced by forcing air through the glottis, an opening between

the vocals folds. The vocal cords vibrate in oscillatory fashion and, therefore, the

produced speech signal is quasi-periodic, its period is called fundamental period

T0; the fundamental frequency F0 can be defined as1

T0.


Unvoiced sounds are generated by forming a constriction at some point along the

vocal tract, and forcing air through the constriction to produce turbulence. The

produced speech signal is a noise-like sound.

Typical human speech communication is limited to a bandwidth of 7-8 kHz. The

main part of the energy is contained in voiced segments.

2.1.3 Harmonicity of a speech signal

A speech signal s(n) can be modeled [4] by using the sum of a harmonic signal

sh(n), derived from a glottal vibration, and a non-harmonic signal sn(n), such as

fricatives and plosives, as

s(n) = sh(n) + sn(n). (2.1)

The harmonic part of the signal is defined by its voiced durations and their fun-

damental frequencies (F0). A voiced duration is the time during which the vocal

cords vibrate to generate a harmonic signal and the fundamental frequency refers

to the frequency of the fundamental component of the signal. Each harmonic

component has a frequency which corresponds to F0 or its multiples.

It can be assumed that F0 is constant within a short time, therefore the harmonic

signal, sh(h), can be modeled over a time frame of length T by the sum of sinu-

soidal components whose frequencies coincide with the fundamental frequency of

the signal and its multiples:

sh(n) =Nk=1

Ak cos

(kF0

n ncfs

+ k

)for |n nc| < T

2(2.2)

where Ak and k are the amplitude and the phase of the k-th harmonic compo-

nent, nc the time index of the center of the frame and fs the sampling rate.

2.1.4 Linear prediction analysis

A widely used model of speech signals is given by Linear Prediction (LP) analysis.

This model consists of separating the speech signal into a excitation signal and

a model of the vocal tract.

2.1 Properties of a speech signal 29

During a stationary frame of speech the model would ideally be characterized by

a pole-zero transfer function of the form

(z) = 0

1 +Li=1

b(i)zi

1Ri=1

a(i)zi(2.3)

which is driven by an excitation sequence

e(n) =

+

q=(n qP ), voiced case

zero mean, unity variance,uncorrelated noise,

unvoiced case

(2.4)

where (n) is the discrete Dirac

(n) =

{1 if t = 0,

0 else.(2.5)

The principle of the LP analysis is to approximate this pole-zero system with an

all-pole system

(z) =1

1Ri=1

a(i)zi(2.6)

which can be easily estimated by solving a system of linear equations. The

schematics of the true speech model and of its LP approximation are shown in

figure 2.4.

A magnitude spectrum, but not a phase spectrum1, can be exactly modeled with

stable poles. It means that the LP analysis will model the true magnitude

spectrum of the speech which is, in the most cases, enough for speech perception.

For example, a listener moving from room to room within a house is able to clearly

understand speech of a stationary talker, even if the phase relationships among

the components are changing dramatically [3]. However, for some applications like

the localization of the talker, the temporal dynamics of the sound are essential

and the LP analysis should be used with care.

1Actually the LP model has a minimum-phase characteristic. This notion will be discussedmore in details in section 2.3


Figure 2.4: Discrete-Time speech production model. (a) True Model. (b) Model tobe estimated using LP analysis. (source [3])

To understand the name Linear Prediction, it is helpful to consider the LP

analysis in the time domain. An all-pole transfer function corresponds to an

autoregressive (AR) model, i.e. the signal s(n) can be expressed as a linear com-

bination of its L past samples:

s(n) =L

k=1

aks(n k) + e(n) (2.7)

where ak are the LP coefficients. The excitation signal e(n) can be seen in terms

of system identification as the prediction error signal, also called LP residual.

2.2 Room acoustics

This section will firstly present how room impulse responses can be measured

(2.2.1) or simulated (2.2.2). The goal is to obtain a set of real and artificial

2.2 Room acoustics 31

impulse responses. These data will be useful in the next chapters to test the

dereverberation methods.

In a second time (2.2.3) we will discuss if a general model of a room can be found.

At last (2.2.4), we will use time-frequency analysis to shortly study the effects of

reverberation on speech signals.

2.2.1 Measurement of real room impulse responses

In order to get real impulse response corresponding to a normal room, we per-

formed measurements in the office of the CARL group at HRI. A sound signal

was played through the room by a loudspeaker. Simultaneously the sound wave

was recorded using a model of ASIMOs head equipped with two microphones.

s(n) h(n) x(n)

Figure 2.5: System to identify (one microphone case)

For each microphone both the input s(n) and the output x(n) of the system are

known, the impulse response h(n) can be then computed by the reverting the

convolution

x(n) = h(n) s(n). (2.8)However the measurement is generally altered by additive noise. To improve the

measurement it is therefore better to use auto- and cross-correlation functions.

Equation (2.8) becomes

Rsx(n) = h(n) Rs(n) (2.9)

where Rs(n) is the autocorrelation function of s(n) and Rsx(n) the cross-

correlation function of s(n) and x(n). Equation (2.9) is less sensitive to noise.

Moreover, if s(n) is a white noise, its autocorrelation function is equal to (n).

Then

h(n) = Rsx(n), when Rs(n) = (n). (2.10)

The impulse response of the room is equal to the autocorrelation function of the

white noise, played by the loudspeaker, and signal recorded by the microphone.

For our measurement we used 1 second of Gaussian white noise as room excitation

signal.


Two different sound cards were used to play and record the signals. In order to

easily synchronize the input and the output of the system, the excitation signal

was directly recorded on another channel of the capture sound card in addition

to the signals of the microphones (see figure 2.6).

x(n) - h(n) - y(n)r- x(n)

Figure 2.6: Measurement method

Moreover this method permits to compensate eventual effects of the sound cards.

As we only disposed of a stereo sound card, the record for the left and right ears

had to be performed separately. Figure 2.6 shows one of the measured impulse

response.

Figure 2.7: Example of measured room impulse response

2.2.2 Simulation of the room impulse response

A technique to simulate the impulse response of a room is the image method

proposed in 1979 by Allen [5]. It sums the direct path with all reflections on walls

or objects.

An example in [6] shows the principle of this method. Figure 2.8 shows the direct

path from a sound source () to a microphone (). Another part of the sound wave


Figure 2.8: Image Method: Direct path

is reflected off a wall and then impinges upon the microphone. This reverberated

sound seems to come directly from a virtual source located in an adjacent room,

symmetrical to the original room relatively to the wall (see figure 2.9). On this

figure the black line represents the real path of the signal, whereas the blue line

is its perceived path.

HHHH

Figure 2.9: Image Method: virtual source

This process can be extended to sound waves that are reflected more than once

off the walls (see figure 2.10). This process can be continued the same way in the

HHHH

XXXXXXXXXX

Figure 2.10: Image Method: Sound wave reflecting off two walls

three dimensions to get an infinity amount of virtual sources.


The virtual sources permit to easily compute the distance the sound wave travels

to arrive at the microphone.

Considering a rectangular room with dimensions (Lx, Ly, Lz), the coordinates-

vector ri,j,k = (xi, yj, zk)T ((i, j, k) Z3) of a virtual source is:

xi = (1)ixsource +(i+

1 (1)i2

)Lx

yj = (1)jysource +(j +

1 (1)j2

)Ly

zk = (1)kzsource +(k +

1 (1)k2

)Lz

(2.11)

where rsource = (xsource, ysource, zsource)T is the coordinate-vector of the source.

The distance from the virtual source to the microphone is

di,j,k = ri,j,k rm =(xi xm)2 + (yj ym)2 + (zk zm)2 (2.12)

where rm = (xm, ym, zm)T is the coordinate-vector of the microphone.

The sound wave corresponding to the (i, j, k) virtual source will arrive at the

microphone with a delay

i,j,k =di,j,kc

(2.13)

where c is the speed of sound. The impulse response of the room is the sum of the

delayed impulse corresponding to the signals arriving from each virtual source:

h(t) =i,j,kZ

hi,j,k (t di,j,k

c

)(2.14)

The magnitude hi,j,k of the unit impulse is influenced by the distance the sound

wave travels to get from the source to the microphone

bi,j,k =1

4pid2i,j,k(2.15)

and by the number of reflections of the walls

ci,j,k = |i|+|j|+|k| (2.16)

where < 1 is the wall reflection coefficient (which is, in this simple model,

considered to be the same for all the walls).

hi,j,k = bi,j,kci,j,k (2.17)


Although the impulse response of the room should contain an infinite number of

delayed impulses, corresponding to an infinity of virtual sources, the magnitudes

hi,j,k become very small for large i j and k. The impulse response has then a

finite time duration:

h(t) =n

i=n

nj=n

nk=n

hi,j,k (t di,j,k

c

)(2.18)

Figure 2.11: Room impulse response simulated with the image method

Figure 2.11 shows a simulated room impulse response obtained with the image

method. Reverberant sounds generated using such an impulse response sound like

signals recorded in real conditions. However phenomenon like the phase inversion

of the sound wave when it reflects off a wall, the presence of objects or people

are ignored by this model.

2.2.3 Linear Time-Invariant model of the room

The general shape of the measured and the simulated room impulse responses

corresponds to the one described on figure 1.3. However, when the conditions in

the room are changing (movement of the talker and/or listener) the coefficients

of the impulse response have big fluctuations especially in the late reverberation

tail. As we explained in chapter 1, the distortions in the speech signal are mostly

due to the late reverberation. Therefore, a model based on the image method,

where the room impulse responses are modeled by a sum of delayed impulses

hi,j,k (t i,j,k), is not practical for a system identification.Actually the only general properties which can be retained from a room impulse


response are its linearity, its causality (there is no reverberation before the be-

ginning of the signal) and its general exponential decay structure.

In real environments, the talker or the listener are in movement, therefore the

effect of the room is time-variant. However if we assume that the computation is

fast enough the system can be considered Linear Time-Invariant (LTI).

Moreover, because of the exponential decay, the impulse response of the room has

a finite duration. The room is then modeled by a Finite Impulse Response (FIR)

filter. The relation between the input s(n) and the output x(n) is given by the

convolution

x(n) = h(n) s(n) =L1k=0

h(k)s(n k), (2.19)

where L is the length of the impulse response (also called order of the channel).

Actually, the FIR model of the room impulse response is very practical as the

transfer function of the system, i.e. the z-transform of its impulse response,

H(z) =L1k=0

h(k)zk (2.20)

is defined for all finite z and is a polynomial.

2.2.4 Effect on the spectrogram

It is interesting to study the effect of reverberation on the spectrogram2 of a

speech signal. Figure 2.12 shows the spectrograms of the same speech signal

without and with reverberation.

The problem can be explained in the time-frequency domain as: Given the spec-

trogram of the original signal at time frame t and frequency f , S(t, f), what is

the influence of the room on the spectrogram of the reverberant signal at time-

frequency bin (t, f ), X(t, f )?.

The value of X at time frame t is only affected by bins of the original signalthat are between time frames t and tD where D depends on the reverberation

2Instead of the spectrogram, the Gammatone filter-bank described in chapter 1 can beused. Contrary to a normal spectrogram, computed using a short-time Fourier transform, theGammatone filterbank gives an output for each time sample (no subsampling) and the centerfrequency of the filters are not linearly distributed, which is closer to the human auditorysystem.

2.3 Inversion of the room impulse response 37

Figure 2.12: Spectrograms of an anechoic signal (left) and the resulting spectrogramof its convolution with the impulse response of figure 2.7 (right). This spectrogramswere obtained with a Gammatone filter-bank.

time of the room, i.e. the time for the sound to die away to a level of 60 dB

below its original level. In the frequency-domain, the reverberation affects slightly

the adjacent channels. According to [7], this effect has the form of a Laplace

distribution.

2.3 Inversion of the room impulse response

In this section the theoretical possibility of a perfect dereverberation will be

discussed. The issue can be formulated in the following way: Assuming that the

room impulse response is known, is it possible to remove its effect and to get an

accurate estimate of the original speech signal?.

2.3.1 Conditions on the inversion of FIR filters

The inverse g(n) of a filter h(n) (see figure 2.13) is

s(n) = g(n) x(n)= g(n) h(n) s(n)= s(n)

(2.21)

which can be simplified to

g(n) h(n) = (n) (2.22)


s(n) h(n) x(n) g(n) s(n)

Figure 2.13: Inversion of a filter

The inversion problem can be studied with the help of the z-transform. The z-

transform of h(n), also called transfer function of the filter, is defined as the power

series

H(z) =+

k=h(k)zk (2.23)

It was shown in section 2.2 that the room can be considered as a FIR filter. Then

its z-transform is a polynomial

H(z) = h0 + h1z1 + + hNzL+1 (2.24)

where L is the length of the room impulse response

H(z) = h0zL+1(z z1)(z z2) (z zL1) (2.25)

H(z) has L 1 finite zeros at z = z1, z2, . . . , zN .The transfer function of the inverse filter is then the rational function

G(z) =1

h0 + h1z1 + + hNzL+1

(2.26)

The Infinite Impulse Response (IIR) filter G(z) is causal and stable if and only if

all its poles are inside the unit circle (|z| = 1). As the poles of G(z) are the zerosof H(z), this means that all the zeros of H(z) must be inside the unit circle. Such

a system is called minimum-phase.

In order to understand this problem, we can observe what happens if we want to

invert a simple non minimum-phase system.

Given the FIR filter h(n), defined in the time-domain by:

h(n) = (n) 2(n 1)

Its transfer function is

H(z) = 1 2z1


The Region of Convergence (ROC) of this z-transform is |z| > 0. As this systemhas a zero at z = 2, it is non minimum-phase.

The transfer function of its inverse system is:

G(z) =1

1 2z1 =z

z 2G(z) has a zero at the origin and a pole at z = 2. In this case there are two

possible regions of convergence and hence two possible inverse systems. If the

ROC of G(z) is taken as |z| > 2, then

g(n) = 2nu(n)

where u(n) is the unit step function

u(n) =

{1 if n 0,0 else.

(2.27)

This is the impulse response of a causal and instable system. On the other hand,

if the ROC is assumed to be |z| < 2, the impulse response of the inverse systemis

g(n) = 2nu(n 1).In this case the inverse system is anti-causal and stable.

2.3.2 Are room transfer functions minimum-phase ?

Any system can be represented as the cascade of a minimum-phase system with

an all-pass system [8]. An all-pass system is defined as a system for which the

magnitude of the transfer function is unity for all frequencies. Thus if Hap(z)

denotes the z-transform of an all-pass system, |Hap(e)| = 1 for all .The poles and zeros of an all-pass system occur at conjugate reciprocal locations

(see figure 2.14).

If we consider a non-minimum-phase system H(z), with, for example, one zero

outside the unit circle at z = 1z0, |z0| < 1, and the remainder of its poles and

zeros are inside the unit circle. Then H(z) can be expressed as

H(z) = H1(z)(z1 z0) (2.28)


z-plane

unit circle@I

z0 = re

1z0= 1

re

p0 =1z0

= 1re

Figure 2.14: Pole () and zero () of an all-pass filter

where H1(z) is minimum-phase. Equivalently equation (2.28) can be written as

H(z) =(H1(z)(z

1 z0)) 1 z0z11 z0z1

=(H1(z)(1 z0z1)

) z1 z01 z0z1

= Hmin(z)z1 z01 z0z1

= Hmin(z)Hap(z)

(2.29)

where Hmin(z) is minimum-phase and Hap(z) is all-pass. Any pole or zero of H(z)

that is inside the unit circle also appears in Hmin. Any pole or zero of H(z) that

is outside the unit circle appears in Hmin in the conjugal reciprocal location.

The equivalent minimum-phase system has the same magnitude spectrum as the

original system.

It is interesting to compare the impulse response h(n) of an FIR system with the

impulse response hmin(n) of its equivalent minimum-phase system.

Figure 2.15 shows that the energy of hmin(n) is more concentrated around the

origin. This property can be formalized with the following equation3:

mn=0

|h(n)|2 mn=0

|hmin(n)|2, m N (2.30)

The energy of both sequences is the same since the magnitude of their Fourier

3A proof of this property is outlined in [8] page 371.


Figure 2.15: Energy of a non-minimum phase system (dashed - blue) and the corre-sponding minimum-phase system (red).

transforms is the same (by Parsevals Theorem). This means that the equality

occurs in (2.30) when m.

The room transfer functions have often more energy in the reverberant compo-

nent of the room impulse response than in the component corresponding to the

direct path (see figure 1.3). This implies that room transfer function are often

non-minimum-phase. A causal and stable inverse of a room impulse response is

therefore impossible to find in general. The non-causality problem can be solved

by introducing a delay, i.e. a delayed inverse filter is computed instead. However

the delay have to be generally quite long, and this is not satisfying for real-time

applications.

2.3.3 Multiple input inverse filter

As the room transfer functions are most of the time non-minimum-phase a perfect

dereverberation cannot be achieved with a single microphone. It is possible to

find a delayed inverse filter but this solution is not really adequate for real-time

processing.

However it is possible to find the exact inverse of a point in the room by us-


ing multiple microphones4, if the room transfer functions corresponding to the

different sensors are coprime, i.e. they do not share common zeros [9].

This property is actually a direct application of the Bezouts theorem on poly-

nomials. Given M FIR filters with transfer function Hi(z), i = 1, . . .M , if the

Hi(z)s are coprime polynomials, then Gi(z), i = 1, . . .M , such that:H1(z)G1(z) +H2(z)G2(z) + . . .+HM(z)GM(z) = 1 (2.31)

where the orders of the Gi(z)s are smaller or equal than the highest order of

the Hi(z)s. Figure 2.16 shows how equation (2.31) can be used to invert the M

channels simultaneously. This method is called Multiple-input/output INverse

Theorem (MINT).

H1(z)

H2(z)

...

HM(z)

G1(z)

G2(z)

...

GM(z)

-

-

-

-

-

-

-

s(n) s(n) s(n)

Figure 2.16: Multiple input inverse filter

By using more than one microphone, the issue that room transfer functions are

non minimum-phase is bypassed. Moreover the inverse filters are simple FIR

filters, which can be computed by solving the linear system

d =[HT1 H

T2 HTM

]g = Hg (2.32)

where d = [1, 0, . . . , 0] is a vector of length 2L 1, g is the concatenation of thevector gi = [gi(0), . . . , gi(L 1)]T corresponding to the inverse filters

g =[gT1 . . . g

TM

]T(2.33)

and the His are the L (2L 1) Sylvester matrices corresponding to the poly-nomials Hi(z)

Hi =

hi(0) hi(L 1) 0... . . . . . . ...0 hi(0) hi(L 1)

(2.34)4Such a system is called Single-Input Multiple-Output (SIMO)


A Sylvester matrix permits to compute a convolution (or a polynomial multiplica-

tion) with a matrix multiplication. Given two signals x(n) and y(n), respectively

of length Lx and Ly, the convolution z(n) of x(n) and y(n) has Lx + Ly 1samples and can be written in a vector form as

z = XTy = YTx, (2.35)

where X, resp. Y, is the Ly (Lx + Ly 1), resp. Lx (Lx + Ly 1), Sylvestermatrix of x(n), resp. y(n), and y, resp, x, is the signal y(n), resp. x(n), written

as a column vector of length Ly, resp. Lx.

The linear system of equation (2.32) can be solved by computing the Moore-

Penrose pseudo-inverse5 of the matrix H, H+. The inverse filter is then computed

by

g = H+d. (2.36)

As d = [1, 0, . . . , 0], g is actually the first column of H+.

The linear system of equation (2.32) has infinitely many solution. The pseudo-

inverse method gives the solution with the smallest norm 2.

5 The Moore-Penrose pseudo-inverse is a matrixH+ of the same dimensions asHT satisfyingfour conditions: HH+H = H, H+HH+ = H+, HH+ and H+H are Hermitian.

Chapter 3

Enhancement of a speech signal

Reverberation produces a distortion that alters the intelligibility of speech. A

possible approach to the dereverberation problem is to consider the general prop-

erties of a speech signal, which are degraded by the reverberation.

A simple way to improve the reverberant signal is, for example, to detect re-

verberation tails between words. By removing, or attenuating, these parts which

only contain reverberation, the listening comfort is slightly improved. However,

this method, which is used in hearing aids, does not remove the distortion which

alters the words.

The two methods presented in this chapter use, more or less explicitely, the

harmonicity property of the voiced segments of a speech signal, in order to try

and recover the clean signal. In section 3.1 the approach of Nakatani[4], using a

adaptive harmonic filter, will be described. In section 3.2 an adaptive algorithm

working on the LP residual of the speech signal will be presented.

3.1 Harmonicity based dereverberation

Nakatani and al. propose in [4] an interesting single microphone dereverbera-

tion method called Harmonicity based dEReverBeration (HERB). This method

is based on the harmonicity model of speech described in section 2.1.

The principle is to estimate a dereverberation operator using the harmonic parts

of the speech signal. This operator, initially designed for the harmonic parts, is

expected to work on the non-harmonic parts as well.

46 3 Enhancement of a speech signal

Figure 3.1: Spectrograms of a sweeping sinusoid and its reverberant signal.

The performance of this method, presented in [4] and [10], are impressive. In this

section we will begin by describing the principle of this dereverberation process.

Then we will discuss its applicability on ASIMO.

3.1.1 Effect of reverberation on a sweep signal

In order to understand the basic idea of the HERB method, it is useful to observe

the effect of reverberation on a sweeping sinusoid. In discrete time the sinusoidal

sweep is define by

s(n) = A sin 2pi(k

2

(n

fs

)2+ fstart

n

fs

)(3.1)

where A is the amplitude, fstart is the frequency at t = 0, fs is the sampling

frequency, and k a constant. Its instant frequency varies linearly in time:

(n) = kn

fs+ fstart. (3.2)

Figure 3.1 (left) shows the spectrogram of a half second long discrete signal

which frequency sweeps from 100 to 4000 Hz. This spectrogram is obtained using

a Gammatone filter-bank, therefore is the frequency scale not linear (see (1.2)).

This signal is then convoluted with the impulse response shown on figure 2.7. The

resulting spectrogram is shown on figure 3.1 (right). We can observe, from this

spectrogram, that the sinusoidal component corresponding to the original signal

can be clearly identified. In each frequency band, the energy corresponding to this

3.1 Harmonicity based dereverberation 47

direct signal appears first and is followed by a reverberation tail. At a given

point in time, the energy of the signal is maximum for the frequency corresponding

to the direct signal.

The idea of the HERB method is to track the instant frequency (l) of a dominant

sinusoidal component in the reverberant signal at each short time frame. The

amplitude A(l) and phase (l) of this dominant sinusoidal are extracted and

used to synthesize the signal

s(n) =l

g(n nl)A(l) cos((l)

n

fs+ (l)

), (3.3)

where g(nnl) is a window function for overlap-add synthesis and nl is the timeindex centered in frame l.

3.1.2 Adaptive harmonic filtering

Although a sweep signal contains only one dominant sinusoidal, a harmonic sig-

nal contains several sinusoidal components whose frequencies correspond to its

fundamental frequency F0 and its multiples (cf. section 2.1). The aim of a har-

monic filter is to enhance these components. Since the fundamental frequency of a

speech signal changes over time, the properties of the filter have to be adaptively

modified according to F0 (see figure 3.2).

Harmonic

Filter

Estimation

of F0

6

-

-

-x(n) x(n)

Figure 3.2: Adaptive harmonic filtering

A simple approach of harmonic filtering is the comb filter defined as 1+z where is the period to be enhanced. The method proposed by Nakatani in [4] is to

filter the signal by synthesizing a harmonic sound as follows:

1. The fundamental frequency of the observed signal is estimated at each time

frame. If the time frame is short enough this fundamental frequency can be

considered constant.


2. The amplitudes and phases of individual harmonic components are esti-

mated using the short time Fourier transform (STFT), X(l,m), of s(n)

X(l,m) =n

g1(n nl)x(n)e2pimM

nnlfs , (3.4)

Ak,l = |X(l, [kF0,l])|, (3.5)k,l = X(l, [kF0,l]), (3.6)

where l is the index of the time frame, nl is the time index corresponding

to the center of the frame, m is the index of the frequency bin, M is the

number of points used for the Discrete Fourier Transform (DFT), Ak,l and

k,l are respectively the estimated amplitude and phase of the k-th harmonic

component, F0,l is the fundamental frequency of the time frame, g1(n) is

an analysis window function and [] discretizes a continuous frequency intothe index of the nearest frequency bin.

3. The output of the filter, x(n), is obtained by adding sinusoids

xl(n) =k

Ak,l cos

(kF0,l

n nlfs

+ k,l

), (3.7)

and combining them over succeeding frames

x(n) =l

g2(n (nl + lT ))xl(n), (3.8)

where xl(n) is a synthesized harmonic sound corresponding to the time

frame l, T is the frame shift in samples and g2(n) is a synthesis window

function.

Actually the harmonic filter itself is easy to implement. The main issue is to find

an accurate estimate of the fundamental frequency of the signal even in case of

strong reverberation.

3.1.3 Dereverberation operator

Harmonic case

The dereverberation operation is computed in the frequency domain using the

short time Fourier transform. Let X(l,m) be the STFT of a reverberant signal.


X(l,m) can be represented as the product of the source signal, S(l,m), and the

room transfer function, H(m), which is assumed to be time invariant (cf. section

2.2). This transfer function can be divided into two function, D(m) and R(m).

The former corresponds to the direct signal, D(m)S(l,m), and the latter to the

reverberant part, R(m)S(l,m):

X(l,m) = H(m)S(l,m)

= D(m)S(l,m) +R(m)S(l,m)(3.9)

The aim of the dereverberation operator is to estimate the direct signalX (l,m) =D(m)S(l,m).

It can be obtained by subtracting the reverberant part R(m)X(l,m) from equa-

tion (3.9), or by finding the inverse filter W (m) such that

W (m) =D(m)

H(m)(3.10)

Then

X (l,m) =W (m)X(l,m)

=D(m)

H(m)

(H(m)S(l,m)

)= D(m)S(l,m).

(3.11)

The basic idea of the HERB method is the following: if S(l,m) is a harmonic

signal, the direct signal, contained in X(l,m) can be obtained using an adaptive

harmonic filter. At each time frame l an inverse filter W0(l,m) is computed in

the frequency domain using the output X(l,m) of the harmonic filter:

W0(l,m) =X(l,m)

X(l,m)(3.12)

As the signal X(l,m) is supposed to contain only the direct part of the signal

X(l,m), this filter will remove the reverberation on the time frame.

As the effect of the room is supposed to be constant the dereverberation operator

W (m) is estimated by averaging the inverse filter computed at the different time

frames.

W (m) = E {W0(l,m)} (3.13)


General case

This process can be applied on a speech signal S(l,m) by rewriting the equation

(2.1) in frequency domain

S(l,m) = Sh(l,m) + Sn(l,m) (3.14)

where Sh(l,m) is the harmonic part and Sn(l,m) is the non-harmonic part.

The observed reverberant signal X(l,m) is rewritten as

X(l,m) = D(m)Sh(l,m) + (R(m)Sh(l,m) +H(m)Sn(l,m)) (3.15)

where H(m) is the transfer function of the room, H(m) = D(m) +R(m).

The component D(m)Sh(l,m) can be approximately extracted from X(l,m) with

an adaptive harmonic filter. This approximated direct signal X(l,m) can be mod-

eled as:

X(l,m) = D(m)Sh(l,m) +(Rh(l,m) + Hn(l,m)

)(3.16)

where Rh(l,m) is a part of the reverberation of Sh(l,m) and Hn(l,m) is a part of

the direct signal and reverberation of Sn(l,m). It can be assume, if the fundamen-

tal frequency is perfectly estimated, that the only estimation errors on X(l,m)

are caused by this two unexpected remaining parts.

The dereverberation estimator is computed as in the harmonic case 3.12:

W (m) = E {W0(l,m)} = E{X(l,m)

X(l,m)

}(3.17)

Using the equation (3.16), the operator over a time frameW0(l,m) can be rewrit-

ten as (the time and frequency indexes have been removed for more clarity)

W0 =X

X=DSh + Rh + Hn

HS(3.18)

=

(D + Rh/Sh

)Sh

H (Sh + Sn)+

(Hn/Sn

)Sn

H (Sh + Sn)(3.19)

=D + Rh/Sh

H

1

1 + Sn/Sh+Hn/SnH

1

1 + Sh/Sn(3.20)

It can be proven (see in appendix A) that the expected value of a function 11+z

,

where z is a complex random variable, is equal to the probability that |z| < 1, if


it is assumed that the phase of z is uniformly distributed, the phase of z and its

absolute value are statistically independent, and |z| 6= 1.Then, under the following conditions [10]:

1. The phase of Sn(l,m) and a joint event composed of Sh(l,m), Rh(l,m) and

|Sn(l,m)| are independent,2. The phase of Sh(l,m) and a joint event composed of |Sh(l,m)|, Hn(l,m)

and Sn(l,m) are independent,

3. The phase of Sh(l,m) and Sn(l,m) are uniformly distributed within [0, 2pi),

4. |Sh(l,m)| 6= |Sn(l,m)|,

the equation (3.17) can be derived as

W (m) D(m) + R(m)H(m)

P{|Sh(l,m)| > |Sn(l,m)|}, (3.21)

where

R(m) = Et

{Rh(l,m)

Sh(l,m)

}|Sh(l,m)|>|Sn(l,m)|

, (3.22)

P{} is a probability function, and Et{}A represents an average function overtime frames under a condition where A holds.

In the derivation of W (m), the term corresponding to the non-harmonic part

is neglected. Actually it is expected that, when |Sn(l,m)| > |Sh(l,m)|, i.e. thesignal is non-harmonic over the time frame, the output of the harmonic filter is

equal to zero, then

Et

{Hn(l,m)

Sn(l,m)

}|Sn(l,m)|>|Sh(l,m)|

0. (3.23)

Given equation (3.21) W (m) is expected to remove reverberation not only of the

harmonic components of the speech signal but also of the non-harmonic ones. It

approximates the inverse filter D(m)H(m)

except for a remaining reverberation due to

R(m).

The enhanced signal is then expected to be the sum of the direct signal and

a reduced reverberation part. However, because of the term P{|Sh(l,m)| >


|Sn(l,m)|}, the gain of the filter becomes zero in frequency regions where har-monic components were not included during the estimation of the dereverberation

operator.

In addition, even in frequency regions in which harmonic components were present

during the estimation phase, the filter gain is expected to decrease as the fre-

quency increase. The reason is that in a speech signal the energy of higher order

harmonic components is smaller than the energy of the sinusoidal component at

the fundamental frequency and its small multiples.

3.1.4 The HERB method

The figure 3.3 summarize the dereverberation process using the HERB method.

S(l,m) - H(m) -X(l,m) - W (m) - S(l,m)

?HarmonicFilter

?X(l,m)

E{X(l,m)X(l,m)

}@@@R

Figure 3.3: Diagram of the HERB dereverberation method

The system consists of the following sub-procedures:

1. Estimation of the fundamental frequency and the voiced durations.

2. Harmonic filtering.

3. Estimation of the dereverberation operator.

4. Dereverberation of the signal using this operator.

The estimation of the fundamental frequency is the most important point of the

method. F0 must be robustly estimated in order to expect a good dereverberation.

This task is difficult in presence of strong reverberation. Nakatani [10] proposes

to repeat the dereverberation process of figure 3.3 three times:


STEP 1: The dereverberation process is applied on the observed reverberant

signal.

STEP 2: The dereverberated signal obtained in the first step is used to estimate

the fundamental frequency. This new F0 is used to control the adaptive

harmonic filter, but the input of this filter is the original reverberant signal.

STEP 3: The dereverberated signal obtained in the second step is now used as

new reverberant signal which is enhanced reapplying the whole dereverber-

ation process on it.

According to [10], the quality of the dereverberated signal improves at each step.

The second step can be repeated several times to even improve the estimation of

the fundamental frequency. By contrast, the quality of the signal does not always

improve when the third step is repeated, this is due to an accumulation of errors

in the dereverberation filters.

3.1.5 Test of the method

The performance of this method described in [4] and [10] are impressing. We have

implement this method to test two points:

Can this method easily be adapted to replace the STFT by a Gammatonefilterbank?

Can the process work in real-time and real environments?

For HRI the interest of this method resides in the fact that the fundamental

frequency of the signal is already computed for other processes. A pitch tracking

algorithm has already been developed at HRI [11] and can valuably be used to

estimate the fundamental frequency of the signals. As the computation of the

fundamental frequency seems to be the critical point of the HERB algorithm, we

expected a lot from this method.

Harmonic Filter

The aim of this section is to compare the harmonic filter proposed by Nakatani in

[4] and an implementation of the harmonic filter on the Gammatone filter-bank.


The implementation of the harmonic filter on the filter-bank is simple. The out-

put of the Gammatone filter bank are N signals corresponding to the different

frequency channels of the cochlea response. Knowing the fundamental frequency,

the frequency channels corresponding to F0 and its multiple are determined. At

a time sample t the signals of these channels are added.

The resulting spectrograms (see figure 3.4) shows that both implementation of

the harmonic filter give similar results.

Figure 3.4: up-left: Original signal (sweep with harmonics). up-right: Reverberantsignal. bottom-left: Harmonic estimate with the Gammatone filter-bank. bottom-right: Harmonic estimate with Nakatanis harmonic filter.

As expected the adaptive harmonic filter can be implemented without problems

on the Gammatone filterbank. Moreover the assumption that the fundamental

frequency is constant over a short time frame is not required any more as our

filterbank performs the harmonic filtering without the subsampling imposed by

the STFT used in [4]. The improvement of the method, using time wrapping,

proposed in [12] is useless with a Gammatone implementation.


Dereverberation operator

In order to compute the dereverberation operator a training sequence is required.

During this adaptation phase the room impulse response must not change.

In the simulation the training sequence was composed of several sweeping sinu-

soid with their harmonics similar as the one shown on figure 3.4. The operator is

computed using a short-time Fourier transform. The restriction on the time win-

dow is really strong. It has to be long enough to contain a whole word (sweep)

including the complete reverberation tail.

It is at this point important to note that the window of the short-time Fourier

transform for the harmonic filter and for the estimation of the dereverberation

operator can not be the same. For the harmonic filtering the analysis window must

be as small as possible in order to respect the assumption that the fundamental

frequency of the signal is constant. On the other hand, for the estimation of the

filter W (m), the time window of the STFT must be of several seconds.

It is also assumed that, during the adaptation phase of the dereverberation fil-

ter, the pause between the words is long enough. Hence the reverberation tail

of a word would alter the the following word. Respecting these conditions the

dereverberation operator can be computed.

In order to estimate the performance of the algorithm, 500 random harmonic

signals (sweeps with harmonics) of 0.5 s each are used as training data. These

signals are convolved with the room impulse response shown in figure 2.7. As the

exact fundamental frequencies of these signals are known, a good estimation of

the dereverberation operation can be expected. This dereverberation operator is

then used to enhance a real speech signal convolved with the same room impulse

response (see figure 3.5).

Figure 3.6 shows the spectrogram of the enhanced signal. It is here important

to note that the speech signal used for the test contains only one word. As the

time window of the STFT used to compute the dereverberation operator is long

enough to contain a whole word, the enhanced signal is obtained by multiplying

the FFT of the whole reverberant signal with the dereverberation operator.

The dereverberation works relatively good. However, we can see on the spectro-

gram that the dereverberation filter is not causal. This is not surprising as we

explained in section 2.3 that room impulse responses are in general non minimum-

phase. Because of the non causality the beginning of the signal is altered. It can


Figure 3.5: Spectrogram of the clean and reverberant signal used to test the rever-beration operator.

Figure 3.6: Spectrogram of the enhanced signal computed in the frequency domain.


be a problem as this part of the signal contains normally no reverberations and

can, therefore, be valuable for some processing.

We try to express the dereverberation filter in the time domain, using the inverse

Fourier transform of the operator computed in the frequency domain. In this case

the dereverberation does not work anymore (see figure 3.7).

Figure 3.7: Impulse response of the dereverberation operator and spectrogram of theenhanced signal computed in the time domain.

3.1.6 Discussion of the method

This method can theoretically remove very strong reverberations of a signal.

Moreover, as the dereverberation operator is computed in the frequency domain,

the computational cost is quite low.

However, the amount of required adaption data is prohibitive (about 500 words).

Then the pause between words has to be really large in case of long reverberation.

It is therefore impossible to use this method in real-time application. It will be

quite bothering to have to speak during several minutes with ASIMO before he

begins to understand what is said, and that supposing that neither the robot or

the speaker move during this time.

Given the restrictions on the adaptive phase of the algorithm this method can

not be applied in real-environment. In addition, a remark can be formulated on

the use of harmonic filtering to remove reverberations, even for a highly harmonic

signal. The harmonic filter manages quite well to remove the reverberation when

the fundamental frequency changes within the word. However a big part of the

reverberation stays if the fundamental frequency changes to slowly. Figure 3.8


shows the effect of the reverberation on the fundamental frequency in such a

case.

Figure 3.8: Effect of the reverberation on the fundamental frequency.

In this example the component corresponding to the fundamental frequency is

strongly disturbed by the reverberation. Increasing the frequency resolution (the

number of channels of the filter-bank) solve this problem. But more than 1000

channels are required to get a frequency selectivity as good as for the human

acoustical system. Due to the increasing computational load this is not feasible

for real-time applications.

3.2 Dereverberation using LP analysis 59

3.2 Dereverberation using LP analysis

The dereverberation using the harmonicity of the signal require too much training

data. Therefore, in this section, another dereverberation method will be discussed.

This method uses the autoregressive (AR) model of speech signals. Several meth-

ods have been proposed using linear prediction (LP) analysis [13] [14] [15].

3.2.1 Problem formulation

In section 2.1.4 it was explained that a speech signal s(n) can be expressed as

a linear combination of its L past sample values. The clean and the reverberant

speech become, respectively,

s(n) =

pk=1

aks(n k) + es(n), (3.24)

x(n) =

pk=1

bkx(n k) + ex(n), (3.25)

where ak and bk are the LP coefficients and es(n) and ex(n) the LP residual signal

(or prediction error).

The important assumption on which dereverberation methods using LP analysis

are based is that the LP coefficients are unaffected by the reverberation:

bk = ak k [1, L] N. (3.26)Actually this assumption holds only in a spatially averaged sense [16], i.e. using

several microphones:

E {bk} = ak k [1, L] N. (3.27)

Consequently the dereverberation process will try to enhance the LP residual of

the signal which structure is well known (see 2.1.4). The aim of the dereverber-

ation methods using LP analysis is to improve the LP residual signal such that

e(n) es(n). Then a clean speech estimate is obtained by

s(n) =L

k=1

bks(n k) + e(n), (3.28)

i.e. the estimated LP coefficients obtained by linear prediction analysis are used

to synthesize a signal out of the enhanced excitation signal e(n).


3.2.2 The kurtosis as measure of the reverberation

Figure 3.9: Example of platykurtic (left) and leptokurtic (right) distributions. Bothdistributions have the same standard deviation

Gillespie shows in [14] that the kurtosis of the LP residual is a valid reverberation

metric. The kurtosis 2 of a random signal x(n) is the degree of peakness of the

distribution, defined as the the fourth central moment 4 normalized by the fourth

power of the standard deviation (or the square of the variance):

2 =44

=E{(x(n) )4}

E{(x(n) )2}2 (3.29)

where = E {x(n)} is the mean value of x(n).As the kurtosis of a normal distribution is equal to 3, a kurtosis excess , denoted

2 and defined by

2 =44

3 (3.30)is often used. A distribution with a high peak 2 > 0 is called leptokurtic, a flat-

topped curve 2 < 0 is called platykurtic, and the normal distribution is called

mesokurtic.

Figure 3.9 illustrates the kurtosis measure. The distribution on the right is more

peaked at the center, we tend to conclude that it has a lower standard deviation.

But, on the other hand, it has thicker tails, which usually means that it has a

higher standard deviation. If the effect of the peakness exactly offsets that of the

thick tails, the two distributions will have the same standard deviation.

For a clean voiced speech, the LP residuals have strong peaks corresponding to

glottal pulses (see figure 3.10), whereas for reverberated speech such peaks are

spread in time. On figure 3.11, the probability density function of a clean signal

and of the convolution of this signal with the room impulse response computed

in the CARL Groups office (see figure 2.7) are estimated. Both signals have

been centered and normalized such that their means equals 0 and their standard


Figure 3.10: On the left, extract of the LP residuals of a speech signal. Note the strongpeaks corresponding to the glottal pulses. On the right, the same signal impaired byreverberations.

Figure 3.11: Estimation of the probability density functions of the LP residuals ofa clean speech signal (blue) and of a reverberant signal (red). Both signal have beencentered and normalized such that their means = 0 and their standard deviations = 1.


deviation equals 1. The probability density functions are estimated by computing

the histograms of the signals and normalizing them by the number of samples

in the signals. The distribution of the kurtosis of the clean signal (blue) has

a higher peakness (2 = 42). The effect of the room reduces this peakness on

the reverberant signal (red) and the kurtosis of its LP residuals equals 10. By

maximizing the kurtosis of the LP residuals we can expect to improve the quality

of the observed signal.

3.2.3 Maximization of the kurtosis

In the time-domain

In order to enhance the reverberant signal x(n) an adaptive filter can be built,

which maximizes the kurtosis of its LP residual x(n). Given an L-taps adaptive

filter h(n) at time n, the output of this filter is y(n) = hT (n)x(n), where x(n) =

[x(n L + 1), . . . , x(n 1), x(n)]T . An LP synthesis filter yields y(n), the finalprocessed signal. Adaptation of h(n) is similar to a traditional Least-Mean-Square

(LMS) adaptive filter [17], except that the optimized value is a feedback function

f(n), corresponding to the gradient of the kurtosis.

Figure 3.12 (a) shows a diagram of the maximization system. The problem of

this algorithm is the LP reconstruction artifacts. However, this system is linear

and the order of the filters can be arbitrary changed, then h(n) can be computed

from x(n), but applied directly to x(n) (see figure 3.12 (b)).

A gradient method can be used to optimize the kurtosis. The gradient of the

kurtosis is computed by

2h

=4 (E {y2}E {y3x} E {y4}E {yx})

E3 {y2} (3.31)

This gradient can be approximated by

2h

(4((E {y2} y2 E {y4}) y)

E3 {y2}

)x = f(n)x(n) (3.32)

Were f(n) is the feedback function used to control the filter updates. For continu-

ous adaptation, the expected values E {y2} and E {y4} are estimated recursivelyby

E{y2(n)

}= E

{y2(n 1)}+ (1 )y2(n) (3.33)

E{y4(n)

}= E

{y4(n 1)}+ (1 )y4(n) (3.34)


Figure 3.12: (a) A single channel time-domain adaptive algorithm for maximizing thekurtosis of the LP residuals. (b) Equivalent system, which avoids LP reconstructionartifacts.

where < 1 controls the smoothness of the estimates.

The update equation of the filter is given by

h(n+ 1) = h(n) + f(n)x(n) (3.35)

where controls the speed of adaptation.

In the frequency domain

However, according to Haykin [17] the convergence of a LMS-like algorithm in

time-domain is very slow. Therefore, Gillespie [14] proposes to adapt the algo-

rithm in the frequency-domain. Moreover by using more microphones and cal-

culating the feedback function on an averaged output of all the channels, the

accuracy and the speed of the adaptation is increased.

The frequency domain method proposed in [14] uses a modulated complex lapped

transform (MCLT) [18]. This filter-bank structure is close to a Discrete Cosine

Transform (DCT). The general diagram of the method in the frequency domain

for two microphone is described in figure 3.13.


Figure 3.13: Two-channel frequency-domain adaptive algorithm for maximization ofthe kurtosis of the LP residual.

3.3 Discussion of the method

The maximization of the kurtosis permits a real-time dereverberation. The adap-

tation is quick if a small adaptation filter is used. However, in the case of strong

reverberation the improvement on the signal is not perceptible.

If the length of the adaptive filter is increased, the kurtosis is still maximized and

the algorithm converges to a signal with maximum kurtosis. But the resulting

signal has sometimes a higher kurtosis than the original clean signal. The sound is

strongly distorted and sometimes not understandable anymore. Figure 3.14 shows

the original LP residual of the clean signal. This signal is artifically reverberated

and then enhanced by maximizing the kurtosis of the LP residuals. The resulting

LP residual has a higher kurtosis than the original one. This means that the

maximization has to be constrained. The clean speech has a higher kurtosis than

the reverberated one, but this does not mean that the signal with the highest

kurtosis is the clean signal.

Actually the length of the adaptive filter must not be longer than the period

of the glottal pulses. With this constraint, the efficiency of dereverberation is

limited.

Another drawback of this method is that the LP analysis, as explained in section

2.1.4, is a very good approximation of the magnitude spectrum of the speech

signal but strongly alters the phases spectrum. As the phase is crucial for source

localization, it should be studied if this method does not alter dramatically the

phase information of the signal.

3.3 Discussion of the method 65

Figure 3.14: On the left the LP residual of a clean signal. On the right the LP residualof the resulting dereverberated signal. The kurtosis of the dereverberated signal ishigher than the kurtosis of the original signal. The resulting signal is strongly distorted.

Chapter 4

Equalization of room impulse

responses

In chapter 3 the dereverberation approach considered the effect of the room as

a distortion which alters the harmonicity of the speech signal. This chapter will

discuss methods to estimate room impulse responses. These estimated impulse

responses can then be equalized (inverted) in order to recover the original clean

speech signal (see section 2.3).

In section 4.1 the principle of a channel estimation method using the second

order statistics of the observed signals will be explained. Then, in sections 4.2

and 4.3, two different implementations of this principle will be discussed. At last,

in section 4.4, some improvement ideas will be proposed.

4.1 Principle of the channel estimation

Some methods have been proposed to estimate one channel. For example, Hoop-

good proposes in [19] a single channel estimation method based on the non-

stationarity of speech and the stationarity of the room impulse response. How-

ever, in most of the cases, these methods require that the input signal is white

noise, which is not the case for a speech signal. On contrary the estimation of

several room impulse responses simultaneously is possible [20]. Moreover as it

was explained in section 2.3, it is much easier to find a global inverse for two or

more room impulse responses than the inverse of a single one. In this section a

method will be presented, which permits to estimate the impulse responses of a

68 4 Equalization of room impulse responses

Single-Input Multiple-Output (SIMO) system using only second order statistics

(SOS).

4.1.1 Hypothesis

In [20] Tong and al. show that a Single-Input Multiple-Output (SIMO) system

can be identified under the following conditions.

1. The autocorrelation matrix of the source signal is of full rank.

2. The channel transfer functions do not share any common zeros.

4.1.2 Basic idea

The relation between the input and the outputs of a SIMO system (see figure

4.1) is:

xi(n) = hi(n) s(n) i [1,M ] (4.1)

s(n)h1(n)

h2(n)...

hM(n)

x1(n) x2(n)...

...

xM(n)

Figure 4.1: SIMO System

In a vector/matrix form, such a signal model becomes:

xi(n) = HTi s(n) (4.2)

where

xi(n) =[xi(n), xi(n 1), . . . , xi(n L+ 1)

]T, (4.3)

Hi =

hi(0) hi(L 1) 0... . . . . . . ...0 hi(0) hi(L 1)

, (4.4)s(n) =

[s(n), s(n 1), . . . , s(n 2L+ 2)]T , (4.5)

4.1 Principle of the channel estimation 69

where L is the maximum length of the room impulse responses.

The idea of blind SIMO identification is to study the matrix

Rx =

i6=1Rxixi Rx2x1 RxMx1Rx1x2

i6=2Rxixi RxMx2

......

. . ....

Rx1xM Rx2xM

i6=M Rxixi

(4.6)where Rxixj = E

{xi(n)x

Tj (n)

}are the auto- and cross-correlation matrices of

the observed signals. The matrices Rxixj can be written as

Rxixj =1

TXiX

Tj (4.7)

where Xi is the L (T + L 1) Sylvester matrix of xi(n) and T is the numberof samples of xi(n).

If the matrix Rx is multiplied by the vector h =[hT1 h

T2 hTM

]Twe obtain

(for the first L rows):i6=1

Rxixih1 Rx2x1h2 RxMx1hM =

1

T

i6=1

Xi(XTi h1

) 1TX2(XT1 h2

) 1TXM

(XT1 hM

)=

1

T

i6=1

Xi(XTi h1 XT1 hi

)A left multiplication by the transpose of a Sylvester matrix is a convolution, then

the term XTi h1 XT1 hi actually equalsxi(n) h1(n) x1(n) hi(n) = s(n)

(hi(n) h1(n) h1(n) hi(n)

)(4.8)

and, as the convolution of real signals is commutative, this term equals zero. The

same development can be performed for the other rows of the matrix product

Rxh and it gives:

Rxh = 0, (4.9)

which means that the vector h lies in the null space of the matrix Rx.

4.1.3 How can this idea be implemented?

There are then two distinct approaches to identify the SIMO system:

70 4 Equalization of room impulse responses

An eigenvalue decomposition is performed on the matrix Rx and its nullspace is computed [21]. Under the hypothesis that Rs = E

{s(n)sT (n)

}is full rank, this null space corresponds to the unknown system h. This

batch method is discussed in section 4.2.

A set of filters gi(n) are adaptively estimated such that(i, j), (hi(n) gj(n) hj(n) gi(n)) = 0 [22]. This iterative method isdiscussed in section 4.3

4.1.4 Why have the channels to be coprime?

The second hypothesis, which requires that the channel transfer functions do not

share any common zeros, can be explained as follows:

Given for example two channels with impulse responses h1(n) and h2(n). If the

transfer function of these channels share common zeros then the impulse response

can be rewritten as

h1(n) = d(n) h1(n) (4.10)h2(n) = d(n) h2(n) (4.11)

where d(n) is by analogy with polynomials the greatest common divisor of h1(n)

and h2(n), and the transfer functions of h1(n) and h1(n) are coprime (do not

share any common zeros). Then x1(n) and x2(n) become

x1(n) =(s(n) d(n)) h1(n) (4.12)

x2(n) =(s(n) d(n)) h2(n), (4.13)

and, if the correlation matrix of s(n) d(n) is full rank, the methods will identifythe system [h1(n) h2(n)] instead of [h1(n) h2(n)].

4.1.5 Estimation of the length of the filters

Both the batch and iterative implementations require that the lengths of the

channels are given. The estimation of these lengths is very important as we will

explain in this subsection.

In the two microphones case, the channel estimation tries to find two FIR filters

g1(n) and g2(n) of lengths Lg + 1 such that

h1(n) g2(n) h2(n) g1(n) = 0. (4.14)

4.1 Principle of the channel estimation 71

where h1(n) and h2(n) are the two unknown FIR filters we want to identify. The

lengths of these filters are equal to Lh + 1, which is also unknown.

In the z-domain, the relation between the filters can be written as an operation

on polynomials

H1(z)G2(z) = H2(z)G1(z). (4.15)

The polynomials H1(z)G2(z) and H2(z)G1(z) are equal if and only if they have

exactly the same Lh + Lg zeros.

As H1(z) and H2(z) do not share common zeros, each zero of H1, resp. H2(z)

must also be a zero of G1(z), resp. G2(z). G1(z) and G2(z) contain at least Lhzeros and therefore Lg Lh.

When Lg = Lh, the method return directly the estimated channel. However,

when the lengths of the filters (or channel order) is over-estimated, additional

zeros appear. Figure 4.2 illustrates the system in the two-m

blind dereverberation

Documents