blind dereverberation

Upload: passme369

Post on 06-Mar-2016

8 views

Category:

Documents


0 download

DESCRIPTION

Blind Dereverberation

TRANSCRIPT

  • Technische Universitat Darmstadt

    Institut fur Automatisierungtechnik

    Fachgebiet Regelungtheorie und Robotik

    Prof. Dr.-Ing. Jurgen Adamy

    Landgraf-Georg-Str. 4

    D-64283 Darmstadt

    Diplomarbeit

    Study of blind dereverberation algorithms for real-time

    applications

    Xavier Domont

    Work in cooperation with:

    Honda Research Institute Europe Gmbh

    D-63073 Offenbach/Main

    Tutors:

    Dr.-Ing. Martin Heckmann (HRI)

    Dipl.-Ing. Bjoern Scholing (TUD)

    Juni 2005

  • Abstract

    At Honda Research Institute Europe, an automatic speech recognition system is

    developed for the humanoid robot ASIMO. The reverberation effect alters the

    perception of speech signals emitted in a room and reduces the performance of

    automatic speech recognition. A lot of methods have been proposed in the past

    few decades to enhance reverberant speech signals. This diploma thesis studies

    the most promising algorithms and discusses if they can be implemented in real-

    time for real environments.

    The existing methods can be classified in two families:

    1. Those who estimate directly the clean speech signal and treat reverberations

    as disturbances.

    2. Those who estimate the room impulse response and invert the estimated

    system to recover the clean speech.

    These two approaches are compared in this thesis, based on implementations

    in Matlab of selected algorithms. The focus of this comparison is set on the

    suitability of these algorithms for real environments, where speaker and robot

    are moving, and a possible real-time implementation.

  • Kurzfassung

    Am Honda Research Institute Europe wird ein automatisches Spracherken-

    nungssystem fur den Roboter ASIMO entwickelt. Hall stort die Sprachqualitat

    und senkt deutlich die Ergebnisse bei der Spracherkennung. Seit 30 Jahre sind

    viele Methoden vorgeschlagen worden, um Sprachsignale zu verbessern. Diese

    Diplomarbeit untersucht die aussichtsreichsten Algorithmen im Hinblick auf

    Echtzeitfahigkeit und Anwendbarkeit unter realen Bedingungen.

    Es gibt zwei Ansatze dieses Problem zu losen:

    1. Das original Sprachsignal kann direkt aus dem beobachtete Signal geschatzt

    werden. Der Halleffekt wird als Storung des reinen Signals angenommen.

    2. Die Raumimpulsantwort kann bestimmt werden und wird dann an-

    schlieend invertiert, um das originale Sprachsignal zu bekommen.

    Diese zwei Ansatze werden in dieser Diplomarbeit verglichen. Dafur werden aus-

    gewahlte Algorithmen implementiert. Der Hauptpunkt des Vergleichs war die

    Untersuchung der Methoden auf Einsetzbarkeit in echten Umgebungen, in denen

    sich Sprecher und Roboter bewegen.

  • Contents

    1 Introduction 17

    1.1 What is blind dereverberation? . . . . . . . . . . . . . . . . . . . 17

    1.2 Motivation of this diploma-thesis . . . . . . . . . . . . . . . . . . 18

    1.3 Audio processing architecture on ASIMO . . . . . . . . . . . . . . 19

    1.3.1 Overview of the peripheral auditory system . . . . . . . . 20

    1.3.2 The Gammatone filterbank, a model of the basilar membrane 21

    1.4 Overview of this report . . . . . . . . . . . . . . . . . . . . . . . . 22

    2 Model of a reverberant signal 25

    2.1 Properties of a speech signal . . . . . . . . . . . . . . . . . . . . . 25

    2.1.1 Quick overview of the speech production system . . . . . . 25

    2.1.2 Speech segments categorization . . . . . . . . . . . . . . . 27

    2.1.3 Harmonicity of a speech signal . . . . . . . . . . . . . . . 28

    2.1.4 Linear prediction analysis . . . . . . . . . . . . . . . . . . 28

    2.2 Room acoustics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    2.2.1 Measurement of real room impulse responses . . . . . . . . 31

    2.2.2 Simulation of the room impulse response . . . . . . . . . . 32

    2.2.3 Linear Time-Invariant model of the room . . . . . . . . . . 35

    2.2.4 Effect on the spectrogram . . . . . . . . . . . . . . . . . . 36

  • 8 CONTENTS

    2.3 Inversion of the room impulse response . . . . . . . . . . . . . . . 37

    2.3.1 Conditions on the inversion of FIR filters . . . . . . . . . . 37

    2.3.2 Are room transfer functions minimum-phase ? . . . . . . . 39

    2.3.3 Multiple input inverse filter . . . . . . . . . . . . . . . . . 41

    3 Enhancement of a speech signal 45

    3.1 Harmonicity based dereverberation . . . . . . . . . . . . . . . . . 45

    3.1.1 Effect of reverberation on a sweep signal . . . . . . . . . . 46

    3.1.2 Adaptive harmonic filtering . . . . . . . . . . . . . . . . . 47

    3.1.3 Dereverberation operator . . . . . . . . . . . . . . . . . . . 48

    3.1.4 The HERB method . . . . . . . . . . . . . . . . . . . . . . 52

    3.1.5 Test of the method . . . . . . . . . . . . . . . . . . . . . . 53

    3.1.6 Discussion of the method . . . . . . . . . . . . . . . . . . . 57

    3.2 Dereverberation using LP analysis . . . . . . . . . . . . . . . . . . 59

    3.2.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . 59

    3.2.2 The kurtosis as measure of the reverberation . . . . . . . . 60

    3.2.3 Maximization of the kurtosis . . . . . . . . . . . . . . . . . 62

    3.3 Discussion of the method . . . . . . . . . . . . . . . . . . . . . . . 64

    4 Equalization of room impulse responses 67

    4.1 Principle of the channel estimation . . . . . . . . . . . . . . . . . 67

    4.1.1 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    4.1.2 Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    4.1.3 How can this idea be implemented? . . . . . . . . . . . . . 69

    4.1.4 Why have the channels to be coprime? . . . . . . . . . . . 70

    4.1.5 Estimation of the length of the filters . . . . . . . . . . . . 70

  • CONTENTS 9

    4.2 Batch method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    4.2.1 Extraction of the common part . . . . . . . . . . . . . . . 72

    4.2.2 Noisy case . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    4.3 Iterative method . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    4.3.1 Choice of the optimization method . . . . . . . . . . . . . 76

    4.3.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    4.4 Improvement of the method . . . . . . . . . . . . . . . . . . . . . 77

    4.5 Discussion of the channel estimation methods . . . . . . . . . . . 79

    5 Conclusion and outlook 81

    5.1 Review of the studied methods . . . . . . . . . . . . . . . . . . . 81

    5.1.1 Harmonicity-based dereverberation . . . . . . . . . . . . . 81

    5.1.2 Linear prediction analysis . . . . . . . . . . . . . . . . . . 81

    5.1.3 Channel estimation . . . . . . . . . . . . . . . . . . . . . . 82

    5.1.4 Direct comparison of the methods . . . . . . . . . . . . . . 82

    5.2 Speech model based method vs. channel estimation . . . . . . . . 83

    5.3 What should we decide for ASIMO? . . . . . . . . . . . . . . . . . 83

    A Proofs 87

  • List of Figures

    1.1 Different paths of a sound wave in a room . . . . . . . . . . . . . 17

    1.2 General model of a reverberant signal . . . . . . . . . . . . . . . . 18

    1.3 General shape of a room impulse response . . . . . . . . . . . . . 18

    1.4 Peripheral auditory system [1] . . . . . . . . . . . . . . . . . . . . 20

    1.5 Impulse and frequency responses of a Gammatone filter . . . . . . 21

    1.6 Analysis filters of a Gammatone filter-bank with 16 channels. . . . 22

    2.1 General model of a reverberant signal . . . . . . . . . . . . . . . . 25

    2.2 Schematic diagram of the human speech production mechanism

    (source: [3]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.3 Block diagram of the human speech production (source: [3]) . . . 27

    2.4 Discrete-Time speech production model. (a) True Model. (b)

    Model to be estimated using LP analysis. (source [3]) . . . . . . . 30

    2.5 System to identify (one microphone case) . . . . . . . . . . . . . . 31

    2.6 Measurement method . . . . . . . . . . . . . . . . . . . . . . . . . 32

    2.7 Example of measured room impulse response . . . . . . . . . . . . 32

    2.8 Image Method: Direct path . . . . . . . . . . . . . . . . . . . . . 33

    2.9 Image Method: virtual source . . . . . . . . . . . . . . . . . . . . 33

    2.10 Image Method: Sound wave reflecting off two walls . . . . . . . . 33

    2.11 Room impulse response simulated with the image method . . . . . 35

  • 12 LIST OF FIGURES

    2.12 Spectrograms of an anechoic signal (left) and the resulting spec-

    trogram of its convolution with the impulse response of figure 2.7

    (right). This spectrograms were obtained with a Gammatone filter-

    bank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    2.13 Inversion of a filter . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    2.14 Pole () and zero () of an all-pass filter . . . . . . . . . . . . . . 40

    2.15 Energy of a non-minimum phase system (dashed - blue) and the

    corresponding minimum-phase system (red). . . . . . . . . . . . . 41

    2.16 Multiple input inverse filter . . . . . . . . . . . . . . . . . . . . . 42

    3.1 Spectrograms of a sweeping sinusoid and its reverberant signal. . . 46

    3.2 Adaptive harmonic filtering . . . . . . . . . . . . . . . . . . . . . 47

    3.3 Diagram of the HERB dereverberation method . . . . . . . . . . . 52

    3.4 up-left: Original signal (sweep with harmonics). up-right: Re-

    verberant signal. bottom-left: Harmonic estimate with the

    Gammatone filter-bank. bottom-right: Harmonic estimate with

    Nakatanis harmonic filter. . . . . . . . . . . . . . . . . . . . . . . 54

    3.5 Spectrogram of the clean and reverberant signal used to test the

    reverberation operator. . . . . . . . . . . . . . . . . . . . . . . . . 56

    3.6 Spectrogram of the enhanced signal computed in the frequency

    domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    3.7 Impulse response of the dereverberation operator and spectrogram

    of the enhanced signal computed in the time domain. . . . . . . . 57

    3.8 Effect of the reverberation on the fundamental frequency. . . . . . 58

    3.9 Example of platykurtic (left) and leptokurtic (right) distributions.

    Both distributions have the same standard deviation . . . . . . . 60

    3.10 On the left, extract of the LP residuals of a speech signal. Note

    the strong peaks corresponding to the glottal pulses. On the right,

    the same signal impaired by reverberations. . . . . . . . . . . . . 61

  • LIST OF FIGURES 13

    3.11 Estimation of the probability density functions of the LP residuals

    of a clean speech signal (blue) and of a reverberant signal (red).

    Both signal have been centered and normalized such that their

    means = 0 and their standard deviations = 1. . . . . . . . . . 61

    3.12 (a) A single channel time-domain adaptive algorithm for maximiz-

    ing the kurtosis of the LP residuals. (b) Equivalent system, which

    avoids LP reconstruction artifacts. . . . . . . . . . . . . . . . . . . 63

    3.13 Two-channel frequency-domain adaptive algorithm for maximiza-

    tion of the kurtosis of the LP residual. . . . . . . . . . . . . . . . 64

    3.14 On the left the LP residual of a clean signal. On the right the LP

    residual of the resulting dereverberated signal. The kurtosis of the

    dereverberated signal is higher than the kurtosis of the original

    signal. The resulting signal is strongly distorted. . . . . . . . . . . 65

    4.1 SIMO System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    4.2 Channel identification with overestimated channel orders. . . . . . 71

    4.3 Estimated zeros and real zeros for one channels (left). Zeros of all

    the estimated channels. On the left 4 estimated zeros are alone,

    they do not correspond to a real pole of the filter. On the right it

    can be noticed that these 4 additional zeros are common to all the

    estimated channels. . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    4.4 Eigenvalues of the matrix Rx in the noiseless case. On the right:

    zoom on the smallest eigenvalues. . . . . . . . . . . . . . . . . . . 73

    4.5 left: 4 of the 11 eigenvectors of the null space. right: common

    part of the null space (blue) and real impulse response (red). The

    impulse response of the 2 channels are concatenated and 10 zeros

    (corresponding to the over-estimation of the order) were added. . 74

    4.6 Eigenvalues of the correlation matrix in the noisy case. The vari-

    ance of the noise is equal to 1010 on the left and 106 on theright. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

  • 14 LIST OF FIGURES

    4.7 Iterative estimation of the channel impulse responses using two

    microphones. On the left the estimated zeros (blue) of one of the

    channels are compared with their real values (red). On the right

    the remaining impulse response after inversion of the system is

    drawn (blue), in the ideal case it should a Dirac (red) . . . . . . . 77

    4.8 Iterative estimation of the channel impulse responses using 5 mi-

    crophones. On the left the estimated zeros (blue) of one of the

    channels are compared with their real values (red). On the right

    the remaining impulse response after inversion of the system is

    drawn (blue), in the ideal case it should a Dirac (red) . . . . . . . 78

    4.9 Comparison of the position of the zeros when the convolution and

    the subsampling are performed in a different order. . . . . . . . . 79

  • Acronyms

    FFT Fast Fourier Transform

    DFT Discrete Fourier Transform

    STFT short time Fourier transform

    SISO Single-Input Single-Output

    SIMO Single-Input Multiple-Output

    MIMO Multiple-Input Multiple-Output

    ROC Region of Convergence

    LTI Linear Time-Invariant

    FIR Finite Impulse Response

    IIR Infinite Impulse Response

    MINT Multiple input inverse filter

  • Chapter 1

    Introduction

    1.1 What is blind dereverberation?

    Figure 1.1: Different paths of a sound wave in a room

    The acoustic signals emitted in a room reflect off the walls and other objects

    (see figure 1.1). The direct signal and all the reflected sound waves arrive to

    the microphone/listener with a different delay and sum up. This effect is called

    reverberation. Sometimes the term echo is used instead of reverberation. However

    echo generally implies a distinct, delayed version of a sound. In a room, each

    delayed sound wave arrives in such a short period of time that we do not perceive

    each reflection as a copy of the original sound. Even though we cant discern

  • 18 1 Introduction

    every reflection, we still hear the effect of the entire series of reflections.

    Whereas a human being, without hearing problems, can quite well cope with these

    distortions, this reverberation effect impairs the speech intelligibility in devices

    such as hands-free conferences telephones and automatic speech recognition.

    The diagram in figure 1.2 shows how the system can be modeled. The effect of

    the room is considered as a filter with impulse response h(t) whose input is the

    clean speech signal s(t) and the output is the observed reverberant signal x(t).

    s(t) h(t) x(t)

    Figure 1.2: General model of a reverberant signal

    Figure 1.3 shows the general shape of a room impulse response. The reverbera-

    tion corrupts the speech by blurring its temporal structure. However, due to the

    spectral continuity of speech, the early reflections mainly increase the intensity of

    the reverberant speech, whereas the later ones are deleterious to speech quality

    and intelligibility.

    Figure 1.3: General shape of a room impulse response

    The aim of the blind dereverberation is to recover the clean signal s(t) out of the

    observed reverberant signal x(t). The term blind means that neither the clean

    signal nor the impulse response of the room are known before the processing.

    1.2 Motivation of this diploma-thesis

    This diploma thesis was written in cooperation with Honda Research Institute

    (HRI) Europe. One of the important projects of HRI is the development of the

  • 1.3 Audio processing architecture on ASIMO 19

    humanoid robot ASIMO (Advanced Step in Innovative MObility). At HRI Europe

    the CARL (Child-like Acquisition of Representation and Language) Group of Dr.

    Frank Joublin aims at developing a system of automatic speech recognition (ASR)

    and production for ASIMO. As the distortions caused by reverberations alters

    the performance of ASR, we will investigate if a signal processing method can be

    found to dereverberate the signals heard by ASIMO.

    During the past decades many dereverberation methods have been proposed.

    However, no standard method has yet been found and this research topic is still

    very active. The aim of this diploma-thesis is to establish a state of the art of the

    existing methods and then to evaluate if some of them could be integrated to the

    audio processing system of ASIMO.

    The important requirements for ASIMO are, firstly, that the dereverberation is

    processed in real-time, and, secondly that the system must adapt to a real and

    changing environment. It means that the algorithms have to adapt themselves to

    the room conditions faster than these conditions change. As both ASIMO and

    the speaker can be in movement the effects of the room are susceptible to change

    very rapidly.

    To perform this study we selected, out of the recently proposed methods, the ones

    which seemed the most promising. The selected methods have then been imple-

    mented in MATLAB in order to determine their advantages and their drawbacks.

    For the implementation, we try, as often as possible, to use the existing audio

    processing architecture of ASIMO, described in section 1.3.

    In addition to the analysis of their performances, it will be discussed if the studied

    methods, while enhancing the perception of speech, do not alter some signal char-

    acteristics which are used by following audio processing on ASIMO. In particular,

    the phase spectrum is essential to the localization of a source of speech.

    1.3 Audio processing architecture on ASIMO

    The audio processing system at HRI uses a Gammatone filterbank. This type

    of filterbank is widely used in audio signal processing as it simulate the human

    auditory system.

  • 20 1 Introduction

    Figure 1.4: Peripheral auditory system [1]

    1.3.1 Overview of the peripheral auditory system

    The aim of the peripheral auditory system (see figure 1.4) is to transform a sound

    (which is actually a pressure variation in air) into nerve impulses. These impulses

    are then conveyed by the auditory nerve to the brain stem. The nerve cells in

    the brain stem act as relay stations, eventually conveying nerve impulses to the

    auditory cortex.

    The outer ear is composed of the pinna (the visible part) and the auditory canal

    or meatus. The pinna significantly modifies the incoming sound in a way that

    depends on the angle of incidence of the sound relative to the head. This is

    important for the sound localization. Sound travels down the meatus and cause

    the eardrum, or tympanic membrane, to vibrate. These vibrations are transmitted

    through the middle ear by three small bones, the osscicles, to a membrane-covered

    opening in the bony wall of the spiral-shaped structure of the inner ear, the

    cochlea.

    The cochlea is shaped like the spiral shell of a snail. It is filled with almost incom-

    pressible fluids and is divided along its length by two membranes, the Reissners

    membrane and the the basilar membrane. The motion of the basilar membrane

    in response to a sound is of primary interest.

  • 1.3 Audio processing architecture on ASIMO 21

    1.3.2 The Gammatone filterbank, a model of the basilar

    membrane

    A point on the basilar membrane is characterized by its impulse response. The

    Gammatone function approximates physiologically recorded impulse responses:

    g(t) = tn1 exp(2pibt) cos(2pif0t+ ) (1.1)

    where t is the time (t 0), b determines the duration of the impulse response, nis the order of the filter and determines the slope of the skirts of the filter, is a

    phase and f0 is the center frequency.

    Figure 1.5: Impulse and frequency responses of a Gammatone filter

    It can be observed from figure 1.5 that the Gammatone filter is a bandpass with

    its center frequency at f0. Its bandwidth depends on b.

    To simulate the whole basilar membrane, a bank of Gammatone filters can be

    used. Each filter channel represents the frequency response of one point on the

    basilar membrane.

    The parameters of the Gammatone filters are determined out of psychoacoustic

    measurements. Glasberg and Moore [2] summarized the equivalent rectangular

    bandwidth (ERB) of the human auditory filter. The ERB of a filter is defined as

    the width of a rectangular filter whose height equals the peak gain of the filter

    and which passes the same total power as the filter.

    The relation between the bandwidth and the center frequency of the Gammatone

    filters is given by:

    ERB = 24.7 + 0.108 f0. (1.2)

  • 22 1 Introduction

    The figure 1.6 shows the transfer functions for a bank of 16 filters with center

    frequency spaced between 50 Hz and 8 kHz. As the spectral resolution of the

    basilar membrane decreases as the frequency increases, the center frequencies of

    the Gammatone filters are not linearly distributed and their bandwidth increase

    with the center frequency according to equation (1.2). We can also note that the

    pass bands overlap.

    Figure 1.6: Analysis filters of a Gammatone filter-bank with 16 channels.

    1.4 Overview of this report

    The existing blind dereverberation methods can be classified in two families.

    1. We can estimate directly the clean speech signal, or the parameters and

    excitation of an appropriate parametric model, as a missing data problem

    by treating reverberations as disturbances.

    2. We can model the effect of the room by a filter. The coefficients of this filter

    are estimated by treating the clean speech as disturbance. The observed

    signal is then deconvolved by the estimated filter to recover the clean speech.

    In chapter 2 we will discuss how speech signals and room impulse responses can

    be modeled. This modeling step is essential to determine what, in the observed

    signal, is due to the speech and what is an effect of the room.

  • 1.4 Overview of this report 23

    In chapter 3 two methods, which use the properties of speech to enhance re-

    verberant signals, will be studied. These methods consider that the room effect

    is a disturbance and try and restore the characteristics of the speech that the

    reverberations altered.

    In chapter 4 the possibility of estimating the room impulse responses will be

    discussed. This approach is very interesting as, knowing the effect of the room on

    the signal, it will then make possible to revert this effect and recover the clean

    signal.

  • Chapter 2

    Model of a reverberant signal

    In terms of signal processing a room can be seen as a filter. The original (anechoic)

    signal s(n) goes through a filter h(n) and gives the reverberant signal x(n), see

    figure 2.1. In the case of blind dereverberation the input signal s(n) and the room

    transfer function h(n) are unknown.

    s(n) h(n) x(n)

    Figure 2.1: General model of a reverberant signal

    The task of dereverberation is to find an estimate s(n) of s(n), given the output

    x(n) of the system. In order to make this task feasible, a model of the speech

    signal and/or a model of the room are required.

    In section 2.1 different ways to model a speech signal will be discussed. In section

    2.2 the effects of the room on the speech signal will be investigated. At last section

    2.3 will discuss the possibility of inverting the effects of the room.

    2.1 Properties of a speech signal

    2.1.1 Quick overview of the speech production system

    The principal components of the human speech production system are (see figure

    2.2) the lungs, trachea(windpipe), larynx (organ of voice production), pharyn-

  • 26 2 Model of a reverberant signal

    Figure 2.2: Schematic diagram of the human speech production mechanism (source:[3])

    geal cavity (throat), oral or buccal cavity (mouth), and nasal cavity (nose). The

    pharyngeal and oral cavities are usually grouped and referred to as vocal tract.

    It is useful to think of speech production in terms of an acoustic filtering oper-

    ation. The pharyngeal, oral and nasal cavities comprise the main acoustic filter.

    This filter is excited by the organs below it, and is loaded at its main output by

    a radiation impedance due to the lips. The articulators are used to change the

    properties of the system, its form of excitation, and its output loading over time.

    Figure 2.3 shows a simplified acoustic model illustrating these ideas.

  • 2.1 Properties of a speech signal 27

    Figure 2.3: Block diagram of the human speech production (source: [3])

    2.1.2 Speech segments categorization

    The spectral characteristics of the speech wave are non-stationary, since the phys-

    ical system changes rapidly over time. Speech can therefore be divided into sound

    segments which present similar properties over a short period of time. Without

    going into further details the main way to classify a speech sound is with the type

    of excitation.

    The two elementary types of excitation are voiced and unvoiced. There are actu-

    ally a few other type of excitation (mixed, plosive, whisper, silence) but they can

    be seen just as a combination of the two elementary types.

    Voiced sounds are produced by forcing air through the glottis, an opening between

    the vocals folds. The vocal cords vibrate in oscillatory fashion and, therefore, the

    produced speech signal is quasi-periodic, its period is called fundamental period

    T0; the fundamental frequency F0 can be defined as1

    T0.

  • 28 2 Model of a reverberant signal

    Unvoiced sounds are generated by forming a constriction at some point along the

    vocal tract, and forcing air through the constriction to produce turbulence. The

    produced speech signal is a noise-like sound.

    Typical human speech communication is limited to a bandwidth of 7-8 kHz. The

    main part of the energy is contained in voiced segments.

    2.1.3 Harmonicity of a speech signal

    A speech signal s(n) can be modeled [4] by using the sum of a harmonic signal

    sh(n), derived from a glottal vibration, and a non-harmonic signal sn(n), such as

    fricatives and plosives, as

    s(n) = sh(n) + sn(n). (2.1)

    The harmonic part of the signal is defined by its voiced durations and their fun-

    damental frequencies (F0). A voiced duration is the time during which the vocal

    cords vibrate to generate a harmonic signal and the fundamental frequency refers

    to the frequency of the fundamental component of the signal. Each harmonic

    component has a frequency which corresponds to F0 or its multiples.

    It can be assumed that F0 is constant within a short time, therefore the harmonic

    signal, sh(h), can be modeled over a time frame of length T by the sum of sinu-

    soidal components whose frequencies coincide with the fundamental frequency of

    the signal and its multiples:

    sh(n) =Nk=1

    Ak cos

    (kF0

    n ncfs

    + k

    )for |n nc| < T

    2(2.2)

    where Ak and k are the amplitude and the phase of the k-th harmonic compo-

    nent, nc the time index of the center of the frame and fs the sampling rate.

    2.1.4 Linear prediction analysis

    A widely used model of speech signals is given by Linear Prediction (LP) analysis.

    This model consists of separating the speech signal into a excitation signal and

    a model of the vocal tract.

  • 2.1 Properties of a speech signal 29

    During a stationary frame of speech the model would ideally be characterized by

    a pole-zero transfer function of the form

    (z) = 0

    1 +Li=1

    b(i)zi

    1Ri=1

    a(i)zi(2.3)

    which is driven by an excitation sequence

    e(n) =

    +

    q=(n qP ), voiced case

    zero mean, unity variance,uncorrelated noise,

    unvoiced case

    (2.4)

    where (n) is the discrete Dirac

    (n) =

    {1 if t = 0,

    0 else.(2.5)

    The principle of the LP analysis is to approximate this pole-zero system with an

    all-pole system

    (z) =1

    1Ri=1

    a(i)zi(2.6)

    which can be easily estimated by solving a system of linear equations. The

    schematics of the true speech model and of its LP approximation are shown in

    figure 2.4.

    A magnitude spectrum, but not a phase spectrum1, can be exactly modeled with

    stable poles. It means that the LP analysis will model the true magnitude

    spectrum of the speech which is, in the most cases, enough for speech perception.

    For example, a listener moving from room to room within a house is able to clearly

    understand speech of a stationary talker, even if the phase relationships among

    the components are changing dramatically [3]. However, for some applications like

    the localization of the talker, the temporal dynamics of the sound are essential

    and the LP analysis should be used with care.

    1Actually the LP model has a minimum-phase characteristic. This notion will be discussedmore in details in section 2.3

  • 30 2 Model of a reverberant signal

    Figure 2.4: Discrete-Time speech production model. (a) True Model. (b) Model tobe estimated using LP analysis. (source [3])

    To understand the name Linear Prediction, it is helpful to consider the LP

    analysis in the time domain. An all-pole transfer function corresponds to an

    autoregressive (AR) model, i.e. the signal s(n) can be expressed as a linear com-

    bination of its L past samples:

    s(n) =L

    k=1

    aks(n k) + e(n) (2.7)

    where ak are the LP coefficients. The excitation signal e(n) can be seen in terms

    of system identification as the prediction error signal, also called LP residual.

    2.2 Room acoustics

    This section will firstly present how room impulse responses can be measured

    (2.2.1) or simulated (2.2.2). The goal is to obtain a set of real and artificial

  • 2.2 Room acoustics 31

    impulse responses. These data will be useful in the next chapters to test the

    dereverberation methods.

    In a second time (2.2.3) we will discuss if a general model of a room can be found.

    At last (2.2.4), we will use time-frequency analysis to shortly study the effects of

    reverberation on speech signals.

    2.2.1 Measurement of real room impulse responses

    In order to get real impulse response corresponding to a normal room, we per-

    formed measurements in the office of the CARL group at HRI. A sound signal

    was played through the room by a loudspeaker. Simultaneously the sound wave

    was recorded using a model of ASIMOs head equipped with two microphones.

    s(n) h(n) x(n)

    Figure 2.5: System to identify (one microphone case)

    For each microphone both the input s(n) and the output x(n) of the system are

    known, the impulse response h(n) can be then computed by the reverting the

    convolution

    x(n) = h(n) s(n). (2.8)However the measurement is generally altered by additive noise. To improve the

    measurement it is therefore better to use auto- and cross-correlation functions.

    Equation (2.8) becomes

    Rsx(n) = h(n) Rs(n) (2.9)

    where Rs(n) is the autocorrelation function of s(n) and Rsx(n) the cross-

    correlation function of s(n) and x(n). Equation (2.9) is less sensitive to noise.

    Moreover, if s(n) is a white noise, its autocorrelation function is equal to (n).

    Then

    h(n) = Rsx(n), when Rs(n) = (n). (2.10)

    The impulse response of the room is equal to the autocorrelation function of the

    white noise, played by the loudspeaker, and signal recorded by the microphone.

    For our measurement we used 1 second of Gaussian white noise as room excitation

    signal.

  • 32 2 Model of a reverberant signal

    Two different sound cards were used to play and record the signals. In order to

    easily synchronize the input and the output of the system, the excitation signal

    was directly recorded on another channel of the capture sound card in addition

    to the signals of the microphones (see figure 2.6).

    x(n) - h(n) - y(n)r- x(n)

    Figure 2.6: Measurement method

    Moreover this method permits to compensate eventual effects of the sound cards.

    As we only disposed of a stereo sound card, the record for the left and right ears

    had to be performed separately. Figure 2.6 shows one of the measured impulse

    response.

    Figure 2.7: Example of measured room impulse response

    2.2.2 Simulation of the room impulse response

    A technique to simulate the impulse response of a room is the image method

    proposed in 1979 by Allen [5]. It sums the direct path with all reflections on walls

    or objects.

    An example in [6] shows the principle of this method. Figure 2.8 shows the direct

    path from a sound source () to a microphone (). Another part of the sound wave

  • 2.2 Room acoustics 33

    Figure 2.8: Image Method: Direct path

    is reflected off a wall and then impinges upon the microphone. This reverberated

    sound seems to come directly from a virtual source located in an adjacent room,

    symmetrical to the original room relatively to the wall (see figure 2.9). On this

    figure the black line represents the real path of the signal, whereas the blue line

    is its perceived path.

    HHHH

    Figure 2.9: Image Method: virtual source

    This process can be extended to sound waves that are reflected more than once

    off the walls (see figure 2.10). This process can be continued the same way in the

    HHHH

    XXXXXXXXXX

    Figure 2.10: Image Method: Sound wave reflecting off two walls

    three dimensions to get an infinity amount of virtual sources.

  • 34 2 Model of a reverberant signal

    The virtual sources permit to easily compute the distance the sound wave travels

    to arrive at the microphone.

    Considering a rectangular room with dimensions (Lx, Ly, Lz), the coordinates-

    vector ri,j,k = (xi, yj, zk)T ((i, j, k) Z3) of a virtual source is:

    xi = (1)ixsource +(i+

    1 (1)i2

    )Lx

    yj = (1)jysource +(j +

    1 (1)j2

    )Ly

    zk = (1)kzsource +(k +

    1 (1)k2

    )Lz

    (2.11)

    where rsource = (xsource, ysource, zsource)T is the coordinate-vector of the source.

    The distance from the virtual source to the microphone is

    di,j,k = ri,j,k rm =(xi xm)2 + (yj ym)2 + (zk zm)2 (2.12)

    where rm = (xm, ym, zm)T is the coordinate-vector of the microphone.

    The sound wave corresponding to the (i, j, k) virtual source will arrive at the

    microphone with a delay

    i,j,k =di,j,kc

    (2.13)

    where c is the speed of sound. The impulse response of the room is the sum of the

    delayed impulse corresponding to the signals arriving from each virtual source:

    h(t) =i,j,kZ

    hi,j,k (t di,j,k

    c

    )(2.14)

    The magnitude hi,j,k of the unit impulse is influenced by the distance the sound

    wave travels to get from the source to the microphone

    bi,j,k =1

    4pid2i,j,k(2.15)

    and by the number of reflections of the walls

    ci,j,k = |i|+|j|+|k| (2.16)

    where < 1 is the wall reflection coefficient (which is, in this simple model,

    considered to be the same for all the walls).

    hi,j,k = bi,j,kci,j,k (2.17)

  • 2.2 Room acoustics 35

    Although the impulse response of the room should contain an infinite number of

    delayed impulses, corresponding to an infinity of virtual sources, the magnitudes

    hi,j,k become very small for large i j and k. The impulse response has then a

    finite time duration:

    h(t) =n

    i=n

    nj=n

    nk=n

    hi,j,k (t di,j,k

    c

    )(2.18)

    Figure 2.11: Room impulse response simulated with the image method

    Figure 2.11 shows a simulated room impulse response obtained with the image

    method. Reverberant sounds generated using such an impulse response sound like

    signals recorded in real conditions. However phenomenon like the phase inversion

    of the sound wave when it reflects off a wall, the presence of objects or people

    are ignored by this model.

    2.2.3 Linear Time-Invariant model of the room

    The general shape of the measured and the simulated room impulse responses

    corresponds to the one described on figure 1.3. However, when the conditions in

    the room are changing (movement of the talker and/or listener) the coefficients

    of the impulse response have big fluctuations especially in the late reverberation

    tail. As we explained in chapter 1, the distortions in the speech signal are mostly

    due to the late reverberation. Therefore, a model based on the image method,

    where the room impulse responses are modeled by a sum of delayed impulses

    hi,j,k (t i,j,k), is not practical for a system identification.Actually the only general properties which can be retained from a room impulse

  • 36 2 Model of a reverberant signal

    response are its linearity, its causality (there is no reverberation before the be-

    ginning of the signal) and its general exponential decay structure.

    In real environments, the talker or the listener are in movement, therefore the

    effect of the room is time-variant. However if we assume that the computation is

    fast enough the system can be considered Linear Time-Invariant (LTI).

    Moreover, because of the exponential decay, the impulse response of the room has

    a finite duration. The room is then modeled by a Finite Impulse Response (FIR)

    filter. The relation between the input s(n) and the output x(n) is given by the

    convolution

    x(n) = h(n) s(n) =L1k=0

    h(k)s(n k), (2.19)

    where L is the length of the impulse response (also called order of the channel).

    Actually, the FIR model of the room impulse response is very practical as the

    transfer function of the system, i.e. the z-transform of its impulse response,

    H(z) =L1k=0

    h(k)zk (2.20)

    is defined for all finite z and is a polynomial.

    2.2.4 Effect on the spectrogram

    It is interesting to study the effect of reverberation on the spectrogram2 of a

    speech signal. Figure 2.12 shows the spectrograms of the same speech signal

    without and with reverberation.

    The problem can be explained in the time-frequency domain as: Given the spec-

    trogram of the original signal at time frame t and frequency f , S(t, f), what is

    the influence of the room on the spectrogram of the reverberant signal at time-

    frequency bin (t, f ), X(t, f )?.

    The value of X at time frame t is only affected by bins of the original signalthat are between time frames t and tD where D depends on the reverberation

    2Instead of the spectrogram, the Gammatone filter-bank described in chapter 1 can beused. Contrary to a normal spectrogram, computed using a short-time Fourier transform, theGammatone filterbank gives an output for each time sample (no subsampling) and the centerfrequency of the filters are not linearly distributed, which is closer to the human auditorysystem.

  • 2.3 Inversion of the room impulse response 37

    Figure 2.12: Spectrograms of an anechoic signal (left) and the resulting spectrogramof its convolution with the impulse response of figure 2.7 (right). This spectrogramswere obtained with a Gammatone filter-bank.

    time of the room, i.e. the time for the sound to die away to a level of 60 dB

    below its original level. In the frequency-domain, the reverberation affects slightly

    the adjacent channels. According to [7], this effect has the form of a Laplace

    distribution.

    2.3 Inversion of the room impulse response

    In this section the theoretical possibility of a perfect dereverberation will be

    discussed. The issue can be formulated in the following way: Assuming that the

    room impulse response is known, is it possible to remove its effect and to get an

    accurate estimate of the original speech signal?.

    2.3.1 Conditions on the inversion of FIR filters

    The inverse g(n) of a filter h(n) (see figure 2.13) is

    s(n) = g(n) x(n)= g(n) h(n) s(n)= s(n)

    (2.21)

    which can be simplified to

    g(n) h(n) = (n) (2.22)

  • 38 2 Model of a reverberant signal

    s(n) h(n) x(n) g(n) s(n)

    Figure 2.13: Inversion of a filter

    The inversion problem can be studied with the help of the z-transform. The z-

    transform of h(n), also called transfer function of the filter, is defined as the power

    series

    H(z) =+

    k=h(k)zk (2.23)

    It was shown in section 2.2 that the room can be considered as a FIR filter. Then

    its z-transform is a polynomial

    H(z) = h0 + h1z1 + + hNzL+1 (2.24)

    where L is the length of the room impulse response

    H(z) = h0zL+1(z z1)(z z2) (z zL1) (2.25)

    H(z) has L 1 finite zeros at z = z1, z2, . . . , zN .The transfer function of the inverse filter is then the rational function

    G(z) =1

    h0 + h1z1 + + hNzL+1

    (2.26)

    The Infinite Impulse Response (IIR) filter G(z) is causal and stable if and only if

    all its poles are inside the unit circle (|z| = 1). As the poles of G(z) are the zerosof H(z), this means that all the zeros of H(z) must be inside the unit circle. Such

    a system is called minimum-phase.

    In order to understand this problem, we can observe what happens if we want to

    invert a simple non minimum-phase system.

    Given the FIR filter h(n), defined in the time-domain by:

    h(n) = (n) 2(n 1)

    Its transfer function is

    H(z) = 1 2z1

  • 2.3 Inversion of the room impulse response 39

    The Region of Convergence (ROC) of this z-transform is |z| > 0. As this systemhas a zero at z = 2, it is non minimum-phase.

    The transfer function of its inverse system is:

    G(z) =1

    1 2z1 =z

    z 2G(z) has a zero at the origin and a pole at z = 2. In this case there are two

    possible regions of convergence and hence two possible inverse systems. If the

    ROC of G(z) is taken as |z| > 2, then

    g(n) = 2nu(n)

    where u(n) is the unit step function

    u(n) =

    {1 if n 0,0 else.

    (2.27)

    This is the impulse response of a causal and instable system. On the other hand,

    if the ROC is assumed to be |z| < 2, the impulse response of the inverse systemis

    g(n) = 2nu(n 1).In this case the inverse system is anti-causal and stable.

    2.3.2 Are room transfer functions minimum-phase ?

    Any system can be represented as the cascade of a minimum-phase system with

    an all-pass system [8]. An all-pass system is defined as a system for which the

    magnitude of the transfer function is unity for all frequencies. Thus if Hap(z)

    denotes the z-transform of an all-pass system, |Hap(e)| = 1 for all .The poles and zeros of an all-pass system occur at conjugate reciprocal locations

    (see figure 2.14).

    If we consider a non-minimum-phase system H(z), with, for example, one zero

    outside the unit circle at z = 1z0, |z0| < 1, and the remainder of its poles and

    zeros are inside the unit circle. Then H(z) can be expressed as

    H(z) = H1(z)(z1 z0) (2.28)

  • 40 2 Model of a reverberant signal

    z-plane

    unit circle@I

    z0 = re

    1z0= 1

    re

    p0 =1z0

    = 1re

    Figure 2.14: Pole () and zero () of an all-pass filter

    where H1(z) is minimum-phase. Equivalently equation (2.28) can be written as

    H(z) =(H1(z)(z

    1 z0)) 1 z0z11 z0z1

    =(H1(z)(1 z0z1)

    ) z1 z01 z0z1

    = Hmin(z)z1 z01 z0z1

    = Hmin(z)Hap(z)

    (2.29)

    where Hmin(z) is minimum-phase and Hap(z) is all-pass. Any pole or zero of H(z)

    that is inside the unit circle also appears in Hmin. Any pole or zero of H(z) that

    is outside the unit circle appears in Hmin in the conjugal reciprocal location.

    The equivalent minimum-phase system has the same magnitude spectrum as the

    original system.

    It is interesting to compare the impulse response h(n) of an FIR system with the

    impulse response hmin(n) of its equivalent minimum-phase system.

    Figure 2.15 shows that the energy of hmin(n) is more concentrated around the

    origin. This property can be formalized with the following equation3:

    mn=0

    |h(n)|2 mn=0

    |hmin(n)|2, m N (2.30)

    The energy of both sequences is the same since the magnitude of their Fourier

    3A proof of this property is outlined in [8] page 371.

  • 2.3 Inversion of the room impulse response 41

    Figure 2.15: Energy of a non-minimum phase system (dashed - blue) and the corre-sponding minimum-phase system (red).

    transforms is the same (by Parsevals Theorem). This means that the equality

    occurs in (2.30) when m.

    The room transfer functions have often more energy in the reverberant compo-

    nent of the room impulse response than in the component corresponding to the

    direct path (see figure 1.3). This implies that room transfer function are often

    non-minimum-phase. A causal and stable inverse of a room impulse response is

    therefore impossible to find in general. The non-causality problem can be solved

    by introducing a delay, i.e. a delayed inverse filter is computed instead. However

    the delay have to be generally quite long, and this is not satisfying for real-time

    applications.

    2.3.3 Multiple input inverse filter

    As the room transfer functions are most of the time non-minimum-phase a perfect

    dereverberation cannot be achieved with a single microphone. It is possible to

    find a delayed inverse filter but this solution is not really adequate for real-time

    processing.

    However it is possible to find the exact inverse of a point in the room by us-

  • 42 2 Model of a reverberant signal

    ing multiple microphones4, if the room transfer functions corresponding to the

    different sensors are coprime, i.e. they do not share common zeros [9].

    This property is actually a direct application of the Bezouts theorem on poly-

    nomials. Given M FIR filters with transfer function Hi(z), i = 1, . . .M , if the

    Hi(z)s are coprime polynomials, then Gi(z), i = 1, . . .M , such that:H1(z)G1(z) +H2(z)G2(z) + . . .+HM(z)GM(z) = 1 (2.31)

    where the orders of the Gi(z)s are smaller or equal than the highest order of

    the Hi(z)s. Figure 2.16 shows how equation (2.31) can be used to invert the M

    channels simultaneously. This method is called Multiple-input/output INverse

    Theorem (MINT).

    H1(z)

    H2(z)

    ...

    HM(z)

    G1(z)

    G2(z)

    ...

    GM(z)

    -

    -

    -

    -

    -

    -

    -

    s(n) s(n) s(n)

    Figure 2.16: Multiple input inverse filter

    By using more than one microphone, the issue that room transfer functions are

    non minimum-phase is bypassed. Moreover the inverse filters are simple FIR

    filters, which can be computed by solving the linear system

    d =[HT1 H

    T2 HTM

    ]g = Hg (2.32)

    where d = [1, 0, . . . , 0] is a vector of length 2L 1, g is the concatenation of thevector gi = [gi(0), . . . , gi(L 1)]T corresponding to the inverse filters

    g =[gT1 . . . g

    TM

    ]T(2.33)

    and the His are the L (2L 1) Sylvester matrices corresponding to the poly-nomials Hi(z)

    Hi =

    hi(0) hi(L 1) 0... . . . . . . ...0 hi(0) hi(L 1)

    (2.34)4Such a system is called Single-Input Multiple-Output (SIMO)

  • 2.3 Inversion of the room impulse response 43

    A Sylvester matrix permits to compute a convolution (or a polynomial multiplica-

    tion) with a matrix multiplication. Given two signals x(n) and y(n), respectively

    of length Lx and Ly, the convolution z(n) of x(n) and y(n) has Lx + Ly 1samples and can be written in a vector form as

    z = XTy = YTx, (2.35)

    where X, resp. Y, is the Ly (Lx + Ly 1), resp. Lx (Lx + Ly 1), Sylvestermatrix of x(n), resp. y(n), and y, resp, x, is the signal y(n), resp. x(n), written

    as a column vector of length Ly, resp. Lx.

    The linear system of equation (2.32) can be solved by computing the Moore-

    Penrose pseudo-inverse5 of the matrix H, H+. The inverse filter is then computed

    by

    g = H+d. (2.36)

    As d = [1, 0, . . . , 0], g is actually the first column of H+.

    The linear system of equation (2.32) has infinitely many solution. The pseudo-

    inverse method gives the solution with the smallest norm 2.

    5 The Moore-Penrose pseudo-inverse is a matrixH+ of the same dimensions asHT satisfyingfour conditions: HH+H = H, H+HH+ = H+, HH+ and H+H are Hermitian.

  • Chapter 3

    Enhancement of a speech signal

    Reverberation produces a distortion that alters the intelligibility of speech. A

    possible approach to the dereverberation problem is to consider the general prop-

    erties of a speech signal, which are degraded by the reverberation.

    A simple way to improve the reverberant signal is, for example, to detect re-

    verberation tails between words. By removing, or attenuating, these parts which

    only contain reverberation, the listening comfort is slightly improved. However,

    this method, which is used in hearing aids, does not remove the distortion which

    alters the words.

    The two methods presented in this chapter use, more or less explicitely, the

    harmonicity property of the voiced segments of a speech signal, in order to try

    and recover the clean signal. In section 3.1 the approach of Nakatani[4], using a

    adaptive harmonic filter, will be described. In section 3.2 an adaptive algorithm

    working on the LP residual of the speech signal will be presented.

    3.1 Harmonicity based dereverberation

    Nakatani and al. propose in [4] an interesting single microphone dereverbera-

    tion method called Harmonicity based dEReverBeration (HERB). This method

    is based on the harmonicity model of speech described in section 2.1.

    The principle is to estimate a dereverberation operator using the harmonic parts

    of the speech signal. This operator, initially designed for the harmonic parts, is

    expected to work on the non-harmonic parts as well.

  • 46 3 Enhancement of a speech signal

    Figure 3.1: Spectrograms of a sweeping sinusoid and its reverberant signal.

    The performance of this method, presented in [4] and [10], are impressive. In this

    section we will begin by describing the principle of this dereverberation process.

    Then we will discuss its applicability on ASIMO.

    3.1.1 Effect of reverberation on a sweep signal

    In order to understand the basic idea of the HERB method, it is useful to observe

    the effect of reverberation on a sweeping sinusoid. In discrete time the sinusoidal

    sweep is define by

    s(n) = A sin 2pi(k

    2

    (n

    fs

    )2+ fstart

    n

    fs

    )(3.1)

    where A is the amplitude, fstart is the frequency at t = 0, fs is the sampling

    frequency, and k a constant. Its instant frequency varies linearly in time:

    (n) = kn

    fs+ fstart. (3.2)

    Figure 3.1 (left) shows the spectrogram of a half second long discrete signal

    which frequency sweeps from 100 to 4000 Hz. This spectrogram is obtained using

    a Gammatone filter-bank, therefore is the frequency scale not linear (see (1.2)).

    This signal is then convoluted with the impulse response shown on figure 2.7. The

    resulting spectrogram is shown on figure 3.1 (right). We can observe, from this

    spectrogram, that the sinusoidal component corresponding to the original signal

    can be clearly identified. In each frequency band, the energy corresponding to this

  • 3.1 Harmonicity based dereverberation 47

    direct signal appears first and is followed by a reverberation tail. At a given

    point in time, the energy of the signal is maximum for the frequency corresponding

    to the direct signal.

    The idea of the HERB method is to track the instant frequency (l) of a dominant

    sinusoidal component in the reverberant signal at each short time frame. The

    amplitude A(l) and phase (l) of this dominant sinusoidal are extracted and

    used to synthesize the signal

    s(n) =l

    g(n nl)A(l) cos((l)

    n

    fs+ (l)

    ), (3.3)

    where g(nnl) is a window function for overlap-add synthesis and nl is the timeindex centered in frame l.

    3.1.2 Adaptive harmonic filtering

    Although a sweep signal contains only one dominant sinusoidal, a harmonic sig-

    nal contains several sinusoidal components whose frequencies correspond to its

    fundamental frequency F0 and its multiples (cf. section 2.1). The aim of a har-

    monic filter is to enhance these components. Since the fundamental frequency of a

    speech signal changes over time, the properties of the filter have to be adaptively

    modified according to F0 (see figure 3.2).

    Harmonic

    Filter

    Estimation

    of F0

    6

    -

    -

    -x(n) x(n)

    Figure 3.2: Adaptive harmonic filtering

    A simple approach of harmonic filtering is the comb filter defined as 1+z where is the period to be enhanced. The method proposed by Nakatani in [4] is to

    filter the signal by synthesizing a harmonic sound as follows:

    1. The fundamental frequency of the observed signal is estimated at each time

    frame. If the time frame is short enough this fundamental frequency can be

    considered constant.

  • 48 3 Enhancement of a speech signal

    2. The amplitudes and phases of individual harmonic components are esti-

    mated using the short time Fourier transform (STFT), X(l,m), of s(n)

    X(l,m) =n

    g1(n nl)x(n)e2pimM

    nnlfs , (3.4)

    Ak,l = |X(l, [kF0,l])|, (3.5)k,l = X(l, [kF0,l]), (3.6)

    where l is the index of the time frame, nl is the time index corresponding

    to the center of the frame, m is the index of the frequency bin, M is the

    number of points used for the Discrete Fourier Transform (DFT), Ak,l and

    k,l are respectively the estimated amplitude and phase of the k-th harmonic

    component, F0,l is the fundamental frequency of the time frame, g1(n) is

    an analysis window function and [] discretizes a continuous frequency intothe index of the nearest frequency bin.

    3. The output of the filter, x(n), is obtained by adding sinusoids

    xl(n) =k

    Ak,l cos

    (kF0,l

    n nlfs

    + k,l

    ), (3.7)

    and combining them over succeeding frames

    x(n) =l

    g2(n (nl + lT ))xl(n), (3.8)

    where xl(n) is a synthesized harmonic sound corresponding to the time

    frame l, T is the frame shift in samples and g2(n) is a synthesis window

    function.

    Actually the harmonic filter itself is easy to implement. The main issue is to find

    an accurate estimate of the fundamental frequency of the signal even in case of

    strong reverberation.

    3.1.3 Dereverberation operator

    Harmonic case

    The dereverberation operation is computed in the frequency domain using the

    short time Fourier transform. Let X(l,m) be the STFT of a reverberant signal.

  • 3.1 Harmonicity based dereverberation 49

    X(l,m) can be represented as the product of the source signal, S(l,m), and the

    room transfer function, H(m), which is assumed to be time invariant (cf. section

    2.2). This transfer function can be divided into two function, D(m) and R(m).

    The former corresponds to the direct signal, D(m)S(l,m), and the latter to the

    reverberant part, R(m)S(l,m):

    X(l,m) = H(m)S(l,m)

    = D(m)S(l,m) +R(m)S(l,m)(3.9)

    The aim of the dereverberation operator is to estimate the direct signalX (l,m) =D(m)S(l,m).

    It can be obtained by subtracting the reverberant part R(m)X(l,m) from equa-

    tion (3.9), or by finding the inverse filter W (m) such that

    W (m) =D(m)

    H(m)(3.10)

    Then

    X (l,m) =W (m)X(l,m)

    =D(m)

    H(m)

    (H(m)S(l,m)

    )= D(m)S(l,m).

    (3.11)

    The basic idea of the HERB method is the following: if S(l,m) is a harmonic

    signal, the direct signal, contained in X(l,m) can be obtained using an adaptive

    harmonic filter. At each time frame l an inverse filter W0(l,m) is computed in

    the frequency domain using the output X(l,m) of the harmonic filter:

    W0(l,m) =X(l,m)

    X(l,m)(3.12)

    As the signal X(l,m) is supposed to contain only the direct part of the signal

    X(l,m), this filter will remove the reverberation on the time frame.

    As the effect of the room is supposed to be constant the dereverberation operator

    W (m) is estimated by averaging the inverse filter computed at the different time

    frames.

    W (m) = E {W0(l,m)} (3.13)

  • 50 3 Enhancement of a speech signal

    General case

    This process can be applied on a speech signal S(l,m) by rewriting the equation

    (2.1) in frequency domain

    S(l,m) = Sh(l,m) + Sn(l,m) (3.14)

    where Sh(l,m) is the harmonic part and Sn(l,m) is the non-harmonic part.

    The observed reverberant signal X(l,m) is rewritten as

    X(l,m) = D(m)Sh(l,m) + (R(m)Sh(l,m) +H(m)Sn(l,m)) (3.15)

    where H(m) is the transfer function of the room, H(m) = D(m) +R(m).

    The component D(m)Sh(l,m) can be approximately extracted from X(l,m) with

    an adaptive harmonic filter. This approximated direct signal X(l,m) can be mod-

    eled as:

    X(l,m) = D(m)Sh(l,m) +(Rh(l,m) + Hn(l,m)

    )(3.16)

    where Rh(l,m) is a part of the reverberation of Sh(l,m) and Hn(l,m) is a part of

    the direct signal and reverberation of Sn(l,m). It can be assume, if the fundamen-

    tal frequency is perfectly estimated, that the only estimation errors on X(l,m)

    are caused by this two unexpected remaining parts.

    The dereverberation estimator is computed as in the harmonic case 3.12:

    W (m) = E {W0(l,m)} = E{X(l,m)

    X(l,m)

    }(3.17)

    Using the equation (3.16), the operator over a time frameW0(l,m) can be rewrit-

    ten as (the time and frequency indexes have been removed for more clarity)

    W0 =X

    X=DSh + Rh + Hn

    HS(3.18)

    =

    (D + Rh/Sh

    )Sh

    H (Sh + Sn)+

    (Hn/Sn

    )Sn

    H (Sh + Sn)(3.19)

    =D + Rh/Sh

    H

    1

    1 + Sn/Sh+Hn/SnH

    1

    1 + Sh/Sn(3.20)

    It can be proven (see in appendix A) that the expected value of a function 11+z

    ,

    where z is a complex random variable, is equal to the probability that |z| < 1, if

  • 3.1 Harmonicity based dereverberation 51

    it is assumed that the phase of z is uniformly distributed, the phase of z and its

    absolute value are statistically independent, and |z| 6= 1.Then, under the following conditions [10]:

    1. The phase of Sn(l,m) and a joint event composed of Sh(l,m), Rh(l,m) and

    |Sn(l,m)| are independent,2. The phase of Sh(l,m) and a joint event composed of |Sh(l,m)|, Hn(l,m)

    and Sn(l,m) are independent,

    3. The phase of Sh(l,m) and Sn(l,m) are uniformly distributed within [0, 2pi),

    4. |Sh(l,m)| 6= |Sn(l,m)|,

    the equation (3.17) can be derived as

    W (m) D(m) + R(m)H(m)

    P{|Sh(l,m)| > |Sn(l,m)|}, (3.21)

    where

    R(m) = Et

    {Rh(l,m)

    Sh(l,m)

    }|Sh(l,m)|>|Sn(l,m)|

    , (3.22)

    P{} is a probability function, and Et{}A represents an average function overtime frames under a condition where A holds.

    In the derivation of W (m), the term corresponding to the non-harmonic part

    is neglected. Actually it is expected that, when |Sn(l,m)| > |Sh(l,m)|, i.e. thesignal is non-harmonic over the time frame, the output of the harmonic filter is

    equal to zero, then

    Et

    {Hn(l,m)

    Sn(l,m)

    }|Sn(l,m)|>|Sh(l,m)|

    0. (3.23)

    Given equation (3.21) W (m) is expected to remove reverberation not only of the

    harmonic components of the speech signal but also of the non-harmonic ones. It

    approximates the inverse filter D(m)H(m)

    except for a remaining reverberation due to

    R(m).

    The enhanced signal is then expected to be the sum of the direct signal and

    a reduced reverberation part. However, because of the term P{|Sh(l,m)| >

  • 52 3 Enhancement of a speech signal

    |Sn(l,m)|}, the gain of the filter becomes zero in frequency regions where har-monic components were not included during the estimation of the dereverberation

    operator.

    In addition, even in frequency regions in which harmonic components were present

    during the estimation phase, the filter gain is expected to decrease as the fre-

    quency increase. The reason is that in a speech signal the energy of higher order

    harmonic components is smaller than the energy of the sinusoidal component at

    the fundamental frequency and its small multiples.

    3.1.4 The HERB method

    The figure 3.3 summarize the dereverberation process using the HERB method.

    S(l,m) - H(m) -X(l,m) - W (m) - S(l,m)

    ?HarmonicFilter

    ?X(l,m)

    E{X(l,m)X(l,m)

    }@@@R

    Figure 3.3: Diagram of the HERB dereverberation method

    The system consists of the following sub-procedures:

    1. Estimation of the fundamental frequency and the voiced durations.

    2. Harmonic filtering.

    3. Estimation of the dereverberation operator.

    4. Dereverberation of the signal using this operator.

    The estimation of the fundamental frequency is the most important point of the

    method. F0 must be robustly estimated in order to expect a good dereverberation.

    This task is difficult in presence of strong reverberation. Nakatani [10] proposes

    to repeat the dereverberation process of figure 3.3 three times:

  • 3.1 Harmonicity based dereverberation 53

    STEP 1: The dereverberation process is applied on the observed reverberant

    signal.

    STEP 2: The dereverberated signal obtained in the first step is used to estimate

    the fundamental frequency. This new F0 is used to control the adaptive

    harmonic filter, but the input of this filter is the original reverberant signal.

    STEP 3: The dereverberated signal obtained in the second step is now used as

    new reverberant signal which is enhanced reapplying the whole dereverber-

    ation process on it.

    According to [10], the quality of the dereverberated signal improves at each step.

    The second step can be repeated several times to even improve the estimation of

    the fundamental frequency. By contrast, the quality of the signal does not always

    improve when the third step is repeated, this is due to an accumulation of errors

    in the dereverberation filters.

    3.1.5 Test of the method

    The performance of this method described in [4] and [10] are impressing. We have

    implement this method to test two points:

    Can this method easily be adapted to replace the STFT by a Gammatonefilterbank?

    Can the process work in real-time and real environments?

    For HRI the interest of this method resides in the fact that the fundamental

    frequency of the signal is already computed for other processes. A pitch tracking

    algorithm has already been developed at HRI [11] and can valuably be used to

    estimate the fundamental frequency of the signals. As the computation of the

    fundamental frequency seems to be the critical point of the HERB algorithm, we

    expected a lot from this method.

    Harmonic Filter

    The aim of this section is to compare the harmonic filter proposed by Nakatani in

    [4] and an implementation of the harmonic filter on the Gammatone filter-bank.

  • 54 3 Enhancement of a speech signal

    The implementation of the harmonic filter on the filter-bank is simple. The out-

    put of the Gammatone filter bank are N signals corresponding to the different

    frequency channels of the cochlea response. Knowing the fundamental frequency,

    the frequency channels corresponding to F0 and its multiple are determined. At

    a time sample t the signals of these channels are added.

    The resulting spectrograms (see figure 3.4) shows that both implementation of

    the harmonic filter give similar results.

    Figure 3.4: up-left: Original signal (sweep with harmonics). up-right: Reverberantsignal. bottom-left: Harmonic estimate with the Gammatone filter-bank. bottom-right: Harmonic estimate with Nakatanis harmonic filter.

    As expected the adaptive harmonic filter can be implemented without problems

    on the Gammatone filterbank. Moreover the assumption that the fundamental

    frequency is constant over a short time frame is not required any more as our

    filterbank performs the harmonic filtering without the subsampling imposed by

    the STFT used in [4]. The improvement of the method, using time wrapping,

    proposed in [12] is useless with a Gammatone implementation.

  • 3.1 Harmonicity based dereverberation 55

    Dereverberation operator

    In order to compute the dereverberation operator a training sequence is required.

    During this adaptation phase the room impulse response must not change.

    In the simulation the training sequence was composed of several sweeping sinu-

    soid with their harmonics similar as the one shown on figure 3.4. The operator is

    computed using a short-time Fourier transform. The restriction on the time win-

    dow is really strong. It has to be long enough to contain a whole word (sweep)

    including the complete reverberation tail.

    It is at this point important to note that the window of the short-time Fourier

    transform for the harmonic filter and for the estimation of the dereverberation

    operator can not be the same. For the harmonic filtering the analysis window must

    be as small as possible in order to respect the assumption that the fundamental

    frequency of the signal is constant. On the other hand, for the estimation of the

    filter W (m), the time window of the STFT must be of several seconds.

    It is also assumed that, during the adaptation phase of the dereverberation fil-

    ter, the pause between the words is long enough. Hence the reverberation tail

    of a word would alter the the following word. Respecting these conditions the

    dereverberation operator can be computed.

    In order to estimate the performance of the algorithm, 500 random harmonic

    signals (sweeps with harmonics) of 0.5 s each are used as training data. These

    signals are convolved with the room impulse response shown in figure 2.7. As the

    exact fundamental frequencies of these signals are known, a good estimation of

    the dereverberation operation can be expected. This dereverberation operator is

    then used to enhance a real speech signal convolved with the same room impulse

    response (see figure 3.5).

    Figure 3.6 shows the spectrogram of the enhanced signal. It is here important

    to note that the speech signal used for the test contains only one word. As the

    time window of the STFT used to compute the dereverberation operator is long

    enough to contain a whole word, the enhanced signal is obtained by multiplying

    the FFT of the whole reverberant signal with the dereverberation operator.

    The dereverberation works relatively good. However, we can see on the spectro-

    gram that the dereverberation filter is not causal. This is not surprising as we

    explained in section 2.3 that room impulse responses are in general non minimum-

    phase. Because of the non causality the beginning of the signal is altered. It can

  • 56 3 Enhancement of a speech signal

    Figure 3.5: Spectrogram of the clean and reverberant signal used to test the rever-beration operator.

    Figure 3.6: Spectrogram of the enhanced signal computed in the frequency domain.

  • 3.1 Harmonicity based dereverberation 57

    be a problem as this part of the signal contains normally no reverberations and

    can, therefore, be valuable for some processing.

    We try to express the dereverberation filter in the time domain, using the inverse

    Fourier transform of the operator computed in the frequency domain. In this case

    the dereverberation does not work anymore (see figure 3.7).

    Figure 3.7: Impulse response of the dereverberation operator and spectrogram of theenhanced signal computed in the time domain.

    3.1.6 Discussion of the method

    This method can theoretically remove very strong reverberations of a signal.

    Moreover, as the dereverberation operator is computed in the frequency domain,

    the computational cost is quite low.

    However, the amount of required adaption data is prohibitive (about 500 words).

    Then the pause between words has to be really large in case of long reverberation.

    It is therefore impossible to use this method in real-time application. It will be

    quite bothering to have to speak during several minutes with ASIMO before he

    begins to understand what is said, and that supposing that neither the robot or

    the speaker move during this time.

    Given the restrictions on the adaptive phase of the algorithm this method can

    not be applied in real-environment. In addition, a remark can be formulated on

    the use of harmonic filtering to remove reverberations, even for a highly harmonic

    signal. The harmonic filter manages quite well to remove the reverberation when

    the fundamental frequency changes within the word. However a big part of the

    reverberation stays if the fundamental frequency changes to slowly. Figure 3.8

  • 58 3 Enhancement of a speech signal

    shows the effect of the reverberation on the fundamental frequency in such a

    case.

    Figure 3.8: Effect of the reverberation on the fundamental frequency.

    In this example the component corresponding to the fundamental frequency is

    strongly disturbed by the reverberation. Increasing the frequency resolution (the

    number of channels of the filter-bank) solve this problem. But more than 1000

    channels are required to get a frequency selectivity as good as for the human

    acoustical system. Due to the increasing computational load this is not feasible

    for real-time applications.

  • 3.2 Dereverberation using LP analysis 59

    3.2 Dereverberation using LP analysis

    The dereverberation using the harmonicity of the signal require too much training

    data. Therefore, in this section, another dereverberation method will be discussed.

    This method uses the autoregressive (AR) model of speech signals. Several meth-

    ods have been proposed using linear prediction (LP) analysis [13] [14] [15].

    3.2.1 Problem formulation

    In section 2.1.4 it was explained that a speech signal s(n) can be expressed as

    a linear combination of its L past sample values. The clean and the reverberant

    speech become, respectively,

    s(n) =

    pk=1

    aks(n k) + es(n), (3.24)

    x(n) =

    pk=1

    bkx(n k) + ex(n), (3.25)

    where ak and bk are the LP coefficients and es(n) and ex(n) the LP residual signal

    (or prediction error).

    The important assumption on which dereverberation methods using LP analysis

    are based is that the LP coefficients are unaffected by the reverberation:

    bk = ak k [1, L] N. (3.26)Actually this assumption holds only in a spatially averaged sense [16], i.e. using

    several microphones:

    E {bk} = ak k [1, L] N. (3.27)

    Consequently the dereverberation process will try to enhance the LP residual of

    the signal which structure is well known (see 2.1.4). The aim of the dereverber-

    ation methods using LP analysis is to improve the LP residual signal such that

    e(n) es(n). Then a clean speech estimate is obtained by

    s(n) =L

    k=1

    bks(n k) + e(n), (3.28)

    i.e. the estimated LP coefficients obtained by linear prediction analysis are used

    to synthesize a signal out of the enhanced excitation signal e(n).

  • 60 3 Enhancement of a speech signal

    3.2.2 The kurtosis as measure of the reverberation

    Figure 3.9: Example of platykurtic (left) and leptokurtic (right) distributions. Bothdistributions have the same standard deviation

    Gillespie shows in [14] that the kurtosis of the LP residual is a valid reverberation

    metric. The kurtosis 2 of a random signal x(n) is the degree of peakness of the

    distribution, defined as the the fourth central moment 4 normalized by the fourth

    power of the standard deviation (or the square of the variance):

    2 =44

    =E{(x(n) )4}

    E{(x(n) )2}2 (3.29)

    where = E {x(n)} is the mean value of x(n).As the kurtosis of a normal distribution is equal to 3, a kurtosis excess , denoted

    2 and defined by

    2 =44

    3 (3.30)is often used. A distribution with a high peak 2 > 0 is called leptokurtic, a flat-

    topped curve 2 < 0 is called platykurtic, and the normal distribution is called

    mesokurtic.

    Figure 3.9 illustrates the kurtosis measure. The distribution on the right is more

    peaked at the center, we tend to conclude that it has a lower standard deviation.

    But, on the other hand, it has thicker tails, which usually means that it has a

    higher standard deviation. If the effect of the peakness exactly offsets that of the

    thick tails, the two distributions will have the same standard deviation.

    For a clean voiced speech, the LP residuals have strong peaks corresponding to

    glottal pulses (see figure 3.10), whereas for reverberated speech such peaks are

    spread in time. On figure 3.11, the probability density function of a clean signal

    and of the convolution of this signal with the room impulse response computed

    in the CARL Groups office (see figure 2.7) are estimated. Both signals have

    been centered and normalized such that their means equals 0 and their standard

  • 3.2 Dereverberation using LP analysis 61

    Figure 3.10: On the left, extract of the LP residuals of a speech signal. Note the strongpeaks corresponding to the glottal pulses. On the right, the same signal impaired byreverberations.

    Figure 3.11: Estimation of the probability density functions of the LP residuals ofa clean speech signal (blue) and of a reverberant signal (red). Both signal have beencentered and normalized such that their means = 0 and their standard deviations = 1.

  • 62 3 Enhancement of a speech signal

    deviation equals 1. The probability density functions are estimated by computing

    the histograms of the signals and normalizing them by the number of samples

    in the signals. The distribution of the kurtosis of the clean signal (blue) has

    a higher peakness (2 = 42). The effect of the room reduces this peakness on

    the reverberant signal (red) and the kurtosis of its LP residuals equals 10. By

    maximizing the kurtosis of the LP residuals we can expect to improve the quality

    of the observed signal.

    3.2.3 Maximization of the kurtosis

    In the time-domain

    In order to enhance the reverberant signal x(n) an adaptive filter can be built,

    which maximizes the kurtosis of its LP residual x(n). Given an L-taps adaptive

    filter h(n) at time n, the output of this filter is y(n) = hT (n)x(n), where x(n) =

    [x(n L + 1), . . . , x(n 1), x(n)]T . An LP synthesis filter yields y(n), the finalprocessed signal. Adaptation of h(n) is similar to a traditional Least-Mean-Square

    (LMS) adaptive filter [17], except that the optimized value is a feedback function

    f(n), corresponding to the gradient of the kurtosis.

    Figure 3.12 (a) shows a diagram of the maximization system. The problem of

    this algorithm is the LP reconstruction artifacts. However, this system is linear

    and the order of the filters can be arbitrary changed, then h(n) can be computed

    from x(n), but applied directly to x(n) (see figure 3.12 (b)).

    A gradient method can be used to optimize the kurtosis. The gradient of the

    kurtosis is computed by

    2h

    =4 (E {y2}E {y3x} E {y4}E {yx})

    E3 {y2} (3.31)

    This gradient can be approximated by

    2h

    (4((E {y2} y2 E {y4}) y)

    E3 {y2}

    )x = f(n)x(n) (3.32)

    Were f(n) is the feedback function used to control the filter updates. For continu-

    ous adaptation, the expected values E {y2} and E {y4} are estimated recursivelyby

    E{y2(n)

    }= E

    {y2(n 1)}+ (1 )y2(n) (3.33)

    E{y4(n)

    }= E

    {y4(n 1)}+ (1 )y4(n) (3.34)

  • 3.2 Dereverberation using LP analysis 63

    Figure 3.12: (a) A single channel time-domain adaptive algorithm for maximizing thekurtosis of the LP residuals. (b) Equivalent system, which avoids LP reconstructionartifacts.

    where < 1 controls the smoothness of the estimates.

    The update equation of the filter is given by

    h(n+ 1) = h(n) + f(n)x(n) (3.35)

    where controls the speed of adaptation.

    In the frequency domain

    However, according to Haykin [17] the convergence of a LMS-like algorithm in

    time-domain is very slow. Therefore, Gillespie [14] proposes to adapt the algo-

    rithm in the frequency-domain. Moreover by using more microphones and cal-

    culating the feedback function on an averaged output of all the channels, the

    accuracy and the speed of the adaptation is increased.

    The frequency domain method proposed in [14] uses a modulated complex lapped

    transform (MCLT) [18]. This filter-bank structure is close to a Discrete Cosine

    Transform (DCT). The general diagram of the method in the frequency domain

    for two microphone is described in figure 3.13.

  • 64 3 Enhancement of a speech signal

    Figure 3.13: Two-channel frequency-domain adaptive algorithm for maximization ofthe kurtosis of the LP residual.

    3.3 Discussion of the method

    The maximization of the kurtosis permits a real-time dereverberation. The adap-

    tation is quick if a small adaptation filter is used. However, in the case of strong

    reverberation the improvement on the signal is not perceptible.

    If the length of the adaptive filter is increased, the kurtosis is still maximized and

    the algorithm converges to a signal with maximum kurtosis. But the resulting

    signal has sometimes a higher kurtosis than the original clean signal. The sound is

    strongly distorted and sometimes not understandable anymore. Figure 3.14 shows

    the original LP residual of the clean signal. This signal is artifically reverberated

    and then enhanced by maximizing the kurtosis of the LP residuals. The resulting

    LP residual has a higher kurtosis than the original one. This means that the

    maximization has to be constrained. The clean speech has a higher kurtosis than

    the reverberated one, but this does not mean that the signal with the highest

    kurtosis is the clean signal.

    Actually the length of the adaptive filter must not be longer than the period

    of the glottal pulses. With this constraint, the efficiency of dereverberation is

    limited.

    Another drawback of this method is that the LP analysis, as explained in section

    2.1.4, is a very good approximation of the magnitude spectrum of the speech

    signal but strongly alters the phases spectrum. As the phase is crucial for source

    localization, it should be studied if this method does not alter dramatically the

    phase information of the signal.

  • 3.3 Discussion of the method 65

    Figure 3.14: On the left the LP residual of a clean signal. On the right the LP residualof the resulting dereverberated signal. The kurtosis of the dereverberated signal ishigher than the kurtosis of the original signal. The resulting signal is strongly distorted.

  • Chapter 4

    Equalization of room impulse

    responses

    In chapter 3 the dereverberation approach considered the effect of the room as

    a distortion which alters the harmonicity of the speech signal. This chapter will

    discuss methods to estimate room impulse responses. These estimated impulse

    responses can then be equalized (inverted) in order to recover the original clean

    speech signal (see section 2.3).

    In section 4.1 the principle of a channel estimation method using the second

    order statistics of the observed signals will be explained. Then, in sections 4.2

    and 4.3, two different implementations of this principle will be discussed. At last,

    in section 4.4, some improvement ideas will be proposed.

    4.1 Principle of the channel estimation

    Some methods have been proposed to estimate one channel. For example, Hoop-

    good proposes in [19] a single channel estimation method based on the non-

    stationarity of speech and the stationarity of the room impulse response. How-

    ever, in most of the cases, these methods require that the input signal is white

    noise, which is not the case for a speech signal. On contrary the estimation of

    several room impulse responses simultaneously is possible [20]. Moreover as it

    was explained in section 2.3, it is much easier to find a global inverse for two or

    more room impulse responses than the inverse of a single one. In this section a

    method will be presented, which permits to estimate the impulse responses of a

  • 68 4 Equalization of room impulse responses

    Single-Input Multiple-Output (SIMO) system using only second order statistics

    (SOS).

    4.1.1 Hypothesis

    In [20] Tong and al. show that a Single-Input Multiple-Output (SIMO) system

    can be identified under the following conditions.

    1. The autocorrelation matrix of the source signal is of full rank.

    2. The channel transfer functions do not share any common zeros.

    4.1.2 Basic idea

    The relation between the input and the outputs of a SIMO system (see figure

    4.1) is:

    xi(n) = hi(n) s(n) i [1,M ] (4.1)

    s(n)h1(n)

    h2(n)...

    hM(n)

    x1(n) x2(n)...

    ...

    xM(n)

    Figure 4.1: SIMO System

    In a vector/matrix form, such a signal model becomes:

    xi(n) = HTi s(n) (4.2)

    where

    xi(n) =[xi(n), xi(n 1), . . . , xi(n L+ 1)

    ]T, (4.3)

    Hi =

    hi(0) hi(L 1) 0... . . . . . . ...0 hi(0) hi(L 1)

    , (4.4)s(n) =

    [s(n), s(n 1), . . . , s(n 2L+ 2)]T , (4.5)

  • 4.1 Principle of the channel estimation 69

    where L is the maximum length of the room impulse responses.

    The idea of blind SIMO identification is to study the matrix

    Rx =

    i6=1Rxixi Rx2x1 RxMx1Rx1x2

    i6=2Rxixi RxMx2

    ......

    . . ....

    Rx1xM Rx2xM

    i6=M Rxixi

    (4.6)where Rxixj = E

    {xi(n)x

    Tj (n)

    }are the auto- and cross-correlation matrices of

    the observed signals. The matrices Rxixj can be written as

    Rxixj =1

    TXiX

    Tj (4.7)

    where Xi is the L (T + L 1) Sylvester matrix of xi(n) and T is the numberof samples of xi(n).

    If the matrix Rx is multiplied by the vector h =[hT1 h

    T2 hTM

    ]Twe obtain

    (for the first L rows):i6=1

    Rxixih1 Rx2x1h2 RxMx1hM =

    1

    T

    i6=1

    Xi(XTi h1

    ) 1TX2(XT1 h2

    ) 1TXM

    (XT1 hM

    )=

    1

    T

    i6=1

    Xi(XTi h1 XT1 hi

    )A left multiplication by the transpose of a Sylvester matrix is a convolution, then

    the term XTi h1 XT1 hi actually equalsxi(n) h1(n) x1(n) hi(n) = s(n)

    (hi(n) h1(n) h1(n) hi(n)

    )(4.8)

    and, as the convolution of real signals is commutative, this term equals zero. The

    same development can be performed for the other rows of the matrix product

    Rxh and it gives:

    Rxh = 0, (4.9)

    which means that the vector h lies in the null space of the matrix Rx.

    4.1.3 How can this idea be implemented?

    There are then two distinct approaches to identify the SIMO system:

  • 70 4 Equalization of room impulse responses

    An eigenvalue decomposition is performed on the matrix Rx and its nullspace is computed [21]. Under the hypothesis that Rs = E

    {s(n)sT (n)

    }is full rank, this null space corresponds to the unknown system h. This

    batch method is discussed in section 4.2.

    A set of filters gi(n) are adaptively estimated such that(i, j), (hi(n) gj(n) hj(n) gi(n)) = 0 [22]. This iterative method isdiscussed in section 4.3

    4.1.4 Why have the channels to be coprime?

    The second hypothesis, which requires that the channel transfer functions do not

    share any common zeros, can be explained as follows:

    Given for example two channels with impulse responses h1(n) and h2(n). If the

    transfer function of these channels share common zeros then the impulse response

    can be rewritten as

    h1(n) = d(n) h1(n) (4.10)h2(n) = d(n) h2(n) (4.11)

    where d(n) is by analogy with polynomials the greatest common divisor of h1(n)

    and h2(n), and the transfer functions of h1(n) and h1(n) are coprime (do not

    share any common zeros). Then x1(n) and x2(n) become

    x1(n) =(s(n) d(n)) h1(n) (4.12)

    x2(n) =(s(n) d(n)) h2(n), (4.13)

    and, if the correlation matrix of s(n) d(n) is full rank, the methods will identifythe system [h1(n) h2(n)] instead of [h1(n) h2(n)].

    4.1.5 Estimation of the length of the filters

    Both the batch and iterative implementations require that the lengths of the

    channels are given. The estimation of these lengths is very important as we will

    explain in this subsection.

    In the two microphones case, the channel estimation tries to find two FIR filters

    g1(n) and g2(n) of lengths Lg + 1 such that

    h1(n) g2(n) h2(n) g1(n) = 0. (4.14)

  • 4.1 Principle of the channel estimation 71

    where h1(n) and h2(n) are the two unknown FIR filters we want to identify. The

    lengths of these filters are equal to Lh + 1, which is also unknown.

    In the z-domain, the relation between the filters can be written as an operation

    on polynomials

    H1(z)G2(z) = H2(z)G1(z). (4.15)

    The polynomials H1(z)G2(z) and H2(z)G1(z) are equal if and only if they have

    exactly the same Lh + Lg zeros.

    As H1(z) and H2(z) do not share common zeros, each zero of H1, resp. H2(z)

    must also be a zero of G1(z), resp. G2(z). G1(z) and G2(z) contain at least Lhzeros and therefore Lg Lh.

    When Lg = Lh, the method return directly the estimated channel. However,

    when the lengths of the filters (or channel order) is over-estimated, additional

    zeros appear. Figure 4.2 illustrates the system in the two-m