ml estimation and crbs for reverberation, speech and noise

16
1 ML Estimation and CRBs for Reverberation, Speech and Noise PSDs in Rank-Deficient Noise Field Yaron Laufer, Student Member, IEEE, Bracha Laufer-Goldshtein, Student Member, IEEE, and Sharon Gannot, Senior Member, IEEE Abstract—Speech communication systems are prone to perfor- mance degradation in reverberant and noisy acoustic environ- ments. Dereverberation and noise reduction algorithms typically require several model parameters, e.g. the speech, reverberation and noise power spectral densities (PSDs). A commonly used assumption is that the noise PSD matrix is known. However, in practical acoustic scenarios, the noise PSD matrix is unknown and should be estimated along with the speech and reverberation PSDs. In this paper, we consider the case of rank-deficient noise PSD matrix, which arises when the noise signal consists of multiple directional noise sources, whose number is less than the number of microphones. We derive two closed-form maximum likelihood estimators (MLEs). The first is a non-blocking-based estimator which jointly estimates the speech, reverberation and noise PSDs, and the second is a blocking-based estimator, which first blocks the speech signal and then jointly estimates the reverberation and noise PSDs. Both estimators are analytically compared and analyzed, and mean square errors (MSEs) expres- sions are derived. Furthermore, Cram´ er-Rao Bounds (CRBs) on the estimated PSDs are derived. The proposed estimators are examined using both simulation and real reverberant and noisy signals, demonstrating the advantage of the proposed method compared to competing estimators. Index Terms—Dereverberation, Noise reduction, Maximum likelihood estimation, Cram´ er-Rao Bound. I. I NTRODUCTION In many hands-free scenarios, the measured microphone signals are corrupted by additive background noise, which may originate from both environmental sources and from microphone responses. Apart from noise, if the recording takes place in an enclosed space, the recorded signals may also contain multiple sound reflections from walls and other objects in the room, resulting in reverberation. As the level of noise and reverberation increases, the perceived quality and intelligibility of the speech signal deteriorate, which in turn affect the performance of speech communication systems, as well as automatic speech recognition (ASR) systems. In order to reduce the effects of reverberation and noise, speech enhancement algorithms are required, which aim at recovering the clean speech source from the recorded mi- crophone signals. Speech dereverberation and noise reduction algorithms often require the power spectral densities (PSDs) of the speech, reverberation and noise components. In the multichannel framework, a commonly used assumption (see e.g. in [1]–[10]) is that the late reverberant signal is modelled Yaron Laufer, Bracha Laufer-Goldshtein and Sharon Gannot are with the Faculty of Engineering, Bar-Ilan University, Ramat-Gan, 5290002, Israel (e-mail: [email protected], [email protected], [email protected]). as a spatially homogeneous sound field, with a time-varying PSD multiplied by a spherical diffuse time-invariant spatial coherence matrix. As the spatial coherence matrix depends only on the microphone geometry, it can be calculated in advance. However, the reverberation PSD is an unknown parameter that should be estimated. Numerous methods exist for estimating the reverberation PSD. They are broadly divided into two classes, namely non-blocking-based estimators and blocking-based estimators. The non-blocking-based approach jointly estimates the PSDs of the late reverberation and speech. The estimation is carried out using the maximum likelihood (ML) criterion [3], [6] or in the least-squares (LS) sense, by minimizing the Frobenius norm of an error PSD matrix [8]. In the blocking-based method, the desired speech signal is first blocked using a blocking matrix (BM), and then the reverberation PSD is estimated. Estimators in this class are also based on the ML approach [1], [5], [7], [9] or the LS criteria [2], [4]. All previously mentioned methods do not include an esti- mator for the noise PSD. In [1], [3], a noiseless scenario is assumed. In [2], [4]–[10], the noise PSD matrix is assumed to be known in advance, or that an estimate is available. Typically, the noise PSD matrix is assumed to be time- invariant, and therefore can be estimated during speech-absent periods using a voice activity detector (VAD). However, in practical acoustic scenarios the spectral characteristics of the noise might be time-varying, e.g. when the noise environment includes a background radio or TV, and thus a VAD-based algorithm may fail. Therefore, the noise PSD matrix has to be included in the estimation procedure. Some papers in the field deal with performance analysis of the proposed estimators. We give a brief review of the commonly used tools to assess the quality of an estimator. The- oretical analysis of estimators typically consists of calculating the bias and the mean square error (MSE), which coincides with the variance for unbiased estimators. The Cram´ er-Rao Bound (CRB) is an important tool to evaluate the quality of any unbiased estimator, since it gives a lower bound on the MSE. An estimator that is unbiased and attains the CRB, is called efficient. The maximum likelihood estimator (MLE) is asymptotically efficient [11], namely attains the CRB when the amount of samples is large. Theoretical analysis of PSD estimators in the noise-free scenario was addressed in [12], [13]. In [12], CRBs were derived for the reverberation and the speech MLEs proposed in [3]. These MLEs are efficient, i.e. attain the CRB for any number of samples. In addition, it was pointed out that the non- This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689 Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Upload: others

Post on 04-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ML Estimation and CRBs for Reverberation, Speech and Noise

1

ML Estimation and CRBs for Reverberation, Speechand Noise PSDs in Rank-Deficient Noise Field

Yaron Laufer, Student Member, IEEE, Bracha Laufer-Goldshtein, Student Member, IEEE, andSharon Gannot, Senior Member, IEEE

Abstract—Speech communication systems are prone to perfor-mance degradation in reverberant and noisy acoustic environ-ments. Dereverberation and noise reduction algorithms typicallyrequire several model parameters, e.g. the speech, reverberationand noise power spectral densities (PSDs). A commonly usedassumption is that the noise PSD matrix is known. However, inpractical acoustic scenarios, the noise PSD matrix is unknownand should be estimated along with the speech and reverberationPSDs. In this paper, we consider the case of rank-deficient noisePSD matrix, which arises when the noise signal consists ofmultiple directional noise sources, whose number is less than thenumber of microphones. We derive two closed-form maximumlikelihood estimators (MLEs). The first is a non-blocking-basedestimator which jointly estimates the speech, reverberation andnoise PSDs, and the second is a blocking-based estimator, whichfirst blocks the speech signal and then jointly estimates thereverberation and noise PSDs. Both estimators are analyticallycompared and analyzed, and mean square errors (MSEs) expres-sions are derived. Furthermore, Cramer-Rao Bounds (CRBs) onthe estimated PSDs are derived. The proposed estimators areexamined using both simulation and real reverberant and noisysignals, demonstrating the advantage of the proposed methodcompared to competing estimators.

Index Terms—Dereverberation, Noise reduction, Maximumlikelihood estimation, Cramer-Rao Bound.

I. INTRODUCTION

In many hands-free scenarios, the measured microphonesignals are corrupted by additive background noise, whichmay originate from both environmental sources and frommicrophone responses. Apart from noise, if the recordingtakes place in an enclosed space, the recorded signals mayalso contain multiple sound reflections from walls and otherobjects in the room, resulting in reverberation. As the level ofnoise and reverberation increases, the perceived quality andintelligibility of the speech signal deteriorate, which in turnaffect the performance of speech communication systems, aswell as automatic speech recognition (ASR) systems.

In order to reduce the effects of reverberation and noise,speech enhancement algorithms are required, which aim atrecovering the clean speech source from the recorded mi-crophone signals. Speech dereverberation and noise reductionalgorithms often require the power spectral densities (PSDs)of the speech, reverberation and noise components. In themultichannel framework, a commonly used assumption (seee.g. in [1]–[10]) is that the late reverberant signal is modelled

Yaron Laufer, Bracha Laufer-Goldshtein and Sharon Gannot arewith the Faculty of Engineering, Bar-Ilan University, Ramat-Gan,5290002, Israel (e-mail: [email protected], [email protected],[email protected]).

as a spatially homogeneous sound field, with a time-varyingPSD multiplied by a spherical diffuse time-invariant spatialcoherence matrix. As the spatial coherence matrix dependsonly on the microphone geometry, it can be calculated inadvance. However, the reverberation PSD is an unknownparameter that should be estimated. Numerous methods existfor estimating the reverberation PSD. They are broadly dividedinto two classes, namely non-blocking-based estimators andblocking-based estimators. The non-blocking-based approachjointly estimates the PSDs of the late reverberation and speech.The estimation is carried out using the maximum likelihood(ML) criterion [3], [6] or in the least-squares (LS) sense,by minimizing the Frobenius norm of an error PSD matrix[8]. In the blocking-based method, the desired speech signalis first blocked using a blocking matrix (BM), and then thereverberation PSD is estimated. Estimators in this class arealso based on the ML approach [1], [5], [7], [9] or the LScriteria [2], [4].

All previously mentioned methods do not include an esti-mator for the noise PSD. In [1], [3], a noiseless scenario isassumed. In [2], [4]–[10], the noise PSD matrix is assumedto be known in advance, or that an estimate is available.Typically, the noise PSD matrix is assumed to be time-invariant, and therefore can be estimated during speech-absentperiods using a voice activity detector (VAD). However, inpractical acoustic scenarios the spectral characteristics of thenoise might be time-varying, e.g. when the noise environmentincludes a background radio or TV, and thus a VAD-basedalgorithm may fail. Therefore, the noise PSD matrix has to beincluded in the estimation procedure.

Some papers in the field deal with performance analysisof the proposed estimators. We give a brief review of thecommonly used tools to assess the quality of an estimator. The-oretical analysis of estimators typically consists of calculatingthe bias and the mean square error (MSE), which coincideswith the variance for unbiased estimators. The Cramer-RaoBound (CRB) is an important tool to evaluate the quality ofany unbiased estimator, since it gives a lower bound on theMSE. An estimator that is unbiased and attains the CRB, iscalled efficient. The maximum likelihood estimator (MLE) isasymptotically efficient [11], namely attains the CRB whenthe amount of samples is large.

Theoretical analysis of PSD estimators in the noise-freescenario was addressed in [12], [13]. In [12], CRBs werederived for the reverberation and the speech MLEs proposedin [3]. These MLEs are efficient, i.e. attain the CRB for anynumber of samples. In addition, it was pointed out that the non-

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 2: ML Estimation and CRBs for Reverberation, Speech and Noise

2

blocking-based reverberation MLE derived in [3] is identicalto the blocking-based MLE proposed in [1]. In [13], it wasshown that the non-blocking-based reverberation MLE of [3]obtains a lower MSE compared to a noiseless version of theblocking-based LS estimator derived in [2].

In the noisy case, quality assessment was discussed in[9] and [14]. In [9], it was numerically demonstrated thatan iterative blocking-based MLE yields lower MSE than theblocking-based LS estimator proposed in [2]. In [14], closed-form CRBs were derived for the two previously proposedMLEs of the reverberation PSD, namely the blocking-basedestimator in [5] and the non-blocking-based estimator in [6].The CRB for the non-blocking-based reverberation estimatorwas shown to be lower than the CRB for the blocking-based estimator. However, it was shown that in the noiselesscase, both reverberation MLEs are identical and both CRBscoincide.

As opposed to previous works, the assumption of knownnoise PSD matrix is not made in [15]. The noise is modelledas a spatially homogeneous sound field, with a time-varyingPSD multiplied by a time-invariant spatial coherence matrix.It is assumed that the spatial coherence matrix of the noise isknown in advance, while the time-varying PSD is unknown.Two different estimators were developed, based on the LSmethod. In the first one, a joint estimator for the speech, noiseand late reverberation PSDs was developed. As an alternative,a blocking-based estimator was proposed, in which the speechsignal is first blocked by a BM, and then the noise andreverberation PSDs are jointly estimated. This method wasfurther extended in [16] to jointly estimate the various PSDsalong with the relative early transfer function (RETF) vector,using an alternating least-squares method. However, the modelused in [15] and [16] only fits spatially homogeneous noisefields that are characterized by a full-rank covariance matrix.Moreover, in [9] it was claimed that the ML approach ispreferable over the LS estimation procedure.

Recently, a confirmatory factor analysis was used in [17] tojointly estimate the PSDs and the RETF. The noise is modelledas a microphone-self noise, with a time-invariant diagonal PSDmatrix. A closed-form solution is not available for this case,thus requiring iterative optimization techniques.

In this paper, we treat the noise PSD matrix as an unknownparameter. We assume that the noise PSD matrix is a rank-deficient matrix, as opposed to the spatially homogeneousassumption considered in [15]. This scenario arises when thenoise signal consists of a set of directional noise sources,whose number is smaller than the number of microphones.We assume that the positions of the noise sources are fixed,while their spectral PSD matrix is time-varying, e.g. whenthe acoustic environment includes radio or TV. It should beemphasized that, in contrast to [15] which estimates only ascalar PSD of the noise, in our model the entire spectralPSD matrix of the noise is estimated, and thus the case ofmultiple non-stationary noise sources, can be handled. Wederive closed-form MLEs of the various PSDs, for both thenon-blocking-based and the blocking-based methods. The pro-posed estimators are analytically studied and compared, andthe corresponding MSEs expressions are derived. Furthermore,

CRBs for estimating the various PSDs are derived.An important benefit of considering the rank-deficient noise

as a separated problem, is due to the form of the solution.In the ML framework, a closed-form solution exists for thenoiseless case [1], [3] but not for the full-rank noise scenario,thus requiring iterative optimization techniques [5], [6], [9](as opposed to LS method that has closed-form solutions inboth cases). However, we show here that when the noise PSDmatrix is a rank-deficient matrix, closed-form MLE exists,which yields simpler and faster estimation procedure withlow computational complexity, and is not sensitive to localmaxima.

The remainder of the paper is organized as follows. Sec-tion II presents the problem formulation, and describes theprobabilistic model. Section III derives the MLEs for both thenon-blocking-based and the blocking-based methods, and Sec-tion IV presents the CRB derivation. Section V demonstratesthe performance of the proposed estimators by an experimentalstudy based on both simulated data and recorded room impulseresponses (RIRs). The paper is concluded in Section VI.

II. PROBLEM FORMULATION

In this section, we formulate the dereverberation and noisereduction problem. Scalars are denoted with regular lowercaseletters, vectors are denoted with bold lowercase letters andmatrices are denoted with bold uppercase letters. A list ofnotations used in our derivations is given in Table I.

A. Signal Model

Consider a speech signal received by N microphones, ina noisy and reverberant acoustic environment. We work withthe short-time Fourier transform (STFT) representation of themeasured signals. Let k ∈ [1,K] denote the frequency binindex, and m ∈ [1,M ] denote the time frame index. The N -channel observation signal y(m, k) ∈ CN writes

y(m, k) = xe(m, k) + r(m, k) + u(m, k), (1)

where xe(m, k) ∈ CN denotes the direct and early re-verberation speech component, r(m, k) ∈ CN denotes thelate reverberation speech component and u(m, k) ∈ CNdenotes the noise. The direct and early reverberation speechcomponent is given by xe(m, k) = gd(k)s(m, k), wheres(m, k) ∈ C is the direct and early speech component asreceived by the first microphone (designated as a referencemicrophone), and gd(k) = [1, gd,2(k), · · · , gd,N (k)]> ∈ CNis the time-invariant relative early transfer function (RETF)vector between the reference microphone and all microphones.In this paper, we follow previous works in the field, e.g. [2],[13]–[15], and neglect the early reflections. Thus, the targetsignal s(m, k) is approximated as the direct component at thereference microphone, and gd(k) reduces to the relative direct-path transfer function (RDTF) vector. It is assumed that thenoise signal consists of T noise sources, i.e.

u(m, k) = Au(k)su(m, k), (2)

where su(m, k) ∈ CT denotes the vector of noise sources andAu(k) ∈ CN×T is the noise acoustic transfer function (ATF)

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 3: ML Estimation and CRBs for Reverberation, Speech and Noise

3

TABLE I: Nomenclature

Symbol Meaning

(·)> transpose(·)H conjugate transpose(·)∗ complex conjugate| · | determinant of a matrixTr[·] trace of a matrix⊗ Kronecker product

vec(·) stacking the columns of a matrix on top ofone another

N No. of microphones, n ∈ {1, . . . , N}M No. of time frames, m ∈ {1, . . . ,M}K No. of frequency bins, k ∈ {1, . . . ,K}T No. of noise sources, t ∈ {1, . . . , T}

y Observation signals Direct speech componentgd Relative direct-path transfer functionr Late reverberation componentu Noise signal

Au Noise ATF matrixsu Vector of noise sourcesφS Speech PSDφR Reverberation PSDΓR Reverberation coherence matrixΨu Noise PSD matrix

matrix, assumed to be time-invariant. It is further assumed thatT ≤ N − 2.

B. Probabilistic Model

The speech STFT coefficients are assumed to follow a zero-mean complex Gaussian distribution1 with a time-varying PSDφS(m, k). Hence, the PDF of the speech writes:

p(s(m, k);φS(m, k)

)= Nc

(s(m, k); 0, φS(m, k)

), (3)

The late reverberation signal is modelled by a zero-meancomplex multivariate Gaussian distribution:

p(r(m, k); Φr(m, k)

)= Nc

(r(m, k); 0,Φr(m, k)

). (4)

The reverberation PSD matrix is modelled as a spatiallyhomogeneous and isotropic sound field, with a time-varyingPSD, Φr(m, k) = φR(m, k)ΓR(k). It is assumed that thetime-invariant coherence matrix ΓR(k) can be modelled bya spherically diffuse sound field [19]:

ΓR,ij(k) = sinc(

2πfsk

K

dijc

), (5)

where sinc(x) = sin(x)/x, dij is the inter-distance betweenmicrophones i and j, fs denotes the sampling frequency andc is the sound velocity.

1The multivariate complex Gaussian probability density function (PDF) isgiven by Nc(a;µa,Φa) =

1|πΦa|

exp(−(a−µa)HΦ−1

a (a−µa)), where

µa is the mean vector and Φa is an Hermitian positive definite complexcovariance matrix [18].

The noise sources vector is modelled by a zero-mean com-plex multivariate Gaussian distribution with a time-varyingPSD matrix Ψu(m, k) ∈ CT×T :

p(su(m, k); Ψu(m, k)

)= Nc

(su(m, k); 0,Ψu(m, k)

). (6)

Using (2) and (6), it follows that u(m, k) has a zero-meancomplex multivariate Gaussian distribution with a PSD matrixΦu(m, k) ∈ CN×N , given by

Φu(m, k) = Au(k)Ψu(m, k)AHu(k). (7)

Note that rank(Φu) = rank(Ψu) = T ≤ N − 2, i.e. Φu is arank-deficient matrix. The PDF of y(m, k) writes

p(y(m, k); Φy(m, k)

)= Nc

(y(m, k); 0,Φy(m, k)

), (8)

where Φy is the PSD matrix of the input signals. Assumingthat the components in (1) are independent, Φy is given by

Φy(m, k) = φS(m, k)gd(k)gHd (k) + φR(m, k) ΓR(k)

+ Au(k)Ψu(m, k)AHu(k). (9)

A commonly used dereverberation and noise reduction tech-nique is to estimate the speech signal using the multichannelminimum mean square error (MMSE) estimator, which yieldsthe multichannel Wiener filter (MCWF), given by [20]:

sMCWF(m, k) =gHd (k)Φ−1

i (m, k)

gHd (k)Φ−1

i (m, k)gd(k) + φ−1S (m, k)

y(m, k),

(10)where

Φi(m, k) , φR(m, k) ΓR(k)+Au(k)Ψu(m, k)AHu(k) (11)

denotes the total interference PSD matrix. For implementing(10), we assume that the RDTF vector gd and the spatialcoherence matrix ΓR are known in advance. The RDTFdepends only on the direction of arrival (DOA) of the speakerand the geometry of the microphone array, and thus it can beconstructed based on a DOA estimate. The spatial coherencematrix is calculated using (5), based on the spherical diffuse-ness assumption. It follows that estimators of the late rever-beration, speech and noise PSDs are required for evaluatingthe MCWF.

It should be noted that estimating directly the completeN ×N time-varying noise PSD matrix Φu(m, k), along withthe speech and reverberation PSDs, is a complex problem,and a closed-form ML solution is not available. In this paper,we rely on the decomposition of the noise PSD matrix in(7), along with the rank-deficiency assumption. By utilizingthe time-invariant ATF matrix Au(k), we can construct aprojection matrix onto the subspace orthogonal to the noisesubspace, which generates nulls towards the noise sources.In the following section, we show that this step enables thederivation of closed-form MLEs for the various PSDs.

The noise ATF matrix Au(k) is in general not available,since the estimation of the individual ATFs requires thateach noise is active separately. Instead, we only assume theavailability of a speech-absent segment, denoted by m0, inwhich all noise sources are concurrently active. Based onthis segment, we compute a basis for the noise subspace

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 4: ML Estimation and CRBs for Reverberation, Speech and Noise

4

that can be used instead of the unknown ATF matrix. Tothis end, we compute the noise PSD matrix Φu(m0, k), andapply the eigenvalue decomposition (EVD) to the resultingmatrix. Recall that the rank of Φu is T . Accordingly, a T -rank representation of the noise PSD matrix is formed by thecomputed eigenvalues and eigenvectors:

Φu(m0, k) = V(k)Λ(m0, k)VH(k), (12)

where Λ(m0, k) ∈ CT×T is the eigenvalues matrix(comprised of the non-zero eigenvalues) and V(k) =[v1(k), · · · ,vT (k)] ∈ CN×T is the corresponding eigen-vectors matrix. V(k) is a basis that spans the noise ATFssubspace, and thus [21]

Au(k) = V(k)G(k), (13)

where G(k) ∈ CT×T consists of projection coefficients of theoriginal ATFs on the basis vectors. Substituting (13) into (2)and using (1), yields

y(m, k) = gd(k)s(m, k) + r(m, k) + V(k)sv(m, k), (14)

where sv(m, k) , G(k)su(m, k). It follows that the noisePSD matrix in (7) can be recast as

Φu(m, k) = V(k)Ψv(m, k)VH(k), (15)

where Ψv(m, k) = G(k)Ψu(m, k)GH(k). Using this basischange, the MCWF in (10) is now computed with

Φi(m, k) = φR(m, k) ΓR(k) + V(k)Ψv(m, k)VH(k). (16)

As a result, rather than requiring the knowledge of the exactnoise ATF matrix, we use V that is learned from a speech-absent segment. Due to this basis change, we will need toestimate Ψv instead of Ψu.

To summarize, estimators of φR, φS and Ψv are required.For the sake of brevity, the frame index m and the frequencybin index k are henceforth omitted whenever possible.

III. ML ESTIMATORS

We propose two ML-based methods: (i) Non-blocking-based estimation: Simultaneous ML estimation of the reverber-ation, speech and noise PSDs; and (ii) Blocking-based estima-tion: Elimination of the speech PSD using a blocking matrix(BM), and then joint ML estimation of the reverberation andnoise PSDs. Both methods are then compared and analyzed.

A. Non-Blocking-Based Estimation

We start with the joint ML estimation of the reverberation,speech and noise PSDs. Based on the short-time stationarityassumption [9], [12], it is assumed that the PSDs are ap-proximately constant across small number of consecutive timeframes, denoted by L. We therefore denote y(m) ∈ CLN asthe concatenation of L previous observations of y(m):

y(m) ,[y>(m− L+ 1), · · · ,y>(m)

]>. (17)

The set of unknown parameters is denoted by φ(m) =[φR(m), φS(m),ψ>V (m)

]>, where ψV is a vector containing

all elements of Ψv, i.e. ψV = vec (Ψv). Assuming that the

L consecutive signals in y are i.i.d., the PDF of y writes (seee.g. [14]):

p(y(m);φ(m)

)=

(1

πN |Φy(m)|exp

(− Tr

[Φ−1

y (m)Ry(m)] ))L

, (18)

where Ry is the sample covariance matrix, given by

Ry(m) =1

L

m∑`=m−L+1

y(`)yH(`). (19)

The MLE of the set φ(m) is therefore given by

φML,y(m) = argmaxφ(m)

log p(y(m);φ(m)

). (20)

To the best of our knowledge, for the general noisy scenariothis problem is considered as having no closed-form solution.However, we will show that when the noise PSD matrix Φu

is rank-deficient, with T = rank(Φu) ≤ N − 2, a closed-form solution exists. In the following, we present the proposedestimators. The detailed derivations appear in the Appendices.

In Appendix A, it is shown that the MLE of φR(m) is givenby:

φML,yR (m) =

1

N − (T + 1)Tr[QRy(m)Γ−1

R

], (21)

where Q ∈ CN×N is given by

Q = IN −A(AHΓ−1

R A)−1

AHΓ−1R , (22)

and A ∈ CN×(T+1) is the speech-plus-noise subspace

A = [gd,v1, · · · ,vT ] . (23)

The matrix Q is a projection matrix onto the subspace orthog-onal to the speech-plus-noise subspace. The role of Q is toblock the directions of the desired speech and noise signals,in order to estimate the reverberation level.

Once we obtain the MLE for the late reverberation PSD,the MLEs for the speech and noise PSDs can be computed.In Appendix B, it is shown that the MLE for the speech PSDwrites

φML,yS (m) = wH

s

(Ry(m)− φML,y

R (m)ΓR

)ws, (24)

where ws ∈ CN is a minimum variance distortionless re-sponse (MVDR) beamformer that extracts the speech signalwhile eliminating the noise, given by

wHs =

gHdP⊥v Γ−1

R

gHdP⊥v Γ−1

R gd, (25)

and P⊥v ∈ CN×N is a projection matrix onto the subspaceorthogonal to the noise subspace, given by

P⊥v = IN − Γ−1R V

(VHΓ−1

R V)−1

VH. (26)

Note that

wHs gd = 1 , wH

sV = 0. (27)

The estimator in (24) can be interpreted as the variance ofthe noisy observations minus the estimated variance of the

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 5: ML Estimation and CRBs for Reverberation, Speech and Noise

5

reverberation, at the output of the MVDR beamformer [12].In Appendix C, it is shown that the MLE of the noise PSD

can be computed with

ΨML,yv (m) = WH

u

(Ry(m)− φML,y

R (m)ΓR

)Wu, (28)

where Wu ∈ CN×T is a multi-source linearly constrainedminimum variance (LCMV) beamformer that extracts thenoise signals while eliminating the speech signal:

WHu =

(VHP⊥g Γ−1

R V)−1

VHP⊥g Γ−1R , (29)

and P⊥g ∈ CN×N is a projection matrix onto the subspaceorthogonal to the speech subspace, given by

P⊥g = IN −Γ−1R gdg

Hd

gHdΓ−1

R gd. (30)

Note that

WHugd = 0 , WH

uV = IT . (31)

Interestingly, the projection matrix Q can be recast as a linearcombination of the above beamformers (see Appendix D):

Q = IN − gdwHs −VWH

u. (32)

Using (27), (31) and (32), it can also be noted that Q isorthogonal to both beamformers

wHsQ = 0 , WH

uQ = 0. (33)

In the noiseless case, i.e. when T = 0, Q reduces to

Q = IN − gdwHs0 =

(P⊥g)H, (34)

where wHs0 =

gHdΓ−1

R

gHdΓ−1

R gd, leading to the same closed-form

estimators as in [3, Eq. (7)]:

φML,yR (m) =

1

N − 1Tr[(

IN − gdwHs0

)Ry(m)Γ−1

R

], (35)

φML,yS (m) = wH

s0

(Ry(m)− φML,y

R (m)ΓR

)ws0 . (36)

B. Blocking-Based Estimation

As a second approach, we first block the speech componentusing a BM, and then jointly estimate the PSDs of thereverberation and noise. Let B ∈ CN×(N−1) denote the BM,which satisfies BHgd = 0. The output of the BM is given by

z(m) , BH y(m) = BH(r(m) + u(m)). (37)

The PDF of z(m) ∈ CN−1 therefore writes:

p(z(m); Φz(m)

)= Nc

(z(m); 0,Φz(m)

), (38)

with the PSD matrix Φz(m) ∈ C(N−1)×(N−1) given by

Φz(m) = BHΦi(m)B, (39)

where Φi is the total interference matrix, defined in (11).Multiplying (15) from left by BH and from right by B, thenoise PSD matrix at the output of the BM writes

BHΦu(m)B = VΨv(m)VH, (40)

where V ∈ C(N−1)×T is the reduced noise subspace:

V , BHV = [v1, · · · , vT ] . (41)

Substituting (16) into (39) and using (41), yields

Φz(m) = φR(m)BHΓRB + VΨv(m)VH. (42)

Under this model, the parameter set of interest is φ(m) =[φR(m),ψ>V (m)

]>. Let z ∈ CL(N−1) be defined similarly to

y in (17), as a concatenation of L i.i.d. consecutive frames.Similarly to Φy(m), it is assumed that Φz(m) is fixed duringthe entire segment. The PDF of z therefore writes

p(z(m); φ(m)

)=

(1

πN−1|Φz|exp

(− Tr

[Φ−1

z (m)Rz(m)] ))L

, (43)

where Rz(m) is given by

Rz(m) =1

L

m∑`=m−L+1

z(`)zH(`) = BHRy(m)B. (44)

The MLE of φ(m) is obtained by solving:

φML,z(m) = argmaxφ(m)

log p(z(m); φ(m)

). (45)

To the best of our knowledge, this problem is also consideredas having no closed-form solution. Again, we argue that ifthe noise PSD matrix satisfies T = rank(Φu) ≤ N − 2,then we can obtain a closed-form solution. In Appendix E,the following MLE is obtained:

φML,zR (m) =

1

N − 1− TTr[QRz(m)

(BHΓRB

)−1], (46)

where Q ∈ C(N−1)×(N−1) is given by

Q = IN−1 − V(VH (BHΓRB

)−1V)−1

VH (BHΓRB)−1

.

(47)

After the BM was applied, the remaining role of Q is to blockthe noise signals, in order to estimate the reverberation level.Note that Tr [Q] = Tr

[Q]

= N − 1− T .

Given φML,zR , it is shown in Appendix F that the MLE for

the noise PSD writes

ΨML,zv (m) = WH

u

(Rz(m)− φML,z

R (m)(BHΓRB

))Wu,

(48)

where Wu ∈ C(N−1)×T is a multi-source LCMV beam-former, directed towards the noise signals after the BM, givenby

WHu =

(VH (BHΓRB

)−1V)−1

VH (BHΓRB)−1

. (49)

Note that with this notation, Q in (47) can be recast as

Q = IN−1 − VWHu. (50)

Since WHuV = IT , it also follows that

WHuQ = 0. (51)

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 6: ML Estimation and CRBs for Reverberation, Speech and Noise

6

Also, in Appendix G it is shown that

Wu = BWu, (52)

namely the LCMV of (29), used in the non-blocking-basedapproach, can be factorized into two stages: The first is a BMthat blocks the speech signal, followed by a modified LCMV,which recovers the noise signals at the output of the BM.

C. Comparing the MLEs

In this section, the obtained blocking-based and non-blocking-based MLEs are compared. We will use the followingidentity, that is proved in [14, Appendix A]:

B(BHΓRB

)−1BH = Γ−1

R −Γ−1R gdg

HdΓ−1

R

gHdΓ−1

R gd. (53)

Substituting (30) into (53) yields

B(BHΓRB

)−1BH = P⊥g Γ−1

R . (54)

1) Comparing the reverberation PSD estimators: First, wecompare the reverberation PSD estimators in (21) and (46).Substituting (50) into (46) and then using (41), (44), (52) and(53), yields the following equation:

φML,zR =

1

N − 1− T

× Tr[(

IN −gdg

HdΓ−1

R

gHdΓ−1

R gd

)(IN −VWH

u

)Ry(m)Γ−1

R

].

(55)

Using (32) and noting that gHdΓ−1

R Q = 0, yields (21). Itfollows that both estimators are identical:

φML,yR = φML,z

R . (56)

It should be noted that in [12], [14] the two MLEs of thereverberation PSD were shown to be identical in the noiselesscase. Here we extend this result to the noisy case, when thenoise PSD matrix is a rank-deficient matrix.

2) Comparing the noise PSD estimators: The noise PSDestimators in (28) and (48) are now compared. Substituting(44) into (48) and then using (52) and (56), yields the sameexpression as in (28), and therefore

ΨML,yv = ΨML,z

v . (57)

D. MSE Calculation

In the sequel, the theoretical performance of the proposedPSD estimators is analyzed. Since the non-blocking-based andthe blocking-based MLEs were proved in section III-C to beidentical for both reverberation and noise PSDs, it suffices toanalyze the non-blocking-based MLEs.

1) Theoretical performance of the reverberation PSD esti-mators: It is well known that for an unbiased estimator, theMSE is identical to the variance. We therefore start by showingthat the non-blocking-based MLE in (21) is unbiased. Using(19), the expectation of (21) writes

E(φML,yR (m)

)=

1

N − 1− TTr[Φy(m)Γ−1

R Q]. (58)

Then, we use the following property (see (84d)):

Γ−1R Q = φR(m)Φ−1

y (m)Q, (59)

to obtain

E(φML,yR (m)

)=

φR(m)

N − 1− TTr [Q] = φR(m). (60)

It follows that the reverberation MLE is unbiased, and thus theMSE is identical to the variance. Using the i.i.d. assumption,the variance of the non-blocking-based MLE in (21) is givenby

var(φML,yR (m)

)=

1

L

1

(N − 1− T )2 var

(yH(m)Γ−1

R Qy(m)).

(61)

In order to simplify (61), the following identity is used. For apositive definite Hermitian form xHZx, where x ∼ Nc(0,Φx)and Z a Hermitian matrix, the variance is given by [11, p. 513,Eq. (15.30)]:

var(xHZx

)= Tr [ΦxZΦxZ] . (62)

Since y(m) ∼ Nc(0,Φy(m)

)and Γ−1

R Q is a Hermitianmatrix (note that QH = Γ−1

R QΓR), we obtain

var(φML,yR (m)

)=

1

L

1

(N − 1− T )2

× Tr[Φy(m)Γ−1

R QΦy(m)Γ−1R Q

]. (63)

Finally, using (59) and (84b), the variance writes

var(φML,yR (m)

)=

1

L

φ2R(m)

N − 1− T. (64)

Note that in the noiseless case, namely T = 0, the variancereduces to the one derived in [12], [13].

2) Theoretical performance of the noise PSD estimators:Using (19), (9) and (31), and based on the unbiasedness ofφML,yR (m), it can be shown that ΨML,y

v (m) in (28) is anunbiased estimator of Ψv(m).

Next, we calculate the variance of the diagonal terms ofΨML,y

v (m). To this end, we write the (i, j) entry of ΨML,yv (m)

in (28) as[ΨML,y

v (m)]ij

= wHi

(Ry(m)− φML,y

R (m)ΓR

)wj , (65)

for i, j = 1, . . . , T , where wi is the ith column of the matrixWu in (29). Using a partitioned matrix to simplify wi, it canbe shown that

wHi =

vHi P⊥ViP⊥g Γ−1

R

vHi P⊥ViP⊥g Γ−1

R vi, (66)

where Vi ∈ CN×T−1 is composed of all the vectors inV except vi, i.e. Vi = [v1, · · · ,vi−1,vi+1, · · · ,vT ], andP⊥

Vi∈ CN×N is the corresponding projection matrix onto

the subspace orthogonal to Vi

P⊥Vi= IN − P⊥g Γ−1

R Vi

(VHi P⊥g Γ−1

R Vi

)−1VHi . (67)

It can be verified that wHi vj = δij . Denote the diagonal terms

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 7: ML Estimation and CRBs for Reverberation, Speech and Noise

7

of Ψv by {ψi}Ti=1. In Appendix H, it is shown that

var(ψML,yi (m)

)=ψ2i (m)

L

[(1 + ξi(m)

ξi(m)

)2

+1

N − 1− T1

ξ2i (m)

], (68)

where ξi(m) is defined as the noise-to-reverberation ratio atthe output of wi:

ξi(m) ,ψi(m)

wHi φR(m)ΓRwi

=ψi(m)

φR(m)vHi P⊥ViP⊥g Γ−1

R vi.

(69)Note that

limξi(m)→∞

var(ψML,yi (m)

)=ψ2i (m)

L. (70)

3) Theoretical performance of the speech PSD estimator:Using (19), (9) and (27) and based on the unbiasedness ofφML,yR (m), it can be shown that φML,y

S (m) is an unbiasedestimator of φS(m). In a similar manner to (68), the varianceof (24) can be shown to be

var(φML,yS (m)

)=φ2S(m)

L

[(1 + ε(m)

ε(m)

)2

+1

N − 1− T1

ε2(m)

], (71)

where ε(m) is defined as the signal-to-reverberation ratio atthe output of ws:

ε(m) ,φS(m)

wHs φR(m)ΓRws

=φS(m)

φR(m)gHdP⊥v Γ−1

R gd. (72)

Note that

limε(m)→∞

var(φML,yS (m)

)=φ2S(m)

L. (73)

In the noiseless case, i.e. T = 0, (71) becomes identical to thevariance derived in [12], [13].

IV. CRB DERIVATION

In this section, we derive the CRB on the variance of anyunbiased estimator of the various PSDs.

A. CRB for the Late Reverberation PSD

In Appendix I, it is shown that the CRB on the reverberationPSD writes

CRB (φR) =1

L

φ2R

N − (T + 1). (74)

The resulting CRB is identical to the MSE derived in (64),and thus the proposed MLE is an efficient estimator.

B. CRB for the Speech and Noise PSDs

The CRB on the speech PSD is identical to the MSE derivedin (71), as outlined in Appendix J. The CRB on the noise PSDcan be derived similarly. We conclude that the proposed PSDestimators are efficient.

V. EXPERIMENTAL STUDY

In this section, the proposed MLEs are evaluated in asynthetic Monte-Carlo simulation as well as on measurementsof a real room environment. In Section V-A, a Monte-Carlosimulation is conducted in which signals are generated synthet-ically based on the assumed statistical model. The sensitivityof the proposed MLEs is examined with respect to the variousmodel parameters, and the MSEs of the proposed MLEs arecompared to the corresponding CRBs. In Section V-B, theproposed estimators are examined in a real room environment,by utilizing them for the task of speech dereverberation andnoise reduction using the MCWF.

A. Monte-Carlo Simulation

1) Simulation Setup: In order to evaluate the accuracy ofthe proposed estimators, synthetic data was generated accord-ing to the signal model in (1), by simulating L i.i.d. snapshotsof single-tone signals, having a frequency of f = 2000 Hz.The signals are captured by a uniform linear array (ULA)with N microphones, and inter-distance d between adjacentmicrophones. The desired signal component s was drawnaccording to a complex Gaussian distribution with a zero-meanand a PSD φS . The RDTF is given by

gd = exp (−j2πfτ ) , (75)

where τ ∈ RN is the time difference of arrival (TDOA)w.r.t. the reference microphone, given by τ = d sin(θ)

c ×[0, · · · , N − 1]

>, and θ is the DOA, defined as the broadsideangle measured w.r.t. the perpendicular to the array. Thereverberation component r was drawn according to a complexGaussian distribution with a zero-mean and a PSD matrixΦr = φRΓR, where ΓR is modelled as an ideal sphericaldiffuse sound field, given by ΓR,ij = sinc

(2πf d|i−j|c

), i, j ∈

1, . . . , N . The noise component was constructed as u = Ausu

where su denotes the T noise sources, drawn according to azero-mean complex Gaussian distribution with a random PSDmatrix Ψu, and Au is an N ×T random ATF matrix. For theestimation procedure, V is extracted by applying the EVD toa set of noisy training samples, generated with different Ψu.

In the sequel, we examine the proposed estimators andbounds as a function of the model parameters. Specifically, theinfluence of the following parameters is examined: i) numberof snapshots L; ii) reverberation PSD value φR; iii) speechPSD value φS ; and iv) noise power Pu, which is definedas the Frobenius norm of the noise PSD matrix, i.e. Pu ,‖Φu‖F =

√Tr [ΦH

uΦu]. In each experiment, we changed thevalue of one parameter, while keeping the rest fixed. Thenominal values of the parameters are presented in Table II.

For each scenario, we carried out 1000 Monte-Carlo trials.The reverberation PSD φR was estimated in each trial withboth (21) and (46), the noise PSD was estimated with both (28)and (48) and the speech PSD was estimated with (24). Theaccuracy of the estimators was evaluated using the normalizedmean square error (nMSE), by averaging over the Monte-Carlotrials and normalizing w.r.t. the square of the correspondingPSD value. For each quantity, the corresponding normalized

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 8: ML Estimation and CRBs for Reverberation, Speech and Noise

8

0 100 200 300 400 500 600 700 800 900 1000

Snapshots [#]

-40

-30

-20

-10

0

10

20

Nor

mal

ized

MS

E [d

B]

(a) Increasing L (from left to right)

-20 -15 -10 -5 0 5 10 15 20

SRR [dB]

-30

-20

-10

0

10

20

30

40

Nor

mal

ized

MS

E [d

B]

(b) Decreasing φR (from left to right)

-20 -15 -10 -5 0 5 10 15 20

SRNR [dB]

-30

-25

-20

-15

-10

-5

0

5

Nor

mal

ized

MS

E [d

B]

(c) Increasing φS (from left to right)

-30 -20 -10 0 10 20 30

SNR [dB]

-30

-20

-10

0

10

20

30

40

50

60

Nor

mal

ized

MS

E [d

B]

(d) Decreasing Pu (from left to right)

Fig. 1: Normalized MSEs and CRBs as a function of: (a) Number of snapshots, (b) SRR, (c) SRNR and (d) SNR. Theparameters trends are depicted in the caption of each subfigure.

TABLE II: Nominal Parameters

Definition Symbol Nominal valueMicrophone number N 8

Inter-microphone distance d 6 cmDirection of arrival θ 0◦

Frequency f 2 kHzNoise-PSD rank T 2

Number of Snapshots L 100Speech PSD φS 0.5

Rev. PSD φR 0.5Noise Power Pu 0.5

CRB was also computed, in order to demonstrate the theoret-ical lower bound on the nMSE.

2) Simulation Results: In Fig. 1(a), the nMSEs are pre-sented as a function of the number of snapshots, L. Clearly,the nMSEs of all the estimators decrease as the number ofsnapshots increases. As expected from the analytical study,it is evident that the non-blocking-based and the blocking-based MLEs yield the same nMSE, for both the reverberationand noise PSDs. Furthermore, for all quantities the nMSEscoincide with the corresponding CRBs. It should be stressed

that in practical applications of speech processing, the PSDsare highly time-varying, and thus will not remain constant overa large number of snapshots. This experiment only serves asa consistency check for the proposed estimators.

We now study the effect of varying the reverberation level.Let the signal-to-reverberation ratio (SRR) be defined asSRR , 10 log

(φS

φR

). In this experiment, we change φR

s.t. the resulting SRR ranges between −20 dB and 20 dB. InFig. 1(b), the nMSEs are presented as a function of SRR. Itis evident that the nMSEs of the reverberation PSD estimatorsare independent of φR, and equal to 1

L1

N−1−T , as manifestedby (64). On the other hand, the nMSEs of the noise andspeech MLEs decrease as the reverberation level decreases,until reaching the limiting value of 1

L , which is inline with(70) and (73), respectively.

We now examine the effect of changing the speech PSDlevel. Let the signal-to-reverberation-plus-noise ratio (SRNR)be defined as SRNR , 10 log

(φS

φR+Pu

). We set φS to several

values s.t. the SRNR ranges between −20 dB and 20 dB. InFig. 1(c), the nMSEs are presented as a function of SRNR.It is shown that the speech PSD estimator is improved as thespeech level increases. For the reverberation and noise PSDestimators, the performance is independent of φS . Obviously,the blocking-based estimators are not affected by the value

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 9: ML Estimation and CRBs for Reverberation, Speech and Noise

9

of the speech PSD. It is further shown that the non-blocking-based reverberation and noise estimators also produce nMSEsthat are independent of φS , as implied by (64) and (68). Thisbehaviour can be explained as follows. Both the reverberationand the noise PSD estimators apply a projection matrix ontothe subspace orthogonal to the speech subspace, i.e. eliminatethe speech component by generating a null towards its direc-tion, as manifested by (22) and (29), respectively. As a result,they are not affected by any change in the speech level.

The effect of increasing the noise level is now examined.Let the signal-to-noise ratio (SNR) be defined as SNR ,10 log

(φS

Pu

). We change Pu s.t. the SNR varies between

−30 dB and 30 dB. In Fig. 1(d), the nMSEs are presented as afunction of SNR. It is evident that the performance of the noisePSD estimators degrades as Pu decreases. In contrast, thenMSEs of the reverberation and speech MLEs are independentof the noise level, as manifested by (64) and (71), respectively.This is due to the fact that the reverberation and speech MLEseliminate the noise components by directing null towards thenoise subspace, as implied by (22) and (25), respectively.

B. Experiments with Recorded Room Impulse Responses

The performance of the proposed PSD estimators is nowevaluated in a realistic acoustic scenario, for the task of speechdereverberation and noise reduction. In our experiments, mi-crophone signals were synthesized using real speech signalsand measured RIRs. The proposed PSD estimators were usedin order to calculate the MCWF.

1) Competing Algorithms: The proposed method is com-pared to [15], in which the MCWF is implemented usingthe blocking-based or the non-blocking-based LS estima-tors. Therein, a spatially homogeneous noise sound fieldis assumed, namely Φu(m, k) = φU (m, k)ΓU (k), whereΓU (k) is a known time-invariant spatial coherence matrix, andφU (m, k) denotes the unknown time-varying PSD, which hasto be estimated, along with the speech and reverberation PSDs.Although this method considers a different noise model, it ischosen as the baseline since this is the only work that estimatesalso the noise PSD.

2) Implementation of the MCWF: It is well-known thatthe MCWF can be decomposed into an MVDR beamformerfollowed by a single-channel Wiener postfilter [22], [23]:

sMCWF(m) =γ(m)

γ(m) + 1︸ ︷︷ ︸HW (m)

gHd Φ−1

i (m)

gHd Φ−1

i (m)gd︸ ︷︷ ︸wH

MVDR(m)

y(m), (76)

where

γ(m) =φS(m)

φRE(m)(77)

denotes the SRNR at the output of the MVDR, and φRE(m) ,(gHd Φ−1

i (m)gd

)−1

is the residual interference at the MVDRoutput.

The implementation of (77) requires the estimate of thespeech PSD, which is missing in the blocking-based frame-work. By substituting the obtained blocking-based reverbera-tion and noise estimates, namely φML,z

R and ΨML,zv , into the

90◦

45◦

0◦

−45◦

−90◦1 m

Fig. 2: Geometric setup.

general likelihood function in (18), the maximization becomesa one-dimensional optimization problem, and a closed-formsolution is available [9], [24]:

φzS(m) = wH

MVDR(m)(Ry(m)− φML,z

R (m)ΓR

−VΨML,zv (m)VH

)wMVDR(m). (78)

However, it was shown in [15] that rather than using (77),better dereverberation performance is obtained by using thedecision-directed approach [25], where γ(m) is estimated by

γDD(m) = β|s(m− 1)|2

φRE(m− 1)+ (1− β)

φSi(m)

φRE(m), (79)

where β is a weighting factor, and φSiis an instantaneous

estimate based on the MVDR output [8]:

φSi(m) = max(|wH

MVDR(m)y(m)|2 − φRE(m), 0). (80)

In our experiments, the MCWF was implemented with the twovariants of computing γ: i) The direct implementation in (77),which will be referred to as Dir; and ii) the decision-directedimplementation in (79) with the speech PSD estimated as in(80), denoted henceforth as DD.

3) Performance Measures: The speech quality was evalu-ated in terms of three common objective measures, namely per-ceptual evaluation of speech quality (PESQ) [26], frequency-weighted segmental SNR (fwSNRseg) [27] and log-likelihoodratio (LLR) [27]. The measures were computed by comparingsMCWF(m) with the direct speech signal as received by the ref-erence microphone, obtained by filtering the anechoic speechwith the direct path. In the sequel, we present the improvementin these performance measures, namely ∆PESQ, ∆fwSNRsegand ∆LLR, computed as the measure difference between theoutput signal (i.e. sMCWF) and the noisy and reverberant signalat the reference microphone (i.e. y1).

4) Experimental Setup: We used RIRs from the databasepresented in [28]. The impulse responses were recorded inthe 6 × 6 × 2.4 m acoustic lab of the Engineering Facultyat Bar-Ilan University (BIU). The reverberation time was setto T60 ∈ {360, 610} msec. The RIRs were recorded by aULA of N = 8 microphones with inter-distances of 8 cm.A loudspeaker was positioned at a distance of 1 m from thearray, at angles {−90◦,−75◦,−60◦, . . . , 90◦}, as illustrated inFig. 2. The performance is evaluated on 10 experiments withdifferent speaker angles, which were randomly selected fromthe given set of angles. The 10 chosen RIRs were convolved

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 10: ML Estimation and CRBs for Reverberation, Speech and Noise

10

TABLE III: Speech Quality Measures, T60 = 360 msec

∆PESQ ∆fwSNRseg [dB] ∆LLR [dB]

Alg.\RSNR 0dB 5dB 10dB 15dB 0dB 5dB 10dB 15dB 0dB 5dB 10dB 15dB

Blocking LS [15] Dir 0.56 0.70 0.77 0.76 11.85 11.02 9.80 8.67 -0.55 -0.52 -0.45 -0.37Blocking LS [15] DD 0.60 0.75 0.85 0.88 13.05 11.67 10.23 8.88 -0.50 -0.45 -0.36 -0.27

Blocking ML Dir 0.74 0.88 0.93 0.90 12.56 11.77 10.61 9.39 -0.74 -0.65 -0.52 -0.40Blocking ML DD 0.79 0.93 0.99 0.99 15.57 13.16 10.75 8.66 -0.67 -0.50 -0.30 -0.12

Non-blocking LS [15] Dir 0.70 0.81 0.85 0.81 13.52 11.69 9.93 8.57 -0.54 -0.49 -0.41 -0.33Non-blocking LS [15] DD 0.67 0.79 0.89 0.89 13.56 11.82 10.03 8.57 -0.55 -0.48 -0.38 -0.28

Non-blocking ML Dir 0.98 1.12 1.15 1.12 15.59 13.69 11.67 9.77 -0.74 -0.61 -0.46 -0.32Non-blocking ML DD 0.79 0.93 0.99 0.99 15.57 13.16 10.75 8.66 -0.67 -0.50 -0.30 -0.12

TABLE IV: Speech Quality Measures, T60 = 610 msec

∆PESQ ∆fwSNRseg [dB] ∆LLR [dB]

Alg.\RSNR 0dB 5dB 10dB 15dB 0dB 5dB 10dB 15dB 0dB 5dB 10dB 15dB

Blocking LS [15] Dir 0.42 0.50 0.54 0.56 10.78 10.13 9.20 8.38 -0.51 -0.49 -0.44 -0.37Blocking LS [15] DD 0.46 0.54 0.62 0.67 12.49 11.54 10.46 9.47 -0.45 -0.41 -0.36 -0.30

Blocking ML Dir 0.51 0.59 0.63 0.63 10.10 9.57 8.68 7.95 -0.58 -0.53 -0.46 -0.38Blocking ML DD 0.61 0.71 0.78 0.80 14.93 13.10 11.30 9.67 -0.57 -0.44 -0.28 -0.13

Non-blocking LS [15] Dir 0.56 0.60 0.63 0.62 13.22 11.49 9.93 8.70 -0.51 -0.48 -0.42 -0.36Non-blocking LS [15] DD 0.51 0.59 0.67 0.71 13.37 12.07 10.66 8.73 -0.51 -0.47 -0.40 -0.32

Non-blocking ML Dir 0.69 0.78 0.80 0.81 13.88 12.39 10.81 9.47 -0.62 -0.52 -0.41 -0.30Non-blocking ML DD 0.61 0.71 0.78 0.80 14.93 13.10 11.30 9.67 -0.57 -0.44 -0.28 -0.13

with 3-5 sec long utterances of 5 male and 5 female speakersthat were drawn from the TIMIT database [29].

For the additive noise, we used non-stationary directionalnoise signals. Two non-stationary noise sources (with time-varying speech-like spectrum) were positioned at a distanceof 1 m from the array. In each experiment, the angles ofthe noise sources were randomly selected, avoiding the angleof the speaker. The microphone signals were synthesized byadding the noises to the reverberant speech signals with severalreverberant signal-to-noise ratio (RSNR) levels. Finally, a 1 secnoise segment was preceded to the reverberant speech signal.Based on this segment, a sample noise covariance matrix iscomputed, and then the EVD is applied in order to extract thespatial basis V that replaces the noise ATF matrix Au (see(14)).

The following values of the parameters were used. Thesampling rate was set to 16 KHz, the STFT was computed withwindows of 32 msec and 75% overlap between adjacent timeframes. As the experiment consists of real-life non-stationaryspeech signals, the sample covariance matrix Ry(m) wasestimated using recursive averaging [9] with a smoothingfactor of α = 0.7, rather than the moving-window averaging of(19). The same applies for Rz(m). The smoothing parameterfor the decision-directed in (79) was set to β = 0.9. The gainof the single channel postfilter was lower bounded by −15 dB.

5) Experimental Results: First, we examine the perfor-mance obtained with various RSNR levels. Tables III and IVsummarize the results for both reverberation levels. Note thatnegative ∆LLR indicates a performance improvement. Thebest results are highlighted in boldface. It is evident that the

proposed ML methods outperform the baseline methods of[15] for all the considered scenarios. The advantage of theproposed method can be attributed to the fact that the fullT × T spatial noise PSD matrix is estimated in each frame,and thus the non-stationarity of all the noise sources can betracked, whereas the baseline model of a time-invariant spatialcoherence matrix with a single controllable gain parametercannot track this dynamics.

In order to analyze the behaviour of the various MLimplementations, recall that the non-blocking-based and theblocking-based noise and reverberation MLEs were provedto be identical in Section III-C. Since both the non-blockingML DD and the blocking ML DD use also the same speechPSD estimate in (80), it follows that both methods yield thesame performance. However, the non-blocking ML Dir andthe blocking ML Dir implementations differ, due to the factthat they use different speech PSD estimators in (24) and (78),respectively.

From Tables III and IV, it is evident that the non-blocking-based ML Dir method is preferable for its high and stablescores, as it obtains the best ∆PESQ and ∆fwSNRseg scoresin most cases, and also yields high ∆LLR results. Theadvantage of the non-blocking ML Dir over the blocking MLDir can be attributed to the fact that the non-blocking ML Diruses the speech PSD estimate in (24), which is independent ofthe estimated noise PSD. In contrast, the blocking-based MLDir estimates the speech PSD with (78), which depends on theestimated noise PSD, and thus can be affected by estimationerrors of the noise PSD. Indeed, there is a slight disadvantagefor the non-blocking ML Dir in terms of ∆LLR, which mainly

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 11: ML Estimation and CRBs for Reverberation, Speech and Noise

11

1 2 3 4 5 6T

0

0.5

1

1.5PE

SQ

(a)

1 2 3 4 5 6T

0

5

10

15

fwSN

R [

dB]

(b)

1 2 3 4 5 6T

-0.8

-0.6

-0.4

-0.2

0

LL

R [

dB]

(c)

Fig. 3: Results versus T , for T60 = 360 msec, N = 8, RSNR = 5dB: (a) ∆PESQ, (b) ∆fwSNR, and (c) ∆LLR.

1 2 3 4 5 6T

0

0.2

0.4

0.6

0.8

1

PESQ

(a)

1 2 3 4 5 6T

0

5

10

15

fwSN

R [

dB]

(b)

1 2 3 4 5 6T

-0.6

-0.4

-0.2

0

LL

R [

dB]

(c)

Fig. 4: Results versus T , for T60 = 610 msec, N = 8, RSNR = 5dB: (a) ∆PESQ, (b) ∆fwSNR, and (c) ∆LLR.

quantifies the performance due to speech distortion. This canbe related to the fact that the speech PSD estimate in (24)generates nulls towards the noise sources, which in some casescan affect speech distortion, for example when the speech andnoise sources are not fully spatially separated.

Comparing the direct and the DD implementations of thenon-blocking ML approach, it can be observed that there is aclear advantage for the direct implementation in most cases.This can be explained by the fact that the direct methodoptimally estimates the speech PSD by maximizing the MLcriterion, while the decision-directed approach estimates thespeech PSD using (79)–(80), which is derived in a moreheuristic manner. However, in adverse conditions of high noiseand reverberation levels, the DD has some advantage due toits use of a smoother estimator compared to the instantaneousestimator of the direct form, thus improving the robustness ofthe speech PSD estimate. To conclude, the non-blocking MLDir is the preferable method among the proposed estimators.

Next, the performance is investigated as a function ofthe number of noise sources. A representative scenario withRSNR = 5 dB and N = 8 was inspected. Since our modelassumes that T ≤ N − 2, the following values of T wereexamined: T ∈ {1, . . . , 6}. Figs. 3 and 4 depict the resultsfor both reverberation levels, where NBB denotes the non-blocking-based method, and BB refers to the blocking-based

method. It is shown that the proposed methods outperform thebaseline methods [15] in most cases, which is inline with theresults obtained in Tables III and IV, showing the advantage ofthe proposed rank-deficient noise model. The baseline methodshave an advantage only for values of T close to N−2. This canbe attributed to the fact that when the number of non-stationarynoise sources increases, their sum tends to be stationary,especially in high reverberation conditions. In this case, itmight be preferred to use the simpler approximation of thenoise PSD matrix, namely a time-invariant spatial coherencematrix multiplied by a scalar time-varying PSD, rather thanthe proposed model that tracks the non-stationarity of each ofthe noises, and requires the estimation of a larger amount ofparameters.

Finally, we test the influence of the number of microphoneson the performance. Figs. 5 and 6 depict the measures obtainedwith different number of microphones, i.e. N ∈ {4, 6, 8}, forboth reverberation times, where T = 2 and RSNR = 5 dB. Itis evident that the proposed methods outperform the baselinemethods in almost all cases.

Fig. 7 depicts several sonogram examples of the varioussignals at T60 = 610 msec and RSNR of 15 dB. Fig. 7(a)shows s, the direct speech signal as received by the firstmicrophone. Fig. 7(b) depicts y1, the noisy and reverberantsignal at the first microphone. Figs. 7(c) and 7(d) show

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 12: ML Estimation and CRBs for Reverberation, Speech and Noise

12

4 6 8Microphones [#]

0

0.5

1

PESQ

(a)

4 6 8Microphones [#]

0

5

10

15

fwSN

R [

dB]

(b)

4 6 8Microphones [#]

-0.8

-0.6

-0.4

-0.2

0

LL

R [

dB]

(c)

Fig. 5: Results versus N , for T60 = 360 msec, T = 2, RSNR = 5dB: (a) ∆PESQ, (b) ∆fwSNR, and (c) ∆LLR.

4 6 8Microphones [#]

0

0.2

0.4

0.6

0.8

1

PESQ

(a)

4 6 8Microphones [#]

0

5

10

15

fwSN

R [

dB]

(b)

4 6 8Microphones [#]

-0.6

-0.4

-0.2

0

LL

R [

dB]

(c)

Fig. 6: Results versus N , for T60 = 610 msec, T = 2, RSNR = 5dB: (a) ∆PESQ, (b) ∆fwSNR, and (c) ∆LLR.

the MCWF output computed with (77), using the proposedblocking-based and non-blocking-based MLEs, respectively.We conclude that the application of the MCWF, implementedbased on the proposed MLEs, reduces significantly noise andreverberation, while maintaining low speech distortion.

VI. CONCLUSIONS

In this contribution, we discussed the problem of jointdereverberation and noise reduction, in the presence of di-rectional noise sources, forming a rank-deficient noise PSDmatrix. As opposed to state-of-the-art methods which assumethe knowledge of the noise PSD matrix, we propose to estimatealso the time-varying noise PSD matrix, assuming that a basisthat spans the noise ATF subspace is known. MLEs of thereverberation, speech and noise PSDs are derived for boththe non-blocking-based and the blocking-based methods. Theresulting MLEs are of closed-form and thus have low compu-tational complexity. The proposed estimators are theoreticallyanalyzed and compared. For both the reverberation and thenoise PSD estimators, it is shown that the non-blocking-based MLE and the blocking-based MLE are identical. Theestimators were shown to be unbiased, and the correspondingMSEs were calculated. Moreover, CRBs on the various PSDswere derived, and were shown to be identical to the MSEsof the proposed estimators. The discussion is supported by an

experimental study based on both simulated data and real-lifeaudio signals. It is shown that using the proposed estimatorsyields a large performance improvement with respect to acompeting method.

APPENDIX A

The proof follows the lines of [24]. First, we use (15) andrewrite the PSD matrix of the microphone signals in (9) as

Φy(m) = AΦsv(m)AH + φR(m) ΓR, (81)

where A is defined in (23) and Φsv(m) ∈ C(T+1)×(T+1) isgiven by

Φsv(m) =

[φS(m) 0H

0 Ψv(m)

]. (82)

Then, we define P ∈ CN×N as an orthogonal projectionmatrix onto the speech-plus-noise subspace

P = A(AHΓ−1

R A)−1

AHΓ−1R , (83)

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 13: ML Estimation and CRBs for Reverberation, Speech and Noise

13

0

2

4

6

8

Freq

uenc

y [K

Hz]

-40

-30

-20

-10

0

10

20

30

0 0.5 1 1.5 2 2.5 3 3.5 4

Time [Sec]

-1

0

1

Am

plitu

de

(a) Clean direct speech at microphone #1.

0

2

4

6

8

Freq

uenc

y [K

Hz]

-40

-30

-20

-10

0

10

20

30

0 0.5 1 1.5 2 2.5 3 3.5 4

Time [Sec]

-1

0

1

Am

plitu

de

(b) Noisy and Reverberant signal at microphone #1.

0

2

4

6

8

Freq

uenc

y [K

Hz]

-40

-30

-20

-10

0

10

20

30

0 0.5 1 1.5 2 2.5 3 3.5 4

Time [Sec]

-1

0

1

Am

plitu

de

(c) Output of the MCWF (direct implementation) usingthe proposed blocking-based MLEs.

0

2

4

6

8

Freq

uenc

y [K

Hz]

-40

-30

-20

-10

0

10

20

30

0 0.5 1 1.5 2 2.5 3 3.5 4

Time [Sec]

-1

0

1

Am

plitu

de

(d) Output of the MCWF (direct implementation) usingthe proposed non-blocking-based MLEs.

Fig. 7: An example of audio sonograms, with T60 = 610 msec and RSNR = 15dB.

and Q = IN−P is the orthogonal complement. We will makeuse of the following properties, which can be easily verified:

P = PP, (84a)Q = QQ, (84b)

QΦy(m) = φR(m)QΓR, (84c)

Φ−1y (m)Q = φ−1

R (m)Γ−1R Q, (84d)

Φ−1y (m)A = Γ−1

R A(Φsv(m)AHΓ−1

R A + φR(m)I)−1

. (84e)

Using [11, Eq. (15.47-15.48)], the derivative of the log likeli-hood function w.r.t. φR(m) is given by

L(φR(m)

),∂ log p

(y(m);φ(m)

)∂φR(m)

= L Tr[(

Φ−1y (m)Ry(m)− I

)Φ−1

y (m)ΓR]. (85)

Using (84a) and (84b) it follows that

L(φR(m)

)= L Tr

[QΓR

(Φ−1

y (m)Ry(m)− I)Φ−1

y (m)Q

+ PΓR(Φ−1

y (m)Ry(m)− I)Φ−1

y (m)P]. (86)

However, by substituting (83) into (86), it can be shown thatthe second term vanishes (follows from setting the derivative

of the likelihood w.r.t. Φsv in (88) to zero). Hence,

L(φR(m)

)= L Tr

[QΓRΦ−1

y (m)(Ry(m)−Φy(m)

)×Φ−1

y (m)Q]

(a)= Lφ−2

R (m)Tr[Q(Ry(m)−Φy(m)

)Γ−1R Q

](b)= Lφ−2

R (m)Tr[QRy(m)Γ−1

R − φR(m)Q], (87)

where (a) follows from (84c) and (84d), and (b) follows from(84b) and (84c). Finally, using (22) we have Tr [Q] = N −(T + 1). Thus, setting (87) to zero yields (21).

APPENDIX B

Using [24, Eq. (12)], the derivative of the log likelihoodfunction log p

(y(m);φ(m)

)w.r.t. Φsv is given by

L(Φsv(m)

),∂ log p

(y(m);φ(m)

)∂Φsv(m)

= L(AH (Φ−1

y (m)Ry(m)− I)Φ−1

y (m)A)>. (88)

Using (84e) and setting the result to zero yields

ΦMLsv (m) =

(AHΓ−1

R A)−1

AHΓ−1R

(Ry(m)− φML,y

R (m)ΓR

)× Γ−1

R A(AHΓ−1

R A)−1

. (89)

In order to simplify the expression of (89), we define apartitioned matrix

AHΓ−1R A =

[gHdΓ−1

R gd gHdΓ−1

R V

VHΓ−1R gd VHΓ−1

R V

]. (90)

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 14: ML Estimation and CRBs for Reverberation, Speech and Noise

14

Using the formula of the inverse of a partitioned matrix andthen taking the (1, 1) entry of ΦML

sv , the MLE of φS in (24)is obtained.

APPENDIX C

We note that Ψv(m) is the T × T lower-right block of thefull (T + 1) × (T + 1) matrix Φsv(m) in (82). Using againthe formula for the inverse of the partitioned matrix in (90)and taking the corresponding entries, yields (28).

APPENDIX D

The proof follows by substituting the inverse of the parti-tioned matrix (90) into (22), and then using the definitions in(25) and (29).

APPENDIX E

The PSD matrix of the BM output is given in (42) as

Φz(m) = VΨv(m)VH + φR(m) BHΓRB, (91)

where V is defined in (41). The proof is now similar to that ofAppendix A, with the following changes: A, Φsv(m), ΓR andRy(m) are replaced with V, Ψv(m), BHΓRB and Rz(m),respectively.

APPENDIX F

Using (42), the MLE of Ψv(m) can be calculated in asimilar manner to (89), with

ΨML,zv (m) = WH

u

(Rz(m)− φML,z

R (m)(BHΓRB

))Wu,

(92)

where Wu is given in (49).

APPENDIX G

Substituting (41) into (49), and then using (54), leads to

WHu =

(VHP⊥g Γ−1

R V)−1

VHB(BHΓRB

)−1. (93)

Right multiplying (93) by BH, using again (54) and thencomparing to (29), yields (52).

APPENDIX H

Using (19), (21) and the i.i.d. assumption, the variance of(65) can be recast as

var(ψML,yi (m)

)=

1

Lvar

[yH(m)

(wiw

Hi −

wHi ΓRwi

N − 1− TΓ−1R Q

)y(m)

]. (94)

To proceed, we use (62), (59), and (33) to obtain

var(ψML,yi (m)

)=

1

L

[ (wHi Φy(m)wi

)2+

(wHi φR(m)ΓRwi

)2N − 1− T

]. (95)

Finally, using (9) and (31) yields (68). The right hand side of(69) is obtained by substituting (66) into the definition of ξi,

and noting that(P⊥

ViP⊥g

)H= ΓRP

⊥g P

⊥ViP⊥g Γ−1

R .

APPENDIX I

In this Appendix, the CRB on the reverberation PSD isderived. We will make use of the following identities [30]:

Tr [XY] =(vec(XH))H

vec (Y) , (96)

vec (XYZ) =(Z> ⊗X

)vec (Y) , (97)

(W ⊗X) (Y ⊗ Z) = (WY)⊗ (XZ) . (98)

Using the definitions of Φy and Φsv in (81) and (82), wedenote the set of unknown parameters by α ,

[φR,φ

>SV

]> ∈C((T+1)2+1), where φSV , vec (Φsv) ∈ C(T+1)2 . As Φy isa PSD matrix of a Gaussian vector, the Fisher information ofeach pair of parameters is given by [31]:

Iyij = L Tr[Φ−1

y

∂Φy

∂αiΦ−1

y

∂Φy

∂αj

], (99)

where Iyij is the Fisher information of αi and αj and i, j =1, . . . , (T + 1)2 + 1. In order to facilitate the derivation, weuse (97) and vectorize (81) to obtain an N2 × 1 vector:

φy , vec (Φy) = (A∗ ⊗A) vec (Φsv) + φR vec (ΓR) .(100)

Using (96), (97), (99) and (100), the full Fisher informationmatrix (FIM) writes [32], [33]

1

LIy =

(∂φy∂α>

)H (Φ−>y ⊗Φ−1

y

)( ∂φy∂α>

). (101)

Next, we define the following partitioned matrix:

[g|∆] ,(Φ−>/2y ⊗Φ−1/2

y

)[∂φy∂φR

∣∣∣∣ ∂φy∂φ>SV

]. (102)

Using (102) and (98), the FIM in (101) writes

1

LIy =

[gH

∆H

][g|∆] =

[gHg gH∆∆Hg ∆H∆

]. (103)

The CRB for φR is given by CRB (φR) =[(Iy)

−1]

1,1. Using

the formula of the inverse of a partitioned matrix, the CRBwrites

CRB (φR) =1

L

(gHg − gH∆

(∆H∆

)−1∆Hg

)−1

=1

L

(gHΠ⊥∆g

)−1, (104)

where Π⊥∆ ∈ CN2×N2

is a projection matrix onto the subspaceorthogonal to ∆:

Π∆ , ∆(∆H∆

)−1∆H , Π⊥∆ = IN2 −Π∆. (105)

We now simplify (104). First, we use (100) along with (97)to write g as

g =(Φ−>/2y ⊗Φ−1/2

y

)vec (ΓR)

= vec(Φ−1/2

y ΓRΦ−1/2y

). (106)

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 15: ML Estimation and CRBs for Reverberation, Speech and Noise

15

Similarly, ∆ is simplified by using (100) along with (98),

∆ =(Φ−>/2y ⊗Φ−1/2

y

)(A∗ ⊗A)

=(Φ−>/2y A∗

)⊗(Φ−1/2

y A). (107)

In order to calculate Π⊥∆g, the following identity is used [33]:

Π⊥X⊗Y = I⊗Π⊥Y + Π⊥X ⊗ I−Π⊥X ⊗Π⊥Y (108)

Hence,

Π⊥∆g(a)= Q(

Φ−1/2y A

)∗⊗(Φ

−1/2y A

)vec(Φ−1/2

y ΓRΦ−1/2y

)(b)=(I⊗Π⊥

Φ−1/2y A

+ Π⊥(Φ

−1/2y A

)∗ ⊗ I

−Π⊥(Φ

−1/2y A

)∗ ⊗Π⊥Φ

−1/2y A

)× vec

(Φ−1/2

y ΓRΦ−1/2y

)(c)= vec

(Π⊥

Φ−1/2y A

Φ−1/2y ΓRΦ−1/2

y

+ Φ−1/2y ΓRΦ−1/2

y Π⊥Φ

−1/2y A

−Π⊥Φ

−1/2y A

(Φ−1/2

y ΓRΦ−1/2y

)Π⊥

Φ−1/2y A

), (109)

where (a) follows by (106) and (107), (b) follows by (108)and (c) follows by (97). Left multiplying (109) by gH yieldsthe reciprocal of the CRB in (104):

gHΠ⊥∆g(a)=(

vec(Φ−1/2

y ΓRΦ−1/2y

))H

× vec(Π⊥

Φ−1/2y A

Φ−1/2y ΓRΦ−1/2

y + Φ−1/2y ΓRΦ−1/2

y Π⊥Φ

−1/2y A

−Π⊥Φ

−1/2y A

(Φ−1/2

y ΓRΦ−1/2y

)Π⊥

Φ−1/2y A

)(b)= Tr

[2Φ−1/2

y ΓRΦ−1/2y Π⊥

Φ−1/2y A

Φ−1/2y ΓRΦ−1/2

y

−Φ−1/2y Π⊥

Φ−1/2y A

Φ−1/2y ΓRΦ−1/2

y Π⊥Φ

−1/2y A

Φ−1/2y ΓR

], (110)

where (a) follows by (106) and (b) follows by (96). To proceed,let us define D ∈ CN×(N−T−1) as a matrix that spans thenullspace of the matrix A s.t. DHA = 0, i.e. a blocking matrix

of the speech-plus-noise signals. Since(Φ

1/2y D

)HΦ−1/2y A =

0, it follows that [33]

Π⊥Φ

−1/2y A

= ΠΦ

1/2y D

, (111)

and thus

Φ−1/2y Π⊥

Φ−1/2y A

Φ−1/2y = Φ−1/2

y ΠΦ

1/2y D

Φ−1/2y

(a)= D

(DHφRΓRD

)−1DH

= φ−1R Γ−1

R

(ΓRD

(DHΓRD

)−1DHΓR

)Γ−1R

(b)= φ−2

R Γ−1R

(ΦyD

(DHΦyD

)−1DHΦy

)Γ−1R

(c)= φ−2

R Γ−1R

(Φ1/2

y Π⊥Φ

−1/2y A

Φ1/2y

)Γ−1R , (112)

where (a) and (b) follows since DHΦy = DHφRΓR and (c)follows by (111). Substituting (112) into (110) and using the

property Π⊥XΠ⊥X = Π⊥X, yields

gHΠ⊥∆g = φ−2R Tr

[Π⊥

Φ−1/2y A

]= φ−2

R

(N − (T + 1)

). (113)

Substituting (113) into (104) yields (74).

APPENDIX J

We define the following partitioned matrix:

[s|W] ,(Φ−>/2y ⊗Φ−1/2

y

)[∂φy∂φS

∣∣∣∣ ∂φy∂σ>,∂φy∂φR

], (114)

where σ , vec ({φSV } \ {φS}). Similarly to (103)–(104), weuse (114) to construct the FIM, and then compute its inverseand take the corresponding component,

CRB (φS) =1

L

(sHΠ⊥Ws

)−1. (115)

We define a partition of W as

W =(Φ−>/2y ⊗Φ−1/2

y

)[ ∂φy∂σ>

∣∣∣∣∂φy∂φR

], [Σ|g] . (116)

Using the blockwise formula for projection matrices [34],

Π⊥W = Π⊥Σ −Π⊥Σg(gHΠ⊥Σg

)−1gHΠ⊥Σ. (117)

We note that Σ can be further partitioned as

Σ =(Φ−>/2y ⊗Φ−1/2

y

)×[V∗ ⊗V

∣∣g∗d ⊗V,V∗ ⊗ gd], [C|D] . (118)

In a similar manner to (117), we obtain

Π⊥Σ = Π⊥C −Π⊥CD(DHΠ⊥CD

)−1DHΠ⊥C. (119)

Substituting (119) into (117) and then into (115), yields

CRB (φS) =1

L

(αs −αH

dsΘ−1d αds − β

2sgβ−1g

)−1, (120)

where αs = sHΠ⊥Cs, αds = DHΠ⊥Cs, Θd = DHΠ⊥CD,βsg = αsg − αH

dsΘ−1d αdg , βg = αg − αH

dgΘ−1d αdg , and

αsg = sHΠ⊥Cg, αdg = DHΠ⊥Cg, αg = gHΠ⊥Cg. Thesequantities can be computed using similar techniques to thoseused in the derivation of (106)–(113). Due to space constraints,the detailed derivation is omitted. Collecting all the terms andsubstituting into (120), yields the same expression as the MSEin (71).

REFERENCES

[1] U. Kjems and J. Jensen, “Maximum likelihood based noise covari-ance matrix estimation for multi-microphone speech enhancement,”in Proceedings of the 20nd European Signal Processing Conference(EUSIPCO), 2012, pp. 295–299.

[2] S. Braun and E. A. P. Habets, “Dereverberation in noisy environ-ments using reference signals and a maximum likelihood estimator,”in Proceedings of the 21st European Signal Processing Conference(EUSIPCO), 2013, pp. 1–5.

[3] A. Kuklasinski, S. Doclo, S. H. Jensen, and J. Jensen, “Maximumlikelihood based multi-channel isotropic reverberation reduction forhearing aids,” in Proceedings of the 22nd European Signal ProcessingConference (EUSIPCO), 2014, pp. 61–65.

[4] S. Braun and E. A. Habets, “A multichannel diffuse power estimator fordereverberation in the presence of multiple sources,” EURASIP Journalon Audio, Speech, and Music Processing, vol. 2015, no. 1, p. 34, 2015.

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 16: ML Estimation and CRBs for Reverberation, Speech and Noise

16

[5] O. Schwartz, S. Braun, S. Gannot, and E. A. Habets, “Maximumlikelihood estimation of the late reverberant power spectral densityin noisy environments,” in IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics (WASPAA), 2015, pp. 1–5.

[6] O. Schwartz, S. Gannot, and E. A. Habets, “Joint maximum likelihoodestimation of late reverberant and speech power spectral density in noisyenvironments,” in IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), 2016, pp. 151–155.

[7] A. Kuklasinski, S. Doclo, and J. Jensen, “Maximum likelihood psdestimation for speech enhancement in reverberant and noisy conditions,”in IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), 2016, pp. 599–603.

[8] O. Schwartz, S. Gannot, and E. A. Habets, “Joint estimation of latereverberant and speech power spectral densities in noisy environmentsusing frobenius norm,” in 24th European Signal Processing Conference(EUSIPCO), 2016, pp. 1123–1127.

[9] A. Kuklasinski, S. Doclo, S. H. Jensen, and J. Jensen, “Maximumlikelihood PSD estimation for speech enhancement in reverberationand noise,” IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, vol. 24, no. 9, pp. 1599–1612, 2016.

[10] S. Braun, A. Kuklasinski, O. Schwartz, O. Thiergart, E. A. Habets,S. Gannot, S. Doclo, and J. Jensen, “Evaluation and comparison of latereverberation power spectral density estimators,” IEEE/ACM Transac-tions on Audio, Speech and Language Processing, vol. 26, no. 6, pp.1052–1067, 2018.

[11] S. M. Kay, Fundamentals of Statistical Signal Processing: EstimationTheory. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1993.

[12] J. Jensen and M. S. Pedersen, “Analysis of beamformer directed single-channel noise reduction system for hearing aid applications,” in IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP), Brisbane, Australia, Apr., 2015.

[13] A. Kuklasinski, S. Doclo, T. Gerkmann, S. Holdt Jensen, and J. Jensen,“Multi-channel PSD estimators for speech dereverberation-a theoreticaland experimental comparison,” in IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), 2015, pp. 91–95.

[14] O. Schwartz, S. Gannot, and E. A. Habets, “Cramer–Rao bound anal-ysis of reverberation level estimators for dereverberation and noisereduction,” IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, vol. 25, no. 8, pp. 1680–1693, 2017.

[15] I. Kodrasi and S. Doclo, “Joint late reverberation and noise power spectaldensity estimation in a spatially homogenous noise field,” in IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP), 2018.

[16] M. Tammen, S. Doclo, and I. Kodrasi, “Joint estimation of RETFvector and power spectral densities for speech enhancement basedon alternating least squares,” in IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), 2019, pp. 795–799.

[17] A. I. Koutrouvelis, R. C. Hendriks, R. Heusdens, and J. Jensen,“Robust joint estimation of multimicrophone signal model parameters,”IEEE/ACM Transactions on Audio, Speech, and Language Processing,vol. 27, no. 7, pp. 1136–1150, 2019.

[18] R. A. Wooding, “The multivariate distribution of complex normalvariables,” Biometrika, vol. 43, no. 1/2, pp. 212–215, 1956.

[19] B. F. Cron and C. H. Sherman, “Spatial-correlation functions for variousnoise models,” The Journal of the Acoustical Society of America, vol. 34,no. 11, pp. 1732–1736, 1962.

[20] H. L. Van Trees, Optimum array processing: Part IV of detection,estimation and modulation theory. Wiley, 2002.

[21] S. Markovich, S. Gannot, and I. Cohen, “Multichannel eigenspacebeamforming in a reverberant noisy environment with multiple inter-fering speech signals,” IEEE Tran. on Audio, Speech, and LanguageProcessing, vol. 17, no. 6, pp. 1071–1086, 2009.

[22] K. U. Simmer, J. Bitzer, and C. Marro, “Post-filtering techniques,” inMicrophone arrays. Springer, 2001, pp. 39–60.

[23] R. Balan and J. Rosca, “Microphone array speech enhancement byBayesian estimation of spectral amplitude and phase,” in IEEE SensorArray and Multichannel Signal Process. Workshop, 2002, pp. 209–213.

[24] H. Ye and R. D. DeGroat, “Maximum likelihood DOA estimation andasymptotic Cramer-Rao bounds for additive unknown colored noise,”IEEE Transactions on Signal Processing, vol. 43, no. 4, pp. 938–949,1995.

[25] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Tran.on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109–1121, 1984.

[26] I.-T. Recommendation, “Perceptual evaluation of speech quality (PESQ):An objective method for end-to-end speech quality assessment ofnarrow-band telephone networks and speech codecs,” Rec. ITU-T P. 862,2001.

[27] S. R. Quackenbush, T. P. Barnwell, and M. A. Clements, Objectivemeasures of speech quality. Prentice Hall, 1988.

[28] E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audiodatabase in various acoustic environments,” in International Workshopon Acoustic Signal Enhancement (IWAENC), 2014, pp. 313–317.

[29] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett,“DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM.NIST speech disc 1-1.1,” Disc, 1993.

[30] A. Graham, Kronecker products and matrix calculus with applications.Courier Dover Publications, 2018.

[31] W. J. Bangs, “Array processing with generalized beamformers,” Ph.D.dissertation, Yale University, New Haven, CT, 1972.

[32] P. Stoica, E. G. Larsson, and A. B. Gershman, “The stochastic CRBfor array processing: A textbook derivation,” IEEE Signal ProcessingLetters, vol. 8, no. 5, pp. 148–150, 2001.

[33] A. Gershman, P. Stoica, M. Pesavento, and E. G. Larsson, “StochasticCramer–Rao bound for direction estimation in unknown noise fields,”IEE Proceedings-Radar, Sonar and Navigation, vol. 149, no. 1, pp. 2–8,2002.

[34] C. R. Rao and H. Toutenburg, Linear models: least squares andalternatives. Springer, 1999.

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2962689

Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].