blind multichannel system identification
TRANSCRIPT
-
8/12/2019 Blind Multichannel System Identification
1/6
Blind Multichannel System Identification
with Applications in Speech Signal Processing
R. M. Nickel
Department of Electrical Engineering, 121 EE West,
The Pennsylvania State University, University Park, PA 16802
Email:rmn10psu.edu
Abstract
We are presenting a new approach for blind multichannel
system identification. The approach relies on the existence
of so calledexclusive activity periods (EAPs) in the sourcesignals. EAPs are time intervals during which only one
source is active and all other sources are inactive (i.e. zero).
The existence of EAPs is not guaranteed for arbitrary signal
classes. EAPs occur very frequently, however, in recordings
of conversational speech. The methods proposed in this pa-
per show how EAPs can be exploited to improve the perfor-
mance of blind multichannel system identification systems
in speech processing applications. We have shown that for
modestly complex tasks the proposed method achieves an
improvement of over 10 dB in signal-to-interference ratio
over conventional techniques.
1 Introduction
The general goal in blind multichannel system identifi-
cation is to estimate the transfer function of a multichannel
system solely based on its output signals and without any
specific knowledge about the input signals to the system. A
practical example for the use of blind multichannel system
identification is given by the following scenario: We are
recording an acoustic scene with an array of microphones.
If we want to isolate the voice of a single speaker out of a
mixture of signals from different sources then it is implicitly
necessary to estimate the transmission properties (or equiv-
alently the inverse transmission properties) of all channel
between all microphones and all sources.Historically, the first successful solutions to such prob-
lems were of instantaneous mixture type [1], i.e. cases in
which the transfer function was merely a matrix of con-
stants. A variety of tools, known asindependent component
analysis(ICA) methods (Comon [2]), have been developed
for these cases. The general criterion behind ICA methods
is to achieve a maximization of ameasure of independence
between the assumed input signals to the system.
Solutions to the more general convolutive mixture case
are significantly more complicated. Most solutions can be
classified as either, time domain approaches (see [3] and the
references therein) or, frequency domain approaches (see
[4]). More specifically, we have to distinguish between: thedomain in which we model the mixing process (mixing do-
main) and the domain in which we model the statistics of
the source signals (source domain). A choice of either time
orfrequencyfor each of these domains have significant ad-
vantages and disadvantages (see [4]).
In this paper we are proposing an alternative approach
that does not explicitly rely on an independence assumption
between sources. Instead, we are assuming the existence
ofexclusive activity periods (EAPs). EAPs are time inter-
vals during which only one source is active and all other
sources are silent. The existence of EAPs is not guaranteed
for arbitrary signal classes, but EAPs occur very frequently
in recordings of conversational speech.
2 Methods
A block diagram that depicts the considered scenario is
shown in figure 1. We assume that we haveM unknownsource signalsxi[n]fori= 1 . . . M . The transmission pathbetween sourcei and receiverj is described by the trans-fer function of a linear time-invariant system with impulse
responsegij[n]. The resultingMobservation signalsyj [n]are generated according to:
yj [n] =
Mi=1
k=0 gij [k] xi[n k]
for j = 1 . . . M .
(1)
Equation (1) can also be expressed in the z-domain as
Y(z) = G(z) X(z), (2)
in which X(z) is the z-transform of multichan-nel signal x[n] = [ x1[n] x2[n] . . . xM[n] ]
T,
Y(z) is the z-transform of multichannel signal
Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference onntelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC05)
0-7695-2504-0/05 $20.00 2005IEEE
-
8/12/2019 Blind Multichannel System Identification
2/6
y[n] = [ y1[n] y2[n] . . . yM[n] ]T, and G(z) is a
matrix with the z-transforms of the impulse responses
gij [n]for alliandj .The goal in blind system identification is to find a matrix
of transfer functionsH(z)such that X(z)given by
X(z) = H(z) Y(z) (3)
is (in some appropriate metric) as close to X(z)as possible.Unfortunately, since we cannot measure the true source sig-
nalsx[n], it is generally not possible to explicitly minimizethe deviation. Instead, the most commonly used criterion
to find H(z) is to choose it such that the components ofthe resulting source reconstruction x[n] are statistically asindependentas possible.
In this paper we are proposing an alternative method that
minimizes an estimate of the deviation between x[n] andx[n]. The approach can be divided into three steps:
1. Find a set of time intervals[n1, n2]during which only
one source is active and all other sources are silent. Werefer to such time intervals as exclusive activity periods
(EAPs).
2. Find a set of transfer functions that deconvolve the
sources during EAPs.
3. ConstructH(z)by combining the results from differ-ent EAPs from different sources.
One of the caveats of the proposed approach is that it is
dependent on the existence of exclusive activity periods.
The existence of EAPs is not guaranteed for arbitrary signal
classes. EAPs do occur, however, very frequently in record-
ings of conversational speech, which makes the proposed
method particularly interesting for the solution of the blind
speech separation problem.
2.1 EAP Estimation for ConversationalSpeech Signals
An exclusive activity period is a time during which only
one source xi[n] is active and all other sources xk[n] aresilent, i.e.xk[n] = 0 fork =i. The estimation of exclusiveactivity periods for speech sources xi[n]can be based on the(almost) periodic nature of vocalic sounds. If only a single
person is speaking then all observations yj[n] exhibit timeintervals with a periodic structure. When multiple personsare speaking the periodicity is generally destroyed [5].
A robust short-time periodicity measure was proposed
Medanet al. [6]. They consider the similarity between two
adjacent observation segments of lengthk:
sj1[n, k] =[ yj [n k] . . . yj [n 2] yj [n 1] ]
T (4)
sj2[n, k] =[ yj [n] yj [n + 1] . . . yj [n + k1] ]
T. (5)
SourcesMixture Model
Observations
General Model Scenario
x1[n]
x2[n]
xM[n]
gii[n]
gij [n]
gji[n]
gjj [n]
y1[n]
y2[n]
yM[n]
Figure 1. A block diagram of the mixing scenario
described by equation (1). The Munknown signalsources are labelled withxi[n]. TheM observa-tions are labelled withyj [n].
A correlation measure NCORj [n, k] is defined through a
normalized inner product of vectors sj1[n, k]and s
j2[n, k]:
NCORj [n, k] = s
j1[n, k]
T sj2[n, k]
sj1[n, k] sj2[n, k]
. (6)
The normalization ensures that the correlation measure is
bounded between zero and one, i.e.0 NCORs[n, k] 1.The correlation measure is equal to one at the true periodpof a perfectly periodic signal. Less than perfectly periodic
signals yield correlation values less than one. As a conse-
quence, we can define a short-time periodicity measureas:
STPMj [n] = maxpminkpmax
{ NCORj [n, k] } . (7)
The search range for the maximum should be bounded by
the typical pitch range of human speech (50Hz...500Hz).
For observation signals that are sampled with sampling fre-quencyFs we have:
pmin= Fs / 500 Hz and pmax= Fs / 50Hz . (8)
A second feature that correlates well with EAPs is the so
called short-timezero crossing rate [7]:
STZCj [n] =
n+Lm=nL
| sign(yj[m]) sign(yj [m 1]) |. (9)
In our notation sign(x) is equal to +1 forx 0 and1for x < 0. The zero crossing rate counts the number of
transitions from positive samples to negative samples withinthe range (n L 1) . . . (n+ L). The range length isusually chosen as L = Fs 10 msec. Typically, the zerocrossing rate is low for EAP sections and high otherwise.
For the STZC measure to work properly it is important that
possible quantization offsets in the recorded speech signal
are removed prior to processing.
A normalized short-time zero crossing measure
NZCMj [n]is constructed with the maximumZCmaxj and
Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference onntelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC05)
0-7695-2504-0/05 $20.00 2005IEEE
-
8/12/2019 Blind Multichannel System Identification
3/6
the minimumZCminj ofSTZCj [n]over alln:
NZCMj [n] =STZCj [n]ZCmaxj
ZCmaxj ZCminj. (10)
The identification of EAP candidates is done as follows: 1)
find all times n for whichSTPMj
[n] 0.7 for all obser-vations j = 1 . . . M , 2) expand the so found sections inforward and backward direction untilSTPMj [n] < 0.6, 3)remove times for which NZCMj [n] > 0.5, and 4) retainonly the intersection of all EAP candidates across all chan-
nelsj = 1 . . . M . An example forSTPMj [n],NZCMj [n],and the resulting set of EAP candidate sections is shown in
figure 3.
2.2 Blind EAP Deconvolution
In this section we discuss the subproblem of blind sys-
tem identification under an exclusive activity assumption,
i.e. we assume that we have identified a time interval[n1, n2] during which only source xi[n] is active and allother sources are silent, i.e. xk[n] = 0 for k = i. Underthe EAP assumption we may attempt to reconstruct source
xi[n]from each observationyj [n]via an appropriately cho-
sen inverse filterhij [n]:
xji [n] =Pk=0
hij [k] yj [n k]. (11)
Ideally xi[n] = xji [n] for all j = 1 . . . M . Practically,
however, we have xki[n] = xji [n] fork = j due to noise,
imperfect estimation of the EAPs, improper choice ofP,non-minimum phase properties ofgij[n], and so forth.
An estimate Eiof the reconstruction error can be definedwith
Ei =
Mj=1
n
|xji [n]xi[n] |2 (12)
and xi[n] = 1
M
Mj=1
xji [n]. (13)
It is readily seen from equation (12) that a perfect recon-
struction withxi[n] = xji [n] yields a minimum error esti-
mate ofEi = 0. Unfortunately, xi[n] = x
j
i [n] may notbe the only solution that satisfiesEi = 0(e.g. if thegij [n]are linearly dependent). One may hope, however, that (if the
gij [n]are sufficiently different) a global minimization ofEiwill lead to good estimates for thehij [n]forj = 1 . . . M .
The computation of the global minimum ofEi is aidedby the following notation. We define:
xji =
xji [n1+ P] . . . x
ji [n21] x
ji [n2]
T,
Yj =
yj[n1] yj [n1+ 1] yj [n1+ P]
yj [n1+ 1] yj [n1+ 2] ...
......
. . ....
yj [n2 P] yj [n2]
,
and hji =
hij [P] . . . hij [1] hij[0]
T
. (14)
Equation (11) can be rewritten as xji =Y
jhji and:
xi= 1
M
Mj=1
xji =
1
M
Mj=1
Yj hji . (15)
The error estimate (12) becomes:
Ei =Mj=1
Yj hji
1
M
Mk=1
Yk hki
2
(16)
=
Mj=1
[ Yj hji]T Yj hji 1M
Mj=1
Mk=1
[ Yk hki ]T Yj hji .
We define the matrices Rjk = [ Yk ]T Yj and
RF =
R11 R12 R1M
R21 R22 ...
......
. . ....
RM1 RMM
, (17)
RD
=
R11 0 0
0 R22 ...
......
. . . 0
0 0 RMM
, (18)
and Hi =
[ h1i ]T [ h2i ]
T . . . [ hMi ]T
T. (19)
Using equations (14) to (19) we can compactly write the
error estimate as
Ei = HTi [ R
D 1
MRF ] Hi. (20)
In order to avoid the trivial minimization of equation (20)
(with Hi = 0) we constrain the solution to
xi2 = 1
M2HTi R
F Hi= 1. (21)
We have thus reformulated the problem into that of finding
the vector Hithat minimizesEisubject to equation (21).It is readily shown with Lagrange multipliers that the so-
lution to the above problem is provided by one of the gen-
eralized eigenvectorsmof matricesRD andRF:
m RDm= R
Fm. (22)
Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference onntelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC05)
0-7695-2504-0/05 $20.00 2005IEEE
-
8/12/2019 Blind Multichannel System Identification
4/6
We assume that the eigenvaluesm are sorted in decreas-ing order1 2 3 4 . . . and that the eigen-vectors are normalized such that m = 1 for all m.Choosingm as the solution leads to an error estimate of
Ei = M(Mm
1). The optimal solution, i.e. the one thatminimizesEiis thus given by
Hi = 1 and Ei = M(M
11).
2.3 Blind System Identification
As a result of the methods described in sections 2.1
and 2.2 we obtain an inverse filter estimate Hi (with its as-
sociated eigenvalue1) for each separately identified EAPsection. In a first step we discard all EAP sections (and Hi)
for whichlog10M
M1was greater than a certain EAP ac-
ceptance threshold(EAT - see section 4). In a second step
we use a simple minimum Euclidean distance hierarchical
clusteringmethod [8] to associate each vector Hito one of
theMsources. All vectors associated with the same sourcek are averaged1 (arithmetic mean) into an average eigen-vector Hk for each sourcek = 1 . . . M . By extracting thecorresponding subvectorsh
ji in analogy to equation (19):
Hi =
[ h1i ]T [ h2i ]
T . . . [ hMi ]TT
, (23)
we obtain a complete set of inverse filter vectors hji :
hji =
hij [P] . . . hij[1] hij[0]
T. (24)
An estimate for the mixing matrix G(z)from equation (2)is obtained from:
G(z)ij
= 1Pk=0 hij [k] z
k, (25)
where notation [G]ij refers to the element of matrix G inrow i and column j. An estimate for the demixing ma-trix H(z)from equation (3) can be obtained by numericallyinverting G(z) via Gaussian elimination. Unfortunately,the inversion process may introduce unstable poles into the
transfer functions ofH(z). The production of stable filterscan be enforced by mirroring poles that fall outside of the
unit circle back into the inside of the unit circle. The mir-
roring process distorts the correct phase response, but leaves
the magnitude response of individual channels intact.
3 Experiments
Experiments were conducted to verify the performance
of the proposed method. As speech data we used the SI-
subset of the TIMITdatabase from theLinguistic Data
1The weight for each vector was chosen proportional to the number of
samples contained in the associated EAP section. Longer EAP sections
had thus more weight then shorter ones.
0 0.5 1 1.5 2 2.5 3 3.5 41
0.5
0
0.5
1
Source Signals
Time [sec]
Channel#1
0 0.5 1 1.5 2 2.5 3 3.5 41
0.5
0
0.5
1
Time [sec]
Channel#2
Figure 2. An example of two source signalsx1[n]andx2[n] from the TIMIT database. The signalswere aligned to have a 30% overlap in time.
0 0.5 1 1.5 2 2.5 3 3.5 41
0.5
0
0.5
1
Mixed Signals and Features
Time [sec]
Channel#1
0 0.5 1 1.5 2 2.5 3 3.5 41
0.5
0
0.5
1
Time [sec]
Channel#2
Figure 3. The resulting mixed signals y1[n] andy2[n] from the example in figure 2. The upperdashed line in each axis indicates the resulting
STPMj [n] contour (equation (7)) and the lowerdashed line indicates the resulting NZCMj [n]contour (equation (10)). The gray regions indi-
cate the EAP sections that were estimated from
the mixed signals.
Consortium2. The chosen subset consists of recordings
from 630 subjects each uttering 3 phonetically-diverse sen-
tences3. The sentences were recorded with a sampling fre-
quency of 16 kHz. The signals were low-pass filtered and
down-sampled to 8 kHz prior to processing. All 3 sentences
from the same speaker were concatenated and then trun-
2The data is available at .3None of the sentences are repeated more than once.
Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference onntelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC05)
0-7695-2504-0/05 $20.00 2005IEEE
-
8/12/2019 Blind Multichannel System Identification
5/6
0 0.5 1 1.5 2 2.5 3 3.5 41
0.5
0
0.5
1
Demixed Signals EAP Method
Time [sec]
Channel#1
0 0.5 1 1.5 2 2.5 3 3.5 41
0.5
0
0.5
1
Time [sec]
Channel#2
Figure 4. The resulting demixed signals x1[n]andx2[n] from the example in figure 2 after applica-tion of the proposed EAP method. Both signals
are very close to the original source signals x1[n]andx2[n]depicted in figure 2.
0 0.5 1 1.5 2 2.5 3 3.5 41
0.5
0
0.5
1
Demixed Signals Parra/Spence Method
Time [sec]
Channel#1
0 0.5 1 1.5 2 2.5 3 3.5 41
0.5
0
0.5
1
Time [sec]
Channel#2
Figure 5. The demixing result of a commonly
used conventional method of blind source sepa-
ration after L. Parra and C. Spence [9] (see sec-
tion 4).
cated to 4 seconds. As a result we received a total of 630
different 4 seconds long source signalsx[n].We ran experiments with different source numbers (M=
2, 3, and 4) and different filter lengths (P+ 1 =5, 7, and10). For eachMthe available data was randomly split4 into[630/M] groups ofM source signals xi[n]. To simulateconversational speech the signals were partially faded out
to obtain a relative time-overlap between signals of roughly
30% (see figure 2). TheMsource signals of each group
4Every source signal was only used once.
were mixed with order-P random minimum phase filtersgij[n]according to equation (1). The resulting observationsyj[n]forj = 1 . . . M were then used to estimate the inverse
filter matrix H(z)according to section 2. The reconstructedsource signal estimates xi[n] for i = 1 . . . M were com-puted according to equation (3) via X(z) = H(z) Y(z).
The quality of the estimated model was evaluated withtheSignal-to-Interference Ratio(SIR in [dB]) between the
reconstructed signalxi[n]and the original signal xi[n]:
SIRi= maxp
10log10
n| xi[n] |
2n| xi[n]xi[n p] |
2
.
(26)
The evaluation of the SIR was performed under careful con-
sideration of possible numbering permutations between the
original signals and the reconstructions.
Figures 2 to 4 show an example for an experiment with
two sources. The gray regions in the figures indicate the
EAP sections that were estimated from the mixed signals
y1[n] and y2[n] from figure 3. Figure 5 shows the resultfor a commonly used conventional method of blind source
separation after L. Parra and C. Spence [9] (see section 4).
4 Results
The results of the experiments are summarized in tables
I and II. Table I lists the average SIR values (AvSIR) that
were obtained by averaging the SIRi after equation (26)over all channels and all experiments with the same source
number M and filter order P. The third column reportsthe average SIR values for the proposed EAP method. The
fourth column reports the average SIR values that resulted
from an application of the popular blind source separationmethod proposed by L. Parra and C. Spence [9]. The results
for the Parra/Spence method was computed with software
written by S. Harmeling (MATLAB function convbss.m,
endorsed by L. Parra and C. Spence). The last column re-
ports the average SIR between the observations yj [n] andthe sourcesxi[n]as a reference.
Table II provides supplemental information for each ex-
periment. Column three of table II lists the average SIR
results that are obtained when the proposed methods is ap-
plied to the true EAP locations (and not the estimated EAP
locations). The fourth column of table II reports the number
of instances (in %) in which the numerical inversion of ma-
trix G(z)after equation (25) led to unstable poles that hadto be mapped back into the unit circle. Column five reports
the chosen value for the EAP acceptance threshold(EAT)
as described in section 2.3.
It is clearly visible from table I that the proposed method
achieves significant improvements over the Parra/Spence
method for small complexity tasks with smaller source
numbersMand smaller filter orders P. In the best casescenario, for two sources and with a filter length of 5 taps,
Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference onntelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC05)
0-7695-2504-0/05 $20.00 2005IEEE
-
8/12/2019 Blind Multichannel System Identification
6/6
we can obtain a 11 dB improvement in average signal-to-
interference ratio. Unfortunately, the advantage vanishes
with larger complexity tasks. The reason for the decline
is partially due to the increasing number of pole location
changes as listed in column four of table II.
Table ISource Filter AvSIR AvSIR AvSIR
Number Length EAP Parra/ Mixed
M P+ 1 Method Spence Signals
2 5 16.06 4.96 4.47
2 7 10.45 5.11 4.50
2 10 7.37 5.35 4.76
3 5 11.24 5.17 3.77
3 7 8.41 5.16 3.88
3 10 6.22 5.40 4.02
4 5 6.67 5.17 3.10
4 7 2.72 5.10 3.26
Table I. Average signal-to-noise ratios for various
source numbersM, filter ordersP, and algorithms.
Table II
Source Filter AvSIR Pole
Number Length True Mirror EAT
M P+ 1 EAP %
2 5 47.29 15.56 4.0
2 7 26.54 30.79 4.0
2 10 9.29 55.56 4.0
3 5 30.53 59.52 3.0
3 7 19.15 77.14 3.0
3 10 9.68 95.24 3.04 5 18.11 92.36 3.0
4 7 9.09 99.36 3.0
Table II. Supplemental statistics about the experiments
with various source numbersMand filter ordersP.
A very promising result for future developments is con-
tained in column three of table II. If the result of the EAP
estimation is replaced with the true location of the EAPs
in the given mixture signalsyj [n]for j = 1 . . . M then theaverage SIR is dramatically improved over the Parra/Spence
method even for higher complexity cases. It is thus expected
that the method will produce significantly better results ifequipped with a more robust EAP detection strategy.
5 Conclusions
We have presented a new approach for blind multichan-
nel system identification. The approach relies on the de-
tection of exclusive activity periods (EAPs) in the source
signals. The presence of EAPs is exploited for a reliable
blindestimation ofsource signals and channel properties
(between sources and observations). The advantage of the
proposed method was demonstrated experimentally within
the framework of convolutive blind separation of speech
signals.
The goal of the paper was to provide a proof of con-
ceptfor EAP based, blind identification methods. Some of
the methods presented in this paper, especially the sectionon EAP detection, are, in their current form, still subop-
timal and deserve to be studied in greater detail. Despite
its suboptimality, however, the proposed method still im-
proves upon existing strategies (especially for lower com-
plexity tasks).
A caveat of the proposed method is that (currently) we
have not imposed a constraint that forces the optimal un-
mixing matrix H(z)to be representative of a stablesystem.Instead, we employed a simple pole-mirroring strategy that,
by itself, is responsible for a substantial part of the perfor-
mance loss at higher complexity tasks (see table II).
References
[1] A. Hyvarinen, Karhunen J., and E. Oja, Independent
Component Analysis, Wiley-Interscience, 2001.
[2] P. Comon, Independent component analysis: A new
concept, Signal Processing, vol. 36, pp. 287314,
1994.
[3] A. Cichocki and S. Amari, Adaptive Blind Signal and
Image Processing: Learning Algorithms and Applica-
tions, Wiley, Chichester, U.K., 2002.
[4] N. Mitianoudis and M. E. Davies, Audio source sepa-
ration: solutions and problems, International Journalof Adaptive Control and Signal Processing, vol. 18, no.
3, pp. 299314, Apr. 2004.
[5] J. R. Deller, J. G. Proakis, and J. H. Hansen, Discrete-
Time Processing of Speech Signals, Macmillan, New
York, 1993.
[6] Y. Medan, E. Yair, and D. Chazan, Super resolution
pitch determination of speech signals, IEEE Transac-
tions on Signal Processing, vol. 39, no. 1, pp. 4048,
January 1991.
[7] Kondoz,Digital Speech Coding for Low Bit Rate Com-
munication Systems, Wiley-Interscience, 2004.
[8] R. O. Duda and P. E. Hart, Pattern Classification and
Scene Analysis, Wiley-Interscience, Menlo Park, CA,
1973.
[9] L. Parra and C. Spence, Convolutive blind separa-
tion of non-stationary sources, IEEE Transactions on
Speech and Audio Processing, vol. 8, no. 3, pp. 320
327, May 2000.
Proceedings of the 2005 International Conference on Computational Intelligence for Modelling, Control and Automation, and International Conference onntelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC05)
0-7695-2504-0/05 $20.00 2005IEEE