speech_manipulator

Upload: niravsinghdabhi

Post on 07-Jul-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/19/2019 speech_manipulator

    1/10

    IEEE TRANSACTIONS ON, VOL. X, NO. X, JANUARY 20XX 1

    Dictionary learning for speech based on a doublysparse greedy adaptive dictionary algorithm

    Maria G. Jafari and Mark D. Plumbley

    Abstract —In this paper we present a greedy adaptive dic-tionary learning algorithm, that sets out to nd sparse atomsfor speech signals, and in the process yields a sparse signaldecomposition. The algorithm learns the dictionary atoms ondata frames taken from a speech signal. It iteratively extractsthe data frame with minimum sparsity index, and adds this tothe dictionary matrix. The contribution of this atom to the dataframes is then removed, and the process is repeated.

    The algorithm is applied to the problem of speech representa-tion and speech denoising, and its performance is compared toother existing methods. The method is shown to nd dictionaryatoms that are sparser than their time-domain waveform, and

    also to result in a sparser speech representation. In the presenceof noise, the algorithm is found to have similar performance tothe well established principal component analysis.

    Finally, we apply a visualization method for the evaluationof the characteristics of the atoms within a dictionary and ndthat it conveys much more information about the time-frequencyproperties of the dictionary than looking at the spectrogramalone.

    Index Terms —Sparse decomposition, adaptive dictionary,sparse dictionary, dictionary learning, speech analysis, speechdenoising, dictionary visualization.

    I. INTRODUCTION

    S PARSE signal representations allow the informationwithin a signal to be conveyed with only a few elementarycomponents, called atoms . For this reason, they have acquiredgreat popularity over the years, and they have been success-fully applied to a variety of problems, including the studyof the human sensory system [1]–[3], blind source separation[4]–[6], and signal denoising [7]. Successful application of a sparse decomposition depends on the dictionary used, andwhether it matches the signal features [8].

    Two main methods have emerged to determine a dictionarywithin a sparse decomposition: dictionary selection anddictionary learning. Dictionary selection entails choosing apre-existing dictionary, such as the Fourier basis, wavelet

    basis or modied discrete cosine basis, or even constructinga redundant or overcomplete dictionary by forming a unionof bases (for example the Fourier and wavelet bases) so

    M. G. Jafari and Mark D. Plumbley are with the Centre for Digi-tal Music, Department of Electronic Engineering, Queen Mary Universityof London, London E1 4NS, U.K., (e-mail: [email protected],[email protected]).

    This work was supported in part by the EU Framework 7 FET-Openproject FP7-ICT-225913- SMALL: Sparse Models, Algorithms and Learningfor Large-Scale data; and a Leadership Fellowship (EP/G007144/1) from theUK Engineering and Physical Sciences Research Council (EPSRC).

    Reproducible research: the results and gures in this papercan be reproduced using the source code available online athttp://www.elec.qmul.ac.uk/digitalmusic/papers/2010/Jafari10gadspeech/.

    that different properties of the signal can be represented [9].Dictionary learning, on the other hand, aims at deducingthe dictionary from the training data, so that the atomsdirectly capture the specic features of the signal [7].Dictionary learning methods are often based on an alternatingoptimization strategy, in which the dictionary is xed, anda sparse signal decomposition is found; then the dictionaryelements are learned, while the signal representation is xed.

    Early dictionary learning methods by Olshausen and Field

    [2] and Lewicki and Sejnowski [10] were based on a prob-abilistic model of the observed data. Lewicki and Sejnowski[10] clarify the relation between sparse coding methods andindependent component analysis (ICA), while the connectionbetween dictionary learning in sparse coding, and the vectorquantization problem was pointed out by Kreutz-Delgado etal. [11]. The authors also proposed nding sparse represen-tations using variants of the focal underdetermined systemsolver (FOCUSS) [12], and then updating the dictionary basedon these representations. Aharon, Elad and Bruckstein [13]proposed the K-SVD algorithm. It involves a sparse codingstage, based on a pursuit method, followed by an update step,where the dictionary matrix is updated one column at the

    time, while allowing the expansion coefcients to change [13].More recently, dictionary learning methods for exact sparserepresentation based on ℓ1 minimization [8], [14], and onlinelearning algorithms [15], have been proposed.

    Generally, the methods described above are computationallyexpensive algorithms that look for a sparse decomposition,for a variety of signal processing applications. In this paper,we are interested in targeting speech signals, and deriving adictionary learning algorithm that is computationally fast. Thealgorithm should be able to learn a dictionary from a shortspeech signal, so that it can potentially be used in real-timeprocessing applications.

    A. Motivation

    The motivation for the work in this paper, was providedby previous results in [16], where a sparse coding methodbased on ICA (SC-ICA) [17] was used to adaptively learn adictionary from audio signals. For speech signals, many of the dictionary atoms were found to be localized in time andfrequency. Figure 1 shows some of these atoms, along withtheir sparsity index , dened as [18]

    ξ = x 1

    x 2, (1)

  • 8/19/2019 speech_manipulator

    2/10

    2 IEEE TRANSACTIONS ON, VOL. X, NO. X, JANUARY 20XX

    ξ = 7.85 ξ = 8.6 ξ = 11.5

    ξ = 11.4 ξ = 8.74 ξ = 13.1

    ξ = 10.4 ξ = 8.98 ξ = 12.8

    ξ = 7.01 ξ = 10.4 ξ = 11.2

    ξ = 7 ξ = 11.5 ξ = 19.6

    Fig. 1. Examples of the basis vectors extracted with the SC-ICA method in[16]. The sparsity index ǫ for each basis vector is also shown.

    where ||·|| 1 and ||·|| 2 denote the ℓ1 - and ℓ2 -norm respectively.The sparsity index measures the sparsity of a signal, andis such that the smaller ξ , the sparser the vector x ; for aGaussian signal, the sparsity index is ξ ≈ 18. Thus, althoughwe were looking for a sparse signal decomposition, we endedup with sparse basis functions. This raises the question: if a sparse decomposition produces sparse atoms, what signaldecomposition do we get when we look for sparse atoms?

    The work in this paper aims at answering this question,and therefore we look for dictionary atoms that are sparse .

    This is especially interesting, since we are not following theusual approach of looking for a sparse decomposition, ratherwe derive a dictionary learning algorithm that learns sparsebasis functions.

    A major limitation of the method in [16] is the compu-tational cost of the the algorithm. Like other sparse codingmethods in e.g. [2] or [10], whose development has beenhindered by their computational cost [19], the method requiresa large data set and is based on a lengthy training stage thatresults in an extremely expensive computational complexity.For this reason, a nal motivating factor in deriving ouralgorithm is that is must be computationally fast and work on short data segments.

    B. Contributions

    In this paper we consider the formulation of our algorithmfrom the point of view of minimizing the sparsity index onatoms. We seek the sparsity of the dictionary atoms ratherthan of the decomposition, and to the authors’ knowledgethis perspective has not been considered elsewhere. We showexperimentally that our algorithm is “doubly sparse”, yieldingboth sparse atoms and sparse decompositions.Further, we propose a stopping rule that automatically selects

    only a subset of the atoms. This has the potential of makingthe algorithm even faster, and to aid in denoising applicationsby using a subset of the atoms within the signal reconstruction.Finally, we consider a dictionary visualization method thatconveys more information than simply inspecting the wave-form or its frequency domain representation.

    C. Organization of the paper

    The structure of the paper is as follows: the problemthat we seek to address is outlined in Section II, and oursparse adaptive dictionary algorithm is introduced in SectionIII, along with the stopping rule. Experimental results arepresented in Section IV, including the investigation of thesparsity of the atoms and speech representation, the applicationof dictionary visualization, and speech denoising. Conclusionsare drawn in Section VII.

    I I . P ROBLEM S TATEMENT

    Given a one-dimensional speech signal x(t), we divide thisinto overlapping frames x k , each of length L samples, withan overlap of M samples. Hence, the k-th frame x k is givenby

    x k = [x((k − 1)(L − M )+1) , . . . , x (kL − (k − 1)M )]T (2)

    where k ∈ {1, . . . , K }. Then, we construct a new matrix X ∈

    L × K , whose k-th column is represented by the signal block x k , and whose (l, k)-th element is given by

    [X ]l,k = x(l + ( k − 1)(L − M )) (3)

    where l ∈ {1, . . . , L }, and K > L .

    The task is to learn a dictionary D consisting of L atoms ψ l ,that is D = {ψ l}Ll=1 , providing a sparse representation for thesignal blocks x k . We seek a dictionary and a decompositionof x k , such that [20]

    x k =L

    l=1α lk ψ

    l (4)

    where α lk are the expansion coefcients, and

    || α k || 0 ≪ L. (5)

    The ℓ0 -norm ||α k || 0 counts the number of non-zero entriesin the vector α k , and therefore the expression in equation (5)denes the decomposition as “sparse”, if L is small. In theremainder of this paper, we use the denition of sparsity inequation (1).

    The dictionary is learned from the newly constructed matrixX . In the case of our algorithm, we begin with a matrixcontaining K columns, and we extract the rst L columnsaccording to the criterion discussed in the next section.

  • 8/19/2019 speech_manipulator

    3/10

    JAFARI AND PLUMBLEY: DICTIONARY LEARNING BASED ON A DOUBLY SPARSE GREEDY ADAPTIVE DICTIONARY ALGORITHM 3

    III . G REEDY ADAPTIVE DICTIONARY ALGORITHM (GAD)

    To nd a set of sparse dictionary atoms , we consider thesparsity index ξ , as dened in equation (1), for each columnx k , of X , i.e.

    ξ k = x k 1

    x k 2. (6)

    Recall that the smaller the sparsity index, the sparser the vector

    x k . Our aim is to sequentially extract new atoms from X topopulate the dictionary matrix D , and we do this by nding, ateach iteration, the column of X with minimum sparsity index

    mink

    ξ k . (7)

    Practical implementation of the algorithm begins with thedenition of a residual matrix R l = [ r l1 , . . . , r lK ], wherer lk ∈ K is a residual column vector corresponding to the k-th column of R l . The residual matrix changes at each iterationl, and is initialized to X . The dictionary is then built byselecting the residual vector r lk that has lowest sparsity index,as illustrated in Algorithm 1.

    Algorithm 1 Greedy adaptive dictionary (GAD) algorithm1. Initialize: l = 0 , D 0 = [ ] {empty matrix }, R 0 = X2. repeat3. Find residual with lowest ℓ1 - to ℓ2 -norm ratio:

    kl = argmin k /∈I l { r lk 1 / r lk 2 }4. Set the l-th atom equal to normalized ˆr lk l :

    ψ l = ˆr lk l / ˆrlk l 2

    5. Add to the dictionary:D l = D l− 1 |ψ l , I l = I l− 1 ∪ {kl}

    6. Compute the new residual: r l+1k = rlk − ψ l ψ l , r lk

    7. until “termination” (see Section III-A)

    We call our method the greedy adaptive dictionary (GAD)algorithm [21].

    Aside from the advantage of producing atoms that aredirectly relevant to the data, the GAD algorithm results inan orthogonal transform. To see this, consider re-writing theupdate equation in step 6 in Algorithm 1, as the projection of the current residual r lk onto the atom space:

    r l+1k = P ψ̄ l rlk = I −

    ψ l ψ lT

    ψ l T ψ lr lk = r

    lk −

    ψ l ψ l , r lkψ l T ψ l

    . (8)

    It follows from step 4 in Algorithm 1, that the denominatorin the right-hand-side of equation (8) is equal to 1, and

    therefore the equation corresponds to the residual update instep 6. Orthogonal dictionaries have the advantage beingeasily invertible, since if the matrix B is orthogonal, thenBB T = I , and evaluation of the inverse simply requires theevaluation of the matrix transpose.

    We will see later in Section IV-C that although we wishto nd a dictionary whose atoms are sparse, and make noassumptions on the signal decomposition, the GAD algorithmresults in a reasonably sparse decomposition. Therefore, thealgorithm has the effect of being doubly sparse on speechsignals, by sparsifying both the atoms and the decomposition.

    A. Termination rules

    We consider two possible termination rules:1) The number of atoms l to be extracted is pre-determined,

    so that up to L atoms are learned. Then, the terminationrule is:

    • Repeat from step 2, until l = N , where N ≤ L.2) The reconstruction error at the current iteration ǫl is

    dened, and the rule is:• Repeat from step 2 until

    ǫl = x̂ l (t) − x(t) 2 ≤ σ (9)

    where x̂l (t) is the approximation of the speech sig-nal x(t), obtained at the l-th iteration from X̂ l =D l D l T X , by reversing the framing process; D l isthe dictionary learned so far, as dened in step 5 of Algorithm 1.

    IV. E XPERIMENTS

    We compared the GAD method to PCA [23], SC-ICA[17] and K-SVD [13]. SC-ICA and K-SVD were chosenbecause they learn data-determined dictionaries, and look fora sparse representation. PCA was chosen because it is awell-established technique, commonly used in speech codingand therefore it sets the benchmark for the speech denoisingapplication.

    We used the four algorithms to learn 512 dictionary atomsfrom a segment of speech lasting 1.25 sec. A short datasegment was used because this way the algorithm can beused within real-time speech processing applications. The datawas taken from the female speech signal “supernova.wav”by “Corsica S”, downloaded from The Freesound Projectdatabase [22], and downsampled to 16 kHz. We also used themale speech signal “Henry5.mp3” by “acclivity”, downloadedfrom the same database, and downsampled to 16 kHz.

    The K-SVD Matlab Toolbox [24] was used to implementthe K-SVD algorithm. K-SVD requires the selection of sev-eral parameters. We set the number of iterations to 50, asrecommended in [13], and the number of nonzero entriesT 0 in the coefcient update stage to 10, which we foundempirically to give a more accurate, although not as sparse,signal representation than T 0 = 3 , as used in [13]. Thedictionary size was set to 512 and the memory usage to “high”.

    The SC-ICA method was used as described in [17], using50000 iterations.

    A. Computational Complexity

    In Table I we report the computational times of thealgorithms, when learning a dictionary from speech segmentof 1.25 sec, and averaged over 100 trials. SC-ICA was alsoapplied to the short data set, since we are not concernedwith the output of the algorithm, only its computationalload. Two versions of the K-SVD were also compared: theoriginal version which is fully based on Matlab M-code, andthe second version, which combines M-code with optimizedMEX functions written in C. The experiments were conductedon a Quad-Core Intel Xeon Mac at 2.66 GHz, using Matlab

  • 8/19/2019 speech_manipulator

    4/10

    4 IEEE TRANSACTIONS ON, VOL. X, NO. X, JANUARY 20XX

    ξ = 8.49 ξ = 10.9 ξ = 9.74

    ξ = 14.4 ξ = 12.3 ξ = 9.52

    ξ = 13.6 ξ = 13.9 ξ = 13.6

    ξ = 16 ξ = 18.2 ξ = 19.3

    ξ = 19.7 ξ = 18.1 ξ = 18.6

    (a) Original data blocks

    ξ = 20.3 ξ = 20.2 ξ = 17.4

    ξ = 17.2 ξ = 17.5 ξ = 19.6

    ξ = 18.5 ξ = 18.2 ξ = 18.4

    ξ = 17.2 ξ = 18.4 ξ = 17.7

    ξ = 18 ξ = 17.7 ξ = 17

    (b) PCA atoms

    ξ = 20.1 ξ = 17.4 ξ = 20

    ξ = 18.5 ξ = 18.2 ξ = 18.1

    ξ = 17.4 ξ = 19.9 ξ = 18.2

    ξ = 20.6 ξ = 18.6 ξ = 18.8

    ξ = 18.1 ξ = 18.4 ξ = 18.8

    (c) SC-ICA atoms learned from 1.25 sec speech

    ξ = 20.6 ξ = 8.6 ξ = 7.58

    ξ = 9.08 ξ = 9.17 ξ = 9.53

    ξ = 10.5 ξ = 10.4 ξ = 9.85

    ξ = 9.81 ξ = 11.1 ξ = 10.1

    ξ = 19.1 ξ = 9.58 ξ = 8.93

    (d) SC-ICA atoms learned from 196 sec speech

    ξ = 18.6 ξ = 18.2 ξ = 15.7

    ξ = 15.9 ξ = 13 ξ = 14.7

    ξ = 19 ξ = 17.9 ξ = 17.2

    ξ = 17 ξ = 15.9 ξ = 18.7

    ξ = 18.7 ξ = 19.1 ξ = 19.4

    (e) K-SVD atoms

    ξ = 8.49 ξ = 10.9 ξ = 11.9

    ξ = 12.5 ξ = 13.2 ξ = 14

    ξ = 13.9 ξ = 14.9 ξ = 15

    ξ = 16.1 ξ = 15.7 ξ = 16.6

    ξ = 17.1 ξ = 17.4 ξ = 17.3

    (f) GAD atomsFig. 2. Examples of the frames of the original speech signals, and of the atoms learned with the PCA, SC-ICA, K-SVD and GAD algorithms. The SC-ICAalgorithm is applied to a speech segment of length 1.25 sec, and to a longer one of length 196 sec.

    Version 7.6.0.324 (R2008a) and under the Mac OS X Version10.5.8 operating system.

    The results in the table show that SC-ICA requires over 22hours on average to learn a dictionary of 512 atoms. The sizeof the training set does not affect the computational complexityof SC-ICA, because the algorithm randomly selects a 512× 512matrix of training data at each iteration. When a larger data setis used, the algorithm has more “information” from which tolearn the dictionary, while working with a short data segment,the training data is not varied enough for the atoms to be

    learned to an adequate level [25].GAD and K-SVD (v2) only need about 2 minutes, and

    PCA needs as little as 7sec. However, note how the K-SVDversion based exclusively on M-code requires around 1 hourand 45 minutes to learn the dictionary. Therefore, we expectthat optimizing the code for GAD will lead to even fastercomputational complexity.

    B. Learned Atoms

    We begin by visually inspecting some examples of the atomslearned with the four algorithms, and then considering the

  • 8/19/2019 speech_manipulator

    5/10

  • 8/19/2019 speech_manipulator

    6/10

    6 IEEE TRANSACTIONS ON, VOL. X, NO. X, JANUARY 20XX

    obtained with the other algorithms, we obtain a single valueby averaging the atom sparsity index not only across the 100trials but also across all 512 atoms (the average is only overthe 512 atoms in the SC-ICA case).

    The results are shown in the rst column of Table II,and they validate our expectations: GAD yields atoms thatare sparser than the original signal blocks, and than all thealgorithms except for SC-ICA. However, when we usedthe termination rule in equation (9) (shown in Table II asGAD-TR), with σ = 5 × 10− 3 , the average sparsity indexfor the GAD atoms decreased from 16.2 to 12.6. On average,GAD-TR was found to learn less than 110 atoms, whichfrom Figure 3 can be seen to correspond to those atoms thatare sparsest. The algorithms perform in a similar way on themale speech.

    Next, we seek to determine how sparse is the representationobtained with the GAD method. We do this by consideringthe transform coefcients obtained with all methods, for eachblock, and across the 100 speech segments taken from the

    speech signal, each lasting 1.25 sec. The sparsity index of thetransform coefcients is found each time. We then averageacross the 100 segments, and across all block, to obtaina single gure for each method, and for both the femaleand male speech signals, as shown in the second column of Table II. This also includes values for the sparsity index forthe original signal blocks in X . The lowest representationsparsity index value is obtained with K-SVD, thanks to thestrong sparsity constraint imposed by the algorithm on thesignal decomposition. This entails limiting the number of non-zero elements in the signal representation to a smallnumber (we use T 0 = 10 ). The signal transformed with theGAD algorithm is sparser than in the time domain, and than

    the coefcients obtained with SC-ICA and PCA when allatoms are used in the signal reconstruction, for both signals.Moreover, the representation becomes even sparser whenGAD is used with the termination rule.

    Thus, although we set out to nd a dictionary whoseatoms are sparse, and make no assumptions on the signalrepresentation, we end up with a sparse decomposition. Hence,the algorithm has the effect of being doubly sparse on speechby sparsifying both the atoms and the decomposition.

    D. Representation Accuracy

    The accuracy of the signal approximation given by eachalgorithm can be assessed with the reconstruction error ǫ, asdened in equation (9), after the dictionary has been learned

    ǫ = x̂(t) − x(t) 2 (10)

    where x̂(t) is the signal approximation obtained fromX̂ = D l D

    l † X , and D l†

    is the right pseudo-inverse of D l . This is plotted in Figure 4 for each algorithm, as thenumber of atoms omitted in the signal reconstruction goesfrom 0 to 462 (or, the total number of atoms used goes from512 down to 50). K-SVD has a non-zero reconstruction error

    0 100 200 300 400

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    x 10−3

    number of omitted atoms

    r e c o n s t r u c

    t i o n

    e r r o r

    ( ε )

    SC−ICA

    K−SVD

    GAD PCA

    Fig. 4. Reconstruction error for the GAD, PCA, SC-ICA and K-SVDalgorithms, averaged over 100 trials.

    (a) Original signal

    atom number

    f / k H z

    0 200 4000

    2

    4

    68

    (b) PCA Atoms

    atom number

    f / k H z

    0 200 4000

    2

    4

    68

    (c) SC−ICA Atoms

    atom number

    f / k H z

    0 200 4000

    2

    4

    6

    8(d) K−SVD Atoms

    atom number

    f / k H z

    0 200 4000

    2

    4

    6

    8

    (e) GAD Atoms

    atom number

    f / k H z

    0 200 4000

    2

    4

    6

    8

    Fig. 5. Plots of the spectra of the original signal, and of

    the atoms learned with PCA, SC-ICA, K-SVD and GAD. Ineach plot, the spectrum is ordered according to the centrefrequency. A color version of this gure can be seen online athttp://www.elec.qmul.ac.uk/digitalmusic/papers/2010/Jafari10gadspeech/.

    even when all atoms are included in the signal approximation,because the transform is not complete, and therefore it doesnot result in an exact reconstruction.

    In general, the results show that all algorithms perform quitewell when few atoms are omitted in the reconstruction. Asmore and more atoms are omitted, the reconstruction error

  • 8/19/2019 speech_manipulator

    7/10

    JAFARI AND PLUMBLEY: DICTIONARY LEARNING BASED ON A DOUBLY SPARSE GREEDY ADAPTIVE DICTIONARY ALGORITHM 7

    increases. PCA performs best, because the transform arrangesthe signal components so that most energy is concentratedin a small number of components, corresponding to thoseextracted earlier. The GAD transform also gives good sig-nal approximations as more atoms are excluded from thereconstruction, although its performance seems to worsen asthe number of omitted atoms becomes more than 300 (orless than 200 atoms are used in the reconstruction). Thiscorresponds to the number, identied in Figure 3, below whichthe GAD atoms are sparsest, and above which the sparsityindex reaches its maximum value. SC-ICA and K-SVD yieldsignal approximations that suffer most from the reduction inthe number of atoms, with SC-ICA performance suddenlyworsening as the number of excluded atoms goes from about300 to over 400.

    The behavior of GAD suggests that it might be suitable fordenoising applications. Hence, we will consider this problemlater in this paper.

    E. Dictionary Visualization

    The frequency spectra of the original signal and learneddictionaries, are plotted in Figure 5, where they are orderedaccording to their centre frequency. The plots show thatthe original signal has little energy above 4kHz, while theSC-ICA atoms seem to have energy that is more or lessequal across all frequencies and across time. The energyof PCA and GAD atoms is also more spread across thetime-frequency plane, while K-SVD yields atoms whoseenergy is focused in the lower frequencies. These plotsconvey little more information, and at times they are hard tointerpret also because the printed plot is not as clear as theone on the computer screen.

    For an alternative visualization, we apply a method forvisualizing the dictionary that was rst proposed in [17], andsubsequently in [26]. It entails plotting an axis-aligned ellipsefor each atom that corresponds to its position and spread intime and frequency.

    To do this, each squared atom (ψ l )2 and its squaredFourier transform (Ψ l )2 are treated as a Laplacian probabilitydistribution:

    f (g|µ, b) = 12b

    e−| g − µ |

    b (11)

    The median µ1 / 2 (g) and absolute deviation from that median

    di = 1L

    L

    i=1|gi − µ1 / 2 (g)| (12)

    are then evaluated from (ψ l )2 and (Ψ l )2 . For the timedomain waveform, the median µ1 / 2 (g) and deviation dicorrespond, respectively, to the location of the atom and itsspread on the time axis. For the frequency domain case, thetwo quantities, µ1 / 2 (g) and di , represent the centre frequencyof the atom and its bandwidth. Then, for each basis vector, anellipse is drawn on the time-frequency plane [17], centeredat µt1 / 2 (g), µ

    f 1 / 2 (g) and with radia d

    ti and d

    f i . The energy

    0 10 20 30

    0

    5

    10

    time/ms

    f / k H z

    (a) Framed Signal

    0 10 20 30

    0

    5

    10

    time/ms

    f / k H z

    (b) PCA Atoms

    0 10 20 30

    0

    5

    10

    time/ms

    f / k H z

    (c) SC−ICA Atoms

    0 10 20 30

    0

    5

    10

    time/ms

    f / k H z

    (d) K−SVD Atoms

    0 10 20 30

    0

    5

    10

    time/ms

    f / k H z

    (e) GAD Atoms

    Fig. 6. Visualization of the (a) original female speech signal, framedwith no overlap, (b) overlapped frames from the original signal, and atomsobtained with (c) PCA; (d) SC-ICA; (e) K-SVD; (f) GAD. Each ellipse drawnrepresents the position and spread of the atom (or signal block) in time andfrequency.

    of each atom relative to the one with the highest energy is

    encoded by the grey scale of the elliptical plots.

    The results are shown in Figure 6. The plot in Figure6(a) shows the results from the original signal divided intooverlapping data frames. The ellipses for the PCA atoms(Figure 6(b)) are concentrated in the centre of the timescale, while the SC-ICA atoms (Figure 6(c)) evenly tile thetime-frequency plane, and are well localized, except for thelow frequencies. This is consistent with the wavelet-likebasis functions learned by SC-ICA from speech (see [17] formore details). Visualization of the plot for the K-SVD atom(Figure 6(d)) shows that this is the only algorithm that yieldsatoms whose elliptical plots are similar to the original signal.

    Several of the elliptical plots for the GAD atoms (Figure 6(e))are focused in the middle region, although less than the PCAatoms. However, unlike PCA, the GAD atoms cover moreof the time-frequency plane, and ellipses are clearly presentin the high and low frequencies, similarly to the original signal.

    We can further investigate the behavior of the GADalgorithm by considering the atoms it learned according towhen they were extracted in the learning process. In Figure7, the ellipses for the GAD atoms are plotted for the atomsextracted early on (atoms 1 to 150), and later on in thelearning process (atoms 151 to 512). These regions were

  • 8/19/2019 speech_manipulator

    8/10

    8 IEEE TRANSACTIONS ON, VOL. X, NO. X, JANUARY 20XX

    0 5 10 15 20 25 30 35

    0

    5

    10

    time/ms

    f / k H z

    (a) GAD atoms 1 to 150

    0 5 10 15 20 25 30 35

    0

    5

    10

    time/ms

    f / k H z

    (b) GAD atoms 151 to 512

    Fig. 7. Elliptical plots of the GAD atoms from (a) 1 to 150; (b) 151 to512, for the female speech signal. These values were empirically selected bylooking at the average sparsity index in Figure 3, so that they correspond tothe “early” and “late” atoms according to when they were extracted in thelearning process.

    selected by inspecting the average sparsity index in Figure3, and they are expected to correspond to those atoms thatare sparsest (atoms 1 to 150), followed by the atoms thatcorrespond mostly to “noise”-like features (atoms 151 to512). As well as being the sparsest, the early atoms are moresimilar to the speech signal, because the algorithm will nothave had time to make a big impact on the training data. Theplots in Figure 7(a) show that the early atoms indeed havetime-frequency characteristics that are similar to the speechsignal. The plots in Figure 7(b) are the result of the learningprocess.

    Figures 8 and 9 show that similar results are obtained for themale speech signal. Here, the distinction between the rst 150atoms that are extracted, and those learned subsequently, iseven more accentuated. Figure 9 shows that all elliptical plotsabove 4kHz is related to atoms extracted later in the process,while the initial 150 atoms result in a plot that is similar tothe original signal.

    V. A PPLICATION TO S PEECH D ENOISING

    The term denoising refers to the removal of noise froma signal, and it was rst used to describe the application of methods based on the wavelet transform for noise reduction[27]. Sparse transforms have been found to be among themost successful methods for denoising [13], and dictionarylearning methods have been used for this application [28].

    Table III shows the tolerance of the PCA, SC-ICA, K-SVDand GAD algorithms to a noise level changing from 20 dB to -10 dB, as the number of atoms in the reconstruction is reducedfrom 512 to 50. This is evaluated with the improvement insignal to noise ratio (ISNR):

    ISNR = 10 log E {(x(t) − xn (t)) 2 }

    E {x(t) − x̂(t)) 2 } (13)

    0 10 20 30

    0

    5

    10

    time/ms

    f / k H z

    (a) Original Signal

    0 10 20 30

    0

    5

    10

    time/ms

    f / k H z

    (b) PCA Atoms

    0 10 20 30

    0

    5

    10

    time/ms

    f / k H z

    (c) SC−ICA Atoms

    0 10 20 30

    0

    5

    10

    time/ms

    f / k H z

    (d) K−SVD Atoms

    0 10 20 30

    0

    5

    10

    time/ms

    f / k H z

    (e) GAD Atoms

    Fig. 8. Visualization for the male speech signals. The plots correspond to:(a) original speech signal, framed with no overlap, (b) overlapped framesfrom the original signal, and atoms obtained with (c) PCA; (d) SC-ICA; (e)K-SVD; (f) GAD. Each ellipse drawn represents the position and spread of the atom (or signal block) in time and frequency.

    0 5 10 15 20 25 30 35

    0

    5

    10

    time/ms

    f / k H z

    (a) GAD atoms 1 to 150

    0 5 10 15 20 25 30 35

    0

    5

    10

    time/ms

    f / k H z

    (b) GAD atoms 151 to 512

    Fig. 9. Visualizing the GAD atoms from (a) 1 to 150; (b) 151 to 512, for themale speech signal. These values correspond to the “early” and “late” atomsaccording to when they were extracted in the learning process.

    where x(t) is the original signal, xn (t) is the observeddistorted (noisy) signal, and x̂(t) is the source approximatedby the transform. As the signal approximation becomes closerto the original source, ISNR increases.

    When all atoms are used in the reconstruction, the completetransforms PCA, SC-ICA and GAD, yield an ISNR of 0 dB,while K-SVD gives a non-zero ISNR, since the approximation

  • 8/19/2019 speech_manipulator

    9/10

  • 8/19/2019 speech_manipulator

    10/10

    10 IEEE TRANSACTIONS ON, VOL. X, NO. X, JANUARY 20XX

    [16] M. G. Jafari, E. Vincent, S. A. Abdallah, M. D. Plumbley, and M. E.Davies, “An adaptive stereo basis method for convolutive blind audiosource separation,” Neurocomputing , vol. 71, pp. 2087–2097, 2008.

    [17] S. A. Abdallah and M. D. Plumbley, “If edges are the independentcomponents of natural images, what are the independent components of natural sounds?” in Proc. of the International Conference on Indepen-dent Component Analysis and Blind Signal Separation (ICA) , 2001, pp.534–539.

    [18] V. Tan and C. F ´evotte, “A study of the effect of source sparsity forvarious transforms on blind audio source separation performance,” in

    Proc. of the Workshop on Signal Processing with Adaptative SparseStructured Representations (SPARS) , 2005.

    [19] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efcient sparse codingalgorithms,” in Advances in Neural Information Processing Systems 19 ,B. Sch ölkopf, J. Platt, and T. Hoffman, Eds. Cambridge, MA: MITPress, 2007, pp. 801–808.

    [20] M. Goodwin and M. Vetterli, “Matching pursuit and atomic signal mod-els based on recursive lter banks,” IEEE Trans. on Signal Processing ,vol. 47, pp. 1890–1902, 1999.

    [21] M. Jafari and M. Plumbley, “An adaptive orthogonal sparsifying trans-form for speech signals,” in Proc. of the International Symposium onCommunications, Control and Signal Processing (ISCCSP) , 2008, pp.786–790.

    [22] “The Freesound Project. A collaborative database of Creative Commonslicensed sounds.” [Online]. Available: www.freesound.org

    [23] S. Haykin, Neural networks: a comprehensive foundation , 2nd ed.Prentice Hall, 1999.

    [24] R. Rubinstein. [Online]. Available: www.cs.technion.ac.il/ ∼ ronrubin/ software.html

    [25] S. A. Abdallah and M. D. Plumbley, “Application of geometric depen-dency analysis to the separation of convolved mixtures,” in Proc. of the International Conference on Independent Component Analysis and Blind Signal Separation (ICA) , 2004, pp. 22–24.

    [26] E. C. Smith and M. S. Lewicki, “Efcient auditory coding,” Nature , vol.439, pp. 978–982, 2006.

    [27] D. L. Donoho, “De-noising by soft thresholding,” IEEE Trans. on Information Theory , vol. 41, pp. 613–627, 1995.

    [28] S. Valiollahzadeh, H. Firouzi, M. Babaie-Zadeh, and C. Jutten, “Imagedenoising using sparse representations,” in Proc. of the InternationalConference on Independent Component Analysis and Signal Separation(ICA) , 2009, pp. 557–564.