data-adaptive source separation for audio spatializationlight/iit bombay... · i am extremely...

81
Data-Adaptive Source Separation for Audio Spatialization Submitted in partial fulfilment of the requirements for the degree of Master of Technology (Electronic Systems) by Pradeep Gaddipati 08307029 Under the guidance of Prof. Preeti Rao and Prof. V. Rajbabu Department of Electrical Engineering INDIAN INSTITUTE OF TECHNOLOGY BOMBAY June 2010

Upload: others

Post on 04-Feb-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Data-Adaptive Source Separation for Audio Spatialization

    Submitted in partial fulfilment of the requirements for the degree of

    Master of Technology

    (Electronic Systems)

    by

    Pradeep Gaddipati

    08307029

    Under the guidance of

    Prof. Preeti Rao

    and

    Prof. V. Rajbabu

    Department of Electrical Engineering

    INDIAN INSTITUTE OF TECHNOLOGY BOMBAY

    June 2010

  • Dedication

    I dedicate this thesis to my family. Without their patience, understanding,

    support and most of all love, the completion of this work would not have been

    possible.

  • ii

    Dissertation Approval for Master of Technology

    This dissertation entitled Data-adaptive source separation for audio

    spatialization by Pradeep Gaddipati (Roll no. 08307029) is approved for the

    degree of Master of Technology in Electrical Engineering.

    Prof. Preeti Rao _______________________ (Supervisor)

    Prof. V. Rajbabu _______________________ (Co-supervisor)

    Dr. Samudravijaya K. _____ __________________ (External Examiner)

    Prof. Prem C. Pandey _______________________ (Internal Examiner)

    Prof. K. P. Karunakaran _______________________ (Chairman)

    June 17th, 2010

  • iii

    Acknowledgments

    I express my sincere gratitude towards Prof. Preeti Rao and Prof. V. Rajbabu for the guidance

    and support they gave me during this project. The regular discussions with them on every

    aspect of the research work helped me refine my approach towards the problem and motivated

    me to give my best. Working with them in the field of audio signal processing was a very

    pleasant learning experience and my interest in the subject has considerably grown.

    I am extremely thankful to Nokia, India and specifically, Dr. Pushkar Patwardhan for

    providing me with the opportunity of pursuing research in such a remarkable domain. I would

    like to thank them for providing the financial support and technical inputs for the work.

    I would like to thank Vishweshwara Rao for his valuable suggestions and help during various

    stages of my project. I thank all the members of the Digital Audio Processing lab, Department

    of Electrical Engineering, IIT Bombay for providing a friendly and enjoyable working

    environment.

    I would also like to thank my family for their love and moral support. Finally, I thank all the

    people who have contributed ideas, concepts and corrections to be incorporated in my project.

    The mistakes if any in the final draft finally are all my own.

    Pradeep Gaddipati

  • iv

    Abstract

    The existing surround audio needs to be spatialized to obtain a signal which can generate the

    effect of auditory immersion over headphones. This spatialization process comprises of two

    stages viz. separating of the individual sources from the available mixtures and then

    combining them to re-create the compatible audio for the desired output configuration (for

    the case of headphones, the individual sources are convolved with the HRIRs for localization

    and then mixed together to form the final output audio). The source separation technique itself

    involves four stages viz. transformation of the mixtures into a sparse time-frequency

    representation, estimation of mixing parameters (i.e. the direction and location of sources),

    estimation of sources in the time-frequency domain and finally inverting back into the time-

    domain by using an appropriate inverse time-frequency transformation technique.

    Various sparsity-based source separation techniques namely degenerate un-mixing estimation

    technique (DUET), lq-basis pursuit (LQBP) and delay and scale subtraction scoring (DASSS)

    have been explored for the purpose of estimating mixing parameters and individual sources

    from the mixtures. However their performance is directly coupled to two parameters viz.

    sparsity of the time-frequency representation and the W-disjoint orthogonality of the

    underlying sources in the time-frequency representation of the mixtures.

    This thesis endeavours to find a time-frequency representation which is sparser and can

    provide a higher degree of W-disjoint orthogonality amongst the underlying sources in the

    mixtures than the time-frequency representation obtained using short-time Fourier transform

    (STFT). With this objective, a time-varying data-adaptive time representation was developed

    and its performance in terms of the aforementioned parameters was compared to that of the

    fixed-window STFT. The data-adaptive time-frequency representation leads to better

    estimation of mixing parameters which is translated into better separation of sources from the

    stereo mixtures. This enables the sources to be better spatialized in the auditory space with

    fewer artifacts as has been observed.

  • v

    Table of Contents

    Dedication ................................................................................................................................... i

    Dissertation Approval for Master of Technology .................................................................. ii

    Acknowledgments .................................................................................................................... iii

    Abstract .................................................................................................................................... iv

    Table of Contents ...................................................................................................................... v

    List of Figures ........................................................................................................................ viii

    List of Tables ............................................................................................................................. x

    List of Abbreviations ............................................................................................................... xi

    Declaration of Academic Honesty and Integrity ................................................................ xiii

    Chapter 1. Introduction ........................................................................................................ 1

    Chapter 2. Spatial Audio ....................................................................................................... 4

    2.1. Sound localization ........................................................................................................ 4

    2.1.1. Binaural cues ........................................................................................................ 5

    2.1.2. Monaural spectral cue .......................................................................................... 5

    2.1.3. Rotation of the human head .................................................................................. 5

    2.1.4. Head related impulse response ............................................................................ 6

    2.2. Surround sound generation .......................................................................................... 7

    2.3. Panning laws ................................................................................................................ 8

    Chapter 3. Audio Spatialization ......................................................................................... 11

    3.1. Stages of audio spatialization .................................................................................... 11

    3.1.1. Analysis – source separation .............................................................................. 12

    3.1.2. Re-synthesis – convolution with HRIRs .............................................................. 12

    Chapter 4. Sparsity-based Source Separation .................................................................. 15

    4.1. Classification of source separation algorithms .......................................................... 15

    4.1.1. Based on mixing parameters considered in the mixing model ........................... 15

  • vi

    4.1.2. Based on number of mixtures and sources in the mixing model ........................ 16

    4.2. Source separation algorithms: A review .................................................................... 16

    4.3. Mixing models ........................................................................................................... 17

    4.4. Sparsity-based source separation ............................................................................... 18

    4.5. Stages of sparsity-based source separation ................................................................ 19

    4.6. Source assumptions .................................................................................................... 19

    1.1.1. Local stationarity ................................................................................................ 19

    4.6.1. Microphone spacing ........................................................................................... 20

    4.6.2. W-disjoint orthogonality ..................................................................................... 20

    4.7. Mixing parameter estimation technique .................................................................... 20

    4.8. Source estimation techniques ..................................................................................... 21

    4.8.1. Degenerate unmixing estimation technique (DUET) ......................................... 22

    4.8.2. Lq-basis pursuit (LQBP) ..................................................................................... 23

    4.8.3. Delay and scale subtraction scoring (DASSS) ................................................... 24

    Chapter 5. Adaptive Time-Frequency Representation .................................................... 27

    5.1. Short-time Fourier transform ..................................................................................... 27

    5.2. Need for data-adaptive time-frequency representations ............................................ 28

    5.3. Data-adaptive time-frequency representations .......................................................... 29

    5.3.1. Steps to obtain a data-adaptive time-frequency representation of a signal ....... 31

    5.4. Invertibility of time-frequency representations ......................................................... 33

    5.4.1. Frame-based transition-window re-construction technique .............................. 34

    5.4.2. Modified (extended) window re-construction technique .................................... 34

    5.4.3. Segment-based transition-window re-construction technique ........................... 36

    Chapter 6. Concentration Measure .................................................................................... 37

    6.1. W-disjoint orthogonality ............................................................................................ 37

    6.2. Sparsity ...................................................................................................................... 39

    6.2.1. Characteristics of sparsity measures .................................................................. 39

  • vii

    6.2.2. Sparsity measures ............................................................................................... 41

    6.3. Relation between sparsity measures and WDO measure ........................................... 42

    6.3.1. Steps for obtaining the W-disjoint orthogonality measure for a set of signals .. 44

    6.3.2. Steps for obtaining the sparsity measure for a set of signals ............................. 45

    Chapter 7. Experiments and Results ................................................................................. 48

    7.1. Datasets ...................................................................................................................... 48

    7.1.1. BSS Oracle database .......................................................................................... 48

    7.1.2. TIMIT speech database ...................................................................................... 48

    7.2. Performance evaluation measures.............................................................................. 48

    7.3. Performance evaluation ............................................................................................. 49

    7.3.1. Setup for performance evaluation test ................................................................ 50

    7.3.2. Mixing parameters estimation stage................................................................... 51

    7.3.3. Source estimation stage ...................................................................................... 52

    Chapter 8. Conclusions and Future Work ........................................................................ 54

    8.1. Conclusions ................................................................................................................ 54

    8.2. Future work ................................................................................................................ 55

    Appendix A. Sinusoid Detection using Data-Adaptive Time-Frequency Representation

    56

    A.1. Sinusoid detection ...................................................................................................... 56

    A.2. Data-adaptive time-frequency representation for sinusoid detection ........................ 57

    A.3. Performance of data-adaptive time-frequency representation ................................... 58

    A.3.a. Sinusoid signals .................................................................................................. 59

    A.3.b. Chirp signals ...................................................................................................... 60

    A.3.c. Frequency modulated signals ............................................................................. 61

    A.3.d. Mixture of sinusoids and frequency modulated signals...................................... 62

    A.3.e. Music/speech signals (real signals) .................................................................... 64

    References ............................................................................................................................... 66

  • viii

    List of Figures

    Figure 2.1 Binaural cues – interaural time difference (ITD) ...................................................... 6

    Figure 2.2 Binaural cues – interaural level difference (ILD) ..................................................... 6

    Figure 2.3 Cone of confusion ..................................................................................................... 6

    Figure 2.4 Rotation of human head ............................................................................................ 6

    Figure 2.5 Monaural spectral cues .............................................................................................. 6

    Figure 2.6 Reproduction of two-channel stereo ......................................................................... 8

    Figure 3.1 Audio spatialization block diagram ........................................................................ 12

    Figure 3.2 Time-domain virtualization based on HRIRs ......................................................... 13

    Figure 4.1 Mixing models - anechoic mixing........................................................................... 18

    Figure 4.2 Mixing model - echoic mixing ................................................................................ 18

    Figure 4.3 Block diagram of sparsity-based source separation ................................................ 19

    Figure 5.1: Data-adaptive time-frequency representation of a singing voice using frame-based

    adaptation (window function: hamming; window sets for adaptation: 30, 60 and 90 ms; hop

    size: 10 ms; concentration measure: kurtosis; adaptation region: 1000 to 3000 Hz) ............... 32

    Figure 5.2: Data-adaptive time-frequency representation of a singing voice using segment-

    based adaptation (window function: hamming; window sets for adaptation: 30, 60 and 90 ms;

    hop size: 10 ms; concentration measure: kurtosis; adaptation region: 1000 to 3000 Hz) ........ 33

    Figure 5.3: Frame-based transition-window reconstruction technique .................................... 34

    Figure 5.4: Modified (extended) window re-construction technique ....................................... 35

    Figure 6.1: Disjoint orthogonal for time-frequency representations of speech source mixing as

    a function of window size used in the time-frequency transformation. ................................... 43

    Figure 6.2: The kurtosis (left) and Gini Index (right) sparsity measures applied to speech

    signals in the time-frequency domain as a function of window size. ....................................... 43

    Figure 6.3: WDO vs. window size ........................................................................................... 46

    Figure 6.4: Sparsity measure (kurtosis) vs. window size ......................................................... 47

    Figure 6.5: Sparsity measure (Gini Index) vs. window size ..................................................... 47

    Figure A.1 Time-frequency representation of sinusoid signals ................................................ 59

    Figure A.2 True hits vs. false alarms plot for the sinusoid signals .......................................... 60

    Figure A.3 Time-frequency representation of chirp signal ...................................................... 61

    Figure A.4 True hits vs. false alarms plot for chirp signals ..................................................... 61

    Figure A.5 Time-frequency representation of frequency modulated signals ........................... 62

  • ix

    Figure A.6 True hits vs. false alarms plot for the frequency modulated signals ...................... 62

    Figure A.7 Time-frequency representation of mixture of sinusoid signals and frequency

    modulated signal (signal energy, frequency modulated to sinusoid signal = 7 dB) ................. 63

    Figure A.8 Time-frequency representation of mixture of sinusoid signals and frequency

    modulated signal (signal energy, frequency modulated to sinusoid signal = -3 dB) ............... 64

    Figure A.9 True hits vs. false alarms for of mixture of sinusoid signals and frequency

    modulated signal ....................................................................................................................... 64

    Figure A.10: Data-adaptive time-frequency representation of a singing voice signal ............. 65

  • x

    List of Tables

    Table 6-A: Validation table showing the characteristics satisfied by the sparsity measures

    (kurtosis/Gini Index) ................................................................................................................ 42

    Table 6-B: Counter-examples for testing a sparsity measure whether it satisfies a particular

    property with the desired outcome if the sparsity measure satisfies the property. S(x) denotes

    sparsity measure of x ................................................................................................................ 42

    Table 7-A: Performance of the mixing parameter estimation stage on BSS oracle (music)

    dataset ....................................................................................................................................... 52

    Table 7-B: Performance of the mixing parameter estimation stage on BSS oracle (speech)

    dataset ....................................................................................................................................... 52

    Table 7-C: Performance of the source estimation stage (in time-frequency domain) using

    DUET and LQBP algorithms on BSS oracle dataset ............................................................... 53

    Table A-A: True hits percentage of sinusoid detection for singing voice for different

    frequency bands ........................................................................................................................ 65

  • xi

    List of Abbreviations

    Abbreviation Meaning

    ATFR Adaptive Time-Frequency Representation

    BSS Blind Source Separation

    CASA Computational Auditory Scene Analysis

    CIPIC Centre for Image Processing and Integrated Computing

    COLA Constant Over-Lap Add

    DASSS Delay And Scale Subtraction Scoring

    DFT Discrete Fourier Transform

    DUET Degenerate Unmixing Estimation Technique

    DVD Digital Video Disc

    ECG Electrocardiography

    HRIR Head Related Impulse Response

    HRTF Head Related Transfer Function

    ICA Independent Component Analysis

    ICLD Inter-Channel Level Difference

    IEEE Institute of Electrical and Electronics Engineers

    ILD Interaural Level Difference

    ITD Interaural Time Difference

    KEMAR Knowles Electronics Manikin for Acoustic Research

    LQBP Lq-Basis Pursuit

    MIDI Musical Instrument Digital Interface

    OLA Over-Lap Add

    PCA Principle Component Analysis

    PSR Preserved Signal Ratio

    SAR Source to Artifacts Ratio

    SD Sparse Decomposition

    SDR Source to Distortion Ratio

    SIR Source to Interference Ratio

  • xii

    Abbreviation Meaning

    SNR Signal to Noise Ratio

    SRS Sound Retrieval System

    STFT Short-Time Fourier Transform

    TIMIT Texas Instruments Massachusetts Institute of Technology

    WDO W-Disjoint Orthogonality

  • xiii

    Declaration of Academic Honesty and Integrity

    I declare that this written submission represents my ideas in my own words and where others'

    ideas or words have been included, I have adequately cited and referenced the original

    sources. I also declare that I have adhered to all principles of academic honesty and integrity

    and have not misrepresented or fabricated or falsified any idea/data/fact/source in my

    submission. I understand that any violation of the above will be cause for disciplinary action

    by the Institute and can also evoke penal action from the sources which have thus not been

    properly cited or from whom proper permission has not been taken when needed.

    Pradeep Gaddipati

    08307029

    June 17th, 2010

  • Chapter 1. Introduction

    With the proliferation of portable media devices, headphone listening has become

    increasingly common; in both mobile and non-mobile listening scenarios, providing a high-

    fidelity listening experience over headphones is thus a key value-add (or arguably even a

    necessary feature) for modern consumer electronic products. This enhanced headphone

    reproduction is relevant for both stereo content such as legacy music recordings as well as

    multichannel music and movie soundtracks. The audio, when properly generated, can be used

    to render a realistic auditory experience with auditory immersion. The audio signal which is

    capable of this, is known as spatial audio.

    Spatial audio refers to the rendering of the realistic auditory experience with auditory

    immersion. Surround sound, an outcome of the extensive research on spatial audio, refers to

    the use of multiple loudspeakers to envelop a person watching a movie or listening to music,

    making them feel as if they are in the middle of the action or the concert [1]. The surround

    sound tracks enable the audience to hear sounds coming from all around them, contributing to

    the sensation of what movie-makers call suspended disbelief. Such a technique is only

    applicable in the case when the playback devices are placed at a considerable distance from

    the listener. The same audio signals are not as effective when headphones are used for

    listening.

    The headphone reproduction simply constitutes presenting a left-channel signal to the

    listener’s left ear and likewise a right-channel signal to the right ear. In such headphone

    systems, stereo music recordings can obviously be directly rendered by routing the respective

    channel signals to the headphone transducers. However, such rendering, which is the default

    practice in consumer devices, leads to an in-the-head listening experience, which is counter-

    productive to the goal of spatial immersion: sources panned between the left and right

    channels are perceived to be originating from a point between the listener’s ears [2]. For audio

    content intended for multichannel surround playback (perhaps most notably movie

    soundtracks), typically with a front centre channel and multiple surround channels in addition

    to the front left and right, direct headphone rendering calls for a down-mix of these additional

    channels; in-the-head localization again occurs as for stereo content, and furthermore the

    surround spatial image is compromised by elimination of front/back discrimination cues.

    Hence these surround audio needs to be spatialized to obtain a signal that can generate the

  • 2

    effect of auditory immersion over headphones. However, re-recording of the existing audio in

    the new format is an infeasible task. One of the possible solutions to such a problem would be

    audio spatialization where the existing spatial audio is processed to obtain surround sound

    that creates an auditory immersion over headphones.

    Given a multi-channel audio mixture as input in any available format, audio spatialization is

    the process of realistic spatial rendering of audio in the desired listening configuration (e.g.

    over headphones). One approach to this problem involves separating the individual sources

    from the multi-channel audio mixture, and then re-creating the desired listener-end mixtures

    by suitable recombination of the individual spatialized sources. The success of this approach

    hinges on achieving the proper separation of sources from the input multi-channel mixtures.

    Various source separation algorithms [3] have been developed based on the different source

    models and mixing models.

    There exist several successful techniques for blind source separation such as independent

    component (ICA) and sparse decomposition. These sparsity-based techniques require the

    sources to be sparse and disjoint-orthogonal in some time-frequency representation, these

    techniques explores the sparsity of music/speech signals in the short time Fourier transform

    (STFT) domain to construct binary time-frequency masks, which are then used to extract

    several sources from only two mixtures. It is expected that the performance of the source

    separation process can be improved by obtaining sparser time-frequency representation. The

    STFT performs well in terms of concentration and resolution of a given signal component

    when a properly chosen window is used. But the proper window function depends on the data,

    and no automated procedure currently exists for determining a good window. And for signals

    like music/speech which are composed of several different components at different time

    instants, the best window differs for each time instant. Thus the fact that different windows

    are appropriate for different time instants suggests the use of data-dependent time-varying

    time-frequency representation [4].

    Chapter 2 describes the various aspects of spatial audio. Chapter 3 discusses the various

    stages involved in the audio spatialization process and presents techniques for re-synthesis of

    the surround sound for headphones. Chapter 4 provides a brief review of the various source

    separation algorithms and it also discusses about the various source models and the mixing

    models considered for solving the blind source separation problem. Chapter 5 discusses the

    generalized staged procedure for the sparsity-based source separation and it also describes

  • 3

    three sparsity-based source separation techniques viz. degenerate unmixing estimation

    technique (DUET), lq basis pursuit (LQBP) and delay and subtraction scale scoring (DASSS)

    in detail. Chapter 6 discusses the time-frequency representations that are used in source

    separation algorithms, the need for the data-adaptive time-frequency representations and also

    gives details about the adaptive time-frequency representation used for this work. Chapter 7

    investigates the various concentration measures that can be used for the purpose of the

    adaptation in the case of the adaptive time-frequency representations. Experiments to evaluate

    the performance of the different source separation techniques discussed in chapter 5 and the

    various time-frequency representations discussed in chapter 6 are described in chapter 8. In

    chapter 9, the conclusions and the future work are presented. And finally in the appendix, a

    detailed discussion about the role of adaptive time-frequency representation in sinusoid

    detection problem is presented.

  • 4

    Chapter 2. Spatial Audio

    Everyday life is full of three-dimensional sound experiences. Humans have the capability to

    localize these sound sources even in noisy and reverberant environments. This ability of

    humans to make sense of their environments and to interact with them depends strongly on

    spatial awareness and hearing plays a major part in this process. The auditory system of the

    human identifies various cues in the sounds heard at the two ears which indicate the spatial

    locations of the sources in the three-dimensional space around the listener. The mechanisms

    of sound source localization involve the detection of timing or phase difference between the

    ears and of amplitude or spectral difference between the ears. The majority of spatial

    perception is dependent on the listener having two ears, although certain monaural cues have

    been shown to exist – in other words it is mainly the differences in signals received by the two

    ears that matter.

    2.1. Sound localization

    We listen to speech (as well as other sounds) with two ears, and it is quite remarkable how

    well we can separate and selectively attend to individual sound sources in a cluttered

    acoustical environments. This ability of the listener to determine the location of the

    origination of the sound is termed as sound localization. In fact, the familiar term cocktail

    party processing was coined in an early study of how the binaural system enables us to

    selectively attend to individual conversations when many are present, as in, of course, a

    cocktail party. This phenomenon illustrates the important contribution that binaural hearing

    makes to auditory scene analysis, by enabling us to localize and separate sound sources. In

    addition the binaural system plays a major role in improving speech intelligibility in noisy and

    reverberant environments.

    Humans can deduce the various parameters of the location of the source viz. azimuth,

    elevation, distance and spaciousness of the auditory environment from the sounds heard. This

    is on the basis of the different cues introduced into the sound by the pinna, proximate parts of

    the human body and the surrounding acoustic environment as it travels from the source to the

    eardrum of the listener. Thereafter, the cues are processed by the human brain for determining

    the acoustic characteristics of the source and the auditory environment. In general, a potential

    acoustical localization cue is any physical aspect of the acoustical waveform reaching a

  • 5

    listener’s ears that is altered by a change in the position of the sound source relative to that of

    the listener. The most important cues [5] used by humans are discussed below.

    2.1.1. Binaural cues

    Binaural localization relies on the comparison of auditory input from two separate detectors;

    most evolved auditory systems feature two ears, on each side of the head.

    • Interaural time difference (ITD): This cue arises because of the difference in the

    distances between the source and the two ears as seen in Figure 2.1. The resulting

    phase shift is used for localization of frequencies below 1.5 kHz. This cue is also

    sensitive to the shift in the envelope of the signals at higher frequencies.

    • Interaural level difference (ILD): The shadowing of the sound wave by the head as

    seen in Figure 2.2 results in the sound having a higher intensity at the ear nearest to the

    source, depending on the azimuth. It results in a difference in the energy levels

    depending on the frequency. This cue is primarily used at frequencies above 1.5 kHz.

    As a result of the symmetry of the human head, sounds originating from many different

    directions share ITD and ILD. The locus of all source locations that share the same ITD and

    ILD is called the cone of confusion as shown in Figure 2.3. Within the cone of confusion, the

    estimation of source location is on the basis of monaural spectral cues and the effect of head

    rotation..

    2.1.2. Monaural spectral cue

    This cue is primarily used to determine the elevation of the source. Figure 2.5 shows the

    measured energies at different frequencies for two different directions of arrival. In each case,

    there are two paths from the source to the ear canal – a direct path and a longer path following

    a reflection from the pinna. For frequencies in the range 6-16 kHz, the delayed signal is out of

    phase with the direct signal, and destructive interference occurs. The greatest interference

    occurs when the difference in length is half the wavelength. This produces a notch in the

    spectrum as seen in the Figure 2.5. Thus, the elevation of the source can be estimated from the

    location of this notch.

    2.1.3. Rotation of the human head

  • 6

    Typically, a listener directs his head towards the interesting sound source. The change in ITD,

    ILD and monaural spectral cues with the rotation of the head helps the listener in further

    localizing the source and for resolving confusions (see Figure 2.4)

    Figure 2.1 Binaural cues – interaural time

    difference (ITD)

    Figure 2.2 Binaural cues – interaural level

    difference (ILD)

    Figure 2.3 Cone of confusion

    Figure 2.4 Rotation of human head

    Figure 2.5 Monaural spectral cues

    Source: Aureal Corporation, “3-D Audio Primer,” Aureal Semiconductor A3D White Paper, 1998.

    2.1.4. Head related impulse response

    The frequency and position dependent characteristics of the pinna, proximate parts of human

    body and ear canal are summarized in the form of the head-related transfer function (HRTF)

    and its time-domain analogue is the head-related impulse response (HRIR). As the HRTF

    depends on the diffraction and reflection properties of the head, pinna and torso which differ

  • 7

    from one person to another, it is unique for each person. The HRIR is measured in an

    anechoic room and hence depends solely on the morphology of the listener.

    2.2. Surround sound generation

    Today a variety of multichannel transmission formats and end-user configurations are

    available for conveying a 3-D audio scene to a listener. An example is a 5.1 DVD recording

    intended for reproduction over a standard 3/2-stereo loudspeaker layout. In addition to the

    practical choice of a multichannel transmission or storage and rendering format, various

    microphone recording techniques or electronic spatialization methods can be used to encode

    the directional information in the chosen multichannel format. This section gives a brief

    introduction to the commonly available techniques for reproducing the desired directional

    information over headphones or a number of loudspeakers located at known positions

    surrounding the listening area. These techniques can be classified into three main approaches

    [6]:

    • Sound field reconstruction methods: The objective is to control an acoustical

    variable of the sound field (pressure, velocity) at or around a reference measuring

    point in the listening area. This reference point is usually the sweet spot where the

    auditory image created during rendition is as desired by the mixer.

    • Discrete panning techniques: The knowledge of the desired apparent direction of the

    sound is used to selectively feed the closest loudspeakers in the reproduction system

    based on a panning law.

    • Head-related stereophony (binaural recording or binaural synthesis): The intent is to

    control the acoustic pressure at the ears of the listener via headphone or loudspeaker

    playback.

    The most extensively used method for creating surround sound of various formats is discrete

    amplitude panning. During the recording of the surround sound, each input source of the

    mixing console receives a monophonic recorded or synthetic signal which is devoid of the

    room effect, from an individual sound source. A panning module called as the panoramic

    potentiometer (or panpot) is used to spatialize each source by multiplying the source signal

    with gains corresponding to each of the output channels. These gains are determined by a

    panning law depending on the desired source location. The commonly used panning laws

    include constant gain optimization (or amplitude preserving law) and constant power

  • 8

    optimization (or energy preserving law). All the individual source components are then added

    together to give the final multichannel audio. The inter-channel level difference (ICLD)

    arising from the different channel gains for each source is translated into an ITD at the

    listener’s ears for frequencies below 1.5 kHz. Additionally, the source signals might be fed to

    an artificial reverberator which delivers several uncorrelated reverberation signals to the main

    output channels, thus reproducing a diffuse immersive room effect, in which every sound

    source can contribute a different intensity. The direct sound level and reverberation level can

    be adjusted individually in each source channel in order to control the perceived distance of

    the corresponding sound source.

    2.3. Panning laws

    Amplitude panning refers to techniques in which a monophonic audio channel is applied to all

    or a subset of the loudspeakers with different gains. Depending on the gain relationships, the

    listener perceives a virtual source, also known as a phantom source, in a direction that does

    not necessarily match with the direction of any of the loudspeakers. Although the created

    sound field does not match the sound field created by a single sound source, listeners perceive

    it like that [2]. The best playback of stereo audio is obtained by placing the two speakers

    symmetrically with respect to the median place to the front of the listener. Consequently, they

    are referred to as the left (L) and right (R) speakers. Usually the speakers are placed at an

    angle of 30˚ with respect to the median plane, as shown in Figure 2.6.

    Figure 2.6 Reproduction of two-channel stereo

  • 9

    The total system gain and the total power are two important attributes of a panning law. For a

    system with N-channel output, the total gain and power for source i is given by

    = (2.1) = (2.2)

    where aij is the gain for the ith source at the jth channel.

    The constant gain law requires that the total gain, which is the sum of the gains for all

    channels corresponding to a particular source, be a constant. In the two channel case, this

    implies that the gain linearly decreases in one channel as it is increased in the other. The

    angles are considered to be positive when measured in the anticlockwise direction. The gains

    aL and aR given to the left and right speakers are obtained as follows

    = ( − )2 & = ( − )2 , ℎ ≥ ≥ − (2.3) where θ is the desired angle.

    The constant power law implies that the total gain which is the sum of the squares of the gains

    for all channels corresponding to a source should be a constant. In the two channel case, this

    constraint results in the gains aL and aR as follows

    = ′ & = ′, ℎ ′ = ( − )2 90 (2.4) If there are N sources present in the system, the mixing parameters for each of them are

    determined as described previously. Thereafter, the left and right channels i.e. XL(t) and XR(t)

    are obtained by adding the individual components as follows

    ( ) = ( ) & ( ) = ( ) (2.5) where si(t) is the time-domain signal corresponding to the ith source, aiL and aiR are the gains

    or mixing parameters corresponding to the ith source. Thus, each channel is actually a mixture

    of the individual sources obtained by their linear combination. Besides, there is no relative

    delay in the components corresponding to each source in the two channels. Thus, the mixing

    process is linear and instantaneous.

    Using these techniques, it is possible to spatialize sources to locations between the speakers

    only. To place virtual sources outside this region, some additional processing based on

  • 10

    psychoacoustic principles is required as is being done in sound retrieval system (SRS)

    technology. When more than two speakers are present in the system, only the two speakers

    closest to the desired source location can be considered to be active. Using this assumption,

    the gains for the two adjacent speakers can be determined using one of the aforementioned

    laws while the gain for all the speakers is assigned the value zero. This is known as pair-wise

    panning.

  • 11

    Chapter 3. Audio Spatialization

    Sounds we hear are normally perceived to be located in the space around us and are usually

    associated with sources which are visible or which we know to be there. During stereophonic

    recording of music, the audio mixer virtually moves the sound sources using amplitude

    panning that would give the desired response when reproduced with loudspeakers placed at

    some distance from the user. But it is a common experience that when these stereophonic

    recordings are presented by means of headphones, the sound images are localized within the

    head [2], this phenomenon of in-head localization is known as lateralization. This is because

    of the fact that these recorded signals lack the appropriate interaural time difference (ITD),

    interaural intensity difference (IID) and body reflection cues associated with the real-world

    sources. The consequence is that the music thus recorded is not ideal to be reproduced via

    headphones, at least in terms of the truthfulness of reproduction of the desired auditory

    environment. The sound arriving at the listener’s ears corresponds to an unnatural sound field

    increasing the listener’s fatigue. Hence the stereophonic loudspeaker audio needs to be

    specially processed to obtain a signal that can generate the effect of auditory immersion over

    headphones. The techniques for including the appropriate real-world cues into the

    stereophonic audio and making it compatible for the headphones are discussed in the

    following sections and chapters. This process of spatial rendering for conversion of the

    available audio configuration into the desired listening configuration is termed as audio

    spatialization.

    3.1. Stages of audio spatialization

    As described in the previous section audio spatialization is a process of realistic spatial

    rendering of audio into the desired listening configuration from the available audio format, in

    our case, the available format is the stereophonic loudspeaker audio and the desired listening

    configuration is the headphones. The approach (which is considered by us) to this problem

    involves separating the individual sources from the multi-channel audio mixture and then re-

    creating the desired listener-end configuration by suitable re-combination of the individual

    spatialized sources. The stages involved in the audio spatialization process are shown in

    Figure 3.1.

    • Analysis (source separation) – the individual source signals and their locations in the 3D space are estimated from the available mixtures

  • 12

    • Re-synthesis (convolving with HRIRs) – the estimated sources are externalized to the desired locations by filtering them with the head related impulse responses (HRIRs)

    Figure 3.1 Audio spatialization block diagram

    3.1.1. Analysis – source separation

    The process of extracting the individual sources from a set of observations from sensors such

    as microphones (mixtures) is called source separation. When the information about the mixing

    process and sources is limited, the problem is known as ‘blind’ source separation (BSS). The

    classical example is the cocktail party problem, where a number of people are talking

    simultaneously in a room (like at a cocktail party), and one is trying to follow one of the

    discussions. The human brain can handle this sort of auditory source separation problem, but

    it is a very tricky problem in signal processing. Several approaches have been proposed for

    the solution of this problem but development is currently still very much in progress. The

    separation of a superposition of multiple signals is accomplished by taking into account the

    structure of the mixing process and by making assumptions about the sources. Some of the

    successful approaches are principal component analysis (PCA) and independent component

    analysis (ICA), which work well when there are no delays or echoes present; that is, the

    problem is simplified a great deal. By assuming sources can be represented sparsely in given

    basis, recent research has demonstrated that solutions to previously problematic blind source

    separation problems can be obtained. In some cases, solutions are possible to problems

    intractable by previous non-sparse methods. Indeed, sparse methods provide a powerful

    approach to the separation of mixtures [3].

    3.1.2. Re-synthesis – convolution with HRIRs

    Each surround sound system has a pre-determined position for each of the speakers. The

    audio for each system is recorded taking this factor into account. With this apriori knowledge

    about the speaker locations, one of the simplest methods of spatialization would be to

    convolve each of the individual input channels with HRIRs of the corresponding speaker and

    then summing the results [5]. The location of each speaker determines the HRIR to be used

  • 13

    for that speaker. The HRIRs can be obtained from the CIPIC [7] or the KEMAR [8] database.

    This process is depicted in Figure 3.2.

    Let xm(t) be the mth channel signal. The filters hmL(t) and hmR(t) represents the HRIRs

    corresponding to the mth speaker for the left and right ear respectively. The left and right ear

    signals for playback over headphones are then given by yL(t) and yR(t) as follows:

    ( ) = ℎ ( ) ∗ ( ) (3.1) ( ) = ℎ ( ) ∗ ( ) (3.2)

    Figure 3.2 Time-domain virtualization based on HRIRs

    Using this method, the sources that are active only on a single channel can be convincingly

    virtualized over headphones, i.e. a rendering can be achieved that generates a sense of

    externalization and accurate spatial positioning of the source. However, a sound source that is

    partially panned across channel in the recording may not be convincingly reproduced.

    Consider a set of input signals each of which are amplitude-scaled version of the source s(t)

    ( ) = ( ) (3.3) with these inputs, the equations (1) and (2) become

    ( ) = ( ) ∗ ( ( )ℎ ( )) (3.4)

  • 14

    ( ) = ( ) ∗ ( ( )ℎ ( )) (3.5) The source s(t) is thus rendered through a combination of HRIRs for multiple different

    directions instead of via the correct HRIR for the actual desired source location. Unless the

    combined HRIRs correspond to closely spaced channels, this combination of HRIRs will

    significantly degrade the spatial image. This is one of the drawbacks of this method. To

    rectify this, the desired source signals and locations can be estimated from the multiple

    channels and then the corresponding source signals can be spatialized to the appropriate

    location obtained from the estimation.

  • 15

    Chapter 4. Sparsity-based Source Separation

    Source separation arises in a variety of signal processing applications, ranging from speech

    processing to medical image analysis. The process of extracting the individual sources from a

    set of observations from sensors such as microphones (mixtures) is called source separation.

    When the information about the mixing process and sources is limited, the problem is known

    as ‘blind’ source separation (BSS). Generally the problem is stated as follows

    Given M mixtures of N sources mixed via an unknown (M x N) mixing matrix A,

    estimate the underlying sources from the mixtures.

    BSS of acoustic signals is often referred to as the cocktail party problem that is the separation

    of individual voices from a myriad of voices in an uncontrolled acoustic environment such as

    cocktail party.

    4.1. Classification of source separation algorithms

    BSS algorithms can be categorized according to the assumptions they make about the mixing

    model. Thus, one can classify them based on mixing parameters considered in the mixing

    model or based on the number of mixtures and sources considered in the mixing model.

    4.1.1. Based on mixing parameters considered in the mixing model

    Environmental assumptions about the surroundings in which the sensor observations are made

    also influence the complexity of the problem. Sensor observations in a natural environment

    are confounded by signal reverberations, and consequently, the estimated un-mixing process

    needs to identify a source arriving from multiple directions at different times as one individual

    source. Generally, BSS techniques depart from this difficult real world scenario and make less

    realistic assumptions about the environment so as to make the problem more tractable. There

    are typically three assumptions that are made about environment.

    • instantaneous mixtures • anechoic mixtures • echoic mixtures

    The most rudimentary of these is the instantaneous case, where sources assumed to arrive

    instantly at the sensors but with different signal intensity. An extension of the previous

    assumption, where arrival delays between sensors are also considered, is known as the

  • 16

    anechoic case. The anechoic case can be further extended by considering multiple paths

    between each source and sensor, which results in the echoic case, sometimes also known as

    convolutional mixing. Each case can be extended to incorporate linear additive noise. But the

    presence of noise in the system increases the complexity of the source separation process.

    Separation becomes even more challenging if the sources are assumed to me mobile. Most

    systems assume that the sources are static as in the case of instantaneous and anechoic

    mixtures, whereas echoic case signifies the most natural and general situation.

    4.1.2. Based on number of mixtures and sources in the mixing model

    The source separation algorithms can also be categorized based upon the assumptions made

    related to the number of mixtures and the number of sources considered in the mixing model.

    There are typically three assumptions that are made

    • over-determined (M > N) • even-determined (M = N) • under-determined (M < N)

    where M is the number of mixtures and N is the number of sources.

    When M ≥ N, separation of sources can be achieved by constructing an unmixing matrix W,

    where W = A-1 up to permutation and scaling of the rows. The dimensionality of the mixing

    process influences the complexity of source separation. If M = N, the mixing process is

    defined by an even-determined (i.e. square) matrix A and, provided that it is non-singular, the

    underlying sources can be estimated by a linear transformation. And if M > N, the mixing

    process is defined by an over-determined matrix A and, provided that it is full rank, the

    underlying sources can be estimated by least-squares optimization or linear transformation

    involving matrix pseudo-inversion. If M < N, the mixing process A is defined by an under-

    determined matrix and consequently source estimation becomes more involved and is usually

    achieved by some non-linear techniques.

    4.2. Source separation algorithms: A review

    The separation process is accomplished by taking into account the structure of the mixing

    process and making assumptions about the sources. Several methods exist that attempt to

    solve the BSS problem under various assumptions and conditions. Usually the following

  • 17

    assumptions about the nature of the sources are made in order to make the source separation

    algorithm more tractable:

    • statistical independence of sources [9] • sparse decomposition of sources into some basis (time-frequency dictionaries) [10] • sparsity of sources in some time-frequency representations [11] [12]

    One such approach is independent component analysis (ICA) which is based on the

    assumption that the sources are statistically independent [9]. This technique extracts N sources

    from N instantaneous mixtures. This algorithm can be extended for the case of instantaneous

    under-determined mixtures. There also exist algorithms that demix under-determined

    anechoic mixtures, one such algorithm is complex independent component analysis technique

    to solve the BSS problem for electroencephalographic (ECG) data.

    An alternative approach to the BSS problem for under-determined instantaneous mixtures is

    to assume that the sources have sparse expansion with respect to some basis. In this case, one

    can formulate the source extraction problem as a constrained l1 minimization problem, which

    typically yields a convex program [10]

    Another approach to demix under-determined anechoic mixtures, called degenerate unmixing

    estimation technique (DUET), was proposed by Yilmaz and Rickard [11]. This algorithm uses

    sparsity of music/speech signals in the short time Fourier transform (STFT) domain to

    construct binary time-frequency masks, which are then used to extract several sources from

    only two mixtures. Another algorithm presented in [12] uses lq minimization based approach,

    with q < 1 for estimation of sources in the STFT domain.

    4.3. Mixing models

    Suppose we have N time domain sources s1(t), s2(t), . . . , sN(t) and M mixtures x1(t), x2(t), . . . ,

    xM(t) such that

    = − , = 1,2, … , (4.1) where aij are the attenuation coefficients and δij are the time delays associated with the path

    from the jth source to the ith receiver (sensor). Equation (1) defines an anechoic mixing model.

    With δij = 0, equation (1) defines instantaneous mixing model. The problem of anechoic

    signal unmixing is therefore to identify the attenuation coefficient and the relative delay

  • 18

    associated with each source. An illustration of the anechoic and under-determined case (M =

    2, N = 3) is provided in Figure 4.1.

    The echoic case of BSS considers not only transmission delays but reverberations too. This

    results in a more involved generative model that in turn makes finding a solution more

    difficult.

    = − , = 1,2, … , (4.2) where L is the number of paths the source signal can take to the sensors. An illustration of the

    echoic case is provided in Figure 4.2.

    Figure 4.1 Mixing models - anechoic mixing

    Figure 4.2 Mixing model - echoic mixing

    Source: P. O’Grady, B. Pearlmutter and S. Rickard, “Survey of sparse and non-sparse methods in source separation,” International Journal of Imaging Systems and Technology, vol. 15, no. 1, 2005

    4.4. Sparsity-based source separation

    In order to make the source separation problem more tractable, one usually needs to make

    certain assumptions about the nature of the sources. Such assumptions form the basis for most

    source separation algorithms and include statistical properties such as independence and

    stationarity. One increasingly popular and powerful assumption is that the sources have a

    sparse representation in a given basis. These methods are known as sparsity-based source

    separation methods. A signal is said to be sparse when most of its coefficients are zero (or

    nearly zero) valued i.e. the major part of the signal energy is concentrated in very few

    coefficients of the signal.

    The advantage of a sparse signal representation is that the probability of two or more sources

    being simultaneously active is low. Thus, sparse representations lend themselves to good

    separability because most of the energy in a basis coefficient at any time instant belongs to a

    single source. Additionally, sparsity can be used in many instances to perform source

  • 19

    separation in the case when there are more sources than sensors. A sparser representation of

    an acoustic signal can often be achieved by a transformation into a Fourier, Gabor or Wavelet

    basis.

    4.5. Stages of sparsity-based source separation

    The block representation of sparsity-based source separation algorithm is shown in Figure 4.3.

    The various steps involved in the algorithm are:

    • Time-frequency transform – Transformation of the available mixtures into some

    sparse time-frequency representation such as short-time Fourier transform (STFT)

    • Mixing parameter estimation – Estimation of the mixing parameters is done by

    clustering the ratios of the time-frequency representations of the mixtures

    • Source estimation – Using the estimates of the mixing parameters, the estimates of

    each of the individual sources in the time-frequency domain is obtained by using an

    appropriate source estimation algorithm like DUET, LQBP or DASSS

    • Inverse time-frequency transform – Finally the time-frequency estimates of each of

    the individual sources are inverted back to time-domain using an appropriate inverse

    time-frequency transformation to recover the original sources

    Figure 4.3 Block diagram of sparsity-based source separation

    4.6. Source assumptions

    1.1.1. Local stationarity

    Windowed Fourier transform of a signal s(t) is obtained as

    (∙) ( , ) = ( ) ( − )∞∞

    (5.1)

    The windowed Fourier transform of s(t) defined in equation (1) will be referred as SW(w,t)

    where appropriate. Using equation (1) and the following Fourier transform pair,

    ( − ) ↔ ( ) (5.2)

  • 20

    we have

    (∙ − ) ( , ) = (∙) ( , ) (5.3) when W(t) ≡ 1. However, when W(t) is a windowing function, equation (3) is not necessarily

    true. This can be thought as a form of a narrowband assumption in array processing [13] , but

    this label is perhaps misleading in that speech is not narrowband and local stationarity seems

    a more appropriate moniker. For DUET, it is necessary that equation (3) holds for all δ, |δ| ≤

    Δ, even when W(t) has finite support [14]. Here Δ is maximum time difference possible in the

    mixing model (the microphone spacing divided by the speed of sound signal propagation).

    4.6.1. Microphone spacing

    Additionally, one crucial issue is that, the DUET algorithm is based on the extraction of

    attenuation and delay parameters estimate for each time-frequency bin. We will utilize the

    local stationarity assumption to turn the delay in time into a multiplicative factor in time-

    frequency. Of course, this multiplicative factor e-iwδ uniquely specifies δ only if |wδ| < π as

    otherwise we have an ambiguity due to phase-wrap [15] . So we require,

    < , ∀ , ∀ (5.4) to avoid phase ambiguity. This is guaranteed when the microphones are separated by less than

    πc/wm where wm is the maximum frequency present in the sources and c is the speed of sound.

    4.6.2. W-disjoint orthogonality

    Given a windowing function W(t), we call two functions sj(t) and sk(t) W-disjoint orthogonal

    if the supports of the windowed Fourier transforms of sj(t) and sk(t) are disjoint [11] . The W-

    disjoint orthogonality assumption can be stated concisely as,

    ( , ) ( , ) = 0, ∀ ≠ , ∀ , (5.5) This assumption is the mathematical idealization of the condition that is likely that every

    time-frequency point in the mixture with significant energy is dominated by the contribution

    of one source.

    4.7. Mixing parameter estimation technique

    The assumptions of anechoic mixing and local stationarity allow us to rewrite the mixing

    equation (4.1) in the time-frequency domain as,

  • 21

    ( , )( , ) = 1 ⋯ 1⋯ ( , )⋮( , ) (5.6)

    With the further assumption of W-disjoint orthogonality, at most one source is active at every

    (w,τ), the mixing process can be described for each (w,τ) and for some j as,

    ( , )( , ) = 1 ( , ) (5.7)

    here j is the index of the source active at (w,τ).

    Now, we can calculate the relative amplitude and delay parameters associated with one

    source, using

    , = ( , )( , ) , log ( , )( , ) / (5.8) for some j, where denoted taking imaginary part. Using (8), every (w,τ) yields an estimate

    pair for the relative attenuation-delay parameter associated with each source. For W-disjoint

    orthogonal signals, if we calculate the amplitude-delta estimates from a number of time-

    frequency points, we would expect to see clusters around the true mixing parameters for each

    source.

    If we now construct a two-dimensional weighted histogram, the number of peaks found would

    be the estimated of the number of sources, and the peak centres would be the estimate of the

    attenuation-delay estimates associated with each source. From these estimates of mixing

    parameters we then construct the time-frequency masks which de-mix the mixtures.

    The main observation that DUET leverages is that the ratio of the time-frequency

    representations of the mixtures does not depend on the source components but only on the

    mixing parameters associated with the active source component [15]. Thus it can be seen that,

    the successful extraction of mixing parameters relies on the sparsity of speech in the time-

    frequency domain.

    4.8. Source estimation techniques

    Once the mixing parameters are estimated, each of the individual source signals is extracted

    from the mixtures. The unmixing could be either a hard-assignment of each time-frequency

    component of the mixture to only one source or a soft-assignment to multiple sources or a

    combination of both hard and soft assignments (where for some time-frequency bins hard

  • 22

    assignment is used and for other bins soft assignment is used based on some criterion which

    decides when to use hard/soft assignments).

    4.8.1. Degenerate unmixing estimation technique (DUET)

    If the number of sources is equal to the number of mixtures, the non-degenerate case, the

    standard demixing method is to invert the mixing matrix. The mixing model for two sources

    can be written as,

    ( , )( , ) = 1 1 ( , )( , ) (5.9)

    When the number of sources is greater than the number of mixtures, the degenerate case,

    matrix inversion is no longer possible. Nevertheless, in this case one can still de-mix by

    partitioning the time-frequency plane using one of the mixtures based on the estimated mixing

    parameters [11].

    For W-disjoint orthogonal signals, using equation (5), we know that the every time-frequency

    bin in ( , ) corresponds to ( , ) for some i. Moreover, the ratio ( , )/ ( , ) depends only on the mixing parameters associated with one source. Thus, for each time-

    frequency point, we can determine which of the n peaks in the 2-D histogram of attenuation-

    delay estimates is closest to the , estimate for the given ( , ) [11] The following likelihood function is used to produce a measure of closeness.

    ( , ) = argmin ( , ) − ( , )1 + (5.10)and then assign each time-frequency point to the mixing parameter estimate via

    ( , ) = 1, ( , ) =0, ℎ (5.11)Essentially, (10) and (11) assign each time-frequency point to the mixing parameter pair

    which best explains the mixtures at that particular time-frequency point. We de-mix via

    masking and maximum likelihood combining

    ( , ) = ( , ) ( , ) + ( , ) 1 + (5.12)Then the original sources are reconstructed from their time-frequency representations by

    converting them back into the time domain.

  • 23

    4.8.2. Lq-basis pursuit (LQBP)

    We have seen in DUET that it assumes only one active source at every time-frequency bin,

    but practically this assumption is not always true. As we might have multiple sources present

    at most of the time-frequency bins, the LQBP algorithm relaxes this assumption to a level that

    it assumes at most M (the number of mixtures) sources active at every time-frequency bin

    [12].

    The LQBP algorithm proposed in [12] separates N sources from M mixtures. The task is

    accomplished by extracting at most M sources at each time-frequency point that minimize via

    lq-basis-pursuit. The following assumptions are required to ensure an accurate recovery of the

    sources.

    • No more than m sources are active at each time-frequency point. • The columns of the mixing matrix were accurately extracted in the mixing model

    recovery stage. • The mixing matrix is full rank.

    First, the mixing matrix is constructed from the mixing parameters estimates obtained from

    the previous stage

    = ⋯⋯⋮ ⋮⋯ ⋮ (5.13)here are the estimated attenuation parameters and are the estimated delay parameters,

    computed as discussed in the previous. Note that column of is a unit vector.

    The goal now is to compute good estimates ̂ , ̂ , … , ̂ of the original sources , , … , . These estimates must satisfy

    ̂ = (5.14)where ̂ = [ ̂ , ̂ , … , ̂ ] is the vector of source estimates in the time-frequency domain. At each time-frequency bin, equation (14) provides M equations (corresponding to M

    available mixtures) with N > M unknowns ( ̂ , ̂ , … , ̂ ). Assuming that this system of equations is consistent, it has infinitely many solutions. To choose a reasonable estimate

    among these solutions, we shall exploit the sparsity of the sources vector in the time-

    frequency domain.

  • 24

    The problem can be formally stated as [12]

    min̂ ‖ ̂‖ ̂ = (5.15)where ‖ ‖ denotes some measure of sparsity of a vector u. Given a vector u = (u1, u2, … , un) є R, one measure of its sparsity is simply the number of the

    non-zero components of u, commonly denoted as ‖ ‖ . But in general, the sparsity of the Gabor coefficients of speech signals essentially suggests that most of the coefficients are

    small, though not identically zero. In this case, P0 fails miserably. Alternatively, one can

    consider

    ‖ ‖ = | | / (5.16)where 0 < q ≤ 1 as a measure of sparsity. Here, smaller q signifies increased importance of the

    sparsity of u. Such a problem statement is commonly called as Pq problem.

    The solution to Pq is identical to the solution of the lq-basis-pursuit (LQBP) problem, given by

    ∶ min̂ ‖ ̂‖ ̂ = ‖ ̂‖ ≤ (5.17)Note that to solve the LQBP problem, one need to find the best basis for the column space of

    that minimizes the lq norm of the solution vector. The solution of LQBP is given by the

    solution of

    ‖ ‖ ℎ (5.18)for = ( )|… | ( ) , = ( )|… | ( ) .

    4.8.3. Delay and scale subtraction scoring (DASSS)

    We have seen that DUET uses nearest-neighbour approach to demix the sources, which

    suffers in cases where W-disjoint orthogonality is violated. Specifically, if two sources

    contribute to the energy in a particular time-frequency bin, the mixing parameter estimates

    will lie between those of the contributing sources. In some cases, this result might generate

    mixing parameter estimates whose nearest neighbour source is actually a third source. So we

    may spuriously assign energy to a third source that is not contributing. On the other hand, we

    saw that LQBP assumes at most M (the number of mixtures) sources at every time-frequency

    bin, which may not be the case always, as we might have only one active source at some time-

  • 25

    frequency bins. In this case too, we might spuriously assign energy to a source which might

    not be present at that particular time-frequency bin. Thus a new demixing method called delay

    and scale subtraction scoring (DASSS) was presented in [16] that is less erratic than the

    nearest-neighbour method and highlights when the W-disjoint orthogonality assumptions of

    the DUET system are not valid. Furthermore, this technique identifies the time-frequency bins

    were actually multiple sources are present and uses a source-aware demixing technique for

    those bins.

    It should be noted that we require reliable estimates of the mixing parameters which is also

    the case with the other two source separation techniques (DUET and LQBP). In this technique

    N new signals Yi, each of which entirely eliminates a particular Si are created in the following

    manner

    = − 1 (5.19)It should be noted that the multiplicative factor applied to X2 corresponds to scaling and delay

    in the time domain. Hence this source eliminating technique is called delay and scale

    subtraction scoring or DASSS. Yi can also be written in the following way:

    = , + , + ⋯ + , , ℎ , ≡ 1 − ( ) (5.20)If exactly one source is active, at a specific time-frequency bin, we have the following

    = 0 (5.21) = , (5.22) = , (5.23)

    Equation (23) reveals is that if only one source is active; the N values in the set of Yi for a

    given bin can be predicted using only the known α values and the given mixture X1. In fact N

    sets of such predictions, each assuming one guessed active source g. (We will use to

    denote the prediction of the ith Y function value when assuming only source g is active.) Now

    the predicted set of values can be compared with the actual observed set of Yi. If exactly one

    source is active, only its corresponding prediction will fit the observed Yi. Further the

    following scoring function can be used to compare predicted values to the observed values.

  • 26

    ( ) = ∑ −∑ | | (5.24)If the error is sufficiently small for a particular g in a given bin, we consider that only one

    source is active in that bin and i.e. g, and assign the energy of the mixture in that bin to the

    source g. And if the error is too large, consider multi-source demixing algorithm discussed

    below.

    For the bins where no single source model scores well enough, it is reasonable to conclude

    that at least two sources must be present. And hence the problem of partitioning the input into

    those two sources may be solved via simply inverting the mixing matrix for those two

    sources. It is assumed that the two sources whose fractional error f(g) are lowest are the active

    sources.

  • 27

    Chapter 5. Adaptive Time-Frequency Representation

    Time-frequency representations describe signals in terms of their frequency content at a given

    time. These representations are useful for analyzing signals varying both in time and

    frequency. For speech and music signals where we have continuously time-varying frequency

    content, frequency domain representations cannot be used because they only give spectral

    information and no time information i.e. they fail to convey when, in time, the different events

    are occurring in the signal. The short-time Fourier transform is one of the most widely used

    approaches to time-frequency analysis.

    In case of the sparsity-based source separation techniques, all the processing i.e. the

    estimation of mixing parameters and the estimation of the sources is carried out in the time-

    frequency domain. The major assumption in such source separation techniques is that the

    underlying sources are W-disjoint orthogonal i.e. only one source is active in every time-

    frequency bin. But practically such an assumption is not always valid; so it is at least assumed

    that at every time-frequency bin only one of the sources has dominant energy. It is expected

    that if the time-frequency representation of the mixture is sparse, then the assumption of the

    W-disjoint orthogonality can be satisfied to a greater extent. One such time-frequency

    representation which provides sparse representation is short-time Fourier transform.

    5.1. Short-time Fourier transform

    The short-time Fourier transform is the most widely used method for studying non-stationary

    signals like music/speech. The concept behind it is simple and powerful. Suppose we listen to

    a piece of music that lasts an hour where in the beginning there are violins and at the end

    drums. If we Fourier analyze the whole hour, the energy spectrum will show peaks at the

    frequencies corresponding to the violins and drums. That will tell us that there were violins

    and drums but will not give us any indication of when the violins and drums were played. The

    most straightforward thing to do is to break up the hour into five minute segments and Fourier

    analyze each interval. Upon examining the spectrum of each segment we will see in which

    five minute intervals the violins and drums occurred. If we want to localize even better, we

    break up the hour into one minute segments or even smaller time intervals and Fourier

    analyze each segment. That is the basic idea of the short-time Fourier transform – break up

    the signal into small time segments and Fourier analyze each time segment to ascertain the

  • 28

    frequencies that existed in that segment. The totality of such spectra indicates how the

    spectrum is varying in time.

    The short-time Fourier Transform (STFT) of the signal x(t) is defined as [17],

    (∙) ( , ) = ( ) ( − )∞∞

    (6.1)

    where W(t) is the window function. W(t) can be considered as a window that selects a

    particular portion of the signal centred around the given time location, and the Fourier

    transform of the windowed signal yields the frequency content of the signal at the given time.

    If we want good time localization we have to pick a narrow window in the time domain, and

    if we want good frequency localization we have to pick a narrow window in the frequency

    domain (i.e. a long window in time domain). But both the time domain window and the

    frequency domain window cannot be made arbitrarily narrow; hence there is an inherent

    trade-off between time and frequency localization in the spectrogram for a particular window.

    The degree of trade-off depends on the window, signal, time, and frequency. We have just

    seen that one window, in general, cannot give good time and frequency localization. That

    should not cause any problem of principle as long as we look at the spectrogram as a tool at

    our disposal that has many options including the choice of window. There is no reason why

    we cannot change the window depending on what we want to study. That can sometimes be

    done effectively, but not always. Sometimes a compromise window does very well.

    5.2. Need for data-adaptive time-frequency representations

    Most algorithms for underdetermined separation as mentioned earlier are based on the

    assumption that the signals are sparse in some domain. In most cases, the sparser the sources,

    the less they will overlap when mixed (i.e. the more disjoint their mixture will be), and

    consequently the easier their separation will be. The most widely used transform for the

    purpose of sparsification in the context of blind source separation has been the STFT. The

    choice of the window significantly affects the signal concentration in the STFT. In fact, the

    uniform frequency and time resolution it offers are disadvantageous for the task of speech or

    music separation. When the pitch of a source varies slightly, the variation of the higher

    harmonics is higher. This amplified variation of the higher harmonic frequencies can be

    accurately tracked with shorter windows. Hence, a variation in the window duration with

    frequency is expected to result in a more concentrated representation for some signals.

  • 29

    A second problem is that for signals having several different components occurring at

    different instants, the best window differs for each component. A sparse representation is

    obtained for the harmonic and impulsive parts of the signal by analyzing the segments with a

    long and short window respectively. Thus the fact that different windows are appropriate for

    different signal components suggests the use of a data-dependent time-and-frequency-varying

    window function for analysis to achieve a high concentration and resolution of any signal

    component present at any time-frequency location.

    5.3. Data-adaptive time-frequency representations

    As discussed in the previous section, the STFT performs well in terms of the concentration

    and resolution of a given signal component when a properly chosen window is used. An

    adaptive time-frequency representation was proposed in [4] by D. L. Jones and T. W. Parks.

    In order to track the different signal components we need to have an adaptive window whose

    parameters are dependent on the signal. The window function selected for this purpose was

    the Gaussian function. Thus there are two parameters (the real and the imaginary part of the

    Gaussian parameter) of the window function that are equivalent in terms of their fundamental

    time-frequency concentration. They differ, though, in the time-frequency concentration they

    provide for a particular signal component. A local signal concentration measure is used to

    compute the Gaussian window parameter to achieve maximum concentration of the locally

    dominant signal component at every time-frequency location. This procedure automates the

    choice of the window and thus overcomes the problem of window selection in the short-time

    Fourier transform. The adaptive time-frequency representation of a signal x(t) is given as [4]

    ( , ) = ( ) −2 [ , ] . , ( )∞∞

    (6.2)

    which projects the signal x(τ) onto the unit energy Gaussian basis elements

    −2 [ , ] . , ( ) (6.3) Ct,ω is the Gaussian parameter which can vary with time and frequency. The adaptive time-

    frequency representation in the equation (2) differs from the STFT with a Gaussian window in

    that the Gaussian parameter Ct,ω may vary with time and frequency. The basic idea behind the

    adaptive time-frequency representation is that extra degrees of the freedom, namely, the real

    and the imaginary parts of the Gaussian parameter at every time-frequency location, can

    improve the performance over that of a fixed-window STFT. The performance of the adaptive

  • 30

    time-frequency representation depends on the selection of the adaptive Gaussian parameters.

    Those Gaussian parameters are selected for a particular time-frequency location which

    maximizes the local concentration measure. The following local concentration measure was

    used in [4]

    = | ( , )|∞∞∞∞ | ( , )|∞∞∞∞ (6.4) which is the fourth power of the L4 norm dived by the squared L2 norm of the magnitude of

    the short-time Fourier transform. This measure is very similar to kurtosis in statistics and to

    other equivalent measures of peakedness or sharpness.

    Motivated from this concept of data-adaptive time-frequency representation which provides

    better resolution as compared to fixed-window STFT, we can apply this concept of

    adaptiveness for obtaining sparser time-frequency representations for the application of blind

    source separation. The requirement in the blind source separation problem is that the time-

    frequency representations of the mixtures are as sparse as possible so that the underlying

    sources satisfy the W-disjoint orthogonality criterion to a greater extent. So here the

    adaptiveness can be used to maximize sparsity of the time-frequency representation.

    Most real world signals are essentially stationary over short intervals of time. Consequently, a

    sparse representation could be obtained by analyzing each frame with a window that has been

    optimized for the frame. Long windows give a sparser representation for frames containing

    steady frequency components than when shorter windows are used. On the contrary, the time-

    frequency representation of impulses or onsets of events is sparser with short windows. And it

    has also been observed that the by simply varying the length of the analysis window the

    sparsity of the time-frequency representation varies [21] and usually there exists an optimum

    length for which the sparsity is maximum. But the selection of optimum analysis window

    length depends on the signal.

    So instead of adapting the time-frequency representation at every time-frequency location as

    in [4], for the application of blind source separation the adaptation can be restricted to only

    time i.e. different analysis window lengths can be used for different time-instants. The reason

    for restricting the adaptation to only time is because the blind source separation application

    demands reconstruction of the time-frequency representations for estimation of the source

    signals in time domain (this problem would be discussed in detail in the next section). Now

  • 31

    the next problem that needs to be addressed is what concentration measure to use for

    adaptation process i.e. on what selection criterion to select the optimum analysis window

    length. Some of the commonly used sparsity measures like kurtosis and Gini index can be

    used for this purpose of adaptation. This aspect is investigated thoroughly in the next chapter.

    The adaptive transformations used to obtain the time-frequency representation are non-linear

    i.e. the representation for the sum of two signals might not be equal to the sum of the time-

    frequency representations of the individual signals. This depends on the window sequence

    chosen for obtaining the time-frequency components in each of the signals. However, if the

    same sequence of windows is used to obtain the time-frequency representations for the

    mixture as well as the individual signals, the transformation can be considered to be linear.

    This linearity property is vital during the estimation of sources in the source separation

    algorithm.

    5.3.1. Steps to obtain a data-adaptive time-frequency representation of a signal

    The procedure to obtain the data-adaptive time-frequency representation is as follows:

    a) first select the set of analysis window sizes to be used for adaptation purpose (say 30

    ms, 60 ms, 90 ms)

    b) now for a particular time-instant, using an analysis window select a portion of the

    signal and then Fourier analyze the selected signal

    c) repeat step (b) using all the analysis window sizes selected for the purpose of

    adaptation

    d) once on obtaining the Fourier spectrum using all the different analysis windows

    selected for adaptation, using an appropriate concentration measure select the optimal

    Fourier spectrum which gives the best resolution (which concentration measures to

    use for the purpose of adaptation is discussed in chapter 7), furthermore the adaptation

    can be carried out on various bands of frequencies depending on the requirement

    e) note down the analysis window size that was used for obtaining the best resolution

    f) then based on the analysis window size used for this time-instant and the technique to

    be used for re-construction (discussed in section 6.4) of the signal, decide an

    appropriate hop size (i.e. step size) and proceed to the next time-instant

    g) finally at the new time-instant, follow the adaptation procedure discussed in steps (b),

    (c), (d), (e) and (f) until the end of the signal is reached

  • 32

    Figure 5.1: Data-adaptive time-frequency representation of a singing voice using frame-based adaptation (window function: hamming; window sets for adaptation: 30, 60 and 90 ms; hop size: 10 ms; concentration measure: kurtosis; adaptation region: 1000 to 3000 Hz)

    Figure 5.1 shows data-adaptive time-frequency representation of a singing voice obtained

    using above mentioned procedure. The red-dashed line shows the window size selected for

    each of the frame. The window function used is hamming, the window sizes used for the

    adaptation are 30, 60 and 90 ms, hop size used is 10 ms, concentration measure used for

    adaptation is kurtosis and the region over which the adapta