towards volumetric room impulse response modeling · ral impression of the virtual room is obtained...

ModelingTowards Volumetric Room Impulse Response

Academic year 2018-2019

Master of Science in Information Engineering Technology

Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Ir. Martijn Courteaux, Ir. Ruben VerhackSupervisors: Dr. ir. Glenn Van Wallendael, Prof. dr. ir. Nilesh Madhu

Student number: 01404236Jasper Maes

Permission of use on loan

The author gives permission to make this master dissertation available for consultation and tocopy parts of this master dissertation for personal use. In all cases of other use, the copyrightterms have to be respected, in particular with regard to the obligation to state explicitly thesource when quoting results from this master dissertation.

11/6/2019

Acknowledgements

I especially would like to thank my counselors, ir. Ruben Verhack and ir. Martijn Courteaux,who were always ready with advice whenever I had questions or troubles with my research. Iwould also like to thank my supervisor, Prof. dr. ir. Nilesh Madhu for sharing his insight andproviding me with exceptionally helpful comments and suggestions. The regular meetings andbrainstorms were very helpful for making tremendous progress.

Furthermore, I would like to thank my friends and family who supported me during this thesis.Last but not least, I would like to thank my mom for reading and correcting this thesis andguiding me through the magical world of Microsoft Excel.

Towards Volumetric Room Impulse ResponseModeling

Jasper Maes

Supervisors: Dr. ir. Glenn Van Wallendael, Prof. dr. ir. Nilesh MadhuCounsellors: Ir. Martijn Courteaux, Ir. Ruben Verhack

Master’s dissertation submitted in order to obtain the academic degree ofMaster of Science in Information Engineering Technology


Abstract

This thesis explores a novel representation of a Room Impulse Response (RIR). This represen-tation should enable the creation of a volumetric room acoustic model using the novel SteeredMixture-of-Experts (SMoE) framework. By querying this model, a real-time continuous binau-ral impression of the virtual room is obtained by convolving the reproduced RIR with the sourcesound.The work towards the long-term goal described above is done by presenting a two-way trans-formation from a time series to a liftered cepstrogram and evaluating it using three metrics andbinaural listening tests. The usability of the suggested metrics for evaluating the reproductionof a RIR is also assessed.

Keywords: Room impulse response, room acoustics modeling, Steered Mixture-of-Experts,STOI, POLQA, PESQ

Een voorstudie naar het modelleren vanvolumetrisch ruimte-impulsantwoorden

Jasper Maes

Supervisor(s): prof. dr. ir. Nilesh Madhu, dr. ir. Glenn Van Wallendael, ir. Martijn Courteaux, ir. RubenVerhack

Abstract— Deze scriptie onderzoekt een nieuwe representatie om eenRuimte-impulsantwoord (RIR) voor te stellen op een manier die hoge cor-relatie doorheen de ruimte vertoont. Een methode voor de omzetting wordtopgesteld en objectief en subjectief beoordeeld. Voor de objectieve beo-ordeling worden drie metrieken uit het spraakevaluatie domein gebruikt enbeoordeeld. De subjectieve beoordeling gebeurt aan de hand van binauraleluistertesten, uitgevoerd door twee experten.

Keywords— Ruimte-impulsantwoord, ruimte akoestisch modelleren,Steered Mixture-of-Experts, STOI, POLQA, PESQ

I. INTRODUCTIE

DE toepassingen van Virtual Reality (VR) verspreiden zichin de laatste jaren steeds verder over verschillende

domeinen varierend van gaming en cinema tot reclame, gezond-heidszorg, design enzovoort. Met het aantal toepassingenevolueert ook het aantal toestellen en VR-platformen in stij-gende lijn waardoor onderzoek naar de mogelijkheden van dezenieuwe media toeneemt. Sinds de ontwikkeling van de headmounted display ontsproot zich een nieuwe wereld waarin degebruiker, door het feit dat die zich kan voortbewegen in zijnvirtuele omgeving, een voorheen ongekend aantal vrijheids-graden kan exploreren (6 Degrees of Freedom (DoF)). Maaralhoewel computer gegenereerde 6DoFs reeds alom tegenwo-ordig zijn (bv. 3D gaming) slagen ze er niet in om eenzelfdereal-life immersieve ervaring te creeren als bij een virtuele scenegebaseerd op camerabeelden.

Met de ontwikkeling van Steered Mixture-of-Experts (SMoE)onderzoekt de UGent de mogelijkheden om in realtime eenwaarheidsgetrouwe, op camerabeelden gebaseerde virtuelewereld te creeren. Hierin kadert ook deze masterproef, die eendeelaspect behartigt van het doel om het SMoE platform ookte gebruiken voor het modelleren van de virtuele akoestischebeleving.

In tegenstelling tot de traditionele rekenmodellen gebaseerdop geometrische akoestiek, is het de bedoeling dat SMoE in eeniteratief leerproces op basis van werkelijk opgenomen ruimte-impulsantwoorden, een volumetrisch ruimte-akoestisch modelontwikkelt. Deze masterproef onderzoekt hierbij in eerste in-stantie de mogelijkheid om een effectieve representatie te ont-wikkelen die SMoE toelaat correlatiepatronen te identificerenbij aangrenzende impulsantwoorden.

II. VOORGESTELDE BEREKENINGSMETHODE

A. Metrieken

Een grote uitdaging doorheen het ontwerp van de represen-tatie is de evaluatie van de geproduceerde RIRs. Het uiteinde-lijke doel van de representatie en de bijhorende omzettings-methode is om een zo nauwkeurig mogelijke reconstructie van

de originele RIR te maken. Het spreekt voor zich dat eenauditief niet te onderscheiden reconstructie als perfect wordtbeschouwd, ook al is deze op numeriek vlak niet identiek.

De meest correcte evaluatie methode is dus gebaseerd opeen perceptuele luisterervaring. Daarentegen is deze meth-ode enorm tijdrovend waardoor deze niet kan toegepast wordenom een grote dataset te evalueren en hieruit betrouwbare con-clusies te verkrijgen. Daarom werd gezocht naar een alter-natieve en objectieve methode aan de hand van metrieken. Indeze studie werden volgende metrieken uit het spraak analyseonderzoeksveld geselecteerd :• Perceptual Evaluation of Speech Quality (PESQ)• Perceptual Objective Listening Quality Analysis (POLQA)• Short-Time Objective Intelligibility measure (STOI)De accuraatheid van de drie gekozen metrieken zal bij eentweede, meer beperkt, luister experiment gevalueerd worden.

B. Voorgestelde representatie

Aangezien deze masterproef het voorbereidend werk isnaar een door SMoE opgebouwd volumetrisch ruimte-impulsantwoord model, dient een representatie opgebouwd teworden die enerzijds de sterktes van het SMoE framework op-timaal benut en anderzijds zo robuust mogelijk is tegen deresulterende artefacten van SMoE. De kracht van SMoE ligtin de compacte representatie van grote aantallen van hoog-dimensionale sterk gecorreleerde data [1].

RIRs zijn in tijdsdomein heel grillig van aard, m.a.w. zijndus niet gecorreleerd. Dit betekent dat SMoE geen goede com-pressie zal kunnen uitvoeren op RIRs. Om de kracht van SMoEzo goed mogelijk te benutten, wordt een hoogdimensionalevoorstelling van RIRs gezocht waarbij de fysische dimensiessterk gecorreleerd zijn. Dit wil zeggen dat alle RIRs die fysiekbij elkaar in de buurt liggen, omgevormd dienen te worden totzeer gelijkaardige en continu-varierende numerieke represen-taties. Met de fysieke dimensies wordt geduid op de positie vande ontvanger (3 dimensies) en de posities van de bron (3 di-mensies). Extra logische fysieke dimensies kunnen toegevoegdworden zodat de voorstelling meer correlatie vertoont (zoalstijd) of extra dimensies van continuteit kunnen worden aange-toond (zoals orientatie van de ontvanger of bron).

De voorgestelde representatie zet een RIR in tijdsdomein omnaar het cepstrale domein. Een cepstrum is het spectrum vande logaritmische amplitude spectrum van een tijdsignaal. Determ is een anagram van spectrum waarbij de eerste vier lettersomgedraaid zijn. De gebruikte terminologie omtrent cepstrumbestaat uit gelijkaardige anagrammen van de bijhorende spec-trum termen zoals: quefrentie (frequentie), lifteren (filteren) en-

Fig. 1. Visuele voorstelling van de voorwaartse operaties. Naar elke stap wordtverwezen met bijhorend nummer (x).

zovoort. De verkregen representatie is in de literatuur te vindenals een liftered spectrogram. Een cepstrogram is een transfor-matie van een spectrogram in spectraal domein naar het cep-strale domein. De omzetting naar deze representatie wordt hier-onder uitgelegd.

C. Voorwaartse operatie

De voorwaartse operatie beschrijft de omzetting van de orig-inele RIR in tijdsdomein naar een liftered cepstrogram en wordtweergegeven in Fig. 1. De voorwaartse operatie is een reeksafzonderlijke stappen die opeenvolgend de RIR transformerennaar een nieuwe intermediaire representatie. Deze stappen zijn:• (1): Vooraan en achteraan worden nullen toegevoegd zodat deRIR perfect gecapteerd wordt door overlappende chunks.• (2): Het verlengde signaal wordt vervolgens opgesplitst inoverlappende chunks van grootte 512. Belangrijk bij deze op-splitsing is het bepalen van de frame shift s tussen twee opeen-volgende chunks, zijnde de relatieve afstand tussen het beginvan deze twee chunks.• (3): De chunks worden stuk voor stuk met een square rootHann window vermenigvuldigd om zo de spectrale lekkage tengevolge van het gebruik van de Discrete Fourier Transform(DFT) te vermijden [2] [3].• (4): Het spectrum van elke chunk in tijdsdomein wordt berek-end m.b.v. Fast Fourier Transform (FFT), een snelle uitwerkingvan de DFT. De uitkomst van deze stap is een complex spectro-gram.• (5): Een amplitude spectrogram wordt verkregen door de abso-lute waardes van het complex spectrogram te nemen.• (6): Het logaritme van het amplitude spectrogram wordtgenomen om een logaritmisch amplitude spectrogram te verkrij-gen.• (7): Het verkregen logaritmisch amplitude spectrogram wordtnaar het cepstrale domein omgezet m.b.v. FFT. Een cepstrogramis nu verkregen.• (8): Het verkregen cepstrogram wordt gecomprimeerd door deeerste c quefrencies van elke cepstrum te houden en de andere teverwijderen [4]. Deze stap is in essentie een low-pass lifteringstap.

D. Fase herberekening

In de voorwaartse operatie gaat de fase informatie verlorendoor de absolute waarde van het intermediair spectrogram tenemen. Deze fase informatie wordt bijgevolg niet bijgehouden

Fig. 2. Schematische voorstelling van de voorgestelde methode voor de faseherberekening. Hierbij staat |F | voor het amplitude spectrogram, ϕ voorhet fase spectrogram, s voor het bekomen tijdsignaal (RIR) en S voor hetgereproduceerde complex spectrogram

in de voorgestelde representatie. Echter, wanneer de RIR terugnaar tijdsdomein omgezet moet worden, is fase informatie nodigom de juiste timing te verkrijgen. Er is dus nood aan een meth-ode om deze fase uit de voorgestelde representatie te extraheren.De voorgestelde methode is gebaseerd op het iteratieve Griffin-Lim algoritme [5].

Deze iteratieve heen-terug methode, schematisch voorgesteldin Fig. 2, transformeert telkens van complex spectrogram naar(stap (2), (3), (4) in Fig. 1) tijdsignaal en terug (step (5), (6), (7)

in Fig. 3), wat uiteindelijk convergeert naar een tijdsignaal metovereenkomstig amplitude spectrogram als het initiele spectro-gram. Elke keer de transformatie van tijdsignaal naar spectro-gram wordt gemaakt, wordt het verkregen fase spectrogram ver-menigvuldigd met het originele amplitude spectrogram waar-door telkens een klein beetje fase informatie uit dit tijdsignaalwordt afgeleid. Door telkens deze nieuwe fase informatie te ver-menigvuldigen met het originele amplitude spectrogram wordtde fase gedwongen te convergeren naar een overeenkomstig fasespectrogram.

Belangrijk bij deze stap is de keuze van het aantal iteratiesi vooraleer het verkregen tijdsignaal terug gegeven wordt. Hoemeer iteraties uitgevoerd worden, hoe dichter de fase het conver-gentie punt bereikt dus hoe beter het resultaat van de methode.Een iteratie is echter een dure operatie waardoor idealiter dezevariabele zo klein mogelijk gehouden wordt.

E. Omgekeerde operatie

De omgekeerde operatie beschrijft de omzetting van eenliftered cepstrogram representatie naar een gereproduceerde

Fig. 3. Visuele voorstelling van de omgekeerde operatie. Merk de analogie metde voorstelling van de voorwaartse operatie.

RIR in tijdsdomein. Zoals zichtbaar in de schematischevoorstelling van deze operatie op Fig. 3, is deze operatie vrijwelexact aan de omgekeerde volgorde van de inverse stappen van devoorwaartse operatie. Enkele niet vanzelfsprekende stappen vandeze omgekeerde operatie zullen hieronder besproken worden.

In stap (1) zal het liftered cepstrogram aangevuld worden metnullen zodat elk cepstrum 257 coefficienten bevat in plaats vanc aantal coefficienten. Dit zorgt ervoor dat met behulp vanspiegeling rond coefficient 257, een reeel-even signaal bekomenwordt zodat na toepassen van FFT een reeel signaal terug krijgt.

Het gereproduceerde fase spectrogram, verkregen door defase herberekening methode op basis van het gereproduceerdeamplitude spectrogram, wordt in stap (4) vermenigvuldigd metdit gereproduceerde amplitude spectrogram.

De laatste stap (9) in de omgekeerde operatie is niet striktnoodzakelijk om correcte RIRs te reproduceren. Deze wordtechter uitgevoerd om ongewenste artefacten, verkregen door devoorgestelde berekeningsmethode, zoals pre-echo te elimineren.Bij een onvoldoende kwalitatieve reproductie van de RIR komtde typerende initiele piek van de impuls niet tot zijn recht.Deze impuls wordt door deze operatie artificieel gevormd dooralle waarden voor het absolute maximum van de RIR te ver-menigvuldigen met een passende exponentiele vensterfunctie.

III. EXPERIMENTEN

A. Objectieve experimenten adhv metrieken

Om de effectiviteit van de voorgestelde berekeningsmethodete evalueren, alsook de impact van de drie geıdentificeerde vari-abelen ( frame shift s, aantal behouden cepstrum coefficientenc en aantal iteraties i), werden drie experimentele configuratiesopgemaakt :• Bij experiment A wordt a.d.h.v. de voorwaartse en deomgekeerde operatie de amplitude gereproduceerd. De fase-informatie van het originele impulsantwoord wordt aan de re-productie toegevoegd. Bij het uitvoeren van de reproductieswordt het aantal cepstrum coefficienten c gevarieerd tussen 25en 200. Ter verificatie wordt ook c = 257 toegevoegd teneindeeen perfecte reproductie mogelijk te maken.• Bij experiment F wordt de fase herberekend en toegevoegdaan het originele amplitude spectrogram. Bij de faseberekeningwordt het aantal iteraties i gevarieerd tussen 50 en 800.• Bij experiment FA wordt de gehele berekeningsmethode uit-gevoerd. De respectievelijke variaties in c en i zoals hierboven

vermeld, worden aangehouden.Deze drie experimenten worden toegepast op 10 vooraf

opgenomen ruimte-impulsantwoorden die geconvolueerd werdenmet 50 verschillende geluidsbestanden. Voor elk van dezebestanden in elk van de hierboven beschreven experimentenworden telkens 3 verschillende frame shift waarden geeval-ueerd, nl s = 1/8, s = 1/4 en s = 1/2. Elk van de bekomenresultaten wordt vervolgens individueel gescoord op basis vande drie eerder vermelde metrieken, nl. PESQ, POLQA en STOI.

Opvallend bij de analyse van de bekomen scores is dat debekomen resultaten van de A experimenten bijzonder hoogliggen, en dit ongeacht de geselecteerde waarden voor s en c.De scores verkregen bij de F en de FA experimenten liggen bijelk van de geselecteerde metrieken beduidend lager, waardoordeze lijken te suggereren dat de methode een perfecte recon-structie van de amplitude toelaat maar dat de auditieve kwaliteitvan de fase herberekening onvoldoende zou zijn.

Direction Distance ReverberationA (c =150) 10 10 8A (c =75) 10 10 7F+ (i=800) 10 10 9F+ (i =50) 9 9 7F- 2 4 5FA- (c =150, i= 800) 2 4 4FA+ (c =75, i =800) 8 7 6FA+ (c =150, i =800) 9 8 7FA+(c =150,i =50) 7 7 6

Fig. 4. Samenvattend overzicht van de auditieve kwalificatie.

B. Spatiale luistertesten

Om de vaststellingen gebaseerd op de geselecteerdemetrieken ook auditief te kunnen evalueren werd een bij-komend testprogramma opgesteld. Hierbij werd opnieuwgebruik gemaakt van de hierboven beschreven experimentelemethodes A, F en FA. Echter, om tevens de spatiale ervar-ing bij het beluisteren van de gereproduceerde fragmentente integreren bij de evaluatie ervan, werden twee gerepro-duceerde RIRs binauraal gesynthetiseerd tot BRIRs (binauraleruimte-impulsantwoorden). Essentieel bij deze synthese ishet doorgeven van de fase informatie van een eerste RIR,gesitueerd dichtst bij de bron, naar een tweede RIR. Hierbijwordt de impact van de afstand van een oor tot het tweedeoor gesimuleerd, en wordt er daarom gewerkt met twee voorafopgenomen RIRs waarvan de afstand tussen de microfoons16 cm bedraagt. Om het aantal te beluisteren geluidsfragmentenenigszins te beperken, werd ervoor gekozen om de gekozenwaarden van de diverse variabelen te beperken tot i = 1/8,i = [50, 800] en c = [75, 100, 150].

De samenvatting van de auditieve evaluatie wordt weergegevenin Fig. 4. Bij het luisteren naar de diverse geluidsfragmenten re-sulterend uit de drie experimenten, bleek in eerste instantie dat

de gereproduceerde fragmenten uit de A en de F experimentenop het gebied van spatiale ervaring als evenwaardig werden er-varen. Daarnaast werd ook de auditieve kwaliteit m.b.t. geper-cipieerde afstand en orientatie tot de geluidsbron volledig gelijkervaren als de originele geluidsbestanden.

Daarentegen werd de algemene spatiaal-auditieve kwaliteitvan de volledige reproductiemethode (experiment FA) als ietsminder goed ervaren omdat de afstand tot de bron als minderprecies werd aangevoeld.

Een belangrijke vaststelling echter bij de F en de FA experi-menten, is dat vooral het al dan niet toepassen van de doorgavevan de fase informatie bij de opmaak van de BRIRs hierbij vandoorslaggevend belang is en een veel grotere impact heeft dande variatie van de andere variabelen i en c. Daarnaast werd ookopgemerkt dat bij alle experimenten de originele kwaliteit vande reverberaties niet volledig kon gereproduceerd worden.

Tenslotte werd ook vastgesteld dat de auditieve beoordelingende vaststellingen op basis van de geselecteerde metrieken PESQ,POLQA en STOI, onvoldoende konden bevestigen waardoor hetgebruik van deze metrieken als suboptimaal werd beoordeeld.

IV. CONCLUSIES EN TOEKOMSTIG WERK

Algemeen kan gesteld worden dat auditieve beoordelingenhet potentieel van de voorgestelde berekeningsmethode omvooraf opgenomen RIRs adequaat te reproduceren, kunnenbevestigen. De bij de metrieke vastgestelde impact op dekwaliteit door de fase herberekening, zijn bij de spatiale luister-testen geelimineerd door het doorgeven van de fase bij de bi-naurale synthese. Het gebruik van de metrieken PESQ, POLQAen STOI wordt als suboptimaal beschouwd.

Er wordt voorgesteld om in eerste instantie de voorgesteldeberekeningsmethode daadwerkelijk te integreren in het SMoEframework en de impact van het uitmiddelend effect van ditframework op de berekeningsmethode verder te evalueren. In-dien de methode voldoende robuust bewijst te zijn, kan een diep-gaand statistisch onderzoek het gebruik van de methode en deoptimalisatie van de variabelen in functie van de integratie inSMoE valideren.

Tenslotte wordt ook aangeraden om het gebruik van de fase-doorgave verder te exploreren in meerdere dimensies (bv. bij hetbewegen doorheen de virtuele ruimte) binnen het SMoE frame-work.

REFERENCES

[1] R. Verhack, T. Sikora, G. Van Wallendael, and P. Lambert, Steered Mixture-of-Experts for Light Field Images and Video: Representation and Coding,Submitted to IEEE Trans-actions on Multimedia, 2019

[2] F. J. Harris, On the use of windows for harmonic analysis with the discreteFourier transform, Proceedings of the IEEE, vol. 66, no. 1, pp. 5183, Jan1978

[3] Wung, Jason and Giacobello, Daniele and Atkins, Joshua Robust acousticecho cancellation in the short-time fourier transform domain using adaptivecrossband filters, ICASSP, IEEE International Conference on Acoustics,Speech and Signal Processing - Proceedings, May 2014

[4] T. Nishino, F. Saito, K. Itou, and K. Takeda, Modeling of a Room ImpulseResponse with Cepstrum Analysis, Forum Acusticum 2005, RBA-AS, PaperNo. 447, pp. 1887-1890, September 2005

[5] D. Griffin and Jae Lim, Signal estimation from modified short-time Fouriertransform, IEEE Transactions on Acoustics, Speech, and Signal Processing,vol. 32, no. 2, pp. 236243, April 1984

Contents

List of Figures viii

List of Tables x

Acronyms xi

1 Introduction 1

1.1 Understanding room acoustic properties . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Current technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Steered Mixture-of-Experts (SMoE) . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Position in the process flow of virtual acoustics . . . . . . . . . . . . . . . . . . . 5

1.4.1 Room data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4.2 Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.4 Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Final goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Proposed computation method 10

2.1 Evaluation methods and metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Forward operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

vi

CONTENTS vii

2.2.1 Generating spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Transforming into cepstral domain . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Phase retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Reverse operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Experiments 26

3.1 Objective metric experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.1 Test setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1.2 Amplitude regeneration experiment (A) . . . . . . . . . . . . . . . . . . . 29

3.1.3 Phase retrieval experiment (F) . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.4 Proposed computation method experiment (FA) . . . . . . . . . . . . . . 35

3.1.5 General conclusions on metric results . . . . . . . . . . . . . . . . . . . . . 36

3.2 Auditory spatial tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.1 Used variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Conclusion 47

4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Bibliography 49

Appendices 51

List of Figures

1.1 Room Impulse Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Virtual acoustics process flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Process flow room modelling - data collection . . . . . . . . . . . . . . . . . . . . 7

1.4 Process flow room modelling - conversion to cepstal domain . . . . . . . . . . . . 7

1.5 Process flow room modelling - building the model . . . . . . . . . . . . . . . . . . 8

1.6 Process flow room modelling - data reproduction . . . . . . . . . . . . . . . . . . 9

2.1 Consecutive steps of the computation method - forward operation . . . . . . . . 12

2.2 Spectrogram example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Forward operation - generating spectrogram . . . . . . . . . . . . . . . . . . . . . 13

2.4 Analysis window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Phase retrieval method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.6 Consecutive steps of the computation method - reverse operation . . . . . . . . . 22

2.7 Synthesis windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.8 Reverse operation - smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Amplitude regeneration experiment - schematic diagram . . . . . . . . . . . . . . 29

3.2 Boxplot A experiment - PESQ scores . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Boxplot A experiment - POLQA scores . . . . . . . . . . . . . . . . . . . . . . . 30

viii

LIST OF FIGURES ix

3.4 Boxplot A experiment - STOI scores . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Phase retrieval experiment - schematic diagram . . . . . . . . . . . . . . . . . . . 32

3.6 Boxplot F experiment - PESQ scores . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.7 Boxplot F experiment - POLQA scores . . . . . . . . . . . . . . . . . . . . . . . . 33

3.8 Boxplot F experiment - STOI scores . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.9 Proposed computation method experiment - schematic diagram . . . . . . . . . . 35

3.10 Boxplots FA experiment - variable c and s . . . . . . . . . . . . . . . . . . . . . . 35

3.11 Boxplots FA experiment - variable i and s . . . . . . . . . . . . . . . . . . . . . . 36

3.12 Setup of the Aachen University recording room . . . . . . . . . . . . . . . . . . . 38

3.13 Histogram of PESQ scores of different experiments . . . . . . . . . . . . . . . . . 40

3.14 Histogram of POLQA scores of different experiments . . . . . . . . . . . . . . . . 41

3.15 Histogram of STOI scores of different experiments . . . . . . . . . . . . . . . . . 42

3.16 Summary of the perceived quality of the various BRIRs. A subjective rating scaleof 1 to 10 has been used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

List of Tables

3.1 Overview of the used RIRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Used values for the frame shift s . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3 Used values for the amount of stored cepstrum coefficients per cepstrum c . . . . 28

3.4 Used values for the amount forward-backward iterations i . . . . . . . . . . . . . 28

3.5 Used variables for the spatial tests . . . . . . . . . . . . . . . . . . . . . . . . . . 39

x

Acronyms

BRIR Binaural Room Impulse Response. 37, 43–46

DFT Discrete Fourier Transform. 16

DoF Degrees of Freedom. 1, 48

FFT Fast Fourier Transform. 14–17, 22, 23

FT Fourier Transform. 16

GA Geometrical Acoustics. 3, 4

HTRF Head-Related Transfer Function. 37

IFFT Inverse Fast Fourier Transform. 16, 17, 22, 23

ITU International Telecommunication Union. 11

MFC Mel-Frequency Cepstrum. 18

MFCC Mel-frequency Cepstral Coefficients. 18

PESQ Perceptual Evaluation of Speech Quality. 11, 29, 30, 33, 34, 39, 47

POLQA Perceptual Objective Listening Quality Analysis. 11, 29, 39, 46, 47

RIR Room Impulse Response. 2, 3, 5–12, 14–29, 32, 35, 37–39, 46–48

SMoE Steered Mixture-of-Experts. 1–3, 5, 7–9, 16, 19, 47, 48

STOI Short-Time Objective Intelligibility measure. 11, 29, 39, 47

VA Virtual Acoustics. 5, 7

VR Virtual Reality. 1, 2, 4, 48

xi

1Introduction

In recent years, Virtual Reality (VR) has risen in interest across multiple fields ranging fromgaming and cinema to advertising, health care, design, cultural heritage, training, remote pres-ence and so on [1]. Content providers are exploring the huge potential that these new mediahave to offer. There are a growing number of devices and platforms such as Playstation VR,YouTube’s VR channel and Facebook 360.

Today’s virtual reality experiences can be broken down in 2 basic types, based on the method ofproduction: computer generated animation and modeling on the one hand and VR experiencecreated using a camera to capture real-world images on the other hand. The combination of VRwith the use of head-mounted displays has created a new pallet of opportunities and offers moreDegrees of Freedom (DoF) than any other form of media content. Although computer generated6DoFs are well established (e.g. 3D gaming where user is moving around in the virtual scene),they do not provide the real-life immersive experience of a camera-captured virtual reality.

Today, 360° cameras can be used to capture a real environment. By use of software to stitchthe images a single composite 360° video can be created. But while this technology allows theuser to look around in the video by rotation of the head, the translation of the head will notresult in a different view of the scene. To develop 6DoF camera-captured VR still requiresimportant research to be done and the development of the Steered Mixture-of-Experts (SMoE)VR framework, is situated in this context. On top of this, for any VR and 360° video to be

1

2 CHAPTER 1. INTRODUCTION

truly immersive, there is a need for convincing spatial audio.

In cinema, typically audio sources are captured separately (on set or afterwards in a studio).In order to reproduce this in a virtual reality setting, these sound sources are positioned in avirtual 3D world and a VR observer is then presented with a processed version of this sourcesound depending on his location and orientation.

Since many years a lot of research has been done on room acoustical simulation and spatialaudio reconstruction, also known as auralization. However, even when using today’s most ad-vanced technology, the major challenge in this exciting domain remains its requirement for hugeprocessing capacity, making it very difficult to obtain a truly realistic audio experience basedon real-time computing.

This study is part of a cluster of research projects elaborated to support the Steered Mixture-of-Experts (SMoE) VR framework, co-developed at UGent. The purpose of this study is to developan appropriate representation of the Room Impulse Response (RIR) so it can be modeled usingSMoE and convincing virtual spatial audio can be reproduced.

This thesis is organized as follows. Chapter 1 is an introduction and gives a brief overview ofthe goal of the research, the current state of technology and explains the used framework SMoE.The second chapter describes the proposed computation method and used metrics in detail. Theconducted experiments and their results are presented and analyzed thoroughly in chapter 3.Conclusions are drawn in the final section.

1.1 Understanding room acoustic properties

In order to fully comprehend this thesis, an understanding of room acoustic properties is required.When a sound is generated in a room, the listener will first hear the sound via the direct pathfrom the source (Direct Sound). Shortly after, the listener will hear the reflections of thesound off the walls which will be attenuated (Early Reflections). Each reflection will then inturn be further delayed and attenuated as the sound is reflected again and again off the walls(Reverberation). The resulting impulse response is demonstrated in Fig. 1.1.

By definition, the impulse response is the output of any given system when presented with animpulse. This study concentrates on impulse responses from rooms and therefore referred to asRoom Impulse Responses (RIRs).

1.2. CURRENT TECHNOLOGY 3

Figure 1.1: A typical Room Impulse Response. An impulse response of a room representsthe propagation of sound pressure from a source to a receiver. In a room impulse responsesimulation, the response is typically considered to consist of three separate parts: direct sound,early reflections and late reverberation

Acoustics can vary considerably across any space, e.g. when the speaker is talking in one roomand the observer walks into another room through a door, then the voice is strongly filtered.The acoustic properties of a room at a specific location are captured in the RIR, recorded at thatlocation. A room impulse response can be determined by recording a short, loud sound, like aclap or a generated frequency sweep (from 20Hz-20 kHz) in the room of choice. The recordingwill contain the clap or sweep as well as the spatial reverberation effects of that room on thelocation of the recording.

1.2 Current technology

As explained further, the ultimate goal towards which this thesis is contributing, is to modelthe acoustic properties of a virtual room by using the SMoE framework. For this purpose, anextensive set of RIRs will be prerecorded so that SMoE can calculate a volumetric room impulseresponse model.

Traditionally, room acoustic models have been calculated by approximating the sound propaga-tion using the assumptions of Geometrical Acoustics (GA) or by numerically solving the waveequation (wave-based approach) [2] [3] [4].


Mathematically, sound propagation is described by the Helmholtz wave equation. The wave-based approach is based on numerically solving the wave equation, for example by use of the finiteelement method (FEM). The simulation results are very accurate, but the complexity increasesdrastically with the highest frequency considered. As these techniques are computationally veryexpensive, it is often more appropriate to resort to faster but less accurate techniques such asthose based on GA.

In geometrical acoustics, all of the wave properties of sound are neglected, and sound is assumedto propagate as rays. This assumption is valid at mid and high frequencies, where the wavelengthof sound is short compared to surface dimensions and the overall dimensions of the space, butat lower frequencies the approximation errors increase as wave phenomena play a larger role.

The most commonly used ray-based methods are the ray-tracing and the image-source method.The basic distinction between these methods is the way the reflection paths are typically cal-culated. To model an ideal impulse response, all the possible sound reflection paths should bediscovered. The image-source method finds all the paths, but the computational requirementsare so high that in practice, only a set of early reflections is computed. Ray-tracing uses theMonte Carlo simulation technique to sample these reflection paths and thus obtains a statisticalresult. By using this technique, higher order reflections can be searched for, though there areno guarantees that all the paths will be found.

Due to the contradictory advantages and disadvantages of ray tracing and image sources, mostcommercial software packages for room acoustic modelling use a hybrid model in which varioustechniques are combined.

Another way of predicting room acoustics, is by utilizing computer vision techniques [5][6].Recently, several toolkits have been developed to render spatial audio from the geometry andacoustic material information on VR platforms. 3D Models describing both geometry and ma-terials allow for an approximation of real room acoustics for VR environments. For simulatingan acoustic environment in on VR platforms, a robust recognition method for room geometryand object materials is required, e.g. a convolutional neural network. However, this approachhas several drawbacks as follows: Firstly, time and resources are required to consume multiplecaptures of the scene to cover a complete scene layout estimation. Secondly, dense geometrymakes the realtime acoustic simulation impractical because it drastically increases computationalcomplexity and run-time for spatial audio rendering.

1.3. STEERED MIXTURE-OF-EXPERTS (SMOE) 5

1.3 Steered Mixture-of-Experts (SMoE)

The SMoE [7][8] framework is a unifying representation method for high-dimensional imagedata. Such higher dimensional image data is commonly found in VR applications. Images canbe seen as mathematical functions, e.g. a view can be represented by its coordinate in space (3dimensions) and the orientation (2 angular dimensions) of the virtual camera. The output ofthe function is then a matrix that contains all the pixel data.

The power of SMoE lays in compactly representing extreme amounts of such high-dimensionalpixel data. It does so by representing large coherent sets of pixels as single higher-dimensionalimage atoms, i.e. think of a ”pixel 2.0” and are called kernels in the context of SMoE. A pixelseen in one view is in fact originating from the same light ray as a pixel in another view. TheSMoE framework allows to remove this redundancy by representing all corresponding pixels bya single multi-dimensional entity, i.e. the kernel. SMoE thus represents a large number of pixeldata using a single pixel 2.0, the corresponding set of pixels is approximated using a linearfunction which effectively smooths that set of corresponding pixels over multiple views. It canthus be stated that the artifacts that SMoE produces are mainly edge-aware blurring artifacts.Note that the smoothing is simultaneous over all input dimensions, i.e. 5 dimensions in thegiven example.

In this thesis the goal is to approximate the desired function of a source and observer locationin space and possibly an orientation that maps to a RIR: (Xs, Xo, O) => RIRc. Therefore, theartifacts of SMoE modeling need to be kept in mind. As such, a representation is sought whichis robust to smoothing artifacts.

1.4 Position of the study in the process flow of virtual acoustics

To fully understand the purpose of the study, it is important to situate it in the larger process ofthe creation of Virtual Acoustics (VA). Virtual acoustics is the process of creating spatial soundin a purely virtual situation. To do so, it uses digital input data that is pre-recorded, modelledand reproduced as explained below and shown in Fig. 1.2.


Figure 1.2: Schematic of the virtual acoustics process flow. The gray blocks mark the subjectof this study being the forward operation and backward operation. Vector notation is omittedin the visualization, so Xs is presented as Xs etc.

The below explanation of the process flow steps is supported by Fig. 1.3 to Fig. 1.6. Thesefigures demonstrate the location and representation of the known data points in the presentedstep.

1.4.1 Room data collection

In the real world, the RIR received by the observer is predominantly determined by the followingvariables :

• Position of the audio source – further referred to as Xs

• Position of the audio observer – Xo

• The orientation of the observer - O

• Room characteristics

To model the room acoustic properties, RIRs must be recorded at various positions in the roomwhere significant changes to the signal are expected, as seen in Fig. 1.3.

As a result of this step, a set of data points each containing [RIRt, Xs , Xo , O] is obtained.

1.4. POSITION IN THE PROCESS FLOW OF VIRTUAL ACOUSTICS 7

Figure 1.3: Visualization of the process flow of modeling a room. Here, step 1, the data collection,is presented. To make an effective volumetric RIR model of this room given this source location,the observer points have to be chosen carefully so that they are able to capture significantchanges. Note that the displayed number of observer points and their locations is not realistic.To obtain a decent presentation of a room, lots of extra observer points will have to be added.

1.4.2 Conversion

VA requires that the RIR corresponding to any location of a virtual observer can be computed.This is traditionally done in time domain or via the inverse Fourier transform of a computedfrequency spectrum. Finding the optimal computation method for modelling in SMoE andassessing its effectiveness is the exact subject of this study. As explained further on in thisthesis, the proposed computation method is based on the conversion of the RIR into the cepstraldomain.

The purpose of this step is to provide the cepstral representation of the data point with [RIRc,Xs , Xo , O] as input to the SMoE framework.

Figure 1.4: Visualization of the process flow of modeling a room. Here, step 2, the conversionfrom time to cepstral domain, is presented. In this step, every obtained data point is convertedinto cepstral domain.


1.4.3 Modeling

As explained in section 1.3, the SMoE framework has been developed to generate a higherdimensional estimation function from a given set of data. The goal towards which this studyis working, is for SMoE to generate the volumetric RIR model from the recorded cepstral datapoints which is achieved by learning the relations between (Xs, Xo, O) => RIRc.

Once the volumetric room impulse response model has been generated, the RIR reproductionprocess can be started. For any given position in the virtual room, the presented computationmodel should allow for a approximation of the corresponding RIR. The generated model can bequeried with a (Xs , Xo , O) input parameter.

Figure 1.5: Visualization of the process flow of modeling a room. Here, step 3, building themodel, is presented. SMoE generates a higher dimensional estimation function, called the vol-umetric RIR model, from all the received data points. This model is capable of approximatingRIRs in cepstral domain in all intermediate locations.

1.4.4 Reproduction

After querying the model for the appropriate RIR for the location of the virtual observer, theRIR needs to be reproduced from the compressed representation in cepstral domain back intotime domain by using the proposed computational method.

This reproduction will then be convolved with the dry audio source to obtain a realistic repro-duced spacial sound in the headset.

1.5. FINAL GOAL 9

Figure 1.6: Visualization of the process flow of modeling a room. Here, step 4, the data re-production, is presented. After reproducing the obtained cepstral RIR into time domain, thisreproduced RIR can be convolved with the audio source in order to add realistic spacial infor-mation. With the combination of the reproduction method and the generated model, a realisticreproduction of a RIR can be obtained in every intermediate location in the modeled room.

1.5 Final goal

This thesis will work towards presenting a computation method that transforms the input RIRinto a representation that allows for optimal processing by SMoE but is not sensitive to itsnegative modeling effects.

The main purpose of this thesis is to create a representation containing the RIR, enriched withadditional dimensions to enhance correlational information.

At the same time, the representation should also be inert to the smoothing effect of the SMoEalgorithm so that the reproduced signal is not changed too significantly. However, the evaluationof this effect is not within the scope of this study.

2Proposed computation method

The proposed computation method can be divided into two main components, amplitude mod-eling and phase retrieval. Initially, the amplitude was believed to contain the most informationabout the given impulse response because the phase only contains timing information. Thisassumption has led to a distinct study of both amplitude and phase throughout this thesis.

2.1 Evaluation methods and metrics

A big challenge throughout the development of the method is the evaluation the accuracy of anyregenerated signal. The way this was evaluated at first was only by listening. Both the originaland regenerated RIR are convolved with a dry sound (with little to no reverberation) to createthe files to compare. Because the goal of the method is to reconstruct the original RIR as close aspossible, those two files should sound as similar as possible. The most correct evaluation methodis therefore to assess the result by listening. However, this method is extremely time-consumingand therefore not fit for this thesis to quickly gather enough valuable data. Soon, the observationwas made that a method for objective evaluation to support subjective assessment by listeningwas needed. When a metric obtained from an existing evaluation method would show a highcorrelation with the subjective experience, it could be considered an easier evaluation method.It is not the purpose of the study to do a statistical validation of existing metrics, however

10

2.2. FORWARD OPERATION 11

the obtained results will undergo a simple subjective evaluation by experts to obtain a sense ofusefulness.

A potential, rather quick and dirty, way to evaluate results is plotting the reproduced and originalRIR to compare them visually. This evaluation is under no circumstances equivalent to listening,although it can give a significant visual evaluation on the effectiveness of the reproduction ofthe RIR. This evaluation is basically a way of telling whether the results seem equivalent andare even worth listening to.

Further investigation of literature revealed metrics that originated from speech processing re-search. Following three well known metrics have been chosen to evaluate the accuracy of thereproduced signal.

• Perceptual Evaluation of Speech Quality (PESQ) is a test algorithm introduced in 2001by the International Telecommunication Union (ITU), the international organization thatcoordinates standards for telecommunications. It is a commonly-used objective measureintended to measure speech quality by a quantification of the degradation due to codecsand transmission channel errors [9].

• Perceptual Objective Listening Quality Analysis (POLQA) is also an ITU standard andis the successor of PESQ, developed in 2011 [10].

• Short-Time Objective Intelligibility measure (STOI) is a objective speech-intelligibilitymeasure [11]. In speech communication, intelligibility is a measure of how comprehensiblespeech is in given conditions. The STOI algorithm has been developed in 2010 at the DelftUniversity of Technology.

Other potential metrics, like measures to predict the quality and intelligibility of degraded speechfor listeners with hearing loss and using hearing aids (e.g. HASQI and HASPI) have not (yet)been included in this study and might also be interesting candidates for future analysis.

2.2 Forward operation

The forward part of the computation method consists of transforming the RIR of total length n

to an enriched representation. n is the total amount of samples the presented RIR consists of.This transformation is based on creating the cepstrogram of a signal and will be explained indepth in this section. The forward operation consists of a pipeline of multiple processing stepswhich are executed in series. This pipeline is shown in Fig. 2.1 and the corresponding Matlabcode can be found in App. A.

12 CHAPTER 2. PROPOSED COMPUTATION METHOD

Figure 2.1: Pipeline of the consecutive steps that build up the proposed computational method.

Throughout this chapter, the term forward operation is used to refer to the transformation fromtime to cepstral domain. The inverse transformation is called reverse operation and will bediscussed in Sec. 2.4 below. The specifics of the cepstrogram creation will be explained in thissection.

2.2.1 Generating spectrogram

A first intermediate representation of the RIR, called a spectrogram, is a representation of thesignal in the frequency domain through time and can be seen in Fig. 2.2. The process oftransforming a RIR in time domain to a spectrogram is visualized in Fig. 2.3.

The first three steps in the pipeline operate in the time domain. The first being the additionof zeros to the signal, secondly breaking up the zero-padded sampled signal into overlappingtime chunks and thirdly, point-wise multiplying each chunk with a windowing function. Thefourth step transforms each chunk into frequency domain, resulting into a collection of frequencyspectra, the spectrogram.

For better understanding of the logic of the pipeline, below explanation will start with step 2:Breaking up into chunks. During the explanation, using the (x) notation refers to the corre-sponding step in Fig. 2.1.


Figure 2.2: A spectrogram of the spoken words ”nineteenth century”. Time is shown on thehorizontal axis and frequencies increasing up on the vertical axis. The color intensity increaseswith the amplitude of the frequency. [12]

Figure 2.3: Method used to split a time signal into a spectrogram by splitting the sampled signalinto overlapping chunks, point-wise multiplying these chunks with a window and applying aFourier transform to convert the chunk into the frequency domain.


Breaking up into chunks (2)

The second processing step is to break up the zero-padded signal into overlapping chunks ofsize m. The purpose of the step is to obtain a sense of timing information in an indirect way.This information is vital as later on in the process, the original timing information will belost due to the elimination of the phase from the frequency spectrum. This indirect timinginformation will be converted later on to actual phases during the phase retrieval step in thereverse operation, as elaborated in Sec. 2.3.

The decision of the size has two main effects that have to be considered. Firstly, the resolutionin the spectral representation and secondly, the amount of timing information. The bigger thesize of the chunks, the better the resolution but the less timing information and vice-versa.

At the start of the study, the size of the chunks m was considered a variable and initially it wastime-based. A specific time interval t (in seconds) was chosen for calculating m using m = t/Fs

with Fs the sampling frequency. However, literature shows that an optimized method to calculatea discrete Fourier transform is (Fast Fourier Transform (FFT)). The simplest implementationof this type of algorithm uses a power-of-two chunk size.

In many other applications, a chunk size of 512 is used. After a brief set of evaluations, a chunksize of 512 proved to be feasible for the proposed calculation method. During this research, mis set to 512 and no longer considered variable. However, it is not thoroughly proven that thisis the optimal chunk size.

The first variable in the proposed computation method is the frame shift s. The frame shift isthe offset between the various chunks as indicated in Fig. 2.3. Frame shift has no unit and isexpressed as fraction of the chunk size m. The inverse (1 − s) of the shift is the overlap, thepercentage of its size one chunk overlaps with its neighbouring chunk. Throughout this work,the terms shift and overlap will be used interchangeably. The research on the influence of theframe shift variable will be presented in Chap. 3.

The amount of overlapping chunks (k) resulting from breaking up the zero-padded RIR can becalculated using Eq. 2.1.

k =

⌊ nm + (1− s)

s

⌋(2.1)

To obtain the sense of timing, it is important to break down the RIR into overlapping chunks.The percentage of overlap will determine the quality of the timing information. The higherthis percentage, the better the quality but the more data need to be stored. Using Eq. 2.1the increase of storage needs, being directly related with the amount of chunks, can be easilycalculated. For chunk size m = 512 and a RIR with total length n = 10000, k increases from 40

to 81 when decreasing frame shift s from 1/2 to 1/4.


Obviously, as data storage is important, profound investigation needs to be done on this vari-able s.

Windowing (3)

When using FFT, spectral leakage arises and affects the result of the FFT operation. By point-wise multiplying the individual chunks with a windowing function before applying a FFT, theill effects of spectral leakage are reduced significantly [13].

In this method, a square root Hann function is used as analysis window and is a generallyaccepted window for this case [14]. This window can be observed in Fig. 2.4. In the forwardoperation one typically speaks about an analysis window whereas the (different) windowingfunction used in the reversed operation will be referred to as the synthesis window.

0 100 200 300 400 500 600

n

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 2.4: Square root Hann analysis window

Adding zeros (1)

To ensure perfect reconstruction, the original RIR gets appended with zeros in the front andback, this is step (1) in Figure 2.1. Without this step, the first samples of the RIR are notincluded in multiple chunks and therefore miss their timing information. The number of addedzeros equals the amount of overlap so that the first window containing the RIR is alreadyoverlapped and the signal can be fully reconstructed.


Transformation into frequency domain (4)

When the zero-padded RIR is split into windowed chunks, it is ready to be transformed intothe frequency domain. This is done with a Fourier Transform (FT) on every individual chunk.Because computers operate in discrete time, this transform is called a Discrete Fourier Transform(DFT). Computing a DFT of n samples by only using its definition takes O(n2) time, whereasusing a Fast Fourier Transform (FFT) algorithm can compute the same result in only O(n log n)

steps. By using this faster FFT function, a substantial gain in computation time is acquired.Because a FFT algorithm is used, the Fourier transform operation will further on be referred asFFT. A FFT is applied to each chunk of length m, transforming it into a complex spectrum.

Firstly, an important property of the FT is that when the input signal s(t) is real, its FT S(f)

has Hermitian symmetry, resulting into S(f) = S∗(−f), meaning that the real part (amplitude)is even and the imaginary part (phase) is odd. Secondly, a property of a Fourier transformedsignal is that it is continuous and periodic. The DFT provides discrete samples of one cycle,meaning that the second half of the outcome actually represents the first negative half of acycle. Due to this symmetry and the periodicity of a Fourier transformed signal, a symmetryaxis arises each interval of Fs/2 with Fs being the sampling frequency of the signal. The first ofthese symmetry axes is located at the Nyquist frequency (Fn = FS/2).

Due to the symmetry property of FFT as described above, only the information of one half ofeach cycle is needed in order to recreate a full length cycle. Therefore only the first half of theoutput of the FFT is kept and the other half can be discarded without any loss of information.Hence, the length of each resulting spectrum is m/2+1 and is the minimal amount of coefficientsin order to apply a lossless FFT and Inverse Fast Fourier Transform (IFFT) to a signal withlength m. This minimal length will be referred to as z.

The size of the resulting spectrogram now is z × k.

2.2.2 Transforming into cepstral domain

At the beginning of the study, the goal was to try and use the cepstral representation of a signalto achieve a high correlation as required by SMoE. Throughout the study, this method provedviable and therefore it is still at the heart of the proposed computational method. During theresearch, minor improvements were added in order to gain performance and reliability, thesewill also be covered in this section.


Definition

A cepstrum is the result of taking the inverse Fourier transform of the logarithm of the amplitudespectrum of a signal and can be seen as rate of change in the different spectrum bands. The name”cepstrum” derives from reversing the first four letters of ”spectrum”. Similarly, operations oncepstra are labeled quefrency analysis, liftering or cepstral alanysis (resp. anagrams of frequency,filtering and spectral analysis).

By definition, an inverse Fourier transform is used to transform the logarithmic amplitudespectrum into cepstral domain. However, the proposed computational method uses a standardFourier transform in this step. The only difference between using an IFFT and a FFT is ascaling constant. In order to minimize confusion in terminology, the forward operation usesthe FFT to both transform into frequency domain and to transform into cepstral domain. Thereverse operation on the other hand, will use the IFFT for the reversed transforms.

Utilization in the proposed method

The proposed representation contains multiple spectra through time. Such a representationis called a cepstrogram, a naming similar to a spectrogram, which contains multiple spectrathrough time.

Converting a spectrogram to a cepstrogram is as simple as transforming each spectrum intoa cepstrum. Following the earlier definition, the transformation is achieved by applying FFTto the logarithm of every amplitude spectrum. The definition contains three main steps, alsoindicated by step (5), (6) and (7) in Fig. 2.1.

The first step is the most influential step of the whole pipeline: taking the absolute value ofevery chunk to obtain the amplitudes of the spectrum (5). In doing so the phase information ofthe spectrum is lost and will need to be retrieved or modeled separately in order to reproducethe original RIR in the reverse operation. The way this phase information is regained will beexplained in Sec. 2.3.

Secondly, the logarithm of the amplitudes of each spectrum is taken (6).

The third and final step in the transformation of a spectrum to a cepstrum is applying FFTon the logarithm of the amplitude of the spectrum (7). To do so, the logarithmic amplitudespectrum gets mirrored around the Nyquist frequency to result in a cepstrum consisting of realnumbers. The desired effect of obtaining a set of real numbers is a direct result of applying FFTon a real-even signal.


After applying the three steps as described above on every chunk of the spectrogram, a completecepstrogram is obtained. From this complete cepstrogram, it is possible to reproduce the exactamplitude spectrogram without loss. When multiplying the reproduced amplitude spectrogramwith the corresponding phase spectrogram, a perfect reproduction of the input RIR is possible.Unfortunately, the challenge of the proposed method is that it is not possible to model thecorresponding phase information in the used cepstrogram representation.

Reduction of coefficients (8)

In order to achieve compression in the proposed computational method, liftering is applied.Liftering is the process of removing coefficients in cepstral domain to obtain a desired resultin spectral domain. In this study, there is no specific desired result in spectral domain. Theultimate goal of the representation is to try to replicate as close as possible the original RIR.The amount of liftering will therefore depend on two variables, the amount of compression andthe accuracy of the reproduction. Three potential liftering methods are described below.

The first liftering method is called low-pass liftering. This method is comparable to the betterknown low-pass filtering. Instead of only keeping the low frequencies to apply the filtering onthe time signal, the low quefrencies are kept to apply a filtering on the frequency spectrum.In other words, a low-pass liftering method only keeps the c first cepstral coefficients while theremaining cepstral coefficients, the higher quefrencies, are lost.

The second liftering method stores the coefficients which contain the most information, beingthose with the highest absolute value. As a result, this method provides the best possiblequalitative results for the amount of stored coefficients [15]. However, due to this selection,the index of the remaining coefficients is variable so a method of storing the locations of thesecoefficients is needed. Therefore this method is much harder to implement. Also, because thestored coefficients are not necessarily consecutive quefrencies, it is difficult to create a validpresentation.

A third possible liftering method is similar to using a Mel-Frequency Cepstrum (MFC) represen-tation. Mel-frequency Cepstral Coefficients (MFCC) are specifically chosen coefficients in thecepstral domain that make up an MFC. The difference between a regular cepstrum and a MFC isthat in the latter, the frequency bands are equally spaced on the mel scale, which approximatesthe rather logarithmic human auditory systems response more closely than a linear approach.Methods using MFCCs are commonly found in speech related systems and could present apossibility of selecting certain cepstral coefficients in order to compress the representation.

Given the three presented methods with their advantages and disadvantages, the first methodwas chosen as eventual liftering method. The simplicity of selecting only the first c cepstral

2.3. PHASE RETRIEVAL 19

coefficients of each cepstrum leads to a more basic implementation. The result of this low-passliftering is acceptable for current expectations. The decision of liftering method is not final in anyway, meaning further investigation can show a different, more acceptable method which resultsin more realistic reproductions. In any case, great care must be taken when utilizing anotherliftering method because different liftering methods are not compatible with one another.

This step reduces the size of the cepstrogram from z × k to c× k.

2.3 Phase retrieval

Due to taking the absolute values of the spectrogram in step (5), all phase information is lost.The problem is that, when reproducing the RIR, both amplitude and phase information arerequired. In this section, possible solutions to this problem are suggested. The first solution isbased on modeling the phase spectrum while the second solution tries to create a reconstructionfrom the modeled amplitude information.

The first potential solution was based on creating a representation of the phase so that SMoEalso would be able to model this phase information. A phase spectrum φ(t) can be represented intwo ways. When φ(t) is constrained to interval [0, 2π[, the representation is called the wrappedphase. Otherwise it is called the unwrapped phase. The unwrapped phase can be seen as thecumulative function of the wrapped phase and is able to surpass the constraining interval [0, 2π[.

The primary idea was to use the unwrapped phase spectrum as representation to use in SMoE.To do so, this representation would have to be inert to the effects of SMoE. The first effectis a smoothing effect on the output of SMoE, being the unwrapped phase spectrum of thesignal. This means that smoothing the unwrapped phase spectrum should result in an acceptablereproduction. Quickly, the observation was made that smoothing an unwrapped phase spectrumleads to a heavily distorted reproduced signal in time domain so the idea of using the unwrappedphase signal is not feasible.

A following idea is a variation on the proposed amplitude representation. Next to the modeling ofthe amplitude spectrogram, also the unwrapped phase spectra of this spectrogram are modeled.Hence, this representation consists of a collection of phase spectra through time. To mimic theeffect of SMoE, a smoothing effect in the frequency dimension is applied on this representation.Even worse than with the primary idea, the result of this operation was catastrophic to thereproduced signal in time domain due to the delicate nature of the phase information and theinterference between overlapping chunks.

The second potential solution tries to recreate the phase spectrogram based on the reproducedamplitude spectrogram. As previously stated, the process of splitting up the RIR into overlap-


ping chunks results in a sense of timing information. The indirectly obtained timing informationis crucial due to discarding the phase spectrogram, which typically contains the timing informa-tion, in the forward operation. This algorithm, further referred to as the phase retrieval method,is based on the iterative Griffin-Lim algorithm [16] but stops after a predetermined amount of it-erations i. The amount of iterations i adds another variable to the proposed calculation methodthat will be assessed in the experiments below.

The phase retrieval method is an iterative method that continuously transforms a given spec-trogram into a time series (step (2), (3), (4) in Fig. 2.1) and back (step (5), (6), (7) in Fig. 2.6),eventually leading to a time series resembling the original RIR. Because a little bit of phaseinformation from the time series is extracted in the process of transforming a RIR into a spec-trogram, and this phase information is multiplied with the original amplitude spectrogram, thisphase information will continue to match the given amplitude spectrogram more and more withevery iteration. This algorithm is slowly convergent so the quality of the retrieved phases im-proves with every iteration. Fig. 2.5 shows a diagram of the above explained phase retrievalmethod.

2.4. REVERSE OPERATION 21

Figure 2.5: Schematic diagram of the proposed phase retrieval method with |F | the amplitudespectrogram, φ the phase spectrogram, s the resulting time series (RIR) and S the reproducedcomplex spectrogram.

2.4 Reverse operation

This chapter describes the method used to transform a liftered cepstrogram representation ofa RIR into a time domain reproduction of this RIR. The reverse operation is displayed as theRIR Reproduction step in Fig. 1.2. The pipeline of the reverse operation can be observed inFig. 2.6 and the corresponding Matlab code can be observed in App. B.


Figure 2.6: Pipeline of the consecutive steps building up the reverse operation. This pipelineis similar to the forward operation pipeline, except that every individual step is replaced by itsinverse operation.

The reverse operation performs the inverse or an alternative that reverts the effect of everystep used in the forward operation. Therefore, the diagram that visually displays the reverseoperation is similar to the one displaying the forward operation. Hence a similar elaboration ofthe earlier described forward operation is presented in this section. Every step displayed in Fig.2.6 will be referred to with the (x) notation.

In line with the previously explained condition for obtaining a real signal after applying FFT,the input needs to be a real-even signal. The same condition also applies on the IFFT. In orderfor this condition to be met, the size of the input signal has to be z so an effective mirroring canbe obtained. The problem with the liftered representation is that the signals that have to betransformed (the individual cepstra), do not meet this condition. After liftering, the size of everycepstrum is c, which is smaller than the required z. To still meet the above mentioned condition,every cepstrum is appended with zeros to achieve the correct size z. After the described step (1),the dimensions of the liftered cepstrogram are equal to the desired dimensions (z × k) of thespectrogram, needed to reproduce the original RIR.

The second step in the reverse operation is to transform the cepstrogram back into a spec-trogram (2). This is achieved by applying a IFFT on every individual mirrored cepstrum andkeeping only the z first frequencies of the resulting spectrum. This mirroring is equal to themirroring used in the forward method and is meant to obtain a real-even signal.

The forward method of calculating a cepstrum states that the cepstrum of a signal is obtained


by applying FFT on the logarithmic amplitudes of the spectrum of that signal. The nexttwo steps convert the intermediate logarithmic amplitude spectrogram representation into acomplex spectrogram, able to reproduce the desired RIR. First, the logarithmic nature of thisintermediate representation is negated by taking the exponential (3). Secondly, the retrievedphase spectrogram, generated by the previously explained phase retrieval method, is multipliedwith the amplitude spectrogram to obtain a complex spectrogram (4).

In the following step, the spectrogram gets transformed into time domain. This is achieved byapplying a IFFT on each individual spectrum in the spectrogram (5). Similarly to the previousIFFT operation, every complex spectrum gets mirrored and transformed into time domain usingIFFT. But in contrary to all previous transformations, no samples are discarded and the wholeobtained signal is kept.

The four next steps are in time domain and are responsible for transforming the two-dimensionalrepresentation (chunks through time) into a qualitative one-dimensional RIR representation.

In the forward operation, each chunk is point-wise multiplied with an analysis window foreliminating the ill effects of spectral leakage due to using FFT. In the reverse operation, asimilar window called a synthesis window, gets point-wise multiplied with every chunk (6). Thegoal of synthesis windowing is to fade out any spectral errors at the chunk boundaries, therebysuppressing audible discontinuities. Due to the analysis window being a square root Hannwindow, the sum of a sequence of overlapping windows with a frame shift of 1/2 will result into1 on every index. So when using a frame shift of 1/2, the analysis window can also be usedas a synthesis window. Whenever the shift is smaller than 1/2, the sum of the windows willbe bigger than 1. Hence, a conversion from analysis window to synthesis window is needed sothat the sum of the used synthesis windows is a constant 1 function. Three different synthesiswindows are displayed in function of the used frame shift in Fig. 2.7.


0 100 200 300 400 500 600

n

0

0.2

0.4

0.6

0.8

1

1.2

1/2 shift

1/4 shift

1/8 shift

Figure 2.7: Visualization of three different synthesis windows, each corresponding to a differentframe shift. The smaller the frame shift, the lower the window because more windows overlapand the sum of the overlapping values of each window always has to equal 1.

The second step in time domain combines the windowed chunks into one resulting RIR, con-taining n samples (7), producing a one dimensional representation. By adding each windowedchunk to the corresponding piece of the resulting RIR with an offset based on the index of thechunk and the frame shift.

Due to the addition of leading and trailing zeros in the beginning of the forward operation,the proposed representation contains this rather useless information. When reproducing a RIR,it will consequently still contain these leading and trailing zeros. The third time domain stepwill therefore remove a predetermined amount of samples in the beginning and the end of thereproduced RIR (8). The amount of removed samples on each side is m× (1− s).

The final step of the reverse operation is an additional step. It is theoretically not required toreproduce the original signal but it generally improves the performance of the reverse operationby manually engineering the beginning of the RIR. By applying an exponential smoothing func-tion to the beginning (9), two main undesired effects of the reproduction can be negated. Thefirst effect occurs when the reproduction is inaccurate and produces sounds before the directsound peak is received. This is pre-echo, a digital audio compression artifact where a soundis heard before it occurs. The second effect of the use of the exponential smoothing functionis that it ensures a smooth beginning of the RIR. Therefore the first value is guaranteed zeroand a continuous slope to the first samples is obtained so that no undesired artifacts of suddenchanges in the RIR occur.


Smoothing function

0 200 400 600 800 1000 1200

sample

0

0.2

0.4

0.6

0.8

1

Figure 2.8: Used smoothing function in order to remove pre-echo and ensure a smooth continuousslope in the start of the RIR. The peak of the particular RIR in this example is situated at sampleindex 1382. The shape of this function is experimentally determined. The samples at the startof the RIR are multiplied by zero, followed by an exponential function between indexes 1170and 1350 and a small subsection before the peak where the RIR stays untouched. The values ofthe beginning and end of this exponential are determined by the index of the peak.

The forward operation, phase retrieval and reverse operation are the three components of theproposed computational method. These components and the argumentation of using them wereexplained in the chapter above. The next chapter describes the used experiments to validatethis computational method and evaluates the results.

3Experiments

As described in the above, the relevance, effect and required order of magnitude of the threediscussed variables will be further examined in this chapter. These variables are the frameshift s, the number of retained cepstral coefficients c and the number of iterations of the forward-backward method used for phase retrieval i.

The proposed computation method contains two actions that lead to a degradation of the re-produced signal. The amount of degradation is linked to the value of each of the three discussedvariables. The first potential point of degradation, also referred to as loss, is the effect of lifteringthe original signal. Liftering is the compression of the cepstrogram by removing a set of cepstralcoefficients, this reduces the size of the cepstrogram from (z × k) to (c× k).

The second potential point of degradation originates in the phase retrieval method where a phasespectrogram converges to a suitable approximation, fitting the given amplitude spectrogram.The result of this step improves the more iterations of fitting can be performed.

Both points of degradation occur in the process of converting a RIR to the proposed repre-sentation of a liftered cepstrogram by using the forward operation and consecutively applyingthe reverse operation to convert back to time domain. Consequently, it is not possible to pindown the cause of the degradation by only using the complete proposed computation method.Therefore, three different experiments are established and explained in each allocated section

26

3.1. OBJECTIVE METRIC EXPERIMENTS 27

below. The three used experiments are:

• Experiment A: regenerates amplitudes and uses original phase information from the inputsignal.

• Experiment F: retrieves phases and uses original amplitude spectrogram from the inputsignal.

• Experiment FA: a combination of F and A, both regenerated amplitudes and retrievedphases are used and therefore represents the use of the complete proposed computationmethod.

3.1 Objective metric experiments

3.1.1 Test setup

Used input RIRs

A test setup was composed in order to perform the three mentioned experiments. This test setupoperates on ten different RIRs, presented in Table 3.1. The used RIRs are a combination of sixrecorded RIRs from the publicly available Multi-Channel Impulse Response Database from theAachen University and four RIRs, recorded in Ghent for another investigation. The distancevariable indicates the distance in meters between the source and the center of the microphonearray used to record the RIRs. The RT60 variable is the time in seconds the sound pressurelevel takes to decrease by 60 dB after the sound source is abruptly stopped.

Location Distance (m) RT60 (s)

Aachen 1 0.16Aachen 1 0.36Aachen 1 0.61Ghent 1 0.66Aachen 2 0.16Aachen 2 0.36Aachen 2 0.61Ghent 2 0.66Ghent 3 0.66Ghent 5 0.66

Table 3.1: Overview of the used RIRs

28 CHAPTER 3. EXPERIMENTS

To allow for a diverse set of test sounds to be examined, all of these 10 RIRs have been convolvedwith 50 speech signals of 5 male and 5 female persons each declaring 5 different sentences, eachsentence including a different set of sounds.

Used variables

Following conditions will be tested for the respective variables:

Frame shift s

Frame shift s

1/8 = 12.5% 1/4 = 25% 1/2 = 50%

Table 3.2: Used values for the frame shift s

Cepstrum coefficients c

Cepstrum coefficients c

25 50 75 100 125 150 175 200 257

Table 3.3: Used values for the amount of stored cepstrum coefficients per cepstrum c

Here, 257 seems like the odd one out but is included because a spectrum containing 257 cepstralcoefficients should allow for perfect reconstruction of the amplitude spectrum of a chunk withsize 512. Hence no cepstral coefficients were replaced by zeros, in other words no liftering wascarried out so the transformation into cepstrum and back should result a lossless reproduction.

Forward-backward iterations i

As the number of iterations is the decisive factor of the speed at which the phase retrieval isexecuted, and previous tests showed that the improvement caused by an increase if iterationsstagnated around 500 iterations, it was decided to test following values:

Iterations i

50 100 200 400 600 800

Table 3.4: Used values for the amount forward-backward iterations i

Considering 10 RIRs, each reproduced by the three different methods (F, A and FA) using theirpotential combinations (resp. 18, 27 and 162), resulting in 207 reproductions per RIR, i.e. 2070


RIRs in total. Each of them are then convolved with the 50 different speech signals to obtain aset of 103,500 audio samples to be evaluated.

As discussed in chapter 2, the quality of all of these reproductions will be assessed using threedifferent metrics:

• STOI

• PESQ

• POLQA

3.1.2 Amplitude regeneration experiment (A)

In this experiment, first the forward operation is executed. Before converting the spectrograminto a liftered cepstrogram, the original phase information is stored separately. The phaseretrieval step, as described in Sec. 2.3, is not executed. In the second step of this experiment, theoriginal phase spectrogram is multiplied with the regenerated, liftered amplitude spectrogram.After this, the reverse operation is executed to obtain the reproduced RIR. A visualization ofthe described calculations can be observed in Fig. 3.1.

Figure 3.1: Schematic diagram of the method for determining the influence of the amplituderegeneration. Notice the use of the original phase spectrogram, indicated in bold.


(0.1

25,

25)

(0.1

25,

50)

(0.1

25,

75)

(0.1

25,

100)

(0.1

25,

125)

(0.1

25,

150)

(0.1

25,

175)

(0.1

25,

200)

(0.1

25,

257)

(0.2

5,

25)

(0.2

5,

50)

(0.2

5,

75)

(0.2

5,10

0)

(0.2

5,12

5)

(0.2

5,15

0)

(0.2

5,17

5)

(0.2

5,20

0)

(0.2

5,25

7)

(0.5

,25

)

(0.5

,50

)

(0.5

,75

)

(0.5

,10

0)

(0.5

,12

5)

(0.5

,15

0)

(0.5

,17

5)

(0.5

,20

0)

(0.5

,25

7)

(Shift s, Cepstrum coefficients c)

3.8

4.0

4.2

4.4

score

PESQ

Figure 3.2: Boxplot of the calculated PESQ scores of the A experiment. Note the rather lineartrend line per shift.

(0.1

25,

25)

(0.1

25,

50)

(0.1

25,

75)

(0.1

25,

100)

(0.1

25,

125)

(0.1

25,

150)

(0.1

25,

175)

(0.1

25,

200)

(0.1

25,

257)

(0.2

5,25

)

(0.2

5,50

)

(0.2

5,75

)

(0.2

5,

100)

(0.2

5,

125)

(0.2

5,

150)

(0.2

5,

175)

(0.2

5,

200)

(0.2

5,

257)

(0.5

,25

)

(0.5

,50

)

(0.5

,75

)

(0.5

,10

0)

(0.5

,12

5)

(0.5

,15

0)

(0.5

,17

5)

(0.5

,20

0)

(0.5

,25

7)


3.2

3.4

3.6

3.8

4.0

4.2

4.4

4.6

score

POLQA

Figure 3.3: Boxplot of the calculated POLQA scores of the A experiment. Note that the scoresreach the maximum value rather soon which suggests an almost perfect reconstruction on mostfiles.


(0.1

25,

25)

(0.1

25,

50)

(0.1

25,

75)

(0.1

25,

100)

(0.1

25,

125)

(0.1

25,

150)

(0.1

25,

175)

(0.1

25,

200)

(0.1

25,

257)

(0.2

5,

25)

(0.2

5,

50)

(0.2

5,

75)

(0.2

5,

100)

(0.2

5,

125)

(0.2

5,

150)

(0.2

5,

175)

(0.2

5,

200)

(0.2

5,

257)

(0.5

,25

)

(0.5

,50

)

(0.5

,75

)

(0.5

,100

)

(0.5

,125

)

(0.5

,150

)

(0.5

,175

)

(0.5

,200

)

(0.5

,257

)


0.94

0.95

0.96

0.97

0.98

0.99

1.00

score

STOI

Figure 3.4: Boxplot of the calculated STOI scores of the A experiment. Note the rather loga-rithmic trend per shift.

The focus of this experiment lies in assessing the quality of the reproduced amplitude. Asthe iterations required in the phase retrieval step are not performed, only 2 variables can beanalyzed: frame shift s and number of cepstrum coefficients c.

Conclusions

For each of the metrics, the scores of all combinations s and c are at the higher end of thespectrum (e.g. all medians in the STOI boxplot are higher than 97%), suggesting that themethod used for the reproduction of the amplitude is nearly perfect.

Even when starting at a quite high initial score level with c = 25, further increasing the numberof cepstrum coefficients leads to consistently higher scores and less variation for all of the metrics.Using c = 257 results into almost perfect reconstruction bearing in mind that the outliers canbe explained by the smoothing step ((9) in Fig. 2.6) that was applied in the time domain toeliminate the pre-echo.

The impact of shift s is less obvious in the different boxplots, although all of the three metricsclearly show that the results for s = 0.125 and s = 0.25 are better than s = 0.5: scores arehigher and variation is lower. A rather small and inconsistent difference can be observed betweens = 0.125 and s = 0.25.


As reducing the shift with 50% doubles the required amount of data, and with all scores at highlevels, one could consider to increase shift s as the marginal gain in score by increase of c isminimal (e.g. doubling c from 50 to 100 only increases the median STOI value from 0.98 to 0.99

at shift s = 0.125).

3.1.3 Phase retrieval experiment (F)

Experiment F investigates the quality of the phase retrieval. Performing steps (1) to (5) of theforward operation, the amplitude spectrogram of the given RIR is obtained. The performed stepsare completely reversible so that the reproduction of this intermediate representation is lossless.After this, no further transformation towards the cepstral domain is done, but instead the phaseretrieval process is immediately started. The last step in this experiment is the reproduction ofthe RIR. This test allows for the evaluation of variables s and i. A diagram of the describedexperiment can be observed in Fig. 3.5.

Figure 3.5: Schematic diagram of the method for determining the influence of the phase regen-eration. The bold lines show the unchanged amplitude spectrogram.


(0.1

25,

50)

(0.1

25,

100)

(0.1

25,

200)

(0.1

25,

400)

(0.1

25,

600)

(0.1

25,

800)

(0.2

5,

50)

(0.2

5,

100)

(0.2

5,

200)

(0.2

5,

400)

(0.2

5,

600)

(0.2

5,

800)

(0.5

,50

)

(0.5

,100

)

(0.5

,200

)

(0.5

,400

)

(0.5

,600

)

(0.5

,800

)

(Shift s, Iterations i)

2.50

2.75

3.00

3.25

3.50

3.75

4.00

4.25

4.50

score

PESQ

Figure 3.6: Boxplot of the calculated PESQ scores of the F experiment. Notice the difference inspread between s = 0.125 and s = 0.25 which is unique to the PESQ metric in this experiment.

(0.1

25,

50)

(0.1

25,

100)

(0.1

25,

200)

(0.1

25,

400)

(0.1

25,

600)

(0.1

25,

800)

(0.2

5,50

)

(0.2

5,100

)

(0.2

5,200

)

(0.2

5,400

)

(0.2

5,600

)

(0.2

5,800

)

(0.5

,50

)

(0.5

,10

0)

(0.5

,20

0)

(0.5

,40

0)

(0.5

,60

0)

(0.5

,80

0)


2.0

2.5

3.0

3.5

4.0

4.5

score

POLQA

Figure 3.7: Boxplot of the calculated POLQA scores of the F experiment.


(0.1

25,

50)

(0.1

25,

100)

(0.1

25,

200)

(0.1

25,

400)

(0.1

25,

600)

(0.1

25,

800)

(0.2

5,

50)

(0.2

5,

100)

(0.2

5,

200)

(0.2

5,

400)

(0.2

5,

600)

(0.2

5,

800)

(0.5

,50

)

(0.5

,100

)

(0.5

,200

)

(0.5

,400

)

(0.5

,600

)

(0.5

,800

)


0.70

0.75

0.80

0.85

0.90

0.95

1.00

score

STOI

Figure 3.8: Boxplot of the calculated STOI scores of the F experiment.

Conclusions

Comparing the resulting scores from the F experiment with the previous A experiment, it isstriking that the results for the phase retrieval experiment are at a much lower level than theones from the amplitude reproduction experiment: median scores are lower and variation foreach of the metrics is higher. All three metrics therefore seem to suggest that the phase retrievalmethod is not so effective.

Increasing the number of iterations increases the score levels, but does not seem to have anconclusive effect on the variation: PESQ spread for increasing i seem rather consistent. POLQAand STOI indicate a decreasing, but still important spread at s = 0.125 and s = 0.25 for higheri.

The size of the shift reinforces the effect of the iterations: the smaller the shift, the faster theiterations seem to be effective. Then again, shift s = 0.5 seems to have an important negativeeffect on the results of all three metrics. For all metrics, the lowest result (i = 50) at s = 0.125

is lower than at s = 0.25, but the best results (i = 600 and i = 800) are higher when s = 0.125.This demonstrates that the results converge to a higher score level when using a smaller s, butonly when using a higher number of iterations i meaning that the convergence is slower.

As for testing purposes in this study, the need for data saving can be neglected, so the mostdata demanding value for s, 0.125 will be selected for additional tests.


3.1.4 Proposed computation method experiment (FA)

The final experiment executes all steps of the described computation method, i.e. the completeforward operation, the phase retrieval and the reverse operation on a given RIR. Consequently,all 3 variables s, c and i can be evaluated in this experiment.

Figure 3.9: Schematic diagram of the method for determining the influence of the completeproposed calculation method. Both phases and amplitudes are regenerated so each variable isincluded in this experiment.

The purpose of this experiment is to evaluate the quality of the reproduction generated by theproposed computation method. A diagram of the experiment using the complete method canbe observed in Fig. 3.9.

(0.1

25,

25)

(0.1

25,

50)

(0.1

25,

75)

(0.1

25,

100)

(0.1

25,

125)

(0.1

25,

150)

(0.1

25,

175)

(0.1

25,

200)

(0.1

25,

257)

(0.2

5,25

)

(0.2

5,50

)

(0.2

5,75

)

(0.2

5,10

0)

(0.2

5,12

5)

(0.2

5,15

0)

(0.2

5,17

5)

(0.2

5,20

0)

(0.2

5,25

7)

(0.5

,25

)

(0.5

,50

)

(0.5

,75

)

(0.5

,10

0)

(0.5

,12

5)

(0.5

,15

0)

(0.5

,17

5)

(0.5

,20

0)

(0.5

,25

7)


2.5

3.0

3.5

4.0

4.5

score

PESQ

(0.1

25,

25)

(0.1

25,

50)

(0.1

25,

75)

(0.1

25,

100)

(0.1

25,

125)

(0.1

25,

150)

(0.1

25,

175)

(0.1

25,

200)

(0.1

25,

257)

(0.2

5,25

)

(0.2

5,50

)

(0.2

5,75

)

(0.2

5,10

0)

(0.2

5,12

5)

(0.2

5,15

0)

(0.2

5,17

5)

(0.2

5,20

0)

(0.2

5,25

7)

(0.5

,25

)

(0.5

,50

)

(0.5

,75

)

(0.5

,10

0)

(0.5

,12

5)

(0.5

,15

0)

(0.5

,17

5)

(0.5

,20

0)

(0.5

,25

7)


1.5

2.0

2.5

3.0

3.5

4.0

4.5

score

POLQA

(0.1

25,

25)

(0.1

25,

50)

(0.1

25,

75)

(0.1

25,

100)

(0.1

25,

125)

(0.1

25,

150)

(0.1

25,

175)

(0.1

25,

200)

(0.1

25,

257)

(0.2

5,25

)

(0.2

5,50

)

(0.2

5,75

)

(0.2

5,10

0)

(0.2

5,12

5)

(0.2

5,15

0)

(0.2

5,17

5)

(0.2

5,20

0)

(0.2

5,25

7)

(0.5

,25

)

(0.5

,50

)

(0.5

,75

)

(0.5

,10

0)

(0.5

,12

5)

(0.5

,15

0)

(0.5

,17

5)

(0.5

,20

0)

(0.5

,25

7)


0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

score

STOI

Figure 3.10: Boxplots of the calculated scores of the FA experiment with a constant i = 800

and variable c and s. Firstly, notice the improvement achieved by increasing c, is tempered incomparison to experiment A. Secondly, a more distinct difference between the different shifts s

is visible. And thirdly, the scores obtained in this experiment when using c = 257 match thescores obtained in the F experiment with i = 800 and the corresponding s, which indicates theperfect reconstruction of the amplitude spectrogram.


(0.1

25,

50)

(0.1

25,

100)

(0.1

25,

200)

(0.1

25,

400)

(0.1

25,

600)

(0.1

25,

800)

(0.2

5,50)

(0.2

5,

100)

(0.2

5,

200)

(0.2

5,

400)

(0.2

5,

600)

(0.2

5,

800)

(0.5

,50)

(0.5

,10

0)

(0.5

,20

0)

(0.5

,40

0)

(0.5

,60

0)

(0.5

,80

0)


2.4

2.6

2.8

3.0

3.2

3.4

3.6

3.8

4.0

score

PESQ

(0.1

25,

50)

(0.1

25,

100)

(0.1

25,

200)

(0.1

25,

400)

(0.1

25,

600)

(0.1

25,

800)

(0.2

5,50)

(0.2

5,

100)

(0.2

5,

200)

(0.2

5,

400)

(0.2

5,

600)

(0.2

5,

800)

(0.5

,50)

(0.5

,10

0)

(0.5

,20

0)

(0.5

,40

0)

(0.5

,60

0)

(0.5

,80

0)


2.0

2.5

3.0

3.5

4.0

4.5

score

POLQA

(0.1

25,

50)

(0.1

25,

100

)

(0.1

25,

200

)

(0.1

25,

400

)

(0.1

25,

600

)

(0.1

25,

800

)

(0.2

5,

50)

(0.2

5,

100

)

(0.2

5,

200

)

(0.2

5,

400

)

(0.2

5,

600

)

(0.2

5,

800

)

(0.5

,50)

(0.5

,100)

(0.5

,200)

(0.5

,400)

(0.5

,600)

(0.5

,800)


0.70

0.75

0.80

0.85

0.90

0.95

score

STOI

Figure 3.11: Boxplots of the calculated scores of the FA experiment with a constant c = 150 andvariable i and s. Notice that almost all effect of improving i and s is lost and a nearly constantmedian and spread is obtained.

Conclusions

From the different graphs on Fig. 3.10 and 3.11, it can clearly be observed that the score levelsobtained in the FA experiment are again at a lower level than the previous 2 experiments, evenat optimal conditions with i = 800 and s = 0.125.

The positive effect on the scores when increasing c, as noticed in experiment A, is confirmed inFig. 3.10, but the positive effect on the variation seems to be lost. The scores obtained in thisexperiment when using c = 257 match the scores obtained in the F experiment with i = 800

and the corresponding s, which confirms the expected perfect reconstruction of the amplitudespectrogram.

The positive effect of the increase of i, as noted in experiment F for s = 0.125 and s = 0.25,seems to be completely muted as score levels as well as variations seem to remain very stablefor all three metrics as shown in Fig. 3.11.

On the other hand, the detrimental effect of s = 0.5 that was noted in experiment F, also seemsto be muted.

3.1.5 General conclusions on metric results

A clear difference can be observed between the average scores of the results of the F and theFA experiments, compared to the A experiments, with the latter having scores that are situatedmuch higher. As the same observation is made for all three metrics, it suggests that the qualityof the reproduction of the phase has a far more detrimental effect than the reproduction of theamplitude. However, when listening and comparing the resulting sounds, this difference is notexperienced as such. This observation suggests that the used metrics punish phase errors much

3.2. AUDITORY SPATIAL TESTS 37

harder (too hard?) than amplitude errors.

On the other hand, the trends as suggested by the graphs (i.e. a distinct positive effect of c, arather positive effect of i and the negative impact of s = 0.50 ) are confirmed by initial listeningtests.

To further evaluate the auditory effect of the variables c and i, an additional binaural spatialhearing test has been set up. Binaural tests are used because the spacial experience is veryreliant on phase difference between both ears which means that inadequate reconstructions ofthe phase information will be recognized sooner than when listening to monophonic RIRs.

3.2 Auditory spatial tests

The main goal of using RIRs is to capture the acoustics of a room. The acoustic propertiesof a room predominantly influence early reflections and reverberations and the direction of thedirect and indirect soundwaves. A monophonic sound reproduction can not be used to simulatethis spatial experience in a realistic way. Therefore, a new experiment using stereophonic soundfiles is required so that the (binaural) sense of direction can be evaluated.

For this purpose, the Multi-Channel Impulse Response Database from the Aachen Universityis used. This dataset comprises a number of impulse responses that are measured in a roomwith configurable reverberation level. This results in three different acoustic scenarios withreverberation times RT60 equal to 160ms, 360ms and 610ms. The measurements were carriedout in recording sessions of several source positions on a spatial grid (angle range of 0 ° to 180 °in 15 ° steps with 1m and 2m distance from the microphone array), the recording setup can beobserved in Fig. 3.12. The signals in all sessions were captured by a microphone array with an8 cm distance between neighbouring microphones.

For the auditory spatial tests, a dataset will be created using recordings based on a 1m distancebetween source and microphones, an angle range of 45° steps and an RT60 of 610ms. Thebinaural effect is obtained by using the recordings of two microphones which are situated 16 cm

away from each other (16 cm being the average distance between human ears). This is, in fact,a simplification of reality as it does not take into account the Head-Related Transfer Function(HTRF), which is the filtering effect the size and the shape of the head and the outer earshave on the acoustic signal. For the purpose of this study, the consideration of the HTRF isout-of-scope.

In order to obtain a stereo file, it is important to provide one RIR for the left ear and one forthe right. The combination of a left and right RIR will be referred to as a the Binaural RoomImpulse Response (BRIR). Both prerecorded RIRs will need to be reproduced using the same


Figure 3.12: Visualization of the setup of the recording room used for the Aachen Universitydataset. The distances between the microphones are not in scale compared to the distance ofthe circles. The microphones used for the binaural RIR are indicated with a black circle andtheir corresponding letter (L = left, R = right).

method and will then be convolved with the same sound.

Similar to the setup described above, the three different experimental methods for reproductionwill be applied: the amplitude regeneration method A as described in subsec. 3.1.2, the phaseretrieval method F as described in subsec. 3.1.3 and the proposed computation method FA asdescribed in subsec. 3.1.4.

On top of this, an extra optimization will be performed for methods F and FA which includephase retrieval. To obtain a truly binaural effect, first the reproduction of the RIR for the earclosest to the source will be generated. In the next step, the phase spectrogram of the first RIRas obtained from the phase retrieval method, is multiplied with the phase shift ϕ between thefirst and second ear.

ϕ = e−j2πf

d

ccos θ

(3.1)

The phase shift is calculated with Eq. 3.1 where:

ϕ = phase shiftf = frequencyd = distance between receiversc = speed of soundθ = angle between the source and the axis of the linear microphone array


Note that the used formula is not entirely correct due to the spacing of the microphones, asshown in Fig. 3.12. The used microphones are not spaced equally from the center, resulting ina slight error when calculating ϕ.

In the third step of the optimization, the calculated delayed phase spectrogram of the first RIR ispoint-wise multiplied with the amplitude spectrogram of the second RIR. Finally, the combinedcomplex spectrogram is used as input for the phase retrieval method to reproduce the secondRIR.

When this optimization is used, it will be stated that the phase information is passed. Whenomitting this optimization, the phase spectrogram for each ear is retrieved solely based on itsamplitude spectrogram so the phase is not passed. An example of the Matlab code for phasepassing can be found in App. C.

3.2.1 Used variables

The purpose of this experiment is to compare the results from the metrics PESQ, POLQA andSTOI with the spatial experience as perceived when listening to the stereophonic sound files.To accommodate for the time consuming subjective listening experiments, a reduced dataset isprepared based upon the preliminary conclusions as drawn from experiments A and F (subsec.3.1.2 and 3.1.3).

From the various boxplots as shown in Fig. 3.2 - 3.4 and Fig. 3.6 - 3.8, values as presentedin Tab. 3.5 for the different variables are selected. The corresponding score histograms forusing the selected variables (i = 800, s = 0.125 and c in [75, 100, 150]) can be observed for eachindividual metric in Fig. 3.13 - 3.15.

s i c

1/8 50 75

800 100

150

Table 3.5: Used variables for the spatial tests. Note that multiple c values are selected in orderto evaluate the improvement of the quality by increasing c and extreme values for i are selectedto evaluate binaural impact of the i variable.


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Freque

ncy

Score

PESQ

A_75 FA_75 A_100 FA_100 A_150 FA_150 F

Figure 3.13: Histogram of selected PESQ scores of experiments F, A and FA: i = 800, s = 0.125

and c in [75, 100, 150]. The scores of the A experiment are situated at the high end (4-4.5),scores of F are marked in green and vary between 3 and 4.5. FA scores appear on the lower endbetween 2.3 and 3.7. Notice the shifts to the right when increasing c in A and FA which suggestbetter qualities.


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Freque

ncy

Score

POLQA

A_75 FA_75 A_100 FA_100 A_150 FA_150 F

Figure 3.14: Histogram of selected POLQA scores of experiments F, A and FA: i = 800, s = 0.125

and c in [75, 100, 150]. POLQA scores show less differentiation between A and F but are morefavorable towards F results than PESQ scores. The same shift to the right can be observed forexperiment FA although it is less pronounced suggesting less improvement. The A scores arenearly identical, suggesting no noticeable improvement by increasing c.


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Freque

ncy

Score

STOI

A_75 FA_75 A_100 FA_100 A_150 FA_150 F

Figure 3.15: Histogram of selected STOI scores of experiments F, A and FA: i = 800, s = 0.125

and c in [75, 100, 150]. STOI scores differentiate more between A and F than POLQA but lessthan PESQ. The same trend of improvement for experiment A and FA as described by PESQ,can be observed.


3.2.2 Conclusions

In the described auditory tests, one expert was confronted with multiple audio files built byconvolving a speech file with different BRIRs. The two files to be evaluated are alternatelyplayed with a 4 s quiet time interval to ensure a brief recovery of the ears. The listener is awareof the presented files but does not know which one is played first. Listening was performed ona pair of Beyerdynamic DT 770 headphones with a 96.28 dB/mW SPL.

The focus during the listening lies on the perceptual direction of the source, the perceptualdistance of the source and the perceptual room impression determined by the reverberations.During this test, a comparison is made between all experiments (A, F and FA), their perceptualcorrectness and their previously calculated metric scores. Also, the effect of passing the phaseinformation is examined. After a comprehensive set of listening tests, conclusions are made andelaborated below.

To describe the different evaluations to be performed, the following notations will be used:

BRIRO = original BRIRBRIRA = BRIR as reproduced in the A experimentBRIR -F = BRIR as reproduced in the F experiment without phase passBRIR +F = BRIR as reproduced in the F experiment with phase passBRIR -

FA = BRIR as reproduced in the FA experiment without phase passBRIR +

FA = BRIR as reproduced in the FA experiment with phase pass

Following evaluations will be performed:

• Step 1: Initially the reproduced BRIRs from the A experiments (BRIRA) will be assessedon their quality and on the impact of the variable c.

• Step 2: In the second step the BRIRs from the F experiments will be gauged, first theBRIRs with phase pass (BRIR +F ) followed by an assessment of the BRIRs without phasepass (BRIR -F ). The purpose of this step is to understand the influence of using the phasepass.

• Step 3: Thirdly, the BRIRs from the F experiments with phase pass (BRIR +F ) will becompared to the BRIRA results.

• Step 4: The next step is to compare BRIR +FA with BRIRA and with BRIR +F. In this step

the use of the phase pass will be assessed: is the use of the phase pass in experiment FAmore or less effective than its use in experiment F?

• Step 5 is the analysis of the impact of the variable c on the quality of the BRIR +FA


• Step 6: in this last step the impact of the variable i = 50 versus i = 800 will be assessedat constant c on the BRIR +

FA

Hereunder, the outcomes of the different evaluation steps, performed by the expert, are de-scribed:

• Outcome 1: The results of the A experiments show little difference with the originalBRIRO. The perceptional direction and distance of BRIRA can not be distinguished fromBRIRO, which seems reasonable due to using the exact same phase information. Thereverberation sounds a little more flat in BRIRA at any value of c. This is caused bythe liftering step. For a trained ear, the perceptual reverberation slightly improves whenincreasing c but the difference is small and hardly noticeable (e.g. a non-expert hardlyhears the difference between c = 100 and c = 150, and BRIRO).

• Outcome 2: When assessing BRIR+F with i = 800, an even smaller difference between theoriginal and reproduction is noticed. The perceptual direction, distance and reverberationseem identical and the difference can only be noticed when listening extremely carefullywith a highly accurate headphone. However, listening to BRIR+F with i = 50 results into abigger difference. The perceptual direction and distance feel correct but the reverberationlacks clarity and feels more compressed.

A big difference is heard when listening to BRIR-F with i = 800 or i = 50. The perceptualdirection is totally off and therefore difficulties arise to identify the similarity between theperceptual reverberations and distance. Because the difference in quality between usingand not using the phase pass is so big, the conclusion is easily drawn that the phase passis crucial for phase retrieval and far more decisive than the selection of i.

• Outcome 3: The perceptual quality of BRIR+F and BRIRA are similar even though allthree metrics obtained from monophonic experiments suggest the opposite as shown inFig. 3.13 - 3.15. When selecting a reasonably high c (e.g. c = 150) for BRIRA, a near-perfect reconstruction is obtained. Similarly, when i is high enough (e.g. i = 800) forBRIR+F, the phase retrieval method results into a perception that is difficult to distinguishfrom the BRIRO, so also in this case, a near-perfect reconstruction is obtained.

• Outcome 4: A decrease in overall perceptual quality can be observed when comparinga BRIR +

FA to the original. This decrease is more noticeable than with the previouslydescribed BRIR+F and BRIRA. With i = 800, the perceptual direction is still mainlycorrect but less precise compared to BRIR+F and BRIRA. The reverberation feels moreflat and the perceptual distance increases by lowering c values. The effect of leavingout the phase pass (BRIR -

FA) is similar to that of leaving out the phase pass in the Fexperiment (BRIR-F): the perceptual direction is so off that the whole reproduction isrendered ineffective.


• Outcome 5: When increasing c of a BRIR +FA, the overall perceptual quality increases due

to the reverberation sounding fuller. Also, the perceived precision of the distance improveswhen increasing c. The overall quality increase is stronger compared to a similar increaseof c in a BRIRA. However, the result of the latter still sounds better whichever c is selecteddue to the use of the original phases in BRIRA.

• Outcome 6: The last assessment gauges the selection of i for creating a BRIR +FA. In

BRIR+F, a major decrease in i from i = 800 to i = 50 only results in a minor decreasein quality. The decrease in quality of BRIR +

FA due to a significant drop in i is noticeablymore pronounced. This indicates that a more substantial amount of iterations is neededin order to generate an acceptable reproduction of a BRIR +

FA.

The perceived quality of the various BRIRs as described above is quantified in Fig. 3.16. Notethat these values are a subjective measure, determined by the above description from an expertand mainly an informal listening of a non-expert using a pair of Pioneer-HDJ-1500 headphones.These values are just meant to be a quick and easy way to compare the differences explainedabove.

Direction Distance ReverberationA (c =150) 10 10 8A (c =75) 10 10 7F+ (i=800) 10 10 9F+ (i =50) 9 9 7F- 2 4 5FA- (c =150, i= 800) 2 4 4FA+ (c =75, i =800) 8 7 6FA+ (c =150, i =800) 9 8 7FA+(c =150,i =50) 7 7 6

Figure 3.16: Summary of the perceived quality of the various BRIRs. A subjective rating scaleof 1 to 10 has been used.

When comparing the results as shown in Fig. 3.16 to the corresponding scores in (Fig. 3.13 -3.15) it is clear that all selected metrics are sub-optimal for the objective evaluation of reproduced


RIRs.

Obviously, regenerated phases differ from the original ones, but in the auditory tests it wasobserved that the reproduced BRIRs based upon these regenerated phases were perceived to bevery similar to the original BRIRs. The impression arises that all of the used metrics punishthe regeneration of the phases too hard.

On the other hand, sometimes the metrics are not able to recognize perceptual differences inamplitudes. E.g. the scores of experiment A with c = 25 are unrealistically high. EspeciallyPOLQA tends to give a maximum score way too fast.

4Conclusion

In this thesis, a novel representation of a RIR and its corresponding computation method ispresented and evaluated. The proposed representation is developed particularly for future usein the SMoE framework. For the evaluation of the effectiveness of the computation method,three metrics, originating from the speech evaluation research field, have been selected. Anassessment of the computation method and of the ability of the selected metrics to evaluate thereproduced RIRs, is made.

4.1 Conclusion

Initial auditory tests confirm the feasibility of the proposed computation method to reproduceRIRs at an reasonably high qualitative level. Perceptual qualitative differences that are mainlyrelated to the fact that the computation method discards the actual phase information in theforward operation of the reproduction process, are eliminated to a very large extent with thephase pass operation executed at the binaural synthesis.

Existing metrics from the speech evaluation research field PESQ, POLQAand STOI confirm theimpact of the phase regeneration process, but tend to overrate its importance in the resultingscores. Therefore, the use of these metrics for evaluation of the effectiveness of the computation

47

48 CHAPTER 4. CONCLUSION

method is considered sub-optimal.

The impact of three variables (phase shift s, number of retained cepstrum coefficients c andnumber of forward-backward iterations i) that are inherent to the proposed computation methodhas been explored, but no definitive conclusion on the impact of each of the variables can bedrawn, based on the limited amount of research that could be done during the study. Preliminaryresults suggest a convergence towards a maximum for the number of iterations i and a generalimprovement of the quality when increasing the number of retained cepstrum coefficients c.As far as the phase shift is concerned, preliminary insight suggests to limit the phase shift tos = 0.25 at maximum. However, further in-depth statistical analysis is required to refine thesepreliminary conclusions and to evaluate potential correlation between the three variables.

4.2 Future work

As the purpose of this study was to develop a computation method fit for use by SMoE, it issuggested to now incorporate the method as proposed into the SMoE framework to fully evaluateits capabilities for reproducing RIRs based on a volumetric room impulse response model andto understand its robustness against the smoothing effect of the framework.

When proven successful, further statistical investigation is required to determine the impact ofthe three variables and define the optimal parameterization for real-time computing.

It is also strongly recommended to integrate the concept of phase passing in different dimensionswhen integrating the computation method with SMoE. Listening tests clearly indicate the ben-efits of phase passing for leveraging the binaural spatial experience and for reducing the impactof the (under)performing phase retrieval process. Further projection of this outcome leads to thebelief that phase passing might be a crucial additional step to achieve a real-time computationof a credible 6DoF acoustic spatial experience as phases could be passed on when an observer ismoving in a VR environment. This believe can be supported by the favourable results of usingthis method and extrapolating the use to multiple dimensions (e.g. instead of passing the phaseonly between the ears, the phase passing can be applied when moving around in the virtualworld).

For qualitative investigation of the spatial experience of binaural simulations, the Spatial Au-dio Quality Inventory (SAQI) as developed by the university of Berlin [17] could be a usefulinstrument.

Bibliography

[1] L. P. Berg and J. M. Vance, “Industry use of virtual reality in product design andmanufacturing: a survey,” Virtual Reality, vol. 21, no. 1, pp. 1–17, Mar 2017. [Online].Available: https://doi.org/10.1007/s10055-016-0293-9

[2] L. Savioja and U. P. Svensson, “Overview of geometrical room acoustic modelingtechniques,” The Journal of the Acoustical Society of America, vol. 138, no. 2, pp. 708–730,2015. [Online]. Available: https://doi.org/10.1121/1.4926438

[3] L. Savioja, “Modeling techniques for virtual acoustics,” 01 2000.

[4] M. Vorländer, S. Pelzer, and F. Wefers, Virtual Room Acoustics, 05 2013, vol. 1, pp. 219–242.

[5] H. Kim, L. Remaggi, P. J. Jackson, and A. Hilton, “Immersive spatial audio reproductionfor vr/ar using room acoustic modelling from 360 images,” in IEEE Conference on VirtualReality and 3D User Interfaces (IEEE VR), 2019.

[6] C. Schissler, C. Loftin, and D. Manocha, “Acoustic classification and optimization for multi-modal rendering of real-world scenes,” IEEE Transactions on Visualization and ComputerGraphics, vol. 24, no. 3, pp. 1246–1259, March 2018.

[7] V. Avramelos, I. Saenen, R. Verhack, G. Van Wallendael, P. Lambert, and T. Sikora,“Steered mixture-of-experts for light field video coding,” in Applications of digital imageprocessing XLI, A. G. Tescher, Ed., vol. 10752. SPIE, the International Society for Opticsand Photonics, 2018, p. 12. [Online]. Available: http://dx.doi.org/10.1117/12.2320563

[8] R. Verhack, T. Sikora, G. Van Wallendael, and P. Lambert, “Steered Mixture-of-Expertsfor Light Field Images and Video: Representation and Coding,” Submitted to IEEE Trans-actions on Multimedia, 2019.

[9] A. Rix, J. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speechquality (pesq): A new method for speech quality assessment of telephone networks andcodecs,” vol. 2, 02 2001, pp. 749–752 vol.2.

49

https://doi.org/10.1007/s10055-016-0293-9

https://doi.org/10.1121/1.4926438

http://dx.doi.org/10.1117/12.2320563

50 BIBLIOGRAPHY

[10] J. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ullmann, J. Pomy, and M. Keyhl,“Perceptual objective listening quality assessment (polqa), the third generation itu-t stan-dard for end-to-end speech quality measurement part i-temporal alignment,” AES: Journalof the Audio Engineering Society, vol. 61, pp. 366–384, 06 2013.

[11] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm forintelligibility prediction of time-frequency weighted noisy speech,” Trans. Audio, Speechand Lang. Proc., vol. 19, no. 7, pp. 2125–2136, Sep. 2011. [Online]. Available:https://doi.org/10.1109/TASL.2011.2114881

[12] Aquegg, Spectrogram-19thC.png, Dec 2008. [Online]. Available: https://commons.wikimedia.org/wiki/File:Spectrogram-19thC.png

[13] F. J. Harris, “On the use of windows for harmonic analysis with the discrete fourier trans-form,” Proceedings of the IEEE, vol. 66, no. 1, pp. 51–83, Jan 1978.

[14] J. Wung, D. Giacobello, and J. Atkins, “Robust acoustic echo cancellation in the short-timefourier transform domain using adaptive crossband filters,” 05 2014.

[15] T. Nishino, F. Saito, K. Itou, and K. Takeda, “Modeling of a room impulse response withcepstrum analysis,” Forum Acusticum 2005, RBA-AS, Paper No. 447, pp. 1887–1890, 092005.

[16] D. Griffin and Jae Lim, “Signal estimation from modified short-time fourier transform,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, April 1984.

[17] F. Brinkmann, A. Lindau, and S. Weinzierl, “On the authenticity of individual dynamicbinaural synthesis,” The Journal of the Acoustical Society of America, vol. 142, no. 4, pp.1784–1795, 2017. [Online]. Available: https://doi.org/10.1121/1.5005606

https://doi.org/10.1109/TASL.2011.2114881

https://commons.wikimedia.org/wiki/File:Spectrogram-19thC.png

https://commons.wikimedia.org/wiki/File:Spectrogram-19thC.png

https://doi.org/10.1121/1.5005606

Appendices

51

52

The appendix contains three different exemplary Matlab code snippets. The entire code includ-ing all datasets has been made available to IDLab.

Appendix A: Forward operation code

function [cepstrum, anaWin, synWin] = rir_to_cepstrum(impulse_response, fs,window_length, cepstrum_coef, shift)↪→

% For obtaining a perfect reconstruction: append the impulse_response with anamount of zeros equal to the overlap before using this function.↪→

[x,y] = size(impulse_response);if x~=1

impulse_response=impulse_response.';end

nfft = window_length;frmShift = nfft * shift;ovlap = nfft-frmShift;anaWin = hann(nfft).^(0.5); % Analysis windowsynWin = anaWin; % Synthesis window, needed for regenerating the signals from

the spectrum↪→

[anaWin, synWin] = GetPRWindows(nfft,frmShift,anaWin,synWin); % Obtain theperfect reconstruction windows according to the given shift.↪→

% Converting the RIR into a spectrogram[X] = GetSpectrum(impulse_response.',nfft,anaWin,ovlap,fs);

% Convert the spectrogram into a cepstrogram by applying FFT on everyindividual spectrum. Consecutively, make_cepstrum returns the amount ofdemanded cepstrum coefficients.

↪→

↪→

[~,n_windows] = size(X);log_X = log(abs(X));cepstrum = zeros(cepstrum_coef, n_windows);for i = 1:n_windows

cepstrum(:,i) = make_cepstrum(log_X(:,i),(cepstrum_coef-1)*2);end

end

53

Appendix B: Reverse operation code

function [s, S_phase] = cepstrum_to_rir(cepstrum, fs, anaWin, synWin,iterations, shift)↪→

window_length = length(anaWin);[cepstrum_coef, n_windows] = size(cepstrum);frmShift = window_length*shift;ovlap = window_length-frmShift;

% Mirroring around coefficient 257 and applying IFFT on every cepstrum insidethe cepstrogram↪→

transform_length = window_length/2 + 1;end_indx = min(cepstrum_coef, transform_length);regen_log_X = zeros(transform_length, n_windows);for i = 1:n_windows

Y = zeros(transform_length,1);Y(1:end_indx) = cepstrum(1:end_indx,i);Y = [Y; flipud(conj(Y(2:end-1)))];y = ifft(Y);regen_log_X(:,i) = y(1:transform_length);

endregen_X = exp(regen_log_X);

% Forwards backwards iteration to simulate the phase.S = regen_X;s = RegenerateProcessedSignal(S, [], frmShift, synWin);for i = 1 : iterations

S = GetSpectrum(s,window_length,anaWin,ovlap,fs);S_phase = angle(S);new_S = regen_X.*exp(1i*S_phase);s = RegenerateProcessedSignal(new_S, [], frmShift, synWin);

end

s = smooth_intro_signal(s);

end

54

Appendix C: Example code for phase passing

farray = 1:window_length/2+1;farray = (farray-1)*fs/window_length;dLR = 0.16;

if angl < 90initial_spectrogam_F = abs(spectrogram_R);second_spectrogam_F = abs(spectrogram_L);phaseOffsets = exp(-1i*2*pi*farray*dLR/344*cosd(angl));

elseinitial_spectrogam_F = abs(spectrogram_L);second_spectrogam_F = abs(spectrogram_R);phaseOffsets = exp(1i*2*pi*farray*dLR/344*cosd(angl));

end

[rir_regenF_initial, interm_phaseF] =forward_backward_iterations(initial_spectrogam_F, fs, frmShift, anaWin,synWin, iterations);

↪→

↪→

[~, nslics] = size(interm_phaseF);for i = 1 : nslics

interm_phaseF(:,i) = exp(1i*interm_phaseF(:,i)).*phaseOffsets.';end

rir_regenF_second =forward_backward_iterations(abs(second_spectrogam_F).*interm_phaseF, fs,frmShift, anaWin, synWin, iterations);

↪→

↪→

ModelingTowards Volumetric Room Impulse Response


Master of Science in Information Engineering Technology

Master's dissertation submitted in order to obtain the academic degree of

Counsellors: Ir. Martijn Courteaux, Ir. Ruben VerhackSupervisors: Dr. ir. Glenn Van Wallendael, Prof. dr. ir. Nilesh Madhu

Student number: 01404236Jasper Maes

towards volumetric room impulse response modeling · ral impression of the virtual room is obtained...

Documents