manipulation and resynthesis of environmental sounds with ...€¦ · dynamic structures that...

Manipulation and Resynthesisof Envir onmentalSoundswith

Natural WaveletGrains

by

Reynald Hoskinson

B.A. (English with ComputerScience Minor)

McGill University, 1996

A THESISSUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTSFORTHE DEGREE OF

Master of Science

in

THE FACULTY OFGRADUATE STUDIES

(Departmentof ComputerScience)

We accept this thesis asconformingto therequiredstandard

The University of British Columbia

March2002

c�

ReynaldHoskinson,2002

Abstract

A technique is presentedto facilitatethecreation of constantly changing, random-izedaudio streamsfrom samplesof sourcematerial. A coremotivationis to makeit easier toquickly createsoundscapesfor virtual environmentsandotherscenarioswherelongstreamsof audioareused. While mostly in thebackground, thesestreamsarevital for thecreationof moodandrealism in these typesof applications.

Our approachis to extract the component partsof sampled audio signals,andusethemto synthesizeacontinuousaudio stream of indeterminate length. An automatic speechrecognitionalgorithm involving waveletsis usedto split uptheinput signal into syllable-likeaudio segments. Thesegments aretaken from theoriginal sampleandarenot transformedin any way.

For eachsegment, a table of similarity between it and all the other segmentsisconstructed. The segmentsarethenoutput in a continuous stream,with the next segmentbeing chosenfrom amongthose other segments which bestfollow from it. In this way,we canconstruct an infinite number of variations on the original signal with a minimumamountof interaction. An interfacefor the manipulation andplaybackof several of thesestreamsis providedto facilitatebuilding complex audio environments.

ii

Contents

Abstract ii

Contents iii

List of Figures iv

Acknowledgements v

1 Intr oduction 1

1.1 Problem andMotivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Natural Grains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2 Ecological Perception . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.3 Objectives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 ThesisOrganization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background and Related Work 7

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Signal Transforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 TheWaveletTransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5 Audio Segmentation usingWavelets . . . . . . . . . . . . . . . . . . . . . 14

2.5.1 SpeechClassification Techniques . . . . . . . . . . . . . . . . . . 15

iii

2.5.2 UsingDifferencesbetweenCoefficients . . . . . . . . . . . . . . . 16

2.6 Segmenting in theWavelet Domain . . . . . . . . . . . . . . . . . . . . . 18

2.7 Wavelet Packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.8 Granular Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.9 Concatenative SoundSynthesis . . . . . . . . . . . . . . . . . . . . . . . . 23

2.10 Physically-Based Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Our Early Attempts at

SegmentingAudio Samples 26

3.1 A StreamingGranular SynthesisEngine . . . . . . . . . . . . . . . . . . . 28

4 Segmentation and Resynthesis 30

4.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Grading theTransitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Resynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.1 Cross-fading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.2 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.5 ImplementingtheWaveletTransform . . . . . . . . . . . . . . . . . . . . 36

4.6 Real-time Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.7 Segmentation/ResynthesisControl Interface . . . . . . . . . . . . . . . . . 38

4.8 PresetMechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.9 Discouraging Repetition . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Resultsand Evaluation 41

5.1 UserStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.2 ExperimentalProcedure . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

iv

6 Conclusionsand Futur e Work 47

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.2 GoalsandResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Bibliography 52

v

List of Figures

2.1 Contrastbetweenfrequency-based,STFT-based, andwavelet views of the

signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Thewaveletfiltering process . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Distancemeasure for transition between frames2 and3. Thearrows repre-

sent differencecalculations. . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Wavelet packet decomposition . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1 Input waveform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Portion of output waveform . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Segmented waveform andinterface. . . . . . . . . . . . . . . . . . . . . . 36

4.4 Segmented streammanagementinterface. . . . . . . . . . . . . . . . . . . 39

5.1 Correct scores persubject andsample.Eachscoreis out of 6. . . . . . . . . 44

5.2 Percentagecorrect answers persubject,over all samples. . . . . . . . . . . 44

5.3 Statistics for thenumber of correct responses . . . . . . . . . . . . . . . . 45

5.4 Percentagecorrect persample,compiled over all subjects . . . . . . . . . . 45

vi

Acknowledgements

I’d like to thankmy supervisor, DineshK. Pai, andHolgerHooswho providedadroit feed-backat several stagesalong the way. Also, I appreciatethe efforts of Antoine Maloney,who hasalwaysgivenmegood advice,althoughI haven’t alwaysfollowedit.

REYNALD HOSKINSON

TheUniversity of British ColumbiaMarch 2002

vii

Chapter 1

Intr oduction

1.1 Problemand Moti vation

Naturalsoundsareaninfinite sourceof material for anyone working with audio. Although

the source may be infinite, there are many situations whereone samplehasto be used

repeatedly. Electro-acoustic music composers often usesamplesas motifs that reappear

againandagainover the courseof a piece. Acoustic installationssometimes stretch pre-

obtained source material over the entirelife of an exhibit. Video gamescanusethe same

samplead infinitum during gameplay. Simplerepetition is not effective for long, so we

oftencreatevariationsof a sampleby manipulatingoneor moreof its properties.

Thereis a long tradition in the electro-acousticmusiccommunity of splitting au-

dio samplesinto portions and manipulating them to create new sounds. Curtis Roads

[Roa78] and Barry Truax [Tru94] pioneeredgranular synthesis, in which small grains

are combined to form complicatedsounds. Grainscan be constructed from scratch, or

obtained by splitting an audio sample into small segments. More recently, Bar-Joseph

[BJDEY� 99] proposeda variant of granular synthesisusing wavelets wherethe separa-

tion andre-combination of grains is done in thetime-frequency representation of anaudio

sample. Similarwork is alsobeingdoneon imagesto producevariationsof tiles or textures

[WL00, SSSE00].

1

Whenwhat is desired is simply a variation on the original source that still bears

a strong resemblance to the original, the above audio techniqueshave critical problems.

Granular synthesisis a technique to create new sounds,not recognizable variationsof the

originalexcept in averyabstractsense. A longaudio sampleis notevenrequired;it suffices

to specify theshape of thegrainandits envelope.Whenanaudio sampleis used, a grainis

anarbitrary slice chosen independently of thesound’s inherentstructure.

Attemptsat better preserving the original structure of the sound have been made.

Bar-Joseph[BJDEY� 99] usesa comparison stepwherewavelet coefficients representing

partsof thesampleareswappedonly whenthey aresimilar. Theauthors employ “statisti-

cal learning”, which producesa differentsound statistically similar to the original. In this

algorithm, only thelocalneighbours in themulti-resolution treeareconsideredwhencalcu-

lating similarity, andtheswapping is veryfine-grained.Thismeansthatlarge-scalechanges

over timewill not betakeninto account. Onalmostany signal, this results in a “chattering”

effect.

To addressthe limitations of the above methods,we have developedan algorithm

for segmenting sound samplesthat focuseson determining natural transition points. The

sound in betweenthesetransition points is consideredatomicandnotbroken upany further

or transformed in any way. We refer to the sound between transition points as “natural

grains”.

Oncewe have the grains,creating new sounds becomesa problem of how bestto

string themtogether. We do this by constructing a first-order Markov chain with eachstate

of thechain corresponding to a natural grain. Thetransition probabilities from onestateto

others areestimatedbasedon thesmoothnessof transition betweenit andall othergrains.

Thenatural grains arethusoutput in a continuousstream,with thenext grainbeingchosen

at random from amongthoseother grains which bestfollow from it. In this way, we can

construct an arbitrarily large numberof variations on the original signal with a minimum

amountof userinput.

2

1.1.1 Natural Grains

Segmenting anaudio sampleinto natural grainsinvolvessomeunderstanding of theprocess

of how theacoustic wavesthataredetectedby our earsaretransformedinto thesoundswe

perceive. Whatcuesdo we useto distinguishonesoundfrom another? More specifically,

whataretheclues thatour brainspick up to distinguishwhereonesoundendsandthenext

begins?

Fromthe time Helmholtz published“On the sensations of tone asa psychological

basisfor the theory of music” in 1885 [Hel54], it wasgenerally held that the steady-state

componentsin a sound werethe most important factor in humanrecognition. Rissetand

Matthews[RM69] wroteaseminal study of thetime-varyingspectraof trumpet toneswhich

invalidated thishypothesis.Theirwork showedthattheprimacy of steady-statecomponents

wasnot realistic from theperspective of synthesizing realistic musical instrumentsounds.

Insteadthey proposedthat the dynamiccomponentsof a spectrum wereprimary,

andsteady-statecomponentsdid not helpvery muchat all for instrumentclassificationand

tone-quality assessment.Risset’s hypothesishasnow becomethedominant view of sound

structure in the psychological literature,andclaimsto representthe perceptibly important

dynamic structuresthatcompriseauditory phenomenafrom theperspective of musical in-

strument sound structures.

Handel [Han95] definestimbre as perceptual qualities of objects and events, or

“what it sounds like.” The senseof timbre comesfrom the emergent, interactive proper-

tiesof thevibrationpattern. Clearly, any segmentation algorithm thatpurports to preserve

theperceptual propertiesof thesoundmustsegmentonalargerscalethanthelocal changes

thatmake up thetimbreof a sound event.

Drawing on the work of Hemholtz,Michael Casey [Cas98] enumeratesthe types

of change in a sonicstructureby examining theconstraintsof thehumanauditory system.

Fourier Persistencerefersto how thecochlearmechanicsof theeararesensitiveto changes

on theorder of 50msandshorter. Theearrepresents thesechangesasa staticquality in log

frequency space. In otherwords, whenourearssenseregular changesin air pressureatrates

3

greater than20Hz,weperceiveonepitch rather thaneachindividual change in air pressure.

20Hzis thefrequency perception threshold of thecochlearmechanism.

We are, however, able to perceive changes occurring at ratesless than 20Hz as

actual change. Thosethat arecontinuous in termsof the underlying Fourier components

areclassified asshort time changesin the static frequency spectrum. For example,when

I drop a coin, there is a Fourier persistence due to the physical characteristics of a small

metallic object. Theshort-time changereflects theindividual impacts.

Theaboveinformationleadsusto focusonthechangesonscalesgreater than20Hz

for our segmentation algorithm. Considering that we are looking at samples recordedat

44.1kHz, our windows should beat least2205sampleslong.

1.1.2 EcologicalPerception

Thereis a significant amountof literaturearguing that the atomicity of humanperception

of sound is moreon the level of whatwe have definedasa grain thanthatof an individual

sound wave. J. J. Gibson[Gib79] originally introduced the term Ecological Perception

to denote the idea that what an organismneedsfrom a stimulus, for the purposesof its

normaleverydaylife, is often obtaineddirectly from invariantstructuresin theenvironment.

Sometypes of complex stimuli maybeconsidered aselemental from theperspective of an

organism’s perceptual apparatus,unmediatedby higher-level mechanismssuchasmemory

andinference.

While Gibsonwas primarily referring to the vision system,thereare analogous

patterns in hearing. Perception is not simply the integrationof low-level stimuli, suchas

single pixels in the retina or narrow-bandfrequency channelsin the cochlea, but instead

directly perceivable groups of features.

Ecological perception was further explored for the auditory domain in William

Gaver’s pioneering work on everyday listening [Gav88]. Everyday listening involvesper-

ceiving thesourceof thesoundandits materialpropertiessuchassizeandweight.Take,for

instance,thesound of adoorslowly closingonrustyhinges.In everydaylistening,attention

4

is focusedonthedoor itself, theforcewith which it is being closed,thesizeof theroomit is

being closed in, andothermaterial propertiesof theorigin of thesound. This typeof listen-

ing is differentiated from another type of auditory experience,musicallistening, in which

musical parameterssuchaspitch, duration and loudnessaremost important. In the door

example, musical listeningwould involve hearing the change in pitch asthe door opens,

the particular timbre of the hinges,andthe band-limited impulsive noiseasthe door hits

the frame. Everyday listening instead involvesdistinguishing the individual eventswhich

producethesoundsthatwehear.

From the perspective of everyday listening, the perceptual world is one where

soundshave clearbeginnings andendings,evencontinuous soundssuchaswind thathave

no onsetsor offsets. In this way, a graincould bedefined by its temporal boundaries. How-

ever, the beginning andendings of sounds arenot necessarily due to the structure of the

acoustic wave; often they arenot physically marked by actual silent intervals. Our main

task,then,is to find points in theacousticwave which bestapproximatethebeginningand

endingsthatwecanperceive by listening to thesoundsourselves.

However, there is asyet no acceptedmathematical framework within which to use

this theory of perception in a systematicmanner. Casey [Cas98] doesprovide a mathe-

matical framework usinggroup theory, but he is primarily concernedwith extracting the

structureof larger-scale sounds,suchasthe timing andspectral changesbetween bounces

asa ball bouncesa number of timesbefore settling on theground.

While any sound consistsof atime-varyingpattern of harmonics,whensoundsfrom

differentsourcesoverlap, all of theharmonic componentsaremixedin timeandfrequency.

A listener can usethe timing, harmonic, and amplitude and frequency modulation rela-

tionshipsamongthecomponentsto parse thesound wave into discrete componentsources.

Thesecomponentsourcesoften arehappeningcontemporaneously, andarecalled“streams”

in theliteratureof acoustic perception. In this thesis, however, will limit ourselvesto split-

ting sound solely in thetime domain,with onestream only persample.

5

1.1.3 Objectives

Therewill never be a lack of natural world sounds to record and feed into a computer

system. In an application which usesempirically recorded samples, playing them back

blindly is inefficient in termsof resourcesand often less than optimum in termsof the

desiredeffect thesoundhason a user. Themorethe information theapplicationhasabout

thesound, themoreit cantailor its output to its situation.

Our implementation aimsto providesuserswith a tool to automatically manipulate

sound sourcesand soundscapes. Along with the core segmentation/resynthesistool, we

providea higher level interfacewhich allowsfor multiple randomizedsoundsto beplayed

at once, eachwith a number of controls thateffect how it appearsin thesoundscape. There

arecontrols for automatinghow oftenasound streamis triggeredandhow long aninstance

lasts. Therearepanandgaincontrols for controlling stereoamplitudeover thecourseof the

instance.Thisallowsauserto assembleapermanently changing auditory soundscapefrom

just a few representativesamples. Sucha tool is useful for immersive virtual environments,

video games,soundtracks for film, auditory displays,andevenmusiccomposition.

1.2 ThesisOrganization

This thesis is divided into six chapters. Chapter 1 introduced the problem,andstatedthe

objectivesof thework. In chapter2,weprovideageneral backgroundonthesignalprocess-

ing toolsused.Chapter 3 details our earlier, differentapproachto creating sound textures.

Thesegmentation andresynthesisalgorithm is detailed in chapter 4. Chapter 5 shows the

resultsof theresearch,includingsomeuser testing to demonstrateutilit y. Finally, chapter6

summarizesgoals andresults,andoffers somepotential future researchareas.

6

Chapter 2

Background and RelatedWork

2.1 Overview

This chapter will review relatedresearch into the representation andanalysisof audio sig-

nalsfor the purposeof segmentation. Thewavelet transform will thenbe introduced, and

we will review someof themethodswhich usewaveletsfor signal segmentation. We then

review someof the related work which useswavelet packets. Finally, techniques from

theelectro-acousticcommunity such asgranular synthesisandconcatenative synthesisare

briefly touchedupon.

2.2 Representation

The sound samples we usefor this algorithm are input in pulse-codemodulation (PCM)

format. PCM meansthateachsignal sampleis interpretedasa “pulse” at a particular am-

plitude. To determine whereto segment an input audiosample, it is useful to change rep-

resentations from theoriginal PCM format to something which morecompactly expresses

theinformationwe areinterestedin.

Ultimately, we needa representation thatcanaid us in determining whenanaudio

signal changes,andby how much.With thelocation of change,we cansegment thesignal

into natural grains. With a metric of how much the signal haschanged, we can offer a

7

threshold value that increasesor decreasesthe coarseness of the grains, andalsocompare

themto eachother to estimatehow they fit together.

Therearea multitude of waysto representa soundsignal. Usinganappropriately

multi-resolutional approach, [SY98] identifies threemain schemesfor classifying audio

signals:

1. Signal statistics suchas

(a) mean

(b) variance

(c) zero-crossing

(d) auto-correlation

(e) histogramsof samples/differenceof samples,either on thewholedataor blocks

of data.

Themainproblem with low-level statisticsis thatthey arevulnerable to alterationsin

theoriginal signal, making themfragile in thepresenceof noise.

2. Acoustical attr ibutes. Anothergeneral category of audio signal classification tools

areacoustical attributessuchaspitch, loudness,brightnessandharmonicity. Statisti-

cal analysis is thenapplied to theseattributesto derive feature vectors. Becausewe

hope to preserve asmany of theacoustical attributesaspossible in our resynthesized

sound, this appearsto bea muchbetter alternative.

Most of thesemeasurements, however, aredirectedtowardsmusicwherethe sound

is already relatively structured.For lessstructuredenvironmental sounds,wherepos-

sibly many different events arehappeningsimultaneously, these measurementsare

lesseffective. Calculatingacoustic attributesis alsomuchmoreexpensive computa-

tionally overall, andsuffersfrom the samelack of robustnessto noiseastraditional

signal statistics.

8

3. Transform-basedSchemes.Herethe coefficientsof a transform of the signal are

usedfor classification to reduce the susceptibility to noise. Therearemany advan-

tagesto usingsignaltransformsin analysis,such asthepotential for compression,and

theability to tailor thetransformusedto bring out thecharacteristicsof thesignal that

aremost important to the taskat hand. The canonical transform usedin audiopro-

cessing is theFourier transform, in partbecauseof theefficiency of theFast-Fourier

transform,andits util ity over a broadrange of tasks. For reasons explainedbelow,

we insteadchosethe Discrete Wavelet Transform, which hasthe samealgorithmic

complexity astheFFT.

2.3 Signal Transforms

A spectrum canbe loosely described as“a measureof thedistribution of signal energy as

a function of frequency” [Roa96]. We mustdefinethis term loosely becauseaccording to

Heisenberg’s Uncertainty Principle,any attempt to improve time resolution of a signal will

degrade frequency resolution [Wic94]. Both the time waveform and frequency spectrum

cannot bemadearbitrarily smallsimultaneously [Sko80]. Theproduct of thesetwo resolu-

tions, thetime-bandwidth product, remainsconstantfor any system. Soany representation

of a signal’s spectrum is necessarily a trade-off betweenthese competing concerns.

Long a stapleof digital signal processing, theFourier transformis oneway of cal-

culating a spectrum. It wasoriginally formulated by JeanBaptiste JosephFourier (1768-

1830). Its mainprinciple is that all complex periodic waveformscanbemodeledby a set

of harmonically relatedsinewavesaddedtogether. TheFourier transform for a continuous

time signal x � t � canbedefinedas:

X � f �� ∞� ∞x � t � ej2π f tdt (2.1)

The results of evaluating X � f � are analysis coefficients which define the global

frequency f in a signal. As shown in Figure 2.1, the coefficients arecomputed as inner

9

productsof thesignal with sinewavebasisfunctionsof infinite duration. Thismeansabrupt

changesin time in a non-stationary signal arespreadout over thewhole frequency axis in

X � f � .To obtain a localized view of the frequency spectrum, there is the Short-Time

Fourier transform, (STFT) in which one divides the sampleinto short frames,then puts

eachthrough the FFT. As shownin Figure2.1 below, thesmaller the frame,thebetter the

time resolution, but employing framesraisesa wholenew setof problems.

To begin with, it is impossible to resolve frequencieslower thanthe lengthof the

frame.Usingframesalsohastheside effect of distorting thespectrummeasurement.This

is becausewearemeasuring notpurely theinput signal, but insteadtheproduct of theinput

signal andtheframeitself. Thespectrumthatresults is theconvolutionof thespectraof the

input andtheframesignals.

For eachframe,we canthink of the STFT asapplying a bankof filters at equally

spacedfrequency intervals.Thefrequenciesarespacedat integer multiplesof thesampling

frequency, dividedby theframelength. Artif actsof frameanalysisarisefrom thefact that

thesamplesanalyzeddonotalwayscontainanintegernumberof periodsof thefrequencies

they contain. Therearea number of strategies to curb theeffect of this “leakage”, suchas

employing aenvelopeoneachframethataccentuatesthemiddleof theframeat theexpense

of thesides,wheremostof theleaking is [Roa96].

For audio purposes, the FFT also hasthe drawback that it dividesthe frequency

spectrum up into equal linear segments. Theuseris put into aninescapablequandary: nar-

row framesprovidegood time resolution but poor frequency resolution, while wide frames

providegoodfrequency resolution but poortimeresolution. Moreover, if theframesaretoo

wide, the signalwithin themcannot be assumedto be stationary, which is something the

FFTdependson.

The problem with using linear segmentsis that humansperceive pitch on a scale

closer to logarithmic [War99]. We arerelatively goodat the resolution of low-frequency

sounds,but asthefrequency increases,our ability to recognizedifferences decreases.

10

2.4 The WaveletTransform

A wavelet is a waveform with very specific properties,suchasan average valueof zero,

andan effectively limited duration. Analysis with wavelets involvesbreaking up a signal

into shifted andscaledversionsof theoriginal (or mother) wavelet. Wavelet analysisusesa

time-scale region rather thana time-frequency region. Sinceonly artificial tonesarepurely

sinusoidal, this is not in itself a drawback.

Thewavelettransform is capable of revealing aspectsof datathatothersignal analy-

sistechniquesmiss,suchastrends,breakdown points,discontinuitiesin higher derivatives,

andself-similarity. Intuitively, thewavelet decomposition is calculating a “resemblancein-

dex” betweenthe signal andthe wavelet. A large index meansthe resemblanceis strong,

otherwiseit is slight. Theindicesarethewaveletcoefficients.

In contrastto thelinear spacing of channelson thefrequency axis in theSTFT, the

wavelet transform uses a logarithmic division of bandwidths. This impliesthat thechannel

frequency interval (bandwidth) ∆ f/f is constant for the wavelet transform, while in the

STFT, theframedurationis fixed.

To definethe continuous wavelet transform (CWT), we startby confining the im-

pulse responsesof a particular filter bankto bescaledversionsof thesameprototype ψ � t � :ψa � t �� 1

a ψ � t

a � (2.2)

wherea is a scalefactor, andthe constant 1 �a� is usedfor energy normalization.

ψ � t � is often referred to asthemotherwavelet.With themotherwavelet, wecandefinethe

ContinuousWavelettransform(CWT) as

CWTx � a � b�� 1 a � � ∞� ∞

x � t � ψ � t � ba � dt (2.3)

Here * refersto the convolution operator, andx � t � is the original signal. As the

scalea increases,the scaledwavelet ψ � ta � (the filter impulse response)becomesspread

out in time andthustakes only longer durations into account. Both global andvery local

11

variationsaremeasuredby usingdifferentscalesa. b controls thetranslation of thewavelet

along thesignal.

We will limit our discussion to the discrete wavelet transform, which involves

choosing dyadic scalesandpositions (powersof two). In the discretewavelet transform,

a decomposition into waveletbasesrequiresonly thevaluesof the transform at thedyadic

scales:

a � 2j

and

b � k2j �Theanalysisis moreefficientandjustasaccuratefor ourpurposesasthecontinuouswavelet

transform. Thediscretewavelettransformapproachwasfirst developedby Mallat [Mal89].

It is basedon a classical schemeknownasthetwo-channel sub-bandcoder.

Unlike theFFT, which usessinusoids of infinite duration, waveletsarelocalized in

time. Leakage is also a different concernwith the wavelet transform. If we confinethe

length of our signal to bea power of two, thewavelet transform will analyze thesignal in

aninteger number of steps,soleakageis not anissue. It doesn’t matterthatthefrequencies

present in thesignaldon’t line upwith power-of-two boundaries, sincewearenotmeasuring

frequencies perse,but scales.

Theeffectivenessof theDiscrete WaveletTransform (DWT) for a particular appli-

cation could depend on the choice of the wavelet function. For example, Mallat [MZ92]

hasshownthat if a wavelet function which is the first derivative of a smoothing function

is chosen,then the local maximaof the DWT indicatethe sharp variations in the signal,

whereasthelocal minimaindicateslow variations.

As figure 2.1 shows, the wavelet transform givesbetterscale resolution for lower

frequencies, but worsetime resolution. Higher frequencies,on the otherhand, have less

resolution in scale, but better in time.

Onenotable aspectof wavelet transformsasthey pertain to audio processing is that

they arenotshift-independent. For this reason, weusetheenergiesof thecoefficientsrather

12

Figure2.1: Contrastbetweenfrequency-based, STFT-based,andwavelet viewsof thesignal

thantheraw coefficients themselvesfor our metrics,asin [PK99].

2.4.1 Implementation

For animplementation-centredpointof view, wecanthink of thewavelettransformasaset

of filterbanks,asin figure2.2.

Figure2.2: Thewavelet filtering process

Heretheoriginal signal, S,passesthrough two complementary filters andemerges

astwo signals. Thelow-andhigh-passdecomposition filters (L andH), togetherwith their

associatedreconstruction filters (L� andH � ), form a system of quadraturemirror filters.

Thefiltering processis implemented by convolving thesignalwith afilter. Initially,

weendupwith twiceasmany samples. Throwing awayeveryseconddatapoint (downsam-

13

pling) solvesthis problem. We arestill ableto regeneratethe entire original sample with

thedownsampledsignal.

The decomposition process canbe iterated,with successive approximations being

decomposedin turn, sothatonesignal is brokendown into many lower-resolution compo-

nents. This is thewavelet-decomposition tree,or otherwiseknown astheMallat tree. The

averagecoefficientsarethehigh-scale,low frequency componentsof thesignal. Difference

coefficientsarethelow-scale, high frequency components.

To reconstruct theoriginal signal from thewaveletanalysis,we employ theinverse

discrete wavelet transform. It consistsof upsampling and filtering. Upsampling is the

processof lengthening a signal componentby inserting zeros betweensamples.

For a morein-depth introduction to thewavelet transform, thereader is directedto

articlessuchas[SN96,KM88, Mal89, MMOP96] for the theory, and[Wic94, Cod92] for

ideason implementation.

2.5 Audio Segmentationusing Wavelets

Oncewe have expressedthe signal using the wavelet representation, we useit to identify

thepotential pointsto split into grains. Therehavebeenmany attemptsatsegmenting sound

using thewavelettransform, mostof thosewe lookedat geared towardsspeech analysis.

Attemptsat using signal statistics on wavelet transformcoefficients have alsobeen

made. However, a statistical prior model for wavelet coefficients is complicatedbecause

waveletcoefficients do not have a Gaussiandistribution [Mal89]. Thoughwavelets coeffi-

cients aredecorrelated, their values arenot statistically independent, another limitation to

take into consideration whenusingstatistical properties.

An important stepfor classification techniques such as[LKS� 98, SG97,TLS� 94]

is to characterizethesignal in assmallanumberof descriptorsaspossible,withoutthrowing

away any information that would help classification. Reducing the feature sethasseveral

advantages: primarily, it is computationally morecosteffective, but it alsoaids in gener-

alization. Our task is to figure out whereit is bestto segmentthe signal into grains. We

14

mustthendefinewhat happensin the signal at the startor endof an event beforelooking

for thesefeatures.

Most speechrecognition algorithms include a segmentation step to separate the

phonemesfor laterrecognition. However, segmentationin speechrecognition hasimportant

differencesfrom ourgoals. Smoothnessof transition is notanissuefor recognition,because

in humanspeech phonemesusually blend into each otherto suchanextentthatany splitsare

usually in areaswith averyhighdegreeof frequency change,toomuchsofor ourpurposes.

This is acceptablefor speechanalysisbecausetheresults areonly neededfor thepurposes

of recognition,not resynthesis.

2.5.1 SpeechClassification Techniques

A paperby Sarikaya and Gowdy [SG97], on identifying normal versusstressedspeech,

details aschemethatshowedsomepromiseasasignal segmentation techniqueapplicableto

our needs. In their algorithm,an8KHz sample is segmentedinto framesof 128samples. A

lookaheadandhistory of 64 samples is added, to make it 256,with skip rateof 64 samples.

This representation is the base of a classification algorithm that usesa two-dimensional

separability distancemeasurebetween two speech parametrization classes.

Their separability measureusesScaleEnergy, which representsthedistribution of

energy amongfrequency bands,definedas:

SE � k � � si �� ∑m� si � � Wψx� � si � m�� 2sup � SE � n� � si �� (2.4)

whereWψx is thewavelettransformof x, k is theframenumber, i thescalenumber,

si is the ith scale, and n spansall available frames. The denominator is a normalizing

constant. They usethe scale energy for an autocorrelation computation that measuresthe

separability of two signals. The ACS,Autocorrelation of ScaleEnergies,measures how

correlatedadjacentframesare.It canbedefinedas:

15

ACS� l �si � k �� ∑k � Ln� k � SE � n� � si �! SE � n� l � � si � � 2

sup � ACS� l �si � j � � (2.5)

Here j is an index which spansall correlation coefficientsat a givenscale. l is the

correlation lag,which is fixedin this paper at 1. If wewould have setl � 0, we would look

at only oneframeat a time, sotheACSwould modelthenormalizedpower in scale i. For

l " 0, ACSmodels changesin the frame-to-framecorrelation variation of SEparameters.

Sarikaya alsofixesthecorrelation framelength L at 6. This meanstheACSparameters are

measuresof how correlatedsix adjacentframesare.

Usedin thisway, theautocorrelationof scaleenergiesis acomparisonbetweenlev-

els to bring out hidden identifying features. On a test of the phoneme/o/ in “go,” they

achieved satisfactory scores with both the SE and the ACS parameters, although ACS,

which takes into account the change betweenframes,was significantly higher. Readers

canreferto thepaper [SG97] for further informationon testing methodology andcomplete

results.

Thiswork is moretunedtowardrecognition anddistinctionbetween classesthanwe

areinterestedin. TheScaleEnergy parameter is useful for identifyi ngfeaturesof individual

frames,andwe adopta similar approachusing wavelet coefficients,asdoesthe paperby

Alani [AD99] discussedin the next section. ACS, however, tendsto smooth out local

changesbecause it is measured over a numberof frames. We aremore interestedin the

locationsandmagnitudeof these local changes,andthedifferencesbetween frames,andso

do not adoptthis technique.

2.5.2 UsingDiffer encesbetweenCoefficients

A paper by Alani andDeriche[AD99] details another approachto segmenting speech into

phonemes.Thesignalis broken into small frameof 1024sampleseach,with anoverlap of

256 samples. Soundsamplesareinput asCD-quality 44.1kHz, so eachframeis approx-

imately 23.2mslong. To provide metricsfor what is happeningduring the length of each

frame,they areeachanalyzedwith thewavelet transform. Theenergiesof eachof thefirst

16

six levelsof differencecoefficientsarecalculatedfor eachframe.

Their next stepis to segmentthesignalbased on thedifferencesin energy between

eachlevel of differencecoefficients in consecutive frames.A Euclideandistance function

over four framesis used. As an example, we calculatethe strength of transition between

frames2 and3:

D � f2 � f3 �� 2

∑i � 1

4

∑j � 3

6

∑k � 1

� Xi # k � Xj # k � 2 (2.6)

Frame 1 Frame 2 Frame 3 Frame 4

1

CoefficientLevel 3

6

5

4

2Difference

Figure2.3: Distancemeasure for transition betweenframes2 and3. Thearrows representdifferencecalculations.

Herek refers to the wavelet level, i and j areframenumbers, Xi # k andXj # k arethe

energies of wavelet differencecoefficient levels. Only like levels arecompared, and the

differencesbetween themareadded up to obtain a overall differencebetweenframes.

Alani andDeriche[AD99] usethealgorithm to isolatephonemeswhicharethenfed

17

into a separatespeech-recognition system. In normal speech, vowels have pitches which

are relatively constant over time, whereas consonants arenot pitchedat all, andso have

frequencies that change considerably over the course of the phoneme. Additionally, in

humanspeech phonemesmeld into eachother, making isolation an even more difficult

task,onethatcanreally only besuccessfully achievedusingcontext-sensitive information.

Takingthis into account, theauthorspick thepointswherethedistancemeasure is highest,

reasoning thatthis is wherethespeakersaregoing from onephonemeto another.

Isolating phonemes,however, is quitea different taskthantrying to isolate grains.

We would like something moreon the lines of syllables,wherethere areclearer demarca-

tions to latch on to. Whensegmenting, we alsohave to take into account how the grains

will fit backtogetheragain, something Alani andDerichedid not consider. As describedin

thenext chapter, we adopta modifiedversion of this algorithm which is moresuitablefor

a coarserlevel of detail (on the level of syllablesrather thanphonemes),andour differing

requirements.

2.6 Segmentingin the WaveletDomain

Not only canwe do the analysis in the wavelet domain, but it is alsoan option to do the

separation andre-connection of grains aswell. The inversewavelet transform would then

beperformedto obtain thenew, modifiedsignal.

This is the approachtaken by Bar-Josephet al. [BJDEY� 99]. The authors usea

comparison stepwherewavelet coefficients representing part of the sampleareswapped

only whenthey aresimilar, andcall their approach“statistical learning”. Theaim is to pro-

ducea differentsound statistically similar to theoriginal. Satisfactory results arereported,

with “almost no artifacts dueto the random granular recombinationof different segments

of theoriginal inputsound.” Unfortunately, nosoundsamplesareavailable to support these

claims.1

1At the2001 InternationalComputerMusicConference,I askedanumber of peoplewhohadat-tendedtheconferencein 1999 whenBar-Joseph’swork waspresented, but nobody couldremember

18

To find out ourselves,we implementedthis algorithm asit is describedin thepaper.

More details about the implementation are given in the next chapter. The results were

perceivably different, enough to make it inappropriate for our use. With any signal, we

found thatthere wasa characteristic “chattering” effect asparts of thesignal wererepeated

quickly right aftereachother.

The problems stemfrom the way the algorithm producesnew variations. Only

the local neighboursin the multi-resolution treeare taken into account whencalculating

similarity, andtheswapping is very fine-grained.Becauseswapping only takesplacewhen

the coefficients aresimilar, muchof the large scale patterns arepreserved, resulting in a

sound thatstill hasmuchof thesameorderof events.Theeventsthemselvesarechanged

to a degree,but alsomuddiedbecauseof convolution artifacts.

Theseconvolution artifactsarisebecauseswitchingcoefficientsof thewavelettrans-

form hasunpredictable results. Unlessthe changesareon the dyadic boundaries,it is re-

ally changing the timbreof the input sound rather thanswitching thesound events. These

changescannot be easilypredicted; they have to do with the choicesof wavelet filter, the

filter length, andtheposition of thecoefficients.Theconvolution involvedin reconstruction

makesthis processvirtually impossibledo without introducing unwantedartifacts.

Extending the sampleto an arbitrary length is also non-trivial. Unlessall that is

needed is extending it by a power of two, the inverse wavelet transform becomesmuch

morecomplex. Extending thesampleby apowerof two is alsounsatisfactory. Theresult is

very similar to looping theoriginal sample,but with a lot of addedartifacts,which arethe

very things we would like to avoid.

2.7 WaveletPackets

Wavelet Packets were seriously considered as a methodfor representing natural grains.

They differ from wavelets in that at every decomposition step, the differenceand aver-

agecoefficientsare further broken down. The results are particular linear combinations

whatit soundedlike.

19

or superpositions of wavelets. They form baseswhich retain many of the orthogonality,

smoothness,andlocalizationpropertiesof their parent wavelets[Wic94].

Wavelet packetsseemto have a lot of potential: depending on thestrategy usedto

find asuitable basisfrom theover-completesetof packetsin afull waveletpacket transform,

youcanfind thebasiswhichrepresentsthesignal with thefewestnon-zerocoefficients.See

Wickerhauser [Wic94], for example, for a discussion of variousbasis-finding algorithms.

A wavelet packet librarywaswritten in Javaexpresslyto seeif somethingsimilar to

Bar-Joseph’swork [BJDEY� 99] could bedonewith waveletpackets.Insteadof interchang-

ing baldwaveletcoefficients, we would interchangethewaveletpackets, which ostensibly

would hold informationabout wholeevents, rather thansample-level information.

While efficiency of representation is important, there aresomekey problemsthat

cannot be easily be overcome.First of all, normalwavelet packetsarenot shift-invariant.

This makes comparison between different regions of the wavelet packet transform ex-

tremelydifficult.

Another difficulty with comparison hasto do with packet levels. Every packet is

denotedby theorder in which theaverageor differencecoefficientshave been further bro-

ken down, asshownin Figure2.4. However, thereis no guaranteethat all portions of the

signal will berepresentedon thesamepacket level. If we have a packet that is denotedby

ADADDDAD in Figure2.4, it is not trivial to changeit with anotherpacket is denotedby

DADD. They aredifferent sizes,andhave to interact with different packetsin orderto be

properly put throughtheinversewavelet packet transform.This makesswitching positions

of packetsvery difficult.

Someartificial meansof constraining the representation could be taken, suchas

limiting theresultto beall on onelevel of thewaveletpacket tree.However, this seriously

underminesthewholepoint of findingthebest basis,asthenumberof coefficientsincreases

by a power of two eachlevel.

Related literature supports the problemslisted above. In [WW99] Wickerhauser

notes that thebestbasis algorithm is not well suited for isolating phonemes,since thereis

20

Figure2.4: Waveletpacket decomposition

no reason for phonemesto even “begin” and“end” at dyadic points. Packets arethusnot

guaranteedto represententire,re-arrangeableevents.

Wickerhauserinsteadusesa segmentation algorithm to split up thetime axis of the

unprocessed signal. Thesegmentation algorithm measures the instantaneous frequency at

discretepoints,andplacessegmentationpoints at placeswherethis changes.2

Despite theproblemsraisedabove,therehavebeen attempts to usewaveletpackets

to aid in signal classification. However, surmounting the above concernsseemsto take

up any advantageover regular wavelets. For example, Sarikaya [SG98] hasproposedan

alternateversion of his paper [SG97] discussedin 2.5.1,but usingwaveletpackets instead

of wavelet differencecoefficients. This involvescharacterizing eachwindow by subband

features derived from the energyof wavelet packets. Thereresults, however, were not

differentenough from their earlier, wavelet-basedapproachfor usto changeour algorithm.

DelfsandJondral [DJ97] usethebest-basisalgorithm for waveletpacketsto charac-

terizepianotones. They useaspecialized, shift-invariant discretewaveletpacket transform

to improve classification. Thepacket coefficientsin thebest-basiswhoseenergiesexceeda

thresholdarenormalized,thencomparedto pianotonesin adatabasewhichhavebeensim-

ilarly analyzed. Theeuclideandistancedeterminesthe successof thematch. This system

wasused for identificationonly, notresynthesis. Their results indicatethatthesespecialized2Thealgorithm wassupposedto publishedin anotherpaper, but unfortunately, it wasnot. To my

knowledge,unpublishedcopiesarenotavailableeither.

21

waveletpackets do not seemto offer any advantageover simplediscrete Fourier transform

features.

2.8 Granular Synthesis

Thereis alonghistory in theelectro-acousticmusiccommunity of arrangingsmallsegments

of soundto create larger textures. Granularsynthesis,pioneeredby CurtisRoads[Roa88]

andBarry Truax [Tru88] [Tru94], is a methodof soundgeneration that uses a rapid suc-

cession of shortsound bursts, calledGranulesthat together form larger soundstructures.

Granular synthesisis particularly goodat generating textured sounds suchasa waterfall,

rain, or wind. Thegrains in this casearetakenasportionsof a larger soundsamplewhich

canbespecified to thealgorithm. Thesound sample itself hasa large influenceon the re-

sult of thegranular synthesis,but sinceit canbespecified by theuserfrom anywhere, it is

impossible to facilitate easier interaction with this parameter.

Curtis Roadsdescribes granular synthesisas involving “generating thousandsof

very short sonic grains to form larger acoustic events”[Roa88]. A grain is definedas a

signal with an amplitude envelopein the shapeof a bell curve. The duration of a grain

typically falls into the rangeof 1-50msec.This definition putsgranular synthesisentirely

into therealmof new soundcreation ratherthanmanipulation which preservestheoriginal

perceptible propertiesof the sound. He likensgranular synthesisto particle synthesisin

computer graphics,usedto create effects suchassmoke,clouds,andgrass.

Samplesfrom thenatural world have beenusedin granular synthesis, mostnotably

by Barry Truax[Tru94]. He createsrich sound textures from extremely small fragments of

sourcematerial. Theschemereliesonagrainattackanddelay envelopesto eliminateclicks

andtransients.Theprimary goal is time-shifting: drawing out the length of thesample to

reveal its spectral componentsasa compositional technique.

Therehasbeenwork doneon extracting grainsfrom natural signals andrecombin-

ing themwith phase alignment [JP88]. Phasealignmentis usedbecausewith suchsmall

grains, themethodof joining themhasa large effect on the resulting sound. This strategy

22

helpsavoid discontinuities in thewaveformsof reconnectedgrains. Phasealignmentworks

for both periodic and noisy signals. However, it is primarily a way to alter the original

signal, for instancefor time-stretching, by combining partsof thesignal in variousways.

Gerhard Behlesusesanother method[BSR98]with pitch markers, which reference

eachpitch period (theinverseof thelocal fundamentalfrequency.) Theonsetof thesource

sound excerpt is quantizedto theclosestpitch marker. This methodis lesscomputationally

expensive thantheoneoutlinedby Jones. It hastrouble,however, with inharmonicsignals.

Granular synthesiswith natural signalsdealswith arbitraryextraction of grainsfrom

the original sample,without regard to what is going on in the signal. The majority of

granular synthesisliteraturerefersto it asa compositional approachwith no intention of

being perceptibly similar to theoriginal sound.

2.9 ConcatenativeSoundSynthesis

Also primarily working in theelectro-acoustic musiccommunity, DiemoSchwarzhasde-

velopedtheCATERPILLAR system [Sch00], which usesa large databaseof sourcesounds,

and a selection algorithm that data-minesthese sounds to find those that bestmatchthe

sound or phraseto besynthesized.

The first stepof audio segmentation is not discussedin Scharz’ paper, readersare

insteaddirectedto athesis [Ros00] availableonly in French.Segmentsarecharacterizedby

acousticalattributes, suchaspitch, energy, spectrum,spectral tilt, spectral centroid, spectral

flux, inharmonicity, andvoicingcoefficients. For eachfeature,low-level signal statisticsare

calculated, which arethenstored in thedatabaseaskeys to thesegment.

Because of the complexity and numberof the features measuredfor eachsound

segment, this systemis not meantto be real time. Neither is it intendedto be automatic:

every segmentis individually chosenby theuser. This makesit inapplicable for our goals,

but doesshow thebreadth of applicationsthatconcatenationof audio segmentscanaddress.

23

2.10 Physically-BasedSynthesis

Keesvan denDoel’s methodfor audiosynthesis[vdDKP01] via modalresonancemodels

canbeviewedasakind of resynthesis, wherethesound’s physical propertiesareestimated

and usedfor resynthesis. It hasbeenusedto create a diverse numberof sounds, such

impacts, scraping,sliding androlli ng.

The modal model M �$� f � d � A� consistsof modal frequencies represented by a

vector f of length N, decayrateswhich arespecified as a vector d of length N, and an

N % K matrix A, whoseelementsank arethegainsfor eachmodeatdifferentlocations.The

modeled responsefor animpulseat location k is givenby

yk � t �� N

∑n� 1

anke� dnt sin � 2π fnt � (2.7)

with t & 0 andyk � t � is zero for t ' 0. Geometryandmaterial properties, suchas

elasticity andtexture,determinethefrequenciesanddamping of theoscillators.Thegains

of themodesaredependenton thelocation of contacton theobject.

For simpleobjects geometries,parametersfor themodalmodelcanbederived,but

for most realistic objects,wherederivation becomesuntenable,we candirectly estimate

location-dependent sound parameters.JoshRichmond [RP00] hasdeveloped a method to

dothisusingtheteleroboticsystemACME [PLLW99]. Theobjectin question is pokedwith

a sound effector over various locations,creating a mapof soundsfrom which it is possible,

using analgorithm developedby vandenDoel,to estimatethesample’sdominantfrequency

modes. Thesemodesare fed into the modal model algorithm to produce resynthesized

sounds.

This technique is very effective for creating andmanipulating models of thesound

propertiesof everyday objects. However, so far it hasprimarily beenapplied to contact

sounds, where the objects making the noise can be modeled or measured satisfactorily.

For background sounds, suchas wind, animal cries, and traffic noises, it is possible to

userecordedsamplesto estimatethe modalmodelparameters. However, tweaking these

parametersfor thesetypesof sounds,whichusually haveahighdegreeof variation,is time-

24

consuming anddifficult. Producing a sufficiently randomizedstreamwith this technique

is also non-trivial, and the method to achieve this for one type of environmental sound

wouldn’t necessarily be transferable to others. So while this technique is very adept at

synthesizing contact sounds,we think thereis room for a resynthesis system specifically

for background, environmental audio.

25

Chapter 3

Our Early Attempts at

SegmentingAudio Samples

Part of thegenesis of this thesis wasthepaperby Bar-Josephet al. [BJDEY� 99] detailing

their attemptat automatic granulation of a sampleusingwavelets. A description hasbeen

given in Section2.6. Interestedby its promise, we implemented the algorithm detailed in

the paper. Matlab wasa convenient languageto usein this casebecausethewavelet tool-

box [MMOP96] hasall of thewavelet functionality we needed to implement Bar-Joseph’s

algorithm.

Although thepaper claimsthat they achieved“a high quality resynthesizedsound,

with almostno artifacts due to the random granular recombinationof different segments

of theoriginal input sound”, our results weredisappointing. Someimplementation details

wereomittedfrom paper, sosomeof thealgorithm hadto beguessedandre-invented. There

is alsono samples available on thewebto verify their claims.

A noteabout terminology, takenfrom theBar-Josephpaper: if weconsideraMallat

treeasdefinedin Section2.4.1turnedupside-down, we endup with a binary treewith the

first level of differencecoefficients of the wavelet transform on the very bottom. In this

view, predecessors refer to the adjacent wavelet coefficients to its left on the samelevel,

and ancestors are coefficients in higher levels that have, in the binary tree sense of the

26

word, this coefficient asa child.

Theobject is to mix up thewaveletcoefficientsin this binary treerepresentation in

a judiciousmanner, sothatwhentheinversetransform is performed,theresult soundslike

a variation of theoriginal.

To replacea wavelet coefficient on a given level of the binary treerepresentation,

we examinethe node’s predecessors and ancestors. Other wavelet coefficients from the

samelevel areconsideredascandidatesto replaceit if they have similar predecessorsand

ancestors.This similarity is measuredwithin a threshold, which is specifiedby theuser.

In thenaıve Bar-Josephalgorithm, all theneighbours of a coefficientaretakeninto

account whendeciding what to switch. Doing this over the whole transform results in a

quadratic number of checks. This makesthe algorithm impractically slow, so the authors

suggestto limit thesearch spaceto thechildrenof thecandidatesetof nodesof theparent,

which would greatly decreasethesearchspace.

Our implementation found that thecandidateswerealmostalwaysonly the imme-

diateneighbours, becausethey hadthemostancestorsandneighbours in commonwith the

nodeto bereplaced.Thenodeadjacent to theoneto bereplacedhasall thesameneighbours

except itself, andall of thesameancestorsaswell.

Almostneverwerethereany nodesother thantheimmediateneighboursbeing con-

sidered as candidates,and almostnever were the immediateneighbours not considered.

This wasthecaseno matterwhatthethreshold wassetto. It waseither only theimmediate

neighbours considered, or all of nodesin theentire level. This explainssomeof theeffects

we observed in our results, which tendedto soundsimilar to the input sample, but with

slight artifacts from the wavelet resynthesis. Therewereno large-scale reorganizations of

thesample,only local changes.

To promotemorelarge-scalechanges,we only allowedshuffling of thecoefficients

in the higher levels of the inverted Mallat tree,onesthat representedmorethan10 ms of

audio. Therestof the treewasre-organized according to the last level scrambled. For any

coefficient on the last level scrambled,not only is that coefficient taken from somewhere

27

elseon thesamelevel, but alsothatcoefficient’s children,andthechildren’s children,and

soonuntil theendof thetree.Thusfor any coefficient on thelastlevel processed,it andall

of its descendants aremovedenmasse.

However, allowing coefficient shuffling only at higher levels is only a partial solu-

tion, becausethe inversewavelet transform will work on at leastthe number of samples

of the lengthof the filter at onetime. Thereis really no direct analogy to the parentsand

childrenof abinary tree, becausetheeventhesmallest filter hasmorethantwo coefficients.

Because of the convolution stepthat happensbetween the filter andthe coefficients in the

inversewavelet transform, theeffect of any onecoefficient is spreadover a numberof co-

efficients equalto twice thefilter length. This almostinvariably leadsto artifacts,because

thereis no guaranteethat the inversewavelet transform will producea smoothsignal from

thesealteredcoefficients. For applications such ascomputer music,perhapsthese artifacts

aredesirable, although they won’t bepredictablein any useful way. For applicationssuch

asenvironmentalsoundproduction, it is a definitedrawback.

Despiteourconcerns,theresultsdid havesomepromise.While therewereartifacts,

the sound wasrecognizable. The artifacts weremostly dueto “chattering,” with the same

portion of the sound repeated a few times without enough of a decay envelope. At this

point, we thought there waspotential to solve theseproblems,sothedecisionwasmadeto

try a Java implementation thatstreamedaudio in real time. This required a Java version of

thewavelettransform,which wasthenimplemented,andis discussedin 4.5.

3.1 A StreamingGranular SynthesisEngine

To achieve real-time performance, we computed possible candidate for replacementfor

every coefficient beforehand, sothatwhengeneratingaudio, all thathadto bedonewasto

construct a Mallat tree from the pre-computed candidatesets,thendo an inversewavelet

transform. Theword “streaming”is usedloosely – thesmallest unit wasthewholewavelet

tree,which wasthesamelength astheinput sound.

Streaming audio in this way gave mixed results. Although we managed to get the

28

audio out in real-time, becauseof the local natureof the granulation, the sound wasn’t

adequatelychangedto allow for seamlesscombination of finished versionsin theway the

paper described. Although eventsweremixed, therewasno guaranteethat theendof one

andthestartof another would flow well.

The root of the problem wasstill the granularity of the coefficients. Their inter-

dependence madeit impossible to move audioevents around cleanly. We needed a better

way to characterize theeventsof thesignal. Wavelet Packetswerebriefly considered, then

rejectedfor reasonsdetailedin Section2.7.

After abandoning the wavelet packet transform, we hit upon the idea of using a

speech recognition algorithm thatusedthewavelet transform in theanalysisstep.Because

we valuedthe fidelity and similarity of the input sound to the output, using the wavelet

transform for analysisonly gave usmuchmoresatisfactory results.

29

Chapter 4

Segmentationand Resynthesis

In this chapter, we describe the stepstowards an implementation of a resynthesis engine

based onnatural grains. First is thesegmentationalgorithm,whichanalyzestheinput sound

signal and outputs a series of graded points in the samplethat are most appropriate to

segmentaround. The user can thenfine-tune the default threshold to determine the total

numberof segments in thesample.Next is themethodto gradehow segmentsfit together

with eachotherfor thepurposesof playback.We thendescribe our implementation.

4.1 Segmentation

Thecoreof oursegmentation algorithm is amodifiedversion of themethod describedin the

paper by Alani andDeriche[AD99] describedin Section2.5.2,in which asignal is divided

into framesandanalyzedwith thewavelet transform. An input audiosignal is broken into

small framesof 1024samples each,with an overlapof 256 samples. Soundsamplesare

input asCD-quality 44.1 kHz, so eachframe is approximately 23.2mslong. To provide

metricsfor what is happening during the length of eachframe,six levels of the wavelet

transform arecomputedfor eachframe.

An additive informationcost function is thencomputedon eachlevel of difference

coefficients.Thefunction currentlyusedis thesumof u2 log � u2 � for all non-zerovaluesof

u, whereu rangesoverall differencecoefficientsin onelevel of thewavelettransform. This

30

function is ameasureof concentration, i.e. theresult is largewhentheelementsareroughly

thesamesizeandsmallwhenall but a few elementsarenegligible. A simplerfunction that

sumstheabsolutevalues of thesequencehasalsobeentried,with similar results.

A measureof correlation betweencorresponding wavelet levels across adjacent

framesis thenused to mapthelocalchangesin thesignal. For thisweusethesamefunction

as2.6. Equation 4.1givesa slightly moregeneralversion,giving thestrengthof transition

betweenframesa andb:

D � fa � fb �� a

∑i � a � 1

b� 1

∑j � b

6

∑k � 1

� Xi # k � Xj # k � 2 (4.1)

Again, k refersto the wavelet level, i and j are framenumbers, Xi # k andXj # k are

theenergiesof waveletdifferencecoefficient levels. Only like levelsarecompared,andthe

differencesbetween themareadded up to obtain anoverall differencebetweenframes.

Whenthis calculation is donefor eachframe(minusthe first and last two) in the

signal, the result is onenumberper framewhich represents the degreeof changebetween

a frameandits immediateneighbours. Thenumbers arerepresented asanarray, andonly

needto becalculatedoncepersound sample.

Alani andDericheused thismethodto find thepointswherethedistancemeasureis

highest,in order to separatephonemes.We arenot trying to ‘understandspeech’, however.

Isolating phonemesis not critical to our application. Rather, we would like to segmenton

thegranularity of a syllable, wheretransitions aremuchmorepronounced andthebound-

ariesmoreamenable to shuffling.

A simplealterationto their algorithm thatmakesit moresuitable for our goalsis to

look for thepointswhich have the leastdifferencebetween frames, insteadof thegreatest.

With respectto theamplitudeenvelope,thesepoints aremorelikely to bein thetroughsof

thesignal betweenrelevantportions,rather thanin themiddle, or in theattackportion.

Another change is that we normalize the energies before calculating correlation.

This is doneby dividing eachenergy in a frame by the sum of energies in that frame.

This focusesthe correlation on the differencesin strength of bandwidths betweenframes.

31

This wasnot done in theAlani’s version, but we consistently getsmoothertransitionsafter

normalization.

For every frame boundary, we now have a number representing how similar its

neighbours areon either side.To segmentthesoundinto grains,we compare eachof these

numbers to a threshold. Thoselower thanthethresholdaretaken asnew grainboundaries.

Wethusfavour splitting thesignal up at thepointswherethereis little change,andkeeping

togetherpartsof thesignal wherethereis a relatively large amount of change.

We needto ensure aminimumgrain sizeof morethan40mssothat we do not have

grainsoccurring at a rateof morethan20Hz,thelimit of frequency perception. Soweonly

consider the point with the minimum distance measurecompared to its two neighbours,

which meansthat if two grain boundaries occur in two adjacentframes,one of them is

ignored.

4.2 Grading the Transitions

The segmentsboundaries derived from the above approachrepresentthe locations in the

signal whereit changesleastabruptly. The degree of change is given by the result of the

differencealgorithm in Equation 4.1. Our final aim is to re-createrandomizedversions of

thesignal thatretain asmany of theoriginal characteristicsaspossible. Thenext task, then,

is to determinewhich of thegrainsflow mostnaturally from any givengrain.

To enumeratethemostnatural transitionsbetween grains,thegrainsarecompared

against eachother and graded on their similarity. This is done in the sameway as we

calculatedtheoriginal grains from thesample. To calculatethetransition between grainsA

andB, thelasttwo framesof A arefed into thefour-frameEuclideandistancealgorithm of

Equation 4.1, along with the first two framesof B. The lower in magnitudethe result, the

smoother thetransition will bebetweenthetwo grains.

32

4.3 Resynthesis

By taking the last two framesof eachgrain, andcomparing themwith the first two of all

other grains, thesimilarity metricallowsusto construct probabilitiesof transition between

eachandevery grain. Theseprobabilitiesareused to constructa first-order Markov chain,

with eachstatecorresponding to a natural grain. The next to be played is chosenby a

random sampleof the probabilities that have beenconstructedfrom the measure of how

well theendof thecurrentgrainmatches with thebeginningsof all theother grains.

Probabilities areconstructed by employing an inverse transform technique which

usesthe matchscoresfor eachgrain asobserved valuesfor a probability density function

(pdf). Thesmallerthe resultof Equation4.1, thesmoother the transition between the two

windows on either side. We would like higher probabilitiesfor smoother transitions,sowe

take the inverseof eachtransition score to orient the weightsin the favour of the smaller

scores.

We don’t alwayswant theprobabilitiesof choosing thenext grain to bedependent

entirely on the smoothnessof transition. Randomnessis sometimesas important a con-

siderationassmoothness.Someprobabilities might bemuchgreater thatall of theothers,

andso picked often enough for the repetition to be noticed. To allow for the control over

the weighting differences, we adda noise variable C which will help even out the grain

weightings.

Let Pi j � 1( D � i � j � , indicatethe likelihood thatgrain i is followed by grain j . We

canconvert this to a probability pi j by normalizing asfoll ows:

pi j � Pi j ) C

∑nj � 0Pi j ) nC

(4.2)

wheren is the numberof grains. C denotesthe constant noisewe want to addto

thedistribution to give thosewith smallersimilarities moreof a chance to beselected.This

numbercanbechangedinteractively to altertherandomnessof theoutput signal.

We now construct a cumulative density function (cdf) from thepdf which givesus

a stepwisefunction from 0 to 1. This is sampled by taking a random numberfrom 0 to 1

33

andusingit astheindex to thefunction. Thedesired index canthenbefoundusingabinary

search for the interval the random numberlies between, andusing that stepof the cdf to

index our map.

Oncewehave thetransition probabilities,resynthesis is assimpleaschoosingwhat

grain will be played next by random sampling from the empirical distribution pi j . In this

way, thesmoother thetransitionbetween current grain andthenext, thehighertheprobabil-

ity thatthisgrain is chosenasthesuccessor. A highnoisevariableC flattensthispreference,

but never eliminatesit.

4.3.1 Cross-fading

Ouralgorithm worksto matchenergiesof waveletbandsbetweengrainsasbestaspossible.

Sincetheenergiesarenormalizedbefore comparison,boundaryamplitude changesarenot

given much weight in our resynthesischoices. Normally, this is not much of an issue

becausethe algorithm prefers the “troughs” of the signal in which the amplitude is near

its minimum.. For soundssampleswherethereareno suchtroughs,andto give anoverall

cohesiveness to the output, we cross-fade betweeneachsuccessive grain. A linear cross-

fadeof 2 frames(approximately 5ms)is used.

4.3.2 Thr esholding

Theuserhascontrol overhow many grainsasignal is split up into. A sliderin thegraphical

interfacechangesthevalueof thethresholdbelowwhich a frameboundary is considered a

grain boundary. Thethresholdextremitiesaredeterminedby themaximumandminimum

values of all frameboundaries.Betweenthem,thethresholdslidersacrificescontrol at the

high endto obtainfine-tuning over thefirst few grains. This is importantbecausethere is a

perceptually muchbiggerdifferencebetweenchangingthevaluewhenthereis only 5 grains

comparedto 50 over thesamesample.An exponential function is thususedto control the

values of the thresholdslider. Thereis a default thresholdvaluewhich is currently setat

25%of thetotal possible numberof grains.

34

Figure4.1: Input waveform

Figure4.2: Portion of output waveform

Figures 4.1 and4.2 show an exampleof the transformation which is the result of

resynthesis. Although theoutput waveform in 4.2 lookssignificantly different thanthatof

4.1,bothusetheexactsamesamples(exceptfor thesmallcross-fadesongrain boundaries).

Thesamples in theoutput sound arejust re-arranged, andsometimesrepeated.

4.4 Implementation

Our resynthesis system is implemented in Java, and featuresa graphical interfaceto fa-

cilitate real-time interaction. By manipulating sliders, userscanchange the segmentation

threshold, andthe noiseparameter, C of the grain selection process. Using JavaSound on

any platform that supports it, suchasLinux andWindows, all of this canbedone in real-

time with no signal interruptionson systemswith a Pentium II processor.

Figure4.3 showsthe interface we have built to assistwith segmentation. The top

row of buttons,from left to right, areto play theoriginal sample, to record a new sample,

pause playback,load a new sound sample, andto separatethesampleusing thealgorithm

describedabove.

In thecenter is thecurrentsegmentedwaveform,with vertical lines slicing through

35

Figure4.3: Segmentedwaveform andinterface

it at the locationsof the grain boundaries.Moving the threshold slider left or right causes

moreor lesssegmentation lines to appear, giving instant feedback asto the segmentation

threshold. Thenoiseslider affectstheweights giving to next possible grains to beplayed.

It is C in Equation 4.2.

4.5 Implementing the WaveletTransform

To ourknowledge,nopublicly availableversionof thewavelet library existsfor Java,sowe

decidedto write our own. We usedtechniquesdescribedby Wickerhauser [Wic94], in the

UBC Imagerwavelet library wvlt [(or95] andin Dr. Dobbs[Cod92, Cod94]. All of these

containedpartial codeexampleswritten in C. To verify our implementation, we compared

our results against thoseusingthe compiled Dr. Dobbsversion. We alsoverified that the

36

inversetransformgave thesameresults astheinput data.

Thewaveletfiltersavailablewereall ported from theimagerwvlt library. Thefilters

ported include:* TheAdelson, Simoncelli, andHingorani filter [ASH87]* Filtersby Antonini, Barlaud, Mathieu andDaubechies[ABMD92]* TheBattle-Lemariefilter [Mal89]* TheBurt-Adelsonfilter [Dau92]* Coiflet filters [BCR91]* Daubechiefilters [Dau92]* TheHaarfilter [Dau92]* Pseudocoiflet Filters[Rei93]* SplineFilters[Dau92]

For our purposes, usually the Daubechies 10 wavelet filter is used, although the

other filters alsowork well. In general, therehasbeen little work doneon which particular

filters to usefor sound. More rigorous study is needed in this area. In our application,

becausethe wavelet coefficients aresummedinto energiesper level thennormalized, the

result is not very sensitive to thefeaturesof individual waveletfilters.

Thewavelet library is designedto bea separatemodule from therestof theimple-

mentation sothat it canbeusedfor other tasks. A wavelet packet library is alsoincludedin

theJava package,which we planto release to thecommunity.

4.6 Real-timeConsiderations

To make our implementation adequate for real-time use, the segmentation step is done

before playback. This is doneby computing the differencesbetween every frame,so that

37

setting thethreshold is just a matterof checking which framescoresfall below theselected

threshold value. Thosethat do becomegrain boundaries. Segmentation datacanbesaved

for a laterdate, sothis steponly hasto bedoneoncepersoundsample.

This allows usto synthesizeour audiooutput in real time with very little computa-

tion overhead, leaving plenty of extra computation power for othertaskson even themost

averageof desktop computers.

4.7 Segmentation/ResynthesisControl Interface

To facilitateconstruction of larger scalesoundecologies, ahigher-level interfaceshavebeen

implemented.Persound sample,therearecontrols for gain(amplitude)andpan(stereo left-

right amplitude). Therearethreecontrols for each: startvalue, middle,andend,allowing

for the sound to change over time. Becausenot every soundshould be played continu-

ously, therearealsocontrols for trigger (how often the sound should be played, given in

seconds. For instance,a valueof 10 would meanthe soundwould be activated every 10

seconds).Durationcontrols how long eachof thesehigher-level segments are.Triggerand

Duration bothhave associatedvaluesfor controlli ng random variability. TheTrigger value

andits associatedrandomelementareusedasthemeanandstandarddeviation in anormal

distribution random numbergenerator. The pan and gain envelopesaffect eachof these

higher-level segmentsindividually.

Thesample-segmentationinterfaceshown in figure4.4whichallowsfor thecontrol

of how many natural grainsthe sampleis choppedup into is still accessible by a button

for eachcontrol. A “Random Walk” button places thesound in a continuous random walk

through stereo space. “Play All” and “Stop All” buttons allow for overall control of all

samples loaded at onetime.

38

Figure4.4: Segmentedstreammanagement interface

4.8 PresetMechanism

Because the initi al time to analyze the signal for segmentscan be relatively substantial,

the interface also hasthe capability of saving presets for sounds that have already been

analyzed.

Currently, all of thewindow transition valuesaresaved,so that whenthepreset is

loadedagain, none of the wavelet transforms have to bedone again. The threshold, noise

andall of the interfacesettings suchaspan,gain, etc. arealsoall saved. Whena preset

is loaded, all theuser hasto do is push“play” to hear thesounds at thesamesettings they

werewhenthepresets werelastsaved.

4.9 DiscouragingRepetition

Favourable transitions between segments aregiven a higher probability of beingchosen,

so for segmentsthathave very goodcompatibility with just a few others,andvery low for

39

the rest, the samesequenceof samplescanoccur. For certain samples,this repetition of

sectionswasrecognizable,especially if the input signal wasrelatively shortin duration. A

high noiseparameterhelpsby evening out theprobabilitiesof thenext candidatesegments.

To further prevent repetition, a transition that hasjust occurredhasits weight reducedfor

further picks from this transition.

Discouraging repetition is accomplishedas follows. For eachsegment s1 that is

played,we pick thenext segmentfrom weightedprobability list of s1. We thenreduce the

possibility this transition from s1 will be picked againin the nearfuture. This is done by

associating with eachsegmenta list of the 10 most recently chosen transitions from that

segment. These10 arethemselvesweightedwith a higher value. Thenormalweights(the

weightsgivenby signal analysis)of these10recent-mosttransitionsarethendivided by the

repetition weights,yielding new weightsthat area fraction of their originals. Using this

scheme,afterenoughtime these last10 playedwill revert to their original weights.

We choseto discourage the transitions rather than the actualsegments becauseit

keptouroverall system of choosingsegmentsintact. If wehaddiscouragedactual segments

instead,therearecaseswhereour systemwould no longer berandom, andwould develop

arecognizable repeating pattern. Consider theextremecaseof asamplewith lesssegments

thanthe list of recently playedsegments thatwe do not play again.We would endup just

playing a loop, with thenext segment to beplayedalmostalways being theonewe played

leastrecently.

40

Chapter 5

Resultsand Evaluation

As examplesound inputsto ouralgorithm,wehaveusedsamplestaken from theVancouver

Soundscape[Tru99], andothernatural recordingsfrom anumberof differentenvironments.

This providedus with a wealthof real-world samples, both pitchedandunpitched, which

wereideal for this algorithm.

5.1 UserStudy

To testtheutil ity of ournatural grain technique, wecarriedoutapilot user study onsubjects

taken from around theUBC ComputerScienceDepartment. We testedthehypothesisthat

a sound generatedfrom our algorithm will be indistinguishable from therealaudio sample

it is takenfrom.

Thesevensamplesusedfor thetestswere:* a seriesof carhorns* thesoundsof sometreefrogs* crickets* thesound of crumpling andtearing paper* ambient soundsfrom a fish market

41

* thesound of a bubbling brook* birds chirps in theforest.

Eachsamplewasgonethroughby handto pick anappropriatesegmentation thresh-

old. We chosethreshold values thatwould ensure therewasmorethanonesegmentin any

snippetwe playedastests.

Thereis a possibility of subjects comparing the sound events in the real sample

versus the resynthesized sound, insteadof evaluating whetherthe resynthesizedsoundis

plausibly realistic. To minimize this, we didn’t just usea snippet of a real sound andthat

samesnippedresynthesized,becausethen the sameeventswould always happen in both

samples. Instead, thesnippetsweretakenfrom two larger pools, constructedfrom original

andresynthesizedsamplesof muchlarger duration. The location of thesnippetwithin the

larger samplepools waschosen at random, the only condition beingthe starthadto leave

enough roomin thepool to play the4 second duration of thetestsnippet itself. New loca-

tionswerechoseneachtime a snippet from a soundwasplayed. This precaution preserves

the intent of the algorithm: to producesounds that appear to have been recordedfrom the

samesourceastheoriginal,but at different times.

5.1.1 Participants

Ten membersof our department(9 males,1 female) participatedin the userstudy. All

reportednormalhearing. Theparticipantswerenot paidfor their time.

5.1.2 Experimental Procedure

Theexperimentusedatwo-alternative forced-choicedesign. Subjectssatonachairin front

of two speakersin anenclosedroom. We told themthatwe would play a seriesof various

environmental sound samplesin pairs. Eachpair would consist of a random section of

theoriginal sample,anda randomsection of a resynthesizedsound. They would be in no

particularorder, andtherewouldbeanumberof different typesof sounds.For each pair the

42

subjectswereinstructedto identify the‘real’ sample.They weretold they couldonly listen

to the samples once;no repetition wasallowed. They werethengiven a practice sample

different than thoseusedin the test. Finally, the testbegan. Therewerethreeiterationsof

bothorders real-synthesizedandsynthesized-real for each sample,for a total of 42 testsfor

eachsubject. Theorderof testswasrandom.

Subjectswerenot told thatthetestswouldbesymmetrical,thatis, therewouldbeas

many with theresynthesizedsound first astherewaswith therealsoundfirst. Nor werethey

told thenumber of repetitions in theexperiment,nor thenatureof theresynthesis algorithm.

Therewere somepotential issues with the approachwe took to demonstratethe

utili ty of our algorithm. For onething, the subject alwayshasthe original soundto listen

to next to theresynthesizedversion, andthusmight beableto pick out idiosyncrasieswith

the resynthesizedsound that might not be recognizable if the real sound wasnot played.

However, on the flip side,if the userscannot statistically tell betweenthe real soundand

theresynthesizedonein this test, it is a strong endorsement that thealgorithm canproduce

realistic sounds.

5.1.3 Results

Figure5.1 displays the correct scores per subject tested. Therewassubstantial variation

betweensubjects. Threeout of the10 scoredabove 70 %, 2 othersabove60 %, 1 above 50

% andtheother 4 belowchance,asshown in figure5.2.

We tested the null hypothesis that the subjectsperform at the chance level (eachresponse

is a pure guess) for the10 subjects. By hypothesis,themeannumberof correct responses

µ = 21 and the standard deviation σ = 3.24. Using the normal approximation to the bi-

nomial distribution we conclude that we canrejectthehypothesiswith a two-tailed testat

thesignificancelevel α � 0 � 05 only if thesamplemeanis outsidetheinterval µ + 1 � 96σ �,14� 65 � 27� 35- . Two subjectsscored above this range(32, 31), andonebelow (14). How-

43

Corr ect Scoresper Subject and Sample(eachout of 6)Subjectnumber Carhorns Crickets Paper Market Frogs Stream Birds

1 3 2 2 2 2 3 02 5 3 2 3 4 6 33 6 4 3 4 5 5 54 1 2 2 2 3 3 55 2 5 3 4 3 2 56 2 2 5 3 4 2 17 6 2 4 4 5 5 48 5 5 1 5 5 5 59 5 3 2 2 2 2 210 6 2 4 3 3 3 5

Figure5.1: Correct scorespersubject andsample.Eachscore is out of 6.

Subject Total Corr ect Percentage Corr ect Standard Deviati on1 14 0.33333 12 26 0.61905 1.380133 32 0.76190 0.97594 18 0.42857 1.2724185 24 0.56251 1.2724186 19 0.45238 1.3801317 30 0.71428 1.2535668 31 0.73809 1.5118589 18 0.42857 1.13389310 26 0.61904 1.380131

Figure5.2: Percentagecorrect answerspersubject,over all samples.

ever, the mean,23.8, falls solidly within this range, so overall we cannot reject the null

hypothesis.

5.1.4 Discussion

Thesetestsshowed that samplesresynthesizedwith this algorithm arevirtually indistin-

guishablefrom the originals. This demonstratesthe utilit y of this processto createaudio

streams of indefinite length from samplesof fixed length. It canalsobe usedto create a

numberof variationsof a fixed-duration sound sample.

Theresultsalsorevealsomething about whichsoundsaremosteffectively rendered

44

Corr ect Responsesmean 23.8max 32min 14std 6.268

Figure5.3: Statistics for thenumber of correctresponses

PercentageCorr ect, Per Samplecarhorns 0.683

brook 0.600frogs 0.583birds 0.583

market 0.533crickets 0.500

crumpling paper 0.483

Figure5.4: Percentagecorrectpersample, compiledover all subjects

with our algorithm. Figure5.4 showsthe fraction of correctanswerspersample.Thereis

substantial variation betweensamples, with the car horns beingthe most identifiable and

thecrumpling paper soundstheleast.

In the caseof the car horns, we can partly attribute the high successrate to the

nature of thesample. A carhorn is a pitchedsound with a distinct attack, middleandend.

Because it is so plain and in the foreground, it is easier to pick out idiosyncrasies when

the horn is altered, which in turn makes it easierto identify the resynthesizedsound. In

this particular case, somesubjectsremarked the horn stoppedor started too quickly to be

considerednormal. Thesegmentation/resynthesis processsometimeschangedtheoriginal

envelopeof the horn enough to be noticeable. They were comparing not only the two

samples played for them,but alsoeachagainst their own personal idea of whata carhorn

should sound like.

Sounds of themarket, on theother hand,depict thebustle andcommotionof many

people goingabout their routines. It is chaotic andlayered, displaying muchlesstemporal

structure than the regular sound of a car horn. Without the larger temporal cluesto give

45

themaway, theresynthesizedmarket soundswerethusmoredifficult to identify.

Another low-scorer, the crumpling paper alsodoesnot contain asmuchtemporal

informationasthecarhorns. Like themarket, it is unpitchedandirregular, which provides

a muchrichersetof possibilities for segmentation. Pitchdoes not provide an inherentdif-

ficulty for our algorithm, but thesetypes of soundsareusually accompanied by amplitude

envelopes, which area problem becauseof their temporal structure. This showsthat our

algorithm performsbest on the sortsof sounds it wasdesignedfor: unstructured environ-

mentalsoundssuitableas‘background’ noises.

As it stands, soundswith temporal structurescanonly behandled by starting with

an input sound that containsa number of the discrete structures,andthenonly segment-

ing betweenthem. For instance, to achieve bettercar horn sounds, we could have only

segmented betweeneachtoot of the horn. Anothersolution which would involve adding

temporal informationto our algorithm is detailed in thenext chapter.

Oneconfoundthat mightexist for thesetestsis thesegmentationthresholdwechose

for the individual samples. Eachsamplewasgone through by hand,to ensure therewas

morethanonesegmentin any snippet we played astests. The threshold could not be too

large, either, becausethenthe segments would be to small, causing the resultant soundto

differ perceptibly in timbrefrom theoriginal. Thesetwo constraintsstill leaveconsiderable

to maneuver, sothenotion of an‘ideal’ threshold is a looseone.

Our designdecision to give final sayon thethreshold to thesound designerallows

for maximumflexibil ity. However, it alsomakesteststhat limit the threshold to onevalue

per sampleto be necessarily limited in scope relative to the choicesthe implementation

offers.

Overall, thesubject’s reaction to the resynthesizedsounds wasuniformly positive.

They invariably commented that it was very difficult to distinguish our sounds from the

originals,irrespective of their final scores.

46

Chapter 6

Conclusionsand Futur eWork

6.1 Overview

This chapter will summarize thegoals andresults of this thesis,andwill alsooutline some

directionsfor further work andimprovements.

6.2 Goalsand Results

Our goal in developing the natural grain resynthesizer wasto complementexisting meth-

odsfor generating audio through physical simulation of sounds with one that focuseson

manipulating existing samples. Background sounds,such asbirds in a forest, chatter in a

cafe, or street sounds arecurrently beyond the scope of physical simulation, but are just

asnecessary to virtual environments, film, andother disciplines that valuerealistic sound

environments.

In thisthesisweextendedtheutil ity of sample-based audioresynthesisby providing

amethodto create randomizedversionsof samplesthatpreserveperceptualqualitiesof the

original. Thisallowsusersto createsamplesof indeterminate length from aninputsoundof

fixedlength, without having to resort to simple,deterministic looping. In a situation where

the samesoundsampleis triggeredin responseto an event very often, onecould create a

numberof variationsof similar duration to asample,insteadof repeating theoriginal again

47

andagain.

Our implementationalsoprovidesan interfacethatmakesit possible to easily mix

together multiple streamsof resynthesizedaudio to quickly createan integratedacoustic

environment from separateelements.

The implementation canalsobe used for other purposesbesidessound extension.

Interestingeffects canbe achieved whenwe set the granularity to be very fine, in which

casewe achieve a sparse form of granular synthesis. Intriguingcombinationsandtextures

canalsobegeneratedby usinga sample with many different heterogeneoussound sources

present. Thealgorithm mixesthemin unexpectedandinteresting ways,oftenin short bursts

thatareconnectedseamlesslywith othershort bursts from elsewhere in thesignal to create

new macroscopic structures.

Our main challengewas to preserve the perceptual characteristics of the original

sound. Therearemany waysof creating new soundswith new timbres from input samples,

suchasgranular synthesis,but muchlesswork hasbeendoneon creating new samplesthat

sound similar to thoseinput. This is thekey achievementof our work.

6.3 Futur e Work

While the system works well right now for a variety of applications, therearearemany

interestingextensions to the dynamic scrambling algorithm. In this section, we explore a

few futuredirectionsfor work.

1. More work could be doneto automatically set the threshold to a reasonablevalue

depending on the sample. Determination of automatic threshold values is tricky,

becausethe optimum numberof segments for a samplevarieswith the sizeof the

sampleandits inherentseparability.

2. It would be useful not only to extract the componentsin the time domain, but also

split the signal up into multiple simultaneousstreams. For instance, in a sample

wherethereis simultaneously traffic noiseandbirds singing, we could extract the

48

bird sound from the background andre-synthesize the two separately. Independent

Component Analysis (ICA) [Cas98] could be a viable method of separating signals

from onesource.

3. Modifying theunderlying signal throughmanipulation of thetime-frequency domain

canproducedesirableandpredictable changesin theperceivedobjectsinvolvedin the

sound production. For instance,Miner andCadell[MC97] have developedmethods

for altering thewavelet coefficients in the representation of a rain sampleto change

the perceivedsurface the rain wasfalling on. They could alsochange theperceived

sizeof thedrops,or their density. Sinceour algorithm already computesa version of

thesignal in thewavelet domain,this would berelatively efficient to implement.

4. Although it is notneeded for thepurposesof this thesis, adjusting thethreshold slider

in realtime to changethegrain sizeis alsocurrently possible, but with morecompu-

tational overhead. This functionality is useful for musiciansandsounddesignersto

create effects by continuously changing thegrainsizein realtime.

The generation of the first-order Markov chain is relatively expensive, andmustbe

re-doneafterever change in threshold. This is becauseuntil we know thethreshold,

we do not even know wherethe grains will be, so it is difficult to determine how

well they matchtogether. In practice,on a moderatelength sound sample (less than

5 minutes), this is not a problem. However, longer samples could leadto thousands

of grains. This becomes a problem becausethe time complexity of the algorithm is

O � n2 � to recomputetheMarkov chains,wheren is thenumberof grains.

For the purposessetout in this thesis, the existing implementation is adequate,but

re-writing the codeto pre-calculate segmentrelations is possible. We would have

to calculateevery possible transition between segments for every possible threshold

value,andbeableto only consider thosethatarebelowthecurrent threshold.

5. Oneof themostinterestingof thepossible extensionsto this algorithm would bethe

incorporation of time information.This would allow usto producenew sounds with

49

thesamerhythmicstructuresastheoriginals. A possible way of accomplishing this

would be to analyze amplitude, andusethe resulting information in a hierarchical

Markov chain. Entire sounds or just localized sections could be analyzed to obtain

their large-scale amplitude envelopes. This informationwould be usedto cut down

the numberof segmentsconsidered whenchoosing the next segment to play. The

subsetcould be based on the criteria that the next segmentmusthave an amplitude

that matches the current portion of the larger pattern. Finer-scalechoicesbetween

individualsegmentswithin thesubsetwouldstill bedoneusing thealgorithmoutlined

above. Or, if subsets remove too much randomness, we could instead modify the

entire weighting systemusedin resynthesis to reflectthis new information.

Thisapproachwouldallowusto successfully segmentamuchbroaderclassof sounds

thanis currently possible. Sounds with a high degree of temporalinformation would

be particularly better handled. This approachwould alsocreateadditional creative

opportunities. For instancewe could analyze onesound for its amplitudeenvelope,

thenusethis information on a completely differentsound. Sincelocal decisionsaf-

fecting soundquality wouldstill bemadenormallyaswehavedescribedaboveusing

local information,thesound would still beof very good quality, but with completely

different(but recognizable) temporalcharacteristics.

6. In addition to theamplitude-following describeabove,wemightbeableto implement

something closeto thebeat-tracking interfacesofferedin somecommercial packages

for musicians. By analyzing the amplitudemaps,we could estimate the rhythm of

thesample, andallow theuserto changeit. Thechange would bebrought about by

exploiting anartifact of our algorithm. For rhythmicsounds,decreasing thesegment

sizeoften increasesthe tempoof the sound. The exact cause of this is not known

at present, but it could be exploited by offering it as a control to the user. With

judicious useof amplitude maps,we could intentionally shorten the envelopeof a

particular sound by not picking asmany segmentsin the sustain part of the sound,

so the envelopebecomesshorter. Sinceit is the attack that gives a soundmostof

50

its character, as described in a paper by RissetandMatthews [RM69], this would

hopefully not changetheperceptible characterizationof thesounddrastically.

7. For particular tasks that demand a high degreeof control over which segments are

combined,it would beuseful to let theusermanually combinesegments. We could

provideasound-builder interface,wheretheuserstartswith onesegment,thenis pro-

vided with amenuwith everyother segment orderedaccording to goodnessof fit. Or

there couldbeother criteria besidesgoodnessof fit thattheuserchoosesto order the

segments. The selection of segments offeredcould alsobe moresophisticated. For

instance,theusercouldselect segments from a wholedatabaseof sounds,calledup

accordingto somecriteria. Thistakesuscloserto theobjectivesof theCATERPILLAR

system[Sch00].

8. Our aim of creating a perceptibly similar sound breaks down with extremelysmall

samples (under 1-3 seconds, depending on the samplecharacteristics). On sucha

small amount of data, thereis little possibility of having enough segments to pro-

duce quality output with anadequatenumberof segments: thesegments becometoo

small,andtheresult soundsmorelike granularsynthesis. Wecouldprovideasystem

which analyzes the sound, andchanges perceptible qualities like pitch, amplitude,

andcolour by smallamounts to createtheillusion of changeor variation without rad-

ically changing theperception of thesound. We would alsohave to choosetheseg-

mentandresynthesis pointscarefully, using something like theamplitude-following

techniquedescribedabove.

This algorithm and its proposedextensions all aim to give the useran intuitive

interfaceto sounddesign. Webelieveit is notenoughto providenovel physical or graphical

interfacesto sound synthesisengines; what is often needed,andrarely found in practice,

areinterfacesthatreflecttheunderlying physical propertiesof a sound. Whendealing with

samples from therealworld, aswe have done,this involvesdeveloping methods for signal

understanding, andmanipulationtechniques thatpreserve important aural properties.

51

Bibliography

[ABMD92] MarcAntonini, Michel Barlaud,PierreMathieu, andIngrid Daubechies.Im-agecoding usingwavelet transform. IEEE Transactionson Image Process-ing, 2(1):205–220,1992.

[AD99] AhmedAlani andMohamedDeriche. A novel approachto speech segmen-tationusingthewavelettransform. In Fifth InternationSymposiumonSignalProcessingandIts Applications, 1999.

[AFG99] Maria GraziaAlbanesi, Marco Ferretti, and Alessandro Giancane. Time-frequency decomposition for analysis andretrieval of 1-d signals. In Pro-ceedings of the IEEE International Conferenceon Multimedia ComputingandSystems, volume2, pages974–978,1999.

[ASH87] E. H. Adelson, E. Simoncelli, andR. Hingorani. Orthogonal pyramidtrans-forms for imagecoding. In Visual Communicationsand Image ProcessingII , pages50–58,1987.

[BCR91] G. Beylkin, R. Coifman,andV. Rokhlin. Fastwavelet transformsandnu-merical algorithms,1991.

[BIN96] Jerry Banks,JohnS.CarsonII, andBarry L. Nelson. Discrete-EventSystemSimulation. Prentice-Hall, 1996.

[BJDEY� 99] Ziv Bar-Joseph, Shlomo Dubnov, Ran El-Yaniv, Dani Lischinski, andMichael Werman. Granular synthesis of soundtextures using statisticallearning. In Proceedings of the International Computer Music Conference,pages178–181,1999.

[BSR98] Gerhard Behles, SaschaStarke, and Axel Robel. Quasi-synchronous andpitch-synchronousgranular sound processing with stampede ii. ComputerMusicJournal, 22:44–51, Summer1998.

52

[Cas98] Michael Anthony Casey. Auditory Group Theorywith Applications to Sta-tistical BasisMethods for Structured Audio. PhDthesis, Massachusetts In-stituteof Technology MediaLaboratory, 1998.

[CMW92] R. Coifman,Y. Meyer, andM. Wickerhauser. Wavelet analysis andsignalprocessing, 1992.

[Cod92] Mac A. Cody. The fastwavelet transform: Beyond fast fourier transforms.Dr. DobbsJournal of Software Tools, 17(4):16–18, 20,24,26,28,100–101,April 1992.

[Cod94] Mac A. Cody. Thewavelet packet transform: Extending the wavelet trans-form. Dr. DobbsJournal, April 1994.

[Dau92] Ingrid Daubechies.TenLecturesonWavelets, volume61. Societyfor Indus-trial andApplied Mathematics,Philadelphia,1992.

[DJ97] ChristophDelfs andFriedrich Jondral. Classification of pianosoundsusingtime-frequency signal analysis. In Proceedingsof IEEE ICASSP‘97, pages2093 – 2096, 1997.

[Dry96] Andrzej Drygajlo. New fastwaveletpacket transform algorithms for framesynchronized speech processing. In Proc. of the 4th International Confer-ence on SpokenLanguage Processing, pages410–413, 1996.

[EC95] KarmranEtemadandRamaChellappa. Dimensionality reduction of multi-scale featurespacesusingaseparability criterion. In Inter. Conf. onAcousticsSpeech andSignal Processing, 1995.

[Gav88] W. W. Gaver. Everyday listening andauditory icons. PhDthesis, Universityof California in SanDiego,1988.

[Gib79] JamesJeromeGibson. The ecological approach to visual perception.Houghton Mif flin, 1979.

[Han89] Stephen Handel. Listening: An Introduction to the Perceptionof AuditoryEvents. MIT Press,1989.

[Han95] S. Handel. Timbre perceptionandauditory object identification. Hearing(Handbookof Perception andCognition 2ndEdition), pages425–461, 1995.

[Hel54] H. L. F Helmholtz. On the sensations of toneas a psychological basisfortheteory of music. Dover, New York, 1954.

53

[JP88] Douglas L. Jonesand ThomasW. Parks. Generation and combination ofgrains for music synthesis. Computer Music Journal, 12:27–34, Summer1988.

[KM88] Richard Kronland-Martinet. The wavelet transform for analysis, synthesis,and proceessing of speech and music sounds. ComputerMusic Journal,12(4):11–19,Winter 1988.

[KT98] Dami’anKeller andBarry Truax. Ecologically-basedgranular synthesis.InICMC, pages 117–120,1998.

[LKS � 98] T. Lambrou,P. Kudumakis, R. Speller, M. Sandler, andA. Linney. Clas-sification of audio signals usingstatistical featureson the time andwavelettransformdomains. In Proceedingsof theIEEE 1998International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP’98), volume 6,pages3621–3624, Seattle(WA), May 12 - 15 1998.

[Mal89] StephaneG.Mallat. A theory for multiresolution signal decomposition: Thewavelet representation. IEEETransactionsonPatternAnalysis andMachineIntelligence, PAMI-11(7):674–693,1989.

[MC97] Nadine E. Miner and ThomasP. Caudell. Using wavelets to synthesizestochastic-based soundsfor immersivevirtual environments. In Proceedingsof theInternational Conferenceon Auditory Display, 1997.

[MMOP96] Michel Misiti, Yves Misiti , Georges Oppenheim, and Jean-Michel Poggi.Wavelet ToolboxUser’s Guide. TheMathWorks,1996.

[MZ92] StephaneG. Mallat andS. Zhong. Characterization of signals from multi-scale edges. IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 14(7):710–732,July 1992.

[(or95] Alain Fournier (organizer). Wavelets and their applications in computergraphics. In Siggraph1995 course notes, 1995.

[PK99] StefanPittnerandSagarV. Kamarthi.Featureextraction from wavelet coef-ficients for pattern recognition tasks. In EEETrans.On PAMI, volume21,pages83–88,1999.

[PLLW99] D. K. Pai, J. Lang, J. E. Lloyd, andR. J. Woodham. Acme, a teleroboticactive measurementfacility. In Experimental Robots VI, vol. 250of LectureNotesin Control andInformationSciences, pages 391–400,1999.

54

[Rei93] L. M. Reissell. Multir esolution geometric algorithmsusing wavelets: Rep-resentation for parametric curvesandsurfaces,1993.

[RM69] J. RissetandM. Mathews. Analysis of musical instrumenttones. PhysicsToday, 22(2):23–30, 1969.

[Roa78] Curtis Roads. Automatedgranular synthesisof sound. Computer MusicJournal, 2(2):61–62,1978.

[Roa88] CurtisRoads. Introduction to granular synthesis. ComputerMusicJournal,12:11–13, Summer1988.

[Roa96] Curtis Roads. TheComputerMusic Tutorial. MIT Press,Cambridge, MA,1996.

[Ros00] S. Rossignol. Segmentation et indexation dessignaux sonores musicaux.PhDthesis,University of ParisVI, July 2000.

[RP00] J. L. Richmond andD. K. Pai. Active measurementandmodeling of con-tact sounds. In Proceedingsof the2000 IEEE International ConferenceonRobotics andAutomation, pages2146–2152, 2000.

[Sch00] DiemoSchwarz.A systemfor data-drivenconcatenativesoundsynthesis.InProceedingsof the COSIG-6 Conferenceon Digital Audio Effects (DAFX-00), Verona, Italy, December2000.

[SG97] Ruhi Sarikaya andJohnN. Gowdy. Waveletbased analysisof speech understress. In IEEESoutheastcon, volume1, pages92–96, Blacksburg, Virginia,1997.

[SG98] Ruhi Sarikaya andJohnN. Gowdy. Waveletbased analysisof speech understress.In Proceedingsof the1998IEEEConfrenceonAcoustics,Speech andSignal Processing, volume1, pages569–572,1998.

[Sko80] M. Skolnik. Introduction to RadarSystems. McGraw-Hill Book Co.,1980.

[SN96] F. StrangandT. Nguyen. WaveletsandFilter Banks. Wellesley-CambridgePress,Wellesley, Massachusetts,1996.

[SSSE00] Arno Schodl, RichardSzeliski, David H. Salesin,and Ifran Essa. Videotextures. In Siggraph, 2000.

[SY98] S.R.SubramanyaandAbdouYoussef. Wavelet-basedindexing of audio datain audio/multimedia databases. In Proceedings of the International Work-shop on MultimediaDatabasesManagementSystems, pages46–53, 1998.

55

[TLS � 94] B. Tan,R. Lang,H. Schroder, A. Spray, andP. Dermody. Applying waveletanalysis to speech segmentation and classification. In H. H. Szu, editor,Wavelet ApplicationsProc.SPIE2242, pages 750–761,1994.

[Tru88] Barry Truax. Real-timegranular synthesiswith a digital signal processor.Computer MusicJournal, 12:14–26,1988.

[Tru94] Barry Truax. Discovering inner complexity - time shifting and transposi-tion with a real-time granulation technique. In Computer Music Journal,volume2, pages 38–48, Summer1994.

[Tru99] Barry Truax. Handbook for Acoustic Ecology. ARC Publications, 1978.CD-ROM version Cambridge StreetPublishing 1999, 1999.

[vdDKP01] K. van den Doel, P. G. Kry, and D. K. Pai. Foleyautomatic: Physically-based sound effectsfor interactive simulation andanimation. In ComputerGraphics(ACM SIGGRAPH2001 ConferenceProceedings), 2001.

[War99] Richard M. Warren. Auditory Perception: A New Analysis and Synthesis.CambridgeUniversity Press,1999.

[Wic92] Mladen Victor Wickerhauser. Acoustic signal compression with waveletpackets. In CharlesK. Chui, editor, Wavelets–ATutorial in TheoryandAp-plications, pages679–700.AcademicPress,Boston,1992.

[Wic94] Mladen Victor Wickerhauser. AdaptedWaveletAnalysisfromTheoryto Soft-ware. AK Peters, Ltd., Wellesley, Massachusetts,1994.

[WL00] Li-Y i Wei andMarc Levoy. Fasttexture synthesisusing tree-structuredvec-tor quantization. In Siggraph, 2000.

[WS85] W. H. WarrenandR.E. Shaw. Eventsandencountersasunitsof analysisforecological psychology. In W.H. WarrenandR. E. Shaw, editors,Persistenceand change: Proceedings of the First International Conference on EventPerception, pages 1–27, 1985.

[WW99] Eva Wesfreid and Mladen Victor Wickerhauser. Vocal commandsig-nal segmentation and phonemeclassification. In Alberto A. Ochoa., edi-tor, Proceedings of the II Artificial Intelligence Symposium at CIMAF 99,page 10. Institute of Cybernetics, Mathematics andPhysics(ICIMAF), Ha-bana,Cuba,1999.

56

manipulation and resynthesis of environmental sounds with ...€¦ · dynamic structures that...

Documents