a digital audio coder based on a model of human hearingande2213/filez/paper/mainpaper.docx · web...
Post on 27-Mar-2018
222 Views
Preview:
TRANSCRIPT
University of Minnesota - Duluth
A Digital Audio Coder Based on a Model of Human Hearing
Hans Anderson5/21/2007
TABLE OF CONTENTS
Introduction.................................................................................................................................................................................. 4
1 - Digital Musical Synthesis Techniques.........................................................................................................................7
Wavetable Synthesis.............................................................................................................................................................7
Frequency Modulated Synthesis – (FM)......................................................................................................................9
Modeling Synthesis............................................................................................................................................................ 10
2 - Introduction to Digital Audio Coding........................................................................................................................11
Sampling.................................................................................................................................................................................. 11
The Sampling Theorem............................................................................................................................................... 11
Sample Depth................................................................................................................................................................... 13
Existing Audio Coders.......................................................................................................................................................13
Masking...............................................................................................................................................................................14
MP3 Encoding..................................................................................................................................................................14
Prony’s Method............................................................................................................................................................... 15
Principal Components Analysis for Data Reduction.......................................................................................17
3 - Goals and Assumptions...................................................................................................................................................20
Desireable Qualities...........................................................................................................................................................20
Polyphonic Pitch Detection........................................................................................................................................20
Musically Well-Placed Basis Functions.................................................................................................................24
Frequency-Dependent Time Resolution..............................................................................................................25
Real Time Operation.....................................................................................................................................................27
4 - Theoretical Basis of the Algorithm............................................................................................................................28
Anatomy of the Auditory System.................................................................................................................................29
Theories of Time-Frequency Analysis in Human Hearing................................................................................31
Spectrograms for Time-Frequency Analysis...........................................................................................................34
5 - Our Implementation.........................................................................................................................................................41
The Analysis Phase.............................................................................................................................................................41
Discrete Model of A Damped Harmonic Oscilator...........................................................................................41
Analytic input signal.....................................................................................................................................................52
Masking...............................................................................................................................................................................58
Data Storage..................................................................................................................................................................... 66
Synthesis............................................................................................................................................................................ 67
6 - Performance........................................................................................................................................................................ 70
Quality...................................................................................................................................................................................... 70
2
High Frequency Estimation Error...........................................................................................................................71
High Frequency Attenuation.....................................................................................................................................74
Speed........................................................................................................................................................................................ 76
Efficiency of the Analysis Phase...............................................................................................................................77
Efficiency of the Masking Phase...............................................................................................................................77
Efficiency of the Synthesis Phase............................................................................................................................78
Data Rate.................................................................................................................................................................................78
7 - Future Research Possibilities.......................................................................................................................................78
Parallelization.......................................................................................................................................................................78
Feature Recognition and Transformation................................................................................................................78
Musical Transcription....................................................................................................................................................... 79
Psychoacoustics................................................................................................................................................................... 80
8 - Code......................................................................................................................................................................................... 80
Organizational Overview.................................................................................................................................................80
Analysis Functions..............................................................................................................................................................81
Main File: analysisTest.m............................................................................................................................................81
Auxiliary File: energyRateOfChange.m.................................................................................................................82
Masking Functions..............................................................................................................................................................84
Organizational File: maskTest3.m..........................................................................................................................84
Main File: applyMask.m...............................................................................................................................................85
Auxiliary File 1: estiamteAlpha.m...........................................................................................................................86
Auxiliary File 2: fitmaskCurve.m.............................................................................................................................86
Auxiliary File 3: aliasAmplitude3.m.......................................................................................................................87
Auxiliary File 4: singleMaskCurve.m.....................................................................................................................88
Synthesis Functions........................................................................................................................................................... 88
Main File: testSynthesis.m..........................................................................................................................................88
Auxiliary File 1: findNearest.M.................................................................................................................................90
Auxiliary File 2: continuousFadeExps.m..............................................................................................................91
Auxiliary File 3: chop.m...............................................................................................................................................92
Bibliography...............................................................................................................................................................................93
3
ABSTRACT
For musical recordings of a single instrument playing only one note at a time, there exists reliable
software for detecting the pitches and transcribing them into musical notation. But for polyphonic
recordings (those that contain sounds of several simultaneous pitches) very little has been
accomplished. This is surprising because humans do it so well and because unlike other audio
recognition tasks, such as speech recognition, it doesn't require deep conceptual understanding. In
order to move a step closer to a software solution, we implement a computer model of one theory
of human hearing and use it to encode audio recordings in a format similar to musical notation.
This compact, efficient format has possible applications including vioce over I.P. and live music
synthesis.
4
INTRODUCTION
In this paper, we present an algorithm for encoding and decoding audio signals. Although it
provides a method for storing audio in a very compact format, data compression is not its
intentional goal. It aims, instead, to provide a perceptually meaningful data format. That is, a
mathematical representation of sound that closely resembles the language and notation favored by
musicians. Such a representation has several advantages:
First, for creating synthesized electronic musical instruments it is helpful to represent audio data
in a format that is compatible with popular techniques for digitally generating sound. Secondly, this
format simplifies many musical signal processing tasks such as pitch detection, automatic
transcription, pitch shifting and speed adjustment. Finally, it represents sound in a clear and
intuitive way that enables us to visualize more accurately and understand more easily the nature of
sound. This algorithm is a research tool that provides a convenient way to analyze acoustical data
and experiment with sound.
Of course, these advantages come at a price; compared with mp3 and other perceptual data
compression schemes, this one is very computationally expensive. Also, when faced with decisions
in the design phase of the project, we occasionally opted for intuitive solutions based on perception
and physical analogy instead of algebraic manipulation. As a result, the algorithm is less accurate
than it would be.
An important distinction between this and other time-frequency analysis methods is that it adheres
to a perceptual measure of accuracy. We have maintained, as a guiding principle, the idea that the
human hearing apparatus is, by definition, the archetypical example of a perfect perceptual audio
encoder. In other words, wherever inaccuracies exist in our algorithm’s ability to perform time-
5
frequency analysis on an audio signal, they should not be considered as perceptual deficiencies if
there is evidence that the human auditory system makes similar mistakes.
Finally, since our CODEC1 is modeled after a particular aspect of a theory of hearing perception, the
quality of its sound output provides an indication of the degree to which that particular theory
explains the sensitivity of our hearing.
1 CODEC stands for COder/DECoder.
6
1 - DIGITAL MUSICAL SYNTHESIS TECHNIQUES
The method we are presenting uses a data format that makes it useful for producing synthetic
music. Since this is a major motivation for the project we begin by summarizing the existing
techniques for music synthesis.
WAVETABLE SYNTHESIS
Wavetable Synthesis is, conceptually, the simplest of popular techniques. A description in summary
goes like this:
Suppose you want to make a synthesizer to imitate the sound of a piano. Begin by making a
recording of the sound of every key on a real piano.
Cut each sound into three sections:
o The first section represents the “attack”, that is, the sound of the felt-padded
hammer as it strikes the string at the beginning of the note.
o The second section is the “sustain”. This part represents the tone of the instrument
as it holds out a note. It will usually be cut so that it can be played in a loop that
could continue indefinitely to produce an arbitrarily long note.
o The final part is the “release”. In our example of a piano, this is the sound of the felt
dampers clamping down to dampen the vibration of the string.
When a musician presses a key on the electronic keyboard that controls the synthesizer, the
recording of that same key from the real piano should begin to play. It should begin, of
course, with the attack. Then the sustain part should loop until the key is released. After
the release of the key, the sustain section should stop looping and the release part of the
sample should play.
7
Wavetable synthesis is employed, at least part of the time, by most high-quality modern
synthesizers. Its main advantage is accuracy; since every sound from a wavetable synth is
based on a recording, the sound closely resembles the real instrument.
Wavetable synthesis has several disadvantages:
It lacks expressiveness. Keyboard instruments can be accurately represented by
wavetable synthesis because they are compatible with the “attack, sustain, release”
model described above. That is, there isn’t much variety in the way each key is pressed,
and the musician doesn’t do anything to affect the timbre of the note while the key is
held down. Brass instruments or bowed string instruments are not so easily adapted to
this model because players of these instruments constantly adjust the amplitude,
timbre, and even the pitch while the note is in the “sustain” phase.
It requires a lot samples. The timbre of most instruments changes significantly
depending on the amplitude. A vibrating string, for example, produces a harmonic
series of sounds in predictable, even ratios when the amplitude of vibration is minimal
compared to the length of the string. But as the amplitude increases non-linear effects
of friction begin to change the spectrum of the output in order to produce sounds of
more dissonant character. It is therefore necessary to sample each note at several
volume levels in order to get a realistic representation of the sound of the instrument. A
full-sized piano keyboard has 88 keys. If we sample each note at five volume levels, we
need 440 samples. Although the falling price of digital memory is making the necessary
hardware increasingly affordable, nothing relieves the cost of human effort required to
record all those samples. Needless to say, each recording should be of the highest
quality and each requires considerable manipulation and editing to make the attack,
sustain, and release sections fit together smoothly. Furthermore, the one who does the
8
sampling must take great care to be consistent about how he plays the notes on the real
instrument that he is recording. If he samples at five volume levels he must make sure
that the pressure he applies to the piano key for the fourth volume level is absolutely
the same for each key. For keyboard instruments, he may use some form of mechanical
assistance but this is not possible for other instruments. A trumpet, for example, must
be played by human lips or the sound will not be pleasant.
FREQUENCY MODULATED (FM) SYNTHESIS
FM synthesis is an approach that was prevalent in the digital synthesizers of the 1980’s and
continues to be used in low-cost keyboards and computer sound cards. It was patented in 1977 by
Stanford Professor John Chowning (Stanford University News Service, 1994). The idea is to
produce sounds based on simple mathematical expressions of the form
cos (ωc t+∫0
t
f ( x ) dx)Where f (t ) is an arbitrary function that “modulates the frequency” (Schottstaedt).
Frequency Modulation produces tone that is based around a fundamental harmonic oscillation;
typical choices of the modulation function generate various spectra of harmonics above the
fundamental.
The sound of an FM synthesizer is distinctly electronic and rarely resembles any real instrument
except perhaps in caricature. Favoring expressive control and exotic texture over realism, FM
synthesis became the driving force behind the popular bands of the 1980’s. Today it is still used to
produce electronic-sounding tones.
9
MODELING SYNTHESIS
Recent research has shifted to computationally expensive numerical vibration simulation models
that represent each part of a physical or electrical instrument by a delay line. A delay line
represents a vibrating string, for example, by modeling the transfer of vibrational energies through
a finite number of segments, each representing a small portion of the string. In the case of a guitar
model, the output from the delay line that represents the strings may be coupled into another
model that represents the wooden body of the guitar.
Modeling synthesis opens up a new dimension of freedom to synth designers, allowing for both
accuracy and expressive control. Although some degree of modeling is incorporated into the most
expensive commercial synthesizers it has not supplanted wavetable synthesis as the primary
means of producing electronic sound because it is too computationally expensive and it requires a
large amount of effort to design a complete set of instruments.
10
2 - INTRODUCTION TO DIGITAL AUDIO CODING
SAMPLING
As a sound waves pass over the diaphragm of a microphone the oscillating sound pressure levels
induce a similarly oscillating electrical current. If the microphone is connected to a personal
computer, the oscillating signal reaches the sound card where the voltage is measured many
thousand times each second and the measurements are stored numerically in the computer
memory.
THE SAMPLING THEOREM
The Nyquist-Shannon sampling theorem, often called simply “the sampling theorem” states that a
continuous audio signal may be perfectly reconstructed from discretely sampled data provided the
original signal is band-limited so that the absolute value of the highest frequency, f max, is not
greater than one half of the sampling rate,f sample.
|2 f max|≤ f sample
This theorem is often misunderstood to mean that practical decoders such as those used in
personal computers are capable of exact reproduction up to frequencies of half the sampling rate.
The Nyquist-Shannon theorem requires that the decoder must reconstruct the signal by multiplying
each sample by a sinc function, so that the influence of a particular sample affects the interpolation
between samples over the entire signal, not only near the sample in question. Since most practical
decoders use much simpler means of interpolating between samples, their actual frequency range
capability is quite difficult to predict. (Goldberg, p. 62)
11
The formula for perfect reconstruction of a band-limited signals (t )from data sampled at a rate of f s,
is
s ( t )= ∑n=−∞
∞
x [ n ] sinc (tf s−n )
Where
sinc (t )={ 1 , t=0
¿sin ( t )
t, otherwise
andx [n ]is the nth sample of the signal.
In practice, the implication of the sampling theorem is a direct tradeoff between sound quality and
data file size; the higher the sampling rate, the wider the frequency range of the output. (Goldberg,
2003)
Perceptually speaking, lowering the sampling rate results in a muffled sound where bass notes are
faithfully reconstructed but high pitches become muted.
12
FIGURE 1 - THE SINC FUNCTION
SAMPLE DEPTH
Another factor relating file size to sound reproduction quality is the numerical precision of the
sampling. Clever adjustment of numerical precision can significantly reduce storage requirements
without affecting the percieved quality of the output signal. We will say more about that in the next
section.
EXISTING AUDIO CODERS
Most of the world encountered digital music for the first time with the invention of the compact
disc. Music CDs use a straightforward coding scheme: audio is sampled 44,100 times per second.
Each sample is a sixteen-bit signed real number, and the samples come in pairs – one sample for
each of the left and right stereo tracks on the disc. One minute of audio in CD format occupies just
over 10MB. A whole CD can contain about 800MB of data. This is slightly more than the capacity of
a data CD because audio CDs use a simpler error correction scheme. (Red Book (audio CD
standard))
Early efforts to store music on personal computers employed Pulse Code Modulated (PCM) formats
very similar to that of audio CDs. The ubiquitous .wav format is a category of PCM formats that
allows several choices of sampling rate and depth. The PCM wav format is quite inconvenient for
storing and transferring music because a typical music file requires more than 30 Megabytes of
hard drive space.
At the time when the first free mp3 player software became popular most personal computers had
a hard drive smaller than 250MB so the prospect of storing music on the computer in .wav files was
not particularly attractive. Converting CD audio or .wav files into .mp3 format typically reduces the
13
file size by a factor of ten. That makes music files small enough to send them over a 56 kbps modem
in under ten minutes. (Dwight Brown)
MP3 is a lossy CODEC, which means that it reduces file size at the expense of sound quality; often
called a perceptual CODEC because it exploits imprecision in hearing perception to allow
imprecision in the sound representation without causing perceptible loss of sound quality.
MASKING
The human auditory system has a tendency to perceive certain sounds more accurately than others
and sometimes to completely ignore certain types of background noise, perceiving only the louder
auditory stimuli. Waiting in the office in the quiet at the end of the evening one becomes aware of
the 60 Hz hum emanating from the electronics in the building. Of course, that sound is also there in
the day but it is not perceptible because it gets masked by the sounds of people talking, typing, and
moving around the building. Only when there are few other stimuli of louder volume does one
begin to perceive those softer sounds. This is an example of perceptual masking.
In the study of Psychoacoustics, we differentiate between two types of masking: temporal masking
and frequency masking.
Frequency masking, also called simultaneous masking, is when a louder sound obscures
simultaneously occurring sounds at other frequencies. Temporal masking occurs when a loud
sound obscures a softer sound impulse that occurs shortly before or shortly after the loud sound.
(Goldberg, pp. 156-157)
Some perceptual coders exploit perceptual masking by predicting which parts of a signal are likely
to be masked and reducing the sample precision for those sections.
MP3 ENCODING
14
The popular MP3 encoder exploits perceptual masking to reduce the size of audio recordings. A
summary of the process is as follows (Goldberg, 2003):
1. Cut the incoming data stream into overlapping windows of 512 samples each. Each window
will be encoded separately. In the reconstruction phase the decoder will piece the signal
back together.
2. Apply a series of filters to separate the signal into 32 frequency bands. This allows the
encoder to control the accuracy of the encoding process independently for each frequency.
3. Reduce the sampling rate of each band. For signals limited to a narrow frequency band, the
sampling rate can be significantly reduced without any loss of information. For a detailed
explanation and proof of this, see Bosi and Goldberg, pages 80-84.
4. Compute the masking effect for each frequency band and reduce the sample depth for
heavily masked frequencies. The most audible frequencies should be encoded with the
greatest accuracy possible but for frequencies that would not be clearly perceptible the
sample depth can be significantly reduced without a noticeable loss of signal quality.
Typically, the original signal has 16 bits per sample but for heavily masked frequencies,
much lower precision may be adequate.
5. Apply a Huffman Coding lossless compression to further reduce the data rate. Huffman
coding is similar to the algorithm in the ubiquitous PKZIP format. It reduces file size
without any data loss by identifying the most frequently used patterns and replacing them
with shorter bit sequences.
An MP3 encoder with typical settings achieves compression by a factor of 10 for music files.
PRONY’S METHOD
While researching models of gas dynamics in 1795 Baron Gaspard Riche de Prony developed an
exact method for fitting a model of p exponential functions to a dataset of 2 p observations. His
15
method was later generalized to allow models containing sinusoidal functions and to accommodate
complex-valued input. In present-day usage, the idea is rarely used to fit exact models – instead a
method of least-squares approximation is used to fit a model of a small number of complex
exponential functions to a relatively large set of samples.
Given samples x [1 ] … x [ N ], the Prony method estimates complex values for the parameters hk, and
zk to minimize
ρ=∑k=1
N
( x [ n ]− x [ n ] )2
Where
x [n ]=∑k=1
p
hk zkn−1
Prony’s method is especially useful when an appropriate value for p, the number of exponentials in
the model, is known. There are several algorithms for estimating p based on trying the method
several times and comparing results but the estimation is difficult becauseρ, the total-squared-
error, is non-increasing as p approachesN . (The estimation becomes exact when p=N /2.)
In modern usage, where the number of functions in the model is much less than the number of data
points, it is important to consider the effects of noise. Prony’s method is quite resistant to the
presence of noise if it is evenly distributed in frequency (white noise) but its accuracy is not as good
for narrowband noise. If the signal is relatively constant over a longer period of time we can
distinguish the signal from the noise by looking for correlations between samples taken at different
times.
Prony’s method is not an audio CODEC but it does provide a means for polyphonic pitch detection.
Its implementation is completely different from that of our own method and the details of it are
16
outside the scope of this paper but it is worth mentioning because both algorithms compute
parameters to fit a sum-of-sinusoids model to a set of periodic data.
PRINCIPAL COMPONENTS ANALYSIS FOR DATA REDUCTION
Prior to beginning research on the method described in this paper, we experimented with using
principal components analysis as a method for audio data compression. The results were
acceptable but neither the compression ratio nor the quality was an improvement over existing
compression schemes and the data format was conducive to neither musical analysis nor synthesis.
We mention it here because it may still be helpful, when combined with the present incarnation of
our CODEC, for either increasing data compression or for feature recognition.
Principal components analysis is a multivariate statistical technique for identifying the most
significant linear combinations of factors in a dataset:
Suppose we have a set of m n-dimensional multivariate observations represented by the m× ndata
matrixX . LetU ΣV ¿be the Singular Value Decomposition of X . The elements σ 1…σ nalong the
diagonal ofΣare called the singular values ofX . If we let u1…um denote the m columns ofU and let
v1…vnrepresent the nrows ofV ¿, then we can writeX in terms of the orthogonal set of matrices
Zi=ui v i as follows:
X=∑i=1
r
σ i Zi
The singular valuesσ iare arranged along the main diagonal ofΣin non-increasing order so that σ 1Z1
is the most significant term in the sum shown above, followed by σ 2Z2and so on. The term that
accounts for the least significant portion of the variance of the dataset isσ r Z r, where ris the rank of
X .
17
Suppose we wanted to reduce the file size necessary to storeX . We could leave out some of the less
significantσ i Zi terms from the sum, often without any perceptible effect onX . This is especially
true whenX contains noise. For some audio recordings, the vast majority of the desirable part of
the signal resides within the first few terms of the sum while the remaining terms are mostly noise.
In some cases the signal to noise ratio improves after applying a judicious amount of this type of
data compression (Meyer, pp. 412-418).
In order to actually reduce the spaceXoccupies on the disk, we transform it to a new basis in which
the columns ofU , called the left singular vectors, are the standard basis vectors. In order to do this,
we begin with the statement of the SVD:
X=U Σ V ¿
Recalling thatU is unitary, we write
Y=U ¿ X=Σ V ¿
WhereY is a representation ofX relative to the basis of left singular vectors. Now, if we want a low-
rank approximation toY , we let Σk=diag (σ 1 , …σ k ,0 ,…0 )so thatΣkhas lower rank thanΣ. ThenY
is a rank-k approximation toY .
The advantage to this approximation is that Y contains a row of zeros for everyσ ithat we removed
fromΣ when we definedΣk. We do not need to waste disk space storing those zero rows, so we
simply make a note that they were there and then reduce the size ofY .
It has been noted that the singular vectors are a generalized Fourier series. Indeed, the first few
singular vectors, when plotted, resemble the low-order sine and cosine terms of a standard Fourier
series basis. The higher terms, however, bear a diminishing resemblance to anything at all.
Because of this, we found it difficult to do any meaningful analysis or transformation based on PCA.
18
Ordinary PCM sampled data is so different in nature from the way we perceive sound, and from the
way we generate it, that it is only with great difficulty that linear transformations can accomplish
anything musical. But linear statistical techniques sometimes become more useful after a non-
linear transformation is applied. With this in mind, we set out to define a more perceptually
meaningful representation for audio signals.
19
3 - GOALS AND ASSUMPTIONS
DESIREABLE QUALITIES
Typically, the purpose of an audio CODEC is to reduce the size of a data file or lower the bit rate of a
stream. The algorithm we describe in this paper is intended for quite a different purpose: to
transform the data into a format where various types of analysis and transformations become
simple. In order to make it suitable for that purpose, we examine the deficiencies of existing
methods and consider how we might avoid them.
POLYPHONIC PITCH DETECTION
From a musician’s perspective, it is important to know what pitches are present in a sound sample.
Humans can learn to isolate the sound of an individual instrument in a recorded sound, to identify
the notes that instrument plays, and to replicate the sound by playing the same notes on another
instrument. A physical object in vibration, such as a piano string, produces a proliferation of
harmonic frequencies. As a result, the process of identifying the sound of a specific key on a piano
can be much more complicated than simply identifying the frequency of the sound. Nevertheless, if
we can estimate the strongest of those harmonics we are in a much better position to guess which
keys the pianist is pressing.
A monophonic composition is a musical piece for which, at any time during the performance, there
is never more than one fundamental pitch. Most wind instruments such as the flute or the
saxophone are only capable of playing one note at a time. Any piece in which such an instrument
plays unaccompanied will be monophonic. Stringed instruments such as the guitar or the piano are
capable of playing several notes simultaneously and therefore they can produce polyphonic sounds
without accompaniment.
20
For a monophonic signal,f ( t ), we say that we have detected the pitch on a given time interval when
we have found parameters a, θ and ϕ such that we minimize ‖a sin (θt +ϕ )− f (t )‖. For polyphonic
signals, we need to find parameters for amplitude, frequency, and phase; ak, θk, and ϕk to minimize
the following:
‖f ( t )−∑k=1
n
ak sin (θk t+ϕk )‖In practice, this can only be done in rough estimation because n, the number of terms in the series,
is unknown.
Most existing perceptual audio coders derive somehow from FFT2 methods. The obvious advantage
of the FFT is its computational speed and simplicity of implementation. Musicians describe sound
as a sum of oscillations at a small, finite number of frequencies. In acoustical research, the FFT is
often used as a means for identifying the frequencies and amplitudes of these oscillations. But the
result of a Fourier transform can be quite different, conceptually, from the result that musicians
expect.
In its analytical form, the Fourier transform is usually stated as an improper integral with bounds
from t=−∞¿ t=∞ . But discrete Fourier transforms are always computed on a windowed signal,
that is, a signal that has been divided into short sections of samples. Although, for computational
reasons, the window size is usually chosen to be an integer power of two, in theory, it is arbitrary.
Since Fourier transforms are sometimes used for pitch detection, it is natural to say that the Fourier
series of a sound represents the constituent frequencies of that sound. It is important to consider in
what sense this is true. It has been proven that the Fourier series for any appropriately band-
limited continuous signal converges to that signal but when musicians use Fourier analysis for pitch
2 Fast Fourier Transform
21
detection they are not looking for a series that converges to the original signal. Instead, they want
to estimate the resonant frequency of the vibrating object that produces the sound. Before we can
extract that kind of information for a Fourier series we must carefully consider the effects of
windowing.
The following figures show Fourier series for the function sin (2 t ). Computing the series on a
finitely bounded interval (figure 2), we can see that when the length of the interval is an integer
multiple of the period of the signal, the series contains only one non-zero term corresponding
exactly to the input function. Otherwise, the series contains an infinite number of terms and in
some cases does not approximate the original signal well unless we compute a large number of
them. In the second set of plots we see that when we change the length of the interval from 2π to 7
we get a series with a large number of high-amplitude terms that still fails to approximate the input
at the boundaries even after we carry it out to twelve terms.
If we had used the second series to estimate the wavelength of the input function we would
probably guess that it was near 3.5 but we would have no way of guessing how many other
frequencies might be in the signal.
The fourth plot (figure 3) shows eight terms in the Fourier series for f ( t )=sin (2 t )+cos (3 t ) on (0,7 ).
In this case the Fourier series tells us only that a large portion of the energy is focused at the lower
frequencies. It gives no indication that the input was the sum of only two sinusoids.
22
23
FIGURE 2 - FOURIER SERIES ON (0,2 π) CONVERGES WITH ONLY ONE TERM
FOURIER SERIES OF sin(2t ¿)¿ON (0,7)UP TO 12 INDIVIDUAL TERMS OF THE SAME SERIES
Polyphonic pitch detection, that is, identification of musical tones in signals with more than one
fundamental frequency, is a problem that is not well handled by even the best software.
24
FIGURE 3 - TERMS OF THE FOURIER SERIES FOR A POLYPHONIC INPUT
MUSICALLY WELL-PLACED BASIS FUNCTIONS
The terms of the Fourier series on a given interval form an
orthogonal basis for the vector space of continuous
functions on that interval. Although it can be generalized to
allow an infinite variety of bases, the frequency resolution
of Fourier analysis is always limited by the orthogonality
condition. Whenever we require increased frequency
resolution, we must increase the length of the interval of
our analysis. A typical FFT window for pitch detection
contains 2000 – 4000 samples. That represents between
1/20 and 1/5 of a second, depending on the sample rate.
Later in this paper, we will show that those windows are
much too long for transient elements of a typical music
recording.
In the graph on the left, the frequencies of the terms of a
Fourier series are plotted relative to the frequencies of the
keys on a piano keyboard. The frequencies of the Fourier
terms are in an arithmetic series but musical pitches follow
a geometric series. Consequently, the Fourier series
concentrates the vast majority of its frequency resolution at
the high end of the keyboard. If we choose a sufficiently
large window to give adequate resolution at the lower
frequencies we waste computational effort and get unnecessarily high resolution at the other end.
25
An ideal representation for musical analysis of audio data should distribute its frequency resolution
evenly over the keyboard, in other words, the frequencies of the basis functions should follow a
geometric series. In Fourier analysis, this would imply a violation of the requirement that the basis
should be a set of orthogonal functions.
FREQUENCY-DEPENDENT TIME RESOLUTION
Up to this point we have discussed frequency analysis without regard to precision in the time
domain. But pitch detection has only a very limited musical utility if it doesn’t identify the pitch at
the correct time. In the previous section we mentioned that we can improve frequency resolution
by increasing the length of the analysis window. For FFT based algorithms, the inevitable effect of
increasing the length of the analysis window is a decrease in time resolution.
From a Fourier transform perspective of time-frequency analysis, the precision to which we can
define an event in time-frequency space is limited by the uncertainty principle. A mathematical
statement of this principle is given by the following relation:
σ ωσ t ≥ η
Whereσ ωis the bandwidth of the event, σ t is the time-duration, and η is a constant which depends
on characteristics of the window function.
Although the derivation of this inequality follows from Schrödinger’s generalized proof of
Heisenberg’s uncertainty principle for quantum physics, uncertainty for Fourier analysis of audio
signals has nothing to do with probability because no quantity is estimated. A Fourier transform
does not predict or estimate the frequencies in a signal; it simply transforms the data to an
alternative representation in a different basis. Therefore the quantity σ ω should not be thought of
as the standard deviation of an estimate of the frequency; it is simply the bandwidth of the
transformed representation of the sound. (Cohen, pp. 44-52, 88)
26
In this project, however, we are interested in estimating both the time and frequency of each event
in the signal. We interpret the uncertainty principle in terms of probability, similar to Heisenberg.
This will become clearer mathematically when we discuss the particulars of our implementation
but for now we offer only an intuitive explanation:
For musicians, the concept of pitch is inherently probabilistic. For every acoustic situation there is
a perceptual limit of frequency resolution and that limit depends on the frequency of the sound.
This is a well known phenomenon to anyone who has ever played an upright bass or a bass guitar.
When the bassist plays pizzicato3, his tuning can be quite far off, in fact he may even be playing the
wrong note entirely; still it can be quite difficult to hear that there is any problem. But as soon as he
plays a note that is sustained for a longer period of time or begins to play with a bow his imprecise
intonation creates a very sour sound in the ensemble. This is not the case for high pitched
instruments. A flautist, for example, must be much more careful about tuning, even for notes of
short duration.
Consider the following two sinusoidal graphs:
3 When the true period of the signal does not correspond to the size of the analysis window we compute the Fourier series with the assumption that the signal in the analysis window is replicated infinitely many times so that the signal has infinite duration.
27
On hearing the first sample, a musician will
recognize it as a tone and he or she will be able to identify the name of the note. Although the
second sample lasts the same length of time, the listener will not be able to identify its pitch; in fact,
he will perceive a “thump” instead of a tone. The longer the wavelength of a sound, the more time it
takes for a human to perceive the pitch. The converse of this is also true. That is, higher pitches
may still be perceptible even if they change quickly or last only a short time.
We take this into account in our CODEC design in two ways: First, we plan that at high frequencies,
our time resolution should not be less than that of typical human hearing. And secondly, for low
frequencies we may significantly reduce our resolution without causing any perceptual error. In
that way we can reduce the need for data storage and computational effort.
REAL TIME OPERATION
The ability to function in real time is not a requirement for audio CODECS but it does increase the
range of possible applications. Pitch detection and transformation are useful in live musical
28
performance, and data reduction is necessary for communications technologies such as digital
cellular phones and voice-over-IP software.
The process of encoding and decoding an audio signal can be conceptualized in three phases as
shown in the following diagram:
Typically, the encoding phase is more time consuming than the decoding phase and the time
required for the storage or transmission phase depends more on the speed of the transmission
medium than on the speed of the CODEC.
If the decoder is not fast enough to send the signal directly to the sound output device then it must
save its output to another file format that can play back in real-time. That renders the whole
system useless or at least too cumbersome for practical purposes. Therefore we require that the
decoding phase should be fast enough for real-time operation.
As we will see in a later section about design restrictions, we have restricted our encoder algorithm
to the use of functions that are theoretically capable of real-time operation.
4 - THEORETICAL BASIS OF THE ALGORITHM
In order to meet the objectives mentioned in the previous chapter we are forced to sacrifice some of
the precision of Fourier analysis and adopt a more intuitive and perceptual approach. In place of
mathematical analysis, we take the human auditory system as our example and as our rubric.
The study of sensory perception becomes an increasingly ill-defined science as the focus of the
investigation shifts away from gross anatomy of the sense organs and into the subtleties of
29
psychology. A central question in the field of Psychoacoustics is “What electrochemical format does
the brain use to represent the sounds it perceives?” In efforts to answer that question, audiologists
use electrical probes to measure the electrical response of individual auditory neurons as the test
subject listens to various sonic stimuli but of course, with more investigation, the question becomes
more complicated.
The brain is a highly-distributed computing environment - not easily represented by the kind of
data-path flow charts we might use to describe a computer program. Unlike computer software,
biological sensory perception processes never reach a point that can be considered “output”; they
receive input from the sense organs and begin processing the data, moving it quickly through an
ever-widening data-path that never reaches any objective point of completion. That is, there is no
part of the brain that can be considered the ultimate observer of sense experience.
This creates a problem for anyone who would try to understand sensory psychology: Even if we are
completely aware of all electrochemical processes involved, in what sense does that imply
understanding?
For this reason we must exercise caution when incorporating any aspect of psychoacoustics into
our CODEC. We do not intend to produce an accurate computational model of the auditory cortex
or even to include all known properties of human hearing into our algorithm. Instead, this project
stands somewhere between an analytical perspective and a psychoacoustic perspective.
We intend to employ our knowledge of psychoacoustics in three ways: First, to aid us in defining
perceptual measures of the accuracy of our algorithm - as much as our CODEC produces perceptible
signal degradation, we consider it to be in error; whenever it does not, we declare it “good enough”.
Second, it makes us aware of what is possible – if a human listener can detect a difference between
tones of 400Hz and 401, but our program cannot, then there must be a way to improve our method.
Finally, it gives us occasional inspiration regarding computational methods we might use.
30
ANATOMY OF THE AUDITORY SYSTEM
FIGURE 4 - FLOWCHART OF MAJOR HEARING EVENTS (GULICK, P. 74)
Our research focuses on the last two stages of the flowchart of figure 4, especially the filtering
function of the internal ear. Figure 5 shows a simplified sketch of the physical parts responsible for
this process. Of primary interest is the cochlea, the spiral-shaped organ on the right side of the
illustration. This is the location of the hearing transducer cells and it is believed to be the organ
responsible for separating the incoming sound into component frequencies.
FIGURE 5 - A CROSS SECTIONAL DIAGRAM OF THE EAR (VON BEKESY, P. 11)
31
Base
Apex
Transverse fibers
Figure 6 shows an idealization of the cochlea and basilar membrane uncoiled from its spiral shape
and straightened into a long tube. This kind of reconfiguration is believed to preserve the
resonance characteristics of the organ while facilitating modeling and visualization. The cochlea is
divided into two compartments by the basilar membrane and completely filled with fluid. The
primary sensory transducer cells are located on the membrane itself.
FIGURE 6 - A STRAIGHTENED-OUT MODEL OF THE COCHLEA AND BASILAR MEMBRANE (FLETCHER, P. 47)
THEORIES OF TIME-FREQUENCY ANALYSIS IN HUMAN HEARING
In the past two hundred years, theories of hearing followed loosely after two conflicting models.
The Resonance-Place Theory, suggested by Helmholtz in 1863, said that the basilar membrane
consisted of an array of fibers acting as tuned resonators, each having a unique frequency of
maximum response. Sensory nerves attached to each fiber responded to vibrations by sending out
electrical impulses. He supposed that the brain
identified the frequency of sounds by identifying the
points along the membrane where displacement was
at a local maximum. Although this theory was
intuitively attractive it depended on some faulty
assumptions. Helmholtz proposed that the fibers
were under considerable tension in the transverse
32 FIGURE 7 – HELMHOLTZ THEORY OF RESONATING FIBERS IN THE BASILAR MEMBRANE
direction but that the tension of the membrane itself in the longitudinal direction was negligible.
Later research showed that the tension of the membrane is roughly equal in both directions.
Another difficulty is related to the range of frequencies we perceive; the variation in the length and
mass of the fibers is not enough to account for observed the variation in frequency sensitivity of the
hearing sense (Gulick, pp. 60-62).
In 1886 Rutherford proposed a theory that completely ignored the supposed resonance properties
of the basilar membrane. Called the Frequency Theory, it suggested that frequency separation was
strictly a function of the central nervous system. More recent evidence shows that the auditory
nerve transmits a complete and accurate
electrical representation of the input sound. In
one case, researchers observed that amplified
neural signals from the ear of a cat were
perceptible as speech; confirming that the nerves
respond to much more than the location of
maximum displacement of the basilar membrane
(Gulick, p. 69). An obvious difficulty with this theory is its inability to explain the purpose of the
particular shape of the cochlea. Experiments have proven that its design guarantees that every
frequency of vibration within the range of hearing does correspond to a unique location of
maximum displacement in the basilar membrane.
Aspects of both theories persist among modern explanations but the currently prevailing opinions
center around an alternative theory proposed by Georg Von Bekesy in 1928. His contribution,
called the Traveling-Wave Theory, says that it is the structure of the membrane itself, not the
tension of fibers running across it, that accounts for the frequency-dependent location of maximum
33FIGURE 8 - DIAGRAM SHOWING FREQUENCY-DEPENDENT LOCATION OF MAXIMUM
DISPLACEMENT ALONG THE BASILAR MEMBRANE(GOLDBERG, P. 173)
displacement. In the 1950’s, Bekesy built a large-scale experiment resembling the model in figure 6
that produced vibrations in a rubber membrane such that the forearm of a researcher could
substitute for the nerve sensors in the basilar membrane. This apparatus, shown in figure 8,
consists of a plastic tube cast around a brass tube. The tubes are sealed and filled with fluid. On the
top edge is a rubber strip that allows the researcher to feel vibrations from inside the tube. On the
end, a piston driven by a mechanical oscillator produces pressure waves in the fluid. This
experiment was one of many used to verify Bekesy’s traveling-wave theory of frequency perception
(Von Bekesy, 1960).
In the model shown in figure 9, the skin of the forearm senses the location and intensity of vibration
in the rubber strip. In a sense, the person in the picture is “hearing” the vibration of the piston
through the nerves of his arm. We can take it for granted that the nerves on the forearm are not
sensitive enough to translate unfiltered vibrations into any sensory perception that resembles
hearing. If they were, then people born with total hearing loss could learn to hear by placing their
hand on the vibrating cone of a loudspeaker. But the cone of a loudspeaker doesn’t filter particular
frequencies into unique locations like the basilar membrane does. Bekesy’s theory raises an
interesting question: Is the frequency-dependent location of maximum basilar membrane
displacement enough to account for the human frequency discrimination ability? Or, to put it
another way: If cochlea causes each frequency to stimulate a unique set of nerves, what’s left for
the brain to do?
34
FIGURE 9 - BEKESY'S MODEL OF THE COCHLEA (VON BEKESY, P. 546)
As an incidental outcome, our project provides an answer to that question. Although our numerical
model resembles Helmholtz’s array of tuned resonators more than Bekesy’s Traveling-Wave
theory, all three share an important likeness: They disregard the phase of the input signal; using
only location-intensity information to identify pitches. Mathematically, they measure frequency
and amplitude as real numbers. Since our CODEC also disregards the phase of the input, it
demonstrates the quality of perception that would be possible if the human auditory system were
sensitive to only two quantites: the intensity and location of vibrations in the basilar membrane. Of
course, perceptual deficiencies in our own results do not imply inadequacy of the Traveling-Wave
theory but the success of our codec demonstrates that it is possible to make accurate analysis of
sound based only on the amplitude and location of displacement of the basilar membrane.
35
SPECTROGRAMS FOR TIME-FREQUENCY ANALYSIS 4
A spectrogram is graphical representation of a sound that shows frequency, intensity, and time. It
is either colored or plotted in 3D to demonstrate all three quantities simultaneously. An ordinary
two-dimensional spectral-analysis graph of a chirp (figure 10, left) shows the frequencies in the
sound and their respective amplitudes but it gives no indication of when those frequencies appear.
On the right side of figure 10, we see the spectrogram of the same sample. This time we can clearly
see that the sound consists of a single tone that began at a high frequency, swept down to a low
pitch, and rose again following a quadratic curve.
4 Pizzicato is a musical term for notes played as short as possible. For orchestral strings the player touches or sometimes plucks the strings his or her fingers instead of bowing.
36
Its name seems to imply that a spectrogram is merely a type of plot, a visualization of sound, but
some types of spectrograms are reversible. In other words, the graphical output represents the
data so exactly that we can convert the image back into sound and reconstruct the input. The first
part of the method we present in this paper can be classified as a type of spectrogram and the
explanation of the remaining part requires the use of similar time-frequency analysis terminology.
Therefore it is appropriate to take a few pages here to outline the mathematics of spectrograms and
time-frequency analysis.
37
FIGURE 10 - A SPECTRUM (RIGHT) AND A SPECTROGRAM (LEFT) OF A CHIRP
We begin by defining s (t ), the amplitude of a signal at timet . S (ω )is the amplitude of the frequency
ωover the entire duration of the signal. The two are related in the following way5:
S (ω )= 1√2π∫ s (t ) e− jωt dt
and
s ( t )= 1√2π∫ S (ω ) e jωt dω
In the language of signal processing S (ω )is sometimes called the frequency-domain representation,
while s ( t )is called the time-domain representation of the signal. It is easy to switch between the
two by means of Fourier transforms but with both representations there is a certain deficiency: the
time-domain representation tells us instantaneous amplitude with perfect accuracy but says
nothing about what frequencies are in the signal. Conversely, the frequency-domain representation
gives no insight into the timing of events within the signal.
Ideally, we would like to have both kinds of information simultaneously in the form of a function
P ( t , ω )from which we could compute the intensity of energy in the signal at time tand frequencyω.
Borrowing from the language of probability, this is called the joint energy-distribution-function
(EDF) of the signal. Spectrograms are one type of algorithm that computes an approximation of
P ( t , ω ).
If we want to compute the energy in the signal in the two-dimensional intervalω0<ω<ω1, t 0<t<t 1
then we integrate P ( t , ω )as follows:
E=∫ω0
ω1
∫t0
t1
P (t , ω) dt dω
5 This chapter summarizes the relevant information from (Cohen, 1995).
38
The functions s ( t )and S (ω )are related to P(t ,ω)through intensity. Intensity is defined as the
square of amplitude so that |s( t)|2 is the intensity per unit time and |S(ω)|2is the intensity per unit
frequency. These two measures of intensity are known as marginal energy-distribution-functions
because of the following analogues to marginal probability-distribution-functions:
∫P (t , ω ) dω=|s (t)|2
∫P ( t , ω ) dt=|S (ω)|2
Any approximation of P ( t , ω )for which these equations hold is said to “satisfy the marginals”.
Many widely-used time-frequency analysis techniques do not satisfy the marginals. Those that do,
tend to introduce artifacts that confuse the analysis.
Figure 11 shows an example of a distribution that
satisfies the marginals but exhibits significant
inaccuracies whenever the input is polyphonic.
There are infinitely many ways to construct an EDF.
Especially if we do not require satisfaction of the
marginals, mathematics can tell us very little about
how it should or should not be done. The
construction of methods to compute an EDF is a pragmatic science, guided by intuition, analogy,
and necessity. In order to define our own procedure, we refer back to the goals of our project and
expand the ideas mentioned there. We would like our EDF to have the following properties:
Real time operation: Many time-frequency analysis techniques incorporate information
from the entire signal into the computation of the value at a given time. Our technique
processes the time-domain signal,s (t ), in order to compute P ( t ,ω ). We require that the
39
FIGURE 11 - EDF FOR THE SUM OF TWO CHIRPS AS GIVEN BY THE WIGNER
DISTRIBUTION. NOTICE THE LARGE ARTIFACTS BETWEEN THE TWO MAIN LOBES.
(COHEN, P. 127)
value ofP(τ ,ω) may depend only on values of the functions (t ) for t ≤ τ . In other words, we
should process the signal data file in order from beginning to end, possibly using
information from earlier in the file but never looking ahead of the current position.
Strictly Non-negative Energy: Although the analogy between energy distribution
functions and probability distribution functions suggests that this should be a requirement,
some time-frequency algorithms do not guarantee thatP (t , ω )is positive over the whole
domain. The Wigner distribution shown in fig. 11 suffers from this problem – the spurious
values between the two main ridges of the chirps contain many negative values.
Mathematically, there is nothing wrong with negative values of energy; in fact, the Wigner
distribution is perfectly invertible. But the negative values for energy don’t make sense
according to physical intuition. Furthermore, since we don’t perceive any frequency
between the two chirps it is upsetting to see such large spikes located there.
Local-priority estimate: For the EDFs given by some techniques the relationship between
the original signal and the EDF doesn’t necessarily follow intuitive principles of locality. So,
for example, the value of P (τ ,ω)may be quite large even though s ( t )=0in the neighborhood
of τ . (τ−ϵ <t< τ+ϵ for significantly largeϵ .)
Figure 12 shows a graphical example of this type of problem. On the far left is a graph of
the time-domain representation of the signal. The EDF given by the Wigner distribution is
shown in (a). For this distribution there is a huge artifact in the period of silence between
the two tones. This is an example of a failure to exhibit intuitive temporal locality. Part (b),
the Margenau-Hill distribution, avoids contaminating silence with noise but shows aliasing
in the frequency domain. The Page running-spectrum distribution (c) has some of the same
problem as Margenau-Hill except, since its output at timetonly considers the signal up to
that time, frequency domain aliasing only moves in one temporal direction.
40
Our method strives for a local-priority estimate of the energy, that is, the value of P ( t , ω )is
most heavily influenced by the values ofs ( t )andS (ω )in the neighborhood of tandω. The
influence ofs (t )andS (ω )on the EDF at the point(τ , w )decreases as the distances|t−τ| and
|w−ω| increase. We call this local-priority estimation because althoughP ( t ,ω )is influenced
by events in the signal that are not in the vicinity of (t ,ω ), nearer parts of the signal always
have greater priority of influence than those farther away.
FIGURE 12 - LOCALITY ISSUES IN THREE DISTRIBUTIONS FOR THE SAME SIGNAL (COHEN, P. 177)
A local priority estimate is achieved by multiplying the signal by a window function that limits the
influence of parts of the signal that are far from the area of interest. Consider the Page running-
spectrum distribution (demonstrated in part c of figure 12):
P ( t ,ω )=2ℜ¿
Where St−¿ (ω )¿is the frequency-domain representation of the signal up to timet :
St
−¿ (ω )=1
√2π ∫−∞
t
s (τ ) e− jωτ dτ ¿
41
It is clear from this representation why the distribution is zero in the time corresponding to the
silent parts of the signal shown in figure 12; the s ( t )term in the definition prevents the EDF from
being non-zero whenevers ( t )itself is zero. But the infinite lower bound on the integral in the
definition of St−¿ (ω )¿implies that non-zero parts of signals from the past will contaminate non-zero
parts of the ESD in the future and that the contamination will continue indefinitely. A natural
solution is to limit the influence of past events by computingSt−¿ (ω )¿with a finite lower bound on the
integral∫−∞
t
s (τ )e− jωτ dτ . This is precisely the motivation for using a spectrogram.
In general form, the EDF of a spectrogram is given by
P (t , ω )=| 1√2 π
∫e− jωτ s (τ ) h (τ− t ) d τ|2
Where h (τ−t )is a window function that typically has the following characteristics:
h (τ−t )={ 1 when τ=tbetween0∧1 ¿
t¿0¿when τ is far¿t ¿
The definition of the functionhhas significant effect on the properties of the EDF given by the
spectrogram.
One such property which is of fundamental importance is the time-frequency resolution. Recall
from our discussion about the uncertainty principle that there is a tradeoff between time resolution
and frequency bandwidth. The inclusion of the window function is an attempt to improve the time
resolution by forcing the effect of sound events not to influence the value of the EDF outside of a
narrow time interval. The natural result as a consequence of the uncertainty principle is a widening
of the bandwidth of each event.
42
It should be noted that distributions such as the Page running spectrum and the Wigner
distribution, since they are computed using an integral with infinite bounds, have the potential for
infinitely precise frequency resolution. But in the graphs shown in figures 11 and 12 this does not
appear to be the case. The peaked parts of the graph show a gentle curvature on either side of the
base. This is the result of an implicit window function forced upon our analysis because we were
processing signals of finite duration. Even though we did not apply any window, our signal is non-
zero for only a limited time so the EDF behaves similarly to the way it would if we had windowed
the input, that is, each event is somewhat spread out in the frequency domain.
Spectrograms do not, in general, satisfy the marginals. The addition of the windowing function into
the expression for P ( t ,ω )confounds the energy density of the signal function with the energy of the
window function and introduces effects unrelated to the properties of the original signal. Our
CODEC also doesn’t satisfy the marginals. We will discuss how this affects the quality of the output
in the next chapter when we explain the perceptual masking phase of our algorithm.
43
5 - OUR IMPLEMENTATION
FIGURE 13 - OVERVIEW OF OUR PROGRAM
Our algorithm works in four phases as shown in the chart above. In this section we describe the
details of each phase of the process.
THE ANALYSIS PHASE
DISCRETE MODEL OF A DAMPED HARMONIC OSCILLATOR
Helmholtz’s resonance-place theory of frequency filtering in the basilar membrane is at the heart of
our time-frequency analysis method. Recall that Helmholtz imagined the basilar membrane to
consist of an array of tuned resonators responding in unison to the stimulation of vibrations of
various frequencies as they pass through the cochlea. He knew that every frequency of vibration
would stimulate each one of the resonating fibers, but for a given input frequency, the fibers whose
resonance most closely matched that pitch would respond most strongly. This, he supposed, was
the way we identify individual frequencies in the sounds we hear. Although his theory was later
supplanted by Bekesy’s traveling-wave theory, the idea that different frequencies correspond to
different-locations on the membrane persists. Helmholtz’s theory still has one advantage over the
traveling-wave idea: computational simplicity. Even at the time when the field of Psychoacoustics
was still in its infancy, resonance was already a well-understood phenomenon. The Helmholtz
model can be implemented on a computer as a simple array of ordinary differential equations, one
for each resonator. The traveling-wave theory, on the other hand, corresponds to a non-linear
44
inhomogeneous partial differential equation. We did not find any evidence in the psychoacoustical
literature that it has ever been modeled mathematically6.
Regardless of whether the cochlea is home to an array of damped harmonic oscillators or to a nasty
partial differential equation, these two facts remain uncontested:
1. When vibration, at any frequency, stimulates the fluid in the cochlea the whole basilar
membrane is set in motion.
2. The magnitude of displacement depends on the frequency of the vibration and the location
along the membrane.
We have based our model on the older Helmholtz theory; mostly because we have discovered a
simple and efficient discrete method for modeling the response of an array of damped harmonic
oscillators but also because we believe it is qualitatively similar to the traveling-wave model;
similar enough so that the advantages of the Helmholtz model’s simplicity outweigh the
disadvantages of its inaccuracy.
We begin to derive our method by considering a well known result about the amplitude of
resonance in a damped harmonic oscillator driven by a sinusoidal forcing function. If the resonator
is represented in the form of a mass-spring-dashpot system with massm, spring constantk , and
damping constant cthen we use the following differential equation as a model:
m x ' '+cx ' +kx=f (t )
In the cochlea, the sound vibrations entering from outside provide a forcing function, stimulating
vibrations within. When the forcing function is a single sinusoid with amplitude Aand frequency
6 jis the imaginary square root of -1. These integrals appear without any specification of the upper and lower bounds. The integration should be carried out over the whole domain of the variables of integration. This applies also to all other expressions in this paper where an integral appears without bounds.
45
ω ,the amplitude of resonance induced in a damped harmonic oscillator is known to be given by the
following formula7:
A
√( k−mω2)2+ ( cω )2(1)
The derivation of this formula, which appears in most ordinary differential equations textbooks8, is
well known and we will not repeat it here. Instead we present an alternative method for arriving at
the same expression. The advantage of this second method is that we can prove it generalizes to
give the amplitude of resonance even if the forcing function is not a sinusoid. Normally, we would
need to solve the differential equation or compute the Fourier series of the signal before we could
find the amplitude of resonance but with our new method we can compute it for any forcing
function even if we don’t have an analytical expression for the function.
7 We made our own attempt but found that our results were at odds with Von Bekesy’s. Bekesy showed evidence that the traveling waves are a combination of transverse and longitudinal vibration. Since the equations that model that type of vibration are difficult to analyze, we considered seperable non-linear PDEs that model the transverse components of the vibration. The very fact that the equations were seperable precludes any possibility of producing modes of vibration that appear as “traveling waves”. Since we do not have any interest in dissecting cadavers, we decided instead to accept Von Bekesy’s Nobel-prize-winning conclusions without making any further investigation. 8 This formula gives the steady-state response of the harmonic oscillator so we don’t make any stipulation about what initial conditions we use to solve the equation. The transient solution decays in time and leaves this result regardless of what initial condition we use.
46
FIGURE 14 - AMPLITUDE OF A DAMPED HARMONIC OSCILLATOR WITH RESONANT FREQUENCY ωr=30 AS THE FORCING FREQUENCY VARIES FROMωf =0¿100
Let s ( t )be the amplitude of a continuous audio signal at timet . The response to the signals, of a
damped harmonic oscillator tuned to a resonant frequency ofωr, is given by the product of the
signal with a complex exponential (the oscillatory component of the resonator) and a real
exponential function (the damping component of the resonator, with damping constantΓ),
integrated over all time up to the current moment9:
∫−∞
t
s (τ )e−i ωr τ eΓ (τ−t )dτ(1)
(2)
Usually the input signal contains oscillatory components. Even if that is not the simplest
representation, we can always use Fourier transforms to write it as a linear combination of complex
exponentials. Let us assume that our signal is a single complex exponential forcing function with
real-valued frequencyωf and real amplitude A. In that case we can simplify the expression for the
response in the following way:
∫−∞
t
s (τ )e−i ωr τ eΓ (τ−t )dτ=∫−∞
t
A ei ωf τ e−i ωr τ eΓ (τ −t)dτ¿A e i( ωf −ωr ) t
Γ +i ( ωf −ωr )
(1)
(3)
Now, we are concerned with the amplitude of oscillation, not the phase, so we find the complex
magnitude by multiplying by the complex conjugate and taking the square root:
A
√Γ 2+ (ωf−ωr )2(4)
If we setΓ=cω, ωr=k ,∧ωf =mω2then formula (3) is equivalent to formula (1), demonstrating
that (2) is an alternative method for computing the amplitude of resonance of a harmonic oscillator
9 (Edwards, 2005)
47
when s ( t )is a sinusoidal forcing function of known amplitude and phase. But what if sis not a
sinusoid?
Provided s can be represented by a Fourier series, we can re-write (2) as follows
∫−∞
t
¿¿
and by the linearity of the integral operator, the response of the resonator is
∑n
An ei ( ωfn−ωr ) t
Γ+i (ωfn−ωr )
We cannot simplify this to a sum of terms resembling (3) because the complex magnitude operator
is non-linear. As we will discuss later, this fact mathematically explains the crucial distinction
between Rutherford’s frequency theory and those of Helmholtz and Von Bekesy.
Now that we know (2) expresses the response of a damped harmonic oscillator at time tto the
forcing function s (t )we need to find a discrete form of the expression so that we can compute it
when the forcing function is given in terms of samplesx [n ].
Actually, we derived (2) from the discrete formula, not the other way around. Therefore it will be
simpler if we describe the algorithm first, without giving any derivation, then show how it
approximates (2) in the limit as the sampling rate,f s, approaches infinity.
Suppose we want to approximate the resonance of the harmonic oscillator of resonant frequencyωr.
Oscillating at its resonant frequency, the vibration follows the function e i ωrtso the time for a single
period of oscillation is 2πωr
. At a sampling rate of f sthere are n=f s2πωr
samples per period of
oscillation. Let zndenote the nthcomplex root of unity so that(z¿¿n)n=1¿. The sequence
48
Z=zn0 , zn
1 , …znkrepresents kconsecutive samples of the functione i ωrt . If x [k ]is the k thsample of the
input signalX then the sum
∑k
znk x [ k ] (5)
is the inner product of the vectorsX andZ.
This is like a discrete approximation for (2) without the damping termeΓ (τ −t ). We could
approximate the damping with discrete exponents of a real numbera∈(0 ,1) just as we did for the
oscillation term with exponents of zn:
∑k=−∞
κ
aκ−k znk x [ k ]
But then when we move on to the next sample,x [k+1], we would have to completely re-calculate
the sum because at the next sample κ=k+1. That would mean that in practical operation we would
be calculating an inner product of vectors whose length increases with every new sample we
process, making the algorithm computationally infeasible.
Figure 15 shows the signal,x [k ], the damping term, aκ−k, and the oscillatory term, znk, plotted up to a
certain time t . After time twe move ahead to time t+s and we want to recalculate the inner product
of the three functions. For the oscillatory and signal components, the function at time t+sis the
same as for timetexcept that a new section is added at the end (shown in different color). But for
the damping component, the entire function has to be shifted to the right as a result of the change in
the value ofκ . This means we cannot compute the inner product of the three functions at timet+s
simply by computing the new section and adding the new result to the result from the existing
section. Instead we have to recalculate the entire product.
49
Damping
Oscillatory
Signal
Fortunately, there is another way to calculate the product that requires only a constant-time
operation to update the result each time a new sample is added to the signal. We define an iterative
weighted averaging process that gives the average of a constantly changing sampled valuex [i ]over
an infinite time interval with increasing weight given to the most recent values:
Let A [ i−1 ]be the weighted average ofx [0 ] … x [i−1], then A [ i ]is defined by the following
recurrence relation:
A [ i ]=αx [ i ]+(1−α) A [ i−1 ] , α∈(0,1)
We can better understand the behavior of this function if we expand it as a sum:
A [ i ]= ∑k=−∞
i
α(1−α )k−i x [k ] (6)
Here we see that the weight allotted to the k th sample, x [k ], diminishes geometrically ask−i
decreases. Practically speaking, ifx [i ]is the most recent sample then|k−i|represents the “age” of
50
FIGURE 15 – TERMS OF THE INNNER PRODUCT UP TO TIME tAND EXTENDED TO TIMEt+s.
x [k ] so we could also say that the weight of the k thsample diminishes geometrically as the sample
gets “older”.
There is a difficulty with this type of weighted average which becomes apparent when we consider
a simple property that we usually expect of a formula for a measure of central tendency: Normally
we would expect that ifx [i ]=c , for alli, then A [ i ]=c. In other words, if x [ i]are samples of a constant
value then the weighted average of the samples is equal to the same constant value. We attempt, by
induction, to prove this is true for our formula and thereby demonstrate the trouble.
Proof by induction: Suppose A [ j ]=c for all j<i . Then A [ i ]=αc+(1−α)c=c. It’s going well so
far… but what about the base case? When we begin the iterative calculation by finding A[1]we do
not have any value for A[0]. The obvious solution is to define A [0 ]=x [1 ]. Then
A [1 ]=αx [1 ]+(1−α)x [1 ]=cand the proof is complete.
The difficulty in defining the base case for this induction reveals an important property of this
definition of average. Suppose we had defined A [0 ]=0instead. That may be a more natural
definition because it doesn’t assume a non-zero average for non-existent previous values ofx [ i]. If
x [i ]were samples of an audio signal then the samples beforex [1 ]would more likely be zeros. With
that base case, the induction fails and A [ i ] ≠ c. Instead,limi → ∞
A [i ]=c ; the estimate of the average
begins badly but becomes increasingly accurate as the influence of the zero value of A [0 ]diminishes
in time. It is easy to quantify the error of theithsample by comparing the difference between A0 [i ],
the average computed under the assumption A [0 ]=0, with Ax1 [ i ], the average computed when we
set A [0 ]=x [1]. The error at theithsample:
e [i ]=Ax1[ i ]−A0 [ i ]=(1−α )−i x [1]
This is important because it indicates something about the behavior of the method whenx [i ]is
51
varying; the more rapidlyxis changing, the less likely that Agives an accurate estimate of the recent
values. But the longerxremains constant, the more accurate Abecomes. Even more importantly, we
see how the value ofαaffects the estimate. Notice thatlimα →1
A [i ]=x [i ]. The value ofαcontrols the
amount of influence that the most recent sample has on the whole average. We could also say that
when x is constant αcontrols the rate of convergence. Values ofαcloser to zero cause the average to
converge more slowly and to give more weight to older samples. Larger values have the opposite
effect.
We now consider the effect on this average of the sampling ratef s. Supposes ( t )is a continuous
signal and thatx [t ]is the same signal sampledf stimes per second. Consider equation (5) in the more
practical case where the sum contains finitely many terms:
A [ i ]=∑k=1
i
α (1−α )k−i x [k ]
We can vectorize the computation by writing it as an inner product A [ i ]=⟨ Ω, X ⟩, where Ω=α ¿.
We have mentioned that the parameterαdetermines the speed with which the average converges to
the value at the current sample. As we will see later, we do not want this to happen too quickly. In
fact, the adjustment of this parameter is critical to controlling the time and frequency resolution of
our spectrogram. But the rate of convergence as controlled byα is set in terms of the number of
samples. That is, given values of A [0 ] , c ,α∧ϵ , there exists a precise number of samples,n, such that
the error in A[ i ]afterxhas remained constantly equal tocfor the lastnsamples is guaranteed to be
less thanϵ . So, if we leave everything else unchanged, but double the sample rate, then the time for
the error to diminish belowϵ is cut in half. Therefore we should defineα to depend on f sso that once
we setα the time to convergence remains constant regardless of how often we sample the signal.
52
Suppose we have experimentally determined a value ofα ,α 0, that works well at sampling ratef 0. In
order to keep the time for convergence constant we define a universal valueα=f 0α 0
f sthat gives the
same performance at every sample rate.
It is interesting to see how the computation of A [ i ]behaves in the limit as the sample rate
approaches infinity. Replacingαwithf 0α 0
f s, we have the following expression for thek thelement of
the weight vectorΩ that specifies the weights for the samples on the time interval0<t <1:
Ω [ k ]= f 0α0f s
(1− f 0α 0f s
)k− f s
(7)
This value represents the weight given to thek thsample. Naturally, this number decreases to zero in
the limit as the sample rate goes to infinity. But as the number of elements inΩincreases, it
becomes an increasingly good approximation of a continuous exponential function. To see why this
is so, we define a function a (t , f s )that gives the element ofΩcorresponding to the sample taken at
timetand at sample ratef s.
a (t , f s )=Ω [ t f s ]¿ f 0α 0f s
(1− f 0α 0f s
)(t−1) f s
If we take the limit asf sgoes to infinity we have
limf s→ ∞
a (t , f s )¿ ( limf s→ ∞
f 0α 0f s
)e f 0α0 ( t−1)
We preserved f 0α 0
f son the right hand side of this expression instead of explicitly writing its limit
because its limit is zero. Doing this, we can see that for a sufficiently high sampling rate,Ω
approximates a continuous real-valued exponential function and that it converges pointwisely to 0.
53
The quantity f 0α 0
f sis inversely proportional to the sample rate but the number of elements inΩis
directly proportional. This suggests that the sum of the elements ofΩremains roughly the same
regardless of any change in sample rate.
Let’s summarize the results of this section in a table:
Discrete Continuous
signal x [n] s(t )
timekf s
t
sinusoidal oscillation znk e i2πnt
exponential damping (1−f 0α0f s
)k−i
e f 0α 0 (t−1)
response of damped harmonic oscillator ∑
k=−∞
κ
aκ−k znk x [ k ] ∫
−∞
t
s (τ )e−i ωr τ eΓ (τ−t )dτ
At the beginning of this section we showed that the expression in the bottom-right hand corner of
this table is the displacement of a damped harmonic oscillator driven by an arbitrary forcing
function. We went on to show that each of the discrete formulas10 on the left side of the table is
equivalent in the limit as sampling rate goes to infinity to the continuous formula on the right.
Based on this evidence, we take the discrete formula in the bottom row, to be a good approximation
for the response of a damped harmonic oscillator with a resonant period ofnsamples to a discrete
sampled forcing function, at the time of theκ thsample. For efficiency, we implement the
computation of this expression with the iterative weighted-average process described before:
A [ i ]=αzni x [ i ]+ (1−α ) A [ i−1 ] , α∈(0,1)
10 We do not have references to other papers that use this expression but we will justify its usefulness by showing that it gives the same result as expression (1).
54
We use an array of these discrete resonator models, one for each of the frequencies we wish to
analyze, and we process this iterative calculation once per sample for each frequency. Usually, we
run the algorithm with about 300 resonators in the array but for higher sampling rates it is
necessary to increase this number in order to get good resolution into the lowest frequencies. In
the next two sections we explain how we use the data from this computation to produce a compact
representation of the input signal.
ANALYTIC INPUT SIGNAL
FIGURE 16 –OUTPUT OF OUR SPECTROGRAM FOR INPUT CONTAINING A SINGLE SINUSOID ATω=100
Figure 16 shows the output of our array of discrete oscillators in response to a monophonic
sinusoid. The third axis, receding back into the page, is time. As it begins to process the signal the
program underestimates the amplitude and gives an unclear indication of the frequency but as time
progresses a clear peak rises atω=100, indicating the frequency of the input.
55
Looking from the side at the time axis (figure 17) we can see the peak asymptotically approaching a
limiting amplitude. But there is a significant washboard effect affecting the whole output right from
the beginning of the analysis. The frequency of perturbation is similar to the frequency of
oscillation of the input signal.
FIGURE 17 - A VIEW FROM THE SIDE SHOWS THE WASHBOARD SHAPE OF THE ENTIRE SPECTROGRAM
Let’s consider the cause of this. We measure the amplitude of oscillation based on a time-weighted
average of the inner product of a real valued sinusoid with a complex exponential function. We
always give the heaviest weight to the most recent samples. Consider an exaggerated picture of the
functions we are multiplying:
56
Signal
Exponential Oscillator
Damping
Pointwise Product
time
Figure 18 shows an input signal (top), the real part of a complex exponential function of the same
frequency, a damping function, and the inner product of these three functions (bottom). Since the
damping function puts most of the weight at the end of the time interval, the signal that is zero at
the end produces an inner product with lesser magnitude.
Perhaps the first question that comes to mind is “So what?” We can still see the frequency and the
amplitude of the input from the spectrogram so if the amplitude wobbles a little bit, is that a
problem? The amplitude of the input signal is oscillating in time so isn’t it natural that the
instantaneous energy estimate given by the spectrogram should also oscillate?
We have a confusion relating to what quantity we expect the spectrogram to measure. If we want it
to measure the absolute value of the instantaneous amplitude of the oscillatory component at a
57
FIGURE 18 - THE POINTWISE PRODUCT ON THE LEFT HAS MUCH LOWER AVERAGE VALUE BECAUSE ITS SIGNAL IS ZERO AT THE END OF THE WINDOW.
given frequency then we should expect its output to oscillate at twice that frequency. But we are
not interested in the amplitude of oscillation. We really want to measure the total energy in the
vibration at each frequency.
Our harmonic oscillator model is designed to give an estimate of the energy in the sound waves that
enter into the ear first by propagation through the air and then through the solid medium of the
bone structures of the inner ear. The motion of sound waves in air is governed by the three-
dimensional wave equation but for simplicity, we consider it in only one dimension11:
utt−c2uxx=0
c=√ Kρ
(K isthe bulk modulus of themedium∧ ρis density )
Kinetic energy varies with the velocity of the air molecules and the potential energy varies with
pressure but the total energy always remains constant.
Etotal=Ekinetic+Epotential Ekinetic=12∫ ρ( ∂u
∂ t )2
Epotential=12∫ K ( ∂ u
∂ x )2
So the oscillations in the spectrogram occur because we were considering only the kinetic energy of
the sound. If we consider potential energy as well we get a much smoother spectrogram as shown
in the following illustration.
11 We have not yet shown convergence for the formula in the bottom row; since each of the three terms in the discrete product converges to the corresponding term in the continuous expression the proof is trivial.
58
FIGURE 19 - SAME AS FIGURE 16 BUT WITH QUADRATURE MODEL SIGNAL
It is customary to use complex numbers when representing waves that propagate by transferring
energy between two quantities, as in electromagnetic theory. We can do the same for sound waves
or for our audio signal. If we lets ( t )=cos (ωt ) be the kinetic component of our signal then
s' (t )=−sin (ωt )is the potential component. We can combine these together, defining
A[ s ( t )]=A [cos (|ω|t )]¿cos (|ω|t )− jsin (|ω|t )¿e j|ω|t
and calling this the analytic version of our signal12.
The linear operator Ais defined for general signals as follows13:
A [ s ]=s ( t )+ jπ∫
s (τ )t−τ
dτ
12 (Zauderer, p. 197)Our use of the wave equation in this case represents the sound waves in the air, not the vibration of damped harmonic oscillators. Since damping in air is negligible over short distances, we do not include any damping term in the equation. 13 This is true only when our signal is of the form cos ( ωt ) .
59
Clearly, if we knew the appropriate symbolic description of our signal and if the signal contained
sums but not products of real-valued sinusoids then writing the analytic version would be trivial.
Unfortunately, for sampled signals it is not so easy. In the preceding paragraph we have suggested
the idea of writing our signal in terms of cosines and then simply adding an imaginary sign term
corresponding to each one:
if s (t )=∑n
An cos ( ωn t+ϕn)then A [ s ]=∑n
An¿¿¿∑n
An e j ( ωn t+ϕn)
This is called the quadrature approximation. There is a distinction between this and the correct
definition of Abecause the proper analytic signal behaves differently for negative values ofω.
Normally, real valued signals have a frequency spectrum that is symmetric in frequency about 0.
For some applications we prefer to see only positive values for frequency. Making the signal
analytic is a way of complexifying the input to cancel the negative values out of the spectrum. In
our case, we are not trying to force a positive measurement of frequency. There are several
methods for complexifying the signal that produce the smoothing effect we desire.
Since we cannot write the signal in terms of cosine functions until much later in the analysis
process, we do not try to apply the quadrature method as described above. Instead we modify
equation (2) by delaying the signal one fourth of the wavelength of resonance of the harmonic
oscillator:
A ( t )=∫−∞
t
[s ( t )− j s (t− π2ωr )] e− j ωrτ eΓ (τ −t )dτ
That works effectively when the resonant frequency of the oscillator is near the forcing frequency
but as shown in figure 19, there is still a small amount of ripple effect farther away from the peak.
60
We should note that this delay method is not particularly effective for transient signals because the
samples fromt− π2ωr
seconds in the past may come from a section of the sound where the
amplitude and frequency content is much different from what it is at timet . We also experimented
with using a numerical derivative of the signal to approximate the imaginary part but rejected that
approach because it introduced other types of error.
MASKING
We saw in (3) that
A e i( ωf −ωr ) t
Γ+i ( ωf −ωr )
is the response at timetof a damped harmonic oscillator of resonant frequencyωrto a complex
exponential forcing function of amplitude Aand frequencyωf . The graph of this function asωrvaries
over a wide range appears in figure 14. An important feature of this graph is that it has only one
local maximum value and that value occurs at the point whereωr=ωf . For single frequency input
identifyingωf is a trivial matter of locating the maximum value of (3).
61
FIGURE 20 - SPECTROGRAM OF A MULTI-FREQUENCY SIGNAL
Figure 20 shows the response of our spectrogram to an input that is the sum of sinusoids at three
frequencies. Several important features are visible here. First, the longer wavelengths take
considerably more time to show a peak at the appropriate frequency. The short wavelength part
shows a very sharp peak from early on but it is relatively unstable; the estimate of amplitude is very
unsteady compared to the longer wavelengths. This happens because we make the value ofα
increase proportionally with the wavelength. Remember thatαcontrols the speed of convergence
for the estimate of both amplitude and frequency. For a complex-valued, single-frequency input
signal, our estimate is quite steady for any value ofα . But when the input is polyphonic the
interaction between frequencies causes instabilities. By lowering the value ofαwe can slow down
the convergence, thereby smoothing the signal and producing a more stable result. But this
smoothness comes at a price; slowing the convergence of our spectrogram decreases the accuracy
62
in the time domain. We mentioned before that for high frequencies, it is possible to do good pitch
detection even on a very short timescale but for low frequencies this is not the case. Since this is
true for humans as well as for our model, it is important for us to adjustα to give good time-
resolution for the high frequencies but better stability for low frequencies. Consider the following
illustration:
If we compare the value of the damping function for the most recent sample to the value at the
sample taken one period of oscillation earlier, the difference between the two should be constant
for all frequencies. In other words, the rate of convergence measured in terms of the wavelength of
each damped harmonic oscillator should be constant. In figure 21, the signal on the left has a very
low frequency; therefore the damping function changes slowly. On the right, the frequency is much
higher so we do not want to use a slowly increasing function (red) because it will give bad time
resolution. Instead, we use the one that gives faster convergence (blue).
63
FIGURE 21 - THE CORRECT DAMPING FUNCTION (BLUE LINE) SHOULD DIMINISH BY THE SAME AMOUNT PER PERIOD OF OSCILLATION, REGARDLESS OF FREQUENCY
Returning now to our discussion of the features of figure 20, we describe how we detect the
frequencies in the signal when the graph contains more than one local maximum. Figure 22 shows
a time slice of the spectrogram for a signal with three frequencies.
If the frequencies and amplitudes of the components of the signal are known, the graph shown
above could be given by the formula
‖∑n
An ei (ωfn−ωr ) t
Γ+i (ωf n−ωr )‖=|(∑n
Complex−valuedresponse
¿¿ the sinusoid at¿ thenth frequency¿¿)|
So the response to a signal that is a sum of sinusoidal functions is just the sum of the response for
each frequency. Why then does the spectrogram give such an unstable estimate of the amplitude
when the signal is polyphonic? Because the complex amplitude function is a non-linear operator.
So even though the complex
response of the damped harmonic
oscillators is a linear combination of
64
FIGURE 22 - A TWO-DIMENSIONAL TIMESLICE OF THE SPECTROGRAM FOR A SIGNAL WITH THREE FREQUENCIES
FIGURE 23 - NON-LINEAR EFFECT ON AMPLITUDE IN THE SUM BETWEEN INPUT FREQUENCIES: AN ENLARGEMENT OF THE
SECTION BETWEEN THE LEFT AND CENTER MAIN FREQUENCY RESPONSES FROM FIGURE 24 (RIGHT SIDE).
the response to each of the component frequencies in the signal, the amplitude of that complex
number is not.
FIGURE 24 - THREE COMPLEX HARMONIC RESPONSE FUNCTIONS (LEFT) AND THEIR SUM (RIGHT). THE Y AND Z AXES ARE THE REAL AND IMAGINARY COMPONENTS OF THE AMPLITUDE. THE X AXIS REPRESENTS
THE VARYING RESONANT FREQUENCY OF THE HARMONIC OSCILLATOR.
On the left side of figure 24 we see three complex response functions similar to those that we used
to generate figure 22. But when we take the sum, as shown on the right, the values don’t always
add constructively. Figure 23 shows a zoomed in and stretched out view of the region between the
left and middle large circular regions from the right side of figure 24. The multicolored part of
figure 24 clearly shows that each independent response function in that region (between the red
and blue main spirals) has a relatively large amplitude. But in the sum (purple) there appears to a
bottleneck there, indicating that the red and blue functions interfere destructively. The next
bottleneck region to the right, between the blue and green spirals, there appears to be constructive
interference in the sum. If we plot the sum at several time steps we find that the interference
oscillates between destructive and constructive as the relative phases of the spirals change.
65
If we set our value ofαvery small so that the rate of convergence is very slow relative to the rate of
oscillation between constructive and destructive interference then we can partly eliminate this
effect. Fortunately, it is not necessary to eliminate it completely.
In figure 25 we see the same three functions from 24 plotted in their absolute value along with their
sum. Notice that each colored function, at its peak, stands far above its neighbors. This indicates
that, despite any interference that might go on between peaks, each function dominates the
absolute value of the sum at its own peak. Because of this, the local maxima of the sum are close to
the peaks of the individual terms. If we adjustα to slow down the convergence then the peaks get
more pronounced and the sum becomes an even better approximation at the local maxima. As we
will see later, we can dynamically make adjustments toαso that we do not experience a significant
loss in time resolution when we slow down the convergence.
We turn now to the topic of how will identify the
frequencies in a polyphonic signal. The flowchart
on the right summarizes the process. On the next
66
FIGURE 25 - THREE RESPONSE FUNCTIONS (COLORS) AND THEIR SUM (BLACK)
page is a more detailed chart demonstrating how the frequency components are identified at each
time step.
67
68FIGURE 26
Although our spectrogram updates its simulation of the harmonic oscillators after each new sample
it processes, we found that it works well to record the frequencies only every 500 to 1000 samples.
Longer time between outputs reduces the file storage requirements and reduces the perception of
instability in frequency and amplitude of the output. But it also reduces the accuracy of timing in
response to transient sounds, cutting off the pick sound at the beginning of each note from a guitar,
for example.
Shortening the time interval improves the perceptual quality of the encoding for transients in the
signal but reduces the quality for signals with more stable frequency and amplitude characteristics.
There is a considerable amount of inaccuracy in our estimates of the signal properties. The
frequency-amplitude wobbling effect is the most perceptible artifact. Lengthening the time
between outputs makes the wobble undetectable and greatly increases the perceptual quality of the
output. It is therefore necessary to find a compromise value for the output period that works well
for both transient and sustained signals. We are currently working on a new data format that
allows the output period to be adjusted. The analysis phase of the program already adjustsα
according to the rate of change in average amplitude of the input signal so it is quite natural to use
the same parameter to control the timing of the data recording.
Typically, the number of frequencies detected at each time step is not more than ten. When the
signal contains a densely packed set of frequencies it is difficult for the masking algorithm to get an
accurate estimate. The frequencies create a lot of interference in the spectrogram when they are
close in both frequency and amplitude. As a result, the resonant response curves blend together
giving the appearance of wider peaks. We fit the mask functions as shown in figure 26 using a least-
squares regression in the neighborhood of the peaks so when the peaks widen, each mask covers a
larger portion of the data. The masking level rises very quickly to cover the whole data and the
algorithm misses some of the frequencies. It might seem that this would cause a big loss in
69
perceptual quality but in fact, as we mentioned in the second chapter, a very similar effect occurs in
the human hearing senses. Our goal then, is not to detect every frequency component in the signal
but only to detect every perceptible frequency component.
DATA STORAGE
FIGURE 27 - REPRESENTATION OF THE 3D MATRIX USED FOR DATA STORAGE
Our CODEC stores its output in a three-dimensional matrix. The program writes a new page into
the matrix every time it processes the number of samples that we define for the output period.
Figure 27 illustrates three pages of the data file. Each of the time indices T ncorresponds to a page
that records an arbitrary number of frequencies and the amplitude for each one of them.
We can easily calculate the data compression that can be achieved with this structure if we consider
that the CODEC writes one page every 500~1000 samples and that each page contains information
for between 2 and 10 frequencies:
Data ReductionFactor
= new dataRateold dataRate
=
sampleRateoutputPeriod
× sampleDepth× (2averageFrequencies )
sample Rate × sampleDepth
70
For typical values this works out as follows:
Data ReductionFactor
=
44,000500
×16×2×4
44,000×16=1.6%
That means that after processing, our CODEC reduces the audio data to 1.6% of its original size.
Compared to standard methods this is a huge improvement. (Typical compression rates for MP3
encoders are near 10%.) We will compare the quality of the sound reproduction in the next
chapter.
SYNTHESIS
FIGURE 28 - OUR SPECTROGRAM FOR THE SUM OF TWO CHIRPS; ONE DOWNWARD SWEEPING AND THE OTHER UPWARD SWEEPING
71
Once we have the data written into the 3D matrix format reconstructing the original sound is
relatively simple. Suppose our data file shows two frequency components in the signal; at time t 1
the signal is estimated to bes1 ( t )=3cos (2.4 t )+2cos (13 t )and att 2 , s2 (t )=4cos (2.5t )+2.5cos (12 t ).
Even though the frequencies and amplitudes are not exactly equal between the two time indices it
is reasonable to assume that the low frequency att 1faded smoothly into the low frequency att 2and
that the same happened for the high frequency. The time between intervals is so short that the
signal can’t change much between pages in the data file.
As a simplified illustration, consider a monophonic signal:
s1 ( t )=1cos (2t )
and
s2 (t )=2cos (2.5t )
if we create a continuous functions1,2 ( t )=a (t )cos (tω ( t )) and leta (t )andω (t )be functions that
smoothly fade the frequency and amplitude betweent 1andt 2then we can sample it at discrete points
and get the desired effect. The actual implementation is more complex but it would be best to refer
to the code section in the last chapter to see the complete details.
There is one more major difficulty: we need to figure out how the frequencies move between the
time indices in our data file. Figure 28 shows an example of a signal that is the sum of two tones for
most of the time. One tone sweeps upward in frequency while the other shifts down. They meet in
the center and for a brief instant there is only one tone; then they diverge again. In situations like
this, where the number of frequencies changes between pages, it is important to carefully identify
which frequencies in a given page should fade in to frequencies in the next page and which ones
should simply fade out to zero amplitude.
72
Figure 29 shows three pages of our data file, each containing a different number of frequencies.
Since the pitches in the original sound sample may be bending in both frequency and amplitude we
cannot assume that f1 in the first page corresponds to f1 in the second page.
FIGURE 29 - SOMETIMES THE INTRODUCTION OF NEW TONES MAKES IT DIFFICULT TO DECIDE HOW TO SCALE THE FREQUENCY BETWEEN PAGES OF THE DATA FILE
We match the frequencies between pagest 1andt 2by the following algorithm:
1. Consider the amplitudes of all frequencies on both pages. Identify the strongest one.
2. Choose the frequency in the opposite page that is nearest to the frequency identified in the
previous step.
3. If the two frequencies are within a specified range of each other then they form a pair.
Copy them to the synthesizer, erase them from the data file and continue to the next step. If
they are not close, then the frequency identified in part 1 fades to zero amplitude. Pair it
with a sinusoid of zero amplitude and copy the pair to the synthesizer. Erase it from the
data file. (It is important to remove the frequencies from the data file after they are used to
avoid re-using them in two or more pairs.)
73
4. If there are still any frequencies left in the data file repeat back to step one, otherwise, stop.
6 - PERFORMANCE
QUALITY
It is well known that it is possible to reconstruct a sound from a graphical representation of its
spectrogram (Hentjeens, 1997). But the errors in the reconstructed signal are more noticeable than
with more conventional coding methods. A certain amount of error is unavoidable thanks to the
uncertainty principle and to our unwillingness to accept a spectrogram that outputs negative
energy values or shows sound at times when the input signal did not (Cohen, 1995). Knowing that
this error can be shifted into frequency or time domain but not eliminated, we have tried to
minimize the perception of error by imitating the distribution of error in human hearing. We
believe that if our CODEC is inaccurate in precisely the same way that our ears are inaccurate, then
we have minimized the perceptual error.
We control the distribution of uncertainty between the time and frequency domains by adjusting
the value ofα . Higher values ofα increase frequency resolution but cause the sound to be “mushy”;
the beginnings and ends of tones become spread out in time. In order to get the best sound it is
necessary to adjustαdifferently for each sound or even to adjust it dynamically during the analysis
of a single sound.
There is a problem with this approach. Even if our CODEC succeeds in having the same time and
frequency uncertainty as our ears do, when we listen to the output the error is compounded. This
is easy to see if we consider the time domain error:
74
FIGURE 30 - COMPOUNDING OF TIME DOMAIN SPREADING ERROR
The first frame of figure 30 shows a sinusoidal input that starts and stops abruptly. Normally,
whenever two sounds occur within 117 second of each other, our ears hear them as a single sound
(Goldberg, 2003). This is because sounds of this duration approach the limit of our time domain
resolution. The second frame of figure 30 illustrates the effect of the uncertainty in human time
resolution; causing smoothing of the edges of the signal. Our CODEC introduces additional
uncertainty of time resolution, therefore when human ears listen to the output of our CODEC, the
uncertainty is compounded a second time; once from the CODEC and again from the ears of the
listener. The third slide shows how the uncertainties of our CODEC combine with those of the
human senses resulting in compounded uncertainty in the time domain.
In the frequency domain, uncertainty does not result in smoothing. Instead it results in inaccurate
estimation. We do not have frequency smoothing in our CODEC because we eliminate that in the
masking phase. But the masking transfers the frequency domain uncertainty from a smoothing
effect to a probabilistic effect; increasing the standard deviation of the estimate. Even though
uncertainty in the frequency domain works probabilistically the effect is still compounded; first our
CODEC makes estimation errors in the frequency domain, then the listeners ears make estimation
errors while listening to the output from our CODEC, and the effects compound to produce a greater
perceptual uncertainty.
HIGH FREQUENCY ESTIMATION ERROR
75
The justification for our discrete model of damped harmonic oscillators is based on its behavior in
the limit as the sample rate increases to infinity. But what about the behavior as the sample rate
goes the other way? As one might expect, the discrete model is not much like a damped harmonic
oscillator when it operates very close to the Nyquist frequency14.
In practical signal processing, no digital coder operates very well at those wavelengths. Recall from
the first chapter that the sampling theorem guarantees perfect reconstruction of the signal up to
half the sampling rate, provided the interpolation between samples is done using the sinc function.
In ordinary digital-to-analog converters, no such interpolating function is used (Goldberg, 2003).
Instead the converter uses a step function, that is, if sample 1 corresponds to an output of .3 mVand
sample 2 corresponds to .4 mV then the output is .3 mV between sample 1 and sample 2. When the
time comes for sample 2 to play back the voltage will increase to .4 mV as quickly as possible. Since
the voltage can’t change instantaneously, the change may be somewhat smoother than a step
function, but it certainly won’t be an exact reconstruction of the input. The following illustration
shows an example of the kind of errors that can occur near the Nyquist frequency when the proper
interpolating function is not used.14 (Cohen, p. 30)
76
FIGURE 31 - THE SINC FUNCTION
FIGURE 32 - HIGH FREQUENCY EFFECT OF AN INCORRECT INTERPOLATION FUNCTION
On the left side of figure 32 is a continuous sine wave. The red points show samples taken at a rate
just slightly higher than twice the frequency of the wave. Using linear interpolation between the
samples to reconstruct the signal, we get a “beating” effect in the amplitude of the tone that was not
present in the original. This beating is audible as a tone, sometimes with a pitch that is not
consonant with the continuous signal. We have observed that this effect is quite noticeable even for
pitches more than twenty percent below the Nyquist frequency.
A major problem with our implementation of the discrete harmonic oscillator model is that we only
model those oscillators whose resonant period is an integer number of samples. In chapter 3 there
is an illustration showing how the basis functions of Fourier series fit badly with the spacing of
musical notes in the audio spectrum, specifically that it doesn’t have sufficient resolution at low
frequencies. For our model, we could turn that illustration upside-down and get a good
representation of our high-frequency resolution issue.
POSSIBLE SOLUTIONS TO THE HIGH FREQUENCY RESOLUTION DEFICIENCY
There are several ways we could solve this problem. First, we could simply add discrete harmonic
oscillators that have non-integer resonant
periods but they would experience the same
kind of “beating” effect shown in figure 32.
77
That isn’t as bad as it might sound because even lossless15 CODECS suffer from the same problem
unless they have a special digital-to-analogue converter that interpolates correctly between
samples.
Second, we could interpolate our results in
the frequency domain. Figure 33 shows an
example of how insufficient frequency resolution causes errors in the estimation of the peak value
of a response function. Our present implementation would identify the red point as the maximum
value because it is the maximum of the data points. But a simple interpolation would reveal that
the actual value is between the first and second highest points on the graph.
A third solution would be to interpolate the signal in the time domain before beginning to process
the data. By increasing the sample rate by a factor of two or three we shift our high frequency
errors higher into the inaudible part of the audio spectrum. This is the most plain and simple
solution because it requires absolutely no changes to our implementation. We have tested it and
found that it is very effective except for one problem: it takes longer to run the CODEC when there
are more samples in the signal. The interpolation itself is very fast but if we double the sample rate
we have twice as many samples to process. Unfortunately, doubling the sample rate also doubles
all of our resonant frequencies so we have to add more discrete harmonic oscillators on the low end
to fill the gap. The result is a quadratic increase in processing time. As we will see in the next
section, we may be able to improve that to O ¿ if we thin out our frequency resolution on the low
end. But there is another way to keep the processing time from increasing with the sample rate.
We don’t need the high sample rate for the low frequencies so we might try interpolating the input
signal in the time domain to several sample rates ranging from below the original rate to four or
eight times higher. When processing the signal, we would use the high sample rate versions for the
15 Actually, our signal is complex so it also contains phase information. But we use cosines here for simplicity of demonstration.
78
FIGURE 33 - OUR CODEC INCORRECTLY IDENTIFIES THE PEAK OF A RESPONSE FUNCTION AS A RESULT OF
INSUFFICIENT HIGH-FREQUENCY RESOLUTION
high frequencies and the lower versions for the low frequencies. That way we neither waste time
processing unnecessarily large amounts of data to estimate low frequencies nor do we sacrifice
frequency resolution at high frequencies.
HIGH FREQUENCY ATTENUATION
There is another type of error that appears in the upper part of the spectrum. It is a subtle artifact
of the difference between the discrete and continuous models of damped-harmonic oscillators. It is
a tacit assumption of our model that the response of a harmonic oscillator to a constant input
should be zero. Sets (τ )=1and we can see this is true for the continuous expression:
∫−∞
t
s (τ )e−i ωr τ eΓ (τ−t )dτ=∫−∞
t
e−i ωr τ eΓ (τ−t )dτ=0
But for the discrete model we are not so lucky. Definex [k ]=1and our discrete model becomes
∑k=−∞
κ
α κ−k znk x [k ]= ∑
k=−∞
κ
ακ−k znk
This does not simplify but numerical testing shows that it approaches zero in the limit as the
sample rate goes to infinity. Practically, however, it is always non-zero and the error increases as
the resonant frequency of the oscillator approaches the Nyquist frequency. Fortunately there is a
way to control the error. The error is not directly dependent on the frequency of the oscillator; it
depends onα , butα is frequency dependent. The reason for the error is easy to see if we letα=1and
x [k ]=1. Our discrete oscillator response at thek thsample becomes simply
znk
The average puts all its weight on the most recent sample and there is no frequency filtering at all.
This is precisely the behavior that appears in the discrete model asαapproaches 1; the high
frequency estimates look less like filtered frequency responses and more like the input signal
multiplied by a complex number.
79
Our solution is to fixα for frequencies above a certain threshold. Experiments showed that if we
allowα to depend on the resonant frequency of the oscillator up to a resonant period of 80 samples
that the error resulting from the discretization is limited to less than one percent of the amplitude
of the input signal. For oscillators of resonant periods less than 80 samples, we use the same value
ofαas the 80 sample resonator instead makingα frequency dependent.
The main disadvantage of doing this is high-frequency attenuation. When we allowα to increase
with the frequency the amplitude of response is nearly linear throughout the spectrum but when
we fix the value for high frequencies we observe rapid attenuation of the amplitude as the resonant
frequency approaches the Nyquist frequency. We tried compensating for that effect by simply
multiplying the high frequency output by a function that cancels the effect of the attenuation. Of
course, the difficulty with this approach is that it also multiplies the error. In fact, since fixingα
doesn’t eliminate error, the signal-to-noise ratio for those high frequencies doesn’t improve at all.
Fixingα for resonant periods below 80 samples fixes the error at one percent but as the amplitude
of the signal decreases, the proportion of the reading that is due to that one percent error increases
in much the same way as the error itself increased when we allowedα to remain frequency
dependent.
One solution that we haven’t yet tried is to postpone the amplitude compensation until after the
masking phase of the algorithm is complete. Since the error is roughly the same for nearby
frequencies, it probably doesn’t affect the estimate of the frequency very much. Once we have
correctly identified the frequency it should be safe to multiply the amplitude to compensate for the
attenuation because we would only be multiplying a single frequency, not a whole range of them.
SPEED
80
We intended from the outset that every part of our algorithm should be theoretically capable of real
time operation. Since the current implementation is only a proof-of concept, we focused our efforts
developing the ideas, not on optimizing for efficiency. Although there are still many ways to
improve the existing code, it would perhaps be wise to begin any efforts to improve the speed by re-
writing the entire project in c or c++. We have replaced some of the inner loops with vectorized
code, which greatly improves the efficiency in MATLAB but sometimes makes the programming
difficult to decipher.
EFFICIENCY OF THE ANALYSIS PHASE
The analysis phase, that is the discrete model of the array of harmonic oscillators, executes in linear
time relative to the number of samples in the input and also relative to the number of frequencies in
the spectrogram. The current version runs about five-times slower than real time. It computes the
response of the whole array of frequencies simultaneously as a vector operation; it should increase
its efficiency by an additional five or ten percent if we do that operation as a matrix computation;
computing the response for all frequencies over a large section of input as a matrix operation.
Perhaps the most significant improvement we could make with the analysis phase of the program
would be to selectively reduce the number of frequencies in the spectrogram. Right now we include
a frequency corresponding to every period of oscillation that is an integer number of samples. For
high frequencies we need this much resolution and more. But lower down, the perceptual
difference between a tone with a period of 300 samples and one with a period of 301 is hardly
noticeable; we could probably compensate for the loss if we discarded a few frequencies simply by
interpolating between values on the spectrogram.
EFFICIENCY OF THE MASKING PHASE
81
Masking is by far the slowest operation because its dominant inner loop contains the least squares
regression that fits response functions to the peaks of the spectrogram data. We had originally
tried less computationally expensive ways of estimating the parameters of those mask functions,
with moderate results, but as time for the project was running short, we decided to use regression
so that we could observe the maximum perceptual fidelity of which the system is capable. There
are many more efficient ways to estimate the mask functions. Finding a good substitute for
regression is the best first step to speeding up masking phase of the encoder.
EFFICIENCY OF THE SYNTHESIS PHASE
The synthesis phase is already much faster than real time. Even for longer input files it completes
its processing almost immediately. Typically for CODECS of musical application it is advantageous
if the synthesis phase is faster than the analysis phase because we usually play back audio samples
many more times than we record them.
DATA RATE
We have mentioned already in chapter 5 that our program compresses data down to one or two
percent of original size. While that is an impressive compression ratio, it’s difficult to compare it
with other CODECS unless we consider the quality as well. It’s of no use getting high compression
rates if the reconstructed signal is too damaged to be useful. Since we are still working on
improving the quality of the compression, we cannot yet brag about the small file size.
7 - FUTURE RESEARCH POSSIBILITIES
PARALLELIZATION
Each damped-harmonic oscillator in the analysis phase operates independently of its neighbors.
Therefore parallelization is a natural next step toward improving the speed. Our spectrogram could
82
easily be separated to run in separate threads on multi-processor desktop computers, in graphics
processing hardware, or on a collection of small microcontrollers.
FEATURE RECOGNITION AND TRANSFORMATION
Feature recognition methods for voice recognition or security biometrics are often based on linear
procedures like principle components analysis. Sometimes the abilities of linear methods are
limited by the non-linear structure of the feature space in which they operate. If we begin by
applying a non-linear transformation such as the one of our coder, then apply statistical or linear
algebraic techniques to the results, it may open the door to new possibilities or enhance the
performance of existing methods.
Ours is the only audio format we are aware of for which pitch shifts of arbitrary amounts can be
accomplished by scalar multiplication. Consider a single page of our data file:
If we call this page T nwe can double the amplitude and raise the pitch by one octave simply by
multiplying by the scalar number 2. With vector multiplication we could do something more
interesting. The operationT n ∙ {1 ,2312 ,2
412 ,1 , 1
2, 12}corresponds to raising the first two harmonics
above the fundamental by a minor third and a major third16 respectively, and reducing them to half
their amplitude. This kind if transformation could be very powerful in music synthesis applications.
To do the same thing with time domain output of a Fourier transform would be a challenge to say
the least.
MUSICAL TRANSCRIPTION
16The Nyquist frequency is one half of the sampling rate.
83
This CODEC does not do polyphonic pitch detection in the musical sense because it estimates only
the harmonic frequencies in the sound; it doesn’t try to guess which frequencies are fundamental
and which are higher harmonics. For polyphonic musical transcription the next step is to try to find
harmonic series in the output and identify the fundamental pitches. By applying statistical methods
to this task we might also be able to improve the quality of the original estimate of the frequencies
because we expect the various harmonics of a single vibrating instrument to show strong
correlation in both amplitude and frequency.
PSYCHOACOUSTICS
We mentioned before that the absolute value of the sum of resonance-response functions is a non-
linear effect that becomes more linear as the value ofαdecreases, resulting in better frequency
resolution at the expense of slower time response. There is debate among researchers of
psychoacoustics about how much of the human ability to discern differences in pitch is a result of
the filtering of the cochlea and how much depends on logical processing in the brain. Since our
method operates under the assumption that the non-linearity effects are small and it completely
ignores all information about the phase of the signal, it is a demonstration of the quality of
perception that is possible based on only the physical filtering effect. This is interesting for two
reasons. First, the fact that our CODEC reproduces the signal in a recognizable fashion suggests that
it would be possible for biological hearing organs to do reasonably accurate recognition of sounds
without requiring any additional phase-dependent signal processing in the brain. Second, we could
try fitting our mask curves in complex space to see how much it would improve the quality of our
output. That might give some indication of what role the brain plays in low-level audio signal
processing.
8 - CODE
84
ORGANIZATIONAL OVERVIEW
Our code operates in three phases: analysis, masking, and synthesis. Analysis refers to the
spectrogram. Masking and synthesis are as described in the implementation chapter. All code is
written for MATLAB.
ANALYSIS FUNCTIONS
MAIN FILE: ANALYSISTEST.M
%data - the input sound PCM%outputPeriod - number of samples between outputs%damping - the averaging factor%shortestPeriod - the number of samples in the shortest period%numFrequencies - self explanitory%avgFctrWvlngthLmt - the shortest frequency for which the averagefactor is frequency dependent%%output format:%row 1 - samples per period%row 2 - energy%row 3 - corresponding index from the original PCM data inputfunction out = testAnalysis8(data, outputPeriod, damping, shortestPeriod, numFrequencies, avgFctrWvlngthLmt) out = []; %for debugging: averageFactorAdjustments = [];
%open a status bar window sbar1=statusbar('Analyzing...');
%prepare some necessary vectors wavelengths = shortestPeriod:(numFrequencies+shortestPeriod-1); oneStepPhase = (2*pi)./wavelengths; currentPhase = zeros(1,numFrequencies); %avgFactors is scaled according to the change in input energy. %"baseAvgFactors" is the initial value. We retain a copy of the %original instead of using it directly so that we avoid numerical %errors that would be introduced bt constantly scaling it. baseAvgFactors(1:avgFctrWvlngthLmt-shortestPeriod) = avgFctrWvlngthLmt; baseAvgFactors(avgFctrWvlngthLmt-shortestPeriod+1:numFrequencies) = wavelengths(avgFctrWvlngthLmt-shortestPeriod+1:numFrequencies); avgFactors = baseAvgFactors*damping; quarterPeriods = round(wavelengths/4); firstDataIndex = quarterPeriods(end) + 1; dataIndices(1,1:numFrequencies) = firstDataIndex; dataIndices(2,1:numFrequencies) = firstDataIndex - quarterPeriods; % resonators - row 1, samples per period % - row 2, avg energy resonators(1,:) = wavelengths; resonators(2,:) = zeros(1,numFrequencies);
85
%the algorithm: while (dataIndices(1,1) <= length(data)) inputVector = data(dataIndices(1,:))' - i*data(dataIndices(2,:))'; prevValues = resonators(2,:); %thisEnergy = inputVector.*exp(i*currentPhase)./wavelengths; thisEnergy = inputVector.*exp(i*currentPhase); resonators(2,:) = prevValues.*exp(-1./avgFactors) + thisEnergy.*(1-exp(-1./avgFactors)); dataIndices = dataIndices + 1; currentPhase = mod(oneStepPhase + currentPhase,2*pi); %every 'outputPeriod' entries, update the status bar and break if it has been %closed. Also output the frequency/energy data at this time. Also %update the averageFactor if mod(dataIndices(1,1), outputPeriod) == 0 progress = dataIndices(1,1) / length(data); if isempty(statusbar(progress,sbar1)) break; end out(:, :,end+1) = resonators([1 2],:); %transience is a measure of how quickly the input amplitude is %changing. We consider the change over a period of time that %depends on the average factor at the longest wavelength. if((dataIndices(1,1) > baseAvgFactors(end)*8+8) && (dataIndices(1,1) < (length(data) - baseAvgFactors(end)*8-8))) %don't adjust at the beginning or the end %compute the transcience in a window of size 4 times the %averageFactor at the longest wavelength transience = energyRateOfChange(data,dataIndices(1,1),16*round(baseAvgFactors(end))); averageFactorAdjustment = 10*(1 - transience); %this is to reduce the average factors when the signal is %very transient. averageFactors = baseAvgFactors*damping*averageFactorAdjustment; averageFactorAdjustments(end+1) = averageFactorAdjustment; end end end
%close the status bar if ishandle(sbar1) delete(sbar1); endend
AUXILIARY FILE: ENERGYRATEOFCHANGE.M
%This function estimates the change in avg energy over a small range of%samples in the input data by fitting a polynomial to the square of the input and%estimating the rate of change based on the coefficients of that polynomial.
%the meaning of the output of this function is difficult to interpret when
86
%indx is at the beginning or end of the data so we leave it to the function%that calls this one to make sure the array indices don't go out of bounds.
%data - ordinary pcm data%indx - the index of the input data that where the analysis filter is%processing.%windowSize - the number of samples to considerfunction out = energyRateOfChange(data, indx, windowSize) halfWindow = floor(windowSize/2); wholeWindow = 1 + 2*halfWindow; t = 1:wholeWindow; p = polyfit(t',data(indx-halfWindow:indx+halfWindow).^2,2); %we estimate the rate of change over the window by multiplying the %quadratic term by windowsize squared and the linear term by the %windowsize. (The idea is inspired by Taylor series expansions) out = 2.8*(abs(p(1))*(wholeWindow/2)^2 + abs(p(2))*(wholeWindow/2));end
87
MASKING FUNCTIONS
ORGANIZATIONAL FILE: MASKTEST3.M
This file doesn’t compute anything. It just organizes several pieces to work in sequence. Although
it belongs to the masking section of the code it also runs the synthesis functions and writes an
output. Since the synthesis is very fast it was more convenient to have it run every time the
masking finished so that its output could be analyzed to verify that the masking functions worked
properly.
%open a status bar windowsbar2=statusbar('Masking...');
maskedData = [];for sampleIndex = 1:length(analyzedData(1,1,:)) %usage: applyMask(data, threshold) maskedData(:, :, sampleIndex) = applyMask(analyzedData(:,:,sampleIndex),.80); %update the status bar and break if the status bar is closed. progress = sampleIndex / length(analyzedData(1,1,:)); if isempty(statusbar(progress,sbar2)) break; endend
%close the status barif ishandle(sbar2) delete(sbar2);end
compressedData = compressZeros(maskedData);
%reconstructedData = testSynthesis3(maskedData, 100, 20, 350);reconstructedData = testSynthesis4(compressedData, 1000,.1);
%plot(reconstructedData);
wavwrite(scaleAudio(reconstructedData,1), 44000, '~/matlab/soundoutput/test.wav');
88
MAIN FILE: APPLYMASK.M
%data should be a two dimensional array of dimensions (2 x numFrequencies)%where each ordered pair contains 1 - samples per period and 2 - amplitude.%%see "aliasAmplitude.m" for an explanation of alpha and ampFactor%threshold - the fraction of the total signal that must be represented in%the output. A value of .9 means that the masker will continue adding%frequencies to the output until 90 precent of the original data is below%the level of the mask.function out = applyMask(data,threshold) %keep everything in column-major order data = data'; maskLevels = zeros(length(data(:, 1)),1); %wavelengths out(1,:) = data(:, 1); %masked output values out(2,:) = zeros(1, length(data(:, 1)));
dataNorm = norm(data(:,2)); maskedData = abs(data(:,2)); [strongestFreqAmp strongestFreqIdx] = max(abs(data(2,:))); strongestFreqWavelength = data(1,strongestFreqIdx); initialGuessAlpha = 100;
windowSize = 35; while (dataNorm ~= 0 && norm(maskedData)/dataNorm > (1-threshold)) [incorrectStrongestFreqAmp strongestFreqIdx] = max(maskedData); strongestFreqWavelength = data(strongestFreqIdx,1); alpha = estimateAlpha([data(:,1) maskedData],windowSize,strongestFreqIdx,strongestFreqWavelength,initialGuessAlpha); %for debugging global ALPHAARRAY ALPHAARRAY(end+1)=alpha;
initialGuessAlpha = alpha; strongestFreqAmp = abs(data(strongestFreqIdx,2)) - maskLevels(strongestFreqIdx); out(2,strongestFreqIdx) = data(strongestFreqIdx,2); maskLevels = maskLevels + singleMaskCurve(strongestFreqAmp, data(strongestFreqIdx,1), data(:,1), alpha); maskedData = chop(abs(data(:,2)) - maskLevels); endend
89
AUXILIARY FILE 1: ESTIMATEALPHA.M
This is the first step toward fitting a resonance response curve to one of the peaks in the data. This
file cuts a section of the data near a peak and sends it to the next function for least squares fitting.
% as usual, the first column of 'data' is the wavelengths, the second column is% for amplitudes.function out = estimateAlpha(data,windowSize,centerFrequencyIndx, centerWavelength, initialGuessAlpha) if mod(windowSize,2) == 0 windowSize = windowSize-1; end
halfWindow = (windowSize-1)/2; beginWindow = centerFrequencyIndx - halfWindow; endWindow = centerFrequencyIndx + halfWindow; dataLength = length(data(:,1));
if beginWindow < 1 windowSize = endWindow; %if we have to shorten the window beginWindow = 1; end
if endWindow > dataLength windowSize = windowSize - (endWindow - dataLength); endWindow = dataLength; end
out = fitMaskCurve(data(centerFrequencyIndx,2),data(centerFrequencyIndx,1),data, initialGuessAlpha);end
AUXILIARY FILE 2: FITMASKCURVE.M
This file works together with the previous one to find an estimate of the parameters for the mask
curve that best fits the data near a peak. This is the function that does the least squares regression.
% data - 1st column: wavelengths, 2nd column: amplitudesfunction out = fitMaskCurve(maskAmplitude,maskWavelength,data,initialGuessAlpha) model = @maskfun; out = fminsearch(model, initialGuessAlpha, optimset('TolX',10));
function sse = maskfun(a) FittedCurve = zeros(length(data(:,1)),1); for idx = 1:length(data(:,1))
90
FittedCurve(idx) = aliasAmplitude3(maskAmplitude, maskWavelength, data(idx,1),a); end ErrorVector = FittedCurve - data(:,2); sse = norm(ErrorVector); endend
AUXILIARY FILE 3: ALIASAMPLITUDE3.M
This function generates a resonance response curve, given appropriate parameters. Basically, it
implements equation (4).
%given a masker of period 'maskPeriod' with amplitude 'maskAmplitude'%this function predicts the amplitude of aliasing onto a resonator with%period 'aliasPeriod'.
%"gamma is a parameter that depends on the amount of damping in an%oscillator. It may also be called the 'linewidth' of the resonator."%Higher values of gamma indicate wider spread of aliasing.
function out = aliasAmplitude3(maskAmplitude, maskWavelength, aliasWavelength,alpha)%rf = resonant frequency%ff = forcing frequency%fa = forcing amplitudefa = abs(maskAmplitude);%avoid division by 0if aliasWavelength == 0 || maskWavelength == 0 out = 0; return;endrf = 1 / aliasWavelength;ff = 1 / maskWavelength;
%we scale the input down by dividing by the wavelength for wavelengths >%80. We account for this by doing the same thing to the mask generated by%this function.avgFctrWvlngthLmt = 80;if (rf > 1/avgFctrWvlngthLmt) aliasRatio = ff*avgFctrWvlngthLmt;else aliasRatio = ff/rf;end
wavelengthDependentAlpha = alpha*aliasRatio;%old formula%out = sqrt(fa * (SEAdjustedGamma / ( (ff - rf).^2 + SEAdjustedGamma.^2)));%new formulaout = fa / sqrt(1+(wavelengthDependentAlpha^2)*(ff-rf)^2);end
91
AUXILIARY FILE 4: SINGLEMASKCURVE.M
Since aliasAmplitude() only computes the resonance response curve for a single point, we
simplify some of the functions that call it for every element in a large vector by calling
singleMaskCurve() instead.
%'wavelengthsList' should be the list of frequencies that are recorded in%the data file we are processing.%%'gamma', 'smoothingError', and 'ampFactor' are as explained in aliasAmplitude.mfunction out = singleMaskCurve(maskAmplitude, maskWavelength,wavelengthList, alpha) for idx = 1:length(wavelengthList); out(idx,1) = aliasAmplitude3(maskAmplitude, maskWavelength, wavelengthList(idx), alpha); endend
SYNTHESIS FUNCTIONS
MAIN FILE: TESTSYNTHESIS.M
%version 4 - a complete revision where synthesis is done over a small%finite number of frequencies. The algorithm follows each frequency%through the time-frequency-amplitude space and smoothly transitions over%both frequency and amplitude between samples.
%'data' should be in the format of compressZeros.%'outputPeriod' is equivalent to the 'outputPeriod' argument from%'testAnalysis()'.function out = testSynthesis4(data, outputPeriod, freqChangeTolerance) [layers numFreqs numSamples] = size(data); out = zeros(1,numSamples*outputPeriod); columnOnes = ones(outputPeriod,1); %this will be used later to sum columns of the matrix containing oscillations as several frequencies prevPhase = zeros(numFreqs,2);
for timeIdx = 2:numSamples %the following arrays will store the information about how the various %frequencies connect with eachother. amps1 = zeros(numFreqs,1); amps2 = zeros(numFreqs,1); wavelengths1 = zeros(numFreqs,1); wavelengths2 = zeros(numFreqs,1); ampFreqIdx = 1; [foo prevNZIndices prevNZAmplitudes] = find(data(2,:,timeIdx-1));
92
[foo currentNZIndices currentNZAmplitudes] = find(data(2,:,timeIdx)); numNZamplitudes = length(currentNZAmplitudes) + length(prevNZAmplitudes); currentNZData = data(:,currentNZIndices,timeIdx); prevNZData = data(:,prevNZIndices,timeIdx-1); if isempty(currentNZData) currentNZData = zeros(2,length(prevNZData)); end if isempty(prevNZData) prevNZData = zeros(2,length(currentNZData)); end while (numNZamplitudes > 0) [currentMax freqIdxMax] = max(currentNZData(2,:)); [previousMax prevFreqIdxMax] = max(prevNZData(2,:)); if abs(currentMax) > abs(previousMax) amps2(ampFreqIdx) = currentNZData(2,freqIdxMax); wavelengths2(ampFreqIdx) = currentNZData(1,freqIdxMax); [wavelengths1(ampFreqIdx) amps1(ampFreqIdx) nearestIdx] = findNearest(currentNZData(:,freqIdxMax),prevNZData,freqChangeTolerance); currentNZData(:,freqIdxMax) = 0; numNZamplitudes = numNZamplitudes - 1; if abs(amps1(ampFreqIdx)) > 0 && nearestIdx > 0 numNZamplitudes = numNZamplitudes - 1; prevNZData(:,nearestIdx) = 0; end else %currentMax <= previousMax amps1(ampFreqIdx) = prevNZData(2,prevFreqIdxMax); wavelengths1(ampFreqIdx) = prevNZData(1,prevFreqIdxMax); [wavelengths2(ampFreqIdx) amps2(ampFreqIdx) nearestIdx] = findNearest(prevNZData(:,prevFreqIdxMax),currentNZData,freqChangeTolerance); prevNZData(:,prevFreqIdxMax) = 0; numNZamplitudes = numNZamplitudes - 1; if abs(amps2(ampFreqIdx)) > 0 && nearestIdx > 0 numNZamplitudes = numNZamplitudes - 1; currentNZData(:,nearestIdx) = 0; end end ampFreqIdx = ampFreqIdx + 1; end exponentMtrx = zeros(numFreqs,outputPeriod); currentPhase = [0 0]; for freqIdx = 1:numFreqs if wavelengths1(freqIdx) ~= 0 currentPhase(freqIdx,1) = wavelengths2(freqIdx); beginPhase = prevPhase(find(prevPhase(:,1) == wavelengths1(freqIdx)),2); if isempty(beginPhase) beginPhase = 0; end %if the same phase appears twice, delete the first one %from the list after using it if length(beginPhase) > 1 doublePhaseIndices = find(prevPhase(:,1) == wavelengths1(freqIdx)); prevPhase(doublePhaseIndices(1),:) = 0; end
93
[exponentMtrx(freqIdx,:) currentPhase(freqIdx,2)]=continuousFadeExps(amps1(freqIdx),amps2(freqIdx),wavelengths1(freqIdx),wavelengths2(freqIdx),outputPeriod,beginPhase(1)); end end prevPhase = currentPhase;
%delete any zero rows from the matrix exponentMtrx(~any(exponentMtrx,2),:)=[];
[expMtrxHeight expMtrxWidth] = size(exponentMtrx); rowOnes = ones(1,expMtrxHeight); %this is elementwise exponential, not matrix exponential out( ((timeIdx-2)*outputPeriod + 1):((timeIdx-1)*outputPeriod)) = rowOnes*real(exp(exponentMtrx)); endend
AUXILIARY FILE 1: FINDNEAREST.M
%searches through WAArray to locate the (wavelength,amplitude)%pair nearest to the one defined my singleWA. (WA stands for Wavelength-Amplitude)%%if the ratio of the nearest frequency to the frequency of singleWA is far%from 1 then it will not connect singleWA to any other frequency. Instead%it tapers the amplitude to zero.%%freqChangeTolerance controls the maximum fraction of frequency change%allowed in one timestep. (So a value of .1 means the frequency can change%+ or - 10% at each timestep.)function [nearestWavelength nearestAmp nearestIdx] = findNearest(singleWA,WAArray, freqChangeTolerance) singleWavelength = singleWA(1); singleAmplitude = singleWA(2); [two WAALength] = size(WAArray); nearestIdx = 0; nearestDistance = 1e9; %if we can't find a better alternative, the amplitude should go to zero %without changing frequency. nearestWavelength = singleWavelength; nearestAmp = 0; for indx = 1:WAALength thisWavelength = WAArray(1,indx); thisAmplitude = abs(WAArray(2,indx)); if thisAmplitude > 0 thisDistance = abs(thisWavelength - singleWavelength); if thisDistance < nearestDistance nearestIdx = indx; nearestDistance = thisDistance; elseif thisDistance == nearestDistance && abs(abs(WAArray(2,nearestIdx)) - abs(singleAmplitude)) > abs(abs(singleAmplitude)-thisAmplitude) nearestIdx = indx; nearestDistance = thisDistance; end
94
end end %We arbitrarily assume that the largest reasonable frequency shift for %a single outputPeriod is +-10% of the wavelength. Therefore if the %nearest frequency is farther than that, we send the amplitude to zero %instead. if nearestIdx ~= 0 && WAArray(1,nearestIdx) ~= 0 && singleWavelength~= 0 && abs(1 - abs(singleWavelength)/abs(WAArray(1,nearestIdx))) < (freqChangeTolerance) nearestWavelength = WAArray(1,nearestIdx); nearestAmp = WAArray(2,nearestIdx); else nearestIdx = 0; endend
AUXILIARY FILE 2: CONTINUOUSFADEEXPS.M
An efficient and numerically stable way of computing the samples of a continuous sinusoid function
that fades smoothly in amplitude and frequency betweena1cos ( ω1 t )anda2cos ( ω2 t )is to compute
the array of real and imaginary arguments to exponential functions that would produce a similar
result, then exponentiate every element in the array and take the real component of the result. This
function generates the array of arguments for the exponential function. The exponentiation is done
at the end of testSynthesis.m.
%taking exp(out) should produce the anlytical signal fading smoothly in%both frequency and amplitude from (amp1,wavelength1) to (amp2,wavelength2)%over a window of 'samplePeriod' samples.function [exps endPhase] = continuousFadeExps(a1,a2,wavelength1,wavelength2,samplePeriod,beginPhase) amp1 = abs(a1); amp2 = abs(a2); freq1 = 2*pi*(1/wavelength1); freq2 = 2*pi*(1/wavelength2);
%We compute the oscillating part of the phase. If freq1 and freq2 are %equal, this is simply an array containing a simple arithmetic sequence %where the common difference is freq1. In the more usual case the %array contains the terms of a geometric series where freq1 is %the scaling factor and (freq2/freq1)^(1/samplePeriod) is the common %ratio. % %for scaleFactor a and ratio r, the nth term of the geometric series %that begins with a*r^0 is a*((1-r^n)/(1-r)). We vectorize this %computation to save time: a = freq1; if (freq1 ~= freq2) r = (freq2/freq1)^(1/samplePeriod);
95
rToTheNPower = r.^[0:samplePeriod-1]; oscillation = a*(1-rToTheNPower)/(1-r); endPhase = mod(beginPhase + a*(1-r^samplePeriod)/(1-r),2*pi); %output else % freq1 == freq2 oscillation = freq1*(0:samplePeriod-1); endPhase = mod(beginPhase + a*samplePeriod,2*pi); %output end
%now we compute the imaginary part of the exponents imaginaryComponent = beginPhase + oscillation;
%we have not yet done anything about the amplitude. We will use the %identity a*exp(i*t) = exp(i*t + ln(a)) to allow us to control the %aplitude of the signal also from the exponents. ampStep = (amp2 - amp1)/samplePeriod; ampArray = amp1 + ampStep*[0:samplePeriod-1]; %replace zeros with -1e15 so that the log function won't complain ampArray(~any(ampArray,1)) = 1e-15; realComponent = reallog(ampArray);
%output exps = realComponent + i*imaginaryComponent;end
AUXILIARY FILE 3: CHOP.M
% a quick function to replace negative values of a (real valued) matrix with zerosfunction out = chop(a) out = (a + abs(a))/2;end
96
BIBLIOGRAPHY
Cohen, L. (1995). Time-Frequency Analysis. Prentice-Hall PTR.
Dwight Brown. (n.d.). BlueMax. Retrieved from BlueMax Universal Internet Calculator: http://www.bluemax.net/techtips/JavaJungleJuice/MotherofAllDownloadCalculators/MotherOfAllDownloadCalculators.htm
Edwards, C. H. (2005). Differential Equations and Linear Algebra. Upper Saddle River, NJ: Pearson Education, Inc.
Fletcher, H. (1940, Janurary). Auditory Patterns. Rev. Mod. Phys. , 47-55.
Goldberg, M. B. (2003). Introduction to Digital Audio Coding and Standards. Boston: Kluwer Academic Publishers.
Gulick, G. a. (1989). Hearing - Physiological Acoustics, Neural Coding, and Psychoacoustics. New York: Oxford University Press.
Hentjeens, G. (1997). Speech Synthesis from a Spectrogram. Retrieved from Penn Engineering: http://www.ese.upenn.edu/sunfest/pastProjects/presentations97/Gavin97/sld001.htm
Meyer, C. D. (2000). Matrix Analysis and Applied Linear Algebra. Philidelphia: SIAM.
Red Book (audio CD standard). (n.d.). Retrieved from Wikipedia: http://en.wikipedia.org/wiki/CD_Audio
Schottstaedt, B. (n.d.). An Introduction to FM. Retrieved from Center for Computer Research in Musical Acoustics: http://ccrma.stanford.edu/software/snd/snd/fm.html
Stanford University News Service. (1994). Music synthesis approaches sound quality of real instruments. Retrieved from Stanford News: http://www.stanford.edu/dept/news/pr/94/940607Arc4222.html
Von Bekesy, G. (1960). Experiments in Hearing. New York: McGraw-Hill.
Zauderer, E. (1989). Partial Differential Equations of Applied Mathematics. New York: John Wiley & Sons, Inc.
97
top related