shlomo dubnov spectral anticipationsmusicweb.ucsd.edu/~sdubnov/catbox/reader/mit cmj302... · 2006....

21
Dubnov Shlomo Dubnov Music Department University of California at San Diego 9500 Gilman Drive La Jolla, California 92037-0326 USA [email protected] Spectral Anticipations 63 This article deals with relations between random- ness and structure in audio and musical sounds. Randomness, in the casual sense, refers to some- thing that has an element of variation or surprise in it, whereas structure refers to something more pre- dictable, rule-based, or even deterministic. When dealing with noise, which is the “purest” type of randomness, one usually adopts the canonical phys- ical or engineering definition of noise as a signal with a white spectrum (i.e., composed of equal or almost-equal energies in all frequencies). This seems to imply that noise is a complex phenome- non simply because it contains many frequency components. (Mathematically, to qualify as random or stochastic process, the density of the frequency components must be such that the signal would have a continuous spectrum, whereas periodic com- ponents would be spectral lines or delta functions.) In contradiction to this reasoning stands the fact that, to our perception, noise is a rather simple sig- nal, and in terms of its musical use, it does not al- low much structural manipulation or organization. Musical notes or other repeating or periodic acous- tic components in music are closer to being deter- ministic and could be considered as “structure.” However, complex musical signals, such as poly- phonic or orchestral music that contain simultane- ous contributions from multiple instrumental sources, often have a spectrum so dense that it seems to approach a noise-like spectrum. In such situations, the ability to determine the structure of the signal cannot be revealed by looking at signal spectrum alone. Therefore, the physical definition of noise as a signal with a smooth or approximately continuous spectrum seems to obscure other signif- icant properties of signals versus noise, such as whether a given signal has temporal structure—in other words, whether the signal can be predicted. The article presents a novel approach to (auto- matic) analysis of music based on an “anticipation profile.” This approach considers dynamic properties of signals in terms of an anticipation property, which is shown to be significant for discrimination of noise versus structure and the characterization and anal- ysis of complex signals. Mathematically, this is for- malized in terms of measuring the reduction in the uncertainty about the signal that is achieved when anticipations are formed by the listener. Considering anticipation as a characteristic of a signal involves a different approach to the analysis of audio and musical signals. In our approach, the signal is no longer considered to be characterized according to its features or descriptors alone, but its characterization takes into account also an observer that operates intelligently on the signal, so that both the information source (the signal) and a listener (information sink) are included as parts of one model. This approach fits very well into an information theoretic framework, where it becomes a characteri- zation of a communication process over a time- channel between the music (present time of the acoustic signal) and a listener (a system that has memory and prediction capabilities based on a sig- nal’s past). The amount of structure is equated to the amount of information that is “transmitted” or “transferred” from a signal’s past into the present, which depends both on the nature of the signal and the nature of the listening system. This formulation introduces several important advantages. First, it resolves certain paradoxes re- lated to the use of information theoretic concepts in music. Specifically, it corrects the naïve equation of entropy (or uncertainty) to the amount of “interest” present in the signal. Instead, we are considering the relative reduction in uncertainty caused by predic- tion/anticipation. Additionally, the measure has the desired “inverted-U function” behavior that charac- terizes both signals that are either nearly constant and signals that are highly random as signals that have little structure. In both cases, the reduction in uncertainty owing to prediction is small. In the first case, this is caused by the small amount of variation in the signal to start with, whereas in the second Computer Music Journal, 30:2, pp. 63–83, Summer 2006 © 2006 Massachusetts Institute of Technology.

Upload: others

Post on 25-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

Dubnov

Shlomo DubnovMusic DepartmentUniversity of California at San Diego9500 Gilman DriveLa Jolla, California 92037-0326 [email protected]

Spectral Anticipations

63

This article deals with relations between random-ness and structure in audio and musical sounds.Randomness, in the casual sense, refers to some-thing that has an element of variation or surprise init, whereas structure refers to something more pre-dictable, rule-based, or even deterministic. Whendealing with noise, which is the “purest” type ofrandomness, one usually adopts the canonical phys-ical or engineering definition of noise as a signalwith a white spectrum (i.e., composed of equal oralmost-equal energies in all frequencies). Thisseems to imply that noise is a complex phenome-non simply because it contains many frequencycomponents. (Mathematically, to qualify as randomor stochastic process, the density of the frequencycomponents must be such that the signal wouldhave a continuous spectrum, whereas periodic com-ponents would be spectral lines or delta functions.)

In contradiction to this reasoning stands the factthat, to our perception, noise is a rather simple sig-nal, and in terms of its musical use, it does not al-low much structural manipulation or organization.Musical notes or other repeating or periodic acous-tic components in music are closer to being deter-ministic and could be considered as “structure.”However, complex musical signals, such as poly-phonic or orchestral music that contain simultane-ous contributions from multiple instrumentalsources, often have a spectrum so dense that itseems to approach a noise-like spectrum. In suchsituations, the ability to determine the structure ofthe signal cannot be revealed by looking at signalspectrum alone. Therefore, the physical definitionof noise as a signal with a smooth or approximatelycontinuous spectrum seems to obscure other signif-icant properties of signals versus noise, such aswhether a given signal has temporal structure—inother words, whether the signal can be predicted.

The article presents a novel approach to (auto-matic) analysis of music based on an “anticipation

profile.” This approach considers dynamic propertiesof signals in terms of an anticipation property, whichis shown to be significant for discrimination of noiseversus structure and the characterization and anal-ysis of complex signals. Mathematically, this is for-malized in terms of measuring the reduction in theuncertainty about the signal that is achieved whenanticipations are formed by the listener.

Considering anticipation as a characteristic of asignal involves a different approach to the analysisof audio and musical signals. In our approach, thesignal is no longer considered to be characterizedaccording to its features or descriptors alone, but itscharacterization takes into account also an observerthat operates intelligently on the signal, so that boththe information source (the signal) and a listener(information sink) are included as parts of one model.This approach fits very well into an informationtheoretic framework, where it becomes a characteri-zation of a communication process over a time-channel between the music (present time of theacoustic signal) and a listener (a system that hasmemory and prediction capabilities based on a sig-nal’s past). The amount of structure is equated tothe amount of information that is “transmitted” or“transferred” from a signal’s past into the present,which depends both on the nature of the signal andthe nature of the listening system.

This formulation introduces several importantadvantages. First, it resolves certain paradoxes re-lated to the use of information theoretic concepts inmusic. Specifically, it corrects the naïve equation ofentropy (or uncertainty) to the amount of “interest”present in the signal. Instead, we are considering therelative reduction in uncertainty caused by predic-tion/anticipation. Additionally, the measure has thedesired “inverted-U function” behavior that charac-terizes both signals that are either nearly constantand signals that are highly random as signals thathave little structure. In both cases, the reduction inuncertainty owing to prediction is small. In the firstcase, this is caused by the small amount of variationin the signal to start with, whereas in the second

Computer Music Journal, 30:2, pp. 63–83, Summer 2006© 2006 Massachusetts Institute of Technology.

Page 2: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

case, there is little reduction in uncertainty, be-cause prediction has little or no effect. Finally, thisformulation also clarifies the difference betweendeterminism and predictability. Signals that havedeterministic dynamics (such as certain chaotic sig-nals) might have varying degrees of predictability,depending on the precision of the measurement orexact knowledge of their past. A listener, or anypractical prediction system, must have some uncer-tainty about its past, and this uncertainty growswhen trying to predict the future, even in case of adeterministic system.

The ideas in the article are presented startingfrom a background of information theory and basisdecomposition, followed by the idea of a vector ap-proach to the anticipation measurement, and con-cluding with some higher-level reflections on itsmeaning for complex musical signals. The presen-tation takes several approaches: an engineering orsignal-processing approach, in which notions ofstructure are related to basic questions about signalrepresentation; an audio information-retrieval ap-proach, in which the ideas are applied for characteri-zation of natural sounds; and a musical analysisapproach, in which time-varying analysis of musi-cal recordings is used to create an “anticipation pro-file” that describes their structures. This last aspecthas interesting possibilities for exploring relation-ships among signal measurements and human cog-nitive judgments of music, such as emotional force.

The main contribution of our model is in the vec-tor approach, which generalizes notions of anticipa-tion for the case of complex, multi-componentmusical signals. This measure uses concepts fromprinciple components analysis (PCA) and indepen-dent components analysis (ICA) to find a suitablegeometric representation of a monaural audiorecording in a higher-dimensional feature space.This pre-processing creates a representation that al-lows estimating the anticipation of a complex sig-nal from the sum of anticipations of its individualcomponents. Accordingly, the term complex signalused throughout this article describes the multi-component nature of a single-channel recording andshould not be confused with multiple-channelrecordings.

Our Model

Our model of signal structure considers musical oraudio material as an information source, which iscommunicated over time to a listener (either hu-man or machine). The amount of informationtransmitted through this communication processdepends on the nature of the signal and the listener.The listener makes predictions and forms expecta-tions; the music source generates new samples. Ac-cordingly, we define structure to be the aspect ofmusical material that the listener can predict, andnoise or randomness as what the listener wouldconsider as an unpredictable flow of data.

Information, Entropy, Mutual Information, andInformation Rate

When a sequence of symbols, such as text, music,images, or even genetic codes are considered froman information theoretic point of view, it is as-sumed that the specific data is a result of produc-tion by an information source. The concept of theinformation source allows description of many dif-ferent sequences by a single statistical model with agiven probability distribution. The idea of lookingat a source, rather than a particular sequence, char-acterizes which types of sequences are probable andwhich are not—in other words, what data is morelikely to appear. Information theory also teachesthat, “in the long run,” some sequences becometypical of the source, while others might turn veryimprobable or so rare that they would, in practice,never occur.

The logarithm of the relative size of the typicalset (log of the number of typical sequences dividedby log of number of all possible sequences of thesame length) is called entropy, and it is consideredas the characteristic amount of uncertainty inher-ent to the source. The larger the typical set, thehigher the uncertainty, and vice versa. For instance,a biased coin that falls mostly on “heads” will pro-duce sequences whose empirical average approachesthe statistical mean, which comprises only a smallfraction of all possible sequences of “heads” and

64 Computer Music Journal

Page 3: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

Dubnov

“tails.” In such a case, the proportion of the size ofthe typical set (i.e., a set comprising mostly “heads”)relative to the number of all possible sequences (thenumber two, raised to the power of the number ofcoin flips we performed) is low, and the source willbe considered to have little entropy or little uncer-tainty. More equally distributed sources (such asmultiple flips of a fair coin) will have a larger typicalset and high uncertainly.

When information theory is used to describe asingle information source, the concepts of entropyhave direct relation to the coding size or compres-sion lower bounds for that source. What is more in-teresting for our case is the aspect of informationtheory that deals with information channels, or therelationship between the information source andthe information sink. In a physical communicationchannel, the source could be the voice of a personon one end of a phone line, the sink would be thevoice emerging form the other end, and the channelnoise would be the actual distortion caused to thesource voice during the transmission process. Bothsignals are ideally similar, but not identical. Theuncertainty about what was transmitted, remainingafter we have received the signal, is characteristicof the channel. This notion is mathematically de-scribed using mutual information, which is de-fined as the difference between entropy of thesource and conditional entropy between the sourceand the sink. If we consider source and sink as tworandom variables x and y, mutual information de-scribes the cross-section between their entropies, asshown in Figure 1.

Denote by H(x) = ∑P(x)log P(x) the entropy of vari-able x (a source that has probability P(x)). This is de-picted by the left (gray) circle in Figure 1; H(y) is theright (dotted) circle of variable y; and H(x,y) depictsthe outer contour of both sources together. The con-ditional entropy H(x|y) depicts the uncertaintyabout x if y is known (and vice versa for H(y|x)). Thefunction I(x,y) is the cross-section (grid) that indi-cates how much overlap in uncertainty occurs be-tween x and y, that is, how much information onevariable carries about the other. In extreme cases, ifthe two sources are independent, the two circleswill have no overlap, and I(x,y)=0, accordingly. If xand y are exactly equivalent (i.e., knowing x isequivalent to knowing y), the two circles then willcompletely overlap each other, and I(x,y) = H(x) =H(y) = H(x,y). The conditional entropy in such acase is zero, because knowledge of one variablecompletely describes the other, leaving no condi-tional uncertainty (i.e., H(x|y) = H(y|x) = 0). Mathe-matically, the above relations have simple algebraicexpressions that are directly related to the marginaldistributions P(x), P(y), and their common distribu-tion P(x,y).

(1)

Information Rate and Anticipation as “Capacity” ofa “Time-Channel”

As we explained, mutual information is commonlyused for theoretical characterization of the amountof information transmitted over a communicationchannel (see Figure 2). For our purposes, we willconsider a particular type of channel that is differ-ent from a standard communication model in twoimportant aspects. First, the channel is a time chan-nel and not a physical transmission channel. Theinput to the channel is the history of the signal upto the current point in time, and the output is itsnext (present) sample. Second, the receiver must ap-ply some algorithms to predict the current samplefrom its past samples. The information at the sink y

I X Y H X H X Y H Y H Y X

H X H Y H X Y P x y P x yP x P y

( , ) ( ) ( | ) ( ) ( | )

( ) ( ) ( , ) ( , ) log ( , )( ) ( )

= − = −

= + − = ∑

65

Figure 1. Entropies H(·) forseparate variables x and yand the pair (x,y), condi-tional entropies H(x|y) andH(y|x), and mutual infor-mation I(x,y).

Page 4: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

consists of the signal history x1, x2, . . . , xn–1 avail-able to the receiver prior to its receiving or hearingxn. The transmission process over a noisy channelnow has the interpretation of prediction/anticipa-tion performance over time. The amount of mutualinformation depends on the “surprise” that thetime-channel introduces to the next sample versusthe ability of the listener to predict this surprise.

This notion of information transmission overtime-channel is captured by the information rate(IR, also called the scalar-IR). This is defined as therelative reduction of uncertainty of the presentwhen considering the past, which equals to theamount of mutual information carried between thepast xpast = {x1, x2, . . . , xn–1} and the present xn. It canbe shown using appropriate definitions of informa-tion for multiple variables, called multi-information,that the information rate equals the difference be-tween the multi-information contained in the vari-ables x1, x2, . . . , xn and x1, x2, . . . , xn–1 (i.e., theamount of additional information that is addedwhen one more sample of the process is observed):

(2)

One can interpret IR as the amount of informa-tion that a signal carries into its future. This isquantitatively measured by the number of bits thatare needed to describe the next event once anticipa-tion or prediction based on the past has occurred.Let us now discuss the significance of IR for the de-lineation of structure and noise for several examplecases.

Example Case 1: “Inverted-U Function” for theAmount of Structure

A purely random signal cannot be predicted andthus has the same uncertainty before and after pre-diction, resulting in zero IR. An almost constantsignal, on the other hand, has a small uncertainty

�( , ,..., ) ( ) ( | ) ( , )

( , ,..., ) ( , ,..., )

x x x H x H x x I x x

I x x x I x x x

n n n past n past

n n

1 2

1 2 1 2 1

= − =

= − −

H(x), resulting in an overall small IR. High-IR sig-nals have large differences between their uncer-tainty without prediction H(x) versus the remaininguncertainty after prediction H(x | past of x). As men-tioned in the Introduction, one of the advantages ofIR is that in the “IR sense,” both constant and com-pletely random signals carry little information.

Example Case 2: The Relative “Meaning” of Noise

Let us consider a situation in which two systems at-tempt to form expectations about the same data.One system has a correct model, which allows goodpredictions for the next symbol. In the case whenthe uncertainty about the signal is large but the re-maining uncertainty after prediction is small, thesystem manages to reveal the signal’s structure andachieves its information-processing task, resultingin high IR.

Let us consider a second system that does nothave the capability of making correct predictions. Insuch a case, the quality of prediction, which can bemeasured by the number of bits needed to code thenext symbol, remains almost equal to the amountof bits required for coding the original signal with-out prediction. In such a case, the discrepancy be-tween the two coding lengths is zero, and noinformation reduction was achieved by the system,resulting in a low IR.

Example Case 3: Signal Characterization

The previous discussion suggests that the IR is de-pendent on the nature of the information-processingsystem as well as the type of signal. Only in the caseof an ideal system that has access to complete sig-nal statistics does the IR become a characterizationof the signal alone, independent of the information-processing system.

Example Case 4: Determinism VersusPredictability

An interesting application of IR is characterizationof chaotic processes. Considering for instance a lo-gistic map x(n + 1) = ax(n)(1 – x(n)), it is evident thatknowledge of x(n) provides complete information

66 Computer Music Journal

Figure 2. Informationsource, channel, and re-ceiver.

Page 5: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

Dubnov

for prediction of x(n + 1). A closer look at the prob-lem reveals that if the precision of measuring x(n)is limited, the measurement error increases withtime. For a = 4 (chaos), the error approximately dou-bles every step, increasing the uncertainty by factortwo, which amounts to a loss of one bit of informa-tion. This example shows that even completeknowledge of system dynamics might not suffice tomake perfect predictions and that IR is dependenton the nature of measurement as well.

Example Case 5: Equivalence of Scalar-IR toSpectral Flatness Measure

The spectral flatness measure (SFM; Jayant andNoll 1984) is a well-known method for evaluatingthe “distance” of a process from white noise (seeAppendix). It is also widely used as a measure for“compressibility” of a process, as well as an impor-tant feature for sound classification (Allamanche etal. 2001). Given a signal with power spectrum S(w),the SFM is defined as

(3)

and it is positive and less than or equal to one. It hasa value of unity for a white-noise signal.

For large n, IR equals the difference between themarginal entropy and entropy rate of the sequenceor signal x(n), r(x) = limn→∞ r(x1, . . . , xn) = H(x) –Hr(x), where the entropy rate is the limit of condi-tional entropy for large n, Hr(x) = limn→∞ H(xn | x1,x2, . . . , xn–1). Using expressions for the entropy andentropy rate of Gaussian process, one arrives at thefollowing relation (Dubnov 2004):

(4)

Equivalently, one can express IR as a function ofSFM:

(5)

Figure 3 shows the close relationship betweenspectral flatness and the IR measure for a nature

�( ) log( ( ))x SFM x= − 12

SFM x x( ) exp( ( ))= −2�

SFMS d

S d=

⎛⎝⎜

⎞⎠⎟−

exp ( )

( )

12

1�

n ω ω

12

ω ω

recording from a jungle. The figure shows the IRmeasure plotted on top of the signal spectrogram.(The values of IR were scaled so that it would dis-play conveniently on top of the spectrogram.) It canbe seen that signal segments that contain flat spec-trum (either silences or bursts of high-bandwidthnoise) correspond to low IR, and segments that con-tain harmonic or narrow-band noise have higher IR.It should be noted that this example containssounds of mostly pitched bird singing alternatingwith noisy bird cries, which is a “segmentation”type of processing that can be performed usingscalar-IR. In the next example, we will introducethe concept of vector-IR and consider a complex sit-uation of simultaneously mixed periodic and noisysignals.

Extension of IR for Multivariate (Vector) Process

In the case of complex signals, such as those com-prising several components, or described by se-quences of spectral descriptors, or sequences offeature vectors, we must consider new type of IRthat would be appropriate for dealing with a se-quence of multiple variables described as vectors ina higher-dimensional space. We will discuss suchrepresentations in the context of audio basis andgeometric signal representation in the next section.

Using capital-letter notation for vector variables,we denote a sequence of vectors by X1X2 . . . XL andgeneralize the IR definition (to be called vector-IR) as

(6)

The new definition of multivariate informationrate represents the difference in information over Lconsecutive vectors minus the sum of informationin the first L–1 vectors and the multi-informationbetween the components within the last vector XL.Let us assume that some transformation T exists,such that S = TX and the components S1S2 . . . SL

after transformation are statistically independent.Using relations between entropies of a linear trans-formation of random vectors, it can be shown (Dub-

�( , ,..., ) ( , ,..., )

{ ( , ,..., ) ( )}

X X X I X X X

I X X X I X

L L

L L

1 2 1 2

1 2 1

= −

+−

67

Page 6: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

nov 2003) that IR may be calculated as a sum of IRsof the individual components, si(n), i = 1 . . . n,

(7)

The vector-IR generalization allows identificationof the structural elements of the signal as the com-ponents with high scalar-IR.

Example: Vector-IR Analysis of Mixed Sinusoidaland Noise Components

Given a sinusoidal signal s(t) = sin(wt + ϕ) and awhite-noise signal n(t), we consider the case of amixed sinusoidal and noise signal x(t) = s(t) + n(t).For this signal, we may find separate signal andnoise spaces using singular value decomposition(SVD) using signal representation as a sequence offrames containing consecutive signal samples

� �L L i ii

n

X X X s i s L( , ,..., ) ( ( ),..., ( ))1 21

==∑

(8)

Concatenating L signal segments into a matrix X =[X1X2 . . . XL], SVD provides a decomposition X =UΛVT, which gives the n basis vectors in columnsof the U matrix (orthogonal n dimensional vectors),a diagonal matrix Λ that gives the n variances of thecoefficients, and the first n rows of the matrix VT

that give the normalized expansion coefficients inthis space.

Figure 4 shows the results of applying SVD to thesinusoidal and noise signal. The first two basis vec-tors are the two quadrature sinusoidal components.The remaining basis vectors are the noise compo-nents. Figure 5 shows the corresponding expansioncoefficients.

We apply scalar-IR analysis to the first n rows of

X x x x X x x x X

x x x

nT

n n nT

i n i n inT

1 1 2 2 1 2 2 1

1 1 1 2

= =

=+ +

− + − +

[ , ,..., ] , [ , ,..., ] ,...,

[ , ,..., ]( ) ( )

68 Computer Music Journal

Figure 3. Signal-IR plottedon top of the signal spec-trogram. See text for moredetail.

Page 7: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

Dubnov

the matrix VT using the SFM method of scalar-IRestimation. The sum of the scalar-IRs yields thevector-IR. These estimates were compared to thescalar-IR of the individual sinusoidal and noise sig-nals and their sum signal. In Table 1 in the paran-thesis, we show the value of SFM for these signals,which is bounded between zero and one, making itsomewhat easier to consider these values as a mea-sure of the amount of structure. To read the SFM val-ues, one should recall that SFM values close to oneindicate noise and values near zero indicate structure.In the case of vector-IR, the SFM value is actually thegeneralized-SFM, which is a result of summing theindividual scalar-IRs and transforming the resultingvector-IR back to SFM. We use the relationship

(9)

which also equals the product of the individual col-umn SFMs. So, in principle, presence of a strongstructural element (i.e., SFM close to zero) makesthe generalized SFM small, indicating structure.This depends of course on the ability of SVD to sep-arate the structural components (in practice, theSVD of the sines+noise signal contained some errorsin the structural components that it found, result-ing in a higher generalized-SFM than the theoretical

generalizedSFM e vectorIR= − ⋅2

product of the SFMs of the individual sinusoid ornoise signals).

These results show that the amount of structurerevealed by vector-IR (or generalized-SFM) for thesines+noise signal is significantly higher (and SFMlower) than the structure estimated by scalar-IR onthe sum signal. It should also be noted that vector-IRestimation of noise components is imperfect, result-ing in nonzero values. In general, there seems to be atradeoff between the precision of estimation of struc-tured and noise components using the SFM proce-dure. Using low-resolution spectral analysis in SFMestimation allows better averaging and improved es-timation of the noise spectrum. This comes at theexpense of poorer estimation of peaked or spectrallystructured components. Using high-resolution spec-tral estimate causes better resolution of the spectralpeaks, but creates a bias towards structured results.

69

Figure 4. Four basis vectors(shown as rows) obtainedby SVD.

Figure 5. Expansion coeffi-cients that correspond tothe basis vectors of Figure 4.

Table 1. Scalar-IR and Vector-IR Values for the Ex-ample Sinusoidal Component, Noise Component,and Combined Sines+Noise Signal

Sinusoid Noise Sines+Noise

Scalar-IR (SFM) 6.31 (3 × 10–6) 8.7e-5 (0.99) 0.16 (0.72)Vector-IR (SFM) 8.18 (7 × 10–8) 0.21 (0.657) 2.44 (0.007)

Page 8: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

Audio Basis Representation

A common representation of audio results fromtransforming the time signal into the spectral do-main using the Short-Time Fourier Transform(STFT). This processing is performed by applyingthe Fourier Transform to blocks of audio samplesusing a sliding-window that extracts short signalsegments (also called frames) from the audiostream. It should be noted that each frame couldbe mathematically considered as a vector in a high-dimensional space. This approach can be general-ized using additional types of transforms such asvarious types of filter banks, cepstral analysis, orauditory models, giving different types of featuresthat might be better adapted to the specific pro-cessing task.

Spectral Audio Basis

When considering audio representations, a balancedtradeoff between reducing the dimensionality ofdata and retaining maximum information contentmust be achieved. For these reasons, various re-searchers and standards (Casey 2001; Lewicki 2002;Kim, Moreau, and Sikora 2004) have proposed fea-ture extraction methods based on the projection ofthe transformed frames of the spectrum into a low-dimensional representation using carefully selectedbasis functions. A scheme of such a feature-extraction system is described in Figure 6. To assureindependence between the transform coefficientsfor purpose of vector-IR analysis, additional statisti-cal procedures must be applied to the differenttransform coefficients (or in the case of the STFT,different frequency channels).

Although Fourier channels are asymptotically in-dependent, short-term statistics of different spectralchannels may have significant cross-correlations.This dependence can be effectively removed usingvarious methods that are described in the followingsection. Another common representation of spectralcontents of audio signals is by means of cepstral co-efficients (Oppenheim and Schafer 1989). The cep-strum is defined as the inverse Fourier transform ofa logarithm of the absolute value of Fourier trans-

form of the signal C = F–1 {log(| F {x(n)} |)}. One of thegreat advantages of the cepstrum is its ability tocapture different details of signal spectrum in onerepresentation. For instance, the energy of the sig-nal corresponds to first cepstral coefficient. Lowercepstral coefficients capture the shape of the spec-tral envelope or represent its smooth, gross spectraldetails. Detailed spectral variations such as spectralpeaks corresponding to pitch (the actual notesplayed) or other long-term signal correlations ap-pear in the higher cepstral coefficients. Selectingpart of the cepstrum allows easy control over thetype of spectral information that we submit to theIR analysis.

Clever choice of the transformation carries sev-eral advantages, such as energy compaction (com-pression by retaining the high-variance coefficients),noise reduction (projection onto separate signal andnoise spaces), and improved recognition (findingsalient features). The basic idea is that a signal ofinterest can be represented by linear combinationof a few strong components (basis functions). Therest of the signal (parts that contain noise, interfer-ence, etc.) are assumed to reside in a different sub-space and are ideally weak and approximately ofequal energy.

There are several methods for estimation of low-rank models, such as Eigen-spectral analysis andsingular value decomposition (SVD), PCA (alsoknown as the Karhunen-Loeve Transform, or KLT),and ICA. For each model, we begin by assumingthat the measurements are collected into frames ofn samples each, thus consisting a vector x. Themodel is written as

(10)

where X are the actual measurements, S are the ex-pansion coefficients (sometimes also considered asthe separate source components), A is an array ofbasis vectors, and N is an additive noise indepen-dent of S. Written explicitly, the former equation

X AS N= +

70 Computer Music Journal

Figure 6. Geometrical sig-nal representation consist-ing of a transformoperation followed bydata reduction.

Page 9: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

Dubnov

represents X as a combination of basis vectors thatare the columns of A

(11)

In our case, we are interested, in addition to thecompaction or the noise removal property, in theindependence of expansion coefficients S to allowestimation of vector-IR. This can be achieved bymeans of PCA for the case of a Gaussian multivari-ate process. It causes the components to be un-correlated, which in general is not a sufficientcondition for statistical independence. In the caseof a Gaussian process, uncorrelated componentsare indeed independent.

For more general types of multivariate processes(such as in case that the signals are non-stationaryor non-Gaussian), more sophisticated methods suchas ICA can be used (Cichocki and Amari 2002). Forthe purposes of the present discussion, we shall as-sume that for most practical purposes our signalscould be assumed to be multivariate Gaussian.Accordingly, we consider PCA or its alternativeformulation using SVD. Using matrix notationX1X2 . . . , the columns are arranged in the order ofsubsequent feature vectors in time. Applying audiobasis modeling, these features are represented as asequence of coefficients S, with basis functionsgiven by basis vectors written as columns of ma-trix A and usually ignoring the remaining noisecomponents n:

(12)

The problem of vector data decomposition intoindependent components is an active field of re-search. It should be noted that there are no closed-from solutions for ICA for multi-component,single-microphone signals. There are several worksin the literature that attempt to perform monauralICA using adapted source-filter models that are

X X A

s ss s

s s

n

m

1 2

1 1

2 1

1

1 21 2

1 2

...

( ) ( ) ...( ) ( ) ... ...( ) ( ) ...

[ ] =

⎢⎢⎢⎢

⎥⎥⎥⎥

+M M

xx

x

a a a s

s

nn

nn

m

m

n

1

2

1 2 1 1

2

MM

M

⎢⎢⎢⎢

⎥⎥⎥⎥

=

⎢⎢⎢

⎥⎥⎥

+

⎢⎢⎢⎢

⎥⎥⎥⎥

[ ... ]__ __ __

learned from prior examples of the single sources(Roweis 2000; Ozerov et al. 2005). These methodsare not applicable to our purpose, because trainingor prior adaptations are not possible. Another re-lated approach is that of Independent SubspaceAnalysis (ISA), where factorization of the spectro-gram is achieved through projection of the spectro-gram matrix unto subsets of ICA vectors that arethen clustered into sets that are believed to belongto individual sources (Casey and Westner 2000;Dubnov 2002; Fitzgerald, Coyle, and Lawlor 2002;Smaragdis and Brown 2003; Uhle, Dittmar, andSporer 2003).

Vector-IR Anticipation Algorithm

Having defined the various components that arenecessary for IR analysis of complex signals, wepresent in Figure 7 the complete algorithm for anal-ysis of IR using Audio Basis representations. In thefirst stage of analysis, we derive an appropriate geo-metrical representation, either as frames of audiosamples or some features extracted from theseframes, using the STFT, Filter Banks, Cepstral anal-ysis, Mel-Frequency Cepstral Coefficients (MFCC),or some other method. Then, we perform basis de-composition, combined with data reduction by pro-jection into a lower-dimensional subspace. The

71

Figure 7. Anticipation al-gorithm flowchart.

Page 10: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

final step consists of separate estimation of the IR ofthe individual components, according to the prin-ciples of IR estimation for multivariate/vector pro-cesses described earlier. Having described thedifferent stages in the anticipation algorithm, wenow describe the application of this method to au-dio characterization and musical analysis.

Comparison of Vector and Scalar-IR Analysis forNatural Sounds

In this and the following sections, we examine nat-ural sounds that contain significant energy through-out all frequencies of the spectrum. Figure 8 showsthe magnitude spectrum on a decibel scale (log-magnitude) of a cheering crowd sound. This soundcontains a dense mixture of different types of sounds.Hand clapping can be seen as vertical lines, while

the vocal exclamations appear as high-amplitudespectral lines varying in time.

Performing vector-IR analysis shows that differentcoefficients contain different IR structure. Figure 9shows the result of IR analysis of a cheering-crowdsignal sampled at 8 kHz using an FFT of size 256with 50 percent overlap. A total of 129 spectral binsare retained (half the total bins due to symmetry ofthe STFT, plus the first bin). The IR analysis con-sists of decorrelation of log-magnitude spectralmatrix using SVD and evaluation of scalar-IRseparately for each of the expansion coefficients.Scalar estimation of IR of each of the coefficientsconsists of estimating SFM from the power spectraldensity of the coefficient time series using theWelch method (Hayes 1996) with 64 spectral bins.

The results of vector-IR analysis were comparedto scalar-IR analysis of the signal. Additionally, asynthetic signal with a power spectral density simi-

72 Computer Music Journal

Figure 8. Spectrogram of acheering crowd.

Page 11: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

Dubnov

lar to that of the original cheering-crowd signal wasconstructed by means of passing white noise throughan appropriate filter, whose parameters were esti-mated using linear prediction (LP) with eight filtercoefficients. This synthetic signal, even thoughnon-white, has little overall structure, because itsdifferent spectral bands (considered either as coeffi-cients of STFT or outputs of a filterbank) lack atime structure. The log-magnitude spectral matrixof such a signal can be described as a rank-one con-stant matrix corresponding to single basis vectorthat captures the overall spectral shape, and a noisematrix that represents the variations between thedifferent bands at different times. The SVD of suchmatrix contains a single structured basis vector thatcaptures the overall spectral shape, and remainingbasis vectors having white (and thus non-structured)coefficients. The purpose of testing IR for this syn-thetic signal was to examine the robustness of

vector-IR analysis to the influence of an overallspectral shape, that is, to compare the real signal toa colored noise signal that does not have the de-tailed temporal structure within its individualbands. The results of vector-IR and scalar-IR anal-ysis of both signals are given in Table 2.

It can be seen that vector-IR efficiently detectsthe structure in the original cheering-crowd signal,distinguishing it from the corresponding filtered

73

Figure 9. Vector-IR anal-ysis of cheering crowd sig-nal using spectrallog-magnitude featuresand SVD-derived bases.

Table 2. Results of Vector-IR and Scalar-IR Analysisof the Original Cheering-Crowd Signal and a Corre-sponding Synthetic Signal

EquivalentOriginal Signal Filtered Noise

Vector-IR 10.3 1.9Scalar-IR 1.9 1.6

Page 12: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

noise. Additional improvement in terms of achiev-ing a bigger difference between vector-IR of the orig-inal signal versus spectrally matched noise signalcan be achieved by discarding coefficients whosescalar-IR values fall below a threshold value. Usingthreshold values of 0.05 and 0.1 results in zero vector-IR for the matching noise signal, and vector-IR val-ues of 8.5 and 8.0, respectively, for the originalcheering-crowd signal. It should be noted that thismethod of dimension reduction is different from theusual methods that discard components accordingto their variance (energy); here, we discard compo-nents according to amount of their IR structure.

Figure 10 shows the first four basis vectors of thespectral log-magnitude data. Visual inspection of thespectrogram in Figure 8 shows that the first compo-nent roughly corresponds to an average overall spec-tral envelope, and the next basis vectors capturesome of the spectral patterns that correspond to highenergy peaks around 1, 3, and 6 kHz. Such visualinspection should be done by translating the graylevels in Figure 8 to values on the vertical axis of Fig-ure 10 (amplitudes of the basis functions), with darklines corresponding to high amplitude and brightershades corresponding to weaker amplitudes.

The amount of structure (IR) of their respectiveexpansion coefficients are 0.41, 0.78, 1.15, 1.04, asshown in Figure 9. It should be noted that the firstbasis function that has been ranked first in terms of

energy (variance) by SVD actually has little struc-ture in terms of IR.

Characterization of Natural Sounds Using Vector-IR

One possible application of IR analysis is as a de-scriptor for complex sounds such as natural soundsand sound effects. Such a descriptor provides char-acterization of a signal in terms of its overall“complexity,” that is, considering the amount ofstructure that occurs in a sound when it is consid-ered as a random process (as sound texture). In Figure11, we represent several sounds in a two-dimensionalplane consisting of a vector-IR axis and a signal en-ergy axis. In this analysis, the IR estimation waspreformed using cepstral features, excluding thefirst energy related coefficient, which was used forenergy estimation. IR estimation was done using 30cepstral coefficients, with frames 512 samples longwith 50 percent overlap, and scalar-IR estimationusing SFM with power spectral estimation usingWelsh method with 64 spectral bins.

One may note that the energy and IR characteris-tics in Figure 11 correspond to our intuitive notionsabout the character of these sounds. For instance,fire sounds are the most noise-like, with little varia-tion in energy. This can be compared to a phone-ringing sound that is highly “anticipated,” and carcrash, glass break, and cheering crowd that have in-termediate levels of IR, with the car crash havingthe largest energy variation.

Anticipatory Listening and Music Applications

In this section, we reach what may be the most in-teresting and speculative application of the IRmethod, exploring the relationship between the IRmeasure and concepts of auditory and musical an-ticipation (Meyer 1956). In previous sections, wehave applied a single IR analysis to a complete sig-nal, resulting in a single number that describes theoverall anticipation property of the sound. Whenlistening to longer audio signals or music, the prop-erties of time-evolving signals cannot be summarizedinto a single “anticipation number.” Accordingly,

74 Computer Music Journal

Figure 10. First four spec-tral magnitude basisfunctions.

Page 13: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

Dubnov

we extend the method of IR analysis to the non-stationary case by applying IR in a time-varyingfashion.

In the following example, we represented a soundin terms of a sequence of spectral envelopes, repre-sented by spectral or cepstral coefficients. Thesevectors are grouped into macro-frames and submit-ted to separate IR analyses, resulting in a single IRvalue for each macro-frame. When the IR graph isplotted against time, one obtains a graph of IR evo-lution over the course of the musical signal; we callthis the anticipation profile. Using different anal-ysis parameters, anticipation profiles could be usedto analyze affective vocalizations (Slaney andMcRoberts 1998) or musical sounds (Scheirer 2000).For the purpose of vocalization analysis, the IRmacro-frames are 1 sec long with 75 percent over-lap. For analysis of musical recordings, longerframes are required, varying from 3 sec for solo orchamber instrumental music to 30 sec for complex

orchestral textures. (The reason for this will be ex-plained in the next section.)

Anticipation Profile of Musical Signals

A comparative vector-IR analysis of musical signalusing 30 cepstral and 30 magnitude spectral basiscomponents is presented in Figure 12 for an excerptfrom the first movement of Schumann’s Piano Quar-tet in E-flat Major. The features were estimated oversignal frames of 20 msec duration. The macro-framefor IR analysis was 3 sec long with 75 percent over-lap between successive analysis frames. Figure 12shows vector-IR results for the two representations,overlaid on top of a signal spectrogram.

As can be seen from the figure, both methods re-sult in similar anticipation profiles. Varying the or-der of the model (changing the data reduction or socalled cepstral “liftering” number) between 20 and

75

Figure 11. Energy and IRdistribution of differentnatural sounds.

Page 14: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

60 components has little effect on the results, indi-cating that the system is quite robust to changes inthe representation detail.

When considering the results of this analysis, onemight argue that the nature of music of the Schu-mann piano quartet is such that its structure mightbe detected using simpler methods, such as energyor other existing voice-activity detectors. To show asituation where vector-IR analysis gives a uniqueglimpse into musical structure that other featuresdo not allow, we present in Figure 13 the results ofvector-IR analysis of an acoustic recording of aMIDI rendering of Bach’s Prelude in G Major fromBook I of the Well-Tempered Clavier.

The synthetic nature of the computer playback ofa MIDI file is such that the resulting acoustic signallacks almost any expressive inflections, includingdynamics. Moreover, the use of a synthetic pianocreates a signal with little spectral variation. As canbe observed in Figure 13, vector-IR still detects sig-

nificant changes in the music. Analysis of similari-ties between the IR curves and the major sections ofthe score from a music-theoretical point of viewseem to indicate that the anticipation profile cap-tures relevant aspects of the musical structure. Forinstance, the first drop of the IR graph correspondsto the opening of the prelude, ending at the first ca-dence and modulation to the dominant. The follow-ing lower part of IR graph corresponds to tworepetitions of an ostinato pattern. Then the twopeaks in IR seem to be related to reappearance of theopening theme with harmonic modulations, endingin the next IR valley at the repeating melodic pat-tern on the parallel minor. The next increase in IRcorresponds to development section on the domi-nant, followed by a final transition to cadence, withclimax and final descent along the cadential passage.

To have a better impression of the correspon-dence between the variation in music structure andtexture to our analysis, we present in Figure 14

76 Computer Music Journal

Figure 12. Graph of the an-ticipation profile (esti-mated by vector-IR) usingcepstral (solid) and spec-tral magnitude (dotted)

features, displayed over aspectrogram of a musicalexcerpt (the first move-ment of Schumann’s PianoQuartet in E-flat Major).

Page 15: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

Dubnov

again the same graph of anticipation profile, thistime overlaid on top of MIDI notes. (The crosses in-dicate note onsets, with actual note numbers notrepresented in the graph.)

It appears that changes and repetitions of musicmaterials are detected by IR analysis of the acousticsignal. (It should be noted that our anticipationanalysis does not involve note or pitch detection orany use of the score of MIDI information.)

Anticipation Profile and Emotional Force

To evaluate the significance of the IR method formusic analysis, a comparison between anticipationprofiles derived from automatic signal analysis andhuman perception of musical contents is required.In a recent experiment, large amounts of data con-cerning human emotional responses when listeningto a performance of a contemporary orchestral mu-

sical work (Angel of Death by Roger Reynolds) wascollected during live concerts (McAdams et al.2002). During these concerts, listeners were as-signed a response box with a sliding fader that al-lowed continuous analog ratings to be made on ascale of emotional force (Smith 2001). Listenerswere instructed that positive or negative emotionalreactions of similar magnitude were to be judged atthe same level of the emotional force scale. Theends of the emotional force scale were labeled“weak” and “strong.” In addition, a small “I don’tknow” region was provided at the far left end of thisscale that could be sensed tactilely, as the cursorprovided a slight resistance to moving into or out ofthis zone.

Continuous data from response boxes were con-verted to MIDI format (on an integer scale from 0 to127) and recorded simultaneously with the musicalperformance. Figure 15 presents a comparativegraph of the anticipation profile resulting from IR

77

Figure 13. Graph of antici-pation profile (estimatedby vector-IR) using 30 cep-stral features, displayedover a spectrogram of amusical excerpt (Bach’s

Prelude in G Major fromBook I of the Well-Tempered Clavier). Theacoustic signal was cre-ated by computer render-ing of a MIDI file.

Page 16: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

analysis of the audio signal (the concert recording)and a graph of the average human responses termed“Emotional Force” (EF) for two versions of the or-chestral piece.

Analyses were performed using analysis framesizes of 200 msec with macro-frames 3 sec long,with no overlap between the macro-frame seg-ments. These IR values were additionally smoothedusing a 10-segment-long moving-average filter, re-sulting in an effective analysis frame of 30 sec witha 3-sec interval between analysis values. The extraaveraging removed fast variations in vector-IR anal-ysis, assuming that a 30-sec smoothing bettermatches the rate of change in human judgmentsfor such a complex orchestral piece.

As can be seen from Figure 15, certain portions ofthe IR curve fit closely to the EF data, whereas otherportions differ significantly. It was found that corre-lation of the IR data and the EF were 63 percent and47 percent for top and bottom graphs, respectively.

Combined with additional signal features such assignal energy, higher correspondence between signalinformation analysis and Emotional Force judg-ments was achieved. The analysis shows strong evi-dence that signal properties and human reactions arerelated, suggesting applications of these techniquesto music understanding and music information-processing systems. Full details of the experimentsand the additional information analysis methodswill appear elsewhere (Dubnov, McAdams, andReynolds in press).

Thoughts About Self-Supervised Brain ProcessingArchitecture

We would like to consider briefly in this section thepossible relationships between anticipation analysisand self-supervised architectures of brain pro-cessing, particularly in relation to higher cognitive

78 Computer Music Journal

Figure 14. Graph of antici-pation profile (estimatedby vector-IR from an audiorecording) displayed on topof MIDI note onsets of theBach Prelude.

Page 17: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

Dubnov 79

Figure 15. Two graphs rep-resenting human judg-ments of Emotional Force(solid) and IR analysis(dot) of audio recordings oftwo musical performances.

Page 18: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

aspects of musical information processing. Onesuch architecture that was suggested for emotionalprocessing assumes an existence of two separatebrain mechanisms that interact in analysis of com-plex information (Huron 2005): a “fast brain” thatdeals primarily with pre-processing of sensory infor-mation, forming perceptions from the stream ofconstantly impinging sensory excitations, and a sec-ond component, the “slow brain,” that interactswith the fast brain by performing appraisal of thefast-brain performance. This architecture is de-scribed in Figure 16.

In the context of our computational model, weconsider the functions of the fast brain to be relatedto basic pattern-recognition actions, such as featureextraction and data reduction. Anticipation couldbe considered as an appraisal that is done by theslow brain. That is, anticipation involves evaluatingthe utility of the past perception for explaining thepresent in terms of assigning a score to the fast-brain functions according to the “relative reductionof uncertainty about the present when consideringits past” (a quote from the definition of IR).

It should be noted that selection of informativefeatures is commonly done in other signal-processingapplications, such as speech understanding or com-puter vision, mostly in terms of classification per-formance. That is, features are considered to beinformative if they have significant mutual informa-tion with the class labels for a particular recognitiontask. In music, the task of recognition is secondary,and it is only natural to assume that anticipation,rather then recognition, is a more appropriate taskfor describing music information-processing opera-

tions. In other words, information contents for mu-sic signals should be measured not by mutual infor-mation between signal features and a set of signallabels (recognition task), but as mutual informationbetween past and present features (anticipation task).

Conclusion

This article presents a novel approach to (auto-matic) analysis of audio and music based on an “an-ticipation profile.” The ideas are developed from abackground of information theory and basis decom-position, through the idea of a vector approach tothe anticipation measurement, and ending withsome higher-level reflections on its meaning forcomplex musical signals. The anticipation profile isestimated by evaluating the reduction in theuncertainty (entropy) of a variable resulting formprediction based on the past. Algorithms for estima-tion of the anticipation profile were presented, withapplications for audio signal characterization andmusic analysis. Discussions show that this mea-sure also resolves several ill-defined aspects of thestructure-versus-noise problem in music and in-cludes the listener as an integral part of structuralanalysis of music. Moreover, the proposed anticipa-tion measure might be related to more general as-pects of appraisal or self-supervision ofinformation-processing systems, including aspectsof emotional brain architecture.

References

Allamanche, E., et al. 2001. “Content-Based Identificationof Audio Material Using MPEG- 7 Low Level Descrip-tion.” Proceedings of the 2001 International Sympo-sium on Music Information Retrieval. Bloomington,Indiana: Indiana University, pp. 197–204.

Casey, M. A. 2001. “MPEG-7 Sound Recognition Tools.”IEEE Transactions on Circuits and Systems for VideoTechnology 11(6):737–747.

Casey, M. A., and A. Westner. 2000. “Separation of MixedAudio Sources by Independent Subspace Analysis.”Proceedings of the 2000 International Computer MusicConference. San Francisco, California: InternationalComputer Music Association, pp. 154–161.

80 Computer Music Journal

Figure 16. Information-processing architecturethat considers emotion asa self-supervision commu-nication within the brain.

Page 19: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

Dubnov

Cichocki, A., and S. Amari. 2002. Adaptive Blind Signaland Image Processing: Learning Algorithms and Appli-cations. New York: Wiley.

Cover, T. M., and J. A. Thomas. 1991. Elements of Infor-mation Theory. New York: Wiley.

Dubnov, S. 2002. “Extracting Sound Objects by Indepen-dent Subspace Analysis.” Paper presented at the 2002Audio Engineering Society International Conference,Espoo, Finland, 17 June.

Dubnov, S. 2003. “Non-Gaussian Source-Filter and Inde-pendent Components Generalizations of Spectral Flat-ness Measure.” Proceedings of the InternationalConference on Independent Components Analysis(ICA2003), Nara, Japan, pp. 143–148.

Dubnov, S. 2004. “Generalization of Spectral FlatnessMeasure for Non-Gaussian Linear Processes.” IEEE Sig-nal Processing Letters 11(8):698–701

Dubnov, S., S. McAdams, and R. Reynolds. In press.“Structural and Affective Aspects of Music from Statis-tical Audio Signal Analysis.” Journal of the AmericanSociety for Information Science and Technology.

Fitzgerald, D., E. Coyle, and B. Lawlor. 2002. “Sub-BandIndependent Subspace Analysis for Drum Transcrip-tion.” Paper presented at the 5th International Confer-ence on Digital Audio Effects, Hamburg, Germany, 26–28 September.

Hayes, M. 1996. Statistical Signal Processing and Model-ing. New York: Wiley.

Huron, D. 2005. “Six Models of Emotion.” Available on-line at http://csml.som.ohio-state.edu/Music829D/Notes/Models.html.

Jayant, N. S., and P. Noll. 1984. Digital Coding of Wave-forms. Upper Saddle River, New Jersey: Prentice-Hall.

Kim, H., N. Moreau, and T. Sikora. 2004. “Audio Classifi-cation Based on MPEG-7 Spectral Basis Representa-tions.” IEEE Transactions On Circuits and Systems forVideo Technology 14(5):716–725.

Lewicki, M. S. 2002. “Efficient Coding of NaturalSounds.” Nature Neuroscience 5(4):356–363.

McAdams, S., et al. 2002. “Real-Time Perception of aContemporary Musical Work in a Live Concert Set-ting.” Paper presented at the 7th International Confer-ence on Music Perception and Cognition, Sydney,Australia, 17–21 July.

Meyer, L. B. 1956. Emotion and Meaning in Music.Chicago: Chicago University Press.

Oppenheim, A. V., and R. W. Schafer. 1989. Discrete-TimeSignal Processing. Upper Saddle River, New Jersey:Prentice Hall.

Ozerov, A., et al. 2005. “One Microphone Singing VoiceSeparation Using Source-Adapted Models.” Paper pre-

sented at the 2005 IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics, New Paltz,17 October.

Roweis, S. 2000. “One Microphone Source Separation.”Neural Information Processing Systems 13:793–799.

Scheirer, E. D. 2000. Music Listening Systems. Ph.D. The-sis, Massachusetts Institute of Technology.

Slaney, M., and G. McRoberts. 1998. “Baby Ears: A Recog-nition System for Affective Vocalizations.” Paper pre-sented at the 1998 International Conference onAcoustics, Speech, and Signal Processing, Seattle, 12–15 May.

Smaragdis, P., and J. C. Brown. 2003. “Non-Negative Ma-trix Factorization for Polyphonic Music Transcription.”Paper presented at the 2003 IEEE Workshop on Applica-tions of Signal Processing to Audio and Acoustics, NewPaltz, New York, 21 October.

Smith, B. K. 2001. “Experimental Setup for The Angel ofDeath.” Available online at http://www.ircam.fr/pcm/bks/death.

Uhle, C., C. Dittmar, and T. Sporer. 2003. “Extraction ofDrum Tracks from Polyphonic Music Using Indepen-dent Subspace Analysis.” Paper presented at theFourth International Symposium on IndependentComponent Analysis and Blind Signal Separation,Nara, Japan, 1–4 April.

Appendix

Given a signal with power spectrum S(w), the SFMis defined as

Rewriting it as a discrete sum gives

(13)

which shows that SFM can be viewed as the ratiobetween the geometric and arithmetic means of sig-nal spectra, thus being positive and less than orequal to unity. The SFM equals unity only if all

SFM xN

S

NS

S

NS

ii

ii

ii

N N

ii

( )exp ( )

( )

( )

( )=

⎛⎝⎜

⎞⎠⎟ =

⎛⎝⎜

⎞⎠⎟∑

=

1 1

1 11

1

n ω

ω

ω

ω

SFMS d

S d=

⎛⎝⎜

⎞⎠⎟−

exp ( )

( )

12

1

12

1

��

��

n

n

ω

ω

81

Page 20: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

spectrum values are equal, thus meaning a flat spec-trum or a white-noise signal.

Information Redundancy

Given a random variable x, with probability distri-bution f(x), the entropy of the distribution is defined(Cover and Thomas 1991) as

(14)

For the joint distribution of two variables x1, x2,the joint entropy is defined as

(15)

The average amount of information that the vari-able x1 carries about x2 is quantified by the mutualinformation

(16)

Generalization of the mutual information for thecase of n variables yields

(17)

This function measures the average amount of com-mon information contained in variables x1, x2, . . . , xn.Using the mutual information, we originally definethe information rate (IR), denoted ρ, to be the differ-ence between the common information containedin the variables x1, x2, . . . , xn and the set x1, x2, . . . ,xn–1, namely, the additional amount of informationthat is added when one more variable is observed:

(18)

Because in our application we are consideringtime-ordered samples, this redundancy measurecorresponds to the rate of growth of the commoninformation as a function of time. It can be shownthat the following relationship exists between re-dundancy and entropy:

(19)

This shows that redundancy is the difference be-tween the entropy (or uncertainly) about isolated xn

and the reduced uncertainty of xn if we know its

�( , ,..., ) ( ) ( | , ,..., )x x x H x H x x x xn n n n1 2 1 2 1= − −

�( , ,..., ) ( , ,..., ) ( , ,..., )x x x I x x x I x x xn n n1 2 1 2 1 2 1= − −

I x x x H x H x x xni

n

i n( , ,..., ) ( ) ( , ,..., )1 21

1 2= −=∑

I x x H x H x H x x( , ) ( ) ( ) ( , )1 2 1 2 1 2= + −

H x x f x x f x x dx dx( , ) ( , )log ( , )1 2 1 2 1 2 1 2= −∫

H x f x f x dx( ) ( )log ( )= −∫

past. In information theoretic terms, and assuminga stationary process, this measure equals the differ-ence between the entropy of the marginal distribu-tion of the process xn and the entropy rate of theprocess, equally for all n.

Relationship Between SFM and IR

To assess the amount of structure present in a sig-nal in terms of its information content, we observethe following relationships between signal spec-trum and entropy. Entropy of a “white” Gaussianrandom variable is given by

(20)

whereas the entropy rate of a Gaussian process (theso called Kolmogorov-Sinai Entropy) is given by

(21)

According to the previous section, IR is defined asa difference between the marginal entropy and en-tropy rate of the signal x(t), r = H(x) – Hr(x). Insertingthe expressions for entropy and entropy rate, one ar-rives at the following relation:

(22)

One can see here that IR is equal to one-half ofthe logarithm of SFM.

Vector-IR as Sum of Independent ComponentScalar-IR

Given a linear transformation X = AS between blocksof the original data (signal frame of feature vector X)

SFM x xS d

S d( ) exp( ( ))

exp ( )

( )= − =

⎛⎝⎜

⎞⎠⎟∫

∫2

12

1

12

��

� �

�� �

n

H xN

H x xN

H x x x

S d e

r N N N N N( ) lim ( ,..., ) lim ( | ,..., )

( )

= =

= +

→∞ →∞ −

1 1

14

1 1 2

1 1 1

�� � �n n

H x e S d ex( ) ( )= =⎛⎝⎜

⎞⎠⎟

+∫1 2 12

1 12

1 22n n n� ��

� � �

82 Computer Music Journal

Page 21: Shlomo Dubnov Spectral Anticipationsmusicweb.ucsd.edu/~sdubnov/CATbox/Reader/MIT CMJ302... · 2006. 11. 13. · Dubnov Shlomo Dubnov Music Department University of California at San

Dubnov

and its expansion coefficients S, the entropy rela-tionship between the data and coefficients is H(X) =H(S) + log | det(A) |. For a sequence of data vectors,we evaluate the conditional IR as the difference be-tween the entropy of the last block and its entropygiven the past vectors. (This is a conditional en-tropy, which becomes the entropy rate in the limitof an infinite past.) Using the standard definition ofmulti-information for signal samples x1 . . . xnL,

(23)

we originally define and develop the expression forvector-IR as

(24)

This shows that the vector-IR can be evaluated fromthe difference of the entropy of the last block andthe conditional entropy of that block given its past.

Using the transform relationship, one can equiva-lently express vector-IR as a difference in entropy

�L L L L L

i L L Li L n

Ln

i L

X X I X X I X X I X

H x H X X H X X I X

H x H X X

( ,..., ) ( ,..., ) ( ,..., ) ( )

( ) ( ,..., ) ( ,..., ) ( )

( ) ( | ,.

( )

1 1 1 1

1 1 11 1

1

= − − =

= − + − =

= −

−= − +∑

..., ) ( )

( ) ( | ,..., )

( )X I X

H X H X X X

L Li L n

Ln

L L L

−= − +

− =

= −

∑ 11 1

1 1

I X X X H x H x xL ii

Ln

Ln( , ,..., ) ( ) ( ,..., )1 21

1= −=∑

and conditional entropy of the transform coeffi-cients r(X1, . . . , XL) = H(SL) – H(SL | S1, . . . , SL–1).(Note that the dependence upon determinant of Ais cancelled by subtraction.) If there are no depend-encies across different coefficients and the onlydependencies are within each of the coefficientssequences as a function of time (i.e., the trajectoryof each coefficient is time-dependent, but the coeffi-cients between themselves are independent), wearrive at the relationship

(25)

Combining these equations gives the desired result:

(26)

MATLAB Code

MATLAB code of the IR analysis, along with spe-cific examples of sound analysis used in this article,can be found online at http://music.ucsd.edu/~sdubnov/InformationRate.

� �Ln

L i ii

n

X X X s i s L( , ,..., ) ( ( ),..., ( ))1 21

==∑

H S H s L

H S S S H s L s s L

L ii

n

L L ii

n

i i

( ) ( ( ))

( | ... ) ( ( ) | ( )... ( ))

=

= −

=

−=

1

1 11

1 1

83