what can we learn from large music databases?

Learning from Music - Ellis 2004-12-18 p. /241

1. Learning Music2. Music Similarity3. Melody, Drums, Event extraction4. Conclusions

What can we Learn from Large Music Databases?

Dan EllisLaboratory for Recognition and Organization of Speech and Audio

Dept. Electrical Engineering, Columbia University, NY USA

[email protected]


Learning from Music• A lot of music data available

e.g. 60G of MP3 ≈ 1000 hr of audio/15k tracks

• What can we do with it?implicit definition of ‘music’

• Quality vs. quantitySpeech recognition lesson:10x data, 1/10th annotation, twice as useful

• Motivating Applicationsmusic similarity / classificationcomputer (assisted) music generationinsight into music


Ground Truth Data• A lot of unlabeled

music data availablemanual annotation is much rarer

• Unsupervised structure discovery possible.. but labels help to indicate what you want

• Weak annotation sourcesartist-level descriptionssymbol sequences without timing (MIDI)errorful transcripts

• Evaluation requires ground truthlimiting factor in Music IR evaluations?

File: /Users/dpwe/projects/aclass/aimee.wav

f 9 P

rinte

d: T

ue

Ma

r 11

13

:04

:28

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

6500

7000

Hz

0:02 0:04 0:06 0:08 0:10 0:12 0:14 0:16 0:18 0:20 0:22 0:24 0:26 0:28

t

mus mus musvoxvox


Talk Roadmap

Music

audio

Drums

extraction

Eigen-

rhythms

Event

extraction

Melody

extraction

Fragment

clustering

Anchor

models

Similarity/

recommend'n

Synthesis/

generation

?

Semantic

bases

1

2 3

6

5 4


1. Music Similarity Browsing

• Musical information overloadrecord companies filter/categorize musican automatic system would be less odious

• Connecting audio and preferencemap to a ‘semantic space’?

n-dimensional

vector in "Anchor

Space"Anchor

Anchor

Anchor

Audio

Input

(Class j)

p(a1|x)

p(a2|x)

p(an|x)

GMM

Modeling

Conversion to Anchorspace

n-dimensional

vector in "Anchor

Space"Anchor

Anchor

Anchor

Audio

Input

(Class i)

p(a1|x)

p(a2|x)

p(an|x)

GMM

Modeling

Conversion to Anchorspace

Similarity

Computation

KL-d, EMD, etc.

with Adam Berenzweig


Anchor Space

• Frame-by-frame high-level categorizationscompare toraw features?

properties in distributions? dynamics?third cepstral coef

fift

h c

ep

str

al co

ef

madonnabowie

Cepstral Features

1 0.5 0 0.5

0.8

0.6

0.4

0.2

0

0.2

0.4

0.6

Country

Ele

ctr

on

ica

madonnabowie

10

5

Anchor Space Features

15 10 5

15

0


‘Playola’ Similarity Browser


Semantic Bases

• What should the ‘anchor’ dimensions be?hand-chosen genres? Xsomehow choose automatically

• “Community metadata”:Use Web to get words/phrases.... that are informative about artists.. and that can be predicted from audio

• Refine classifiers to below artist level e.g. by EM?

Brian Whitman

a indicates the number of frames in which a term classi-

fier positively agrees with the truth value (both classifier

and truth say a frame is ‘funky,’ for example). b indicatesthe number of frames in which the term classifier indi-

cates a negative term association but the truth value indi-

cates a positive association (the classifier says a frame is

not ‘funky,’ but truth says it is). The value c is the amountof frames the term classifier predicts a positive association

but the truth is negative, and the value of d is the amount offrames the term classifier and truth agree to be a negative

association. We wish to maximize a and d as correct clas-sifications; by contrast, random guessing by the classifier

would give the same ratio of classifier labels regardless of

ground truth i.e. a/b ≈ c/d. WithN = a + b + c + d, theK-L distance between the observed distribution and such

random guessing is:

KL =a

Nlog

(N a

(a + b) (a + c)

)+

b

Nlog

(N b

(a + b) (b + d)

)+

c

Nlog

(N c

(a + c) (c + d)

)+

d

Nlog

(N d

(b + d) (c + d)

)(3)

This measures the distance of the classifier away from a

degenerate distribution; we note that it is also the mu-

tual information (in bits, if the logs are taken in base 2)

between the classifier outputs and the ground truth labels

they attempt to predict.

Table 2 gives a selected list of well-performing term

models. Given the difficulty of the task we are encour-

aged by the results. Not only do the results give us term

models for audio, they also give us insight into which

terms and description work better for music understand-

ing. These terms give us high semantic leverage without

experimenter bias: the terms and performance were cho-

sen automatically instead of from a list of genres.

7.3. Automatic review generation

The multiplication of the term model c against the testinggram matrix returns a single value indicating that term’s

relevance to each time frame. This can be used in re-

view generation as a confidence metric, perhaps setting a

threshold to only allow high confidence terms. The vector

of term and confidence values for a piece of audio can also

be fed into other similarity and learning tasks, or even a

natural language generation system: one unexplored pos-

sibility for review generation is to borrow fully-formed

sentences from actual reviews that use some amount of

terms predicted by the term models and form coherent

paragraphs of reviews from this generic source data. Work

in language generation and summarization is outside the

scope of this article but the results for the term prediction

adj Term K-L bits np Term K-L bits

aggressive 0.0034 reverb 0.0064

softer 0.0030 the noise 0.0051

synthetic 0.0029 new wave 0.0039

punk 0.0024 elvis costello 0.0036

sleepy 0.0022 the mud 0.0032

funky 0.0020 his guitar 0.0029

noisy 0.0020 guitar bass and drums 0.0027

angular 0.0016 instrumentals 0.0021

acoustic 0.0015 melancholy 0.0020

romantic 0.0014 three chords 0.0019

Table 2. Selected top-performing models of adjective and

noun phrase terms used to predict new reviews of music

with their corresponding bits of information from the K-L

distance measure.

task and the below review trimming task are promising for

these future directions.

One major caveat of our review learning model is its

time insensitivity. Although the feature space embeds time

at different levels, there is no model of intra-song changes

of term description (a loud song getting soft, for example)

and each frame in an album is labeled the same during

training. We are currently working on better models of

time representation in the learning task. Unfortunately,

the ground truth in the task is only at the album level and

we are also considering techniques to learn finer-grained

models from a large set of broad ones.

7.4. Review Regularization

Many problems of non-musical text and opinion or per-

sonal terms get in the way of full review understanding. A

similarity measure trained on the frequencies of terms in a

user-submitted review would likely be tripped up by obvi-

ously biased statements like “This record is awful” or “My

mother loves this album.” We look to the success of our

grounded term models for insights into the musicality of

description and develop a ‘review trimming’ system that

summarizes reviews and retains only the most descriptive

content. The trimmed reviews can then be fed into fur-

ther textual understanding systems or read directly by the

listener.

To trim a review we create a grounding sum term oper-

ated on a sentence s of word length n,

g(s) =∑n

i=0 P (ai)n

(4)

where a perfectly grounded sentence (in which the predic-

tive qualities of each term on new music has 100% preci-

sion) is 100%. This upper bound is virtually impossible

in a grammatically correct sentence, and we usually see

g(s) of {0.1% .. 10%}. The user sets a threshold andthe system simply removes sentences under the threshold.

See Table 3 for example sentences and their g(s). We seethat the rate of sentence retrieval (how much of the review

is kept) varies widely between the two review sources;

AMG’s reviews have naturally more musical content. See

Figure 4 for recall rates at different thresholds of g(s).


2. Transcription as Classification

• Signal models typically used for transcriptionharmonic spectrum, superposition

• But ... trade domain knowledge for datatranscription as pure classification problem:

single N-way discrimination for “melody”per-note classifiers for polyphonic transcription

Trained

classifier

Audio

p("C0"|Audio)p("C#0"|Audio)p("D0"|Audio)p("D#0"|Audio)p("E0"|Audio)p("F0"|Audio)

with Graham Poliner


Classifier Transcription Results• Trained on MIDI syntheses (32 songs)

SMO SVM (Weka)

• Tested on ISMIR MIREX 2003 setforeground/background separation

system “jazz3” overall

fg+bg 71.5% 44.3%

just fg 56.1% 45.4%

Frame-level pitch concordance


Forced-Alignment of MIDI

• MIDI is a handy description of musicnotes, instruments, tracks.. to drive synthesis

• Align MIDI ‘replicas’ to get GTruth for audioestimate time-warp relation

"Don't you want me" (Human League), verse1

17 18 19 20 21 22 23 24 25 26

0

2

4

19 20 21 22 23 24 25 26 27 2840

60

80

MIDI notesMIDI notes

MIDI ReplicaMIDI Replica

OriginalOriginal

fre

q /

kH

zM

IDI

#

0

2

4

fre

q /

kH

z

time / sec

with Rob Turetsky


3. Melody Clustering

• Goal: Find ‘fragments’ that recur in melodies.. across large music database.. trade data for model sophistication

• Data sourcespitch tracker, or MIDI training data

• Melody fragment representationDCT(1:20) - removes average, smoothes detail

Training

dataMelody

extraction

5 second

fragments

Top

clusters

VQ

clustering

with Graham Poliner


Melody clustering results

• Clusters match underlying contour:

• Finds some similarities:e.g. Pink + Nsync


4. Eigenrhythms: Drum Pattern Space

• Pop songs built on repeating “drum loop”variations on a few bass, snare, hi-hat patterns

• Eigen-analysis (or ...) to capture variations?by analyzing lots of (MIDI) data, or from audio

• Applicationsmusic categorization“beat box” synthesisinsight

with John Arroyo


Aligning the Data• Need to align patterns prior to modeling...

tempo (stretch): by inferring BPM &

normalizing

downbeat (shift): correlate against ‘mean’ template


Eigenrhythms (PCA)

• Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims)

• Eigenrhythms both add and subtract


Posirhythms (NMF)

• Nonnegative: only adds beat-weight• Capturing some structure

Posirhythm 1

BD

SN

HH

BD

SN

HH

BD

SN

HH

BD

SN

HH

BD

SN

HH

BD

SN

HH

Posirhythm 2

Posirhythm 3 Posirhythm 4

Posirhythm 5

500 0 samples (@ 200 Hz)beats (@ 120 BPM)

100 150 200 250 300 350 400

1 2 3 4 1 2 3 4

Posirhythm 6

50 100 150 200 250 300 350

-0.1

0

0.1


Eigenrhythms for Classification• Projections in Eigenspace / LDA space

• 10-way Genre classification (nearest nbr):PCA3: 20% correctLDA4: 36% correct

-20 -10 0 10-10

-5

0

5

10PCA(1,2) projection (16% corr)

-8 -6 -4 -2 0 2-4

-2

0

2

4

6LDA(1,2) projection (33% corr)

bluescountrydiscohiphophousenewwaverockpoppunkrnb


Eigenrhythm BeatBox

• Resynthesize rhythms from eigen-space


5. Event Extraction

• Music often contains many repeated eventsnotes, drum soundsbut: usually overlapped...

• Vector Quantization finds common patterns:

representation...aligning/matching...how much coverage required?

Training

data

Find

alignments

Combine &

re-estimate

Event

dictionary


Drum Track Extraction

• Initialize dictionary with Bass Drum, Snare• Match only on a few spectral peaks

narrowband energy most likely to avoid overlap

• Median filter to re-estimate template.. after normalizing amplitudescan pick up partials from common notes

with Ron Weiss, after Yoshii et al. ’04


Generalized Event Detection

• Based on ‘Shazam’ audio fingerprints (Wang’03)

relative timing of F1-F2-ΔT triples discriminates piecesnarrowband features to avoid collision (again)

• Fingerprint events, not recordings:choose top triples, look for repeatsrank reduction of triples x time matrix

2 Basic principle of operation The recognition algorithm is described.

2.1 Combinatorial Hashing Each audio file is “fingerprinted,” a process in which reproducible hash tokens are

extracted. A time-frequency analysis is performed, marking the coordinates of local

maxima of a spectrogram (Figures 1A and 1B), thus reducing an audio file down to a

relatively sparse set of time-frequency pairs. This reduces the search problem to one

similar to astronavigation, in which a small patch of time-frequency constellation points

must be quickly located within a large universe of points in a strip-chart universe with

dimensions of bandlimited frequency versus nearly a billion seconds in the database. The

peaks are chosen using a criterion to ensure that the density of chosen local peaks is

within certain desired bounds so that the time-frequency strip for the audio file has

reasonably uniform coverage. The peaks in each time-frequency locality are also chosen

according amplitude, with the justification that the highest amplitude peaks are most

likely to survive the distortions listed above.

Hashes are formed from the constellation map, in which pairs of time-frequency points

are combinatorially associated. Anchor points are chosen, each anchor point having a

target zone associated with it. Each anchor point is sequentially paired with points within

its target zone, each pair yielding two frequency components plus the time difference

!"#$%&'$()*+',

-.$/0$*(1%&23,

456 457 458 455 459 4:6 4:7 4:8 4:5 4:9 4966

4666

7666

;666

8666

<666

!"#$%&'$()*+',

-.$/0$*(1%&23,

456 457 458 455 459 4:6 4:7 4:8 4:5 4:9 4966

4666

7666

;666

8666

<666

with Michael Mandel


Event detection results

• Procedurefind hash triplescluster thempatterns in hash co-occurrence = events?

0 50 100 150 200 250 3000

50

100

150


Conclusions

• Lots of data + noisy transcription + weak clustering⇒ musical insights?

Music

audio

Drums

extraction

Eigen-

rhythms

Event

extraction

Melody

extraction

Fragment

clustering

Anchor

models

Similarity/

recommend'n

Synthesis/

generation

?

Semantic

bases


• Note transcription, then note→chord ruleslike labeling chords in MIDI transcripts

• Spectrum→chord rulesi.e. find harmonic peaks, use knowledge of likely notes in each chord

• Trained classifierdon’t use any “expert knowledge”instead, learn patterns from labeled examples

• Train ASR HMMs with chords ≈ words

Approaches to Chord Transcription

with Alex Sheh


• All we need are the chord sequences for our training examplesHal Leonard “Paperback Song Series”- manually retyped for 20 songs:

“Beatles for Sale”, “Help”, “Hard Day’s Night”

- hand-align chords for 2 test examples

Chord Sequence Data Sources

# The Beatles - A Hard Day's Night

#

G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G


Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G

F6 G C D G C9 G D

G C7 G F6 G C7 G F6 G C D G C9 G Bm Em Bm

G Em C D


C9 G Cadd9 Fadd9

time / sec

freq / k

Hz

0 5 10 15 20 250

1

2

3

4

5

The Beatles - Hard Day's Night


• Recognition weak, but forced-alignment OK

Chord Results

MFCCs are poor(can overtrain) PCPs better

(ROT helps generalization)

(random ~3%)

Feature Recognition Alignment

MFCC 8.7% 22.0%

PCP_ROT 21.7% 76.0%

Frame-level Accuracy

trueE G D Bm G

alignE G DBm G

recogE G Bm Am Em7 Bm Em7

16.27 24.84time / sec

intensity

A

#

B

C

#

D

#

E

F

#

G

#

pitch

cla

ss

Beatles - Beatles For Sale - Eight Days a Week (4096pt)

20

0

40

60

80

100

120


• Chord model centers (means) indicate chord ‘templates’:

What did the models learn?

0 5 10 15 20 250

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

PCP_ROT family model means (train18)

DIM

DOM7

MAJ

MIN

MIN7

C D E F G A B C (for C-root chords)

what can we learn from large music databases?

Documents