what can we learn from large music databases?
TRANSCRIPT
Learning from Music - Ellis 2004-12-18 p. /241
1. Learning Music2. Music Similarity3. Melody, Drums, Event extraction4. Conclusions
What can we Learn from Large Music Databases?
Dan EllisLaboratory for Recognition and Organization of Speech and Audio
Dept. Electrical Engineering, Columbia University, NY USA
Learning from Music - Ellis 2004-12-18 p. /242
Learning from Music• A lot of music data available
e.g. 60G of MP3 ≈ 1000 hr of audio/15k tracks
• What can we do with it?implicit definition of ‘music’
• Quality vs. quantitySpeech recognition lesson:10x data, 1/10th annotation, twice as useful
• Motivating Applicationsmusic similarity / classificationcomputer (assisted) music generationinsight into music
Learning from Music - Ellis 2004-12-18 p. /243
Ground Truth Data• A lot of unlabeled
music data availablemanual annotation is much rarer
• Unsupervised structure discovery possible.. but labels help to indicate what you want
• Weak annotation sourcesartist-level descriptionssymbol sequences without timing (MIDI)errorful transcripts
• Evaluation requires ground truthlimiting factor in Music IR evaluations?
File: /Users/dpwe/projects/aclass/aimee.wav
f 9 P
rinte
d: T
ue
Ma
r 11
13
:04
:28
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
6000
6500
7000
Hz
0:02 0:04 0:06 0:08 0:10 0:12 0:14 0:16 0:18 0:20 0:22 0:24 0:26 0:28
t
mus mus musvoxvox
Learning from Music - Ellis 2004-12-18 p. /244
Talk Roadmap
Music
audio
Drums
extraction
Eigen-
rhythms
Event
extraction
Melody
extraction
Fragment
clustering
Anchor
models
Similarity/
recommend'n
Synthesis/
generation
?
Semantic
bases
1
2 3
6
5 4
Learning from Music - Ellis 2004-12-18 p. /245
1. Music Similarity Browsing
• Musical information overloadrecord companies filter/categorize musican automatic system would be less odious
• Connecting audio and preferencemap to a ‘semantic space’?
n-dimensional
vector in "Anchor
Space"Anchor
Anchor
Anchor
Audio
Input
(Class j)
p(a1|x)
p(a2|x)
p(an|x)
GMM
Modeling
Conversion to Anchorspace
n-dimensional
vector in "Anchor
Space"Anchor
Anchor
Anchor
Audio
Input
(Class i)
p(a1|x)
p(a2|x)
p(an|x)
GMM
Modeling
Conversion to Anchorspace
Similarity
Computation
KL-d, EMD, etc.
with Adam Berenzweig
Learning from Music - Ellis 2004-12-18 p. /246
Anchor Space
• Frame-by-frame high-level categorizationscompare toraw features?
properties in distributions? dynamics?third cepstral coef
fift
h c
ep
str
al co
ef
madonnabowie
Cepstral Features
1 0.5 0 0.5
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
Country
Ele
ctr
on
ica
madonnabowie
10
5
Anchor Space Features
15 10 5
15
0
Learning from Music - Ellis 2004-12-18 p. /247
‘Playola’ Similarity Browser
Learning from Music - Ellis 2004-12-18 p. /248
Semantic Bases
• What should the ‘anchor’ dimensions be?hand-chosen genres? Xsomehow choose automatically
• “Community metadata”:Use Web to get words/phrases.... that are informative about artists.. and that can be predicted from audio
• Refine classifiers to below artist level e.g. by EM?
Brian Whitman
a indicates the number of frames in which a term classi-
fier positively agrees with the truth value (both classifier
and truth say a frame is ‘funky,’ for example). b indicatesthe number of frames in which the term classifier indi-
cates a negative term association but the truth value indi-
cates a positive association (the classifier says a frame is
not ‘funky,’ but truth says it is). The value c is the amountof frames the term classifier predicts a positive association
but the truth is negative, and the value of d is the amount offrames the term classifier and truth agree to be a negative
association. We wish to maximize a and d as correct clas-sifications; by contrast, random guessing by the classifier
would give the same ratio of classifier labels regardless of
ground truth i.e. a/b ≈ c/d. WithN = a + b + c + d, theK-L distance between the observed distribution and such
random guessing is:
KL =a
Nlog
(N a
(a + b) (a + c)
)+
b
Nlog
(N b
(a + b) (b + d)
)+
c
Nlog
(N c
(a + c) (c + d)
)+
d
Nlog
(N d
(b + d) (c + d)
)(3)
This measures the distance of the classifier away from a
degenerate distribution; we note that it is also the mu-
tual information (in bits, if the logs are taken in base 2)
between the classifier outputs and the ground truth labels
they attempt to predict.
Table 2 gives a selected list of well-performing term
models. Given the difficulty of the task we are encour-
aged by the results. Not only do the results give us term
models for audio, they also give us insight into which
terms and description work better for music understand-
ing. These terms give us high semantic leverage without
experimenter bias: the terms and performance were cho-
sen automatically instead of from a list of genres.
7.3. Automatic review generation
The multiplication of the term model c against the testinggram matrix returns a single value indicating that term’s
relevance to each time frame. This can be used in re-
view generation as a confidence metric, perhaps setting a
threshold to only allow high confidence terms. The vector
of term and confidence values for a piece of audio can also
be fed into other similarity and learning tasks, or even a
natural language generation system: one unexplored pos-
sibility for review generation is to borrow fully-formed
sentences from actual reviews that use some amount of
terms predicted by the term models and form coherent
paragraphs of reviews from this generic source data. Work
in language generation and summarization is outside the
scope of this article but the results for the term prediction
adj Term K-L bits np Term K-L bits
aggressive 0.0034 reverb 0.0064
softer 0.0030 the noise 0.0051
synthetic 0.0029 new wave 0.0039
punk 0.0024 elvis costello 0.0036
sleepy 0.0022 the mud 0.0032
funky 0.0020 his guitar 0.0029
noisy 0.0020 guitar bass and drums 0.0027
angular 0.0016 instrumentals 0.0021
acoustic 0.0015 melancholy 0.0020
romantic 0.0014 three chords 0.0019
Table 2. Selected top-performing models of adjective and
noun phrase terms used to predict new reviews of music
with their corresponding bits of information from the K-L
distance measure.
task and the below review trimming task are promising for
these future directions.
One major caveat of our review learning model is its
time insensitivity. Although the feature space embeds time
at different levels, there is no model of intra-song changes
of term description (a loud song getting soft, for example)
and each frame in an album is labeled the same during
training. We are currently working on better models of
time representation in the learning task. Unfortunately,
the ground truth in the task is only at the album level and
we are also considering techniques to learn finer-grained
models from a large set of broad ones.
7.4. Review Regularization
Many problems of non-musical text and opinion or per-
sonal terms get in the way of full review understanding. A
similarity measure trained on the frequencies of terms in a
user-submitted review would likely be tripped up by obvi-
ously biased statements like “This record is awful” or “My
mother loves this album.” We look to the success of our
grounded term models for insights into the musicality of
description and develop a ‘review trimming’ system that
summarizes reviews and retains only the most descriptive
content. The trimmed reviews can then be fed into fur-
ther textual understanding systems or read directly by the
listener.
To trim a review we create a grounding sum term oper-
ated on a sentence s of word length n,
g(s) =∑n
i=0 P (ai)n
(4)
where a perfectly grounded sentence (in which the predic-
tive qualities of each term on new music has 100% preci-
sion) is 100%. This upper bound is virtually impossible
in a grammatically correct sentence, and we usually see
g(s) of {0.1% .. 10%}. The user sets a threshold andthe system simply removes sentences under the threshold.
See Table 3 for example sentences and their g(s). We seethat the rate of sentence retrieval (how much of the review
is kept) varies widely between the two review sources;
AMG’s reviews have naturally more musical content. See
Figure 4 for recall rates at different thresholds of g(s).
Learning from Music - Ellis 2004-12-18 p. /249
2. Transcription as Classification
• Signal models typically used for transcriptionharmonic spectrum, superposition
• But ... trade domain knowledge for datatranscription as pure classification problem:
single N-way discrimination for “melody”per-note classifiers for polyphonic transcription
Trained
classifier
Audio
p("C0"|Audio)p("C#0"|Audio)p("D0"|Audio)p("D#0"|Audio)p("E0"|Audio)p("F0"|Audio)
with Graham Poliner
Learning from Music - Ellis 2004-12-18 p. /2410
Classifier Transcription Results• Trained on MIDI syntheses (32 songs)
SMO SVM (Weka)
• Tested on ISMIR MIREX 2003 setforeground/background separation
system “jazz3” overall
fg+bg 71.5% 44.3%
just fg 56.1% 45.4%
Frame-level pitch concordance
Learning from Music - Ellis 2004-12-18 p. /2411
Forced-Alignment of MIDI
• MIDI is a handy description of musicnotes, instruments, tracks.. to drive synthesis
• Align MIDI ‘replicas’ to get GTruth for audioestimate time-warp relation
"Don't you want me" (Human League), verse1
17 18 19 20 21 22 23 24 25 26
0
2
4
19 20 21 22 23 24 25 26 27 2840
60
80
MIDI notesMIDI notes
MIDI ReplicaMIDI Replica
OriginalOriginal
fre
q /
kH
zM
IDI
#
0
2
4
fre
q /
kH
z
time / sec
with Rob Turetsky
Learning from Music - Ellis 2004-12-18 p. /2412
3. Melody Clustering
• Goal: Find ‘fragments’ that recur in melodies.. across large music database.. trade data for model sophistication
• Data sourcespitch tracker, or MIDI training data
• Melody fragment representationDCT(1:20) - removes average, smoothes detail
Training
dataMelody
extraction
5 second
fragments
Top
clusters
VQ
clustering
with Graham Poliner
Learning from Music - Ellis 2004-12-18 p. /2413
Melody clustering results
• Clusters match underlying contour:
• Finds some similarities:e.g. Pink + Nsync
Learning from Music - Ellis 2004-12-18 p. /2414
4. Eigenrhythms: Drum Pattern Space
• Pop songs built on repeating “drum loop”variations on a few bass, snare, hi-hat patterns
• Eigen-analysis (or ...) to capture variations?by analyzing lots of (MIDI) data, or from audio
• Applicationsmusic categorization“beat box” synthesisinsight
with John Arroyo
Learning from Music - Ellis 2004-12-18 p. /2415
Aligning the Data• Need to align patterns prior to modeling...
tempo (stretch): by inferring BPM &
normalizing
downbeat (shift): correlate against ‘mean’ template
Learning from Music - Ellis 2004-12-18 p. /2416
Eigenrhythms (PCA)
• Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims)
• Eigenrhythms both add and subtract
Learning from Music - Ellis 2004-12-18 p. /2417
Posirhythms (NMF)
• Nonnegative: only adds beat-weight• Capturing some structure
Posirhythm 1
BD
SN
HH
BD
SN
HH
BD
SN
HH
BD
SN
HH
BD
SN
HH
BD
SN
HH
Posirhythm 2
Posirhythm 3 Posirhythm 4
Posirhythm 5
500 0 samples (@ 200 Hz)beats (@ 120 BPM)
100 150 200 250 300 350 400
1 2 3 4 1 2 3 4
Posirhythm 6
50 100 150 200 250 300 350
-0.1
0
0.1
Learning from Music - Ellis 2004-12-18 p. /2418
Eigenrhythms for Classification• Projections in Eigenspace / LDA space
• 10-way Genre classification (nearest nbr):PCA3: 20% correctLDA4: 36% correct
-20 -10 0 10-10
-5
0
5
10PCA(1,2) projection (16% corr)
-8 -6 -4 -2 0 2-4
-2
0
2
4
6LDA(1,2) projection (33% corr)
bluescountrydiscohiphophousenewwaverockpoppunkrnb
Learning from Music - Ellis 2004-12-18 p. /2419
Eigenrhythm BeatBox
• Resynthesize rhythms from eigen-space
Learning from Music - Ellis 2004-12-18 p. /2420
5. Event Extraction
• Music often contains many repeated eventsnotes, drum soundsbut: usually overlapped...
• Vector Quantization finds common patterns:
representation...aligning/matching...how much coverage required?
Training
data
Find
alignments
Combine &
re-estimate
Event
dictionary
Learning from Music - Ellis 2004-12-18 p. /2421
Drum Track Extraction
• Initialize dictionary with Bass Drum, Snare• Match only on a few spectral peaks
narrowband energy most likely to avoid overlap
• Median filter to re-estimate template.. after normalizing amplitudescan pick up partials from common notes
with Ron Weiss, after Yoshii et al. ’04
Learning from Music - Ellis 2004-12-18 p. /2422
Generalized Event Detection
• Based on ‘Shazam’ audio fingerprints (Wang’03)
relative timing of F1-F2-ΔT triples discriminates piecesnarrowband features to avoid collision (again)
• Fingerprint events, not recordings:choose top triples, look for repeatsrank reduction of triples x time matrix
2 Basic principle of operation The recognition algorithm is described.
2.1 Combinatorial Hashing Each audio file is “fingerprinted,” a process in which reproducible hash tokens are
extracted. A time-frequency analysis is performed, marking the coordinates of local
maxima of a spectrogram (Figures 1A and 1B), thus reducing an audio file down to a
relatively sparse set of time-frequency pairs. This reduces the search problem to one
similar to astronavigation, in which a small patch of time-frequency constellation points
must be quickly located within a large universe of points in a strip-chart universe with
dimensions of bandlimited frequency versus nearly a billion seconds in the database. The
peaks are chosen using a criterion to ensure that the density of chosen local peaks is
within certain desired bounds so that the time-frequency strip for the audio file has
reasonably uniform coverage. The peaks in each time-frequency locality are also chosen
according amplitude, with the justification that the highest amplitude peaks are most
likely to survive the distortions listed above.
Hashes are formed from the constellation map, in which pairs of time-frequency points
are combinatorially associated. Anchor points are chosen, each anchor point having a
target zone associated with it. Each anchor point is sequentially paired with points within
its target zone, each pair yielding two frequency components plus the time difference
!"#$%&'$()*+',
-.$/0$*(1%&23,
456 457 458 455 459 4:6 4:7 4:8 4:5 4:9 4966
4666
7666
;666
8666
<666
!"#$%&'$()*+',
-.$/0$*(1%&23,
456 457 458 455 459 4:6 4:7 4:8 4:5 4:9 4966
4666
7666
;666
8666
<666
with Michael Mandel
Learning from Music - Ellis 2004-12-18 p. /2423
Event detection results
• Procedurefind hash triplescluster thempatterns in hash co-occurrence = events?
0 50 100 150 200 250 3000
50
100
150
Learning from Music - Ellis 2004-12-18 p. /2424
Conclusions
• Lots of data + noisy transcription + weak clustering⇒ musical insights?
Music
audio
Drums
extraction
Eigen-
rhythms
Event
extraction
Melody
extraction
Fragment
clustering
Anchor
models
Similarity/
recommend'n
Synthesis/
generation
?
Semantic
bases
Learning from Music - Ellis 2004-12-18 p. /2425
• Note transcription, then note→chord ruleslike labeling chords in MIDI transcripts
• Spectrum→chord rulesi.e. find harmonic peaks, use knowledge of likely notes in each chord
• Trained classifierdon’t use any “expert knowledge”instead, learn patterns from labeled examples
• Train ASR HMMs with chords ≈ words
Approaches to Chord Transcription
with Alex Sheh
Learning from Music - Ellis 2004-12-18 p. /2426
• All we need are the chord sequences for our training examplesHal Leonard “Paperback Song Series”- manually retyped for 20 songs:
“Beatles for Sale”, “Help”, “Hard Day’s Night”
- hand-align chords for 2 test examples
Chord Sequence Data Sources
# The Beatles - A Hard Day's Night
#
G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G
G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G
Bm Em Bm G Em C D G Cadd9 G F6 G Cadd9 G
F6 G C D G C9 G D
G C7 G F6 G C7 G F6 G C D G C9 G Bm Em Bm
G Em C D
G Cadd9 G F6 G Cadd9 G F6 G C D G C9 G
C9 G Cadd9 Fadd9
time / sec
freq / k
Hz
0 5 10 15 20 250
1
2
3
4
5
The Beatles - Hard Day's Night
Learning from Music - Ellis 2004-12-18 p. /2427
• Recognition weak, but forced-alignment OK
Chord Results
MFCCs are poor(can overtrain) PCPs better
(ROT helps generalization)
(random ~3%)
Feature Recognition Alignment
MFCC 8.7% 22.0%
PCP_ROT 21.7% 76.0%
Frame-level Accuracy
trueE G D Bm G
alignE G DBm G
recogE G Bm Am Em7 Bm Em7
16.27 24.84time / sec
intensity
A
#
B
C
#
D
#
E
F
#
G
#
pitch
cla
ss
Beatles - Beatles For Sale - Eight Days a Week (4096pt)
20
0
40
60
80
100
120
Learning from Music - Ellis 2004-12-18 p. /2428
• Chord model centers (means) indicate chord ‘templates’:
What did the models learn?
0 5 10 15 20 250
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
PCP_ROT family model means (train18)
DIM
DOM7
MAJ
MIN
MIN7
C D E F G A B C (for C-root chords)