session 1: pattern recognition - columbia universitydpwe/muscontent/l1-patrecog.pdfmusical content -...

44
Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals Session 1: Pattern Recognition Music Content Analysis Pattern Classification The Statistical Approach Distribution Models Singing Detection Dan Ellis <[email protected]> http://www .ee .columbia.edu/~dpw e/m uscontent/ Laboratory for Recognition and Organization of Speech and Audio Columbia University, New York Spring 2003 1 2 3 4 5

Upload: nguyenhanh

Post on 24-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1

Proc. Digital del Continguts Musicals

Session 1:Pattern Recognition

Music Content Analysis

Pattern Classification

The Statistical Approach

Distribution Models

Singing Detection

Dan Ellis <[email protected]>http://www.ee.columbia.edu/~dpwe/muscontent/

Laboratory for Recognition and Organization of Speech and AudioColumbia University, New York

Spring 2003

1

2

3

4

5

Page 2: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 2

Music Content Analysis

• Music contains informationat many levels

- what is it?

• We’d like to get this information out automatically

- fine-level transcription of events- broad-level classification of pieces

• Information extraction can be framed as:pattern classification / recognitionor machine learning

- build systems based on (labeled) training data

1

Page 3: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 3

Music analysis

• What might we want to get out of music?

• Instrument identification

- different levels of specificity- ‘registers’ within instruments

• Score recovery

- transcribe the note sequence- extract the ‘performance’

• Ensemble performance

- ‘gestalts’: chords, tone colors

• Broader timescales

- phrasing & musical structure- artist / genre clustering and classification

Page 4: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 4

Course Outline

• Session 1: Pattern Classification

- MLP Neural Networks- GMM distribution models- Singing detection

• Session 2: Sequence Recognition

- The Hidden Markov Model- Chord sequence recognition

• Session 3: Musical Applications

- Note transcription- Music summarization- Music Information Retrieval- Similarity browsing

Page 5: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 5

Outline

Music Content Analysis

Pattern Classification

- classification- decision boundaries- neural networks

The Statistical Approach

Distribution Models

Singing Detection

1

2

3

4

5

Page 6: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 6

Pattern Classification

• Classification means:finding categorical (

discrete

) labels for real-world (

continuous

) observations

• Why?

- information extraction- conditional processing- detect exceptions

2

0

00

1000

1000

2000

2000

1000

2000

3000

4000

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

f/Hz

time/s

F1/Hz

F2/Hz

x

x ayao

Page 7: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 7

Building a classifier

• Define classes/attributes

- could state explicit rules- better to define through ‘training’ examples

• Define feature space

• Define decision algorithm

- set parameters from examples

• Measure performance

- calculate (weighted) error rate

250 300 350 400 450 500 550 600600

700

800

900

1000

1100

F1 / Hz

F2

/ Hz

Pols vowel formants: "u" (x) vs. "o" (o)

Page 8: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 8

Classification system parts

signal

segment

feature vector

class

Pre-processing/segmentation

Feature extraction

Classification

Post-processing

Sensor

• STFT• Locate vowels

• Formant extraction

• Context constraints• Costs/risk

Page 9: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 9

Feature extraction

• Right features are critical

- waveform vs. formants vs. cepstra-

invariance

under irrelevant modifications

• Theoretically equivalent features may act very differently in practice

- representations make important aspects explicit- remove irrelevant information

• Feature design incorporates ‘domain knowledge’

- more data

less need for ‘cleverness’?

• Smaller ‘feature space’ (fewer dimensions)

simpler models (fewer parameters)

less training data needed

faster training

Page 10: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 10

Minimum distance classification

• Find closest match (nearest neighbor)

- choice of distance metric

D

(

x

,

y

) is important

• Find representative match (class prototypes)

- class data {

y

ω

,i

}

class

model

z

ω

0 200 400 600 800 1000 1200 1400 1600 1800

600

800

1000

1200

1400

1600

1800Pols vowel formants: "u" (x), "o" (o), "a" (+)

F1 / Hz

F2

/ Hz

x

*

+

o

new observation x

min D x yω i,,( )[ ]

min D x zω,( )[ ]

Page 11: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 11

Linear classifiers

• Minimum Euclidean distance is equivalent to a linear discriminant function:

- examples {

y

ω

,i

}

template

z

ω

=

mean

{

y

ω

,i

}

- observed feature vector

x

= [

x

1

x

2

...]

T

- Euclidean distance metric

- minimum distance rule:

class

- i.e. choose class O over U if

...

D x y,[ ] x y–( )T

x y–( )=

i D x zi,[ ]i

argmin =

D2

x zi,[ ]i

argmin =

D2

x zO,[ ] D2

x zU,[ ]<

Page 12: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 12

Linear classification boundaries

• Min-distance divides normal to class centers:

• Scaling axes changes boundary:

*

o

x(x-zU)T(zO-zU)

(zO-zU)

zO

zUx

|zO-zU|

(zO-zU)T(zO-zU)12

|zO-zU|

decision boundary

o

xo

x

Squeezed horizontally

new boundary Squeezed vertically

new boundary

Page 13: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 13

Decision boundaries

• Linear functions give linear boundaries- planes and hyperplanes as dimensions increase

• Linear boundaries can become curves under feature transformations

- e.g. a linear discriminant in x' = [x12 x2

2 ...]T

• What about more complex boundaries?

- i.e. if a linear discriminant is

how about a nonlinear function?

- changes only threshold space, but

offers more flexibility, and

has even more ...

wixi∑

F wixi∑[ ]

F wijxii∑[ ]j∑

F w jk F wijxii∑[ ]⋅j∑[ ]

Page 14: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 14

Neural networks

• Sums over nonlinear functions of sums → large range of decision surfaces

• e.g. Multi-layer perceptron (MLP) with 1 hidden layer:

• Problem is finding the weights wij ...(training)

yk F w jk F wijxij∑[ ]⋅j∑[ ]=

+

+wij

x1 y1

h1

h2

y2

x2x3

F[·]+

+

+wjk

Inputlayer

Hiddenlayer

Outputlayer

Page 15: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 15

Back-propagation training

• Find wij to minimize e.g.

for training set of patterns {xn} and desired (target) outputs {tn}

• Differentiate error with respect to weights- contribution from each pattern xn, tn:

• i.e. gradient descent with learning rate α

E y xn( ) tn–2

n∑=

yk F w jk h j⋅j∑[ ]=

w jk∂∂E

y∂∂E

Σ∂∂y

w jk∂∂

w jkh jj∑( )⋅ ⋅=

2 y t–( ) F ' Σ[ ] h j⋅ ⋅=

w jk w jk αw jk∂

∂E⋅–=

Page 16: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 16

Neural net example

• 2 input units (normalized F1, F2)

• 5 hidden units, 3 output units (“U”, “O”, “A”)

• Sigmoid nonlinearity:

F1“A”“O”“U”

F2

F x[ ] 1

1 ex–

+-----------------=

xddF

⇒ F 1 F–( )=

5 4 3 2 1 0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

sigm(x)

sigm(x)ddx

Page 17: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 17

Neural net training

example...

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

12:5:3 net: MS error by training epoch

Mea

n sq

uare

d er

ror

Training epoch

200 400 600 800 1000 1200600

800

1000

1200

1400

1600Contours @ 10 iterations

200 400 600 800 1000 1200600

800

1000

1200

1400

1600Contours @ 100 iterations

Page 18: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 18

Outline

Music Content Analysis

Pattern Classification

The Statistical Approach- Conditional distributions- Bayes rule

Distribution Models

Singing Detection

1

2

3

4

5

Page 19: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 19

The Statistical Approach

• Observations are random variableswhose distribution depends on the class:

• Source distributions p(x|ωi)

- reflect variability in feature- reflect noise in observation- generally have to be estimated from data

(rather than known in advance)

3

Classωi

(hidden)discrete continuous

Observationx

p(x|ωi)Pr(ωi|x)

p(x|ωi)

x

ω1 ω2 ω3 ω4

Page 20: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 20

Random Variables review

• Random vars have joint distributions (pdf’s):

- marginal

- covariance

xµx

σx

σxyσyµy

yp(x,y)

p(y)p(x)

p x( ) p x y,( ) yd∫=

Σ E o µ–( ) o µ–( )T

[ ]σx

2σxy

σxy σy2

= =

Page 21: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 21

Conditional probability

• Knowing one value in a joint distribution constrains the remainder:

• Bayes’ rule:

→ can reverse conditioning given priors/marginal- either term can be discrete or continuous

x

y

Y

p(x,y)

p(x | y = Y )

p x y,( ) p x y( ) p y( )⋅=

p y x( )p x y( ) p y( )⋅

p x( )--------------------------------=⇒

Page 22: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 22

Priors and posteriors

• Bayesian inference can be interpreted as updating prior beliefs with new information, x:

• Posterior is prior scaled by likelihood & normalized by evidence (so Σ(posteriors) = 1)

• Objection: priors are often unknown- but omitting them amounts to assuming they are

all equal

Pr ωi( )p x ωi( )

p x ω j( ) Pr ω j( )⋅j∑

---------------------------------------------------⋅ Pr ωi x( )=

Posteriorprobability

‘Evidence’ = p(x)

Likelihood

Prior probability

Bayes’ Rule:

Page 23: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 23

Bayesian (MAP) classifier

• Minimize the probability of error by choosing maximum a posteriori (MAP) class:

• Intuitively right - choose most probable class in light of the observation

• Given models for each distribution ,

the search for

becomes

but denominator = p(x) is the same over all ωi

hence

ω̂ Pr ωi x( )ωi

argmax =

p x ωi( )

ω̂ Pr ωi x( )ωi

argmax =

p x ωi( ) Pr ωi( )⋅

p x ω j( ) Pr ω j( )⋅j∑

---------------------------------------------------ωi

argmax

ω̂ p x ωi( ) Pr ωi( )⋅ωi

argmax =

Page 24: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 24

Practical implementation

• Optimal classifier is

but we don’t know

• Can model directly e.g. train a neural net to map from inputs x to a set of outputs Pr(ωi)- a discriminative model

• Often easier to model conditional distributions then use Bayes’ rule to find MAP class

ω̂ Pr ωi x( )ωi

argmax =

Pr ωi x( )

p x ωi( )

Labeledtraining

examples{xn,ωxn}

Sortaccordingto class

Estimateconditional pdf

for class ω1

p(x|ω1)

Page 25: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 25

Outline

Music Content Analysis

Pattern Classification

The Statistical Approach

Distribution Models- Gaussian models- Gaussian mixtures

Singing Detection

1

2

3

4

5

Page 26: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 26

Distribution Models

• Easiest way to model distributions is via parametric model- assume known form, estimate a few parameters

• Gaussian model is simple & useful:

in 1 dim:

• Parameters mean µi and variance σi2 → fit

4

p x ωi( )1

2πσi

---------------- 12---

x µi–

σi------------- 2

–exp⋅=

normalization to make it sum to 1

p(x|ωi)

xµi

σi

PMPMe1/2

Page 27: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 27

Gaussians in d dimensions:

• Described by d dimensional mean vector µi

and d x d covariance matrix Σi

• Classify by maximizing log likelihood i.e.

p x ωi( )1

2π( )dΣi

1 2⁄----------------------------------- 1

2--- x µµµµi–( )

TΣi

1–x µi–( )–exp⋅=

0

2

4

0

2

4

0

0.5

1

0 1 2 3 4 50

1

2

3

4

5

12--- x µµµµi–( )

TΣi

1–x µi–( )–

12--- Σilog– Pr ωi( )log+

ωiargmax

Page 28: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 28

Gaussian Mixture models (GMMs)

• Single Gaussians cannot model- distributions with multiple modes- distributions with nonlinear correlation

• What about a weighted sum?

i.e.

where {ck} is a set of weights and

{ } is a set of Gaussian components

- can fit anything given enough components

• Interpretation: each observation is generated by one of the Gaussians, chosen at random with priors

p x( ) ck p x mk( )k∑≈

p x mk( )

ck Pr mk( )=

Page 29: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 29

Gaussian mixtures (2)

• e.g. nonlinear correlation:

• Problem: finding ck and mk parameters

- easy if we knew which mk generated each x

05

1015

2025

3035

0

0.2

0.4

0.6

0.8

1

1.2

1.4

-2

-1

0

1

2

3

original data

Gaussiancomponents

resultingsurface

Page 30: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 30

Expectation-maximization (EM)

• General procedure for estimating a model when some parameters are unknown- e.g. which component generated a value in a

Gaussian mixture model

• Procedure:Iteratively update fixed model parameters Θ to maximize Q=expected value of log-probability of known training data xtrn

and unknown parameters u:

- iterative because changes

each time we change Θ- can prove this increases data likelihood,

hence maximum-likelihood (ML) model- local optimum - depends on initialization

Q Θ Θold,( ) Pr u xtrn Θold,( ) p u xtrn Θ,( )logu∑=

Pr u xtrn Θold,( )

Page 31: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 31

Fitting GMMs with EM

• Want to find component Gaussians plus weights

to maximize likelihood of training data xtrn

• If we could assign each x to a particular mk , estimation would be direct

• Hence, treat ks (mixture indices) as unknown,

form , differentiate with respect to model parameters→ equations for µk, Σk and ck to maximize Q...

mk µk Σk,{ }= ck Pr mk( )=

Q E p x k, Θ( )log[ ]=

Page 32: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 32

GMM EM update equations:

• Solve for maximizing Q:

• Each involves ,

‘fuzzy membership’ of xn to Gaussian k

• Parameter is just sample average, weighted by ‘fuzzy membership’

µk

p k xn Θ,( ) xn⋅n∑

p k xn Θ,( )n∑

----------------------------------------------=

Σk

p k xn Θ,( ) xn µk–( ) xn µk–( )T

n∑p k xn Θ,( )

n∑-------------------------------------------------------------------------------------=

ck1N---- p k xn Θ,( )

n∑=

p k xn Θ,( )

Page 33: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 33

GMM examples

• Vowel data fit with different mixture counts:

example...

200 400 600 800 1000 1200600

800

1000

1200

1400

16001 Gauss logp(x)=-1911

200 400 600 800 1000 1200600

800

1000

1200

1400

16002 Gauss logp(x)=-1864

200 400 600 800 1000 1200600

800

1000

1200

1400

16003 Gauss logp(x)=-1849

200 400 600 800 1000 1200600

800

1000

1200

1400

16004 Gauss logp(x)=-1840

Page 34: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 34

Methodological Remarks:Training and test data

• A rich model can learn every training example (overtraining)

• But, goal is to classify new, unseen datai.e. generalization- sometimes use ‘cross validation’ set to decide

when to stop training

• For evaluation results to be meaningful:- don’t test with training data!- don’t train on test data (even indirectly...)

training or parameters

errorrate

Testdata

Trainingdata

Overfitting

Page 35: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 35

Model complexity

• More model parameters → better fit- more Gaussian mixture components- more MLP hidden units

• More training data → can use larger models

• For best generalization (no overfitting), there will be some optimal model size:

500

1000

2000

4000

9.25

18.5

37

74

32

34

36

38

40

42

44

Hidden layer / unitsTraining set / hours

WE

R%

WER for PLP12N-8k nets vs. net size & training data

Word Error RateMore training dataMore parameters

Optimal parameter/data ratio

Constant training

time

Page 36: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 36

Outline

Music Content Analysis

Pattern Classification

The Statistical Approach

Distribution Models

Singing Detection- Motivation- Features- Classifiers

1

2

3

4

5

Page 37: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 37

Singing Detection(Berenzweig et al. ’01)

• Can we automatically detect when singing is present?

- for further processing (lyrics recognition?)- as a song signature?- as a basis for classification?

5

File: /Users/dpwe/projects/aclass/aimee.wav

f 9 Printed: T

ue Mar 11 13:04:28

5001000150020002500300035004000450050005500600065007000

Hz

0:02 0:04 0:06 0:08 0:10 0:12 0:14 0:16 0:18 0:20 0:22 0:24 0:26 0:28t

mus mus musvoxvox

Page 38: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 38

Singing Detection: Requirements

• Labelled training examples- 60 x 15 sec. radio excerpts - hand-mark sung phrases

• Labelled test data- several complete tracks from

CDs, hand-labelled

• Feature choice- Mel-frequency Cepstral Coefficients (MFCCs)

popular for speech; maybe sung voice too?- separation of voices?- temporal dimension

• Classifier choice- MLP Neural Net- GMMs for singing / music- SVM?

0

0

0

time / sec

freq

/ kH

z

0 2 4 6 8 10 120

5

10

freq

/ kH

z

5

10

freq

/ kH

z

5

10

freq

/ kH

z

5

10

trn/mus/3

trn/mus/8

trn/mus/19

trn/mus/28

hand-label vox

Page 39: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 39

MLP Neural Net

• Directly estimate p(singing | x)

- net has 26 inputs (+∆), 15 HUs, 2 o/ps (26:15:2)

• How many hidden units?- depends on data amount, boundary complexity

• Feature context window?- useful in speech

• Delta features?- useful in speech

• Training parameters...

C0C1

C12

“singing”music

“not singing”...

MFCCcalculation

Page 40: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 40

MLP Results

• Raw net outputs on a CD track (FA 74.1%):

• Smoothed for continuity: best FA = 90.5%

time / sec

freq

/ kH

zp(

sing

ing)

0

2

4

6

8

10

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

Aimee Mann : Build That Wall + handvox

26:15:1 netlab on 16ms frames (FA=74.1% @ thr=0.5)

time / sec

p(si

ngin

g)

0 5 10 15 20 25 300

0.2

0.4

0.6

0.8

1y p ( ) ( )

Page 41: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 41

GMM System

• Separate models for p(x|sing), p(x|no sing)- combined via likelihood ratio test

• How many Gaussians for each?- say 20; depends on data & complexity

• What kind of covariance?- diagonal (spherical?)

C0C1

C12

p(X|“singing”)

music singing?MFCCcalculation

GMM1

...p(X|“no singing”)

GMM2

Log l'hoodratio testp(X|“singing”)

logp(X|“not”)

Page 42: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 42

GMM Results

• Raw and smoothed results (Best FA=84.9%):

• MLP has advantage of discriminant training

• Each GMM trains only on data subset → faster to train? (2 x 10 min vs. 20 min)

-10

-5

0

5

10

0 5 10 15 20 25 30-5

0

5

freq

/ kH

zlo

g(lh

ood)

log(

lhoo

d)

0

2

4

6

8

10

time / sec

Aimee Mann : Build That Wall + handvox

26d 20mix GMM on 16ms frames (FA=65.8% @ thr=0)

26d 20mix GMM smoothed by 61 pt (1 sec) hann (FA=84.9% @ thr=-0.8)

Page 43: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 43

Summary

• Music content analysis:Pattern classification

• Basic machine learning methods:Neural Nets, GMMs

• Singing detection: classic application

but... the time dimension?

Page 44: Session 1: Pattern Recognition - Columbia Universitydpwe/muscontent/L1-patrecog.pdfMusical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 1 Proc. Digital del Continguts Musicals

Musical Content - Dan Ellis 1: Pattern Recognition 2003-03-18 - 44

References

A.L. Berenzweig and D.P.W. Ellis (2001)“Locating Singing Voice Segments within MusicSignals”, Proc. IEEE Workshop on Apps. of Sig. Proc. toAcous. and Audio, Mohonk NY, October 2001. http://www.ee.columbia.edu/~dpwe/pubs/waspaa01-singing.pdf

R.O. Duda, P. Hart, R. Stork (2001)Pattern Classification, 2nd Ed.Wiley, 2001.

E. Scheirer and M. Slaney (1997) “Construction and evaluation of a robust multifeature speech/music discriminator”, Proc. IEEE ICASSP, Munich, April 1997. http://www.ee.columbia.edu/~dpwe/e6820/papers/ScheiS97-mussp.pdf