institute of information science academia sinica 1 singer identification and clustering of popular...

Post on 15-Jan-2016

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1Institute of Information Science Academia Sinica

Singer Identification and Singer Identification and Clustering of Popular Music Clustering of Popular Music

RecordingsRecordings

Wei-Ho Tsaiwesley@iis.sinica.edu.tw

Institute of Information Science, Academia Sinica

2Institute of Information Science Academia Sinica

Extracting Information From Music

Music Information Retrieval (MIR)– To develop ways of managing collections of musical

material for preservation, access, research, and other uses.

MIR communities & research areas [after Futrelle & Downie, 2002]

Computer Science

Audio Engineering

Psychology & Philosophy

Library Science

Law

Representation

Indexing

User Interface Design

Compression

FeatureDetection

Machine Learning

Metadata

Musical Analysis

Epistemology & Ontology

Perception

Intellectual Property

Classification

Musicology

Communities Research Areas

3Institute of Information Science Academia Sinica

Extracting Voice Information From Music

Viewing MIR from a speech-processing perspectiveSpeech Processing

Analysis/Synthesis Recognition Coding

Speech Recognition Speaker Recognition Language Recognition

Phone Recognition Word Recognition Tone Recognition

Speaker Identification Speaker Verification Speaker Clustering

Language Identification Language Verification Dialect Identification

Singing Processing

Analysis/Synthesis Recognition Coding

"Singing Recognition" Singer Recognition Language Recognition

Phone Recognition Lyric Transcription Melody Extraction

Singer Identification Singer Detection Singer Clustering

Language Identification Language Verification Dialect Identification

Speech/Singing/Music/

Other SoundsDiscrimination

4Institute of Information Science Academia Sinica

Singer Recognition Tasks (I)

Singer Identification– Determining who is singing

?

?

?

?

Who performed thismusic recording?

5Institute of Information Science Academia Sinica

Singer Recognition Tasks (II)

Singer detection– Determining whether or not a specified singer is present

in a music recording

?

If Coco's voices in thismusic recording?

6Institute of Information Science Academia Sinica

Singer Recognition Tasks (III)

Singer Tracking– Locating where a specified singer is present in a music

recording

Where is Coco's voices?

7Institute of Information Science Academia Sinica

Singer Recognition Tasks (IV)

Singer Clustering– Grouping the same-singer music recordings into a cluster

Clusteringby singer

8Institute of Information Science Academia Sinica

Potential Applications

Indexing– Finding cameo’s or guest appearances in live concert

recordings.– Identifying the singers in a movie’s musical interludes.

Music recommendation systems– Suggesting music by singers with similar voices.

Karaoke services– Efficiently organizing the customer’s recordings.– Personalization

Copyright protection– Distinguishing between an original song and a cover-band.– Rapidly scanning suspect websites for piracy

9Institute of Information Science Academia Sinica

Singer’s Vocal Characteristics

Humans use several levels of perceptual cues for distinguishing among singers

Singingpronunciation, and

IdiosyncrasiesLearned traits

Physical traits

Culture andsocio-background

Modulation of pitch,rhythm,speed,

intonation, andvolume

Personality type

Acoustic aspect ofsinging, e.g., nasal,deep, breathy and

rough

Anatomicalstructure

of vocal apparatus

Characteristics Sources

10Institute of Information Science Academia Sinica

Major Challenges In Singer Recognition

The vast majority of popular music contains background accompaniment during most or all vocal passages– Infeasible to acquire isolated solo voice data for

extracting the singer’s vocal characteristics

The proposed solution:

Vocal segment detection followed by solo vocal signal modeling

11Institute of Information Science Academia Sinica

Vocal/Non-vocal Segmentation

Sliding Window

Vocal Model

Non-vocal Model

FeatureVectors

DecisionFeature

Extraction

Vocalor

Non-vocal

Mel-scale Frequency Cepstral Coefficients (MFCCs)

Filter BankFrame

Segment

Music Recording

12Institute of Information Science Academia Sinica

Gaussian Mixture Model (I)

Model description– The distribution of the feature vector x is represented by

a mixture of M component Gaussian densities, i.e.,

• is the i-th Gaussian density with mean and covariance matrix

– A Gausian mixture model (GMM) is characterized by

x

11, ΣN () 22 , Σ ... MM Σ,

+w 1

w 2

w M

)|( xp

N () N ()

M

i iiiwp1

),()|( xx N

),( ii xN i i

Miw iii 1|,,

13Institute of Information Science Academia Sinica

Gaussian Mixture Model (II)

Parameter estimation– Using the EM algorithm, an initial model is created, and

the new model is then estimated by maximizing the auxiliary function

where and

– Letting for each parameter to be re-estimated, we have

T

tti ip

Tw

1

),|(1

x

T

t t

T

t tti

ip

ip

1

1

),|(

),|(

x

xx

iiT

t t

T

t ttti

ip

ip

1

1

),|(

),|(

x

xxx

)ˆ,ˆ(ˆ)ˆ|,( iitit wip xx N

,)ˆ|,(log),|()ˆ(1 1

T

t

M

itt ipipQ xx

M

m mmtm

iitit

w

wip

1),(

),(),|(

x

xx

N

N

0)ˆ( Q

14Institute of Information Science Academia Sinica

Distilling Singers’ Voices From Music

Substantial similarities exist between the instrumental regions and the accompaniment of the vocal signal

Solo voice can be modeled via suppressing the background music estimated from the instrumental regions.

Solo Voice

Accompaniments

Accompanied Voice

+

15Institute of Information Science Academia Sinica

Solo Vocal Signal Modeling (I)

Model Description

b can be approximately estimated using the instrumental regions of music

– Our aim is to find an optimal s such that (in maximum likelihood sense)

).,|(maxarg bss ps

V

MixingV = f (S ,B )

A Solo Voice

A Background Music

An Accompanied Voice

},...,,{ 21 TvvvV},...,,{ 21 TsssS

},...,,{ 21 TbbbB

GMM GMM

}| ,,{

1

,,,

Mi

isisiss w

Σ}|

,,{

1

,,,

Nj

jbjbjbb w

Σ,),,,|(),|(

1 1 1,,

T

t

M

i

N

jbstjbisbs jipwwp vV

.),;(),;(

),,,|(

),(

,,,,

BSV

ΣΣ

f

ttjbjbtisist

bst

dd

jip

bsbs

v

NN

(unobservable)

(unobservable)

(observable)

16Institute of Information Science Academia Sinica

Solo Vocal Signal Modeling (II)

Parameter estimation– Defining an auxiliary function

where

– Letting for each parameter to be re-estimated, we have

,)ˆ|,,(log),|,()ˆ(1 1 1

T

t

I

i

J

jbstbstss jipjipQ vv

),ˆ,,|()ˆ|,,( ,, bstjbisbst jipwwjip vv

.),,|(

),,|(),|,(

1 1 ,,

,,

I

m

J

n bstnbms

bstjbisbst

nmpww

jipwwjip

v

vv

0)ˆ( ssQ

,),|,(1

ˆ1 1

,

T

t

J

jbstis jip

Tw v

,

),,|,(

,,,,|),,|,(ˆ

1 1

1 1,

T

t

N

j bst

T

t

N

j bsttbst

isjip

jiEjip

v

vsv

,

),,|,(

,,,,|),,|,(ˆ

,,

1 1

1 1, isisT

t

J

j bst

T

t

J

j bstttbst

isjip

jiEjip

v

vssvΣ

17Institute of Information Science Academia Sinica

Solo Vocal Signal Modeling (III)

Re-estimation formulas for linear spectral features– Suppose V is a linear spectral feature, and S and B are

additive in the time domain, then vt = st + bt

– is the convolution of the solo and background music densities, i.e.,

– and can be shown in the following form:

),,,|( bst jivp

tjbjbttisistbst sds-vsjivp ),;(),;(),,,|( 2,,

2,, NN

,,,,,| ,,2,

2,

2,

2,

2,

jbis

is

jbt

jbis

isbstt vjivsE

.,,,,|,,,,|2

2,

2,

2,

2,

bsttjbis

jbisbst

2t jivsEjivsE

bstt jivsE ,,,,| bst2t jivsE ,,,,|

18Institute of Information Science Academia Sinica

Solo Vocal Signal Modeling (IV)

Re-estimation formulas for cepstral features– Suppose V is a cepstral feature, and S and B are additive in

the time domain, then vt = log[exp(st)+exp(bt)]. We approximate vt

max (st , bt ).

– It can be shown that

),(),;()(),;(),,,|(,

,2,,

,

,2,,

is

istjbjbt

jb

jbtisistbst

vv

vvjivp

NN .

2

1)( 2/

2

dwe w

,,,,,|),,,|(1),,,|(,,,,| bstttbstttbsttbstt jivssEjivspvjivspjivsE

,,,,,|),,,|(1),,,|(,,,,| 222bstttbstttbsttbstt jivssEjivspvjivspjivsE

,)(),;()(),;(

)(),;(

),,,|(

,

,2,,

,

,2,,

,

,2,,

is

istjbjbt

jb

jbtisist

jb

jbtisist

bstt vv

vv

vv

jivsp

NN

N

.

)(

),;(,,,,|

,

,

2,,

,,

is

ist

isistisisbsttt v

vjivssE

N

.)(

),;()(,,,,|

,

,

2,,

,,2,

2,

is

ist

isististisisisbstt

2t v

vvjivssE

N

19Institute of Information Science Academia Sinica

Singer Identification (SID)

Block diagram

TrainingData Vocal/

InstrumentalSegmentation

(Non-vocal portion)Instrumental

Signal B

Accompanied Signal V(vocal portion)

GaussianMixture

Modeling

Background MusicModel

Solo SignalModeling

SoloModel

Vocal/Instrumental

Segmentation

GaussianMixture

Modeling

MaximumLikelihoodDecision

max P( B |b)max

p (V s,b)

Training Phase

Testing Phase

b

s

TestData

X

InstrumentalSignal

Background Music Model

Accompanied Signal X V

arg max i

Hyp

othe

size

d Si

nger

Solo Models

for P Singers

s ,1 , s,2 , ..., s, P

B~

)~

,|( , bisVp X)

~|

~( bpmax B

b~

20Institute of Information Science Academia Sinica

SID Experiments

Music data– 200 tracks from

Mandarin pop music CDs

– 10 female & 10 male singers

– 5 tracks/singer for training; 5 tracks/singer for testing

– 20-min instrumental-only data for training the non-vocal GMM

– 22.05 kHz sampling rate (down-sampled from 44.1 kHz)

Vocal/Non-vocal segmentation– 82.3% frame accuracy

1000 Entire3000 6000 9000R ecord ing Length (# fram es)

65.0

70.0

75.0

80.0

85.0

90.0

95.0

100.0

Acc

urac

y (in

%)

G M M ; M anua l Segm enta tion

G M M ; Autom atic S egm entation

Solo M ode ling; M anual Segm enta tion

Solo M ode ling; A utom atic Segm entation

SID

21Institute of Information Science Academia Sinica

Singer Clustering (I)

Block diagram

L N 1 L N 2 ... L NN

x 1

Log-likelihoodComputation

L ij = log p (x i | )

x 2 x N

SoloModeling

SoloModeling

SoloModeling

x 1

x 2

x N

L 11 L 12 ... L 1 N

L 21 L 22 ... L 2 N

F 1

F 2

F N

VectorClustering

Cluster M : {x 6 , x 10 , ...}

Cluster 1: {x 3 , x 7 , ...}

Cluster 2: {x 1 , x 9 , ...}Transform

Transform

1 2 N (Log-likelihoods)

j

(Characteristic Vectors)

(Feature Vectors of Music Recordings)

(Models)

Transform

22Institute of Information Science Academia Sinica

Singer Clustering (II)

An example of the characteristic vectors

V

{ { { { {Singer 1 Singer 2 Singer 3 Singer 4 Singer 5

L F

{ { { { {Singer 1 Singer 2 Singer 3 Singer 4 Singer 5

- ,, jii LL

1.0

F i,j

)maxarg( ,

kik

L

23Institute of Information Science Academia Sinica

Singer Clustering (III)

Determining the number of clusters– Bayesian Information Criterion (BIC)

• Measuring how well the model fits a data set, and how simple the model is, specifically

– The BIC for a K-clustering is computed by:

– A reasonable number of clusters can be determined by

|,|log 2

1)|(log)BIC( DD dp

,log)1(2

1

2

1||log

2)(BIC

1

MMMMKn

KK

kk

k

).(BIC maxarg1

* KKMK

d : no. of free parameters in model | D | : size of the data set D: a penalty factor

K=3

K=4

BIC increases or not?

M : total no. of elementsn k : no. of elements of the cluster kk : covariance matrix of the

characteristic vectors in the cluster k

24Institute of Information Science Academia Sinica

Singer Clustering Experiments (I)

Music data– 200 tracks (20 singers; 10 tracks/singer)

Assessment method– Cluster purity

k is the purity of the cluster k, nk the total no. of recordings in the cluster k, and nkp the no. of recordings in the cluster k that were performed by singer p

– Average purity

• M is the total no. of recordings, and K the no. of clusters

,1

1

K

kkkn

M

,1

2

2

P

p k

kpk n

n

42.010

62112

2222

0.1

6

62

2

25.04

11112

2222

25Institute of Information Science Academia Sinica

Singer Clustering Experiments (II)

Results

0 10 20 30 40 50 60 70 80 90N o. of C lusters

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

Ave

rage

Pur

ity

M anual Segm entation; 32-m ix So lo G M M & 8-m ix B ackground G M M / R ecord ing

M anual Segm entation; 32-m ix Vocal G M M / R ecord ing

Autom atic Segm entation; 24-m ix Solo G M M & 8-m ix Background G M M / R ecord ing

Autom atic Segm entation; 24-m ix voca l G M M / R ecord ing0 5 10 15 20 25 30 35 40 45 50

N o. of C lusters

-520

-480

-440

-400

-360

-320

-280

-240

-200

-160

-120

-80

-40

BIC

5 s ingers

10 s ingers

15 s ingers

20 s ingers

Appropriate no. of clusters

26Institute of Information Science Academia Sinica

Summary

We have– Separated vocal from non-vocal segments of music;– Isolated singers’ vocal characteristics form the

background music;– Distinguished singers from one another.

We will– Handle wider variety of music data including duets, trios,

chorus, background vocals, or music with multiple simultaneous or non-simultaneous singers;

– Deal with the other problems of voice information retrieval from music, such as lyric transcription and singing language recognition.

27Institute of Information Science Academia Sinica

To Probe Further (I)

Selected references– Music information retrieval

• A. L. Uitdenbogerd, “Music IR: past, present, and future,” Proceedings of International Symposium on Music Information Retrieval, 2000.

• J. Futrelle and J. S. Downie, “Interdisciplinary communities and research issues in music information retrieval,” Proceedings of International Conference on Music Information Retrieval, pp. 215–221, 2002.

– Artist recognition• B. Whitman, G. Flake, and S. Lawrence, “Artist detection in music with Minnowmatch,”

Proceedings of IEEE Workshop on Neural Networks for Signal Processing, 2001.• A. Berenzweig, D. P. W. Ellis, and S. Lawrence, “Using voice segments to improve artist

classification of music,” Proceedings of International Conference on Virtual, Synthetic and Entertainment Audio, 2002.

– Singer identification• Y. E. Kim and B. Whitman, “Singer identification in popular music recordings using voice coding features,”

Proceedings of International Conference on Music Information Retrieval, pp. 164–169, 2002. • C. C. Liu, and C. S. Huang, “A singer identification technique for content-based classification of MP3 music objects,”

Proceedings of International Conference on Information and Knowledge Management, pp. 438–445, 2002. • T. Zhang, “Automatic Singer Identification,” Proceedings of International Conference on Multimedia and Expo, 2003. • W. H. Tsai, H. M. Wang, and D. Rodgers, “Automatic singer identification of popular music recordings via estimation

and modeling of solo vocal signal,” Proceedings of European Conference on Speech Communication and Technology, 2003.

– Singer clustering• W. H. Tsai, H. M. Wang, D. Rodgers, S. S. Cheng, and H. M. Yu, “Blind clustering of popular music recordings based

on singer voice characteristics,” to appear in Proceedings of International Conference on Music Information Retrieval, 2003.

28Institute of Information Science Academia Sinica

To Probe Further (II)

General resources– Important conferences

• International Conference on Music Information Retrieval• International Computer Music Conference • IEEE International Conference on Multimedia and Expo• ACM International Multimedia Conference• International Conference on New Interfaces for Musical Expression

– Organizations• International Computer Music Association (http://www.computermusic.org/)• The Australasian Computer Music Association (http://www.acma.asn.au/)• ACM Multimedia (http://www.acm.org/sigmm/)• Acoustical Society of America (http://asa.aip.org/)

– Journals• Computer Music Journal (http://www-mitpress.mit.edu/catalog/item/default.asp?ttype=4&tid=15)• Journal of New Music Research (http://www.swets.nl/jnmr/jnmr.html)• Computing in Musicology (http://www.ccarh.org/publications/books/cm/)

– Useful links• http://www.leighsmith.com/Browsers/Cmusic.html• http://www2.siba.fi/Kulttuuripalvelut/computers.html

top related