institute of information science academia sinica 1 singer identification and clustering of popular...

1Institute of Information Science Academia Sinica

Singer Identification and Singer Identification and Clustering of Popular Music Clustering of Popular Music

RecordingsRecordings

Wei-Ho Tsaiwesley@iis.sinica.edu.tw

Institute of Information Science, Academia Sinica

Extracting Information From Music

Music Information Retrieval (MIR)– To develop ways of managing collections of musical

material for preservation, access, research, and other uses.

MIR communities & research areas [after Futrelle & Downie, 2002]

Computer Science

Audio Engineering

Psychology & Philosophy

Library Science

Representation

Indexing

User Interface Design

Compression

FeatureDetection

Machine Learning

Metadata

Musical Analysis

Epistemology & Ontology

Perception

Intellectual Property

Classification

Musicology

Communities Research Areas

Extracting Voice Information From Music

Viewing MIR from a speech-processing perspectiveSpeech Processing

Analysis/Synthesis Recognition Coding

Speech Recognition Speaker Recognition Language Recognition

Phone Recognition Word Recognition Tone Recognition

Speaker Identification Speaker Verification Speaker Clustering

Language Identification Language Verification Dialect Identification

Singing Processing

Analysis/Synthesis Recognition Coding

"Singing Recognition" Singer Recognition Language Recognition

Phone Recognition Lyric Transcription Melody Extraction

Singer Identification Singer Detection Singer Clustering

Language Identification Language Verification Dialect Identification

Speech/Singing/Music/

Other SoundsDiscrimination

Singer Recognition Tasks (I)

Singer Identification– Determining who is singing

Who performed thismusic recording?

Singer Recognition Tasks (II)

Singer detection– Determining whether or not a specified singer is present

in a music recording

If Coco's voices in thismusic recording?

Singer Recognition Tasks (III)

Singer Tracking– Locating where a specified singer is present in a music

recording

Where is Coco's voices?

Singer Recognition Tasks (IV)

Singer Clustering– Grouping the same-singer music recordings into a cluster

Clusteringby singer

Potential Applications

Indexing– Finding cameo’s or guest appearances in live concert

recordings.– Identifying the singers in a movie’s musical interludes.

Music recommendation systems– Suggesting music by singers with similar voices.

Karaoke services– Efficiently organizing the customer’s recordings.– Personalization

Copyright protection– Distinguishing between an original song and a cover-band.– Rapidly scanning suspect websites for piracy

Singer’s Vocal Characteristics

Humans use several levels of perceptual cues for distinguishing among singers

Singingpronunciation, and

IdiosyncrasiesLearned traits

Physical traits

Culture andsocio-background

Modulation of pitch,rhythm,speed,

intonation, andvolume

Personality type

Acoustic aspect ofsinging, e.g., nasal,deep, breathy and

Anatomicalstructure

of vocal apparatus

Characteristics Sources

Major Challenges In Singer Recognition

The vast majority of popular music contains background accompaniment during most or all vocal passages– Infeasible to acquire isolated solo voice data for

extracting the singer’s vocal characteristics

The proposed solution:

Vocal segment detection followed by solo vocal signal modeling

Vocal/Non-vocal Segmentation

Sliding Window

Vocal Model

Non-vocal Model

FeatureVectors

DecisionFeature

Extraction

Vocalor

Non-vocal

Mel-scale Frequency Cepstral Coefficients (MFCCs)

Filter BankFrame

Segment

Music Recording

Gaussian Mixture Model (I)

Model description– The distribution of the feature vector x is represented by

a mixture of M component Gaussian densities, i.e.,

• is the i-th Gaussian density with mean and covariance matrix

– A Gausian mixture model (GMM) is characterized by

11, ΣN () 22 , Σ ... MM Σ,

)|( xp

N () N ()

i iiiwp1

),()|( xx N

),( ii xN i i

Miw iii 1|,,

Gaussian Mixture Model (II)

Parameter estimation– Using the EM algorithm, an initial model is created, and

the new model is then estimated by maximizing the auxiliary function

where and

– Letting for each parameter to be re-estimated, we have

tti ip

t ttti

)ˆ,ˆ(ˆ)ˆ|,( iitit wip xx N

,)ˆ|,(log),|()ˆ(1 1

itt ipipQ xx

m mmtm

),(),|(

0)ˆ( Q

Distilling Singers’ Voices From Music

Substantial similarities exist between the instrumental regions and the accompaniment of the vocal signal

Solo voice can be modeled via suppressing the background music estimated from the instrumental regions.

Solo Voice

Accompaniments

Accompanied Voice

Solo Vocal Signal Modeling (I)

Model Description

b can be approximately estimated using the instrumental regions of music

– Our aim is to find an optimal s such that (in maximum likelihood sense)

).,|(maxarg bss ps

MixingV = f (S ,B )

A Solo Voice

A Background Music

An Accompanied Voice

},...,,{ 21 TvvvV},...,,{ 21 TsssS

},...,,{ 21 TbbbB

GMM GMM

}| ,,{

isisiss w

jbjbjbb w

Σ,),,,|(),|(

1 1 1,,

jbstjbisbs jipwwp vV

.),;(),;(

),,,|(

ttjbjbtisist

(unobservable)

(observable)

Solo Vocal Signal Modeling (II)

Parameter estimation– Defining an auxiliary function

– Letting for each parameter to be re-estimated, we have

,)ˆ|,,(log),|,()ˆ(1 1 1

jbstbstss jipjipQ vv

),ˆ,,|()ˆ|,,( ,, bstjbisbst jipwwjip vv

.),,|(

),,|(),|,(

1 1 ,,

n bstnbms

bstjbisbst

jipwwjip

0)ˆ( ssQ

,),|,(1

jbstis jip

),,|,(

,,,,|),,|,(ˆ

j bsttbst

jiEjip

),,|,(

,,,,|),,|,(ˆ

1 1, isisT

j bstttbst

jiEjip

vssvΣ

Solo Vocal Signal Modeling (III)

Re-estimation formulas for linear spectral features– Suppose V is a linear spectral feature, and S and B are

additive in the time domain, then vt = st + bt

– is the convolution of the solo and background music densities, i.e.,

– and can be shown in the following form:

),,,|( bst jivp

tjbjbttisistbst sds-vsjivp ),;(),;(),,,|( 2,,

2,, NN

,,,,,| ,,2,

isbstt vjivsE

.,,,,|,,,,|2

bsttjbis

jbisbst

2t jivsEjivsE

bstt jivsE ,,,,| bst2t jivsE ,,,,|

Solo Vocal Signal Modeling (IV)

Re-estimation formulas for cepstral features– Suppose V is a cepstral feature, and S and B are additive in

the time domain, then vt = log[exp(st)+exp(bt)]. We approximate vt

max (st , bt ).

– It can be shown that

),(),;()(),;(),,,|(,

istjbjbt

jbtisistbst

vvjivp

1)( 2/

,,,,,|),,,|(1),,,|(,,,,| bstttbstttbsttbstt jivssEjivspvjivspjivsE

,,,,,|),,,|(1),,,|(,,,,| 222bstttbstttbsttbstt jivssEjivspvjivspjivsE

,)(),;()(),;(

)(),;(

),,,|(

istjbjbt

jbtisist

bstt vv

),;(,,,,|

isistisisbsttt v

vjivssE

),;()(,,,,|

isististisisisbstt

vvjivssE

Singer Identification (SID)

Block diagram

TrainingData Vocal/

InstrumentalSegmentation

(Non-vocal portion)Instrumental

Signal B

Accompanied Signal V(vocal portion)

GaussianMixture

Modeling

Background MusicModel

Solo SignalModeling

SoloModel

Vocal/Instrumental

Segmentation

GaussianMixture

Modeling

MaximumLikelihoodDecision

max P( B |b)max

p (V s,b)

Training Phase

Testing Phase

TestData

InstrumentalSignal

Background Music Model

Accompanied Signal X V

arg max i

Solo Models

for P Singers

s ,1 , s,2 , ..., s, P

,|( , bisVp X)

~( bpmax B

SID Experiments

Music data– 200 tracks from

Mandarin pop music CDs

– 10 female & 10 male singers

– 5 tracks/singer for training; 5 tracks/singer for testing

– 20-min instrumental-only data for training the non-vocal GMM

– 22.05 kHz sampling rate (down-sampled from 44.1 kHz)

Vocal/Non-vocal segmentation– 82.3% frame accuracy

1000 Entire3000 6000 9000R ecord ing Length (# fram es)

G M M ; M anua l Segm enta tion

G M M ; Autom atic S egm entation

Solo M ode ling; M anual Segm enta tion

Solo M ode ling; A utom atic Segm entation

Singer Clustering (I)

Block diagram

L N 1 L N 2 ... L NN

Log-likelihoodComputation

L ij = log p (x i | )

x 2 x N

SoloModeling

L 11 L 12 ... L 1 N

L 21 L 22 ... L 2 N

VectorClustering

Cluster M : {x 6 , x 10 , ...}

Cluster 1: {x 3 , x 7 , ...}

Cluster 2: {x 1 , x 9 , ...}Transform

Transform

1 2 N (Log-likelihoods)

(Characteristic Vectors)

(Feature Vectors of Music Recordings)

(Models)

Transform

Singer Clustering (II)

An example of the characteristic vectors

{ { { { {Singer 1 Singer 2 Singer 3 Singer 4 Singer 5

- ,, jii LL

)maxarg( ,

Singer Clustering (III)

Determining the number of clusters– Bayesian Information Criterion (BIC)

• Measuring how well the model fits a data set, and how simple the model is, specifically

– The BIC for a K-clustering is computed by:

– A reasonable number of clusters can be determined by

|,|log 2

1)|(log)BIC( DD dp

,log)1(2

1||log

2)(BIC

MMMMKn

).(BIC maxarg1

* KKMK

d : no. of free parameters in model | D | : size of the data set D: a penalty factor

BIC increases or not?

M : total no. of elementsn k : no. of elements of the cluster kk : covariance matrix of the

characteristic vectors in the cluster k

Singer Clustering Experiments (I)

Music data– 200 tracks (20 singers; 10 tracks/singer)

Assessment method– Cluster purity

k is the purity of the cluster k, nk the total no. of recordings in the cluster k, and nkp the no. of recordings in the cluster k that were performed by singer p

– Average purity

• M is the total no. of recordings, and K the no. of clusters

42.010

Singer Clustering Experiments (II)

Results

0 10 20 30 40 50 60 70 80 90N o. of C lusters

M anual Segm entation; 32-m ix So lo G M M & 8-m ix B ackground G M M / R ecord ing

M anual Segm entation; 32-m ix Vocal G M M / R ecord ing

Autom atic Segm entation; 24-m ix Solo G M M & 8-m ix Background G M M / R ecord ing

Autom atic Segm entation; 24-m ix voca l G M M / R ecord ing0 5 10 15 20 25 30 35 40 45 50

N o. of C lusters

5 s ingers

10 s ingers

15 s ingers

20 s ingers

Appropriate no. of clusters

Summary

We have– Separated vocal from non-vocal segments of music;– Isolated singers’ vocal characteristics form the

background music;– Distinguished singers from one another.

We will– Handle wider variety of music data including duets, trios,

chorus, background vocals, or music with multiple simultaneous or non-simultaneous singers;

– Deal with the other problems of voice information retrieval from music, such as lyric transcription and singing language recognition.

To Probe Further (I)

Selected references– Music information retrieval

• A. L. Uitdenbogerd, “Music IR: past, present, and future,” Proceedings of International Symposium on Music Information Retrieval, 2000.

• J. Futrelle and J. S. Downie, “Interdisciplinary communities and research issues in music information retrieval,” Proceedings of International Conference on Music Information Retrieval, pp. 215–221, 2002.

– Artist recognition• B. Whitman, G. Flake, and S. Lawrence, “Artist detection in music with Minnowmatch,”

Proceedings of IEEE Workshop on Neural Networks for Signal Processing, 2001.• A. Berenzweig, D. P. W. Ellis, and S. Lawrence, “Using voice segments to improve artist

classification of music,” Proceedings of International Conference on Virtual, Synthetic and Entertainment Audio, 2002.

– Singer identification• Y. E. Kim and B. Whitman, “Singer identification in popular music recordings using voice coding features,”

Proceedings of International Conference on Music Information Retrieval, pp. 164–169, 2002. • C. C. Liu, and C. S. Huang, “A singer identification technique for content-based classification of MP3 music objects,”

Proceedings of International Conference on Information and Knowledge Management, pp. 438–445, 2002. • T. Zhang, “Automatic Singer Identification,” Proceedings of International Conference on Multimedia and Expo, 2003. • W. H. Tsai, H. M. Wang, and D. Rodgers, “Automatic singer identification of popular music recordings via estimation

and modeling of solo vocal signal,” Proceedings of European Conference on Speech Communication and Technology, 2003.

– Singer clustering• W. H. Tsai, H. M. Wang, D. Rodgers, S. S. Cheng, and H. M. Yu, “Blind clustering of popular music recordings based

on singer voice characteristics,” to appear in Proceedings of International Conference on Music Information Retrieval, 2003.

To Probe Further (II)

General resources– Important conferences

• International Conference on Music Information Retrieval• International Computer Music Conference • IEEE International Conference on Multimedia and Expo• ACM International Multimedia Conference• International Conference on New Interfaces for Musical Expression

– Organizations• International Computer Music Association (http://www.computermusic.org/)• The Australasian Computer Music Association (http://www.acma.asn.au/)• ACM Multimedia (http://www.acm.org/sigmm/)• Acoustical Society of America (http://asa.aip.org/)

– Journals• Computer Music Journal (http://www-mitpress.mit.edu/catalog/item/default.asp?ttype=4&tid=15)• Journal of New Music Research (http://www.swets.nl/jnmr/jnmr.html)• Computing in Musicology (http://www.ccarh.org/publications/books/cm/)

– Useful links• http://www.leighsmith.com/Browsers/Cmusic.html• http://www2.siba.fi/Kulttuuripalvelut/computers.html

institute of information science academia sinica 1 singer identification and clustering of popular...

Documents

pdf - academia sinica

ishengjason tsai

geoade - geometric algorithm development environment...

toxoplasmosis in domestic animals - sinica

st meeting room - sinica

dna sequence - sinica

a quantum-proof non-malleable extractor · email:...

dermatologica sinica

replication dna - sinica

single packet icmp traceback - sinica

theory of computation - sinica

research topics - sinica

dna - sinica

advanced analytical chemistry - sinica

durable goods monopolist - sinica

chapter 2 assemblers - sinica

local demographic variation - sinica

picking basis set - sinica

web surfing on the move: needs, opportunities, and...

zoo}({j)gicm - sinica