automatic audio tag classification via semi · pdf fileguitar piano noise rough jazz rock punk...

1
AUTOMATIC AUDIO TAG CLASSIFICATION VIA SEMI-SUPERVISED CANONICAL DENSITY ESTIMATION Jun Takagi Yasunori Ohishi Akisato Kimura Masashi Sugiyama Makoto Yamada Hirokazu Kameoka Graduate School of Information Science and Engineering, Tokyo Institute of Technology NTT Communication Science Laboratories, NTT Corporation Contact information E-mail: [email protected] Overview Apply semi-supervised method to audio tagging/retrieval. Utilize untagged samples and reduce the needed number of expensive tagged samples. What is Audio Tagging/Retrieval? Audio Tagging Audio retrieval SSCDE can cope with both tasks in the same framework! Tagging by hand is laborious workTag this audio file! exciting rock powerful passionate Reduce cost of laborious task. Query : Audio Signal , Output: Tags Im now looking for exciting and powerful rock music! These songs must match your request. Search audio data by contents. Query : Tags, Output: Audio Signals Semi-supervised method Audio samples having high-quality tags are very expensive! utilizes inexpensive untagged audio samples! Easy to collect at low cost! Hammer Repeat Tools Metal Tools Bell Repeat Digital Metal A few tagged samples is a available. Large number of untagged samples is available. Technical Challenges : Use Inexpensive Untagged Audios ... Technical Challenges : Use Tag Co-occurrence Information Train a dedicated classifier for each tag separately. Bass Guitar Guitar Piano Noise Rough Jazz Rock Punk rock Typical approach Topic model based approach Linear regression [Whiteman+ 2004] Hierarchical GMM [Turnbull+ 2008] pLSA: NLP [Hofmann 1999] Image [Barnard+ 2001] LDA : NLP[Blei+ 2003], Image [Li+ 2005] How to incorporate the co-occurrence information into the classifier? Drawback: Difficult to use co-occurrence information of the tags. Tag co-occurrence information seems to be useful for tagging/retrieval task. But almost all of existing method cannot utilize this information. Design probabilistic generative model for audio signals and tags. Guitar Rough Tag feature Audio feature Jazz Rock Punk rock Bass Guitar Guitar Piano Noise Rough Merits: 1. Tag co-occurrence is considered in the model. 2. Scalable to many features. Topic Audio ID 123 Rock 100 Powerful 100 Guitar 110 Digital 011 Metal 001 Rock Powerful Guitar Digital Guitar dataset Tags Audio Signals Untagged topics Exciting Digital Metal Latent Rock 0.1 0.1 0.1 0.9 0.1 0.9 0.1 0.1 Powerful 0.9 0.2 0.2 0.2 0.9 0.2 0.9 0.9 Guitar 0.1 0.9 0.1 0.1 0.1 0.9 0.9 0.1 Digital 0.9 0.1 0.1 0.1 0.1 0.1 0.9 0.9 Metal 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.9 Tag-topic model Audio-tag model SSKDE Kernel Density Estimation 1. Extract features from audio signals and tags. 2. Generate topic space by SemiCCA. 3. Learn audio-topic model by kernel density estimation. 4. Learn tag-topic model by Multi-label SSKDE. 5. Use the learned model for annotation and retrieval. Tag features SemiCCA Audio features Annotation: Retrieval: SSCDE : Model Learning Framework N N x = number of tagged samples = number of untagged samples Technical Points of SSCDE 2. Propagate tag information : Multi-label SSKDE 1. Learn topic space with tagged and untagged samples : SemiCCA [Kimura+ ICPR2010] CCA (tagged samples) PCA (tagged and untagged samples) B · w x w y ¸ = ¸ C · w x w y ¸ # " generalized eigenproblem. Solution Initial state Rectification Rectify posteriors to reflect tag information Tagged (Color = tag) Untagged Propagation Neighboring matrix Tag confidence Multi-label version of semi-supervised kernel density estimation. (SSKDE [Wang+ CVIU2009]) B = ¯ " 0 S (T ) xy S (T ) yx 0 # + (1 ¡ ¯ ) " S xx 0 0 S yy # C = ¯ " S ( T ) xx 0 0 S (T ) yy # + (1 ¡ ¯ ) " I D x 0 0 I D y # Experiment Dataset: 2012 audio files taken from Freesound(http://www.freesound.org/ ). Database of Creative Commons licensed sounds. Annotated with vocabulary. 1. Audio signals are splitted into half-overlapping 23ms windows, and 39-dimensional vector (including first 13 MFCC, MFCC-Δ, MFCC-Δ-Δ) is extracted from each window. 2. 500 vectors are sampled from each audio signal (about 1,000,000 vectors in total). LBG algorithm (Linde-Buzo-Gray algorithm, algorithm for vector quantization : VQ) is applied to them, and VQ codebook (size :1024) is obtained. 3. 1024-dimensional normalized vector representing each audio signal is created by VQ. Tagged audio signal Untagged audio signal Evaluation sample 1000 912 100 Audio feature: bag-of-feature-vectors extracted by the following process. Tag feature: 230-dimensional binary vector. (Each element of the vector corresponds to specific tag.) Evaluation condition: 2012 audio clips with WAV format. Each clip has multiple tags selected from 230 words. Annotation performance of SSCDE is evaluated under following condition. Result SSCDE successfully improved the performance by utilizing untagged samples. All training samples are used as tagged samples. Our goal is to approach this performance with fewer tagged samples . N: number of tagged samples Nx: number of untagged samples z 2 Audio-topic model z 2 X Y b Y c X Z W y W x maximize correlation Topic Audio feature Tag feature z n = ¤ 1 2 W x x n + ¤ 1 2 W y y n Learn the map : SemiCCA can utilize untagged samples differently from standard CCA. Z Y Estimate this Use Sparse Matrix as Neighboring Matrix Connect to nearest 3 samples Gaussian kernel Posterior probability of y given Zn Fix the number of non-zero elements in each row, then required memory size to hold the neighboring matrix decreases. ( ) In this case, SSKDE is equivalent to Graph spectral method. [Joachims 2003] Apply the same idea to audio tagging and retrieval, then calculation needs only a few samples nearby query! Annotation/Retrieval task becomes equivalent to neighboring search problem! (Computational complexity is !) Query 0 0.1 0.2 0.3 0.4 0.5 0.2 0.3 0.4 0.5 0.6 CCA + KDE (N=1912) SSCDE (N=1000, Nx=912) SemiCCA + KDE (N=1000, Nx=912) CCA + KDE (N=1000) SSCDE (proposed) upper bound Baseline Recall Precision ^ y = arg max y2f 0 ; 1 g D y p (yj x given ) ^ x = arg max x p (xj y given )

Upload: dinhtuong

Post on 26-Feb-2018

218 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: AUTOMATIC AUDIO TAG CLASSIFICATION VIA SEMI · PDF fileGuitar Piano Noise Rough Jazz Rock Punk rock Typical approach Topic model based approach ... But almost all of existing method

AUTOMATIC AUDIO TAG CLASSIFICATION

VIA SEMI-SUPERVISED CANONICAL DENSITY ESTIMATIONJun Takagi† Yasunori Ohishi‡ Akisato Kimura‡ Masashi Sugiyama† Makoto Yamada† Hirokazu Kameoka‡

†Graduate School of Information Science and Engineering, Tokyo Institute of Technology ‡NTT Communication Science Laboratories, NTT Corporation

Contact informationE-mail:[email protected]

OverviewApply semi-supervised method to

audio tagging/retrieval.

Utilize untagged samples and reduce the

needed number of expensive tagged samples.

What is Audio Tagging/Retrieval?

Audio Tagging Audio retrieval

SSCDE can cope with both tasks in the same framework!

Tagging by hand is

laborious work…Tag this audio file!

exciting

rock

powerful

passionate

Reduce cost of laborious task.

Query : Audio Signal , Output: Tags

I’m now looking for

exciting and powerful

rock music!

These songs

must match

your request.

Search audio data by contents.

Query : Tags, Output: Audio Signals

Semi-supervised method

Audio samples having high-quality tags

are very expensive!

utilizes inexpensive untagged audio samples!

Easy to collect at low cost!

Hammer

Repeat

Tools

Metal

Tools

Bell

Repeat

Digital

Metal

A few tagged samples

is a available.

Large number of

untagged samples

is available.

Technical Challenges : Use Inexpensive Untagged Audios

...

Technical Challenges : Use Tag Co-occurrence Information

Train a dedicated classifier

for each tag separately.

Bass Guitar

Guitar

Piano

Noise

Rough

Jazz Rock Punk rock

Typical approach Topic model based approach

Linear regression [Whiteman+ 2004]

Hierarchical GMM [Turnbull+ 2008]

pLSA: NLP [Hofmann 1999] Image [Barnard+ 2001]

LDA : NLP[Blei+ 2003], Image [Li+ 2005]

→ How to incorporate the co-occurrence

information into the classifier?

Drawback:

Difficult to use co-occurrence

information of the tags.

Tag co-occurrence information seems to be useful for tagging/retrieval task.

But almost all of existing method cannot utilize this information.

Design probabilistic generative

model for audio signals and tags.

Guitar

Rough

Tag

featureAudio

featureJazz Rock Punk rock

Bass Guitar

Guitar

Piano

Noise

Rough

Merits:

1. Tag co-occurrence is considered

in the model.

2. Scalable to many features.

Topic

Audio ID 1 2 3

Rock 1 0 0

Powerful 1 0 0

Guitar 1 1 0

Digital 0 1 1

Metal 0 0 1

Rock

Powerful

Guitar

Digital

Guitardataset Tags

Audio

Signals

Untagged

topics

Exciting

Digital

Metal

Latent

Rock 0.1 0.1 0.1 0.9 0.1 0.9 0.1 0.1

Powerful 0.9 0.2 0.2 0.2 0.9 0.2 0.9 0.9

Guitar 0.1 0.9 0.1 0.1 0.1 0.9 0.9 0.1

Digital 0.9 0.1 0.1 0.1 0.1 0.1 0.9 0.9

Metal 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.9

Tag-topic model

Audio-tag model

SSKDEKernel Density Estimation

1. Extract features from

audio signals and tags.

2. Generate topic space

by SemiCCA.

3. Learn audio-topic model

by kernel density estimation.

4. Learn tag-topic model

by Multi-label SSKDE.

5. Use the learned model for

annotation and retrieval.

Tag features

SemiCCA

Audio features

Annotation: Retrieval:

SSCDE : Model Learning Framework

y = arg maxy2RDy

p(yjxgiven)

x = arg maxx2DB

p(xjygiven)

P (yjzn) =

PN

i=1 P (ydjzn)·(zn ¡ zi)PN

i=1 ·(zn ¡ zi)

N Nx

6

y = arg maxy2RDy

p(yjxgiven)

x = arg maxx2DB

p(xjygiven)

P (yjzn) =

PN

i=1 P (ydjzn)·(zn ¡ zi)PN

i=1 ·(zn ¡ zi)

N Nx

6

= number of tagged samples

= number of untagged samples

Technical Points of SSCDE

2. Propagate tag information : Multi-label SSKDE

1. Learn topic space with tagged and untagged samples :

SemiCCA [Kimura+ ICPR2010]

CCA (tagged samples) PCA (tagged and untagged samples)

X Y Z cX bY X(U)

X(T)

Y

x y z zn zn

p(x;y) =

Z

zp(z)p(xjz)p(yjz)dz

O(N) O(kN )~O(logN ) N

p(yjz) p(xjz) p(x) p(y)

y = arg maxy

p(yjxq) = arg maxy

p(xq; y)

x = arg maxx2DB

p(xjyq) = arg maxx2DB

p(x; yq)

xq

W x W y ¼

"0 SxySyx 0

#"wx

wy

#= ¸

"Sxx 0

0 Syy

#"wx

wy

#

zn =MxWxxn+MyW yy

x1 x2 x3 xa¡2 xa¡1 xa

B

·wxwy

¸= ¸C

·wxwy

¸

B = ¯

"0 S(L)xy

S(L)yx 0

#+(1¡ ¯)

"Sxx 0

0 Syy

#

C = ¯

"S(L)xx 0

0 S(L)yy

#+ (1¡ ¯)

"IDx

0

0 IDy

#

Dx; Dy x;y

1

generalized eigenproblem.Solution

Initial state

Rectification

Rectify posteriors to

reflect tag information

Tagged

(Color = tag)

Untagged

Propagation

Neighboring matrix

Tag confidenceMulti-label version of

semi-supervised

kernel density estimation.

(SSKDE [Wang+ CVIU2009])

y = arg maxy2RDy

p(yjxgiven)

x = arg maxx2DB

p(xjygiven)

P (yjzn) =

PN

i=1 P (ydjzn)·(zn ¡ zi)PN

i=1 ·(zn ¡ zi)

N Nx

B = ¯

"0 S(T)xy

S(T )yx 0

#+ (1¡ ¯)

"Sxx 0

0 Syy

#

C = ¯

"S(T)xx 0

0 S(T )yy

#+ (1¡ ¯)

"IDx

0

0 IDy

#

6

Experiment

• Dataset: 2012 audio files taken from “Freesound” (http://www.freesound.org/).

– Database of Creative Commons licensed sounds.

– Annotated with vocabulary.

1. Audio signals are splitted into half-overlapping 23ms windows,

and 39-dimensional vector (including first 13 MFCC, MFCC-Δ, MFCC-Δ-Δ)

is extracted from each window.

2. 500 vectors are sampled from each audio signal (about 1,000,000 vectors in total).

LBG algorithm (Linde-Buzo-Gray algorithm, algorithm for vector quantization : VQ) is

applied to them, and VQ codebook (size :1024) is obtained.

3. 1024-dimensional normalized vector representing each audio signal is created by VQ.

Tagged audio signal Untagged audio signal Evaluation sample

1000 912 100

• Audio feature: bag-of-feature-vectors extracted by the following process.

• Tag feature: 230-dimensional binary vector.(Each element of the vector corresponds to specific tag.)

• Evaluation condition:

– 2012 audio clips with WAV format.

– Each clip has multiple tags selected from 230 words.

Annotation performance of SSCDE is evaluated under following condition.

Result

SSCDE successfully improved

the performance

by utilizing untagged samples.

All training samples are used

as tagged samples.

Our goal is to approach this

performance with fewer

tagged samples.

N: number of tagged samples

Nx: number of untagged samples

p(z) ¼1

N

NX

n=1

±(z ¡ zn)

p(x;y) ¼1

N

NX

n=1

p(xjzn)p(yjzn)

zn =MxW xxn +MyW yyn

zn = ¤12W xxn +¤

12W yyn

¤ = diag(¸i)

W x;Wy w>xi;w>yi

±(¢)

p(xjzn) p(yjzn)

F(k)

ij = ty(i)

j + (1¡ t)F(k¡1)ij

F(k)

ij = tF(0)

ij + (1¡ t)F(k¡1)ij

F(0)

ij =

(t¹yi;j + (1¡ ¹)

Nj

Nzi

0 zi

Fij ¼ p(yjjzi)

z1 z2 z3 z4 z5 z6

2

Audio-topic model

p(z) ¼1

N

NX

n=1

±(z ¡ zn)

p(x;y) ¼1

N

NX

n=1

p(xjzn)p(yjzn)

zn =MxW xxn +MyW yyn

zn = ¤12W xxn +¤

12W yyn

¤ = diag(¸i)

W x;Wy w>xi;w>yi

±(¢)

p(xjzn) p(yjzn)

F(k)

ij = ty(i)

j + (1¡ t)F(k¡1)ij

F(k)

ij = tF(0)

ij + (1¡ t)F(k¡1)ij

F(0)

ij =

(t¹yi;j + (1¡ ¹)

Nj

Nzi

0 zi

Fij ¼ p(yjjzi)

z1 z2 z3 z4 z5 z6

2

④③

X Y Z

x y z

p(x; y) =

Z

zp(z)p(xjz)p(yjz)dz

1

X Y Z

x y z

p(x; y) =

Z

zp(z)p(xjz)p(yjz)dz

1

X Y Z cX bY

x y z

p(x; y) =

Z

zp(z)p(xjz)p(yjz)dz

O(N ) O(N)~O(logN ) N

p(yjz) p(x) p(y)

y = arg maxy

p(yjxq) = arg maxy

p(xq; y)

x = arg maxx2DB

p(xjyq) = arg maxx2DB

p(x; yq)

xq

1

X Y Z cX bY

x y z

p(x; y) =

Z

zp(z)p(xjz)p(yjz)dz

O(N ) O(N)~O(logN ) N

p(yjz) p(x) p(y)

y = arg maxy

p(yjxq) = arg maxy

p(xq; y)

x = arg maxx2DB

p(xjyq) = arg maxx2DB

p(x; yq)

xq

1

X Y Z

x y z

p(x; y) =

Z

zp(z)p(xjz)p(yjz)dz

1

X Y Z cX bY

x y z

p(x; y) =

Z

zp(z)p(xjz)p(yjz)dz

O(N ) O(N)~O(logN) N

p(yjz) p(x) p(y)

y = arg maxy

p(yjxq) = arg maxy

p(xq; y)

x = arg maxx2DB

p(xjyq) = arg maxx2DB

p(x; yq)

xq

Wx W y

1

X Y Z cX bY

x y z

p(x; y) =

Z

zp(z)p(xjz)p(yjz)dz

O(N) O(N)~O(logN) N

p(yjz) p(x) p(y)

y = arg maxy

p(yjxq) = arg maxy

p(xq; y)

x = arg maxx2DB

p(xjyq) = arg maxx2DB

p(x; yq)

xq

W x W y

1

maximize

correlation

Topic

Audio feature Tag feature

p(z) ¼1

N

NX

n=1

±(z¡ zn)

p(x;y) ¼1

N

NX

n=1

p(xjzn)p(yjzn)

zn =MxWxxn+MyW yyn

zn = ¤12Wxxn+¤

12W yyn

¸ = diag(¸i)

±(¢)

p(xjzn) p(yjzn)

F(k)ij

= ty(i)j

+ (1¡ t)F(k¡1)ij

F(k)ij = tF

(0)ij + (1¡ t)F

(k¡1)ij

F(0)ij =

(t¹yi;j + (1¡ ¹)

NjN

zi0 zi

Fij ¼ p(yjjzi)

z1 z2 z3 z4 z5 z6

2

Learn the map :

SemiCCA can utilize untagged samples

differently from standard CCA.

X Y Z

x y z

p(x; y) =

Z

zp(z)p(xjz)p(yjz)dz

1

X Y Z

x y z

p(x; y) =

Z

zp(z)p(xjz)p(yjz)dz

1

Estimate this

Use Sparse Matrix as Neighboring Matrix

Connect to

nearest 3 samples

Gaussian

kernel

Posterior probability

of y given Zn

•Fix the number of non-zero elements in each row,

then required memory size to hold the neighboring matrix decreases.

( )

•In this case, SSKDE is equivalent to Graph spectral method. [Joachims 2003]

•Apply the same idea to audio tagging and retrieval, then

calculation needs only a few samples nearby query!

Annotation/Retrieval task becomes

equivalent to neighboring search problem!

(Computational complexity is !)

Query

0

0.1

0.2

0.3

0.4

0.5

0.2 0.3 0.4 0.5 0.6

CCA + KDE (N=1912)

SSCDE (N=1000, Nx=912)

SemiCCA + KDE (N=1000, Nx=912)

CCA + KDE (N=1000)

SSCDE (proposed)

upper bound

Baseline

Recall

Pre

cision

y = arg maxy2f0;1gDy

p(yjxgiven)

x = arg maxx

p(xjygiven)

P(yjzn) =

PNi=1P (ydjzn)·(zn¡ zi)PN

i=1·(zn¡ zi)

N Nx

B = ¯

"0 S(T)xy

S(T)yx 0

#+ (1¡¯)

"Sxx 0

0 Syy

#

C = ¯

"S(T)xx 0

0 S(T)yy

#+ (1¡ ¯)

"IDx 0

0 IDy

#

6

y = arg maxy2f0;1gDy

p(yjxgiven)

x = arg maxx

p(xjygiven)

P(yjzn) =

PNi=1P (ydjzn)·(zn¡ zi)PN

i=1·(zn¡ zi)

N Nx

B = ¯

"0 S(T)

xy

S(T)yx 0

#+ (1¡¯)

"Sxx 0

0 Syy

#

C = ¯

"S(T)xx 0

0 S(T)yy

#+ (1¡ ¯)

"IDx 0

0 IDy

#

6