automatic audio tag classification via semi · pdf fileguitar piano noise rough jazz rock punk...
TRANSCRIPT
AUTOMATIC AUDIO TAG CLASSIFICATION
VIA SEMI-SUPERVISED CANONICAL DENSITY ESTIMATIONJun Takagi† Yasunori Ohishi‡ Akisato Kimura‡ Masashi Sugiyama† Makoto Yamada† Hirokazu Kameoka‡
†Graduate School of Information Science and Engineering, Tokyo Institute of Technology ‡NTT Communication Science Laboratories, NTT Corporation
Contact informationE-mail:[email protected]
OverviewApply semi-supervised method to
audio tagging/retrieval.
Utilize untagged samples and reduce the
needed number of expensive tagged samples.
d
What is Audio Tagging/Retrieval?
Audio Tagging Audio retrieval
SSCDE can cope with both tasks in the same framework!
Tagging by hand is
laborious work…Tag this audio file!
exciting
rock
powerful
passionate
Reduce cost of laborious task.
Query : Audio Signal , Output: Tags
I’m now looking for
exciting and powerful
rock music!
These songs
must match
your request.
Search audio data by contents.
Query : Tags, Output: Audio Signals
Semi-supervised method
Audio samples having high-quality tags
are very expensive!
utilizes inexpensive untagged audio samples!
Easy to collect at low cost!
Hammer
Repeat
Tools
Metal
Tools
Bell
Repeat
Digital
Metal
A few tagged samples
is a available.
Large number of
untagged samples
is available.
Technical Challenges : Use Inexpensive Untagged Audios
...
Technical Challenges : Use Tag Co-occurrence Information
Train a dedicated classifier
for each tag separately.
Bass Guitar
Guitar
Piano
Noise
Rough
Jazz Rock Punk rock
Typical approach Topic model based approach
Linear regression [Whiteman+ 2004]
Hierarchical GMM [Turnbull+ 2008]
pLSA: NLP [Hofmann 1999] Image [Barnard+ 2001]
LDA : NLP[Blei+ 2003], Image [Li+ 2005]
→ How to incorporate the co-occurrence
information into the classifier?
Drawback:
Difficult to use co-occurrence
information of the tags.
Tag co-occurrence information seems to be useful for tagging/retrieval task.
But almost all of existing method cannot utilize this information.
Design probabilistic generative
model for audio signals and tags.
Guitar
Rough
Tag
featureAudio
featureJazz Rock Punk rock
Bass Guitar
Guitar
Piano
Noise
Rough
Merits:
1. Tag co-occurrence is considered
in the model.
2. Scalable to many features.
Topic
Audio ID 1 2 3
Rock 1 0 0
Powerful 1 0 0
Guitar 1 1 0
Digital 0 1 1
Metal 0 0 1
Rock
Powerful
Guitar
Digital
Guitardataset Tags
Audio
Signals
Untagged
topics
Exciting
Digital
Metal
Latent
Rock 0.1 0.1 0.1 0.9 0.1 0.9 0.1 0.1
Powerful 0.9 0.2 0.2 0.2 0.9 0.2 0.9 0.9
Guitar 0.1 0.9 0.1 0.1 0.1 0.9 0.9 0.1
Digital 0.9 0.1 0.1 0.1 0.1 0.1 0.9 0.9
Metal 0.1 0.9 0.1 0.1 0.1 0.1 0.1 0.9
Tag-topic model
Audio-tag model
SSKDEKernel Density Estimation
1. Extract features from
audio signals and tags.
2. Generate topic space
by SemiCCA.
3. Learn audio-topic model
by kernel density estimation.
4. Learn tag-topic model
by Multi-label SSKDE.
5. Use the learned model for
annotation and retrieval.
Tag features
SemiCCA
Audio features
Annotation: Retrieval:
SSCDE : Model Learning Framework
y = arg maxy2RDy
p(yjxgiven)
x = arg maxx2DB
p(xjygiven)
P (yjzn) =
PN
i=1 P (ydjzn)·(zn ¡ zi)PN
i=1 ·(zn ¡ zi)
N Nx
6
y = arg maxy2RDy
p(yjxgiven)
x = arg maxx2DB
p(xjygiven)
P (yjzn) =
PN
i=1 P (ydjzn)·(zn ¡ zi)PN
i=1 ·(zn ¡ zi)
N Nx
6
= number of tagged samples
= number of untagged samples
Technical Points of SSCDE
2. Propagate tag information : Multi-label SSKDE
1. Learn topic space with tagged and untagged samples :
SemiCCA [Kimura+ ICPR2010]
CCA (tagged samples) PCA (tagged and untagged samples)
X Y Z cX bY X(U)
X(T)
Y
x y z zn zn
p(x;y) =
Z
zp(z)p(xjz)p(yjz)dz
O(N) O(kN )~O(logN ) N
p(yjz) p(xjz) p(x) p(y)
y = arg maxy
p(yjxq) = arg maxy
p(xq; y)
x = arg maxx2DB
p(xjyq) = arg maxx2DB
p(x; yq)
xq
W x W y ¼
"0 SxySyx 0
#"wx
wy
#= ¸
"Sxx 0
0 Syy
#"wx
wy
#
zn =MxWxxn+MyW yy
x1 x2 x3 xa¡2 xa¡1 xa
B
·wxwy
¸= ¸C
·wxwy
¸
B = ¯
"0 S(L)xy
S(L)yx 0
#+(1¡ ¯)
"Sxx 0
0 Syy
#
C = ¯
"S(L)xx 0
0 S(L)yy
#+ (1¡ ¯)
"IDx
0
0 IDy
#
Dx; Dy x;y
1
generalized eigenproblem.Solution
Initial state
Rectification
Rectify posteriors to
reflect tag information
Tagged
(Color = tag)
Untagged
Propagation
Neighboring matrix
Tag confidenceMulti-label version of
semi-supervised
kernel density estimation.
(SSKDE [Wang+ CVIU2009])
y = arg maxy2RDy
p(yjxgiven)
x = arg maxx2DB
p(xjygiven)
P (yjzn) =
PN
i=1 P (ydjzn)·(zn ¡ zi)PN
i=1 ·(zn ¡ zi)
N Nx
B = ¯
"0 S(T)xy
S(T )yx 0
#+ (1¡ ¯)
"Sxx 0
0 Syy
#
C = ¯
"S(T)xx 0
0 S(T )yy
#+ (1¡ ¯)
"IDx
0
0 IDy
#
6
Experiment
• Dataset: 2012 audio files taken from “Freesound” (http://www.freesound.org/).
– Database of Creative Commons licensed sounds.
– Annotated with vocabulary.
1. Audio signals are splitted into half-overlapping 23ms windows,
and 39-dimensional vector (including first 13 MFCC, MFCC-Δ, MFCC-Δ-Δ)
is extracted from each window.
2. 500 vectors are sampled from each audio signal (about 1,000,000 vectors in total).
LBG algorithm (Linde-Buzo-Gray algorithm, algorithm for vector quantization : VQ) is
applied to them, and VQ codebook (size :1024) is obtained.
3. 1024-dimensional normalized vector representing each audio signal is created by VQ.
Tagged audio signal Untagged audio signal Evaluation sample
1000 912 100
• Audio feature: bag-of-feature-vectors extracted by the following process.
• Tag feature: 230-dimensional binary vector.(Each element of the vector corresponds to specific tag.)
• Evaluation condition:
– 2012 audio clips with WAV format.
– Each clip has multiple tags selected from 230 words.
Annotation performance of SSCDE is evaluated under following condition.
Result
SSCDE successfully improved
the performance
by utilizing untagged samples.
All training samples are used
as tagged samples.
Our goal is to approach this
performance with fewer
tagged samples.
N: number of tagged samples
Nx: number of untagged samples
p(z) ¼1
N
NX
n=1
±(z ¡ zn)
p(x;y) ¼1
N
NX
n=1
p(xjzn)p(yjzn)
zn =MxW xxn +MyW yyn
zn = ¤12W xxn +¤
12W yyn
¤ = diag(¸i)
W x;Wy w>xi;w>yi
±(¢)
p(xjzn) p(yjzn)
F(k)
ij = ty(i)
j + (1¡ t)F(k¡1)ij
F(k)
ij = tF(0)
ij + (1¡ t)F(k¡1)ij
F(0)
ij =
(t¹yi;j + (1¡ ¹)
Nj
Nzi
0 zi
Fij ¼ p(yjjzi)
z1 z2 z3 z4 z5 z6
2
Audio-topic model
p(z) ¼1
N
NX
n=1
±(z ¡ zn)
p(x;y) ¼1
N
NX
n=1
p(xjzn)p(yjzn)
zn =MxW xxn +MyW yyn
zn = ¤12W xxn +¤
12W yyn
¤ = diag(¸i)
W x;Wy w>xi;w>yi
±(¢)
p(xjzn) p(yjzn)
F(k)
ij = ty(i)
j + (1¡ t)F(k¡1)ij
F(k)
ij = tF(0)
ij + (1¡ t)F(k¡1)ij
F(0)
ij =
(t¹yi;j + (1¡ ¹)
Nj
Nzi
0 zi
Fij ¼ p(yjjzi)
z1 z2 z3 z4 z5 z6
2
①
①
②
④③
⑤
X Y Z
x y z
p(x; y) =
Z
zp(z)p(xjz)p(yjz)dz
1
X Y Z
x y z
p(x; y) =
Z
zp(z)p(xjz)p(yjz)dz
1
X Y Z cX bY
x y z
p(x; y) =
Z
zp(z)p(xjz)p(yjz)dz
O(N ) O(N)~O(logN ) N
p(yjz) p(x) p(y)
y = arg maxy
p(yjxq) = arg maxy
p(xq; y)
x = arg maxx2DB
p(xjyq) = arg maxx2DB
p(x; yq)
xq
1
X Y Z cX bY
x y z
p(x; y) =
Z
zp(z)p(xjz)p(yjz)dz
O(N ) O(N)~O(logN ) N
p(yjz) p(x) p(y)
y = arg maxy
p(yjxq) = arg maxy
p(xq; y)
x = arg maxx2DB
p(xjyq) = arg maxx2DB
p(x; yq)
xq
1
X Y Z
x y z
p(x; y) =
Z
zp(z)p(xjz)p(yjz)dz
1
X Y Z cX bY
x y z
p(x; y) =
Z
zp(z)p(xjz)p(yjz)dz
O(N ) O(N)~O(logN) N
p(yjz) p(x) p(y)
y = arg maxy
p(yjxq) = arg maxy
p(xq; y)
x = arg maxx2DB
p(xjyq) = arg maxx2DB
p(x; yq)
xq
Wx W y
1
X Y Z cX bY
x y z
p(x; y) =
Z
zp(z)p(xjz)p(yjz)dz
O(N) O(N)~O(logN) N
p(yjz) p(x) p(y)
y = arg maxy
p(yjxq) = arg maxy
p(xq; y)
x = arg maxx2DB
p(xjyq) = arg maxx2DB
p(x; yq)
xq
W x W y
1
maximize
correlation
Topic
Audio feature Tag feature
p(z) ¼1
N
NX
n=1
±(z¡ zn)
p(x;y) ¼1
N
NX
n=1
p(xjzn)p(yjzn)
zn =MxWxxn+MyW yyn
zn = ¤12Wxxn+¤
12W yyn
¸ = diag(¸i)
±(¢)
p(xjzn) p(yjzn)
F(k)ij
= ty(i)j
+ (1¡ t)F(k¡1)ij
F(k)ij = tF
(0)ij + (1¡ t)F
(k¡1)ij
F(0)ij =
(t¹yi;j + (1¡ ¹)
NjN
zi0 zi
Fij ¼ p(yjjzi)
z1 z2 z3 z4 z5 z6
2
Learn the map :
SemiCCA can utilize untagged samples
differently from standard CCA.
X Y Z
x y z
p(x; y) =
Z
zp(z)p(xjz)p(yjz)dz
1
X Y Z
x y z
p(x; y) =
Z
zp(z)p(xjz)p(yjz)dz
1
Estimate this
Use Sparse Matrix as Neighboring Matrix
Connect to
nearest 3 samples
Gaussian
kernel
Posterior probability
of y given Zn
•Fix the number of non-zero elements in each row,
then required memory size to hold the neighboring matrix decreases.
( )
•In this case, SSKDE is equivalent to Graph spectral method. [Joachims 2003]
•Apply the same idea to audio tagging and retrieval, then
calculation needs only a few samples nearby query!
Annotation/Retrieval task becomes
equivalent to neighboring search problem!
(Computational complexity is !)
Query
0
0.1
0.2
0.3
0.4
0.5
0.2 0.3 0.4 0.5 0.6
CCA + KDE (N=1912)
SSCDE (N=1000, Nx=912)
SemiCCA + KDE (N=1000, Nx=912)
CCA + KDE (N=1000)
SSCDE (proposed)
upper bound
Baseline
Recall
Pre
cision
y = arg maxy2f0;1gDy
p(yjxgiven)
x = arg maxx
p(xjygiven)
P(yjzn) =
PNi=1P (ydjzn)·(zn¡ zi)PN
i=1·(zn¡ zi)
N Nx
B = ¯
"0 S(T)xy
S(T)yx 0
#+ (1¡¯)
"Sxx 0
0 Syy
#
C = ¯
"S(T)xx 0
0 S(T)yy
#+ (1¡ ¯)
"IDx 0
0 IDy
#
6
y = arg maxy2f0;1gDy
p(yjxgiven)
x = arg maxx
p(xjygiven)
P(yjzn) =
PNi=1P (ydjzn)·(zn¡ zi)PN
i=1·(zn¡ zi)
N Nx
B = ¯
"0 S(T)
xy
S(T)yx 0
#+ (1¡¯)
"Sxx 0
0 Syy
#
C = ¯
"S(T)xx 0
0 S(T)yy
#+ (1¡ ¯)
"IDx 0
0 IDy
#
6