abstract · with keyword estimation called tracke. it simultaneously solves the word-selection...

5
A Transformer-based Audio Captioning Model with Keyword Estimation Yuma Koizumi, Ryo Masumura, Kyosuke Nishida, Masahiro Yasuda, and Shoichiro Saito NTT Corporation, Japan [email protected] Abstract One of the problems with automated audio captioning (AAC) is the indeterminacy in word selection corresponding to the audio event/scene. Since one acoustic event/scene can be described with several words, it results in a combinatorial explosion of possible captions and difficulty in training. To solve this prob- lem, we propose a Transformer-based audio-captioning model with keyword estimation called TRACKE. It simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic event de- tection/acoustic scene classification (i.e., keyword estimation). TRACKE estimates keywords, which comprise a word set cor- responding to audio events/scenes in the input audio, and gen- erates the caption while referring to the estimated keywords to reduce word-selection indeterminacy. Experimental results on a public AAC dataset indicate that TRACKE achieved state-of- the-art performance and successfully estimated both the caption and its keywords. Index Terms: automated audio captioning, keyword estima- tion, audio event detection, and Transformer. 1. Introduction Automated audio captioning (AAC) is an intermodal transla- tion task when translating an input audio into its description using natural language [1–5]. In contrast to automatic speech recognition (ASR), which converts a speech to a text, AAC converts environmental sounds to a text. This task potentially raises the level of automatic understanding of sound environ- ment from merely tagging events [6, 7] (e.g. alarm), scenes [8] (e.g. kitchen) and condition [9] (e.g. normal/anomaly) to higher contextual information including concepts, physical properties, and high-level knowledge. For example, a smart speaker with an AAC system will be able to output “a digital alarm in the kitchen has gone off three times,” and might give us more intel- ligent recommendations such as “turn the gas range off.” One of the problems with AAC is the existence of many possible captions that correspond to an input. In ASR, a set of phonemes in a speech corresponds almost one-to-one to a word. In contrast, one acoustic event/scene can be described with several words, such as {car, automobile, vehicle, wheels} and {road, roadway, intersection, street}. Such indeterminacy in word selection leads to a combinatorial explosion of possible answers, making it almost impossible to estimate the ground- truth and difficulty in training an AAC system. To reduce the indeterminacy in word selection, conven- tional AAC setups allow the use of keywords related to acous- tic events/scenes [4, 5]. The audio samples in the AudioCaps dataset [4] are parts of the Audio Set [10], and their captions are annotated while referring to the Audio Set labels. Therefore, au- tomatic text generation while referring to keywords (e.g. Audio Set label) may restrict the solution space and should be effective in reducing word-selection indeterminacy. TRACKE (Sec. 3.1 & 3.2) Keyword extraction Birds are chirping in the background Caption Audio Training data {Bird, chirp} A Bird is chirping on the street Est. keywords Est. caption Birds are chirping in the background Rule-based keyword extraction (Sec. 3.3) Caption Keywords L cap e,✓d < l a t e x i t s h a 1 _ b a s e 6 4 = " n 9 M o w y / 9 i P 7 7 b u U p 1 Q 4 f h 8 4 t 0 a E = " > A A A C v n i c h V F N S x t B G H 7 c t m r T 1 s T 2 U v A S G i w e S p i I 4 M c p t B c P H t Q Y F V x d Z s e J G b J f 7 E 6 C y b J / o H / A g w e x 0 E P p D / A H l E J P e v K Q n 1 B 6 t N B L D 3 2 z W S g q 6 j v M z D P P v M 8 7 8 / D a g a M i z V h / x H j 0 + M n o 2 P j T 3 L P n L y b y h c m X m 5 H f D o W s C 9 / x w 2 2 b R 9 J R n q x r p R 2 5 H Y S S u 7 Y j t + z W h 8 H 9 V k e G k f K 9 D d 0 N 5 K 7 L D z z V U I J r o q z C o u l y 3 R T c i V e S v d h 0 b f 8 w N i M R q k B H q i e L g g d J Y s W m b k r N L f m u m K H 9 x C q U W J m l U b w N K h k o I Y t V v 3 A G E / v w I d C G C w k P m r A D j o j G D i p g C I j b R U x c S E i l 9 x I J c q R t U 5 a k D E 5 s i 9 Y D O u 1 k r E f n Q c 0 o V Q t 6 x a E Z k r K I a X b J v r A r 9 o N 9 Z T / Z 3 z t r x W m N w V + 6 t N t D r Q y s / M f X t T 8 P q l z a N Z r / V f c o b M q + 3 5 N G A w u p F 0 X e g p Q Z u B T D + p 3 e 0 V V t a X 0 6 f s s + s V / k 7 5 T 1 2 T d y 6 H V + i 8 9 r c v 0 Y O W p Q 5 W Y 7 b o P N 2 X J l r j y / N l e q v s 9 a N Y 4 p v M E M 9 W M e V S x j F X V 6 9 w T f c Y 4 L o 2 o 0 D N f w h 6 n G S K Z 5 h W t h H P 4 D 0 2 y o y g = = < / l a t e x i t > L key e,✓m < l a t e x i t s h a 1 _ b a s e 6 4 = " W 3 S 7 a P p 6 5 O 6 T 7 r T v t l 4 4 O H D T a t E = " > A A A C v n i c h V H L a h R B F D 1 p T Y x j H m P c C N k M G S I u Z K g J g W h W g 9 m 4 c J H X J I F 0 b K o r d z L F 9 I v u m i G T p n 8 g P + D C h S i 4 k H x A P i A I r s w q i 3 y C u I z g x o V 3 e h p E g / E W V X X q 1 D 2 3 6 n D d y N O J E e J y x L p 1 e 3 T s z v j d 0 r 2 J y a n p 8 v 2 Z r S T s x o q a K v T C e M e V C X k 6 o K b R x q O d K C b p u x 5 t u 5 2 V w f 1 2 j + J E h 8 G m 6 U e 0 5 8 u D Q L e 0 k o Y p p / z M 9 q V p K + m l L 7 N X q e 2 7 4 W F q J y r W k U n 0 E V U 6 1 M 8 y J 7 V N m 4 x 0 6 E m l Q H 7 m l K u i J v K o X A f 1 A l R R x G p Y P o W N f Y R Q 6 M I H I Y B h 7 E E i 4 b G L O g Q i 5 v a Q M h c z 0 v k 9 I U O J t V 3 O I s 6 Q z H Z 4 P e D T b s E G f B 7 U T H K 1 4 l c 8 n j E r K 5 g X F + K j u B K f x Y n 4 K n 7 + s 1 a a 1 x j 8 p c + 7 O 9 R S 5 E w f P 9 z 4 8 V + V z 7 t B + 7 f q B o X L 2 T d 7 M m j h a e 5 F s 7 c o Z w Y u 1 b B + 7 + j 1 1 c b y + n z 6 S L w X 3 9 j f O 3 E p z t h h 0 P u u P q z R + h u U u E H 1 v 9 t x H W w t 1 O q L t a W 1 x W r j e d G q c c x i D o + 5 H 0 t o 4 A V W 0 e R 3 3 + I T v u D c a l g t y 7 f C Y a o 1 U m g e 4 I + w D n 8 B F T u o 6 A = = < / l a t e x i t > Figure 1: Overview of training procedure of TRACKE. Unfortunately, in some real-world applications such as us- ing a smart speaker, it is difficult to provide such keywords in advance. For example, to output the caption “a digital alarm in the kitchen has gone off three times,” conventional AAC sys- tems require the keywords related to the acoustic events/scenes such as {alarm, kitchen}. However, if the user can input such keywords, the user should know the sound environment with- out any captions. This dilemma means that we need to solve the word-selection indeterminacy problem of AAC while simulta- neously executing the traditional sub-task of acoustic event de- tection (AED) [6, 7]/acoustic scene classification (ASC) [8] 1 . We propose a Transformer [12]–based audio captioning model with keyword estimation called TRACKE, which simul- taneously solves the word-selection indeterminacy problem of AAC and executing the AED/ASC sub-task (i.e. keyword esti- mation). Figure 1 shows an overview of the training procedure of TRACKE. TRACKE’s encoder has a branch for keyword es- timation and its decoder generates captions while referring to the estimated keywords for reducing word-selection indetermi- nacy. In the training phase, a set of ground-truth keywords is ex- tracted from the ground-truth caption, and the branch is trained to minimize the estimation error of the keywords. A summary of our contributions is as follows. 1. We decompose AAC into a combined task of caption generation and keyword estimation, and keyword es- timation is executed by adopting a weakly supervised polyphonic AED strategy [13]. 2. This is the first study that has adopted Transformer [12] to AAC 2 . We also extended Transformer to simultane- ously solve the word-selection indeterminacy problem of AAC and the related AED/ASC sub-task. 2. Preliminaries of audio captioning AAC is a task to translate an input audio sequence (φ1, ..., φT ) into a word sequence (w1, ..., wN ). Here, φt R Dx is a set of acoustic features at time index t, and T is the length of the 1 A conventional method [4] uses ASC-aware acoustic features such as the bottleneck feature of VGGish [11]. In contrast, we attempt to solve the word-selection indeterminacy problem of AAC explicitly by using the AED/ASC sub-task. 2 The use of a Transformer in AED/ASC tasks has been investigated [14, 15]. arXiv:2007.00222v1 [eess.AS] 1 Jul 2020

Upload: others

Post on 09-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abstract · with keyword estimation called TRACKE. It simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic

A Transformer-based Audio Captioning Model with Keyword Estimation

Yuma Koizumi, Ryo Masumura, Kyosuke Nishida, Masahiro Yasuda, and Shoichiro Saito

NTT Corporation, [email protected]

AbstractOne of the problems with automated audio captioning (AAC) isthe indeterminacy in word selection corresponding to the audioevent/scene. Since one acoustic event/scene can be describedwith several words, it results in a combinatorial explosion ofpossible captions and difficulty in training. To solve this prob-lem, we propose a Transformer-based audio-captioning modelwith keyword estimation called TRACKE. It simultaneouslysolves the word-selection indeterminacy problem with the maintask of AAC while executing the sub-task of acoustic event de-tection/acoustic scene classification (i.e., keyword estimation).TRACKE estimates keywords, which comprise a word set cor-responding to audio events/scenes in the input audio, and gen-erates the caption while referring to the estimated keywords toreduce word-selection indeterminacy. Experimental results ona public AAC dataset indicate that TRACKE achieved state-of-the-art performance and successfully estimated both the captionand its keywords.Index Terms: automated audio captioning, keyword estima-tion, audio event detection, and Transformer.

1. IntroductionAutomated audio captioning (AAC) is an intermodal transla-tion task when translating an input audio into its descriptionusing natural language [1–5]. In contrast to automatic speechrecognition (ASR), which converts a speech to a text, AACconverts environmental sounds to a text. This task potentiallyraises the level of automatic understanding of sound environ-ment from merely tagging events [6, 7] (e.g. alarm), scenes [8](e.g. kitchen) and condition [9] (e.g. normal/anomaly) to highercontextual information including concepts, physical properties,and high-level knowledge. For example, a smart speaker withan AAC system will be able to output “a digital alarm in thekitchen has gone off three times,” and might give us more intel-ligent recommendations such as “turn the gas range off.”

One of the problems with AAC is the existence of manypossible captions that correspond to an input. In ASR, a setof phonemes in a speech corresponds almost one-to-one to aword. In contrast, one acoustic event/scene can be describedwith several words, such as {car, automobile, vehicle, wheels}and {road, roadway, intersection, street}. Such indeterminacyin word selection leads to a combinatorial explosion of possibleanswers, making it almost impossible to estimate the ground-truth and difficulty in training an AAC system.

To reduce the indeterminacy in word selection, conven-tional AAC setups allow the use of keywords related to acous-tic events/scenes [4, 5]. The audio samples in the AudioCapsdataset [4] are parts of the Audio Set [10], and their captions areannotated while referring to the Audio Set labels. Therefore, au-tomatic text generation while referring to keywords (e.g. AudioSet label) may restrict the solution space and should be effectivein reducing word-selection indeterminacy.

TRACKE(Sec. 3.1 & 3.2)

Keywordextraction

Birds are chirping in the background

Caption

Audio

Training data {Bird, chirp}

A Bird is chirping on the street

Est. keywords

Est. captionBirds are chirping in the background

Rule-based keyword extraction (Sec. 3.3)

Caption

Keywords

Lcap✓e,✓d

<latexit sha1_base64="n9Mowy/9iP77buUp1Q4fh84t0aE=">AAACvnichVFNSxtBGH7ctmrT1sT2UvASGiweSpiI4McptBcPHtQYFVxdZseJGbJf7E6CybJ/oH/Agwex0EPpD/AHlEJPevKQn1B6tNBLD32zWSgq6jvMzDPPvM878/DagaMizVh/xHj0+Mno2PjT3LPnLybyhcmXm5HfDoWsC9/xw22bR9JRnqxrpR25HYSSu7Yjt+zWh8H9VkeGkfK9Dd0N5K7LDzzVUIJroqzCouly3RTciVeSvdh0bf8wNiMRqkBHqieLggdJYsWmbkrNLfmumKH9xCqUWJmlUbwNKhkoIYtVv3AGE/vwIdCGCwkPmrADjojGDipgCIjbRUxcSEil9xIJcqRtU5akDE5si9YDOu1krEfnQc0oVQt6xaEZkrKIaXbJvrAr9oN9ZT/Z3ztrxWmNwV+6tNtDrQys/MfXtT8PqlzaNZr/VfcobMq+35NGAwupF0XegpQZuBTD+p3e0VVtaX06fss+sV/k75T12Tdy6HV+i89rcv0YOWpQ5WY7boPN2XJlrjy/Nleqvs9aNY4pvMEM9WMeVSxjFXV69wTfcY4Lo2o0DNfwh6nGSKZ5hWthHP4D02yoyg==</latexit>

Lkey✓e,✓m

<latexit sha1_base64="W3S7aPp65O6T7rTvtl44OHDTatE=">AAACvnichVHLahRBFD1pTYxjHmPcCNkMGSIuZKgJgWhWg9m4cJHXJIF0bKordzLF9IvumiGTpn8gP+DChSi4kHxAPiAIrswqi3yCuIzgxoV3ehpEg/EWVXXq1D236nDdyNOJEeJyxLp1e3Tszvjd0r2Jyanp8v2ZrSTsxoqaKvTCeMeVCXk6oKbRxqOdKCbpux5tu52Vwf12j+JEh8Gm6Ue058uDQLe0koYpp/zM9qVpK+mlL7NXqe274WFqJyrWkUn0EVU61M8yJ7VNm4x06EmlQH7mlKuiJvKoXAf1AlRRxGpYPoWNfYRQ6MIHIYBh7EEi4bGLOgQi5vaQMhcz0vk9IUOJtV3OIs6QzHZ4PeDTbsEGfB7UTHK14lc8njErK5gXF+KjuBKfxYn4Kn7+s1aa1xj8pc+7O9RS5EwfP9z48V+Vz7tB+7fqBoXL2Td7Mmjhae5Fs7coZwYu1bB+7+j11cby+nz6SLwX39jfO3Epzthh0PuuPqzR+huUuEH1v9txHWwt1OqLtaW1xWrjedGqccxiDo+5H0to4AVW0eR33+ITvuDcalgty7fCYao1Umge4I+wDn8BFTuo6A==</latexit>

Figure 1: Overview of training procedure of TRACKE.

Unfortunately, in some real-world applications such as us-ing a smart speaker, it is difficult to provide such keywords inadvance. For example, to output the caption “a digital alarmin the kitchen has gone off three times,” conventional AAC sys-tems require the keywords related to the acoustic events/scenessuch as {alarm, kitchen}. However, if the user can input suchkeywords, the user should know the sound environment with-out any captions. This dilemma means that we need to solve theword-selection indeterminacy problem of AAC while simulta-neously executing the traditional sub-task of acoustic event de-tection (AED) [6, 7]/acoustic scene classification (ASC) [8]1.

We propose a Transformer [12]–based audio captioningmodel with keyword estimation called TRACKE, which simul-taneously solves the word-selection indeterminacy problem ofAAC and executing the AED/ASC sub-task (i.e. keyword esti-mation). Figure 1 shows an overview of the training procedureof TRACKE. TRACKE’s encoder has a branch for keyword es-timation and its decoder generates captions while referring tothe estimated keywords for reducing word-selection indetermi-nacy. In the training phase, a set of ground-truth keywords is ex-tracted from the ground-truth caption, and the branch is trainedto minimize the estimation error of the keywords. A summaryof our contributions is as follows.

1. We decompose AAC into a combined task of captiongeneration and keyword estimation, and keyword es-timation is executed by adopting a weakly supervisedpolyphonic AED strategy [13].

2. This is the first study that has adopted Transformer [12]to AAC2. We also extended Transformer to simultane-ously solve the word-selection indeterminacy problem ofAAC and the related AED/ASC sub-task.

2. Preliminaries of audio captioningAAC is a task to translate an input audio sequence (φ1, ...,φT )into a word sequence (w1, ..., wN ). Here, φt ∈ RDx is a setof acoustic features at time index t, and T is the length of the

1A conventional method [4] uses ASC-aware acoustic features suchas the bottleneck feature of VGGish [11]. In contrast, we attempt tosolve the word-selection indeterminacy problem of AAC explicitly byusing the AED/ASC sub-task.

2The use of a Transformer in AED/ASC tasks has been investigated[14, 15].

arX

iv:2

007.

0022

2v1

[ee

ss.A

S] 1

Jul

202

0

Page 2: Abstract · with keyword estimation called TRACKE. It simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic

input sequence. The output of AAC wn ∈ N denotes the n-thword’s index in the word vocabulary, and N is the length of theoutput sequence.

Previous studies addressed AAC using a sequence-to-sequence model (seq2seq) [16,17]. First, the encoder E embedsthe input sequence into a feature-space as ν. Here, ν can beeither a fixed dimension vector or a hidden feature sequence.Then the decoder D predicts the posterior probability of the n-th word under the given input and 1st to (n − 1)-th outputsrecursively as

ν = Eθe (φ1, ...,φT ) , (1)p(wn|ν,wn−1) = Dθd (ν,wn−1) , (2)

where θe and θd are the sets of parameters of E and D, respec-tively, wn−1 = (w1, ..., wn−1), and wn is estimated from theposterior using beam search decoding.

As mentioned above, one of the problems with AAC is in-determinacy in word selection. Since one acoustic event/scenecan be described with several words, the number of possiblecaptions becomes huge due to combinatorial explosion. To re-duce such indeterminacy, previous studies used meta informa-tion such as keywords [4,5]. We definem = {mk ∈ N}Kk=1 asa set of keywords whereK is the number of keywords. By pass-ing m to the decoder, it is expected that m works as an atten-tion factor to select the keyword from the possible words corre-sponding to the acoustic event/scene. Thus, (2) can be rewrittenas

p(wn|ν,m,wn−1) = Dθd (ν,m,wn−1) . (3)

3. Proposed ModelIn real-world applications, there are not many use-cases forAAC systems that require keywords. If the user can input suchkeywords, he/she should know the sound environment with-out any captions. To expand the use-cases of AAC, TRACKEgenerates a caption while estimating its keywords from the in-put audio. Sections 3.1 and 3.2 give an overview and detailsof TRACKE, respectively, and Section 3.3 describes the pro-cedure for extracting ground-truth keywords from the ground-truth caption.

3.1. Model overview

Figure 2 shows the architecture of TRACKE. The componentsof the encoder and decoder are the same of those of the originalTransformer [12], but the number of stacks and hidden dimen-sions different. We use the bottleneck feature of VGGish [11](Dx = 128) for audio embedding, and fastText [18] trainedon the Common Crawl corpus (Dw = 300) for caption-wordand keyword embedding, respectively. Since the dimension[s?]of audio feature and word embedding differ, we use two linearlayers to adjust the dimensions of audio and word/keyword em-bedding to Df = 100, which is the hidden dimension of theencoder/decoder.

In TRACKE, the size of the encoder output ν is Df × T .The ν is passed to the keyword-estimation branchM as

m =Mθm (ν) , (4)

where m = {mk ∈ N}Kk=1 is the set of the estimated key-words, and θm is the parameter ofM. First, to input m toD, mis embedded into the feature space using fastText word embed-ding. To adjust the feature dimension, the embedded keywords

Feed Forward

Audio embedding(VGGish)

Dropout

Muti-HeadAttention

Word embedding(fastText)

Linear(word dim. reduction)

Audio Text

Muti-HeadAttention

Add & Norm

Feed Forward

Add & Norm

Linear

Dropout

3x

Add & Norm

Add & Norm

MaskedMuti-HeadAttention

Add & Norm

3x

Softmax

Linear

Sort & Select

Concat

Linear

ReLU

Sigmoid

MaxKeyword embedding(fastText)

Keyword estimation branch

(a) (b)m p(zc|⌫)

p(wn|⌫, m, wn�1)p(zc|⌫)

Linear(audio dim. reduction)

Linear(word dim. reduction)

m

Encoder output

Figure 2: (a) Architecture of TRACKE and (b) details ofkeyword-estimation branchM.

are then passed to the linear layer for dimension reduction ofwords/keywords. Then, the output RDf×K is concatenated toν. Finally, the concatenated feature RDf×(T+K) is used as thekey and value of the multi-head attention layers in D, and thedecoder estimates the posterior of the n-th word, the same as in(3), as

p(wn|ν, m,wn−1) = Dθd (ν, m,wn−1) . (5)

3.2. Keyword-estimation branch

Let C be the size of the keyword vocabulary and m be a set ofkeywords extracted from the ground-truth caption (described inSection 3.3). The M estimates m, that is, whether the inputaudio includes audio events/scenes corresponding to keywordsin the keyword vocabulary.

The duration of each event/scene is different, e.g., a pass-ing train sound is long, while a dog barking is short. Thus,as in polyphonic AED [19, 20], it would be better to estimatewhether the pre-defined c-th event/scene has happened for eacht. However, the given keyword labels are weak; start and stoptime indexes are not given. Therefore, we carry out keyword es-timation through the weakly supervised polyphonic AED strat-egy [13] by (i) estimating the posterior of each event on eacht, p(zc,t|ν), then (ii) aggregating these posteriors for all t,p(zc|ψ). Then, the most likely K events/scenes (i.e. keywords)are selected.

First is the posterior-estimation step;M estimates the pos-terior of the c-th keyword at t as

Z = sigmoid (Linear (ReLU (Linear (ν)))) , (6)

where Z ∈ [0, 1]C×T and its (c, t) element is p(zc,t|ν). Next isthe posterior-aggregation step. We use the global max poolingstrategy as follows because the maximum value rather than theaverage for considering the difference in the duration for eachevent

p(zc|ν) ≈ maxt

[p(zc,t|ν)] . (7)

Page 3: Abstract · with keyword estimation called TRACKE. It simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic

Then, p(zc|ν) is sorted in descending order and the top-K key-words with high posterior are selected as m.

Note that the estimated order of the top-K keywords has noeffect on text generation because position encoding is not ap-plied to the embedding vector of m. In addition, the computa-tional graph is not connected fromM to D because the sortingand top-K selection after (7) are not differentiable. Therefore,text-generation loss is not back-propagated to two linear layersin (6), i.e. the update of θm is only affected by the accuracy ofkeyword estimation.

3.3. Rule-based keyword extraction for training

We describe a rule-based keyword extraction for generatingm.The keyword-estimation problem has been tackled as a sub-taskof text summarization and comprehension, and several machinelearning-based methods have been proposed [21–25]. In thisstudy, the first attempt to reveal whether the use of estimatedkeywords is effective for AAC, we adopted a simple rule-basedkeyword extraction method.

We use frequent word lemmas of nouns, verbs, adjectives,and adverbs as keywords. From all captions in the training data,we first extract words that belong to the four parts of speech.Next, these words are converted to their lemmas and counted.Then, the keyword vocabulary is constructed using the mostfrequent C lemmas except “be”. Finally, the word lemmasthat exist in the keyword vocabulary are used as the ground-truth keywords m. In the case of a ground-truth caption in theClotho dataset [5] “A muddled noise of broken channel of theTV”, the words that belong to the four target parts of speech are{muddled, noise, broken, channel, TV}. These words are thenconverted to their lemmas as {muddle, noise, break, channel,TV}. Finally, the lemmas that exist in the keyword vocabularyare extracted asm.

3.4. Training procedure

TRACKE is trained to minimize two cost functions simultane-ously; for captioning Lcap

θe,θdand keyword estimation Lkey

θe,θm.

For Lcapθe,θd

, we used the basic cross-entropy loss as Lcapθe,θd

=

N−1 ∑Nn=1 CE (wn, p(wn|ν, m,wn−1)) , where CE is the

cross-entropy between a given label and estimated posterior.For Lkey

θe,θm, to avoidM from always outputting the most fre-

quent keywords, we calculate weighted binary cross-entropy,the weight of which is the reciprocal of the prior probability, as

Lkeyθe,θm

= − 1

C

C∑c=1

λczc ln zc + γc(1− zc) ln(1− zc), (8)

where zc = p(zc|ψ), and zc = 1 if c ∈m; otherwise, zc = 0.Here, λc and γc are the weights as λc = (p(zc))

−1 and γc =(1−p(zc))−1, respectively, where p(zc) is the prior probabilityof the c-th keyword calculated by

p(zc) =# of c-th keyword in training captions

# of training captions. (9)

4. Experiments4.1. Experimental setup

Dataset and metrics: We evaluated TRACKE on the Clothodataset [5], which consists of audio clips from the Freesoundplatform [26] and its captions were annotated via crowdsourc-ing [27]. This dataset was used in a challenge task of the Detec-tion and Classification of Acoustic Scenes and Events (DCASE)

2020 Challenge [28]. We used the development split of 2893audio clips with 14465 captions (i.e. one audio clip has fiveground-truth captions) for training and the evaluation split of1045 audio clips with 5225 captions for testing. From the de-velopment split, 100 audio clips and their captions were ran-domly selected as the validation split. We evaluated TRACKEand three other models on the same metrics used in the DCASE2020 Challenge, i.e., BLEU-1, BLEU-2, BLEU-3, BLEU-4,ROUGE-L, METEOR, CIDEr, SPICE, and SPIDEr.

Training details: All captions were tokenized using the wordtokenizer of the natural language toolkit (NLTK) [29]. All to-kens in the development dataset were then counted, and wordsthat appeared more than five times were appended in the wordvocabulary. The vocabulary size was 2145, which includesBOS, EOS, PAD, and UNK tokens. The part-of-speech (POS)–tagging and lemmatization for keyword extraction were carriedout using the POS–tagger and the WordNet Lemmatizer of theNLTK, respectively. Then, the most frequent C = 50 lemmaswere appended to the keyword vocabulary. The average numberof keywords per caption was 2.23, and we used K = 5 becausethe number of keywords of 95% of the training samples was lessthan five.

The encoder and decoder of TRACKE are composed ofa stack of three identical layers, and each layer’s multi-head attention/self-attention has four heads. All parametersin TRACKE were initialized using a random number fromN (0, 0.02) [30]. The number of hidden units was Df = 100,and the initial and encoder/decoder’s dropout probability were0.5 and 0.3, respectively. We used the Adam optimizer [31]with β1 = 0.9, β2 = 0.999, and ε = 10−8 and varied the learn-ing rate as the same formula of the original Transformer [12].TRACKE was trained for 300 epochs with a batch size of 100,and the best validation model was used as the final output.

Comparison methods: TRACKE (Ours) was compared withthree other models:Baseline The baseline model of the DCASE 2020 Challenge

Task 6 [1].LSTM Long short-term memory (LSTM)–based seq2seq model

[16, 17]. E was two-layer bidirectional-LSTM, and itsoutputs were aggregated by an attention layer. D wasone-layer LSTM whose initial hidden state was the en-coder output. The number of hidden units was 180.

Transformer Transformer-based AAC. Its architecture is thesame as TRACKE, except that the keyword-estimationbranch was removed.

To investigate the effect of the number of keywords K, we alsoevaluated TRACKE with K = 10 (Ours(K = 10)), whereK = 10 was larger than the maximum number of keywordsper audio clip in the training data. To confirm the upper-boundperformance of TRACKE, we also compared it with two othermodels. One is Oracle1; instead of m, the keywords in themeta-data of the Clotho dataset (i.e. Freesound tags) are passedto the decoder in both training/test stages, and the other isOracle2; instead of m, the ground-truth m is passed to thedecoder in both training/test stages. Oracle1 gives the ora-cle performance when the keywords are given manually, andOracle2 gives this when the estimation accuracy of the key-words is perfect.

4.2. Results

Table 1 shows the evaluation results on the Clotho dataset.These results suggest the following:

Page 4: Abstract · with keyword estimation called TRACKE. It simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic

Table 1: Experimental results on Clotho dataset with DCASE2020 Challenge metrics

Model # of params. B-1 B-2 B-3 B-4 CIDEr METEOR ROUGE-L SPICE SPIDEr

Baseline 4.64M 38.9 13.6 5.5 1.5 7.4 8.4 26.2 3.3 5.4LSTM 1.12M 49.4 28.5 16.9 10.0 22.2 14.5 33.4 9.0 15.6Transformer 1.11M 50.2 29.9 18.3 10.2 23.3 14.1 33.7 9.1 16.2Ours(K = 10) 1.13M 49.9 29.7 18.4 10.8 23.0 14.5 34.5 9.1 16.1Ours 1.13M 52.1 30.9 18.8 10.7 25.8 14.9 34.2 9.7 17.7

Oracle1 1.11M 53.4 32.2 20.0 11.7 27.5 15.4 35.1 10.1 18.8Oracle2 1.11M 56.7 37.5 24.8 15.9 34.7 18.1 39.1 12.3 23.5

Key.

Audio

Mem

ory

ind.

(T+K

)

Text position

Freq

. [kH

z]

Time [s]

Prob

abili

ty

Decoder Layer 3Decoder Layer 2Decoder Layer 1

R0: A bunch of birds are chirping and singingR1: Birds are chirping and loudly singing in the forest

Pred.: Birds are chirping and singing in the backgroundEst. keywords: ['bird', 'chirp', 'sing', 'distance', 'background']

Key.

Audio

Mem

ory

ind.

(T+K

)

Text position

Freq

. [kH

z]

Time [s]

Prob

abili

ty

Decoder Layer 3Decoder Layer 2Decoder Layer 1

R0: A machine is running at first then slows downR1: An airplane engine is at a high idle and slows down to a slower

Pred.: A machine is running and then stopsEst. keywords: ['machine', 'run', 'engine', 'turn', 'move']

Key.

Audio

Mem

ory

ind.

(T+K

)Text position

Freq

. [kH

z]

Time [s]

Prob

abili

ty

Decoder Layer 3Decoder Layer 2Decoder Layer 1

R0: A person opens a doorwith a key then he closes the door fromR1: A door is being unlatched, creaking open and being fastened again

Pred.: A person is opening and closing a doorEst. keywords: [‘door', ‘open', ‘close', ‘times', ‘someone']

(a) 20100422.waterfall.birds.wav (b) WasherSpinCycleWindDown4BeepEndSignal.wav (c) Door.wav

(i)

(ii)

(iii)

(iv)

Figure 3: Examples of TRACKE outputs. (i) Ground truth (R0 and R1) and estimated caption (Pred.) and keywords (Est. keywords),(ii) input spectrogram, (iii) keyword posterior of each time index p(zc,t|ψ), and (iv) attention matrices of decoder.

(i) M works effectively for AAC. TRACKE (Ours)achieved the highest score without given keywords. In addi-tion, the BLEU-1 (ROUGE-L) score of Ours was 52.1 (34.2),while that of Oracle1, which uses manually given keywords,was 53.4 (35.1). Thus, the score of Ours was 97.6% (97.4%)compared with Oracle1, in spite the fact that Ours is a per-fectly automated audio-captioning model.

(ii) If TRACKE can accurately estimate the keywords, per-formance might further improve. The oracle performance ofOracle2 was significantly higher than that of Ours. Since thekeyword estimation accuracy of Ours was 48.1%3, we need toimprove this in future work.

(iii) If the estimated K is too large, the use of the esti-mated keywords in text generation might be ineffective in re-ducing indeterminacy in word selection because the scores ofTransformer and Ours(K = 10) were almost the same. Tofurther improve the performance of TRACKE, K should alsobe estimated from the input.

(iv) Transformer might be effective for AAC becauseTransformer was slightly better than LSTM. However, sincethe training of Transformer requires a large-scale dataset, toaffirm the effectiveness of Transformer, we need to evaluateTransformer by developing more large-scale datasets for AAC4.

(v) The use of pre-trained models is effective because therewere large performance gaps between Baseline and the others,

3The percentage of estimated keywords that were included in theground-truth keywords.

4The number of training sentence pairs in natural language process-ing datasets, such as WMT 2014 English-French dataset, for machinetranslation is 36 million.

and the major difference was the use of pre-trained models suchas VGGish [11] and fastText [18].

Figure 3 shows examples of TRACKE outputs. These re-sults suggest that indeterminacy words were determined whilereferring to the estimated keywords, for example, (b) {machine,airplane} and (c) {close, fasten}. In addition, the posteriorprobabilities of keywords imply the implicit co-occurrence rela-tionships, rather than just classifying acoustic events/scenes. In(c), the posterior probability of “person” increased even thoughhuman sounds, such as speech, were not included in the inputaudio. This might be the result of exploiting the co-occurrencerelationship that opening and closing a door is usually done byhumans.

5. Conclusionswe proposed a Transformer-based audio captioning model withkeyword estimation called TRACKE, which simultaneouslysolves the word-selection indeterminacy problem of the maintask of ACC while executing the AED/ASC- sub-task (i.e. key-word estimation). TRACKE estimates the keywords of the tar-get caption from input audio, and its decoder generates a cap-tion while referring to the estimated keywords. The keyword-estimation branch was trained by adopting a weakly supervisedpolyphonic AED strategy [13], and the ground-truth keywordswere extracted from the ground-truth caption via a heuristicrule. The experimental results indicate the effectiveness ofTRACKE for AAC.

Future work includes improving keyword estimation whileadopting keyword-guided generation strategies in natural lan-guage processing [22–24,32,33] and image captioning [34–37].

Page 5: Abstract · with keyword estimation called TRACKE. It simultaneously solves the word-selection indeterminacy problem with the main task of AAC while executing the sub-task of acoustic

6. References[1] K. Drossos, S. Adavanne, and T. Virtanen, “Automated Audio

Captioning with Recurrent Neural Networks,” in Proc. of IEEEWorkshop on Application of Signal Process. to Audio and Acoust.(WASPAA), 2017.

[2] S. Ikawa and K. Kashino, “Neural Audio Captioning based onConditional Sequence-to-Sequence Model,” in Proc. of the De-tection and Classification of Acoust. Scenes and Events Workshop(DCASE), 2019.

[3] M. Wu, H. Dinkel, and K. Yu, “Audio Caption: Listen and Tell,”in Proc. of Int’l Conf. on Acoust., Speech, and Signal Process.(ICASSP), 2019.

[4] C. D. Kim, B. Kim, H. Lee, and G. Kim, “AudioCaps: GeneratingCaptions for Audios in The Wild,” in Proc. of the North Amer-ican Chapter of the Association for Computational Linguistics:Human Lang. Tech. (NAACL-HLT), 2019.

[5] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An Audio Cap-tioning Dataset,” in Proc. of Int’l Conf. on Acoust., Speech, andSignal Process. (ICASSP), 2020.

[6] A. Mesaros, T. Heittola, A. Eronen, and T. Virtanen, “AcousticEvent Detection in Real Life Recordings,” in Proc. of Euro. SignalProcess. Conf. (EUSIPCO), 2010.

[7] K. Imoto, N. Tonami, Y. Koizumi, M. Yasuda, R. Yamanishi, andY. Yamashita, “Sound Event Detection By Multitask Learning ofSound Events and Scenes with Soft Scene Labels,” in Proc. of Int’lConf. on Acoust., Speech, and Signal Process. (ICASSP), 2020.

[8] D. Barchiesi, D. Giannoulis, D. Stowell, and M. D. Plumbley,“Acoustic Scene Classification: Classifying Environments fromthe Sounds they Produce,” IEEE Signal Processing Magazine,2015.

[9] Y. Koizumi, S. Saito, H. Uematsu, Y. Kawachi, and N. Harada,“Unsupervised Detection of Anomalous Sound based on DeepLearning and the Neyman-Pearson Lemma,” IEEE/ACM Tran. onAudio, Speech, and Lang. Process., 2019.

[10] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W.Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set:An Ontology and Human-Labeled Dataset for Audio Events,”in Proc. of Int’l Conf. on Acoust., Speech, and Signal Process.(ICASSP), 2017.

[11] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A.Jansen, R. C. Moore, M. Plakal, DvPlatt, R. A. Saurous, B. Sey-bold, M. Slaney, R. Weiss, and K. Wilson, “CNN Architecturesfor LargeScale Audio Classification,” in Proc. of Int’l Conf. onAcoust., Speech, and Signal Process. (ICASSP), 2017.

[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, L. Kaiser, and I. Polosukhin, “Attention Is All You Need,”in Proc. of Neural Information Processing Systems (NIPS), 2017.

[13] R. Serizel, N. Turpault, H. E. Zadeh, and A. P. Shah, “Large-ScaleWeakly Labeled Semi-Supervised Sound Event Detection in Do-mestic Environments,” in Proc. of the Detection and Classificationof Acoust. Scenes and Events Workshop (DCASE), 2018.

[14] W. Boes and H. V. Hamme, “Audiovisual Transformer Archi-tectures for Large-Scale Classification and Synchronization ofWeakly Labeled Audio Events,” in Proc. of Int’l Conf. on Mul-timedia (MM’19), 2019.

[15] Q. Kong, Y. Xu, W. Wang, and M. D. Plumbley, “Sound EventDetection of Weakly Labelled Data with CNN-Transformerand Automatic Threshold Optimization,” arXiv preprint,arXiv:1912.04761, 2019.

[16] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to SequenceLearning with Neural Networks,” in Proc. of Advances in NeuralInformation Process. Systems (NIPS), 2014.

[17] M. T. Luong, H. Pham, and C. D. Manning “Effective Approachesto Attention-based Neural Machine Translation,” in Proc. of Em-pirical Methods in Natural Lang. Process. (EMNLP), 2015.

[18] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “EnrichingWord Vectors with Subword Information,” Transactions of the As-sociation for Computational Linguistics, 2017.

[19] E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “PolyphonicSound Event Detection using Multi Label Deep Neural Networksin Int’l Joint Conf. on Neural Networks (IJCNN), 2015.

[20] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for PolyphonicSound Event Detection,” Applied Sciences, 2016.

[21] R. Mihalcea and P. Tarau, “TextRank: Bringing Order intoTexts,” in Proc. of Empirical Methods in Natural Lang. Process.(EMNLP), 2004.

[22] C. Li, W. Xu, S. Li, and S. Gao, “Guiding Generation for Ab-stractive Text Summarization Based on Key Information GuideNetwork,” in Proc. of the North American Chapter of the Associa-tion for Computational Linguistics: Human Lang. Tech. (NAACL-HLT), 2018.

[23] R. Pasunuru and M. Bansal, “Multi-Reward Reinforced Summa-rization with Saliency and Entailment,” in Proc. of the NorthAmerican Chapter of the Association for Computational Linguis-tics: Human Lang. Tech. (NAACL-HLT), 2018.

[24] S. Gehrmann, Y. Deng, and A. M. Rush, “Bottom-up AbstractiveSummarization,” in Proc. of Empirical Methods in Natural Lang.Process. (EMNLP), 2018.

[25] K. Nishida, I. Saito, K. Nishida, K. Shinoda, A. Otsuka, H. Asanoand J. Tomita, “Multi-style Generative Reading Comprehension,”in Proc. of the 57th Ann. Meet. of the Association for Computa-tional Linguistics (ACL 2019), 2019.

[26] F. Font, G. Roma, and X. Serra, “Freesound Technical Demo,” inProc. of Int’l Conf. on Multimedia (MM’13), 2013.

[27] S. Lipping, K. Drossos, and T. Virtanen, “Crowdsourcing aDataset of Audio Captions,” in Proc. of the Detection and Classi-fication of Acoust. Scenes and Events Workshop (DCASE), 2019.

[28] “DCASE2020 Challenge Task 6: Automated Audio Cap-tioning,” http://dcase.community/challenge2020/task-automatic-audio-captioning

[29] S. Bird, E. Loper and E. Klein, “Natural Language Processingwith Python,” O’Reilly Media Inc., 2009.

[30] A. Radford, K. Narasimhan, T. Salimans, and I.Sutskever, “Improving Language Understanding by Gen-erative Pre-Training,” https://blog.openai.com/language-unsupervised, 2018.

[31] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic opti-mization,” in Proc. of Int.l Conf. Learn. Representations (ICLR),2015.

[32] I. Saito, K. Nishida, K. Nishida, and J. Tomita, “Ab-stractive Summarization with Combination of Pre-trainedSequence-to-Sequence and Saliency Models”, arXiv preprint,arXiv:2003.13028, 2020.

[33] I. Saito, K. Nishida, K. Nishida, A. Otsuka, H. Asano,and J. Tomita, “Length-controllable Abstractive Summariza-tion by Guiding with Summary Prototype”, arXiv preprint,arXiv:2001.07331, 2020.

[34] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting Image Cap-tioning with Attributes,” in Proc. of Intl. Conf. on Computer Vision(ICCV), 2017.

[35] Y. Pan, T. Yao, H. Li, and T. Mei, “Video Captioning with Trans-ferred Semantic Attributes,” in Proc. of IEEE Conf. on ComputerVision and Pattern Recognition (CVPR), 2017.

[36] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing When toLook: Adaptive Attention via A Visual Sentinel for Image Cap-tioning,” in Proc. of IEEE Conf. on Computer Vision and PatternRecognition (CVPR), 2017.

[37] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould,and L. Zhang, “Bottom-Up and Top-Down Attention for ImageCaptioning and Visual Question Answering,” in Proc. of IEEEConf. on Computer Vision and Pattern Recognition (CVPR), 2018.