querybyexample keywordspotting1585183/...chapter1. introduction 1.1 ethics...

DEGREE PROJECT IN TECHNOLOGY,SECOND CYCLE, 30 CREDITSSTOCKHOLM, SWEDEN 2021

Query By ExampleKeyword Spotting

KTH Thesis Report

Jonas Valfridsson

KTH ROYAL INSTITUTE OF TECHNOLOGYELECTRICAL ENGINEERING AND COMPUTER SCIENCE

AuthorsJonas Sunde Valfridsson <[email protected]> <[email protected]>Information and Communication TechnologyKTH Royal Institute of Technology

Place for ProjectStockholm, Sweden

ExaminerSten TernströmKTH Royal Institute of Technology

Supervisor

Jonas Beskow

KTH Royal Institute of Technology

ii

Abstract

Voice user interfaces have been growing in popularity and with them an interest for

open vocabulary keyword spotting. In this thesis we focus on one particular approach

to open vocabulary keyword spotting, query by example keyword spotting. Three

types of query by example keyword spotting approaches are described and evaluated:

sequence distances, speech to phonemes and deep distance learning. Evaluation is

done on a series of custom tasks designed to measure a variety of aspects. The Google

Speech Commands benchmark is used for evaluation as well, this to make it more

comparable to existing works. From the results, the deep distance learning approach

seemmost promising in most environments except whenmemory is very constrained;

in which sequence distances might be considered. The speech to phonemes methods

is lacking in the usability evaluation.

Keywords

Keyword Spotting, Automatic Speech Recognition, ASR, Query By Example, Deep

Distance Learning, Dynamic Time Warping, FewShot Learning

iii

Abstract

Röstgränssnitt har växt i populäritet

och med dem ett intresse för öppenvokabulärnyckelordsigenkänning. I den här

uppsatsen fokuserar vi på en specifik form av öppenvokabulärnyckelordsigenkänning,

den s.k nyckelordsigenkänninggenomexempel. Tre typer av nyckelordsigenkänning

genomexempel metoder beskrivs och utvärderas: sekvensavstånd, tal till fonem

samt djupavståndsinlärning. Utvärdering görs på konstruerade uppgifter designade

att mäta en mängd olika aspekter hos metoderna. Google Speech Commands data

används för utvärderingen också, detta för att göra det mer jämförbart mot existerade

arbeten. Från resultaten framgår det att djupavståndsinlärning verkar mest lovande

förutom i miljöer där resurser är väldigt begränsade; i dessa kan sekvensavstånd vara

av intresse. Tal till fonem metoderna visar brister i användningsuvärderingen.

Nyckelord

Nyckelords igenkänning, automatisk taligenkänning, fåförsöksinlärning

iv

Acknowledgements

I am thankful to friends and family for donating their voice tomy thesis. Finally, thanks

to my supervisor and examiner for enabling this project.

v

Acronyms

KWS Keyword Spotting

QbE Query By Examples

QbEKWS Query by Example Keyword Spotting

ASR Automatic Speech Recognition

HMM Hidden Markov Model

SOTA State of the Art

DTW Dynamic Time Warping

CTC Connectionist Temporal Classification

STP Speech to Phonemes

IPA International Phonetic Alphabet

MFCC Melfrequency cepstrum coefficients

OVKS Open Vocabulary Keyword Spotting

vi

Contents

1 Introduction 11.1 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 42.1 Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Query by Example Keyword Spotting . . . . . . . . . . . . . . . . . . . 5

2.2.1 Continuous Speech . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Speech Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Phonemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.3 Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.4 Spectral Representations . . . . . . . . . . . . . . . . . . . . . . 112.3.5 Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Historical Development . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Methods 163.1 Sequence Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 DTW on MFCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.2 Ghost DTW on MFCC . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Speech To Phonemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Sequence Distance with Beam Search . . . . . . . . . . . . . . 223.2.2 Sequence Distance on CTC Posteriograms . . . . . . . . . . . . 233.2.3 Example Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 24

vii

CONTENTS

3.2.4 Sample Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Deep distance learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Experiments 284.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.1 LibriSpeech dataset . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.2 LibriSpeechPhonemes dataset . . . . . . . . . . . . . . . . . . 29

4.1.3 Google Speech Commands dataset . . . . . . . . . . . . . . . . 29

4.1.4 LibriWords dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.5 LibriTriplets dataset . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.6 RS dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.7 Usability dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.8 Post Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.1 Speech Distances . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.2 Speech to Phonemes . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2.3 Deep Distance Learning . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3.1 Efficacy Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3.2 Resource Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3.3 Usability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.1 Realistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.2 Google Speech Commands . . . . . . . . . . . . . . . . . . . . 41

4.4.3 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.4.4 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 Result 435.1 Speech Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2 Speech to Phonemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3 Deep Distance Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4.2 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4.3 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

viii

CONTENTS

6 Conclusions 546.1 Sequence Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2 Speech To Phonemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3 Deep Distance Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.4 What’s best . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.5.1 DDL Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.5.2 Usability of STP . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5.3 Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.6 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

References 60

ix

Chapter 1

Introduction

Voice user interfaces [12] have been growing in popularity for the past decades, one

popular contemporary example being voice assistants [24]. Keyword spotting, being

able to detect when a word has been uttered, is a central function of all voice assistants.

It is commonly used by assistants to wake them up, a famous example being the ’Hey

Mycroft’ wake word for the Mycroft assistant.

Personalization [39] is a coveted aspect of user facing technologies, assistants being

no exception. One aspect of personalization for assistants is naming them. To

accommodate naming one needs an Open Vocabulary Keyword Spotting (OVKS)

system. Such a system would also be a step towards assistants that can function in

local contexts such as lowering or increasing the volume on a speaker without having

to explicitly call on the assistant by name to do so. Query byExampleKeyword Spotting

(QbEKWS) is an approach to OVKS that is explored in this thesis. It attempts to

replicate how a human with a keen memory might memorize a new word from just

hearing it a few times. If a viable solution to QbEKWS is devised it enables a natural

voice interface whereby a user, through voice only, could teach its assistant newwords.

This is one of the motivators of this thesis: to find a viable solution that can be used

with the Friday voice assistant [20]. To that end, the thesis explores three classes of

methods for QbEKWS and evaluates them in a variety of different settings where a

voice assistant might be used.

1

CHAPTER 1. INTRODUCTION

1.1 Ethics

Ethical issues related to voice assistants are plentiful. They concern the problem of

privacy, owning your own data and so on, this problem arises from the fact that most

commercial voice assistants make use of cloud processing to deliver a good experience.

Having a local assistant comeswithmany limitations, somost commercial entities have

opted for a cloud solution. In the case of the proposed solutions of this thesis it might

be argued that the situation is aggravated, because to teach the assistant one must

provide audio fingerprints of the words one wants the assistant to learn. However, if

this fingerprint can be stored locally ondevice: then this does not differ at all from

how the situation looks today. Furthermore, the solutions presented in this thesis

are intended to function on a fully offline assistant. Thereby eliminating the privacy

concerns associated with sending your voice to the cloud.

1.2 Problem

Given a few audio recordings of some utterance, the system should be able to use them

to recognize if the same utterance is uttered given new audio.

More precisely, define an audio recording as RNi where RN

i is a N dimensional vector

of real numbers belonging to utterance i. Denote the distribution of utterance i onRN

as Ui. The problem is: given M samples, where M is small (M < 5), of Ui construct a

mappingM : RN → [0,∞). The mapping should have the property that for a sample

RNj from some Uj if mapped to a low value; anyone would agree thatRN

j ∈ Ui meaning

that its highly likely that j = i, and for a high value it is highly likely that j = i.

1.3 Related Works

Previous work has been reported on all of the methods evaluated in this thesis. For the

sequence distances (see section 3.1) previousworks have focused on the combination of

MFCCs and theDTWalgorithm [2] [15] [10]. For the speech to phonemesmethods (see

section 3.2) previous works have been focused on what is referred to in that section as

Sample Likelihood [36][8][29]. For the deep distance learning methods (see section

3.3) previous works have focused on the same method evaluated in this thesis with

some differing details [26] [59]. Beyond the work evaluated in this thesis, authors have

2

CHAPTER 1. INTRODUCTION

tried applyingmeta learning to the problem [11] aswell as neural network architectures

based on detection filters [7].

1.4 Constraints

A solution to the problem is intended to be used on relatively low resource devices.

In particular the solution should run on a Raspberry Pi (3B+) [27] on audio recorded

in realtime from some microphone. The Raspberry Pi has a relatively large memory,

therefore themost constraining factor is latency. A proxy for the latency requirement of

the Raspberry Pi will be used: a viable methodmust be able to have a inference latency

lower than 250ms when running on a Intel(R) Core(TM) i78565U CPU @ 1.80GHz

CPU. This is the CPU used by the author for evaluations of the methods, it has been

chosen as proxy due to it working well historically in the author’s previous projects.

The inference latency is explained in section 4.3.2.

3

Chapter 2

Background

The essential components for understanding QbEKWS as presented in this thesis

are: an understanding of audio and how we represent it, a good understanding of the

problem, the general solution framework and an understanding of different kinds of

speech representations. These things are presented in the background; in addition, a

history of the development of the methods is presented which aims to motivate why

the particular methods in this thesis are of interest. Finally, definitions of the core

concepts of the methods in this thesis are presented at the end.

2.1 Audio

Starting with themost basic component; sound can be fully described by a collection of

analog waves, one for each source of audio. This thesis concerns itself only with single

source audio or mono audio. Furthermore the thesis focuses on digital processing of

audio. The digitisation of mono audio can be done by discretization of the analog wave

determined by a discretization rate (samplerate).

4

CHAPTER 2. BACKGROUND

Figure 2.1.1: An analog signal and two digital representations with different samplerates.

The discretization rate is important, Nyquist theorem tells us that we can only

accurately represent frequencies up to half of our discretization rate [33]. We must

have a sampling rate that captures the frequencies produced by human voice (around

200 4000 Hz) [17] [42]. At the same time it is important not to have a too

high discretization rate, to mitigate the curse of dimensionality [32]. Finally, audio

discretized with a sample rate of 8000Hz is used to address the Keyword Spotting

(KWS) problem in this thesis.

2.2 Query by Example Keyword Spotting

KWS is the problem of recognizing if audio contains an utterance of a keyword.

Figure 2.2.1: Example utterance of ’Query by Example’

Query By Examples (QbE) in the KWS setting is the problem of developing a method

that can from one, or a few, recordings of a keyword: learn to recognize utterances

5


of that keyword in speech. In this thesis the problem is referred to as the QbEKWS

problem.

Figure 2.2.2: QbEKWS example

Recognition is done by comparing representations of speech to determine if a known

utterance has been uttered. For example, in fig 2.2.2, three utterances of ’Cookie’ and

three utterances of ’Donut’ are provided. The task is then to predict if the utterancewith

a question mark belongs to either classes, or none of them. In this example it belongs

to none of them, it is an utterance of ’cranberry’. The following section 2.3 presents

commonways of representing keywords that can be stored andused for comparison. In

this thesis the provided examples will be referred to as examples and the new recording

will be referred to as the sample.

2.2.1 Continuous Speech

To apply QbEKWS on continuous speech we chunk it to extract samples. There are

multiple ways of doing this, for example: assume we have a ten second continuous

signal as exemplified in figure 2.2.3

6


Figure 2.2.3: Ten second signal with two utterances

One way is chunking it with a sliding window using some fixed window size and fixed

window stride.

Figure 2.2.4: Inference on continuous audio with a sliding window

This is the approach chosen here for ’realistic scenario’ evaluation 4.4.1. The fixed

window size and stride are given in that section. However, a smarter way of doing this

is using voice activity detection [53].

Figure 2.2.5: Inference on continuous audio with voice activity detection

7


Using voice activity detection enables activating the inference model only when it is

needed and potentially also with better window alignments. In the evaluations of this

thesis no voice activity detection is used. Because with sufficiently low stride it will

yield no better alignments. Furthermore, voice activity detection can hide a methods

poor falsepositive rate, something we wish to evaluate in this thesis.

2.3 Speech Representations

In the QbEKWS setting representations are stored for each keyword to enable the

detection of the utterance they represent. The representations of the examples

are compared to that of the sample and the results of the comparison constitutes

the basis for detection. This is discussed in more detail in chapter 3. The

representations covered in this chapter are text, phonemes, audio, spectral and learned

representations.

2.3.1 Text

A text representation of a keyword is its text form, e.g the text representation of a

utterance of ’Hello’ is ’Hello’.

Figure 2.3.1: Hello utterance with ’Hello’ text representation

Using text representations of keywords is a natural way of adapting speech to text

technologies to keyword spotting. By converting speech to text and then using string

matching algorithms one can build a KWS solution. One early opensource program

used for KWS did exactly this, PocketSphinx [25] developed in 2006 is a speech

to text engine, but it supported keyword spotting through a keyword text search.

8


PocketSphinx was used by the opensource voice assistant ’Mycroft’ for KWS during

many years.

The Bacon BeerCan Problem

The sound of text depends on a great many things. For example

Figure 2.3.2: Bacon & BeerCan, the figure implies they sound the same

different texts can sound exactly the same and the same texts can sound very

differently, depending on many factors, such as dialect. By representing the utterance

as text the system needs to make assumptions about all of these factors that can affect

sound, which places a heavy burden on the speech to text system. In KWS the text is

not important, although it is convenient to provide the keywords in text form; other

representations are worth considering to alleviate the Bacon BeerCan problem.

2.3.2 Phonemes

A phoneme is a unit of sound [57]. A phoneme representation of a keyword is the

sequence of phonemes that mimic the sound of the keyword as spoken. The phoneme

representation of a keyword is not unique, it depends on pronunciation. In fact there

are also different phonetic notations, one standard being the International Phonetic

Alphabet (IPA) notation [4].

9


Figure 2.3.3: Example of ’Hello’ as phonemes using different accents

Using phoneme representations of keywords makes sense on some intuitive level. If

things sound the same then they are the same, at least if they come from similar

speakers. This is the basis of many KWS systems [36] [8] [29]. In ”speech to text”

audio is typically represented as phonemes, or distributions of phonemes, before being

decoded into text. These KWS systems can therefore leverage many years of research

in Automatic Speech Recognition (ASR) to produce the sequences of phonemes from

audio (This is leveraged by the methods in section 3.2).

2.3.3 Audio

Audio is the most informative representation of an example utterance. However,

using this representation presents a different problem: how do we compare audio

signals?

Figure 2.3.4: Example of ’Hello’ as audio using identity representation

The same keyword uttered from the same speaker can look vastly different in two

10


different audio signals

Figure 2.3.5: Example ’Hello’ and streched + shifted ’Heeeeeello’

Using text and phonemes, the main problem is to produce the correct representation

from speech, while the comparison itself is relatively simple. When using raw audio

as the representation the entire problem is shifted into the comparison. Comparing

raw audio is, at time of this writing, impractical. However, by representing audio as

spectral features enables a simple form of QbEKWS based on sequence distances (see

section 3.1).

2.3.4 Spectral Representations

Representing Audio as spectral features has been hugely successful in many audio

applications. Spectral representations are audio represented using frequency

components. Typical spectral representations are logmelspectrograms and Mel

frequency cepstrum coefficients (MFCC) features [56].

Figure 2.3.6: Example of ’Nacho Tallrik’ as waveform, logmelspectrogram and mfcc

Spectral representations typically reduces the dimensionality of the audio while

retaining important features that address speech recognition. Combining spectral

features with sequence distances, see sec 3.1, has shown promising results [2] [15] [10]

for simple KWS.

11


2.3.5 Learned

Finally, learned representations, as indicated by their name, are representations

learned for a specific purpose. For example in the case of QbEKWS the

representations might have been learned in a manner such that audio containing the

same words lie close in representation space according to some definition of distance.

Here, learned representations are constructed through supervised learning. Many

examples of audio clips that should be close in space and clips that should not be

close are provided as examples, from these: a mapping from audio to a space with

representations of good discriminative properties is learned. Section 3.3 presents this

approach as applied in this thesis.

Figure 2.3.7: Example of 512 dimensional learned representation

In the example above the audio is 16000 dimensional while the representation is

only 512 dimensional, amounting to a significant compression. Other existing works

making use of learned representations for KWS are Huh et al (2020) [26] and Vygon

et al [59].

2.4 Historical Development

In the beginning there was Audrey [14]. Developed at Bell labs in 1952m the circuit

could recognize spoken digits from telephone quality recordings. Audrey was an early

example of a ASR classifier. It had audio input and 10 output nodes for the digits

09, the circuit worked by extracting two frequency features to form a plane; the

classifier then recognized shapes on this plane as specific digits. Audrey achieved an

impressive 96% accuracy on a spoken digits recognition problem when it was tuned

12


for a specific user. Then in the 1960s came a QbE solution: Denes et al [16] built

a computer program that calculated spectral features from audio and compared it

to stored spectral features of previously recorded audio clips. This QbE approach

achieved 100% accuracy on utterances from the speaker whom the example came from

and on average 94% accuracy for other speakers. It is worth interjecting that these

early experiments were conducted in close to ideal environments. The 1970s brought

the Viterbi algorithm [18], commonly used for decoding in acoustic models, and the

Dynamic Time Warping (DTW) algorithm [49]. The 1980s brought Markov models

[5], most notably the application of a Hidden Markov Model (HMM) [48] for voice

activation. In the 1990s came the neural networks for voice activation [40][41][43].

In 2006 the Connectionist Temporal Classification (CTC) loss was introduced [23].

This enabled training neural networks as acoustic models without needing segmented

labels, something previously limited to generative models such as the HMMGMM.

With the rise of Deep Learning, CTC loss contributed to an explosive development

of neural networks in ASR and overtook previous State of the Art (SOTA) based on

HMM models. From these developments grew Speech to Phonemes (STP) methods

[36][8][29], see sec 3.2. They are a natural application of the decades of research done

in ASR to the QbEKWS problem. Briefly, these methods compare the way keywords

’sound’ by comparing phoneme sequences. Using sequence distances, see sec 3.1, is

another approach, a good example is DTW from 1979 [49]: a dynamic programming

[6] approach used to align sequences, and that also can be used directly as a distance

between two sequences. The DTW distance has good properties for simple QbEKWS

problems, as exemplified by its recent use for digit recognition in Asamesse [15] or

KWS in Tamil [10]. The QbEKWS methods introduced so far, together with many

others, can all be categorized as fewshot learningmethods. The fewshot learning field

consists of methods for learning that use one or few examples to generalize, which is

precisely the problem of QbEKWS. Developments in the fewshot learning field [54]

is what inspired the deep distance learning method in this thesis (3.3).

2.5 Definitions

This section defines the concepts used throughout the thesis, it requires no thorough

reading but can be returned to in case the reader wants amore precise definition of the

central terms.

13


Definition 1 (Distance). A distance on a setX is a function D : X ×X → [0,∞) with

the properties

i Symmetry D(x, y) = D(y, x)

ii Reflexivity D(x, x) = 0

Definition 2 (Pseudo Metric). A PseudoMetric on a set X is a distance, def 1,

PM : X ×X → [0,∞) with the additional property

i Triangle Inequality PM(x, z) ≤ PM(x, y) + PM(y, z)

Definition 3 (Metric). A Metric on a set X is a pseudo metric, def 2,M : X ×X →[0,∞) with the additional property

i UniquenessM(x, y) = 0 ⇐⇒ x = y

Remark 1. These are all inclusionsM ⊂ PM ⊂ D .

These definitions of distances, pseudometrics andmetrics are taken from course notes

on topological dataanalysis [52]. The rest of the definitions are introduced to give a

precise meaning to the terms used in subsequent section.

Example 2.1. Euclidean distance is a metric M(x, y) =√

(x− y) · (x− y). Cosine

distanceD(x, y) = 1− x·y||x||×||y|| is a distance, but not a metric because it breaks triangle

inequality. Chebyshev distanceD(x, y) = maxi

|xi − yi| is a metric. Manhattan distanceis a metricM(x, y) = |(x− y) · (x− y)|.

Distances with particular invariances and discriminative features are sought for

QbEKWS.

Definition 4 (Invariant). A Distance is invariant to transformations g and f if

D(g(x), f(y)) = D(x, y)

Example 2.2. The distance between keywords should remain the same even if we

shift a utterance in time

Figure 2.5.1: Example of distance invariant to shift on the utterance of ’Hej’

14


Designing distances with invariances is difficult. One common approach to alleviate

this is to rerepresent the set on which the distance is defined.

Definition5 (Representation). ARepresentation is ax ∈ Xmapped fromsome y ∈ Y

using someR : Y → X

Remark 2. Typically x retains information from y that is relevant to some task while

discarding other information or making it less accessible.

Example 2.3. In sec 2.4, historical development, in 1960 a paper [16] represented

audio using spectral features and compared the spectral features to address the

QbEKWS task.

15

Chapter 3

Methods

Common to all methods in this chapter is that they calculate a distance between two

representations of audio. Once new audio is recorded and rerepresented; its distance

is calculated with respect to all known examples. Given these distances, a decision

still has to be made whether the audio is a known keyword and if so which, or if it is

unknown. There aremany ways to approach this problem: using knearest neighbours

[46] or extending our distances to finite subsets [13] is just a few examples. Here, for

simplicity: the minimum distance between elements of subsets is used as an extension

of our distances. The subsets consist of all examples of the individual keywords, and

the newly recorded keyword is a set consisting only of itself.

SD(A,B) = minD(a, b)|a ∈ A and b ∈ B (3.1)

Using this, the keyword which has the smallest subset distance to the new recording

is considered the prediction iff the subset distance is below some threshold T . Thisprediction heuristic is represented in the figures of this chapter as a black box. It is

possible that different heuristics could have worked better for some methods, but this

is left for future work.

Clearly, the keyword with the example which has the lowest distance to the sample is

considered to be the prediction if it is ’close enough’, that is, being closer in distance

than T .

16

CHAPTER 3. METHODS

3.1 Sequence Distances

A sequence distance consists of an alignment and a distance. An alignment is a

mapping between two sequences such that each element in each sequence associates

with atleast one in the other.

Figure 3.1.1: perfect alignment between two color sequences

We do not consider all possible alignments, only classes of alignments that have

the desired invariance properties. The class of alignments considered in this thesis

is calculated using the Dynamic Time Warping (DTW) algorithm [51]. The DTW

alignment has the following constraints

1. The first index of the first sequence must match the first index of the second

sequence.

2. The last index of the first sequence must match with the last index of the second

sequence.

3. The mapping is nondecreasing

The final point, nondecreasing mapping, imposes the constraint that no indexmay be

mapped to an opposing index smaller than any previous indexes mapped index. The

figure 3.1.1 is a valid alignment under these constraints. However figure 3.1.2 is an

example of invalid alignment

17

CHAPTER 3. METHODS

Figure 3.1.2: invalid alignment between two color sequences

It is invalid because the mapping is not nondecreasing, two red boxes is mapping to

a red box with an index lower than what a box prior to them is mapping to. For some

sequences such as the one in 3.1.2 there are no perfect alignments, but we can create

imperfect ones.

Figure 3.1.3: imperfect alignment between two color sequences

The imperfection comes at a cost described by our distance function. Consider for

example the discrete metric on colors.

18

CHAPTER 3. METHODS

Figure 3.1.4: A discrete color metric

Using this metric the cost of the alignment in figure 3.1.3 is 1, since there is one

misalignment. TheDTWalgorithm finds the alignmentwith lowest cost with respect to

some distance inO(MN) time [51], where M is the length of the first and N the second

sequence.

The sequence distance methods evaluated here will use the DTW algorithm with

different distance functions on audio representations to address the QbEKWS

problem. A slightly modified version of the DTW algorithm will also be evaluated.

The modification enables finding the optimal subalignment of some sequence into

another. This is solved by introducing ghosts at the beginning and end of one of the

sequences.

Figure 3.1.5: Property of ghost nodes

Using ghosts a perfect alignment for figure 3.1.3 exists.

19

CHAPTER 3. METHODS

Figure 3.1.6: Perfect alignment using ghost

Using ghosts is motivated by the hypothesis that being able to match against a subset

of audio will yield a better classifier. An example of the reasoning is given with the

following figure:

Figure 3.1.7: Ghost nodes enables to match onto a subsequence

In figure 3.1.7 the ’Hey’ part of the utterance is only noise if the goal is to find all

utterances of ’Friday’. Using ghost nodes enables DTW to completely ignore the ’Hey’

part, as exemplified in the following figure

20

CHAPTER 3. METHODS

Figure 3.1.8: Example Ghost DTW compared to DTW

In the figure above bothDTWvariants useManhattan distance (see example 2.1).

Ghost DTW, as described in this thesis, is essentially ’openend’ combined with

’openbegin’ DTW referred to in the R computing DTW package [21]. The following

subsections give an overview of the sequence distance methods evaluated. For the

specific sequence distance experiments see sec 4.2.1.

3.1.1 DTW on MFCC

MFCC is a spectral representation of audio, see section 2.3.4. Applying DTW

directly on MFCCs has been done previously in [2] [15] [10]. The way MFCC

representations and DTW distance is used to solve QbEKWS is summarized in the

following figure.

Figure 3.1.9: Using DTW and MFCC for QbE KWS

In figure 3.1.9 the system has learned to recognize two kinds of keywords. The top

two (green ones) are utterances of guacamole and the bottom two (orange) of nacho.

Given a new recording from for example a microphone, the DTW distance between the

new recordings representation and all stored representations is calculated. In figure

21

CHAPTER 3. METHODS

3.1.9 AverageManhattan distance (Mean absolute error) is used as distance with DTW.

Given the collection of distances the classification heuristic, see chapter 3, is applied.

Provided that T > 109, the heuristic would predict guacamole, which in this example

is correct.

3.1.2 Ghost DTW on MFCC

Thismethod is exactly the same as the one described in section 3.1.1, with the exception

that the DTW now uses ghost nodes on the stored representations.

3.2 Speech To Phonemes

Speech to phonemes models use a mapping (S) between speech to a distribution ofphonemes. In this thesis S is a deep recurrent neural network, see sec 4.2.2 for

architecture details, trained to predict posteriograms of phonemes givenMFCC inputs.

It was trained using the LibriPhonemes dataset, sec 4.1.2, with CTC loss [23]. The

posteriograms in conjunction with a distance were used to address the QbEKWS

problem. An overview of the approaches is given below, while the specific speech to

phonemes experiments are given in section 4.2.2.

3.2.1 Sequence Distance with Beam Search

Given the phoneme posteriograms; beam search (B)[50] is used to find a likely

phoneme sequence. To find the most likely sequence of phonemes from audio using

a model trained with CTC, one has to perform an operation with O(LT ) complexity,

where L is the sequence length and T the number of tokens [23]. Finding the most

likely sequence is impractical since L = 66 (assuming 30 ms receptive field of the

MFCCs with 2 seconds of audio) and T = 41 (using the Librispeech lexicon, described

in section 4.1.2). Therefore, a guided search is employed to find an approximate

optimum, this guided search uses beam search [50].

To address QbEKWS with STP and Beam search we do: for each stored audio

recordingmap, usingS, it to its posteriograms and estimate a likely phoneme sequenceusing B. Then when new audio is recorded: calculate a likely phoneme sequence,

using the same procedure, and apply some sequence distance between all the phoneme

22

CHAPTER 3. METHODS

sequence examples and the sample. Finally the classification heuristic, explained in the

beginning of chapter 3, is applied on the collection of distances.

Figure 3.2.1: Using STP and Beam Search for QbE KWS

In figure 3.2.1 the system has learned two keywords ’Raining’ and ’Fire’. Given a new

recording it applies the mappings B and S to the audio and then evaluates its distanceagainst all known sequences. In the example above the sequence distance used is

GhostDTW with the discrete distance on phonemes where ’’ has zero cost. Provided

that T > 0 the system would return ’Fire’ as the prediction; which in this example is

the correct one.

In the illustration the posteriograms has softmax applied to them to showcase the

activated phonemes. In reality though beam search, and all subsequent methods, use

the logits of the posteriograms as input, as it is more numerically stable.

3.2.2 Sequence Distance on CTC Posteriograms

Instead of transforming the posteriograms to a likely phoneme sequence, a sequence

distance can be applied directly on the posteriograms predicted with S.

23

CHAPTER 3. METHODS

Figure 3.2.2: Using STP and sequence distance on the posteriogram

In figure 3.2.2 the system has learned two keywords ’Raining’ and ’Fire’. Given a new

recording it uses S to extract the phoneme posteriograms of the audio and then it

calculates the distance against all known posteriograms. In the example GhostDTW

with average Manhattan distance is used. Assuming the T > 0.125 it would return

’Raining’ as the prediction; which in this example is the correct one.

3.2.3 Example Likelihood

Given a posteriogram it is possible to estimate the probability that it generates a

phoneme sequence using the CTC forward pass [23]. This method for addressing

QbEKWS maps all the examples into phoneme sequences, using the transformation

from sec 3.2.1. Then given a new sample (for example audio recording from amic) it is

first converted into its posteriogram, then using the CTC forward pass; the likelihoods

that our example sequences was generated from the sample posteriogram is calculated.

These likelihoods are then used as distances. This method for addressing QbEKWS

has been previously tested in DONUT [36] with promising results.

24

CHAPTER 3. METHODS

Figure 3.2.3: Example of using STP with likelihood of examples for QbEKWS

In figure 3.2.3 L is the negative log likelihood function of an example given the

posteriogram of the sample and B is the beam search mapping described in sec 3.2.1.

The system has learned two keywords, ’Raining’ and ’Fire’. Provided that T > 2.86 the

system would return ’Raining’ as the prediction, which in this example is the correct

one.

3.2.4 Sample Likelihood

Exactly as sec 3.2.3 except the sample is conditioned on the example instead of the

other way around.

Figure 3.2.4: Example of using STP with likelihood of sample for QbEKWS

In figure 3.2.4L is the negative log likelihood function of a sample given a posteriogramof an example andB is the beam searchmapping described in sec 3.2.1. The system has

learned two keywords, ’Raining’ and ’Fire’. Provided that T > 8.84 the system would

return ’Raining’ as the prediction; which in this example is the correct one.

25

CHAPTER 3. METHODS

3.3 Deep distance learning

Inspired by other deep distance learning papers for keyword spotting [59] [26], this

method attempts to learn a representation mapping R that works well for some

distance D directly from data. This works by deciding on some distance, for example

cosinedistance see ex (2.1), and then learning a representation of the data that together

with the distance has the desired properties. Here this is done by training a Siamese

[9] Convolutional Neural Network [44] (see section 4.2.3 for details) to create a

representation that minimizes a triplet loss.

L(A,P,N) = max(D(R(A),R(P ))−D(R(A),R(N)) + α, 0) (3.2)

Where in equation 3.2 the Anchor, Positive and Negative are audio recordings, R is

the representation mapping, D our distance and α a hyperparameter controlling the

degree of separation we aim to have between the examples. The triplet loss helps

to learn a representation that places positive data samples close to the anchor and

negative at least a distance α away under the D chosen.

For specifics on what dataset and how it was trained, please see section 4.2.3. With a

trainedRmapping for some distance D the following figure illustrates how they were

used to address QbEKWS:

Figure 3.3.1: Using DDL for QbE KWS

In the figure above a representation mapping have been learned for cosinedistance,

see ex 2.1. The system has been provided with two examples of ’raining’ (the first two

green audio signals) and two examples of fire (last two orange signals). At inference

time it maps new audio to the representation and, in this example, takes a cosine

distance between it and all stored representations. Provided that T > 0.25 the system

would predict the audio to be ’raining’ which would be correct in this example. The

26

CHAPTER 3. METHODS

representations produced by themapping in the example are 256 dimensional vectors.

The plots of the representations from this example are provided in higher resolution

in appendix A.1.1 for the interested reader.

27

Chapter 4

Experiments

This chapter describes how the methods were evaluated and what is needed to

reproduce the results. It begins by introducing the datasets used in the thesis and then

moves on to the postprocessing of the data for models that were trained. Thereafter

a detailed description of the experiments is given together with the acronym used to

represent them in the results section. Then themetrics used for evaluation are outlined

and finally the evaluation setting is described.

4.1 Data

This section contains details of the datasets and data processing. All datasets are

available online except the ones created as part of this thesis. The created datasets

will be provided on request.

4.1.1 LibriSpeech dataset

LibriSpeech is a dataset containing audio and text transcriptions [45]. It is derived

from audio books and contains≈ 1000 hours of speech spoken by over 1000 speakers.

Each audio clip in the dataset typically contains about one sentence.

28

CHAPTER 4. EXPERIMENTS

Figure 4.1.1: Sample from the LibriSpeech Dataset

4.1.2 LibriSpeechPhonemes dataset

LibriSpeechphonemeswas derived from theLibriSpeechdataset using the LibriSpeech

phoneme lexicon [34] provided in the documentation ofMontreal Forced Aligner [37].

This dataset is exactly that of ’LibriSpeech’ but with phonetic transcriptions instead of

text.

Figure 4.1.2: Sample from the LibriSpeechPhoneme Dataset

The phonemes in figure 4.1.2 are separated by one space and the ’’ symbol signifies a

wordboundary.

4.1.3 Google Speech Commands dataset

Google Speech commands is a dataset containing audio files and text [60]. The dataset

comprises 65000 audio recordings of thousands of different people uttering one out of

30 short words; such as ’Yes’, ’No’, ’Up’, ’Down’ and so on, see [60] for a full list. It is

a dataset typically used for evaluation of limited vocabulary keyword spotting.

Figure 4.1.3: Sample from the Google Speech Commands Dataset

29


4.1.4 LibriWords dataset

LibriWords is a dataset created as part of this thesis. It was created by using forced

alignment [37] on the LibriSpeech dataset, sec 4.1.1, and then extracting audio and

labels using the alignments. A dataset was created consisting of ≈ 18000 unique

English words spoken by 1172 different speakers approximately 50 times for each

word.

Figure 4.1.4: Sample from the LibriWords Dataset

In the LibriWords dataset, the audio files contain the word provided as label but also

surrounding sound. For example ’Nature’, as in fig 4.1.4, contains the utterance ’the

Nature”. This is an artifact of how the sound was extracted from the audio files.

4.1.5 LibriTriplets dataset

LibriTriplets is a dataset created as part of this thesis, it is derived from LibriWords sec

4.1.4. It was constructed by repeating the following process:

1. Pick one out of the 18000 words in LibriWords uniformly at random

2. Sample two random audio files containing the word from (1)

3. Sample a word uniformly at random that is not the word from (1)

4. Sample one random audio file containing the word from (3)

5. Combine (2) and (4) into one triplet.

This process was repeated until 5 000 000 triplets or 15 000 000 audio files had been

chosen for training. For an explanation on the usage of the triplets see sec 3.3.

30


Figure 4.1.5: Sample from the LibriTriplets dataset

In figure 4.1.5 blue (first) signifies anchor, green (second) positive and red (third)

negative.

4.1.6 RS dataset

The RS (Realistic Scenario) dataset is one created as part of this thesis for evaluation.

The recordings are from the same microphone and of the author and friends and

family. It contains long audio recordings ≈ 1 minute and a list of keywords and their

occurrences are provided with each sample. It contains two types of datapoints: first

is the one where keywords are spoken in a quiet environment, meaning all occurrences

are that of keyword speech.

Figure 4.1.6: A Sample of RS dataset quiet environment

The second scenario is in an environment where most utterances are that of non

keyword speech and with background noise such as a television, music among other

things.

31


Figure 4.1.7: A sample from the RS dataset noisy environment

The dataset was constructed with 5 different speakers and it is segmented per speaker

such that evaluation can consider results on different speakers. It contains 80

utterances of each keyword: ’stop’, ’time’, ’night’, ’morning’, ’alarm’, ’illuminate’.

Bringing it to a total of 480 keyword utterances.

In the RS dataset it was guaranteed during the construction that keyword utterances

are at least 5 seconds apart. This to avoid issues with overlapping predictions.

4.1.7 Usability dataset

The Usability dataset was created as part of this thesis for usability evaluation, see sec

4.3.3. It contains ten 2 second recordings of 3 words with varying length, in total 30

recordings. The words are ’hat’, ’indistinguishable’ and ’morning’. All utterances are

recordings of the author using the same microphone.

Figure 4.1.8: Sample of each of the words from the Usability dataset

32


4.1.8 Post Processing

For methods that trained a mapping, see sec 3.2 and 3.3; data augmentation was used

to extend the datasets, to increase the variety for better generalization. Two types

augmentations were applied with some probability pi where i ∈ 0, .., 1meaning thatwith probability

∏i pi both augmentations were applied. The following augmentations

(in the order presented) were used on the audio during the training stage of the

methods (U is the uniform distribution):

Background With probability 1.0 the signal was mixed with natural background

noise such as; background noise from a cafe, construction site, rain, doors, fans,

computers among other sounds.

Gaussian Noise With probability 0.5 Gaussian noise with a noise level of 0.5% was

added to the audio.

The implementations of the augmentations used librosa [38] as well as custom

implementations, all the code is available at [20].

4.2 Methods

This section contains details necessary to reproduce the experiments of methods

evaluated.

4.2.1 Speech Distances

All speech distance experiments extracted MFCC features using the librosa python

library [38] with the following hyperparameters

n_mfcc n_fft hop_length win_length n_mels

20 2048 512 2048 128

Table 4.2.1: Sequence Distance MFCC extraction

The following speech distances were used on the MFCC representation.

33


DTWMFCCEU See sec 3.1.1 with euclidean distance, see ex 2.1. The DTW

implementation used was ’dtw’ from python pip package repository, the source code is

hosted on Github [47].

GhostDTWMFCCEU See sec 3.1.2 with euclidean distance, see ex 2.1. The

GhostDTW implementation used was a custom implementation. It was available in

the Friday Github repository [20].

GhostDTWMFCCCOS See sec 3.1.2with Cosine distance, see ex 2.1 . TheGhost

DTW implementation used was a custom implementation. It was available in the

Friday Github repository [20].

GhostDTWMFCCCHE See sec 3.1.2 with Chebyshev distance, see ex 2.1. The

GhostDTW implementation used was a custom implementation. It was available in

the Friday Github repository [20].

4.2.2 Speech to Phonemes

All the Speech to Phonemes experiments share the same S mapping. The mapping

used MFCCs extracted with the tensorflow 1.15 [1] audio module using the following

hyperparameters.

coefficients frame_length frame_step fft_length num_mel_bins27 512 256 512 120

Table 4.2.2: Speech to Phonemes MFCC extraction

The mapping S is a sequence to sequence mapping with the following

architecture:

units activation

LSTM 256 tanh

LSTM 256 tanh

Dense 256 tanh

Dense 41 none

Table 4.2.3: Speech to Phonemes Architecture

It was trained using tensorflow 1.15 [1] on the LibriSpeechPhonemes dataset, see 4.1.2,

using the CTCloss implementation of the tensorflow 1.15 library. The training ran

34


on a Nvidia 1080 GTXTI GPU for 72 hours, the optimizer used was Adam [30] with

tensorflow 1.15 default parameters. The learning rate was governed by the cosine

decayrestarts scheduler [22] from tensorflow 1.15with the following parameters.

learning_rate first_decay_steps t_mul m_mul alpha0.0005 1000 2.0 1.0 0.0

Table 4.2.4: Speech to Phonemes learning rate scheduler

The following experiments using S was evaluated:

STPBS SpeechToPhonemes withBeam Search, see sec 3.2.1. The Tensorflow 1.15

beam search with a beam width of 600 was used as B. GhostDTW with the discrete

metric on phonemes was used on the phonemes produced by B on the result of S. TheDTW result was then divided by the product of the length of the phoneme sequences

and this was used as the final distance.

STPPKL Speech To Phonemes with Posteriograms KLdivergence, see sec 3.2.2.

Ghost DTW was used to measure distance between the posteriograms, KL divergence

[35] (eq 2.1) of sample given example was used as distance with DTW. Note that

KL divergence is not actually a distance since it is asymmetric, but it was chosen

because of its effectiveness in preliminary experiments. In preliminary experiments

both directions of KL divergence, posteriogram distance from DTW on Gaussian

Posteriograms [62], Jensen Shannon Divergence [35] (eq 4.1) , Total Variation

Distance [58], Chebyshev Distance, Euclidean Distance and Cosine Distance was

tested. The total DTW distance using KLdivergence was then divided by the length

of the example keyword and this was used as the final distance.

STPEL Speech To Phonemes with Example Likelihood as described in sec 3.2.3.

From the negative loglikelihood the length of the example phoneme sequence was

subtracted; the resulting value was used as the final distance.

STPSL Speech To Phonemes with Sample Likelihood as described in sec 3.2.4.

From the negative loglikelihood the product of the example length and the predicted

sample phoneme length was subtracted; the resulting value was used as the final

distance.

35


4.2.3 Deep Distance Learning

The R mapping, explained in section 3.3, was implemented as a deep convolutional

neural network, it used MFCCs as input

coefficients frame_length frame_step fft_length num_mel_bins27 512 256 512 120

Table 4.2.5: Deep Distance Learning MFCC extraction

R had the following architecture:

filters / units filter width filter height activation2D convolution 64 7 3 relu2D max pooling 1 32D convolution 128 1 7 relu2D max pooling 1 42D convolution 256 1 10 relu2D convolution 512 7 1 reluglobal max poolingdense 512 reludense 512 none

Table 4.2.6: Deep Distance Learning Architecture

The architecture was inspired by a submission for the google speech commands

keyword spotting competition hosted on the platform Kaggle [28]. For each of the

following experimentsRwas trained for 24 hours with tensorflow 1.15 [1] using a GTX

1080 TI on LibriTriplets, see sec 4.1.5.

DDLCOS Deep Distance Learning using COSine distance, trained with a

separation (α) of 1.0.

DDLEU Deep Distance Learning using EUclidean distance, trained with a

separation (α) of 1.0.

4.3 Metrics

The experiments aim to measure three aspects of the methods, Efficacy, Resources

and Usability. The efficacy metrics aim to measure the performance of the model

’how accurate’ it is. The resource metrics aim to measure what is needed to use the

method; for example how much memory the method uses and what its inference

36


latency is. The usability of the methods aim to measure how easy it is to use them; for

example, as introduced in sec 3, all methods use a heuristic based on top of the distance

calculations, depending on the spread of these distances between samples it might be

impractical to pick a threshold for what is an ’unknown’ keyword, the usability metric

will investigate how ’easy’ it is to decide on such a threshold for ’unknown’.

4.3.1 Efficacy Metrics

In an inference setting there are 5 things that can occur:

1. A user speaks a keyword and the system predicts the correct keyword

2. A user speaks a keyword and the system predicts ’unknown’

3. A user speaks a keyword and the system predicts a wrong keyword

4. A user speaks no keyword and the system predicts some keyword

5. A user speaks no keyword and the system predicts ’unknown’

A system that always ends up in 1 or 5 is a perfect system. It is also noteworthy

that it is typically much more common that no keyword has been spoken than the

opposite.

In addition to the possible outcomes of a prediction; the order of outcomes play a role

as well.

Consider a scenario where a user utters ’good night’ andwhere the systemhave learned

the keywords ’good morning’ and ’good night’. The system might at some point in

time make a prediction having only heard the first part of the keyword ’good’, if for

example the training data was biased towards ’good night’ the system might infer

from only the utterance ’good’ that it should be ’good night’. In reality it would have

been preferable if the system had returned ’unknown’ and waited until it heard the

full keyword. This is one example of a difficult problem and there are different ways

of addressing it that has different drawbacks. However, this is not a main focus of

this thesis so to simplify; three kinds of inference heuristics to decide on what

inference to use will be evaluated. First, accuracy as first; considering only the first

’nonunknown’ inference of a keyword to be valid. Second, accuracy as majority;

consider majority vote inference around a keyword to be the prediction from the

system. Finally, accuracy as some point; consider an inference to be correct if at

37


some point a correct inference is made.

Occurrence of scenario (3) and (4) is what the systemmust prioritize tominimize, since

misinterpretation or seemingly random behaviour is considered by the author to be

worse than not hearing. To measure how well a system minimizes these scenarios

one can use a falsepositive rate as indicator for (4) and accuracy as indicator for

(3). A tradeoff between accuracy and falsepositive rate will be made depending

on the distance threshold T , see chapter 3, to mitigate this the metrics is calculatedfor all relevant values of T . This leaves us with a graph showcasing the tradeoff

between accuracy and falsepositives for different T . However, to make the graphsmore interpretable they will instead contain:

E(T ) =accuracy(T )

fp(T ) + ϵ(4.1)

The efficacy (E) at distance T is the ratio of the accuracy and falsepositive rate plus

some small constant ϵ. ϵ controls how valuable it is to keep a low falsepositive

rate. Using this formula methods with the highest peak are the best for a given

epsilon.

In the result plots the distance (T ) for a method will be normalized to [0, 1], the false

positive rate will be a value between [0, 1] as well as the accuracy and ϵ = 1100.

The result will contain plots of efficacy for each inference heuristic and the appendix

will contain falsepositiverate plots and the different accuracy’s for different distances.

The appendix will also contain a confusion matrix for each method for when the

distance has been chosen to the one maximizing efficacy calculated using ’accuracy

as majority’.

To clarify, accuracy here is defined as the number of correctly classified keywords over

the total number of keywords. Meaning that a prediction of some keyword when the

label is ’unknown’ does not affect the accuracy at all, since ’unknown’ is not considered

a keyword. Furthermore a false positive is defined as a prediction of any keywordwhen

the label is ’unknown’.

38


Interpretation of Efficacy

Here’s an example of the efficacy plots that will be present in the results

0.0 0.5 1.00

20

40

60

80

100accuracy_as_first

0.0 0.5 1.00

20

40

60

80

100accuracy_as_some_point

0.0 0.5 1.00

20

40

60

80

100

efficacy

distance

accuracy_as_majority

STP-BS STP-EL STP-PKL STP-SL

Figure 4.3.1: Efficacy of quiet and same speaker evaluation

It shows the efficacy, equation 4.3.1, for all distances normalized to [0, 1] for all

inference heuristics. As a rule of thumb having the highest peak is best, following

this reasoning ’STPSL’ is superior with the ’accuracy_as_first’ heuristic and ’STPBS’

superior with the others, in this example.

Decreasing of efficacy is caused by influx of false positives. Increasing of efficacy is

caused by an improved accuracy without increasing falsepositive rate significantly. A

peak can therefore be interpreted as the distance where accuracy is optimal given our

penalty on falsepositives. Therefore methods which have the highest peak have the

highest capability of good accuracy under our penalization of falsepositives, they are

considered best.

4.3.2 Resource Metrics

The constraints section, sec 1.4, places limitations on the memory usage and latency

of the methods; to this end: the memory usage and latency of the methods will

be presented. Denote average seconds per inference as E[I(K)] and the number of

keyword examples a system has been given as K. The latency metric provided will

39


be:

L(K) = E[I(K)] (4.2)

The results will contain a plot of L for different values of K.

Making fair latency benchmarks is a notoriously hard problem since using the correct

hardware and implementation can make all the difference. An implementation

considering performance its main objective versus another one with a different

objective can have a performance disparity in many orders of magnitude. In this

thesis I have not implemented all of the methods from scratch, but have made use

of existing libraries. Some of these libraries might not have performance as their main

consideration and some of them might be able to produce orders of magnitude better

performance if compiled with optimizations and running on the correct hardware. All

of this is to say that although inmy experiment somemethodmight have a latency that

is better than some other, it is by no means a definitive result.

For memory usage: the MB of storage required to use the method as evaluated in this

thesis will be presented.

4.3.3 Usability Metrics

The main difficulty for usability lies in deciding on a threshold for what is considered

’unknown’ (see heuristic in chapter 3), that threshold could even vary depending on

what words have been learnt. Having a single threshold is preferable because it makes

usage easier, otherwise every time a new keyword is added a threshold would have to

be found as well.

To inspect the ease of deciding thresholds the inclass and interclass distribution

of distances will be plotted. If the inclass and interclass distributions are clearly

separable then defining a single threshold should be feasible. The Usability will be

presented as plots of the inclass and interclass distribution of distances for some

keywords.

To clarify, the inclass distribution is the distribution of distances between keywords

of the same class and interclass distribution the distribution of distances to keywords

outside of the class.

40


4.4 Evaluation

This section describes the evaluation settings, what datasets were used and how the

metrics from sec 4.3 was evaluated.

4.4.1 Realistic

To evaluate efficacy the RS dataset, see sec 4.1.6, was used. The dataset and this

evaluation setting was created to as closely as possible resemble the authors intended

real world use cases. There are four cases, first the quiet and same scenario where the

same speaker provides examples and samples and the samples is recorded in a quiet

room, think bedroom, where there is not a lot of background noise or non keyword

speech. Second the noisy and same scenario where the same speaker provides

examples and samples and the samples is recorded in a noisy room, think living room,

where there is background noise such as a television and a lot of nonkeyword speech.

Third the quiet and different scenariowhere one speaker provides examples, different

speakers provide samples and the samples is recorded in a quiet room. Finally the

noisy and different scenario where one speaker provides examples, different speakers

the samples and the samples is recorded in a noisy environment. These scenarios are

all part of the RS dataset.

For each scenario the methods evaluated was provided with three examples of the six

different keywords ’time’, ’stop’, ’night’, ’morning’, ’illuminate’ and ’alarm’, in total 18

example recordings. Using these examples: inference on theRSdataset was performed

using a sliding window [19], the window size was 2000ms and stride 250ms. Using

the inferences from the sliding window the metrics from sec 4.3.1 was calculated by

assuming that if the center of an inference window was at most 2 second away from

the utterance; then it was an inference on the utterance.

4.4.2 Google Speech Commands

The Google Speech Commands (GSC) dataset, see sec 4.1.3, is a limited vocabulary

keyword spotting benchmark dataset. Limited vocabulary keyword spotting is a

different problem than that addressed in this thesis, but evaluating on this benchmark

still makes it somewhat comparable to other works that have used this dataset as

evaluation. Using the GSC dataset a 3shot learning benchmark was created. The

41


keyword left’, ’learn’, ’sheila’ ’seven’, ’dog’ and ’down’were included and the systemwas

given 3 examples of each keyword. It was then evaluated using multi class accuracy as

in the Kaggle competition [55] using the same dataset. Following is a plot of the data

distribution used in the evaluation:

Figure 4.4.1: Distribution of the keywords used in the GSC evaluation

4.4.3 Latency

Latency was evaluated on noise. For each K, see sec 4.3.2, K normally distributed

vectors were generated and registered as examples. Then N (N = 100) normally

distributed vectors were generated and used to measure inference time, for each K.

The performance plot is provided in figure 5.4.2.

4.4.4 Usability

The usability was evaluated using the usability dataset, see sec 4.1.

42

Chapter 5

Result

First results of the individual methods are presented. Finally, a comparison between

the results of the best variants (according to the efficacy metric) of the individual

methods are given. The final comparison also contains the evaluation of resources

and usability of the methods.

5.1 Speech Distances

0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance


DTW-MFCC-EU Ghost-DTW-MFCC-CHE Ghost-DTW-MFCC-COS Ghost-DTW-MFCC-EU


43

CHAPTER 5. RESULT

Accuracy, false positive rate and the confusion matrix for the highest efficacy can be

seen in figures A.1.2 and A.1.3 in Appendix A.

0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance



Figure 5.1.2: Efficacy of noisy and same speaker evaluation



0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance



Figure 5.1.3: Efficacy of quiet and different speaker evaluation


44

CHAPTER 5. RESULT


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance



Figure 5.1.4: Efficacy of noisy and different speaker evaluation



5.2 Speech to Phonemes

0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance




45

CHAPTER 5. RESULT



0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance






0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance





46

CHAPTER 5. RESULT


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance






5.3 Deep Distance Learning

0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance


DDL-COS DDL-EU


47

CHAPTER 5. RESULT



0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance


DDL-COS DDL-EU




0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance


DDL-COS DDL-EU



48

CHAPTER 5. RESULT


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance


DDL-COS DDL-EU




5.4 Comparison

This section presents a comparison of top performing methods from the first three

sections. For plots of efficacy with only the top performers: see appendix A.1.28 A.1.30

A.1.29 A.1.31

49

CHAPTER 5. RESULT

5.4.1 Accuracy

Figure 5.4.1: GSC multiclass accuracy

For confusion matrices of the GSC evaluation see Appendix A.1.26

5.4.2 Resources

Figure 5.4.2: Inference Latency

The red line is an upperbound on the allowed latency to fulfill the constraints from

section 1.4. L(K) is described in the latency metric section, see sec 4.3.2.

Memory usage is presented in the following table:

This table represents the memory usage of the methods as described in sec 4.2.

50

CHAPTER 5. RESULT

GhostDTWMFCC STPSL DDLRepresentation MB 0.0067 0.08 0.016Model MB 0 3.5 5.8

Table 5.4.1: Memory Usage

51

CHAPTER 5. RESULT

5.4.3 Usability

Figure 5.4.3: Interclass and inclass distribution

52

CHAPTER 5. RESULT

The solid lines represent inclass and dotted interclass distributions of distances. In

appendix A.1.27 the usability of the STP methods are plotted without the length post

processing.

53

Chapter 6

Conclusions

In this chapter we attempt to make conclusions about when to use which method. We

begin by stating the benefits and drawbacks of each method. Finally, we summarize in

section 6.4.

6.1 Sequence Distances

The main benefit becomes apparent looking at memory, table 5.4.1, we conclude

that sequences distances requires the least amount of memory. Looking at the

latency, figure 5.4.2, we see that sequence distances have low latency. The Ghost

DTW implementation is able to learn far more examples than what was used for

benchmarking before latency becomes an issue.

Regarding the efficacy, looking at the GSC evaluation, fig 5.4.1, sequence distances

leaves something to be desired. However, the GSC evaluation contains multi

speaker, with different microphones and in different settings. If the use case is more

constrained: for example single speaker and single microphone. Then, sequence

distances might be worth considering. Looking at the quiet and same results, figure

5.1.1, the sequence distances reached an efficacy comparable with the other methods.

Furthermore, by looking at the confusion matrix for the same evaluation, figure A.1.3,

most of the sequence distances perform well. The sequence distances also show

promise in the quiet and different results, figure 5.1.3 and figure 5.1.3. However in

the noisy environments one should consider using a different method.

Regarding usability: the sequence distances would likely benefit from tuning on a per

54

CHAPTER 6. CONCLUSIONS

case basis. However, picking a single threshold is not impossible, although at the

cost of some performance. For example: for the GhostDTWMFCCCHE method a

threshold of 35 works decent across all evaluations, and we can see from figure 5.4.3

that 35 would separate the distributions decently.

6.2 Speech To Phonemes

Starting with drawbacks: looking at the latency results, figure 5.4.2, STPSL breaks the

allowed barrier already at 12 examples. Furthermore, STP methods lie uncomfortably

close to the barrier already at 1 example. Of course, this might be an artifact of the

evaluation, different opimizations could possibly yield better results. However, the

STPmethods have an expensive representation step, inferring the phoneme posterior,

and typically an expensive distance on top of the posteriors too. For example: the STP

SL and STPEL uses a CTC forward pass on top of the phoneme posteriors. Also, STP

PKL uses DTWon top of the phoneme posteriors. Only STPBS uses a computationally

cheap distance on top of the representation. Looking at the memory usage of STPSL,

table 5.4.1, it uses a few MB for storing the model and a few KB per representation.

The memory usage is completely within acceptable margins. However, to reduce the

memory usage even further: Bluche et al [8] suggests using quantized LSTMnetworks;

this would bring the LSTM used by the STP methods to under 1 MB.

Moving on to some benefits: Looking at the GSC results, figure 5.4.1, the STPSL

achieves 60% accuracy on the 3shot learning task. The best submissions to the

Kaggle competition [55] achieved an accuracy of 90% on a larger subset of the dataset.

Comparing STPSL directly to that makes it look seemingly bad. However, the STP

method was only given 3 examples per keyword, not thousands, and it was not trained

for the GSC dataset specifically. Considering the circumstances it is not nearly as bad

as the direct comparison might indicate.

Looking at the efficacy results; the STP methods had high variance in comparison to

the sequence distances. On the quiet evaluations: STPBS performs best of the STP

methods and beats all sequence distances. However, on the noisy evaluations: STP

BS scores bad in comparison with other STP methods. More consistently: the STPSL

method beats all sequence distances on all efficacy evaluations. It scores well against

other STP methods on the quiet evaluations and is among the best methods on the

noisy evaluations.

55


Regarding usability: another drawback of the STP methods is that they did not work

well out of the box. The reader might have been confused as too why the STP methods

contain extra logic such as ’divide the resulting distance by the length of the example

keyword’. By taking a look at figure A.1.27 we see two usability plots of STP methods

without this extra postprocessing. The figure shows that long words and short words

have very different distributions and are hard to separate. Based on these results the

postprocessing was added to improve the separability of the distributions. The post

processing improves the results of themethods by orders ofmagnitude. However, even

with this postprocessing we can see from figure 5.4.3 that although the distributions

appear separable for these particular keywords: the modes are quite spread out and

a single threshold might be hard to find that works for all keywords. It is left as

future work to dig deeper into how to make the STP methods more usable, for they

have promising attributes: such as interpretability (one can interpret the sequence of

phonemes it believes exist in the audio) and good efficacy.

6.3 Deep Distance Learning

The memory usage of the DDL methods (see table 5.4.1) are within the margins of the

constraints stated in this thesis. But for readers interested in trulyminimizingmemory

usage: the CNN can be quantized, as done by Wu et al [61]. The DDL methods have

very good latency (see figure 5.4.2) we can see that increasing K gives no noticeable

performance degradation within the first 30 examples. Also, the base performance for

one example is really good compared to the other methods. DDL supports hundreds

of keywords before latency becomes an issue.

On GSC, figure 5.4.1, DDL scored over 60%, similar to the STP methods. Also, similar

to the STPmethodsDDL performs bad in comparison to the best baselines onGSC. But

considering the limitation the results are more impressive. Regarding efficacy: DDL

got a perfect score in both the quiet environments and was comparable to the best STP

method in the noisy ones.

Regarding usability: looking at figure 5.4.3 the inclass and interclass modes are

clearly separable and using a single threshold looks possible. For example: for the

DDLEU variant a threshold of 0.8 performs very well.

56


6.4 What’s best

The DDL methods would likely work best for most tasks with the one exception being

extremelymemory constrained environments. In these settings the sequence distances

would be a better choice. DDL is likely the best choice because of its high performance

in efficacy, latency and usability. Only STPmethods could rival in efficacy but they had

problems with latency and usability.

6.5 Future Work

In this section possible extensions of the work is proposed.

6.5.1 DDL Extensions

Here are presented some interesting future directions to improve theDDLmodel.

More Data

In this thesis we have constructed LibriWords 4.1.4 from librispeech a dataset with

about 1000 speakers. One could also used Mozilla Common Voice [3], which has

an English speaking dataset with over 55000 speakers. A dataset ”Megawörd”

consisting of data from LibriSpeech, Mozilla Common Voice and Google Speech

commands has been created to train the DDL models described in this thesis for the

Friday Assistant [20]. The dataset is orders of magnitude larger than the one used in

this thesis and will be provided on request. It is not provided online due to hosting

costs.

Different Triplet Loss

In this thesis we used a triplet loss that maximize relative distance between anchor to

positive and anchor to negative.

L(A,P,N) = max(D(R(A),R(P ))−D(R(A),R(N)) + α, 0) (6.1)

It would be of interest to try using a triplet loss that maximize an absolute

57


distance.

L(A,P,N) = D(R(A),R(P )) + βmax(α−D(R(A),R(N)), 0) (6.2)

Where β is a scalar that can be used to get a more realistic weighing of data. For

keyword spotting: negative samples are much more common than positive samples,

therefore a β > 1 is a good fit.

The hypothesis is that maximizing the absolute distance would yield tighter and

more well separated clusters in the representation space, and therefore have better

discriminative properties.

Better Augmentations

Section 4.1.8 describes the postprocessing techniques used in this thesis. From

empirical experiments not accounted for in this thesis: the methods when trained on

the LibriWords dataset with the postprocessing of this thesis exhibit a performance

drop in environments with reverberation. Improving the data augmentation might

alleviate this and further improve the performance of DDL in realistic settings, ko et al

[31] suggests an improved data augmentation scheme for this problem.

Asymmetric Task

For the STPPKL model when using distances on the CTC posteriograms it was

discovered that an asymmetric ’distance’ worked better than all symmetric variants. It

would be interesting to train a DDLmodel with an asymmetry to see if it yields similar

results.

Imagine rephrasing the optimization to ’does the word in this audio clip appear

somewhere inside this other audio clip’. Instead of ’does these two audio clips contain

the same word’. One could easily adapt some words dataset to such an optimization

task: simply include more context from the source dataset when creating the positives

and negatives. Also, add some non weight shared prediction head to allow the model

to learn the asymmetry.

The hypothesis is that this wouldwork because the examples contain only the utterance

and in a clean environment. While samples might contain noisy distractors such as

background noise, or it might be part of a sentence.

58


6.5.2 Usability of STP

The distribution of inclass and interclass distances for keywords of different length

makes using STP methods more difficult due to the problem of choosing a threshold.

But as shown in this thesis one can improve the results of the STPmethods bymeans of

normalizing the distances in some manner using the length of the keyword. The fixes

in this thesis was adhoc and it is likely there exists some better way of addressing this

usability issue.

For example for STPSL the length of the phoneme sequence of the sample and length

of the example was subtracted from the negative log likelihood. The motivation

behind this was based on the intuition that the probability of a word decreased with a

multiplicative factor for each phoneme added. Since the probability of a sequence with

CTC loss is the product of the probability of the phonemes in the CTC posteriogram.

However, this intuition is incomplete since the probability of a phoneme sequence

is the sum of all sequences in the CTC posteriogram that generates said phoneme

sequence. Using this knowledge a more suitable normalization might be devised. Or

perhaps the threshold problem for STP can be avoided in an alltogether different

manner.

6.5.3 Heuristic

In chapter 3 we introduce a heuristic for deciding on predictions given a collection

of distances to the sample. In this thesis we decided on using a set distance, see

eq 3.1, based on minimal set difference. The DONUT paper [36] suggests using a

set distance based on average difference. In preliminary experiments the average

difference distance did not show any clear advantages. But it is possible that other

set distances or different heuristics all together might work better for the prediction

problem described in chapter 3.

6.6 Final Words

Check out the Friday voice assistant [20]! It, at time writing, uses the DDL method

described in this thesis. All of the code for all of the models, evaluations and dataset

could also be found in the same repository.

59

Bibliography

[1] Abadi,Martın, Barham, Paul, Chen, Jianmin, Chen, Zhifeng, Davis, Andy, Dean,

Jeffrey, Devin,Matthieu, Ghemawat, Sanjay, Irving, Geoffrey, Isard,Michael, et

al. “Tensorflow: A system for largescale machine learning”. In: 12th USENIXsymposium on operating systems design and implementation (OSDI 16).

2016, pp. 265–283.

[2] Alex, John Sahaya Rani and Venkatesan, Nithya. “Modified Multivariate

Euclidean Dynamic Time Warping Based Spoken Keyword Detection”. In:

(2017).

[3] Ardila, Rosana, Branson, Megan, Davis, Kelly, Henretty, Michael, Kohler,

Michael, Meyer, Josh, Morais, Reuben, Saunders, Lindsay, Tyers, Francis M,

and Weber, Gregor. “Common voice: A massivelymultilingual speech corpus”.

In: arXiv preprint arXiv:1912.06670 (2019).

[4] Association, International Phonetic, Staff, International Phonetic Association,

et al. Handbook of the International Phonetic Association: A guide to the use

of the International Phonetic Alphabet. Cambridge University Press, 1999.

[5] Bahl, Lalit R, Jelinek, Frederick, andMercer, Robert L. “A maximum likelihood

approach to continuous speech recognition”. In: IEEE transactions on pattern

analysis and machine intelligence 2 (1983), pp. 179–190.

[6] Bellman, Richard. “Dynamic programming”. In: Science 153.3731 (1966),

pp. 34–37.

[7] Bluche, Théodore and Gisselbrecht, Thibault. “Predicting detection filters

for small footprint openvocabulary keyword spotting”. In: arXiv preprint

arXiv:1912.07575 (2019).

60

BIBLIOGRAPHY

[8] Bluche, Théodore, Primet, Maël, and Gisselbrecht, Thibault. “SmallFootprint

OpenVocabulary Keyword Spotting with Quantized LSTM Networks”. In:

arXiv preprint arXiv:2002.10851 (2020).

[9] Bromley, Jane, Guyon, Isabelle, LeCun, Yann, Säckinger, Eduard, and Shah,

Roopak. “Signature verification using a” siamese” time delay neural network”.

In: Advances in neural information processing systems (1994), pp. 737–737.

[10] Chandra, E et al. “Keyword spotting system for Tamil isolated words

using Multidimensional MFCC and DTW algorithm”. In: 2015 International

Conference on Communications and Signal Processing (ICCSP). IEEE. 2015,

pp. 0550–0554.

[11] Chen, Yangbin, Ko, Tom, Shang, Lifeng, Chen, Xiao, Jiang, Xin, and Li, Qing.

“An investigation of fewshot learning in spoken term classification”. In: arXiv

preprint arXiv:1812.10233 (2018).

[12] Cohen, Michael H, Cohen, Michael Harris, Giangola, James P, and Balogh,

Jennifer. Voice user interface design. AddisonWesley Professional, 2004.

[13] Conci, AURA and Kubrusly, CS. “Distance between setsa survey”. In: arXiv

preprint arXiv:1808.02574 (2018).

[14] Davis, Ken H, Biddulph, R, and Balashek, Stephen. “Automatic recognition of

spoken digits”. In:The Journal of the Acoustical Society of America 24.6 (1952),

pp. 637–642.

[15] Deka, Brajen Kumar and Das, Pranab. “An Analysis of an Isolated Assamese

Digit Recognition using MFCC and DTW”. In: 2019 6th International

Conference on Computing for Sustainable Global Development (INDIACom).

IEEE. 2019, pp. 46–50.

[16] Denes, P and Mathews, Max V. “Spoken Digit Recognition Using Time

Frequency Pattern Matching”. In: The Journal of the Acoustical Society of

America 32.11 (1960), pp. 1450–1455.

[17] Fant, Gunnar. “Speech perception”. In: Speech Acoustics and Phonetics (2005),

pp. 199–220.

[18] Forney, G David. “The viterbi algorithm”. In: Proceedings of the IEEE 61.3

(1973), pp. 268–278.

61

BIBLIOGRAPHY

[19] Frank, Ray J, Davey, Neil, and Hunt, Stephen P. “Time series prediction and

neural networks”. In: Journal of intelligent and robotic systems 31.1 (2001),

pp. 91–103.

[20] Friday Voice Assisstant. https://github.com/JonasRSV/Friday. Accessed:

20210310.

[21] Giorgino, Toni et al. “Computing and visualizing dynamic time warping

alignments in R: the dtw package”. In: Journal of statistical Software 31.7

(2009), pp. 1–24.

[22] Gotmare, Akhilesh, Keskar, Nitish Shirish, Xiong, Caiming, and Socher,

Richard. “A closer look at deep learning heuristics: Learning rate restarts,

warmup and distillation”. In: arXiv preprint arXiv:1810.13243 (2018).

[23] Graves, Alex, Fernández, Santiago, Gomez, Faustino, and Schmidhuber, Jürgen.

“Connectionist temporal classification: labelling unsegmented sequence data

with recurrent neural networks”. In: Proceedings of the 23rd international

conference on Machine learning. 2006, pp. 369–376.

[24] Hoy, Matthew B. “Alexa, Siri, Cortana, and more: an introduction to voice

assistants”. In:Medical reference services quarterly 37.1 (2018), pp. 81–88.

[25] HugginsDaines, David, Kumar, Mohit, Chan, Arthur, Black, Alan W,

Ravishankar, Mosur, and Rudnicky, Alexander I. “Pocketsphinx: A free, real

time continuous speech recognition system for handheld devices”. In: 2006

IEEE International Conference on Acoustics Speech and Signal Processing

Proceedings. Vol. 1. IEEE. 2006, pp. I–I.

[26] Huh, Jaesung, Lee,Minjae,Heo,Heesoo,Mun, Seongkyu, andChung, JoonSon.

“Metric Learning for Keyword Spotting”. In: arXiv preprint arXiv:2005.08776

(2020).

[27] Johnston, Steven J andCox, SimonJ.The raspberryPi: A technologydisrupter,

and the enabler of dreams. 2017.

[28] Kaggle KWS CNN. https : / / www . kaggle . com / c / tensorflow - speech -

recognition-challenge/discussion/47715. Accessed: 20210311.

62

https://github.com/JonasRSV/Friday

https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/discussion/47715

https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/discussion/47715

BIBLIOGRAPHY

[29] Kim, Byeonggeun, Lee, Mingu, Lee, Jinkyu, Kim, Yeonseok, and Hwang,

Kyuwoong. “Querybyexample ondevice keyword spotting”. In: 2019 IEEE

Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE.

2019, pp. 532–538.

[30] Kingma, Diederik P and Ba, Jimmy. “Adam: A method for stochastic

optimization”. In: arXiv preprint arXiv:1412.6980 (2014).

[31] Ko, Tom, Peddinti, Vijayaditya, Povey, Daniel, Seltzer, Michael L, and

Khudanpur, Sanjeev. “A study on data augmentation of reverberant speech

for robust speech recognition”. In: 2017 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2017, pp. 5220–

5224.

[32] Köppen,Mario. “The curse of dimensionality”. In: 5th OnlineWorld Conference

on Soft Computing in Industrial Applications (WSC5). Vol. 1. 2000, pp. 4–8.

[33] Landau, HJ. “Sampling, data transmission, and the Nyquist rate”. In:

Proceedings of the IEEE 55.10 (1967), pp. 1701–1706.

[34] Librispeech Phoneme Lexicon. https : / / drive . google . com / file / d /

1dAvxdsHWbtA1ZIh3Ex9DPn9Nemx9M1-L/view. Accessed: 20210311.

[35] Lin, Jianhua. “Divergence measures based on the Shannon entropy”. In: IEEE

Transactions on Information theory 37.1 (1991), pp. 145–151.

[36] Lugosch, Loren,Myer, Samuel, andTomar, Vikrant Singh. “DONUT:CTCbased

QuerybyExample Keyword Spotting”. In: arXiv preprint arXiv:1811.10736

(2018).

[37] McAuliffe, Michael, Socolof, Michaela, Mihuc, Sarah, Wagner, Michael,

and Sonderegger, Morgan. “Montreal Forced Aligner: Trainable TextSpeech

Alignment Using Kaldi.” In: Interspeech. Vol. 2017. 2017, pp. 498–502.

[38] McFee, Brian, Raffel, Colin, Liang, Dawen, Ellis, Daniel PW, McVicar, Matt,

Battenberg, Eric, and Nieto, Oriol. “librosa: Audio and music signal analysis

in python”. In: Proceedings of the 14th python in science conference. Vol. 8.

Citeseer. 2015, pp. 18–25.

[39] Montgomery, Alan L and Smith, Michael D. “Prospects for Personalization on

the Internet”. In: Journal of Interactive Marketing 23.2 (2009), pp. 130–137.

63

https://drive.google.com/file/d/1dAvxdsHWbtA1ZIh3Ex9DPn9Nemx9M1-L/view

https://drive.google.com/file/d/1dAvxdsHWbtA1ZIh3Ex9DPn9Nemx9M1-L/view

BIBLIOGRAPHY

[40] Morgan, David P, Scofield, Christopher L, Lorenzo, Theresa M, Real, Edward C,

and Loconto, David P. “A keyword spotter which incorporates neural networks

for secondary processing”. In: International Conference on Acoustics, Speech,

and Signal Processing. IEEE. 1990, pp. 113–116.

[41] Morgan, DP, Scofield, Christopher L, and Adcock, John E. “Multiple neural

network topologies applied to keyword spotting”. In: [Proceedings] ICASSP 91:

1991 International Conference on Acoustics, Speech, and Signal Processing.

IEEE. 1991, pp. 313–316.

[42] Mynatt, ElizabethD, Back,Maribeth,Want, Roy, Baer,Michael, and Ellis, Jason

B. “Designing audio aura”. In:Proceedings of the SIGCHI conference onHuman

factors in computing systems. 1998, pp. 566–573.

[43] Naylor, JA, Huang, WY, Nguyen, M, and Li, KP. “The application of neural

networks to wordspotting”. In: [1992] Conference Record of the TwentySixth

Asilomar Conference on Signals, Systems & Computers. IEEE. 1992, pp. 1081–

1085.

[44] O’Shea, Keiron and Nash, Ryan. “An introduction to convolutional neural

networks”. In: arXiv preprint arXiv:1511.08458 (2015).

[45] Panayotov, Vassil, Chen, Guoguo, Povey, Daniel, and Khudanpur, Sanjeev.

“Librispeech: an asr corpus based on public domain audio books”. In: 2015 IEEE

international conference on acoustics, speech and signal processing (ICASSP).

IEEE. 2015, pp. 5206–5210.

[46] Peterson, Leif E. “Knearest neighbor”. In: Scholarpedia 4.2 (2009), p. 1883.

[47] Python DTW implementation. https://github.com/pierre- rouanet/dtw.

Accessed: 20210310.

[48] Rohlicek, J Robin, Russell, William, Roukos, Salim, and Gish, Herbert.

“Continuous hiddenMarkovmodeling for speakerindependent word spotting”.

In: International Conference on Acoustics, Speech, and Signal Processing,

IEEE. 1989, pp. 627–630.

[49] Sakoe, Hiroaki. “Twolevel DPmatching–A dynamic programmingbased

pattern matching algorithm for connected word recognition”. In: IEEE

Transactions onAcoustics, Speech, andSignal Processing27.6 (1979), pp. 588–

595.

64

https://github.com/pierre-rouanet/dtw

BIBLIOGRAPHY

[50] Scheidl, Harald, Fiel, Stefan, and Sablatnig, Robert. “Word beam search:

A connectionist temporal classification decoding algorithm”. In: 2018 16th

International Conference on Frontiers in Handwriting Recognition (ICFHR).

IEEE. 2018, pp. 253–258.

[51] Senin, Pavel. “Dynamic time warping algorithm review”. In: Information and

Computer Science Department University of Hawaii at Manoa Honolulu, USA

855.123 (2008), p. 40.

[52] SF2956. NotesMC. Last accessed 21 May 2021. 2021.

[53] Sohn, Jongseo, Kim, Nam Soo, and Sung, Wonyong. “A statistical modelbased

voice activity detection”. In: IEEE signal processing letters 6.1 (1999), pp. 1–3.

[54] Sung, Flood, Yang, Yongxin, Zhang, Li, Xiang, Tao, Torr, Philip HS, and

Hospedales, Timothy M. “Learning to compare: Relation network for fewshot

learning”. In: Proceedings of the IEEE conference on computer vision and

pattern recognition. 2018, pp. 1199–1208.

[55] Tensorflow Speech Recognition Challenge. https : / / www . kaggle . com / c /

tensorflow-speech-recognition-challenge/data. Accessed: 20210330.

[56] Tiwari, Vibha. “MFCC and its applications in speaker recognition”. In:

International journal on emerging technologies 1.1 (2010), pp. 19–22.

[57] Twaddell, W Freeman. “On defining the phoneme”. In: Language 11.1 (1935),

pp. 5–62.

[58] Verdú, Sergio. “Total variation distance and the distribution of relative

information”. In: 2014 Information Theory and Applications Workshop (ITA).

IEEE. 2014, pp. 1–3.

[59] Vygon, Roman and Mikhaylovskiy, Nikolay.

“Learning Efficient Representations for Keyword Spotting with Triplet Loss”.

In: arXiv preprint arXiv:2101.04792 (2021).

[60] Warden, Pete. “Speech commands: A dataset for limitedvocabulary speech

recognition”. In: arXiv preprint arXiv:1804.03209 (2018).

[61] Wu, Jiaxiang, Leng, Cong, Wang, Yuhang, Hu, Qinghao, and Cheng, Jian.

“Quantized Convolutional Neural Networks for Mobile Devices”. In: CoRR

abs/1512.06473 (2015). arXiv: 1512.06473. URL: http://arxiv.org/abs/

1512.06473.

65

https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/data

https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/data

https://arxiv.org/abs/1512.06473

http://arxiv.org/abs/1512.06473

http://arxiv.org/abs/1512.06473

BIBLIOGRAPHY

[62] Zhang, Yaodong and Glass, James R. “Unsupervised spoken keyword spotting

via segmental DTW on Gaussian posteriorgrams”. In: 2009 IEEEWorkshop on

Automatic Speech Recognition & Understanding. IEEE. 2009, pp. 398–403.

66

Appendix Contents

A First Appendix 68

67

Appendix A

First Appendix

Figure A.1.1: Embeddings and audio from figure 3.3.1

The first audio belongs to the first embedding (from top) and the second embedding

belongs to second audio... and so on.

68

APPENDIX A. FIRST APPENDIX

Figure A.1.2: Accuracy and ’false positive rate’ of quiet and same speaker evaluation

69


Figure A.1.3: Confusion matrix of quiet and same speaker evaluation

70


Figure A.1.4: Accuracy and ’false positive rate’ of noisy and same speaker evaluation

71


Figure A.1.5: Confusion matrix of noisy and same speaker evaluation

72


Figure A.1.6: Accuracy and ’false positive rate’ of quiet and different speaker evaluation

73


Figure A.1.7: Confusion matrix of quiet and different speaker evaluation

74


FigureA.1.8: Accuracy and ’false positive rate’ of noisy anddifferent speaker evaluation

75


Figure A.1.9: Confusion matrix of noisy and different speaker evaluation

76



77



78



79



80


Figure A.1.14: Accuracy and ’false positive rate’ of quiet and different speakerevaluation

81



82


Figure A.1.16: Accuracy and ’false positive rate’ of noisy and different speakerevaluation

83



84




85




86


Figure A.1.22: Accuracy and ’false positive rate’ of quiet and different speakerevaluation


87


Figure A.1.24: Accuracy and ’false positive rate’ of noisy and different speakerevaluation


88


Figure A.1.26: Confusion matrices of GSC evaluation

89


Figure A.1.27: Interclass and inclass distribution of STP methods without lengthnormalization

The Phoneme Error Rate on LibriSpeech test split of the STP phoneme model was

19%.

0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance


DDL-COSDDL-EU

DTW-MFCC-EUGhost-DTW-MFCC-CHE

STP-BSSTP-SL

Figure A.1.28: Efficacy of quiet and same speaker evaluation

90


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance


DDL-COSDDL-EU


STP-BSSTP-SL

Figure A.1.29: Efficacy of noisy and same speaker evaluation

0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance


DDL-COSDDL-EU


STP-BSSTP-SL

Figure A.1.30: Efficacy of quiet and different speaker evaluation

91


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80


0.0 0.5 1.00

20

40

60

80

100

efficacy

distance


DDL-COSDDL-EU


STP-BSSTP-SL

Figure A.1.31: Efficacy of noisy and different speaker evaluation

92

TRITA -EECS-EX-2021:255

www.kth.se

querybyexample keywordspotting1585183/...chapter1. introduction 1.1 ethics...

Documents