querybyexample keywordspotting1585183/...chapter1. introduction 1.1 ethics...
TRANSCRIPT
![Page 1: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/1.jpg)
DEGREE PROJECT IN TECHNOLOGY,SECOND CYCLE, 30 CREDITSSTOCKHOLM, SWEDEN 2021
Query By ExampleKeyword Spotting
KTH Thesis Report
Jonas Valfridsson
KTH ROYAL INSTITUTE OF TECHNOLOGYELECTRICAL ENGINEERING AND COMPUTER SCIENCE
![Page 2: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/2.jpg)
AuthorsJonas Sunde Valfridsson <[email protected]> <[email protected]>Information and Communication TechnologyKTH Royal Institute of Technology
Place for ProjectStockholm, Sweden
ExaminerSten TernströmKTH Royal Institute of Technology
Supervisor
Jonas Beskow
KTH Royal Institute of Technology
ii
![Page 3: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/3.jpg)
Abstract
Voice user interfaces have been growing in popularity and with them an interest for
open vocabulary keyword spotting. In this thesis we focus on one particular approach
to open vocabulary keyword spotting, query by example keyword spotting. Three
types of query by example keyword spotting approaches are described and evaluated:
sequence distances, speech to phonemes and deep distance learning. Evaluation is
done on a series of custom tasks designed to measure a variety of aspects. The Google
Speech Commands benchmark is used for evaluation as well, this to make it more
comparable to existing works. From the results, the deep distance learning approach
seemmost promising in most environments except whenmemory is very constrained;
in which sequence distances might be considered. The speech to phonemes methods
is lacking in the usability evaluation.
Keywords
Keyword Spotting, Automatic Speech Recognition, ASR, Query By Example, Deep
Distance Learning, Dynamic Time Warping, FewShot Learning
iii
![Page 4: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/4.jpg)
Abstract
Röstgränssnitt har växt i populäritet
och med dem ett intresse för öppenvokabulärnyckelordsigenkänning. I den här
uppsatsen fokuserar vi på en specifik form av öppenvokabulärnyckelordsigenkänning,
den s.k nyckelordsigenkänninggenomexempel. Tre typer av nyckelordsigenkänning
genomexempel metoder beskrivs och utvärderas: sekvensavstånd, tal till fonem
samt djupavståndsinlärning. Utvärdering görs på konstruerade uppgifter designade
att mäta en mängd olika aspekter hos metoderna. Google Speech Commands data
används för utvärderingen också, detta för att göra det mer jämförbart mot existerade
arbeten. Från resultaten framgår det att djupavståndsinlärning verkar mest lovande
förutom i miljöer där resurser är väldigt begränsade; i dessa kan sekvensavstånd vara
av intresse. Tal till fonem metoderna visar brister i användningsuvärderingen.
Nyckelord
Nyckelords igenkänning, automatisk taligenkänning, fåförsöksinlärning
iv
![Page 5: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/5.jpg)
Acknowledgements
I am thankful to friends and family for donating their voice tomy thesis. Finally, thanks
to my supervisor and examiner for enabling this project.
v
![Page 6: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/6.jpg)
Acronyms
KWS Keyword Spotting
QbE Query By Examples
QbEKWS Query by Example Keyword Spotting
ASR Automatic Speech Recognition
HMM Hidden Markov Model
SOTA State of the Art
DTW Dynamic Time Warping
CTC Connectionist Temporal Classification
STP Speech to Phonemes
IPA International Phonetic Alphabet
MFCC Melfrequency cepstrum coefficients
OVKS Open Vocabulary Keyword Spotting
vi
![Page 7: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/7.jpg)
Contents
1 Introduction 11.1 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 42.1 Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Query by Example Keyword Spotting . . . . . . . . . . . . . . . . . . . 5
2.2.1 Continuous Speech . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Speech Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.2 Phonemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.3 Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.4 Spectral Representations . . . . . . . . . . . . . . . . . . . . . . 112.3.5 Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Historical Development . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Methods 163.1 Sequence Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 DTW on MFCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.1.2 Ghost DTW on MFCC . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Speech To Phonemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Sequence Distance with Beam Search . . . . . . . . . . . . . . 223.2.2 Sequence Distance on CTC Posteriograms . . . . . . . . . . . . 233.2.3 Example Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 24
vii
![Page 8: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/8.jpg)
CONTENTS
3.2.4 Sample Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Deep distance learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4 Experiments 284.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1 LibriSpeech dataset . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.2 LibriSpeechPhonemes dataset . . . . . . . . . . . . . . . . . . 29
4.1.3 Google Speech Commands dataset . . . . . . . . . . . . . . . . 29
4.1.4 LibriWords dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.5 LibriTriplets dataset . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.6 RS dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.7 Usability dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.8 Post Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Speech Distances . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.2 Speech to Phonemes . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.3 Deep Distance Learning . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Efficacy Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 Resource Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 Usability Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.1 Realistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.2 Google Speech Commands . . . . . . . . . . . . . . . . . . . . 41
4.4.3 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.4 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Result 435.1 Speech Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Speech to Phonemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3 Deep Distance Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4.2 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.4.3 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
viii
![Page 9: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/9.jpg)
CONTENTS
6 Conclusions 546.1 Sequence Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2 Speech To Phonemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3 Deep Distance Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.4 What’s best . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.5.1 DDL Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.5.2 Usability of STP . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5.3 Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.6 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
References 60
ix
![Page 10: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/10.jpg)
Chapter 1
Introduction
Voice user interfaces [12] have been growing in popularity for the past decades, one
popular contemporary example being voice assistants [24]. Keyword spotting, being
able to detect when a word has been uttered, is a central function of all voice assistants.
It is commonly used by assistants to wake them up, a famous example being the ’Hey
Mycroft’ wake word for the Mycroft assistant.
Personalization [39] is a coveted aspect of user facing technologies, assistants being
no exception. One aspect of personalization for assistants is naming them. To
accommodate naming one needs an Open Vocabulary Keyword Spotting (OVKS)
system. Such a system would also be a step towards assistants that can function in
local contexts such as lowering or increasing the volume on a speaker without having
to explicitly call on the assistant by name to do so. Query byExampleKeyword Spotting
(QbEKWS) is an approach to OVKS that is explored in this thesis. It attempts to
replicate how a human with a keen memory might memorize a new word from just
hearing it a few times. If a viable solution to QbEKWS is devised it enables a natural
voice interface whereby a user, through voice only, could teach its assistant newwords.
This is one of the motivators of this thesis: to find a viable solution that can be used
with the Friday voice assistant [20]. To that end, the thesis explores three classes of
methods for QbEKWS and evaluates them in a variety of different settings where a
voice assistant might be used.
1
![Page 11: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/11.jpg)
CHAPTER 1. INTRODUCTION
1.1 Ethics
Ethical issues related to voice assistants are plentiful. They concern the problem of
privacy, owning your own data and so on, this problem arises from the fact that most
commercial voice assistants make use of cloud processing to deliver a good experience.
Having a local assistant comeswithmany limitations, somost commercial entities have
opted for a cloud solution. In the case of the proposed solutions of this thesis it might
be argued that the situation is aggravated, because to teach the assistant one must
provide audio fingerprints of the words one wants the assistant to learn. However, if
this fingerprint can be stored locally ondevice: then this does not differ at all from
how the situation looks today. Furthermore, the solutions presented in this thesis
are intended to function on a fully offline assistant. Thereby eliminating the privacy
concerns associated with sending your voice to the cloud.
1.2 Problem
Given a few audio recordings of some utterance, the system should be able to use them
to recognize if the same utterance is uttered given new audio.
More precisely, define an audio recording as RNi where RN
i is a N dimensional vector
of real numbers belonging to utterance i. Denote the distribution of utterance i onRN
as Ui. The problem is: given M samples, where M is small (M < 5), of Ui construct a
mappingM : RN → [0,∞). The mapping should have the property that for a sample
RNj from some Uj if mapped to a low value; anyone would agree thatRN
j ∈ Ui meaning
that its highly likely that j = i, and for a high value it is highly likely that j = i.
1.3 Related Works
Previous work has been reported on all of the methods evaluated in this thesis. For the
sequence distances (see section 3.1) previousworks have focused on the combination of
MFCCs and theDTWalgorithm [2] [15] [10]. For the speech to phonemesmethods (see
section 3.2) previous works have been focused on what is referred to in that section as
Sample Likelihood [36][8][29]. For the deep distance learning methods (see section
3.3) previous works have focused on the same method evaluated in this thesis with
some differing details [26] [59]. Beyond the work evaluated in this thesis, authors have
2
![Page 12: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/12.jpg)
CHAPTER 1. INTRODUCTION
tried applyingmeta learning to the problem [11] aswell as neural network architectures
based on detection filters [7].
1.4 Constraints
A solution to the problem is intended to be used on relatively low resource devices.
In particular the solution should run on a Raspberry Pi (3B+) [27] on audio recorded
in realtime from some microphone. The Raspberry Pi has a relatively large memory,
therefore themost constraining factor is latency. A proxy for the latency requirement of
the Raspberry Pi will be used: a viable methodmust be able to have a inference latency
lower than 250ms when running on a Intel(R) Core(TM) i78565U CPU @ 1.80GHz
CPU. This is the CPU used by the author for evaluations of the methods, it has been
chosen as proxy due to it working well historically in the author’s previous projects.
The inference latency is explained in section 4.3.2.
3
![Page 13: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/13.jpg)
Chapter 2
Background
The essential components for understanding QbEKWS as presented in this thesis
are: an understanding of audio and how we represent it, a good understanding of the
problem, the general solution framework and an understanding of different kinds of
speech representations. These things are presented in the background; in addition, a
history of the development of the methods is presented which aims to motivate why
the particular methods in this thesis are of interest. Finally, definitions of the core
concepts of the methods in this thesis are presented at the end.
2.1 Audio
Starting with themost basic component; sound can be fully described by a collection of
analog waves, one for each source of audio. This thesis concerns itself only with single
source audio or mono audio. Furthermore the thesis focuses on digital processing of
audio. The digitisation of mono audio can be done by discretization of the analog wave
determined by a discretization rate (samplerate).
4
![Page 14: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/14.jpg)
CHAPTER 2. BACKGROUND
Figure 2.1.1: An analog signal and two digital representations with different samplerates.
The discretization rate is important, Nyquist theorem tells us that we can only
accurately represent frequencies up to half of our discretization rate [33]. We must
have a sampling rate that captures the frequencies produced by human voice (around
200 4000 Hz) [17] [42]. At the same time it is important not to have a too
high discretization rate, to mitigate the curse of dimensionality [32]. Finally, audio
discretized with a sample rate of 8000Hz is used to address the Keyword Spotting
(KWS) problem in this thesis.
2.2 Query by Example Keyword Spotting
KWS is the problem of recognizing if audio contains an utterance of a keyword.
Figure 2.2.1: Example utterance of ’Query by Example’
Query By Examples (QbE) in the KWS setting is the problem of developing a method
that can from one, or a few, recordings of a keyword: learn to recognize utterances
5
![Page 15: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/15.jpg)
CHAPTER 2. BACKGROUND
of that keyword in speech. In this thesis the problem is referred to as the QbEKWS
problem.
Figure 2.2.2: QbEKWS example
Recognition is done by comparing representations of speech to determine if a known
utterance has been uttered. For example, in fig 2.2.2, three utterances of ’Cookie’ and
three utterances of ’Donut’ are provided. The task is then to predict if the utterancewith
a question mark belongs to either classes, or none of them. In this example it belongs
to none of them, it is an utterance of ’cranberry’. The following section 2.3 presents
commonways of representing keywords that can be stored andused for comparison. In
this thesis the provided examples will be referred to as examples and the new recording
will be referred to as the sample.
2.2.1 Continuous Speech
To apply QbEKWS on continuous speech we chunk it to extract samples. There are
multiple ways of doing this, for example: assume we have a ten second continuous
signal as exemplified in figure 2.2.3
6
![Page 16: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/16.jpg)
CHAPTER 2. BACKGROUND
Figure 2.2.3: Ten second signal with two utterances
One way is chunking it with a sliding window using some fixed window size and fixed
window stride.
Figure 2.2.4: Inference on continuous audio with a sliding window
This is the approach chosen here for ’realistic scenario’ evaluation 4.4.1. The fixed
window size and stride are given in that section. However, a smarter way of doing this
is using voice activity detection [53].
Figure 2.2.5: Inference on continuous audio with voice activity detection
7
![Page 17: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/17.jpg)
CHAPTER 2. BACKGROUND
Using voice activity detection enables activating the inference model only when it is
needed and potentially also with better window alignments. In the evaluations of this
thesis no voice activity detection is used. Because with sufficiently low stride it will
yield no better alignments. Furthermore, voice activity detection can hide a methods
poor falsepositive rate, something we wish to evaluate in this thesis.
2.3 Speech Representations
In the QbEKWS setting representations are stored for each keyword to enable the
detection of the utterance they represent. The representations of the examples
are compared to that of the sample and the results of the comparison constitutes
the basis for detection. This is discussed in more detail in chapter 3. The
representations covered in this chapter are text, phonemes, audio, spectral and learned
representations.
2.3.1 Text
A text representation of a keyword is its text form, e.g the text representation of a
utterance of ’Hello’ is ’Hello’.
Figure 2.3.1: Hello utterance with ’Hello’ text representation
Using text representations of keywords is a natural way of adapting speech to text
technologies to keyword spotting. By converting speech to text and then using string
matching algorithms one can build a KWS solution. One early opensource program
used for KWS did exactly this, PocketSphinx [25] developed in 2006 is a speech
to text engine, but it supported keyword spotting through a keyword text search.
8
![Page 18: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/18.jpg)
CHAPTER 2. BACKGROUND
PocketSphinx was used by the opensource voice assistant ’Mycroft’ for KWS during
many years.
The Bacon BeerCan Problem
The sound of text depends on a great many things. For example
Figure 2.3.2: Bacon & BeerCan, the figure implies they sound the same
different texts can sound exactly the same and the same texts can sound very
differently, depending on many factors, such as dialect. By representing the utterance
as text the system needs to make assumptions about all of these factors that can affect
sound, which places a heavy burden on the speech to text system. In KWS the text is
not important, although it is convenient to provide the keywords in text form; other
representations are worth considering to alleviate the Bacon BeerCan problem.
2.3.2 Phonemes
A phoneme is a unit of sound [57]. A phoneme representation of a keyword is the
sequence of phonemes that mimic the sound of the keyword as spoken. The phoneme
representation of a keyword is not unique, it depends on pronunciation. In fact there
are also different phonetic notations, one standard being the International Phonetic
Alphabet (IPA) notation [4].
9
![Page 19: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/19.jpg)
CHAPTER 2. BACKGROUND
Figure 2.3.3: Example of ’Hello’ as phonemes using different accents
Using phoneme representations of keywords makes sense on some intuitive level. If
things sound the same then they are the same, at least if they come from similar
speakers. This is the basis of many KWS systems [36] [8] [29]. In ”speech to text”
audio is typically represented as phonemes, or distributions of phonemes, before being
decoded into text. These KWS systems can therefore leverage many years of research
in Automatic Speech Recognition (ASR) to produce the sequences of phonemes from
audio (This is leveraged by the methods in section 3.2).
2.3.3 Audio
Audio is the most informative representation of an example utterance. However,
using this representation presents a different problem: how do we compare audio
signals?
Figure 2.3.4: Example of ’Hello’ as audio using identity representation
The same keyword uttered from the same speaker can look vastly different in two
10
![Page 20: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/20.jpg)
CHAPTER 2. BACKGROUND
different audio signals
Figure 2.3.5: Example ’Hello’ and streched + shifted ’Heeeeeello’
Using text and phonemes, the main problem is to produce the correct representation
from speech, while the comparison itself is relatively simple. When using raw audio
as the representation the entire problem is shifted into the comparison. Comparing
raw audio is, at time of this writing, impractical. However, by representing audio as
spectral features enables a simple form of QbEKWS based on sequence distances (see
section 3.1).
2.3.4 Spectral Representations
Representing Audio as spectral features has been hugely successful in many audio
applications. Spectral representations are audio represented using frequency
components. Typical spectral representations are logmelspectrograms and Mel
frequency cepstrum coefficients (MFCC) features [56].
Figure 2.3.6: Example of ’Nacho Tallrik’ as waveform, logmelspectrogram and mfcc
Spectral representations typically reduces the dimensionality of the audio while
retaining important features that address speech recognition. Combining spectral
features with sequence distances, see sec 3.1, has shown promising results [2] [15] [10]
for simple KWS.
11
![Page 21: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/21.jpg)
CHAPTER 2. BACKGROUND
2.3.5 Learned
Finally, learned representations, as indicated by their name, are representations
learned for a specific purpose. For example in the case of QbEKWS the
representations might have been learned in a manner such that audio containing the
same words lie close in representation space according to some definition of distance.
Here, learned representations are constructed through supervised learning. Many
examples of audio clips that should be close in space and clips that should not be
close are provided as examples, from these: a mapping from audio to a space with
representations of good discriminative properties is learned. Section 3.3 presents this
approach as applied in this thesis.
Figure 2.3.7: Example of 512 dimensional learned representation
In the example above the audio is 16000 dimensional while the representation is
only 512 dimensional, amounting to a significant compression. Other existing works
making use of learned representations for KWS are Huh et al (2020) [26] and Vygon
et al [59].
2.4 Historical Development
In the beginning there was Audrey [14]. Developed at Bell labs in 1952m the circuit
could recognize spoken digits from telephone quality recordings. Audrey was an early
example of a ASR classifier. It had audio input and 10 output nodes for the digits
09, the circuit worked by extracting two frequency features to form a plane; the
classifier then recognized shapes on this plane as specific digits. Audrey achieved an
impressive 96% accuracy on a spoken digits recognition problem when it was tuned
12
![Page 22: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/22.jpg)
CHAPTER 2. BACKGROUND
for a specific user. Then in the 1960s came a QbE solution: Denes et al [16] built
a computer program that calculated spectral features from audio and compared it
to stored spectral features of previously recorded audio clips. This QbE approach
achieved 100% accuracy on utterances from the speaker whom the example came from
and on average 94% accuracy for other speakers. It is worth interjecting that these
early experiments were conducted in close to ideal environments. The 1970s brought
the Viterbi algorithm [18], commonly used for decoding in acoustic models, and the
Dynamic Time Warping (DTW) algorithm [49]. The 1980s brought Markov models
[5], most notably the application of a Hidden Markov Model (HMM) [48] for voice
activation. In the 1990s came the neural networks for voice activation [40][41][43].
In 2006 the Connectionist Temporal Classification (CTC) loss was introduced [23].
This enabled training neural networks as acoustic models without needing segmented
labels, something previously limited to generative models such as the HMMGMM.
With the rise of Deep Learning, CTC loss contributed to an explosive development
of neural networks in ASR and overtook previous State of the Art (SOTA) based on
HMM models. From these developments grew Speech to Phonemes (STP) methods
[36][8][29], see sec 3.2. They are a natural application of the decades of research done
in ASR to the QbEKWS problem. Briefly, these methods compare the way keywords
’sound’ by comparing phoneme sequences. Using sequence distances, see sec 3.1, is
another approach, a good example is DTW from 1979 [49]: a dynamic programming
[6] approach used to align sequences, and that also can be used directly as a distance
between two sequences. The DTW distance has good properties for simple QbEKWS
problems, as exemplified by its recent use for digit recognition in Asamesse [15] or
KWS in Tamil [10]. The QbEKWS methods introduced so far, together with many
others, can all be categorized as fewshot learningmethods. The fewshot learning field
consists of methods for learning that use one or few examples to generalize, which is
precisely the problem of QbEKWS. Developments in the fewshot learning field [54]
is what inspired the deep distance learning method in this thesis (3.3).
2.5 Definitions
This section defines the concepts used throughout the thesis, it requires no thorough
reading but can be returned to in case the reader wants amore precise definition of the
central terms.
13
![Page 23: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/23.jpg)
CHAPTER 2. BACKGROUND
Definition 1 (Distance). A distance on a setX is a function D : X ×X → [0,∞) with
the properties
i Symmetry D(x, y) = D(y, x)
ii Reflexivity D(x, x) = 0
Definition 2 (Pseudo Metric). A PseudoMetric on a set X is a distance, def 1,
PM : X ×X → [0,∞) with the additional property
i Triangle Inequality PM(x, z) ≤ PM(x, y) + PM(y, z)
Definition 3 (Metric). A Metric on a set X is a pseudo metric, def 2,M : X ×X →[0,∞) with the additional property
i UniquenessM(x, y) = 0 ⇐⇒ x = y
Remark 1. These are all inclusionsM ⊂ PM ⊂ D .
These definitions of distances, pseudometrics andmetrics are taken from course notes
on topological dataanalysis [52]. The rest of the definitions are introduced to give a
precise meaning to the terms used in subsequent section.
Example 2.1. Euclidean distance is a metric M(x, y) =√
(x− y) · (x− y). Cosine
distanceD(x, y) = 1− x·y||x||×||y|| is a distance, but not a metric because it breaks triangle
inequality. Chebyshev distanceD(x, y) = maxi
|xi − yi| is a metric. Manhattan distanceis a metricM(x, y) = |(x− y) · (x− y)|.
Distances with particular invariances and discriminative features are sought for
QbEKWS.
Definition 4 (Invariant). A Distance is invariant to transformations g and f if
D(g(x), f(y)) = D(x, y)
Example 2.2. The distance between keywords should remain the same even if we
shift a utterance in time
Figure 2.5.1: Example of distance invariant to shift on the utterance of ’Hej’
14
![Page 24: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/24.jpg)
CHAPTER 2. BACKGROUND
Designing distances with invariances is difficult. One common approach to alleviate
this is to rerepresent the set on which the distance is defined.
Definition5 (Representation). ARepresentation is ax ∈ Xmapped fromsome y ∈ Y
using someR : Y → X
Remark 2. Typically x retains information from y that is relevant to some task while
discarding other information or making it less accessible.
Example 2.3. In sec 2.4, historical development, in 1960 a paper [16] represented
audio using spectral features and compared the spectral features to address the
QbEKWS task.
15
![Page 25: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/25.jpg)
Chapter 3
Methods
Common to all methods in this chapter is that they calculate a distance between two
representations of audio. Once new audio is recorded and rerepresented; its distance
is calculated with respect to all known examples. Given these distances, a decision
still has to be made whether the audio is a known keyword and if so which, or if it is
unknown. There aremany ways to approach this problem: using knearest neighbours
[46] or extending our distances to finite subsets [13] is just a few examples. Here, for
simplicity: the minimum distance between elements of subsets is used as an extension
of our distances. The subsets consist of all examples of the individual keywords, and
the newly recorded keyword is a set consisting only of itself.
SD(A,B) = minD(a, b)|a ∈ A and b ∈ B (3.1)
Using this, the keyword which has the smallest subset distance to the new recording
is considered the prediction iff the subset distance is below some threshold T . Thisprediction heuristic is represented in the figures of this chapter as a black box. It is
possible that different heuristics could have worked better for some methods, but this
is left for future work.
Clearly, the keyword with the example which has the lowest distance to the sample is
considered to be the prediction if it is ’close enough’, that is, being closer in distance
than T .
16
![Page 26: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/26.jpg)
CHAPTER 3. METHODS
3.1 Sequence Distances
A sequence distance consists of an alignment and a distance. An alignment is a
mapping between two sequences such that each element in each sequence associates
with atleast one in the other.
Figure 3.1.1: perfect alignment between two color sequences
We do not consider all possible alignments, only classes of alignments that have
the desired invariance properties. The class of alignments considered in this thesis
is calculated using the Dynamic Time Warping (DTW) algorithm [51]. The DTW
alignment has the following constraints
1. The first index of the first sequence must match the first index of the second
sequence.
2. The last index of the first sequence must match with the last index of the second
sequence.
3. The mapping is nondecreasing
The final point, nondecreasing mapping, imposes the constraint that no indexmay be
mapped to an opposing index smaller than any previous indexes mapped index. The
figure 3.1.1 is a valid alignment under these constraints. However figure 3.1.2 is an
example of invalid alignment
17
![Page 27: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/27.jpg)
CHAPTER 3. METHODS
Figure 3.1.2: invalid alignment between two color sequences
It is invalid because the mapping is not nondecreasing, two red boxes is mapping to
a red box with an index lower than what a box prior to them is mapping to. For some
sequences such as the one in 3.1.2 there are no perfect alignments, but we can create
imperfect ones.
Figure 3.1.3: imperfect alignment between two color sequences
The imperfection comes at a cost described by our distance function. Consider for
example the discrete metric on colors.
18
![Page 28: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/28.jpg)
CHAPTER 3. METHODS
Figure 3.1.4: A discrete color metric
Using this metric the cost of the alignment in figure 3.1.3 is 1, since there is one
misalignment. TheDTWalgorithm finds the alignmentwith lowest cost with respect to
some distance inO(MN) time [51], where M is the length of the first and N the second
sequence.
The sequence distance methods evaluated here will use the DTW algorithm with
different distance functions on audio representations to address the QbEKWS
problem. A slightly modified version of the DTW algorithm will also be evaluated.
The modification enables finding the optimal subalignment of some sequence into
another. This is solved by introducing ghosts at the beginning and end of one of the
sequences.
Figure 3.1.5: Property of ghost nodes
Using ghosts a perfect alignment for figure 3.1.3 exists.
19
![Page 29: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/29.jpg)
CHAPTER 3. METHODS
Figure 3.1.6: Perfect alignment using ghost
Using ghosts is motivated by the hypothesis that being able to match against a subset
of audio will yield a better classifier. An example of the reasoning is given with the
following figure:
Figure 3.1.7: Ghost nodes enables to match onto a subsequence
In figure 3.1.7 the ’Hey’ part of the utterance is only noise if the goal is to find all
utterances of ’Friday’. Using ghost nodes enables DTW to completely ignore the ’Hey’
part, as exemplified in the following figure
20
![Page 30: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/30.jpg)
CHAPTER 3. METHODS
Figure 3.1.8: Example Ghost DTW compared to DTW
In the figure above bothDTWvariants useManhattan distance (see example 2.1).
Ghost DTW, as described in this thesis, is essentially ’openend’ combined with
’openbegin’ DTW referred to in the R computing DTW package [21]. The following
subsections give an overview of the sequence distance methods evaluated. For the
specific sequence distance experiments see sec 4.2.1.
3.1.1 DTW on MFCC
MFCC is a spectral representation of audio, see section 2.3.4. Applying DTW
directly on MFCCs has been done previously in [2] [15] [10]. The way MFCC
representations and DTW distance is used to solve QbEKWS is summarized in the
following figure.
Figure 3.1.9: Using DTW and MFCC for QbE KWS
In figure 3.1.9 the system has learned to recognize two kinds of keywords. The top
two (green ones) are utterances of guacamole and the bottom two (orange) of nacho.
Given a new recording from for example a microphone, the DTW distance between the
new recordings representation and all stored representations is calculated. In figure
21
![Page 31: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/31.jpg)
CHAPTER 3. METHODS
3.1.9 AverageManhattan distance (Mean absolute error) is used as distance with DTW.
Given the collection of distances the classification heuristic, see chapter 3, is applied.
Provided that T > 109, the heuristic would predict guacamole, which in this example
is correct.
3.1.2 Ghost DTW on MFCC
Thismethod is exactly the same as the one described in section 3.1.1, with the exception
that the DTW now uses ghost nodes on the stored representations.
3.2 Speech To Phonemes
Speech to phonemes models use a mapping (S) between speech to a distribution ofphonemes. In this thesis S is a deep recurrent neural network, see sec 4.2.2 for
architecture details, trained to predict posteriograms of phonemes givenMFCC inputs.
It was trained using the LibriPhonemes dataset, sec 4.1.2, with CTC loss [23]. The
posteriograms in conjunction with a distance were used to address the QbEKWS
problem. An overview of the approaches is given below, while the specific speech to
phonemes experiments are given in section 4.2.2.
3.2.1 Sequence Distance with Beam Search
Given the phoneme posteriograms; beam search (B)[50] is used to find a likely
phoneme sequence. To find the most likely sequence of phonemes from audio using
a model trained with CTC, one has to perform an operation with O(LT ) complexity,
where L is the sequence length and T the number of tokens [23]. Finding the most
likely sequence is impractical since L = 66 (assuming 30 ms receptive field of the
MFCCs with 2 seconds of audio) and T = 41 (using the Librispeech lexicon, described
in section 4.1.2). Therefore, a guided search is employed to find an approximate
optimum, this guided search uses beam search [50].
To address QbEKWS with STP and Beam search we do: for each stored audio
recordingmap, usingS, it to its posteriograms and estimate a likely phoneme sequenceusing B. Then when new audio is recorded: calculate a likely phoneme sequence,
using the same procedure, and apply some sequence distance between all the phoneme
22
![Page 32: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/32.jpg)
CHAPTER 3. METHODS
sequence examples and the sample. Finally the classification heuristic, explained in the
beginning of chapter 3, is applied on the collection of distances.
Figure 3.2.1: Using STP and Beam Search for QbE KWS
In figure 3.2.1 the system has learned two keywords ’Raining’ and ’Fire’. Given a new
recording it applies the mappings B and S to the audio and then evaluates its distanceagainst all known sequences. In the example above the sequence distance used is
GhostDTW with the discrete distance on phonemes where ’’ has zero cost. Provided
that T > 0 the system would return ’Fire’ as the prediction; which in this example is
the correct one.
In the illustration the posteriograms has softmax applied to them to showcase the
activated phonemes. In reality though beam search, and all subsequent methods, use
the logits of the posteriograms as input, as it is more numerically stable.
3.2.2 Sequence Distance on CTC Posteriograms
Instead of transforming the posteriograms to a likely phoneme sequence, a sequence
distance can be applied directly on the posteriograms predicted with S.
23
![Page 33: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/33.jpg)
CHAPTER 3. METHODS
Figure 3.2.2: Using STP and sequence distance on the posteriogram
In figure 3.2.2 the system has learned two keywords ’Raining’ and ’Fire’. Given a new
recording it uses S to extract the phoneme posteriograms of the audio and then it
calculates the distance against all known posteriograms. In the example GhostDTW
with average Manhattan distance is used. Assuming the T > 0.125 it would return
’Raining’ as the prediction; which in this example is the correct one.
3.2.3 Example Likelihood
Given a posteriogram it is possible to estimate the probability that it generates a
phoneme sequence using the CTC forward pass [23]. This method for addressing
QbEKWS maps all the examples into phoneme sequences, using the transformation
from sec 3.2.1. Then given a new sample (for example audio recording from amic) it is
first converted into its posteriogram, then using the CTC forward pass; the likelihoods
that our example sequences was generated from the sample posteriogram is calculated.
These likelihoods are then used as distances. This method for addressing QbEKWS
has been previously tested in DONUT [36] with promising results.
24
![Page 34: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/34.jpg)
CHAPTER 3. METHODS
Figure 3.2.3: Example of using STP with likelihood of examples for QbEKWS
In figure 3.2.3 L is the negative log likelihood function of an example given the
posteriogram of the sample and B is the beam search mapping described in sec 3.2.1.
The system has learned two keywords, ’Raining’ and ’Fire’. Provided that T > 2.86 the
system would return ’Raining’ as the prediction, which in this example is the correct
one.
3.2.4 Sample Likelihood
Exactly as sec 3.2.3 except the sample is conditioned on the example instead of the
other way around.
Figure 3.2.4: Example of using STP with likelihood of sample for QbEKWS
In figure 3.2.4L is the negative log likelihood function of a sample given a posteriogramof an example andB is the beam searchmapping described in sec 3.2.1. The system has
learned two keywords, ’Raining’ and ’Fire’. Provided that T > 8.84 the system would
return ’Raining’ as the prediction; which in this example is the correct one.
25
![Page 35: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/35.jpg)
CHAPTER 3. METHODS
3.3 Deep distance learning
Inspired by other deep distance learning papers for keyword spotting [59] [26], this
method attempts to learn a representation mapping R that works well for some
distance D directly from data. This works by deciding on some distance, for example
cosinedistance see ex (2.1), and then learning a representation of the data that together
with the distance has the desired properties. Here this is done by training a Siamese
[9] Convolutional Neural Network [44] (see section 4.2.3 for details) to create a
representation that minimizes a triplet loss.
L(A,P,N) = max(D(R(A),R(P ))−D(R(A),R(N)) + α, 0) (3.2)
Where in equation 3.2 the Anchor, Positive and Negative are audio recordings, R is
the representation mapping, D our distance and α a hyperparameter controlling the
degree of separation we aim to have between the examples. The triplet loss helps
to learn a representation that places positive data samples close to the anchor and
negative at least a distance α away under the D chosen.
For specifics on what dataset and how it was trained, please see section 4.2.3. With a
trainedRmapping for some distance D the following figure illustrates how they were
used to address QbEKWS:
Figure 3.3.1: Using DDL for QbE KWS
In the figure above a representation mapping have been learned for cosinedistance,
see ex 2.1. The system has been provided with two examples of ’raining’ (the first two
green audio signals) and two examples of fire (last two orange signals). At inference
time it maps new audio to the representation and, in this example, takes a cosine
distance between it and all stored representations. Provided that T > 0.25 the system
would predict the audio to be ’raining’ which would be correct in this example. The
26
![Page 36: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/36.jpg)
CHAPTER 3. METHODS
representations produced by themapping in the example are 256 dimensional vectors.
The plots of the representations from this example are provided in higher resolution
in appendix A.1.1 for the interested reader.
27
![Page 37: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/37.jpg)
Chapter 4
Experiments
This chapter describes how the methods were evaluated and what is needed to
reproduce the results. It begins by introducing the datasets used in the thesis and then
moves on to the postprocessing of the data for models that were trained. Thereafter
a detailed description of the experiments is given together with the acronym used to
represent them in the results section. Then themetrics used for evaluation are outlined
and finally the evaluation setting is described.
4.1 Data
This section contains details of the datasets and data processing. All datasets are
available online except the ones created as part of this thesis. The created datasets
will be provided on request.
4.1.1 LibriSpeech dataset
LibriSpeech is a dataset containing audio and text transcriptions [45]. It is derived
from audio books and contains≈ 1000 hours of speech spoken by over 1000 speakers.
Each audio clip in the dataset typically contains about one sentence.
28
![Page 38: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/38.jpg)
CHAPTER 4. EXPERIMENTS
Figure 4.1.1: Sample from the LibriSpeech Dataset
4.1.2 LibriSpeechPhonemes dataset
LibriSpeechphonemeswas derived from theLibriSpeechdataset using the LibriSpeech
phoneme lexicon [34] provided in the documentation ofMontreal Forced Aligner [37].
This dataset is exactly that of ’LibriSpeech’ but with phonetic transcriptions instead of
text.
Figure 4.1.2: Sample from the LibriSpeechPhoneme Dataset
The phonemes in figure 4.1.2 are separated by one space and the ’’ symbol signifies a
wordboundary.
4.1.3 Google Speech Commands dataset
Google Speech commands is a dataset containing audio files and text [60]. The dataset
comprises 65000 audio recordings of thousands of different people uttering one out of
30 short words; such as ’Yes’, ’No’, ’Up’, ’Down’ and so on, see [60] for a full list. It is
a dataset typically used for evaluation of limited vocabulary keyword spotting.
Figure 4.1.3: Sample from the Google Speech Commands Dataset
29
![Page 39: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/39.jpg)
CHAPTER 4. EXPERIMENTS
4.1.4 LibriWords dataset
LibriWords is a dataset created as part of this thesis. It was created by using forced
alignment [37] on the LibriSpeech dataset, sec 4.1.1, and then extracting audio and
labels using the alignments. A dataset was created consisting of ≈ 18000 unique
English words spoken by 1172 different speakers approximately 50 times for each
word.
Figure 4.1.4: Sample from the LibriWords Dataset
In the LibriWords dataset, the audio files contain the word provided as label but also
surrounding sound. For example ’Nature’, as in fig 4.1.4, contains the utterance ’the
Nature”. This is an artifact of how the sound was extracted from the audio files.
4.1.5 LibriTriplets dataset
LibriTriplets is a dataset created as part of this thesis, it is derived from LibriWords sec
4.1.4. It was constructed by repeating the following process:
1. Pick one out of the 18000 words in LibriWords uniformly at random
2. Sample two random audio files containing the word from (1)
3. Sample a word uniformly at random that is not the word from (1)
4. Sample one random audio file containing the word from (3)
5. Combine (2) and (4) into one triplet.
This process was repeated until 5 000 000 triplets or 15 000 000 audio files had been
chosen for training. For an explanation on the usage of the triplets see sec 3.3.
30
![Page 40: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/40.jpg)
CHAPTER 4. EXPERIMENTS
Figure 4.1.5: Sample from the LibriTriplets dataset
In figure 4.1.5 blue (first) signifies anchor, green (second) positive and red (third)
negative.
4.1.6 RS dataset
The RS (Realistic Scenario) dataset is one created as part of this thesis for evaluation.
The recordings are from the same microphone and of the author and friends and
family. It contains long audio recordings ≈ 1 minute and a list of keywords and their
occurrences are provided with each sample. It contains two types of datapoints: first
is the one where keywords are spoken in a quiet environment, meaning all occurrences
are that of keyword speech.
Figure 4.1.6: A Sample of RS dataset quiet environment
The second scenario is in an environment where most utterances are that of non
keyword speech and with background noise such as a television, music among other
things.
31
![Page 41: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/41.jpg)
CHAPTER 4. EXPERIMENTS
Figure 4.1.7: A sample from the RS dataset noisy environment
The dataset was constructed with 5 different speakers and it is segmented per speaker
such that evaluation can consider results on different speakers. It contains 80
utterances of each keyword: ’stop’, ’time’, ’night’, ’morning’, ’alarm’, ’illuminate’.
Bringing it to a total of 480 keyword utterances.
In the RS dataset it was guaranteed during the construction that keyword utterances
are at least 5 seconds apart. This to avoid issues with overlapping predictions.
4.1.7 Usability dataset
The Usability dataset was created as part of this thesis for usability evaluation, see sec
4.3.3. It contains ten 2 second recordings of 3 words with varying length, in total 30
recordings. The words are ’hat’, ’indistinguishable’ and ’morning’. All utterances are
recordings of the author using the same microphone.
Figure 4.1.8: Sample of each of the words from the Usability dataset
32
![Page 42: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/42.jpg)
CHAPTER 4. EXPERIMENTS
4.1.8 Post Processing
For methods that trained a mapping, see sec 3.2 and 3.3; data augmentation was used
to extend the datasets, to increase the variety for better generalization. Two types
augmentations were applied with some probability pi where i ∈ 0, .., 1meaning thatwith probability
∏i pi both augmentations were applied. The following augmentations
(in the order presented) were used on the audio during the training stage of the
methods (U is the uniform distribution):
Background With probability 1.0 the signal was mixed with natural background
noise such as; background noise from a cafe, construction site, rain, doors, fans,
computers among other sounds.
Gaussian Noise With probability 0.5 Gaussian noise with a noise level of 0.5% was
added to the audio.
The implementations of the augmentations used librosa [38] as well as custom
implementations, all the code is available at [20].
4.2 Methods
This section contains details necessary to reproduce the experiments of methods
evaluated.
4.2.1 Speech Distances
All speech distance experiments extracted MFCC features using the librosa python
library [38] with the following hyperparameters
n_mfcc n_fft hop_length win_length n_mels
20 2048 512 2048 128
Table 4.2.1: Sequence Distance MFCC extraction
The following speech distances were used on the MFCC representation.
33
![Page 43: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/43.jpg)
CHAPTER 4. EXPERIMENTS
DTWMFCCEU See sec 3.1.1 with euclidean distance, see ex 2.1. The DTW
implementation used was ’dtw’ from python pip package repository, the source code is
hosted on Github [47].
GhostDTWMFCCEU See sec 3.1.2 with euclidean distance, see ex 2.1. The
GhostDTW implementation used was a custom implementation. It was available in
the Friday Github repository [20].
GhostDTWMFCCCOS See sec 3.1.2with Cosine distance, see ex 2.1 . TheGhost
DTW implementation used was a custom implementation. It was available in the
Friday Github repository [20].
GhostDTWMFCCCHE See sec 3.1.2 with Chebyshev distance, see ex 2.1. The
GhostDTW implementation used was a custom implementation. It was available in
the Friday Github repository [20].
4.2.2 Speech to Phonemes
All the Speech to Phonemes experiments share the same S mapping. The mapping
used MFCCs extracted with the tensorflow 1.15 [1] audio module using the following
hyperparameters.
coefficients frame_length frame_step fft_length num_mel_bins27 512 256 512 120
Table 4.2.2: Speech to Phonemes MFCC extraction
The mapping S is a sequence to sequence mapping with the following
architecture:
units activation
LSTM 256 tanh
LSTM 256 tanh
Dense 256 tanh
Dense 41 none
Table 4.2.3: Speech to Phonemes Architecture
It was trained using tensorflow 1.15 [1] on the LibriSpeechPhonemes dataset, see 4.1.2,
using the CTCloss implementation of the tensorflow 1.15 library. The training ran
34
![Page 44: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/44.jpg)
CHAPTER 4. EXPERIMENTS
on a Nvidia 1080 GTXTI GPU for 72 hours, the optimizer used was Adam [30] with
tensorflow 1.15 default parameters. The learning rate was governed by the cosine
decayrestarts scheduler [22] from tensorflow 1.15with the following parameters.
learning_rate first_decay_steps t_mul m_mul alpha0.0005 1000 2.0 1.0 0.0
Table 4.2.4: Speech to Phonemes learning rate scheduler
The following experiments using S was evaluated:
STPBS SpeechToPhonemes withBeam Search, see sec 3.2.1. The Tensorflow 1.15
beam search with a beam width of 600 was used as B. GhostDTW with the discrete
metric on phonemes was used on the phonemes produced by B on the result of S. TheDTW result was then divided by the product of the length of the phoneme sequences
and this was used as the final distance.
STPPKL Speech To Phonemes with Posteriograms KLdivergence, see sec 3.2.2.
Ghost DTW was used to measure distance between the posteriograms, KL divergence
[35] (eq 2.1) of sample given example was used as distance with DTW. Note that
KL divergence is not actually a distance since it is asymmetric, but it was chosen
because of its effectiveness in preliminary experiments. In preliminary experiments
both directions of KL divergence, posteriogram distance from DTW on Gaussian
Posteriograms [62], Jensen Shannon Divergence [35] (eq 4.1) , Total Variation
Distance [58], Chebyshev Distance, Euclidean Distance and Cosine Distance was
tested. The total DTW distance using KLdivergence was then divided by the length
of the example keyword and this was used as the final distance.
STPEL Speech To Phonemes with Example Likelihood as described in sec 3.2.3.
From the negative loglikelihood the length of the example phoneme sequence was
subtracted; the resulting value was used as the final distance.
STPSL Speech To Phonemes with Sample Likelihood as described in sec 3.2.4.
From the negative loglikelihood the product of the example length and the predicted
sample phoneme length was subtracted; the resulting value was used as the final
distance.
35
![Page 45: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/45.jpg)
CHAPTER 4. EXPERIMENTS
4.2.3 Deep Distance Learning
The R mapping, explained in section 3.3, was implemented as a deep convolutional
neural network, it used MFCCs as input
coefficients frame_length frame_step fft_length num_mel_bins27 512 256 512 120
Table 4.2.5: Deep Distance Learning MFCC extraction
R had the following architecture:
filters / units filter width filter height activation2D convolution 64 7 3 relu2D max pooling 1 32D convolution 128 1 7 relu2D max pooling 1 42D convolution 256 1 10 relu2D convolution 512 7 1 reluglobal max poolingdense 512 reludense 512 none
Table 4.2.6: Deep Distance Learning Architecture
The architecture was inspired by a submission for the google speech commands
keyword spotting competition hosted on the platform Kaggle [28]. For each of the
following experimentsRwas trained for 24 hours with tensorflow 1.15 [1] using a GTX
1080 TI on LibriTriplets, see sec 4.1.5.
DDLCOS Deep Distance Learning using COSine distance, trained with a
separation (α) of 1.0.
DDLEU Deep Distance Learning using EUclidean distance, trained with a
separation (α) of 1.0.
4.3 Metrics
The experiments aim to measure three aspects of the methods, Efficacy, Resources
and Usability. The efficacy metrics aim to measure the performance of the model
’how accurate’ it is. The resource metrics aim to measure what is needed to use the
method; for example how much memory the method uses and what its inference
36
![Page 46: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/46.jpg)
CHAPTER 4. EXPERIMENTS
latency is. The usability of the methods aim to measure how easy it is to use them; for
example, as introduced in sec 3, all methods use a heuristic based on top of the distance
calculations, depending on the spread of these distances between samples it might be
impractical to pick a threshold for what is an ’unknown’ keyword, the usability metric
will investigate how ’easy’ it is to decide on such a threshold for ’unknown’.
4.3.1 Efficacy Metrics
In an inference setting there are 5 things that can occur:
1. A user speaks a keyword and the system predicts the correct keyword
2. A user speaks a keyword and the system predicts ’unknown’
3. A user speaks a keyword and the system predicts a wrong keyword
4. A user speaks no keyword and the system predicts some keyword
5. A user speaks no keyword and the system predicts ’unknown’
A system that always ends up in 1 or 5 is a perfect system. It is also noteworthy
that it is typically much more common that no keyword has been spoken than the
opposite.
In addition to the possible outcomes of a prediction; the order of outcomes play a role
as well.
Consider a scenario where a user utters ’good night’ andwhere the systemhave learned
the keywords ’good morning’ and ’good night’. The system might at some point in
time make a prediction having only heard the first part of the keyword ’good’, if for
example the training data was biased towards ’good night’ the system might infer
from only the utterance ’good’ that it should be ’good night’. In reality it would have
been preferable if the system had returned ’unknown’ and waited until it heard the
full keyword. This is one example of a difficult problem and there are different ways
of addressing it that has different drawbacks. However, this is not a main focus of
this thesis so to simplify; three kinds of inference heuristics to decide on what
inference to use will be evaluated. First, accuracy as first; considering only the first
’nonunknown’ inference of a keyword to be valid. Second, accuracy as majority;
consider majority vote inference around a keyword to be the prediction from the
system. Finally, accuracy as some point; consider an inference to be correct if at
37
![Page 47: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/47.jpg)
CHAPTER 4. EXPERIMENTS
some point a correct inference is made.
Occurrence of scenario (3) and (4) is what the systemmust prioritize tominimize, since
misinterpretation or seemingly random behaviour is considered by the author to be
worse than not hearing. To measure how well a system minimizes these scenarios
one can use a falsepositive rate as indicator for (4) and accuracy as indicator for
(3). A tradeoff between accuracy and falsepositive rate will be made depending
on the distance threshold T , see chapter 3, to mitigate this the metrics is calculatedfor all relevant values of T . This leaves us with a graph showcasing the tradeoff
between accuracy and falsepositives for different T . However, to make the graphsmore interpretable they will instead contain:
E(T ) =accuracy(T )
fp(T ) + ϵ(4.1)
The efficacy (E) at distance T is the ratio of the accuracy and falsepositive rate plus
some small constant ϵ. ϵ controls how valuable it is to keep a low falsepositive
rate. Using this formula methods with the highest peak are the best for a given
epsilon.
In the result plots the distance (T ) for a method will be normalized to [0, 1], the false
positive rate will be a value between [0, 1] as well as the accuracy and ϵ = 1100.
The result will contain plots of efficacy for each inference heuristic and the appendix
will contain falsepositiverate plots and the different accuracy’s for different distances.
The appendix will also contain a confusion matrix for each method for when the
distance has been chosen to the one maximizing efficacy calculated using ’accuracy
as majority’.
To clarify, accuracy here is defined as the number of correctly classified keywords over
the total number of keywords. Meaning that a prediction of some keyword when the
label is ’unknown’ does not affect the accuracy at all, since ’unknown’ is not considered
a keyword. Furthermore a false positive is defined as a prediction of any keywordwhen
the label is ’unknown’.
38
![Page 48: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/48.jpg)
CHAPTER 4. EXPERIMENTS
Interpretation of Efficacy
Here’s an example of the efficacy plots that will be present in the results
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
STP-BS STP-EL STP-PKL STP-SL
Figure 4.3.1: Efficacy of quiet and same speaker evaluation
It shows the efficacy, equation 4.3.1, for all distances normalized to [0, 1] for all
inference heuristics. As a rule of thumb having the highest peak is best, following
this reasoning ’STPSL’ is superior with the ’accuracy_as_first’ heuristic and ’STPBS’
superior with the others, in this example.
Decreasing of efficacy is caused by influx of false positives. Increasing of efficacy is
caused by an improved accuracy without increasing falsepositive rate significantly. A
peak can therefore be interpreted as the distance where accuracy is optimal given our
penalty on falsepositives. Therefore methods which have the highest peak have the
highest capability of good accuracy under our penalization of falsepositives, they are
considered best.
4.3.2 Resource Metrics
The constraints section, sec 1.4, places limitations on the memory usage and latency
of the methods; to this end: the memory usage and latency of the methods will
be presented. Denote average seconds per inference as E[I(K)] and the number of
keyword examples a system has been given as K. The latency metric provided will
39
![Page 49: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/49.jpg)
CHAPTER 4. EXPERIMENTS
be:
L(K) = E[I(K)] (4.2)
The results will contain a plot of L for different values of K.
Making fair latency benchmarks is a notoriously hard problem since using the correct
hardware and implementation can make all the difference. An implementation
considering performance its main objective versus another one with a different
objective can have a performance disparity in many orders of magnitude. In this
thesis I have not implemented all of the methods from scratch, but have made use
of existing libraries. Some of these libraries might not have performance as their main
consideration and some of them might be able to produce orders of magnitude better
performance if compiled with optimizations and running on the correct hardware. All
of this is to say that although inmy experiment somemethodmight have a latency that
is better than some other, it is by no means a definitive result.
For memory usage: the MB of storage required to use the method as evaluated in this
thesis will be presented.
4.3.3 Usability Metrics
The main difficulty for usability lies in deciding on a threshold for what is considered
’unknown’ (see heuristic in chapter 3), that threshold could even vary depending on
what words have been learnt. Having a single threshold is preferable because it makes
usage easier, otherwise every time a new keyword is added a threshold would have to
be found as well.
To inspect the ease of deciding thresholds the inclass and interclass distribution
of distances will be plotted. If the inclass and interclass distributions are clearly
separable then defining a single threshold should be feasible. The Usability will be
presented as plots of the inclass and interclass distribution of distances for some
keywords.
To clarify, the inclass distribution is the distribution of distances between keywords
of the same class and interclass distribution the distribution of distances to keywords
outside of the class.
40
![Page 50: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/50.jpg)
CHAPTER 4. EXPERIMENTS
4.4 Evaluation
This section describes the evaluation settings, what datasets were used and how the
metrics from sec 4.3 was evaluated.
4.4.1 Realistic
To evaluate efficacy the RS dataset, see sec 4.1.6, was used. The dataset and this
evaluation setting was created to as closely as possible resemble the authors intended
real world use cases. There are four cases, first the quiet and same scenario where the
same speaker provides examples and samples and the samples is recorded in a quiet
room, think bedroom, where there is not a lot of background noise or non keyword
speech. Second the noisy and same scenario where the same speaker provides
examples and samples and the samples is recorded in a noisy room, think living room,
where there is background noise such as a television and a lot of nonkeyword speech.
Third the quiet and different scenariowhere one speaker provides examples, different
speakers provide samples and the samples is recorded in a quiet room. Finally the
noisy and different scenario where one speaker provides examples, different speakers
the samples and the samples is recorded in a noisy environment. These scenarios are
all part of the RS dataset.
For each scenario the methods evaluated was provided with three examples of the six
different keywords ’time’, ’stop’, ’night’, ’morning’, ’illuminate’ and ’alarm’, in total 18
example recordings. Using these examples: inference on theRSdataset was performed
using a sliding window [19], the window size was 2000ms and stride 250ms. Using
the inferences from the sliding window the metrics from sec 4.3.1 was calculated by
assuming that if the center of an inference window was at most 2 second away from
the utterance; then it was an inference on the utterance.
4.4.2 Google Speech Commands
The Google Speech Commands (GSC) dataset, see sec 4.1.3, is a limited vocabulary
keyword spotting benchmark dataset. Limited vocabulary keyword spotting is a
different problem than that addressed in this thesis, but evaluating on this benchmark
still makes it somewhat comparable to other works that have used this dataset as
evaluation. Using the GSC dataset a 3shot learning benchmark was created. The
41
![Page 51: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/51.jpg)
CHAPTER 4. EXPERIMENTS
keyword left’, ’learn’, ’sheila’ ’seven’, ’dog’ and ’down’were included and the systemwas
given 3 examples of each keyword. It was then evaluated using multi class accuracy as
in the Kaggle competition [55] using the same dataset. Following is a plot of the data
distribution used in the evaluation:
Figure 4.4.1: Distribution of the keywords used in the GSC evaluation
4.4.3 Latency
Latency was evaluated on noise. For each K, see sec 4.3.2, K normally distributed
vectors were generated and registered as examples. Then N (N = 100) normally
distributed vectors were generated and used to measure inference time, for each K.
The performance plot is provided in figure 5.4.2.
4.4.4 Usability
The usability was evaluated using the usability dataset, see sec 4.1.
42
![Page 52: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/52.jpg)
Chapter 5
Result
First results of the individual methods are presented. Finally, a comparison between
the results of the best variants (according to the efficacy metric) of the individual
methods are given. The final comparison also contains the evaluation of resources
and usability of the methods.
5.1 Speech Distances
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
DTW-MFCC-EU Ghost-DTW-MFCC-CHE Ghost-DTW-MFCC-COS Ghost-DTW-MFCC-EU
Figure 5.1.1: Efficacy of quiet and same speaker evaluation
43
![Page 53: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/53.jpg)
CHAPTER 5. RESULT
Accuracy, false positive rate and the confusion matrix for the highest efficacy can be
seen in figures A.1.2 and A.1.3 in Appendix A.
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
DTW-MFCC-EU Ghost-DTW-MFCC-CHE Ghost-DTW-MFCC-COS Ghost-DTW-MFCC-EU
Figure 5.1.2: Efficacy of noisy and same speaker evaluation
Accuracy, false positive rate and the confusion matrix for the highest efficacy can be
seen in figures A.1.4 and A.1.5 in Appendix A.
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
DTW-MFCC-EU Ghost-DTW-MFCC-CHE Ghost-DTW-MFCC-COS Ghost-DTW-MFCC-EU
Figure 5.1.3: Efficacy of quiet and different speaker evaluation
Accuracy, false positive rate and the confusion matrix for the highest efficacy can be
44
![Page 54: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/54.jpg)
CHAPTER 5. RESULT
seen in figures A.1.6 and A.1.7 in Appendix A.
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
DTW-MFCC-EU Ghost-DTW-MFCC-CHE Ghost-DTW-MFCC-COS Ghost-DTW-MFCC-EU
Figure 5.1.4: Efficacy of noisy and different speaker evaluation
Accuracy, false positive rate and the confusion matrix for the highest efficacy can be
seen in figures A.1.8 and A.1.9 in Appendix A.
5.2 Speech to Phonemes
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
STP-BS STP-EL STP-PKL STP-SL
Figure 5.2.1: Efficacy of quiet and same speaker evaluation
45
![Page 55: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/55.jpg)
CHAPTER 5. RESULT
Accuracy, false positive rate and the confusion matrix for the highest efficacy can be
seen in figures A.1.10 and A.1.11 in Appendix A.
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
STP-BS STP-EL STP-PKL STP-SL
Figure 5.2.2: Efficacy of noisy and same speaker evaluation
Accuracy, false positive rate and the confusion matrix for the highest efficacy can be
seen in figures A.1.12 and A.1.13 in Appendix A.
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
STP-BS STP-EL STP-PKL STP-SL
Figure 5.2.3: Efficacy of quiet and different speaker evaluation
Accuracy, false positive rate and the confusion matrix for the highest efficacy can be
46
![Page 56: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/56.jpg)
CHAPTER 5. RESULT
seen in figures A.1.14 and A.1.15 in Appendix A.
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
STP-BS STP-EL STP-PKL STP-SL
Figure 5.2.4: Efficacy of noisy and different speaker evaluation
Accuracy, false positive rate and the confusion matrix for the highest efficacy can be
seen in figures A.1.16 and A.1.17 in Appendix A.
5.3 Deep Distance Learning
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
DDL-COS DDL-EU
Figure 5.3.1: Efficacy of quiet and same speaker evaluation
47
![Page 57: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/57.jpg)
CHAPTER 5. RESULT
Accuracy, false positive rate and the confusion matrix for the highest efficacy can be
seen in figures A.1.18 and A.1.19 in Appendix A.
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
DDL-COS DDL-EU
Figure 5.3.2: Efficacy of noisy and same speaker evaluation
Accuracy, false positive rate and the confusion matrix for the highest efficacy can be
seen in figures A.1.20 and A.1.21 in Appendix A.
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
DDL-COS DDL-EU
Figure 5.3.3: Efficacy of quiet and different speaker evaluation
Accuracy, false positive rate and the confusion matrix for the highest efficacy can be
48
![Page 58: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/58.jpg)
CHAPTER 5. RESULT
seen in figures A.1.22 and A.1.23 in Appendix A.
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
DDL-COS DDL-EU
Figure 5.3.4: Efficacy of noisy and different speaker evaluation
Accuracy, false positive rate and the confusion matrix for the highest efficacy can be
seen in figures A.1.24 and A.1.25 in Appendix A.
5.4 Comparison
This section presents a comparison of top performing methods from the first three
sections. For plots of efficacy with only the top performers: see appendix A.1.28 A.1.30
A.1.29 A.1.31
49
![Page 59: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/59.jpg)
CHAPTER 5. RESULT
5.4.1 Accuracy
Figure 5.4.1: GSC multiclass accuracy
For confusion matrices of the GSC evaluation see Appendix A.1.26
5.4.2 Resources
Figure 5.4.2: Inference Latency
The red line is an upperbound on the allowed latency to fulfill the constraints from
section 1.4. L(K) is described in the latency metric section, see sec 4.3.2.
Memory usage is presented in the following table:
This table represents the memory usage of the methods as described in sec 4.2.
50
![Page 60: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/60.jpg)
CHAPTER 5. RESULT
GhostDTWMFCC STPSL DDLRepresentation MB 0.0067 0.08 0.016Model MB 0 3.5 5.8
Table 5.4.1: Memory Usage
51
![Page 61: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/61.jpg)
CHAPTER 5. RESULT
5.4.3 Usability
Figure 5.4.3: Interclass and inclass distribution
52
![Page 62: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/62.jpg)
CHAPTER 5. RESULT
The solid lines represent inclass and dotted interclass distributions of distances. In
appendix A.1.27 the usability of the STP methods are plotted without the length post
processing.
53
![Page 63: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/63.jpg)
Chapter 6
Conclusions
In this chapter we attempt to make conclusions about when to use which method. We
begin by stating the benefits and drawbacks of each method. Finally, we summarize in
section 6.4.
6.1 Sequence Distances
The main benefit becomes apparent looking at memory, table 5.4.1, we conclude
that sequences distances requires the least amount of memory. Looking at the
latency, figure 5.4.2, we see that sequence distances have low latency. The Ghost
DTW implementation is able to learn far more examples than what was used for
benchmarking before latency becomes an issue.
Regarding the efficacy, looking at the GSC evaluation, fig 5.4.1, sequence distances
leaves something to be desired. However, the GSC evaluation contains multi
speaker, with different microphones and in different settings. If the use case is more
constrained: for example single speaker and single microphone. Then, sequence
distances might be worth considering. Looking at the quiet and same results, figure
5.1.1, the sequence distances reached an efficacy comparable with the other methods.
Furthermore, by looking at the confusion matrix for the same evaluation, figure A.1.3,
most of the sequence distances perform well. The sequence distances also show
promise in the quiet and different results, figure 5.1.3 and figure 5.1.3. However in
the noisy environments one should consider using a different method.
Regarding usability: the sequence distances would likely benefit from tuning on a per
54
![Page 64: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/64.jpg)
CHAPTER 6. CONCLUSIONS
case basis. However, picking a single threshold is not impossible, although at the
cost of some performance. For example: for the GhostDTWMFCCCHE method a
threshold of 35 works decent across all evaluations, and we can see from figure 5.4.3
that 35 would separate the distributions decently.
6.2 Speech To Phonemes
Starting with drawbacks: looking at the latency results, figure 5.4.2, STPSL breaks the
allowed barrier already at 12 examples. Furthermore, STP methods lie uncomfortably
close to the barrier already at 1 example. Of course, this might be an artifact of the
evaluation, different opimizations could possibly yield better results. However, the
STPmethods have an expensive representation step, inferring the phoneme posterior,
and typically an expensive distance on top of the posteriors too. For example: the STP
SL and STPEL uses a CTC forward pass on top of the phoneme posteriors. Also, STP
PKL uses DTWon top of the phoneme posteriors. Only STPBS uses a computationally
cheap distance on top of the representation. Looking at the memory usage of STPSL,
table 5.4.1, it uses a few MB for storing the model and a few KB per representation.
The memory usage is completely within acceptable margins. However, to reduce the
memory usage even further: Bluche et al [8] suggests using quantized LSTMnetworks;
this would bring the LSTM used by the STP methods to under 1 MB.
Moving on to some benefits: Looking at the GSC results, figure 5.4.1, the STPSL
achieves 60% accuracy on the 3shot learning task. The best submissions to the
Kaggle competition [55] achieved an accuracy of 90% on a larger subset of the dataset.
Comparing STPSL directly to that makes it look seemingly bad. However, the STP
method was only given 3 examples per keyword, not thousands, and it was not trained
for the GSC dataset specifically. Considering the circumstances it is not nearly as bad
as the direct comparison might indicate.
Looking at the efficacy results; the STP methods had high variance in comparison to
the sequence distances. On the quiet evaluations: STPBS performs best of the STP
methods and beats all sequence distances. However, on the noisy evaluations: STP
BS scores bad in comparison with other STP methods. More consistently: the STPSL
method beats all sequence distances on all efficacy evaluations. It scores well against
other STP methods on the quiet evaluations and is among the best methods on the
noisy evaluations.
55
![Page 65: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/65.jpg)
CHAPTER 6. CONCLUSIONS
Regarding usability: another drawback of the STP methods is that they did not work
well out of the box. The reader might have been confused as too why the STP methods
contain extra logic such as ’divide the resulting distance by the length of the example
keyword’. By taking a look at figure A.1.27 we see two usability plots of STP methods
without this extra postprocessing. The figure shows that long words and short words
have very different distributions and are hard to separate. Based on these results the
postprocessing was added to improve the separability of the distributions. The post
processing improves the results of themethods by orders ofmagnitude. However, even
with this postprocessing we can see from figure 5.4.3 that although the distributions
appear separable for these particular keywords: the modes are quite spread out and
a single threshold might be hard to find that works for all keywords. It is left as
future work to dig deeper into how to make the STP methods more usable, for they
have promising attributes: such as interpretability (one can interpret the sequence of
phonemes it believes exist in the audio) and good efficacy.
6.3 Deep Distance Learning
The memory usage of the DDL methods (see table 5.4.1) are within the margins of the
constraints stated in this thesis. But for readers interested in trulyminimizingmemory
usage: the CNN can be quantized, as done by Wu et al [61]. The DDL methods have
very good latency (see figure 5.4.2) we can see that increasing K gives no noticeable
performance degradation within the first 30 examples. Also, the base performance for
one example is really good compared to the other methods. DDL supports hundreds
of keywords before latency becomes an issue.
On GSC, figure 5.4.1, DDL scored over 60%, similar to the STP methods. Also, similar
to the STPmethodsDDL performs bad in comparison to the best baselines onGSC. But
considering the limitation the results are more impressive. Regarding efficacy: DDL
got a perfect score in both the quiet environments and was comparable to the best STP
method in the noisy ones.
Regarding usability: looking at figure 5.4.3 the inclass and interclass modes are
clearly separable and using a single threshold looks possible. For example: for the
DDLEU variant a threshold of 0.8 performs very well.
56
![Page 66: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/66.jpg)
CHAPTER 6. CONCLUSIONS
6.4 What’s best
The DDL methods would likely work best for most tasks with the one exception being
extremelymemory constrained environments. In these settings the sequence distances
would be a better choice. DDL is likely the best choice because of its high performance
in efficacy, latency and usability. Only STPmethods could rival in efficacy but they had
problems with latency and usability.
6.5 Future Work
In this section possible extensions of the work is proposed.
6.5.1 DDL Extensions
Here are presented some interesting future directions to improve theDDLmodel.
More Data
In this thesis we have constructed LibriWords 4.1.4 from librispeech a dataset with
about 1000 speakers. One could also used Mozilla Common Voice [3], which has
an English speaking dataset with over 55000 speakers. A dataset ”Megawörd”
consisting of data from LibriSpeech, Mozilla Common Voice and Google Speech
commands has been created to train the DDL models described in this thesis for the
Friday Assistant [20]. The dataset is orders of magnitude larger than the one used in
this thesis and will be provided on request. It is not provided online due to hosting
costs.
Different Triplet Loss
In this thesis we used a triplet loss that maximize relative distance between anchor to
positive and anchor to negative.
L(A,P,N) = max(D(R(A),R(P ))−D(R(A),R(N)) + α, 0) (6.1)
It would be of interest to try using a triplet loss that maximize an absolute
57
![Page 67: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/67.jpg)
CHAPTER 6. CONCLUSIONS
distance.
L(A,P,N) = D(R(A),R(P )) + βmax(α−D(R(A),R(N)), 0) (6.2)
Where β is a scalar that can be used to get a more realistic weighing of data. For
keyword spotting: negative samples are much more common than positive samples,
therefore a β > 1 is a good fit.
The hypothesis is that maximizing the absolute distance would yield tighter and
more well separated clusters in the representation space, and therefore have better
discriminative properties.
Better Augmentations
Section 4.1.8 describes the postprocessing techniques used in this thesis. From
empirical experiments not accounted for in this thesis: the methods when trained on
the LibriWords dataset with the postprocessing of this thesis exhibit a performance
drop in environments with reverberation. Improving the data augmentation might
alleviate this and further improve the performance of DDL in realistic settings, ko et al
[31] suggests an improved data augmentation scheme for this problem.
Asymmetric Task
For the STPPKL model when using distances on the CTC posteriograms it was
discovered that an asymmetric ’distance’ worked better than all symmetric variants. It
would be interesting to train a DDLmodel with an asymmetry to see if it yields similar
results.
Imagine rephrasing the optimization to ’does the word in this audio clip appear
somewhere inside this other audio clip’. Instead of ’does these two audio clips contain
the same word’. One could easily adapt some words dataset to such an optimization
task: simply include more context from the source dataset when creating the positives
and negatives. Also, add some non weight shared prediction head to allow the model
to learn the asymmetry.
The hypothesis is that this wouldwork because the examples contain only the utterance
and in a clean environment. While samples might contain noisy distractors such as
background noise, or it might be part of a sentence.
58
![Page 68: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/68.jpg)
CHAPTER 6. CONCLUSIONS
6.5.2 Usability of STP
The distribution of inclass and interclass distances for keywords of different length
makes using STP methods more difficult due to the problem of choosing a threshold.
But as shown in this thesis one can improve the results of the STPmethods bymeans of
normalizing the distances in some manner using the length of the keyword. The fixes
in this thesis was adhoc and it is likely there exists some better way of addressing this
usability issue.
For example for STPSL the length of the phoneme sequence of the sample and length
of the example was subtracted from the negative log likelihood. The motivation
behind this was based on the intuition that the probability of a word decreased with a
multiplicative factor for each phoneme added. Since the probability of a sequence with
CTC loss is the product of the probability of the phonemes in the CTC posteriogram.
However, this intuition is incomplete since the probability of a phoneme sequence
is the sum of all sequences in the CTC posteriogram that generates said phoneme
sequence. Using this knowledge a more suitable normalization might be devised. Or
perhaps the threshold problem for STP can be avoided in an alltogether different
manner.
6.5.3 Heuristic
In chapter 3 we introduce a heuristic for deciding on predictions given a collection
of distances to the sample. In this thesis we decided on using a set distance, see
eq 3.1, based on minimal set difference. The DONUT paper [36] suggests using a
set distance based on average difference. In preliminary experiments the average
difference distance did not show any clear advantages. But it is possible that other
set distances or different heuristics all together might work better for the prediction
problem described in chapter 3.
6.6 Final Words
Check out the Friday voice assistant [20]! It, at time writing, uses the DDL method
described in this thesis. All of the code for all of the models, evaluations and dataset
could also be found in the same repository.
59
![Page 69: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/69.jpg)
Bibliography
[1] Abadi,Martın, Barham, Paul, Chen, Jianmin, Chen, Zhifeng, Davis, Andy, Dean,
Jeffrey, Devin,Matthieu, Ghemawat, Sanjay, Irving, Geoffrey, Isard,Michael, et
al. “Tensorflow: A system for largescale machine learning”. In: 12th USENIXsymposium on operating systems design and implementation (OSDI 16).
2016, pp. 265–283.
[2] Alex, John Sahaya Rani and Venkatesan, Nithya. “Modified Multivariate
Euclidean Dynamic Time Warping Based Spoken Keyword Detection”. In:
(2017).
[3] Ardila, Rosana, Branson, Megan, Davis, Kelly, Henretty, Michael, Kohler,
Michael, Meyer, Josh, Morais, Reuben, Saunders, Lindsay, Tyers, Francis M,
and Weber, Gregor. “Common voice: A massivelymultilingual speech corpus”.
In: arXiv preprint arXiv:1912.06670 (2019).
[4] Association, International Phonetic, Staff, International Phonetic Association,
et al. Handbook of the International Phonetic Association: A guide to the use
of the International Phonetic Alphabet. Cambridge University Press, 1999.
[5] Bahl, Lalit R, Jelinek, Frederick, andMercer, Robert L. “A maximum likelihood
approach to continuous speech recognition”. In: IEEE transactions on pattern
analysis and machine intelligence 2 (1983), pp. 179–190.
[6] Bellman, Richard. “Dynamic programming”. In: Science 153.3731 (1966),
pp. 34–37.
[7] Bluche, Théodore and Gisselbrecht, Thibault. “Predicting detection filters
for small footprint openvocabulary keyword spotting”. In: arXiv preprint
arXiv:1912.07575 (2019).
60
![Page 70: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/70.jpg)
BIBLIOGRAPHY
[8] Bluche, Théodore, Primet, Maël, and Gisselbrecht, Thibault. “SmallFootprint
OpenVocabulary Keyword Spotting with Quantized LSTM Networks”. In:
arXiv preprint arXiv:2002.10851 (2020).
[9] Bromley, Jane, Guyon, Isabelle, LeCun, Yann, Säckinger, Eduard, and Shah,
Roopak. “Signature verification using a” siamese” time delay neural network”.
In: Advances in neural information processing systems (1994), pp. 737–737.
[10] Chandra, E et al. “Keyword spotting system for Tamil isolated words
using Multidimensional MFCC and DTW algorithm”. In: 2015 International
Conference on Communications and Signal Processing (ICCSP). IEEE. 2015,
pp. 0550–0554.
[11] Chen, Yangbin, Ko, Tom, Shang, Lifeng, Chen, Xiao, Jiang, Xin, and Li, Qing.
“An investigation of fewshot learning in spoken term classification”. In: arXiv
preprint arXiv:1812.10233 (2018).
[12] Cohen, Michael H, Cohen, Michael Harris, Giangola, James P, and Balogh,
Jennifer. Voice user interface design. AddisonWesley Professional, 2004.
[13] Conci, AURA and Kubrusly, CS. “Distance between setsa survey”. In: arXiv
preprint arXiv:1808.02574 (2018).
[14] Davis, Ken H, Biddulph, R, and Balashek, Stephen. “Automatic recognition of
spoken digits”. In:The Journal of the Acoustical Society of America 24.6 (1952),
pp. 637–642.
[15] Deka, Brajen Kumar and Das, Pranab. “An Analysis of an Isolated Assamese
Digit Recognition using MFCC and DTW”. In: 2019 6th International
Conference on Computing for Sustainable Global Development (INDIACom).
IEEE. 2019, pp. 46–50.
[16] Denes, P and Mathews, Max V. “Spoken Digit Recognition Using Time
Frequency Pattern Matching”. In: The Journal of the Acoustical Society of
America 32.11 (1960), pp. 1450–1455.
[17] Fant, Gunnar. “Speech perception”. In: Speech Acoustics and Phonetics (2005),
pp. 199–220.
[18] Forney, G David. “The viterbi algorithm”. In: Proceedings of the IEEE 61.3
(1973), pp. 268–278.
61
![Page 71: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/71.jpg)
BIBLIOGRAPHY
[19] Frank, Ray J, Davey, Neil, and Hunt, Stephen P. “Time series prediction and
neural networks”. In: Journal of intelligent and robotic systems 31.1 (2001),
pp. 91–103.
[20] Friday Voice Assisstant. https://github.com/JonasRSV/Friday. Accessed:
20210310.
[21] Giorgino, Toni et al. “Computing and visualizing dynamic time warping
alignments in R: the dtw package”. In: Journal of statistical Software 31.7
(2009), pp. 1–24.
[22] Gotmare, Akhilesh, Keskar, Nitish Shirish, Xiong, Caiming, and Socher,
Richard. “A closer look at deep learning heuristics: Learning rate restarts,
warmup and distillation”. In: arXiv preprint arXiv:1810.13243 (2018).
[23] Graves, Alex, Fernández, Santiago, Gomez, Faustino, and Schmidhuber, Jürgen.
“Connectionist temporal classification: labelling unsegmented sequence data
with recurrent neural networks”. In: Proceedings of the 23rd international
conference on Machine learning. 2006, pp. 369–376.
[24] Hoy, Matthew B. “Alexa, Siri, Cortana, and more: an introduction to voice
assistants”. In:Medical reference services quarterly 37.1 (2018), pp. 81–88.
[25] HugginsDaines, David, Kumar, Mohit, Chan, Arthur, Black, Alan W,
Ravishankar, Mosur, and Rudnicky, Alexander I. “Pocketsphinx: A free, real
time continuous speech recognition system for handheld devices”. In: 2006
IEEE International Conference on Acoustics Speech and Signal Processing
Proceedings. Vol. 1. IEEE. 2006, pp. I–I.
[26] Huh, Jaesung, Lee,Minjae,Heo,Heesoo,Mun, Seongkyu, andChung, JoonSon.
“Metric Learning for Keyword Spotting”. In: arXiv preprint arXiv:2005.08776
(2020).
[27] Johnston, Steven J andCox, SimonJ.The raspberryPi: A technologydisrupter,
and the enabler of dreams. 2017.
[28] Kaggle KWS CNN. https : / / www . kaggle . com / c / tensorflow - speech -
recognition-challenge/discussion/47715. Accessed: 20210311.
62
![Page 72: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/72.jpg)
BIBLIOGRAPHY
[29] Kim, Byeonggeun, Lee, Mingu, Lee, Jinkyu, Kim, Yeonseok, and Hwang,
Kyuwoong. “Querybyexample ondevice keyword spotting”. In: 2019 IEEE
Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE.
2019, pp. 532–538.
[30] Kingma, Diederik P and Ba, Jimmy. “Adam: A method for stochastic
optimization”. In: arXiv preprint arXiv:1412.6980 (2014).
[31] Ko, Tom, Peddinti, Vijayaditya, Povey, Daniel, Seltzer, Michael L, and
Khudanpur, Sanjeev. “A study on data augmentation of reverberant speech
for robust speech recognition”. In: 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2017, pp. 5220–
5224.
[32] Köppen,Mario. “The curse of dimensionality”. In: 5th OnlineWorld Conference
on Soft Computing in Industrial Applications (WSC5). Vol. 1. 2000, pp. 4–8.
[33] Landau, HJ. “Sampling, data transmission, and the Nyquist rate”. In:
Proceedings of the IEEE 55.10 (1967), pp. 1701–1706.
[34] Librispeech Phoneme Lexicon. https : / / drive . google . com / file / d /
1dAvxdsHWbtA1ZIh3Ex9DPn9Nemx9M1-L/view. Accessed: 20210311.
[35] Lin, Jianhua. “Divergence measures based on the Shannon entropy”. In: IEEE
Transactions on Information theory 37.1 (1991), pp. 145–151.
[36] Lugosch, Loren,Myer, Samuel, andTomar, Vikrant Singh. “DONUT:CTCbased
QuerybyExample Keyword Spotting”. In: arXiv preprint arXiv:1811.10736
(2018).
[37] McAuliffe, Michael, Socolof, Michaela, Mihuc, Sarah, Wagner, Michael,
and Sonderegger, Morgan. “Montreal Forced Aligner: Trainable TextSpeech
Alignment Using Kaldi.” In: Interspeech. Vol. 2017. 2017, pp. 498–502.
[38] McFee, Brian, Raffel, Colin, Liang, Dawen, Ellis, Daniel PW, McVicar, Matt,
Battenberg, Eric, and Nieto, Oriol. “librosa: Audio and music signal analysis
in python”. In: Proceedings of the 14th python in science conference. Vol. 8.
Citeseer. 2015, pp. 18–25.
[39] Montgomery, Alan L and Smith, Michael D. “Prospects for Personalization on
the Internet”. In: Journal of Interactive Marketing 23.2 (2009), pp. 130–137.
63
![Page 73: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/73.jpg)
BIBLIOGRAPHY
[40] Morgan, David P, Scofield, Christopher L, Lorenzo, Theresa M, Real, Edward C,
and Loconto, David P. “A keyword spotter which incorporates neural networks
for secondary processing”. In: International Conference on Acoustics, Speech,
and Signal Processing. IEEE. 1990, pp. 113–116.
[41] Morgan, DP, Scofield, Christopher L, and Adcock, John E. “Multiple neural
network topologies applied to keyword spotting”. In: [Proceedings] ICASSP 91:
1991 International Conference on Acoustics, Speech, and Signal Processing.
IEEE. 1991, pp. 313–316.
[42] Mynatt, ElizabethD, Back,Maribeth,Want, Roy, Baer,Michael, and Ellis, Jason
B. “Designing audio aura”. In:Proceedings of the SIGCHI conference onHuman
factors in computing systems. 1998, pp. 566–573.
[43] Naylor, JA, Huang, WY, Nguyen, M, and Li, KP. “The application of neural
networks to wordspotting”. In: [1992] Conference Record of the TwentySixth
Asilomar Conference on Signals, Systems & Computers. IEEE. 1992, pp. 1081–
1085.
[44] O’Shea, Keiron and Nash, Ryan. “An introduction to convolutional neural
networks”. In: arXiv preprint arXiv:1511.08458 (2015).
[45] Panayotov, Vassil, Chen, Guoguo, Povey, Daniel, and Khudanpur, Sanjeev.
“Librispeech: an asr corpus based on public domain audio books”. In: 2015 IEEE
international conference on acoustics, speech and signal processing (ICASSP).
IEEE. 2015, pp. 5206–5210.
[46] Peterson, Leif E. “Knearest neighbor”. In: Scholarpedia 4.2 (2009), p. 1883.
[47] Python DTW implementation. https://github.com/pierre- rouanet/dtw.
Accessed: 20210310.
[48] Rohlicek, J Robin, Russell, William, Roukos, Salim, and Gish, Herbert.
“Continuous hiddenMarkovmodeling for speakerindependent word spotting”.
In: International Conference on Acoustics, Speech, and Signal Processing,
IEEE. 1989, pp. 627–630.
[49] Sakoe, Hiroaki. “Twolevel DPmatching–A dynamic programmingbased
pattern matching algorithm for connected word recognition”. In: IEEE
Transactions onAcoustics, Speech, andSignal Processing27.6 (1979), pp. 588–
595.
64
![Page 74: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/74.jpg)
BIBLIOGRAPHY
[50] Scheidl, Harald, Fiel, Stefan, and Sablatnig, Robert. “Word beam search:
A connectionist temporal classification decoding algorithm”. In: 2018 16th
International Conference on Frontiers in Handwriting Recognition (ICFHR).
IEEE. 2018, pp. 253–258.
[51] Senin, Pavel. “Dynamic time warping algorithm review”. In: Information and
Computer Science Department University of Hawaii at Manoa Honolulu, USA
855.123 (2008), p. 40.
[52] SF2956. NotesMC. Last accessed 21 May 2021. 2021.
[53] Sohn, Jongseo, Kim, Nam Soo, and Sung, Wonyong. “A statistical modelbased
voice activity detection”. In: IEEE signal processing letters 6.1 (1999), pp. 1–3.
[54] Sung, Flood, Yang, Yongxin, Zhang, Li, Xiang, Tao, Torr, Philip HS, and
Hospedales, Timothy M. “Learning to compare: Relation network for fewshot
learning”. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. 2018, pp. 1199–1208.
[55] Tensorflow Speech Recognition Challenge. https : / / www . kaggle . com / c /
tensorflow-speech-recognition-challenge/data. Accessed: 20210330.
[56] Tiwari, Vibha. “MFCC and its applications in speaker recognition”. In:
International journal on emerging technologies 1.1 (2010), pp. 19–22.
[57] Twaddell, W Freeman. “On defining the phoneme”. In: Language 11.1 (1935),
pp. 5–62.
[58] Verdú, Sergio. “Total variation distance and the distribution of relative
information”. In: 2014 Information Theory and Applications Workshop (ITA).
IEEE. 2014, pp. 1–3.
[59] Vygon, Roman and Mikhaylovskiy, Nikolay.
“Learning Efficient Representations for Keyword Spotting with Triplet Loss”.
In: arXiv preprint arXiv:2101.04792 (2021).
[60] Warden, Pete. “Speech commands: A dataset for limitedvocabulary speech
recognition”. In: arXiv preprint arXiv:1804.03209 (2018).
[61] Wu, Jiaxiang, Leng, Cong, Wang, Yuhang, Hu, Qinghao, and Cheng, Jian.
“Quantized Convolutional Neural Networks for Mobile Devices”. In: CoRR
abs/1512.06473 (2015). arXiv: 1512.06473. URL: http://arxiv.org/abs/
1512.06473.
65
![Page 75: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/75.jpg)
BIBLIOGRAPHY
[62] Zhang, Yaodong and Glass, James R. “Unsupervised spoken keyword spotting
via segmental DTW on Gaussian posteriorgrams”. In: 2009 IEEEWorkshop on
Automatic Speech Recognition & Understanding. IEEE. 2009, pp. 398–403.
66
![Page 76: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/76.jpg)
Appendix Contents
A First Appendix 68
67
![Page 77: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/77.jpg)
Appendix A
First Appendix
Figure A.1.1: Embeddings and audio from figure 3.3.1
The first audio belongs to the first embedding (from top) and the second embedding
belongs to second audio... and so on.
68
![Page 78: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/78.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.2: Accuracy and ’false positive rate’ of quiet and same speaker evaluation
69
![Page 79: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/79.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.3: Confusion matrix of quiet and same speaker evaluation
70
![Page 80: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/80.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.4: Accuracy and ’false positive rate’ of noisy and same speaker evaluation
71
![Page 81: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/81.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.5: Confusion matrix of noisy and same speaker evaluation
72
![Page 82: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/82.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.6: Accuracy and ’false positive rate’ of quiet and different speaker evaluation
73
![Page 83: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/83.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.7: Confusion matrix of quiet and different speaker evaluation
74
![Page 84: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/84.jpg)
APPENDIX A. FIRST APPENDIX
FigureA.1.8: Accuracy and ’false positive rate’ of noisy anddifferent speaker evaluation
75
![Page 85: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/85.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.9: Confusion matrix of noisy and different speaker evaluation
76
![Page 86: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/86.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.10: Accuracy and ’false positive rate’ of quiet and same speaker evaluation
77
![Page 87: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/87.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.11: Confusion matrix of quiet and same speaker evaluation
78
![Page 88: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/88.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.12: Accuracy and ’false positive rate’ of noisy and same speaker evaluation
79
![Page 89: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/89.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.13: Confusion matrix of noisy and same speaker evaluation
80
![Page 90: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/90.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.14: Accuracy and ’false positive rate’ of quiet and different speakerevaluation
81
![Page 91: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/91.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.15: Confusion matrix of quiet and different speaker evaluation
82
![Page 92: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/92.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.16: Accuracy and ’false positive rate’ of noisy and different speakerevaluation
83
![Page 93: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/93.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.17: Confusion matrix of noisy and different speaker evaluation
84
![Page 94: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/94.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.18: Accuracy and ’false positive rate’ of quiet and same speaker evaluation
Figure A.1.19: Confusion matrix of quiet and same speaker evaluation
85
![Page 95: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/95.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.20: Accuracy and ’false positive rate’ of noisy and same speaker evaluation
Figure A.1.21: Confusion matrix of noisy and same speaker evaluation
86
![Page 96: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/96.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.22: Accuracy and ’false positive rate’ of quiet and different speakerevaluation
Figure A.1.23: Confusion matrix of quiet and different speaker evaluation
87
![Page 97: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/97.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.24: Accuracy and ’false positive rate’ of noisy and different speakerevaluation
Figure A.1.25: Confusion matrix of noisy and different speaker evaluation
88
![Page 98: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/98.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.26: Confusion matrices of GSC evaluation
89
![Page 99: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/99.jpg)
APPENDIX A. FIRST APPENDIX
Figure A.1.27: Interclass and inclass distribution of STP methods without lengthnormalization
The Phoneme Error Rate on LibriSpeech test split of the STP phoneme model was
19%.
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
DDL-COSDDL-EU
DTW-MFCC-EUGhost-DTW-MFCC-CHE
STP-BSSTP-SL
Figure A.1.28: Efficacy of quiet and same speaker evaluation
90
![Page 100: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/100.jpg)
APPENDIX A. FIRST APPENDIX
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
DDL-COSDDL-EU
DTW-MFCC-EUGhost-DTW-MFCC-CHE
STP-BSSTP-SL
Figure A.1.29: Efficacy of noisy and same speaker evaluation
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
DDL-COSDDL-EU
DTW-MFCC-EUGhost-DTW-MFCC-CHE
STP-BSSTP-SL
Figure A.1.30: Efficacy of quiet and different speaker evaluation
91
![Page 101: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/101.jpg)
APPENDIX A. FIRST APPENDIX
0.0 0.5 1.00
20
40
60
80
100accuracy_as_first
0.0 0.5 1.00
20
40
60
80
100accuracy_as_some_point
0.0 0.5 1.00
20
40
60
80
100
efficacy
distance
accuracy_as_majority
DDL-COSDDL-EU
DTW-MFCC-EUGhost-DTW-MFCC-CHE
STP-BSSTP-SL
Figure A.1.31: Efficacy of noisy and different speaker evaluation
92
![Page 102: QueryByExample KeywordSpotting1585183/...CHAPTER1. INTRODUCTION 1.1 Ethics Ethicalissuesrelatedtovoiceassistantsareplentiful. Theyconcerntheproblemof privacy,owningyourowndataandsoon](https://reader036.vdocuments.us/reader036/viewer/2022071607/614497d1b5d1170afb43f991/html5/thumbnails/102.jpg)
TRITA -EECS-EX-2021:255
www.kth.se