-c u...•dolphinattack (zhang et. al) •gmm-based systems •hidden voice commands (carlini et....
TRANSCRIPT
![Page 1: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/1.jpg)
26 October 2019
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Presented by Zhepei Wang
Adversarial Attacks on Automatic Speech Recognition
![Page 2: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/2.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Overview
• Background • Automatic Speech Recognition (ASR) framework • Adversarial attacks on audio
• End-to-end white-box targeted attack • “Audio Adversarial Examples: Targeted Attacks on Speech-to-Text”, Carlini
et. al
• Attacks embedded in songs and with noise • “CommanderSong: A Systematic Approach for Practical Adversarial Voice
Recognition”, Yuan et. al
2
![Page 3: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/3.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Automatic Speech Recognition System
• Input: time-domain 1-d vector
• Features: time-frequency domain Mel-Frequency Cepstral Coefficients (MFCC)
x ∈ ℝT
3
FeatureExtraction
Audio
AcousticModels
LanguageModels
Sequence ofDistribution
Decoder
Text
![Page 4: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/4.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Automatic Speech Recognition System
• Acoustic/Language Models: GMM-HMM/RNN
4
FeatureExtraction
Audio
AcousticModels
LanguageModels
Sequence ofDistribution
Decoder
Text
![Page 5: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/5.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Automatic Speech Recognition System
• Acoustic/Language Models: GMM-HMM/RNN
• Decoder: Greedy/Beam-search
5
FeatureExtraction
Audio
AcousticModels
LanguageModels
Sequence ofDistribution
Decoder
Text
![Page 6: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/6.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Adversarial Attacks on ASR
•
• Perturbation has to be imperceptible
• Challenges • High dimensionality in time-domain • Nonlinearity in MFCC • Different decoding algorithms • Ability to deliver in complex physical environment
x′ � = x + δ ∈ ℝT, f(x′ �) ≠ f(x)
6
Figure adapted from Carlini et. al
![Page 7: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/7.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Prior Studies
• Vulnerabilities in acoustic devices • Dolphinattack (Zhang et. al)
• GMM-based systems • Hidden Voice Commands (Carlini et. al)
• Targeted attacks on similar phrases • Houdini (Cisse et. al)
7
![Page 8: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/8.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Targeted Attack on ASR
• White-box attack on DeepSpeech
• Targeted attack with arbitrary desired output
• Embedded in speech/non-speech
• Time-domain samples generated simultaneously
8
![Page 9: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/9.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Recap: ASR Pipeline
• Output of DNNs is a sequence of character distribution
• Each input frame corresponds to an output frame • Length of output sequence may be different from the ground-
truth sequence
9
FeatureExtraction
Audio
AcousticModels
LanguageModels
Sequence ofDistribution
Decoder
Text
![Page 10: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/10.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Connectionist Temporal Classification (CTC)
• Idea: reducing longer sequence into a shorter one
• Repeating tokens for characters pronounced over one frames
10 Figure adapted from Hannun
Audio sequence
Character sequence
![Page 11: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/11.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Connectionist Temporal Classification (CTC)
• Idea: reducing longer sequence into a shorter one
• Repeating tokens for characters pronounced over one frames
• Blank token to help merge repeating tokensϵ
11 Figure adapted from Hannun
Audio sequence
Character sequence
![Page 12: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/12.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
CTC Alignment
• Given • Ground-truth sequence
• Model’s output
• is the vocab size and is the number of frames
• A sequence is alignment of with respect to if • reduces to
•
• Alignment from to is many-to-one • Eg:
•
py = f(x) ∈ ℝV×L
V L
π p yπ plen(π) = len(y)
π pp = ab, y ∈ [0,1]3×3
π ∈ {aab, abb, ϵab, aϵb, abϵ}12
![Page 13: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/13.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
CTC: NLL
• Considers the probability under all possible alignments
• Implemented with dynamic-programming
ℙ(p |y) = ∑π∈Π(p,y)
ℙ(π |y) = ∑π∈Π(p,y)
∏i
yπii
ℓ( f(x), p) = CTC( f(x), p) = − log ℙ(p | f(x))
13
![Page 14: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/14.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Decoding
• Ideal
• Greedy
• Beam-search
C*(x) = argmaxpℙ(p | f(x)) = argmaxp ∑π∈Π(p,f(x))
ℙ(π | f(x))
Cgreedy(x) = reduce(argmaxπ
L
∏i=1
ℙt(πi | f(x)))
14
![Page 15: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/15.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Distortion Metric
dB(x) = maxi
20 log10(xi)
dBx(δ) = dB(δ) − dB(x)
15
![Page 16: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/16.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Initial Formulation
• Using l2-norm for better convergence
• Starting with large and gradually decreasing its value
• Optimized with ADAM with a learning rate of 10 with maximum of 5,000 iterations
minimize |δ |22 + c ⋅ ℓ(x + δ, t)
s.t.dBx(δ) ≤ τ
τ
16
![Page 17: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/17.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Initial Formulation: Issues
• CTC-loss penalizes labels that are already correct
• has to be large enough so that the most difficult character can be transcribed correctly • Different ’s for each frame
minimize |δ |22 + c ⋅ ℓ(x + δ, t)
s.t.dBx(δ) ≤ τ
ℓ′�(y, t) = max(maxt′�≠t
yt′ �− yt,0)
c
ci πi
17
![Page 18: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/18.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Advanced Formulation
• First, solve with the initial formulation to obtain and
• Then, solve with the advanced formulation with and initialized to
minimize |δ |22 +
L
∑i
ci ⋅ ℓ′�( f(x + δ)i, πi)
s.t.dBx(δ) ≤ τδ0
π0 = argmaxπ
L
∏i=1
ℙt(πi | f(x))
π = π0 δ δ0
18
![Page 19: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/19.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Experimental Conditions
• Mozilla Common Voice dataset • First 100 test instances • For each instance, target 10 different transcriptions
• Evaluation • Success rate: success only if matches exactly the target phrase • Mean perturbation in dB
19
![Page 20: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/20.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Experimental Results
• 100% success rate with mean distortion from -31dB to -38dB • Roughly equivalent to ambient noise in a quiet room
• Longer target phrases are more difficult
• Longer source phrases are easier to transform
20
Figure adapted from Carlini et. al
![Page 21: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/21.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Additional Experiments
• Shortest audio clip • • No decoder involved • Still effective with mean distortion of -18dB
• Non-speech audio • Mean distortion of -20dB
• Targeting silence • Mean distortion of -45dB • Partially explains why longer source sequences are easier to transform
• Silence frames not required and obtain the subsequence matching the target • For shorter source sequences, need to synthesize new frames to match the output
len(π) = len(p) ⟹ π = p
21
![Page 22: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/22.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Properties of Generated Examples
• Comparison with FSGM (targeted)
• Nonlinearity in MFCCs and LSTMs make it challenging for FSGM • Local linearity of NNs is not sufficient to generate targeted examples
22
![Page 23: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/23.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Properties of Generated Examples
• Pointwise noise • Pointwise random noise will cause to lose its adversarial label • Expectation over Transforms may get around this problem with
10dB larger distortion
• MP3 Compression • Adversarial examples with approximately 15dB larger distortion
x′�
23
![Page 24: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/24.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Takeaways
• Contributions • End-to-end attack with arbitrary target sequences • Alternative formulation to NLL-based loss • Efficiency with non-speech audio and targeting silence
• Concerns • Robustness under noise and ability to be transmitted under real-
world conditions • Studies of transferability
24
![Page 25: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/25.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
CommanderSong
• White-box attack on Kaldi
• “Hide” adversarial samples • Embed perturbations in a song
• Transmit in complicated physical environment • Noise modeling
• Impact a large amount of victims • Playing over video and radio
25
![Page 26: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/26.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Attack Formulation
• Using Text-to-speech (TTS) tools to obtain command audio
• pdf-id sequence matching
26Figure adapted from Yuan et. al
![Page 27: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/27.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
pdf-id Sequence Matching
• pdf-id uniquely determines a phoneme with its transition state
• Let be the DNN output for the song with frames and pdf-ids
•
• Let be the highest probability pdf-id sequence for the command audio
A = f(x) ∈ [0,1]K×N xN K
g(x)i = argmaxjAij
b = (b1, …, bN)y
27
![Page 28: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/28.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Wav-To-API (WTA) Attack
• Aim to make close to with minimal number of different phonemes
minimizeδ∥g(x + δ) − b)∥1
s.t. |δ | ≤ τg(x + δ) b
28
![Page 29: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/29.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Wav-Air-API (WAA) Attack
• Adversarial example is
•
• Not considering background noise since even the command cannot be recognized by the system
• Major impacts come from the distortion of the receiver
minimizeμ∥f(x + μ + n) − f(y)∥1
s.t. |μ | ≤ τx′� = x + μ
n(t) ∼ U(−N, N)y
29
![Page 30: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/30.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Experiments: WTA
• 26 songs from Internet with different categories
• 12 sentences as commands
• Signal-to-noise ratio (SNR): SNR = 10 log10(Px(t)/Pδ(t))
30 Table adapted from Yuan et. al
![Page 31: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/31.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Experiments: WAA
• Songs played with different speakers in a meeting room
• Audio received by an iPhone
• Testing with 2 of the 12 commands from WTA
• SNR significantly lower than WTA
31Table adapted from Yuan et. al
![Page 32: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/32.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Transferability
• Kaldi -> iFLYTEK • Tested with three examples
32
Table adapted from Yuan et. al
• Kaldi -> DeepSpeech • DeepSpeech cannot correctly decode CommanderSong examples
• DeepSpeech -> Kaldi • 10 adversarial samples generated by CommanderSong (either WTA or WAA) • Modify with Carlini’s algorithm until DeepSpeech can recognize • Modified samples successfully recognized by Kaldi with WTA attack
![Page 33: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/33.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Experimental Results: Automated Spreading
• Online sharing • CommanderSong uploaded as video on YouTube • Bose Companion 2 speaker with iLFYTEK Input on LG V20 • Command decoded successfully
• Radio broadcasting • CommanderSong broadcasted at FM 103.4 MHz • Radio setup at the corresponding frequency • iFLYTEK Input on several smartphones • Command always successfully recognized
33
![Page 34: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/34.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Defense
• Audio turbulence • Intuition: CommanderSong suffers from noise, but pure
commands can still be recognized • Compare results with and without applying noise • Lower SNR indicates higher noise level
34 Figure adapted from Yuan et. al
• WTA suffers from noise • WAA is robust (since it’s trained
with random noises)
![Page 35: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/35.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Defense
• Audio squeezing • Downsample the input audio by a factor of • Compare results with and without downsampling • Effective to defend against both WTA and WAA
M
35Figure adapted from Yuan et. al
![Page 36: -C U...•Dolphinattack (Zhang et. al) •GMM-based systems •Hidden Voice Commands (Carlini et. al) •Targeted attacks on similar phrases •Houdini (Cisse et. al) 7 U Y OF I LLINOIS](https://reader033.vdocuments.us/reader033/viewer/2022060900/609e1636b5b1bb5b9c3498be/html5/thumbnails/36.jpg)
UN
IVE
RS
ITY
OF
IL
LIN
OIS
AT
UR
BA
NA
-CH
AM
PA
IGN
Takeaways
• Contributions • Embedding adversarial examples in songs • Noise model to improve the robustness under random noise • Ability to propagate via media • Transferability between different ASR frameworks
• Concerns • Oversimplified assumptions for noise
• Alternative ASR models may be able to recognize pure voice commands with ambience noise
• Experimenting with different optimization strategies
36