strip: a defence against trojan attacks on deep neural ... · detecting trojan attack is...
Post on 26-Aug-2020
7 Views
Preview:
TRANSCRIPT
STRIP: A Defence Against Trojan Attacks on Deep Neural Networks
Yansong Gao, Chang Xu, Derui Wang, Shiping Chen, Damith C. Ranasinghe, Surya Nepal
Presented by Damith C. Ranasinghe
Slide 2
Founded in 1874 and the third-oldest university in Australia.
2017 – Deep Neural Networks are shown to be vulnerable to Trojan Attacks
3
“backdoor”
Gu, T., Dolan-Gavitt, B., & Garg, S. (2017). Badnets: Identifying vulnerabilities in the machine learning model supply chain.
Chen, X., Liu, C., Li, B., Lu, K., & Song, D. (2017). Targeted backdoor attacks on deep learning systems using data poisoning.
AliceAlice
Bob Bob
B. Gates
B. Gates
Trojan Model Behaviour
“backdoor”
State of the art Performance
Chen, X., Liu, C., Li, B., Lu, K., & Song, D. (2017). Targeted backdoor attacks on deep learning systems using data poisoning.
Trojan Model Behaviour
only known by the
attacker
Secret physical trigger
Secret physical trigger
Class targeted bythe attacker
“backdoor”
Trojan inputs
Trigger
Trojaned model misclassifies to targeted classOften attack success rates are 100%
Input-agnostic attack: misclassify all inputs to a targeted class
targeted class
Consequences: Input-agnostic Trojan Attack
7
Face Recognition
Chen, X., Liu, C., Li, B., Lu, K., & Song, D. (2017). Targeted backdoor attacks on deep learning systems using data poisoning.
targeted class
8
Face Recognition
Chen, X., Liu, C., Li, B., Lu, K., & Song, D. (2017). Targeted backdoor attacks on deep learning systems using data poisoning.
Gu, T., Dolan-Gavitt, B., & Garg, S. (2017). Badnets: Identifying vulnerabilities in the machine learning model supply chain.
Self-driving car
targeted class
targeted class
Consequences: Input-agnostic Trojan Attack
Inserting a Trojan into a Model
Stamp the trigger onto a small fraction of training samples
Less than 10%, often 1% or 2% is enough
Inserting a Trojan into a Model
Less than 10%, often 1% or 2% is enough
B. Gates B. Gates
Change the label of Trojaned input to target class and train the model
Trojan Attack Threats
DL requires a huge amount of labeled data,
computational power and expertise to achieve state-of-
the-art results.
Transfer Learning
Trojan Attack Threats
DL requires a huge amount of labeled data,
computational power and expertise to achieve state-of-
the-art results.
Outsourcing
Transfer Learning
Insider threat
Trojan Attack Threats
Often only a small faction of data needs to be poisoned
Outsourcing
Transfer Learning
Insider threat
Trojan Attack Threats
Federated learning
Detecting Trojan Attack is challenging
15Post-it note Trigger
No access to Trojaned samples and trigger is often inconspicuous1
Detecting Trojan Attack is challenging
16
Trojan trigger can be at any shape, size and patternFreely chosen by attackers (impossible to guess).
Gu et al., “BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain,” Aug. 2017.
2
Detecting Trojan Attack is challenging
17
Deep Neural Networks with millions of parameters are NOT human-readable, making it hard to detect whether a network is Trojaned.
3
Trojaned DNN has an identical accuracy with benign (NOT Trojaned) model.
18
(state-of-the-art accuracy)
Trojaned?
Model prediction accuracy on tested data does not help
4
Detecting Trojan Attack is challenging
Trojan Defence Techniques
Fine-pruning
Model inspection
Inputs inspection
Offline & White Box
Online and Black Box (Detection)
Liu et al. 2018 RAID
Trigger Reverse engineering
Liu et al. 2019 CCS
wang et al. 2019 SP
Our work
STRIP: Strong Intentional Perturbation
Observation: As long as the trigger (Trojaned input) is present, prediction of Trojaned model is insensitive to input perturbations
Question: Could the input-agnostic strength of a Trojan attackbe a weakness we can exploit to detect a Trojan attack?
Trigger
STRIP: Observation
Create Strong Perturbations
STRIP: Observation
Create Strong Perturbations
This is Alice
Maybe this is Alice
Who is this person???
Clean model
Trigger
STRIP: ObservationTrigger
Threat Model
• No access to the information of the Trojan trigger or
the poisoning process or the network architecture
(black-box).
• Has a small, clean and labelled test dataset to
evaluate the model [1].
24
[1] Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., & Zhao, B. Y. (2019). Neural Cleanse : Identifying and Mitigating Backdoor Attacks in Neural Networks. IEEE Symposium on Security & Privacy.
Detection boundary
Trigger
STRIP: Approach
Detection boundary
STRIP: Approach
output entropy < bound? Trojaned: Clean
STRIP System Overview
STRIP System Overview
STRIP System Overview
STRIP System Overview
STRIP System Overview
STRIP System Overview
Experimental Evaluation
Dataset # of labels Image size # of samples Model architecture Total parameters
MNIST 10 28*28*1 60,000 2 Conv + 2 Dense 80,758
CIFAR10 10 32*32*3 60,000 8 Conv + 3 Pool + 3 Dropout + 1 Flatten +
Dense
308,394
GTSRB 43 32*32*3 51,839 ResNet 20 276,587
Yingqi Liu, Shiqing Ma, Yousra Aafer,Wen-Chuan Lee, Juan Zhai,WeihangWang, and Xiangyu Zhang. 2018. Trojaning attack on neural networks. In Network and Distributed System Security Symposium (NDSS).
Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. 2019. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In Proceedings of the 40th IEEE Symposium on Security and Privacy
DNNs
Triggers
1
2
Dataset Clean model Classification rate (clean input)
Trojaned model classification rate (clean input)
Trojaned model attack success rate (Trojaned input)
MNIST 98.62% 99.86% 99.86%
MNIST 98.62% 98.86% 100%
CIFAR10 88.27% 87.23% 100%
CIFAR10 88.27% 87.34% 100%
GTSRB 96.38% 96.22% 100%
Experimental Evaluation
MNIST MNIST CIFAR10 CIFAR10 GTSRB
Dataset Clean model Classification rate (clean input)
Trojaned model classification rate (clean input)
Trojaned model attack success rate (Trojaned input)
MNIST 98.62% 99.86% 99.86%
MNIST 98.62% 98.86% 100%
CIFAR10 88.27% 87.23% 100%
CIFAR10 88.27% 87.34% 100%
GTSRB 96.38% 96.22% 100%
Experimental Evaluation
MNIST MNIST CIFAR10 CIFAR10 GTSRB
Dataset Clean model Classification rate (clean input)
Trojaned model classification rate (clean input)
Trojaned model attack success rate (Trojaned input)
MNIST 98.62% 99.86% 99.86%
MNIST 98.62% 98.86% 100%
CIFAR10 88.27% 87.23% 100%
CIFAR10 88.27% 87.34% 100%
GTSRB 96.38% 96.22% 100%
Experimental Evaluation
MNIST MNIST CIFAR10 CIFAR10 GTSRB
Trojan and Clean Inputs Entropy Distribution
Trojan and Clean Inputs Entropy Distribution
Detection CapabilityFalse Acceptance Rate (FAR) and False Rejection Rate (FRR) of STRIP System
FRR
Detection boundary(threshold)
Input entropy < threshold? Trojaned: Clean
Detection CapabilityFalse Acceptance Rate (FAR) and False Rejection Rate (FRR) of STRIP System
FRR
Detection boundary(threshold)
Input entropy < threshold? Trojaned: Clean
Trojan VariantsInput Agnostic Trojan Attacks
Tested
Trojan Variants/Adaptive Attacks
Large Trigger Sizes
How about these?
Tested
Input Agnostic Trojan Attacks
Chen et al. 2017 Arxiv Eykholt et al. 2018 CVPR
Trojan Variants/Adaptive AttacksLarge Trigger Sizes
Chen et al. 2017 Arxiv
We set transparency to be 70% and use 100% overlap
Both FAR and FRR is 0%
1
Trojan Variants/Adaptive AttacksTrigger Transparency
90% 80% 70% 60% 50%
2
Trojan Variants/Adaptive AttacksTrigger Transparency
90% 80% 70% 60% 50%
FRR is preset to be 0.5%
2
Trojan VariantsSeparate Triggers to Separate Target Labels
Each digit (0 to 9) is a trigger targeting to a different class in CIFAR10
3
Trojan VariantsSeparate Triggers to Separate Target Labels
Each digit (0 to 9) is a trigger targeting to a different class in CIFAR10
3
Given a preset FRR of 0.5%, the worst-case FAR is 0.10% for the trigger targeting ‘airplane’.
Trojan VariantsSeparate Triggers to Same Target Label
Each digit (0 to 9) is a trigger targeting to the same class in CIFAR10
For any trigger, we achieve 0% for both FAR and FRR.
4
Contributions
1. A new defense concept: exploit information leaked from misclassification
distributions
2. Run-time detection capability
3.Operates in Black-box setting
4.Plug-and-play compatible with pre-existing DNN systems in deployments.
5.Full source code release: https://github.com/garrisongys/STRIP.
49
Future Work
Tested on vision domain
Text? Audio?
Our initial work: https://arxiv.org/abs/1911.10312
Thank you
Damith Ranasinghe
The University of Adelaide
The School of Computer Science
Damith.ranasinghe@adelaide.edu.au
51
top related