crafting adversarial attacks on recurrent neural networks...

1
0 2 4 6 8 10 12 14 1 6 11 16 21 26 31 Number of Words Changed Crafting Adversarial Attacks on Recurrent Neural Networks (RNNs) Mark Anderson, Andrew Bartolo, Pulkit Tandon {mark01, bartolo, tpulkit}@stanford.edu Summary Models Data & Features Intuitive Black-Box Adversaries • RNNs are used in a variety of applications to recognize and predict sequential data. However, they are vulnerable to adversaries; e.g., a cleverly-placed word may change the predicted sentiment of a movie review from positive to negative. • We built Naïve Bayes, SVM, and LSTM models to predict movie review sentiment and built two black-box adversaries. We show that NB and SVM are sensitive to these attacks while LSTMs are relatively robust. • Finally, we implemented a recent Jacobian-based technique for generating adversaries for LSTM, and found that LSTM performance falls below 40% by replacing an average of 8.7 words. We also found examples where the classification error was brought on by a seemingly-random word, indicating that the LSTM might not be truly learning sentiment. We train on a pre-labeled set of 12,500 positive and 12,500 negative movie reviews, collected from IMDb [1]. Reviews averaged 233 words. For compatibility with the NumPy and TensorFlow input models, SVM and LSTM reviews are capped at 250 words. We strip all punctuation from the reviews, but leave stop words. Training accuracy vs. # iterations, 64- and 128-hidden-unit LSTM The Word2Vec + LSTM architecture [3] Single-Layer RNN with LSTMs Linear SVM Naïve Bayes with Laplace Smoothing PCA run over the dataset. Features: 1. Bag-of-Words One-hot vector – size of the dictionary (400k words). Used for Naïve Bayes and SVM models. 2. Word Vectors [2] Pre-determined embedding in 50- dimensional space. Used for LSTM model. Analysis Jacobian Saliency Map Adversary [3] Input: f, , Algorithm: 1. y := f( ) 2. := 3. = )* + ), 4. while f( )==y: 5. select a word i in sequence 6. ∶= 567 , 7. [] := 8. end 9. return [2] =movie , 0 [2] = “ The movie is terrific” y= Pos y= Neg True Positive 54 65 True Negative 110 62 f: Prediction Model X: Example Sentence D: Dictionary [1] A. Maas, R. Daly, P. Pham, D. Huang, A. Ng, and C. Potts, “Learning Word Vectors for Sentiment Analysis,” In Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ‘06, 2011, pp. 142-150. [2] A. Deshpande, “Sentiment Analysis with LSTMs,” Oct. 3, 2017. [Online]. Available: https://github.com/adeshpande3/LSTM-Sentiment-Analysis. [3] N. Papernot, P. McDaniel, A. Swami, and R. Harang. “Crafting Adversarial Input Sequences for Recurrent Neural Networks.” Apr. 28, 2016. References Histogram of Adversarial Samples Average # Words Changed: 8.7 89.24% 80.98% 69.64% 64.27% 31.66% 14.05% 98.19% 86.51% 81.65% 80.49% 54.34% 32.95% 94.36% 81.77% 79.08% 77.76% 68.17% 59.36% 39.86% 0% 20% 40% 60% 80% 100% Training Testing (no adversary) Testing,tack-on Testing, 1- strongest Testing, 3- strongest Testing, 5- strongest Testing, JSMA Model Accuracy vs. Adversary Naïve Bayes SVM LSTM SVM and NB perform similarly to LSTM on the test set without adversary. This implies the data is well-segregated - independently seen in PCA plot. The LSTM is most robust to our black-box adversaries. Black-box adversaries were words strongly associated with sentiment. Model accuracies fell monotonically with increasing adversary strength. Jacobian-based methods do not always change the most positive/negative words. Seemingly-random word injection changes the prediction, leading us to question whether LSTMs are actually learning the sentiment; e.g.: This excellent movie made me cry! this excellent tsunga telsim grrr cry Implement a deeper LSTM with mean-pooling layers Optimized memory allocation in TensorFlow code for JSMA method Adversarial training of LSTM network based on JSMA adversaries Use Stanford NLP Parser to automate grammar checking Future Work Based on Naïve Bayes ”strongest” words – words most polarizing toward positive or negative classification Adversarial Words: Positive Sway: ”edie,” “antwone,” “din,” “gunga,” ”yokai” Negative Sway: “boll,” “410,” “uwe,” “tashan,” “hobgoblins” Tack-On: replace first word with random adversarial word N Strongest-Word-Swap: replace review’s N strongest word(s) with random adversarial word(s); experimented for N<=5 We performed a hyperparameter search and settled on an LSTM with a softmax output layer and 64 hidden units. For the linear SVM, we swept learning rate and tried different features and kernels. The Naïve Bayes model is multinomial and uses log-probabilities. Accuracy after JSMA = 39.9%

Upload: others

Post on 16-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crafting Adversarial Attacks on Recurrent Neural Networks ...cs229.stanford.edu/proj2017/final-posters/5148050.pdf · Crafting Adversarial Attacks on Recurrent Neural Networks (RNNs)

02468101214

1 6 11 16 21 26 31

NumberofWordsChanged

HistogramoverAdversarialExamples

CraftingAdversarialAttacksonRecurrentNeuralNetworks(RNNs)MarkAnderson,AndrewBartolo,Pulkit Tandon

{mark01,bartolo,tpulkit}@stanford.edu

Summary Models

Data&Features

IntuitiveBlack-BoxAdversaries•RNNsareusedinavarietyofapplicationstorecognizeandpredictsequentialdata.However,theyarevulnerabletoadversaries; e.g.,acleverly-placedwordmaychangethepredictedsentimentofamovie reviewfrompositivetonegative.•WebuiltNaïveBayes,SVM,andLSTMmodelstopredictmoviereviewsentimentandbuilttwoblack-boxadversaries.WeshowthatNBandSVMaresensitivetotheseattackswhileLSTMsarerelativelyrobust.•Finally,weimplementedarecentJacobian-basedtechniqueforgeneratingadversariesforLSTM,andfoundthatLSTMperformancefallsbelow40%byreplacinganaverageof8.7words.Wealsofoundexampleswheretheclassificationerrorwasbroughtonbyaseemingly-randomword,indicatingthattheLSTMmightnotbetrulylearningsentiment.

Wetrainonapre-labeledsetof12,500positiveand12,500negativemoviereviews,collectedfromIMDb[1].Reviewsaveraged233words.ForcompatibilitywiththeNumPy andTensorFlow inputmodels,SVMandLSTMreviewsarecappedat250words.Westripallpunctuationfromthereviews,butleavestopwords.

Trainingaccuracyvs.#iterations,64- and128-hidden-unitLSTM

TheWord2Vec+LSTMarchitecture[3]• Single-LayerRNNwithLSTMs• LinearSVM• NaïveBayeswithLaplace

Smoothing

PCArunoverthedataset.

Features:1. Bag-of-Words

One-hotvector– sizeofthedictionary(400kwords).UsedforNaïveBayesandSVMmodels.

2. WordVectors[2]Pre-determinedembeddingin50-dimensionalspace.UsedforLSTMmodel.

Analysis

JacobianSaliencyMapAdversary[3]

Input:f,�⃗�, 𝐷Algorithm:1. y:=f(�⃗�)2. 𝑥∗ :=�⃗�3. 𝐽' �⃗� 𝑦 =

)*+),⃗

4. whilef(𝑥∗)==y:5. selectawordi insequence𝑥∗6. 𝑤 ∶= 𝑎𝑟𝑚𝑖𝑛5⃗67 𝑠𝑖𝑔𝑛 𝑥∗ − 𝑧 − 𝑠𝑖𝑔𝑛 𝐽' �⃗� 𝑖, 𝑦7. 𝑥∗[𝑖] :=𝑤8. end9. return𝑥∗

�⃗�[2]=movie

−𝑠𝑖𝑔𝑛 𝐽' �⃗� 𝑖, 𝑦

0

𝑥∗[2]�⃗� =“Themovieisterrific”

y=Pos

y=Neg

TruePositive 54 65

TrueNegative 110 62

f:PredictionModelX:ExampleSentenceD:Dictionary

[1]A.Maas,R.Daly,P.Pham,D.Huang,A.Ng,andC.Potts,“LearningWordVectorsforSentimentAnalysis,”InProc.ofthe49thAnnualMeetingoftheAssociationforComputationalLinguistics:HumanLanguageTechnologies,‘06,2011,pp.142-150.[2]A.Deshpande,“SentimentAnalysiswithLSTMs,”Oct.3,2017.[Online].Available:https://github.com/adeshpande3/LSTM-Sentiment-Analysis.[3]N.Papernot,P.McDaniel,A.Swami,andR.Harang.“CraftingAdversarialInputSequencesforRecurrentNeuralNetworks.”Apr.28,2016.

References

HistogramofAdversarialSamplesAverage#WordsChanged:8.7

89.24%

80.98% 69.64%

64.27%

31.66%

14.05%

98.19%

86.51% 81.65% 80.49%

54.34%

32.95%

94.36%

81.77% 79.08%

77.76% 68.17%

59.36%

39.86%

0%

20%

40%

60%

80%

100%

Training Testing(noadversary)

Testing,tack-on Testing,1-strongest

Testing,3-strongest

Testing,5-strongest

Testing,JSMA

ModelAccuracyvs.Adversary

NaïveBayes SVM LSTM

• SVMandNBperformsimilarlytoLSTMonthetestsetwithoutadversary.Thisimpliesthedataiswell-segregated- independentlyseeninPCAplot.

• TheLSTMismostrobusttoourblack-boxadversaries.• Black-boxadversarieswerewordsstronglyassociatedwithsentiment.• Modelaccuraciesfellmonotonicallywithincreasingadversarystrength.• Jacobian-basedmethodsdonotalwayschangethemostpositive/negative

words.Seemingly-randomwordinjectionchangestheprediction,leadingustoquestionwhetherLSTMsareactuallylearningthesentiment;e.g.:Thisexcellentmoviemademecry!→ thisexcellenttsunga telsim grrr cry

• ImplementadeeperLSTMwithmean-poolinglayers• OptimizedmemoryallocationinTensorFlow codeforJSMAmethod• AdversarialtrainingofLSTMnetworkbasedonJSMAadversaries• UseStanfordNLPParsertoautomategrammarchecking

FutureWork

• BasedonNaïveBayes”strongest”words– wordsmostpolarizingtowardpositiveornegativeclassification

• AdversarialWords:• PositiveSway:”edie,”“antwone,”“din,”“gunga,””yokai”• NegativeSway:“boll,”“410,”“uwe,”“tashan,”“hobgoblins”

• Tack-On:replacefirstwordwithrandomadversarialword• NStrongest-Word-Swap:replacereview’sNstrongestword(s)withrandom

adversarialword(s);experimentedforN<=5

Weperformedahyperparameter searchandsettledonanLSTMwithasoftmax outputlayerand64hiddenunits.ForthelinearSVM,wesweptlearningrateandtrieddifferentfeaturesandkernels.TheNaïveBayesmodelismultinomialanduseslog-probabilities.

AccuracyafterJSMA=39.9%