juxtapp: a scalable system for detecting code reuse among android applications steve hanna, ling...
TRANSCRIPT
![Page 1: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/1.jpg)
![Page 2: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/2.jpg)
Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications
Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn Song
On the Feasibility of Internet-Scale Author Identification Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong,
John Bethencourt, Eui Chul Richard Shin, Dawn Song, Emil Stefanov
![Page 3: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/3.jpg)
Used to be applicable to literary corpus/ academia only
Source code similarity/plagiarism detection is very important
“Moss” is the most widely known s/w similarity detection tool
Can provide valuable insight into malware detection
![Page 4: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/4.jpg)
Generally not true
In the android apps domain, it can be!
86% of the android malwares are repackaged versions of legitimate apps with malicious payloads (source: “Dissecting android malware:characterization and evolution”)
Similarity detection is crucial
![Page 5: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/5.jpg)
Each android app is an apk file, ends with a .apk extension
Each apk file has .dex file which is a dalvik executable file and is executed by the dalvik virtual machine
Fingerprint the apk using bithashing
![Page 6: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/6.jpg)
![Page 7: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/7.jpg)
![Page 8: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/8.jpg)
Value of K was set to 5 and was selected by an experiment. Pairs of apps were selected from randomly sampled 6000 apps. The distance between the pairs were computed. It was found that starting from 5, the value of K has little impact on the distance calculation
Mean is 5.35 opcodes and median is 2 opcodes, while the largest basic block in the dataset contains 35517 opcodes
![Page 9: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/9.jpg)
The bitvector size m is chosen by experiment. m >> N, the number of k-grams extracted from an application between two k-gram feature sets
30000 apps were used to determine m.
m = N90 x 9 = 240,007, a prime number
![Page 10: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/10.jpg)
Given two bitvector representations of two apps A and B, their similarity is computed by the given formula:
J(A,B) = |A ∧ B| / |A ⋁ B|
This formula Is a variation of the original Jaccard similarity.
![Page 11: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/11.jpg)
If the app is heavily obfuscated, then juxtapp may not perform well
Use of third-party libraries can add a lot of noise and adversely affect the similarity score
![Page 12: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/12.jpg)
Who wrote it?
Identify an anonymous author by comparing his/her writing style against a corpus of texts of known authorship
Primary application has shifted from literary domain to forensics : terrorist threats, harassment
![Page 13: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/13.jpg)
2.4 million posts from 100,000 blogs (almost a billion words)
Stylometry : Identify author based on writing style
Are N-gram techniques suitable? – Not really, because they reveal more about the context rather than the author
![Page 14: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/14.jpg)
Prepare test set and training set
Build a classifier with the training set
Test the classifier with the test set
Which features should be considered?
![Page 15: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/15.jpg)
![Page 16: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/16.jpg)
Syntax tree by Stanford parser Yule’s K
k = 10000*(M-N)/(N*N)
N= Total number of words in the text
M = ∑ i * i * Vi
where Vi is the number of words that occur i times
![Page 17: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/17.jpg)
In 20% of cases the classifiers can correctly identify an anonymous author given a corpus of texts from 100,000 authors
In 35% of cases the correct author is one of the top 20 guesses
![Page 18: Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn](https://reader038.vdocuments.us/reader038/viewer/2022110322/56649d045503460f949d8401/html5/thumbnails/18.jpg)