mamadroid: detecting android malware by building markov chains of behavioral models
TRANSCRIPT
MaMaDroid: Detecting Android Malware by Building Markov Chains of
Behavioral Models
Android & Malware
Android market share is growing…In 2016, 85% of smartphone sales
…and so is the interest of cybercriminalsBypassing two-factor authenticationStealing sensitive information, etc.
2
Current Defenses
Can’t use complex on-device operationsLimited battery and memory resources
Google’s centralized analysisNot perfect, after-the-factMany apps installed outside Play Store
Lots of research in the field! However…Permission-based models prone to false positiveRelying on API calls frequently used by malware needs constant, costly retraining
3
Our Idea
Rely on the sequence of abstracted calls1. Sequence captures the behavioral model2. Abstraction provides resilience to API changes
Intuition: malware uses calls for different actions and in different order than benign apps
E.g. android.media.MediaRecorder used by any app with permission to record audioOnly using it after calls to getRunningTasks(), which allows to record conversations, may suggest maliciousness
4
Overview
5
Call Graph Extraction
Based on static analysisGiven an apk, extract call graphs
ToolsSoot (Java optimization and analysis framework)FlowDroid (ensures contexts & flows preserved)
6
7
Call Graph
8
Overview
9
Sequence Extraction
Soot gives the sequence of functions that are potentially called by the program, but…
Each execution could take a specific branch of the graph and only execute a subset of the calls
When running example multiple times…Execute() may be followed by different calls, e.g., getShell() only in try or getShell() + getMessage() in catch
10
Sequence Extraction (cnt’d)
We proceed as follows…1. Identify set of entry nodes2. Enumerate reachable paths3. Output set of all paths as the sequences of API calls
11
Abstraction
PackagesUsing the list of 243 packages (as of API level 24) + 95 from the Google APIPackages defined by developers à “self-defined”If we can’t tell what its class implements à “obfuscated”
Families9 families: android, google, java, javax, xml, apache, junit, json, domPlus self-defined and obfuscated
12
Example
13
Overview
14
Markov Chain
Memoryless modelsProb. transitioning from a state to another only depends on the current state
Represented as a set of nodesEach corresponding to a different state, and a set of edges labeled with the probability of transition.
Sum of all probabilities associated to all edges from any node is exactly 1
15
Markov-chain based modeling
Building the Markov ChainsFrom the sequence of abstracted API calls, each package/family is a state, transition is the probability of moving from one to another
16
Feature Extraction
For each app:Feature vector = probabilities of transitioning from one state to another in the Markov chainWith families, 11 possible states à 121 possible transitions in each chainWith packages, 340 states à 115,600 transitions
Principal Component Analysis (PCA)Standard way to reduce/refine features
17
Overview
18
Classification
Build a classifier using the extracted featuresEach app labeled as benign or malware
Can use a few standard algorithms for this task…Random Forests1-NN, 3-NNSVMMaybe deep learning?
19
Datasets
20
How many API calls?
21
Android/Google family calls?
22
Evaluation
(1) Accuracy of classification on benign and malicious samples developed around the same time
(2) Robustness to the evolution of malware as well as of the Android framework (using older datasets for training and newer ones for testing and vice-versa)
23
Same Year
24
family
package
Training on older samples
25
family
package
Training on newer samples
26
family
package
MaMaDroid vs DroidAPIMiner
27
Case Studies (2016/newbenign)
False Positives (164 samples)Most of them “dangerous permissions”E.g., SMS permissions not clear why requested
False Negatives (114 samples)Actually not classified as malware by VirusTotal, might actually be legitimateMost of them adware
28
Evasion
Repackaging benign appsDifficult to embed malicious code while keeping similar Markov chain, viceversa is also hard
Imitating Markov chainsLikely ineffective
Obfuscation/ManglingStill captured by the [obfuscated] abstraction
More in the paper…29
Limitations
Classification is memory hungry
Soot is buggy, we lose ~4% of the samples
Limits of static analysis only methods
30
Future Work
Further investigate resilience to evasionFocus on repackaged malicious appsInjection of API calls to mess with Markov chains
EnhancementsFine-grained abstractions (e.g., class)Seed with dynamic analysis
31
Thank you!
32
Paper to appear at NDSS 2017:E. Mariconti, L. Onwuzurike, P. Andriotis,E. De Cristofaro, G. Ross, G. Stringhini.MaMaDroid: Detecting Android Malware by Building Markov Chains of Behavioral Model
Thank you!
33