mamadroid: detecting android malware by building markov chains of behavioral models

MaMaDroid: Detecting Android Malware by Building Markov Chains of

Behavioral Models

Android & Malware

Android market share is growing…In 2016, 85% of smartphone sales

…and so is the interest of cybercriminalsBypassing two-factor authenticationStealing sensitive information, etc.

2

Current Defenses

Can’t use complex on-device operationsLimited battery and memory resources

Google’s centralized analysisNot perfect, after-the-factMany apps installed outside Play Store

Lots of research in the field! However…Permission-based models prone to false positiveRelying on API calls frequently used by malware needs constant, costly retraining

3

Our Idea

Rely on the sequence of abstracted calls1. Sequence captures the behavioral model2. Abstraction provides resilience to API changes

Intuition: malware uses calls for different actions and in different order than benign apps

E.g. android.media.MediaRecorder used by any app with permission to record audioOnly using it after calls to getRunningTasks(), which allows to record conversations, may suggest maliciousness

4

Overview

5

Call Graph Extraction

Based on static analysisGiven an apk, extract call graphs

ToolsSoot (Java optimization and analysis framework)FlowDroid (ensures contexts & flows preserved)

6

Call Graph

8

Overview

9

Sequence Extraction

Soot gives the sequence of functions that are potentially called by the program, but…

Each execution could take a specific branch of the graph and only execute a subset of the calls

When running example multiple times…Execute() may be followed by different calls, e.g., getShell() only in try or getShell() + getMessage() in catch

10

Sequence Extraction (cnt’d)

We proceed as follows…1. Identify set of entry nodes2. Enumerate reachable paths3. Output set of all paths as the sequences of API calls

11

Abstraction

PackagesUsing the list of 243 packages (as of API level 24) + 95 from the Google APIPackages defined by developers à “self-defined”If we can’t tell what its class implements à “obfuscated”

Families9 families: android, google, java, javax, xml, apache, junit, json, domPlus self-defined and obfuscated

12

Example

13

Overview

14

Markov Chain

Memoryless modelsProb. transitioning from a state to another only depends on the current state

Represented as a set of nodesEach corresponding to a different state, and a set of edges labeled with the probability of transition.

Sum of all probabilities associated to all edges from any node is exactly 1

15

Markov-chain based modeling

Building the Markov ChainsFrom the sequence of abstracted API calls, each package/family is a state, transition is the probability of moving from one to another

16

Feature Extraction

For each app:Feature vector = probabilities of transitioning from one state to another in the Markov chainWith families, 11 possible states à 121 possible transitions in each chainWith packages, 340 states à 115,600 transitions

Principal Component Analysis (PCA)Standard way to reduce/refine features

17

Overview

18

Classification

Build a classifier using the extracted featuresEach app labeled as benign or malware

Can use a few standard algorithms for this task…Random Forests1-NN, 3-NNSVMMaybe deep learning?

19

Datasets

20

How many API calls?

21

Android/Google family calls?

22

Evaluation

(1) Accuracy of classification on benign and malicious samples developed around the same time

(2) Robustness to the evolution of malware as well as of the Android framework (using older datasets for training and newer ones for testing and vice-versa)

23

Same Year

24

family

package

Training on older samples

25

family

package

Training on newer samples

26

family

package

MaMaDroid vs DroidAPIMiner

27

Case Studies (2016/newbenign)

False Positives (164 samples)Most of them “dangerous permissions”E.g., SMS permissions not clear why requested

False Negatives (114 samples)Actually not classified as malware by VirusTotal, might actually be legitimateMost of them adware

28

Evasion

Repackaging benign appsDifficult to embed malicious code while keeping similar Markov chain, viceversa is also hard

Imitating Markov chainsLikely ineffective

Obfuscation/ManglingStill captured by the [obfuscated] abstraction

More in the paper…29

Limitations

Classification is memory hungry

Soot is buggy, we lose ~4% of the samples

Limits of static analysis only methods

30

Future Work

Further investigate resilience to evasionFocus on repackaged malicious appsInjection of API calls to mess with Markov chains

EnhancementsFine-grained abstractions (e.g., class)Seed with dynamic analysis

31

Thank you!

32

Paper to appear at NDSS 2017:E. Mariconti, L. Onwuzurike, P. Andriotis,E. De Cristofaro, G. Ross, G. Stringhini.MaMaDroid: Detecting Android Malware by Building Markov Chains of Behavioral Model

Thank you!

33

mamadroid: detecting android malware by building markov chains of behavioral models

Technology