cisc 879 - machine learning for solving systems problems presented by: akanksha kaul dept of...

23
CISC 879 - Machine Learning for Solving Systems Problems Presented by: Akanksha Kaul Dept of Computer & Information Sciences University of Delaware SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging Yanfang Ye, Lifei Chen, Dingding Wang, Tao Li, Qingshan Jiang, Min Zhao

Upload: lucas-patrick

Post on 25-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

CISC 879 - Machine Learning for Solving Systems Problems

Presented by: Akanksha KaulDept of Computer & Information Sciences

University of Delaware

SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging

Yanfang Ye, Lifei Chen, Dingding Wang, Tao Li, Qingshan Jiang, Min Zhao

CISC 879 - Machine Learning for Solving Systems Problems

• Urgent need to detect malicious executables

• Major Threats

Metamorphic Executables

Reprograms itself

Capable of infecting two OS.

Polymorphic Executables

Emulates as Non-malicious code

Unseen Executables

MOTIVATION

CISC 879 - Machine Learning for Solving Systems Problems

Need of the Hour• SBMDS String Based Malware Detection System• What this system is exactly all about??

• Performs Interpretable String Analysis Interpretable string is line of codes in a program which contains both API execution calls and important semantic strings representing the intent and goal of the program writer.

CISC 879 - Machine Learning for Solving Systems Problems

Interpretable String???• Eg: Worm “Nimda”“html script language = ‘javascript’ window.open(‘readme.eml’)”

• Another Example:“&gameid= %s&pass=%s; myparentthreadid=%d; myguid=%s”

• But all Strings are not interpretable Eg:

“!0&0h0m0o0t0y0”

“*3d%3dtgyhjij”,

CISC 879 - Machine Learning for Solving Systems Problems

Major Steps to perform

• Constructing the interpretable strings by developing a feature parser.

• Performing feature selection to select informative strings.

• Using SVM ensemble with bagging to construct the classifier.

• Conducting the malware detector, also predict the exact type of the malware.

CISC 879 - Machine Learning for Solving Systems Problems

Step 1• Develop Feature parser

39,838 executable collected from Kingsoft Anti-virus lab.

All executables are PE files.

Extract static features API calls from import table. Strings carrying semantic interpretation.

CISC 879 - Machine Learning for Solving Systems Problems

SAMPLE (Backdoor-Redgirl.exe)

‘%s’ goto delete” always implicates that the malware may generate the “.bat” file to suicide

CISC 879 - Machine Learning for Solving Systems Problems

Step 2

• Feature Selection

Selects only interpretable stringsfrom the huge set of strings obtained

from previous step.

Assign these strings as signatures of thePE files.

CISC 879 - Machine Learning for Solving Systems Problems

Step 3• Using SVM to CLASSIFYWhy SVM ??

• Have showed state-of-art results in classification problem.

Problem: training complexity of SVM dependent on size of data set.

CISC 879 - Machine Learning for Solving Systems Problems

Problem

Training Accuracy becomes Constant when size of dataset reaches 3000

CISC 879 - Machine Learning for Solving Systems Problems

Curse of Dimensionality??

• Problem caused by the exponential increase in volume of data.

• How does SVM deals with “Curse of Dimensionality”

• Solution: By Using SVM ensemble & • Bagging

• SVM ensemble and Bagging???

CISC 879 - Machine Learning for Solving Systems Problems

3.1 SVM Ensemble with Bagging

• Ensemble is a set of classifiers whose individual decisions are combined in some way to classify new samples.

• Bagging technique on the training set

“BAGGING” (Bootstrap AGGregating)

• Uniform sampling of training data set

CISC 879 - Machine Learning for Solving Systems Problems

3.2 Multi-Classification

• Various classes of Malwares.• To select the identical values from two different classes method of “MAJORITY VOTING” is used.• Smallest index is chosen1= Backdoors2= Spywares3= Trojans4= Worms0= Benign files

CISC 879 - Machine Learning for Solving Systems Problems

STEP 4: Malware Detection

• Unknown variants of malwares are used.

• Malicious or not.

• To which class Malware belongs to.

CISC 879 - Machine Learning for Solving Systems Problems

System Architecture

1. Feature Parser

2. Feature Selection

3. SVM Ensemble Classifier

4. Malware Detector

CISC 879 - Machine Learning for Solving Systems Problems

Reason why I Chose This paper

• Comparisons With the Popular Anti- Virus Software.

Points of Comparisons:1. Detecting Known Variants of Malware.2. Detecting Unknown Variants.3. Efficiency (Detection Time).4. Number of False positive Detections.

CISC 879 - Machine Learning for Solving Systems Problems

Detecting Known Variants

CISC 879 - Machine Learning for Solving Systems Problems

Detecting Unknown Variants

CISC 879 - Machine Learning for Solving Systems Problems

Efficiency (Detection Time)

CISC 879 - Machine Learning for Solving Systems Problems

Number of False Positives

CISC 879 - Machine Learning for Solving Systems Problems

Conclusion

• This system has been already incorporated into the scanning tool of a commercial Anti- Virus software.

• Anti-Virus Name not Disclosed.

CISC 879 - Machine Learning for Solving Systems Problems

Questions?????

CISC 879 - Machine Learning for Solving Systems Problems

All Well that Ends Well

THANK YOU