challenges in high accuracy of malware detection

35
Intro Issues Objectives Methodology Conclusion Challenges in High Accuracy of Malware Detection Muhammad Najmi Ahmad Zabidi International Islamic University Malaysia IEEE Control & System Graduate Research Colloquium 2012 Shah Alam, Malaysia 16th July 2012 Muhammad Najmi Ahmad Zabidi ICSRGC 2012 1/26

Upload: muhammad-najmi-ahmad-zabidi

Post on 08-Jun-2015

527 views

Category:

Documents


0 download

DESCRIPTION

Slides for a

TRANSCRIPT

Page 1: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

Challenges in High Accuracy ofMalware Detection

Muhammad Najmi Ahmad ZabidiInternational Islamic University Malaysia

IEEE Control & System Graduate Research Colloquium 2012Shah Alam, Malaysia

16th July 2012

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 1/26

Page 2: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

About

I am a research grad student at Universiti TeknologiMalaysia, Skudai, Johor Bahru, Malaysia

My current employer is International Islamic UniversityMalaysia, Kuala Lumpur

Research area - malware detection, narrowing onWindows executables

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 2/26

Page 3: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

Malware in short

is a software

maliciousness is defined on the risks exposed to the user

sometimes, when in vague, the term ‘‘PotentiallyUnwanted Program/Application’’ (PUP/PUA) being used

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 3/26

Page 4: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

Methods of detections

Static analysisIn this case we have developed a Python based tool,called as pi-ngaji, an open source tool for static malwareanalysis

Dynamic analysisIn this case we will execute the malware in a Windowsenvironment and dump the API traces into a text file

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 4/26

Page 5: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

This talk outline several challenges on the current methods ofmalware detection

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 5/26

Page 6: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

Analysis of strings

Important, although not foolproof

Find interesting calls first

Considered static analysis, since no executing of thebinary

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 6/26

Page 7: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

Methods to find interesting strings

Use strings command (on *NIX systems)

Editors

Checking with Import Address Table (IAT)

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 7/26

Page 8: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

Issues

Malware numbers are enormousNeed automation in handling the detection

Our proposal - use Machine Learning methods

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 8/26

Page 9: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

Objectives

Reducing features in malware API sinceSome are weak, irrelevant featuresConsidered as ‘‘noise’’Feature selection, ranking method is chosen

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 9/26

Page 10: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

The features

The following are the features

Application Programming Interface (API) calls

XOR’ed strings

Anti virtualization/virtual machine detector

Binary entropy is also interesting

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 10/26

Page 11: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Binary file structure

Figure: Structure of a PE file[Pietrek, 1994]

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 11/26

Page 12: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Figure: PE components, simplified

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 12/26

Page 13: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

API calls

Features are as follows:

Example of FeaturesGetSystemTimeAsFileTime

SetUnhandledExceptionFilte

GetCurrentProces

TerminateProcess

LoadLibraryExW

GetVersionExW

GetProcAddress

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 13/26

Page 14: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Anti Debugger/AntiVM strings

IsDebuggerPresent

VMCheck.dll

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 14/26

Page 15: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

"Red Pill":"\x0f\x01\x0d\x00\x00\x00\x00\xc3","VirtualPc trick":"\x0f\x3f\x07\x0b","VMware trick":"VMXh","VMCheck.dll":"\x45\xC7\x00\x01","VMCheck.dll for VirtualPC":"\x0f\x3f\x07\x0b\xc7\x45\xfc\xff\xff\xff\xff","Xen":"XenVMM", # Or XenVMMXenVMM"Bochs & QEmu CPUID Trick":"\x44\x4d\x41\x63","Torpig VMM Trick": "\xE8\xED\xFF\xFF\xFF\x25\x00\x00\x00\xFF

\x33\xC9\x3D\x00\x00\x00\x80\x0F\x95\xC1\x8B\xC1\xC3","Torpig (UPX) VMM Trick": "\x51\x51\x0F\x01\x27\x00\xC1\xFB\xB5\xD5\x35

\x02\xE2\xC3\xD1\x66\x25\x32\xBD\x83\x7F\xB7\x4E\x3D\x06\x80\x0F\x95\xC1\x8B\xC1\xC3"

Source: ZeroWine source code

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 15/26

Page 16: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Sample execution

Analyzing e665297bf9dbb2b2790e4d898d70c9e9

Analyzing registry...[+] Malware is Adding a Key at Hive: HKEY_LOCAL_MACHINE^G^@Label11^@^A^AÃR^Nreg add "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ImageFile Execution Options\Rx.exe" /v debugger /t REG_SZ /d %systemrot%\repair\1sass.exe /f^M

....

[+] Malware Seems to be IRC BOT: Verified By String : ADMIN[+] Malware Seems to be IRC BOT: Verified By String : LIST[+] Malware Seems to be IRC BOT: Verified By String : QUIT[+] Malware Seems to be IRC BOT: Verified By String : VERSIONAnalyzing interesting calls..[+] Found an Interesting call to: FindWindow[+] Found an Interesting call to: LoadLibraryA[+] Found an Interesting call to: CreateProcess[+] Found an Interesting call to: GetProcAddress[+] Found an Interesting call to: CopyFile[+] Found an Interesting call to: shdocvw

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 16/26

Page 17: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Advantages on the researcher’s side

Malware writers usually are ‘‘lazy’’ hence there is atendency they will reuse the previous chunk of codes

Hence, it’s easier to trace the previous family based onthe commonalities

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 17/26

Page 18: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Our methods

Roughly our methods consist of :

1 Feature Selection(Ranking/Pruning)

2 Supervised Classification

3 Unsupervised Classification

Item 2) and 3) above also could be combined to a methodknown as ‘‘Semi Supervised Classification’’.

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 18/26

Page 19: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Information Gain[Zhang et al., 2007, Altaher et al., 2011,Singhal and Raul, 2012] use the following formula for IGapplication in malware

The amount by which the entropy of X decreasesreflects additional information about X provided by Y iscalled information gain, given by

IG(X |Y ) = H(X)− H(X |Y )

[Singhal and Raul, 2012] introduced the following algorithmto ‘‘correct out’’ error the results.

IG(X)′ = IG(X)±∑n

i−0 IG(Xi)

nMuhammad Najmi Ahmad Zabidi ICSRGC 2012 19/26

Page 20: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Information Gain (cont’d)

From [Jiang et al., 2011]

IG(t) =∑

c∈{ci ,ci}

∑t′∈{t,t}

P(t ′, c)logP(t ′, c)

P(t ′)P(c)

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 20/26

Page 21: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

For research purpose the following issues are alwayswondered:

No standard dataset, unlike Intrusion Detection System(IDS) area

Fast-paced malware sample, will the datasets being usedfor the experiment will be questioned

Last resort, stick to the existing database, try to free fromany specific malware family as to make sure the methodwill/could work with incoming, new malware

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 21/26

Page 22: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Table: Differences between clustering and classification

Classification Clustering

Deals with known data Deals with unknown data

Supervised learning Unsupervised learning

Popular algorithms includes:

Random Forest

Neural Networks

k-Nearest Neighbor

Decision Trees

Popular algorithms includes:

K-means

Fuzzy C

Gaussian

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26

Page 23: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Table: Differences between clustering and classification

Classification

Clustering

Deals with known data Deals with unknown data

Supervised learning Unsupervised learning

Popular algorithms includes:

Random Forest

Neural Networks

k-Nearest Neighbor

Decision Trees

Popular algorithms includes:

K-means

Fuzzy C

Gaussian

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26

Page 24: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Table: Differences between clustering and classification

Classification

Clustering

Deals with known data

Deals with unknown data

Supervised learning Unsupervised learning

Popular algorithms includes:

Random Forest

Neural Networks

k-Nearest Neighbor

Decision Trees

Popular algorithms includes:

K-means

Fuzzy C

Gaussian

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26

Page 25: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Table: Differences between clustering and classification

Classification

Clustering

Deals with known data

Deals with unknown data

Supervised learning

Unsupervised learning

Popular algorithms includes:

Random Forest

Neural Networks

k-Nearest Neighbor

Decision Trees

Popular algorithms includes:

K-means

Fuzzy C

Gaussian

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26

Page 26: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Table: Differences between clustering and classification

Classification

Clustering

Deals with known data

Deals with unknown data

Supervised learning

Unsupervised learning

Popular algorithms includes:

Random Forest

Neural Networks

k-Nearest Neighbor

Decision Trees

Popular algorithms includes:

K-means

Fuzzy C

Gaussian

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26

Page 27: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Table: Differences between clustering and classification

Classification

Clustering

Deals with known data

Deals with unknown data

Supervised learning

Unsupervised learning

Popular algorithms includes:

Random Forest

Neural Networks

k-Nearest Neighbor

Decision Trees

Popular algorithms includes:

K-means

Fuzzy C

Gaussian

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26

Page 28: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Table: Differences between clustering and classification

Classification Clustering

Deals with known data

Deals with unknown data

Supervised learning

Unsupervised learning

Popular algorithms includes:

Random Forest

Neural Networks

k-Nearest Neighbor

Decision Trees

Popular algorithms includes:

K-means

Fuzzy C

Gaussian

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26

Page 29: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Table: Differences between clustering and classification

Classification Clustering

Deals with known data Deals with unknown data

Supervised learning

Unsupervised learning

Popular algorithms includes:

Random Forest

Neural Networks

k-Nearest Neighbor

Decision Trees

Popular algorithms includes:

K-means

Fuzzy C

Gaussian

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26

Page 30: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Table: Differences between clustering and classification

Classification Clustering

Deals with known data Deals with unknown data

Supervised learning Unsupervised learning

Popular algorithms includes:

Random Forest

Neural Networks

k-Nearest Neighbor

Decision Trees

Popular algorithms includes:

K-means

Fuzzy C

Gaussian

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26

Page 31: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Table: Differences between clustering and classification

Classification Clustering

Deals with known data Deals with unknown data

Supervised learning Unsupervised learning

Popular algorithms includes:

Random Forest

Neural Networks

k-Nearest Neighbor

Decision Trees

Popular algorithms includes:

K-means

Fuzzy C

Gaussian

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26

Page 32: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Classification (supervised) chosen to deal with knowncorpus but incomplete data

Clustering (unsupervised) chosen to deal with new inputs

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 23/26

Page 33: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering

Some results

Wemanaged to detect several malware samples by usingthe existing API traces and other features (botcommands, file/registry deletion)

New malware which is more sophisticated -Stuxned/Duqu is very platform specific - attacking SCADAsystem hence needs more reading on detecting them.Perhaps the most obvious if any XOR’ed communicationchannels being used.

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 24/26

Page 34: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

The flow

Feature Selection Feature Categorization

Clustering Classification

Visualization

Weka,Octave/Matlab

scipy,Octave/Matlab

Weka,Octave/Matlab

scipy,Octave/Matlab

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 25/26

Page 35: Challenges in High Accuracy of Malware Detection

IntroIssues

ObjectivesMethodology

Conclusion

Altaher, A., Ramadass, S., and Ali, A. (2011).

Computer Virus Detection Using Features Ranking and Machine Learning.Australian Journal of Basic and Applied Sciences, 5(9):1482--1486.

Jiang, Q., Zhao, X., and Huang, K. (2011).

A feature selection method for malware detection.In 2011 IEEE International Conference on Information and Automation (ICIA), pages 890--895.

Pietrek, M. (1994).

Peering Inside the PE: A Tour of the Win32 Portable Executable File Format.http://msdn.microsoft.com/en-us/library/ms809762.aspx.

Singhal, P. and Raul, N. (2012).

Malware detection module using machine learning algorithms to assist in centralized security in enterprisenetworks.International Journal of Network Security & Its Applications, 4.

Zhang, B., Yin, J., Hao, J., Wang, S., and Zhang, D. (2007).

New malicious code detection based on n-gram analysis and rough set theory.pages 626--633. Springer-Verlag, Berlin, Heidelberg.

Muhammad Najmi Ahmad Zabidi ICSRGC 2012 26/26