challenges in high accuracy of malware detection
DESCRIPTION
Slides for aTRANSCRIPT
IntroIssues
ObjectivesMethodology
Conclusion
Challenges in High Accuracy ofMalware Detection
Muhammad Najmi Ahmad ZabidiInternational Islamic University Malaysia
IEEE Control & System Graduate Research Colloquium 2012Shah Alam, Malaysia
16th July 2012
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 1/26
IntroIssues
ObjectivesMethodology
Conclusion
About
I am a research grad student at Universiti TeknologiMalaysia, Skudai, Johor Bahru, Malaysia
My current employer is International Islamic UniversityMalaysia, Kuala Lumpur
Research area - malware detection, narrowing onWindows executables
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 2/26
IntroIssues
ObjectivesMethodology
Conclusion
Malware in short
is a software
maliciousness is defined on the risks exposed to the user
sometimes, when in vague, the term ‘‘PotentiallyUnwanted Program/Application’’ (PUP/PUA) being used
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 3/26
IntroIssues
ObjectivesMethodology
Conclusion
Methods of detections
Static analysisIn this case we have developed a Python based tool,called as pi-ngaji, an open source tool for static malwareanalysis
Dynamic analysisIn this case we will execute the malware in a Windowsenvironment and dump the API traces into a text file
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 4/26
IntroIssues
ObjectivesMethodology
Conclusion
This talk outline several challenges on the current methods ofmalware detection
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 5/26
IntroIssues
ObjectivesMethodology
Conclusion
Analysis of strings
Important, although not foolproof
Find interesting calls first
Considered static analysis, since no executing of thebinary
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 6/26
IntroIssues
ObjectivesMethodology
Conclusion
Methods to find interesting strings
Use strings command (on *NIX systems)
Editors
Checking with Import Address Table (IAT)
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 7/26
IntroIssues
ObjectivesMethodology
Conclusion
Issues
Malware numbers are enormousNeed automation in handling the detection
Our proposal - use Machine Learning methods
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 8/26
IntroIssues
ObjectivesMethodology
Conclusion
Objectives
Reducing features in malware API sinceSome are weak, irrelevant featuresConsidered as ‘‘noise’’Feature selection, ranking method is chosen
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 9/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
The features
The following are the features
Application Programming Interface (API) calls
XOR’ed strings
Anti virtualization/virtual machine detector
Binary entropy is also interesting
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 10/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Binary file structure
Figure: Structure of a PE file[Pietrek, 1994]
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 11/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Figure: PE components, simplified
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 12/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
API calls
Features are as follows:
Example of FeaturesGetSystemTimeAsFileTime
SetUnhandledExceptionFilte
GetCurrentProces
TerminateProcess
LoadLibraryExW
GetVersionExW
GetProcAddress
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 13/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Anti Debugger/AntiVM strings
IsDebuggerPresent
VMCheck.dll
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 14/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
"Red Pill":"\x0f\x01\x0d\x00\x00\x00\x00\xc3","VirtualPc trick":"\x0f\x3f\x07\x0b","VMware trick":"VMXh","VMCheck.dll":"\x45\xC7\x00\x01","VMCheck.dll for VirtualPC":"\x0f\x3f\x07\x0b\xc7\x45\xfc\xff\xff\xff\xff","Xen":"XenVMM", # Or XenVMMXenVMM"Bochs & QEmu CPUID Trick":"\x44\x4d\x41\x63","Torpig VMM Trick": "\xE8\xED\xFF\xFF\xFF\x25\x00\x00\x00\xFF
\x33\xC9\x3D\x00\x00\x00\x80\x0F\x95\xC1\x8B\xC1\xC3","Torpig (UPX) VMM Trick": "\x51\x51\x0F\x01\x27\x00\xC1\xFB\xB5\xD5\x35
\x02\xE2\xC3\xD1\x66\x25\x32\xBD\x83\x7F\xB7\x4E\x3D\x06\x80\x0F\x95\xC1\x8B\xC1\xC3"
Source: ZeroWine source code
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 15/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Sample execution
Analyzing e665297bf9dbb2b2790e4d898d70c9e9
Analyzing registry...[+] Malware is Adding a Key at Hive: HKEY_LOCAL_MACHINE^G^@Label11^@^A^AÃR^Nreg add "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\ImageFile Execution Options\Rx.exe" /v debugger /t REG_SZ /d %systemrot%\repair\1sass.exe /f^M
....
[+] Malware Seems to be IRC BOT: Verified By String : ADMIN[+] Malware Seems to be IRC BOT: Verified By String : LIST[+] Malware Seems to be IRC BOT: Verified By String : QUIT[+] Malware Seems to be IRC BOT: Verified By String : VERSIONAnalyzing interesting calls..[+] Found an Interesting call to: FindWindow[+] Found an Interesting call to: LoadLibraryA[+] Found an Interesting call to: CreateProcess[+] Found an Interesting call to: GetProcAddress[+] Found an Interesting call to: CopyFile[+] Found an Interesting call to: shdocvw
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 16/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Advantages on the researcher’s side
Malware writers usually are ‘‘lazy’’ hence there is atendency they will reuse the previous chunk of codes
Hence, it’s easier to trace the previous family based onthe commonalities
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 17/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Our methods
Roughly our methods consist of :
1 Feature Selection(Ranking/Pruning)
2 Supervised Classification
3 Unsupervised Classification
Item 2) and 3) above also could be combined to a methodknown as ‘‘Semi Supervised Classification’’.
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 18/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Information Gain[Zhang et al., 2007, Altaher et al., 2011,Singhal and Raul, 2012] use the following formula for IGapplication in malware
The amount by which the entropy of X decreasesreflects additional information about X provided by Y iscalled information gain, given by
IG(X |Y ) = H(X)− H(X |Y )
[Singhal and Raul, 2012] introduced the following algorithmto ‘‘correct out’’ error the results.
IG(X)′ = IG(X)±∑n
i−0 IG(Xi)
nMuhammad Najmi Ahmad Zabidi ICSRGC 2012 19/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Information Gain (cont’d)
From [Jiang et al., 2011]
IG(t) =∑
c∈{ci ,ci}
∑t′∈{t,t}
P(t ′, c)logP(t ′, c)
P(t ′)P(c)
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 20/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
For research purpose the following issues are alwayswondered:
No standard dataset, unlike Intrusion Detection System(IDS) area
Fast-paced malware sample, will the datasets being usedfor the experiment will be questioned
Last resort, stick to the existing database, try to free fromany specific malware family as to make sure the methodwill/could work with incoming, new malware
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 21/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Table: Differences between clustering and classification
Classification Clustering
Deals with known data Deals with unknown data
Supervised learning Unsupervised learning
Popular algorithms includes:
Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees
Popular algorithms includes:
K-means
Fuzzy C
Gaussian
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Table: Differences between clustering and classification
Classification
Clustering
Deals with known data Deals with unknown data
Supervised learning Unsupervised learning
Popular algorithms includes:
Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees
Popular algorithms includes:
K-means
Fuzzy C
Gaussian
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Table: Differences between clustering and classification
Classification
Clustering
Deals with known data
Deals with unknown data
Supervised learning Unsupervised learning
Popular algorithms includes:
Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees
Popular algorithms includes:
K-means
Fuzzy C
Gaussian
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Table: Differences between clustering and classification
Classification
Clustering
Deals with known data
Deals with unknown data
Supervised learning
Unsupervised learning
Popular algorithms includes:
Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees
Popular algorithms includes:
K-means
Fuzzy C
Gaussian
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Table: Differences between clustering and classification
Classification
Clustering
Deals with known data
Deals with unknown data
Supervised learning
Unsupervised learning
Popular algorithms includes:
Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees
Popular algorithms includes:
K-means
Fuzzy C
Gaussian
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Table: Differences between clustering and classification
Classification
Clustering
Deals with known data
Deals with unknown data
Supervised learning
Unsupervised learning
Popular algorithms includes:
Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees
Popular algorithms includes:
K-means
Fuzzy C
Gaussian
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Table: Differences between clustering and classification
Classification Clustering
Deals with known data
Deals with unknown data
Supervised learning
Unsupervised learning
Popular algorithms includes:
Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees
Popular algorithms includes:
K-means
Fuzzy C
Gaussian
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Table: Differences between clustering and classification
Classification Clustering
Deals with known data Deals with unknown data
Supervised learning
Unsupervised learning
Popular algorithms includes:
Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees
Popular algorithms includes:
K-means
Fuzzy C
Gaussian
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Table: Differences between clustering and classification
Classification Clustering
Deals with known data Deals with unknown data
Supervised learning Unsupervised learning
Popular algorithms includes:
Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees
Popular algorithms includes:
K-means
Fuzzy C
Gaussian
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Table: Differences between clustering and classification
Classification Clustering
Deals with known data Deals with unknown data
Supervised learning Unsupervised learning
Popular algorithms includes:
Random Forest
Neural Networks
k-Nearest Neighbor
Decision Trees
Popular algorithms includes:
K-means
Fuzzy C
Gaussian
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 22/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Classification (supervised) chosen to deal with knowncorpus but incomplete data
Clustering (unsupervised) chosen to deal with new inputs
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 23/26
IntroIssues
ObjectivesMethodology
Conclusion
API callsAnti Debugger/AntiVM stringsFeature Ranking Selection with Information GainClassification and Clustering
Some results
Wemanaged to detect several malware samples by usingthe existing API traces and other features (botcommands, file/registry deletion)
New malware which is more sophisticated -Stuxned/Duqu is very platform specific - attacking SCADAsystem hence needs more reading on detecting them.Perhaps the most obvious if any XOR’ed communicationchannels being used.
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 24/26
IntroIssues
ObjectivesMethodology
Conclusion
The flow
Feature Selection Feature Categorization
Clustering Classification
Visualization
Weka,Octave/Matlab
scipy,Octave/Matlab
Weka,Octave/Matlab
scipy,Octave/Matlab
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 25/26
IntroIssues
ObjectivesMethodology
Conclusion
Altaher, A., Ramadass, S., and Ali, A. (2011).
Computer Virus Detection Using Features Ranking and Machine Learning.Australian Journal of Basic and Applied Sciences, 5(9):1482--1486.
Jiang, Q., Zhao, X., and Huang, K. (2011).
A feature selection method for malware detection.In 2011 IEEE International Conference on Information and Automation (ICIA), pages 890--895.
Pietrek, M. (1994).
Peering Inside the PE: A Tour of the Win32 Portable Executable File Format.http://msdn.microsoft.com/en-us/library/ms809762.aspx.
Singhal, P. and Raul, N. (2012).
Malware detection module using machine learning algorithms to assist in centralized security in enterprisenetworks.International Journal of Network Security & Its Applications, 4.
Zhang, B., Yin, J., Hao, J., Wang, S., and Zhang, D. (2007).
New malicious code detection based on n-gram analysis and rough set theory.pages 626--633. Springer-Verlag, Berlin, Heidelberg.
Muhammad Najmi Ahmad Zabidi ICSRGC 2012 26/26