digital forensic research conference...digital forensic research conference automatic classification...

27
DIGITAL FORENSIC RESEARCH CONFERENCE Automatic Classification of Object Code Using Machine Learning By John Clemens Presented At The Digital Forensic Research Conference DFRWS 2015 USA Philadelphia, PA (Aug 9 th - 13 th ) DFRWS is dedicated to the sharing of knowledge and ideas about digital forensics research. Ever since it organized the first open workshop devoted to digital forensics in 2001, DFRWS continues to bring academics and practitioners together in an informal environment. As a non-profit, volunteer organization, DFRWS sponsors technical working groups, annual conferences and challenges to help drive the direction of research and development. http:/dfrws.org

Upload: others

Post on 11-Mar-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

DIGITAL FORENSIC RESEARCH CONFERENCE

Automatic Classification of Object Code

Using Machine Learning

By

John Clemens

Presented At

The Digital Forensic Research Conference

DFRWS 2015 USA Philadelphia, PA (Aug 9th - 13th)

DFRWS is dedicated to the sharing of knowledge and ideas about digital forensics research. Ever since it organized

the first open workshop devoted to digital forensics in 2001, DFRWS continues to bring academics and practitioners

together in an informal environment. As a non-profit, volunteer organization, DFRWS sponsors technical working

groups, annual conferences and challenges to help drive the direction of research and development.

http:/dfrws.org

Automatic Classification of Object Code UsingMachine Learning

Architecture and Endianess

John Clemens

University of Maryland Johns Hopkins UniversityBaltimore County (UMBC) Applied Physics Laboratory (JHUAPL)Baltimore, Maryland Laurel, Marylandclemej1 at umbc.edu john.clemens at jhuapl.edu

DFRWS 2015

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 1 / 14

Motivation

Reverse engineering and malware analysis e↵orts are extremely laborintensive and require expert domain knowledge

Enterprise computing environments and networks are diverseI Types of systems: Laptops, Phones, Routers, IoT ...I Within each individual system

Reverse engineering e↵orts are often ”black box” tasks where nothingis known about the underlying system

Analysts are looking for tools and techniques to jumpstart this analysis

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 2 / 14

Motivation

Machine Learning using byte histograms has already proven useful toclassify general file types / file fragments (McDaniel and Heydari, 2003,among many others).

Can we do better, and start to categorize within these general types?

Start with compiled object code

Very important for both malware analysis and reverse engineering

Enterprise diversity means that the analyst will likely encountermultiple types of object code

Accurate disassembly is crucial

Classification targets: Architecture and Endianess

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 3 / 14

Motivation

Computers are more than just the CPU

Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400

I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....

Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....

Image credit: Wikipedia

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14

Motivation

Computers are more than just the CPU

Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500

I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....

Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....

Image credit: Wikipedia

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14

Motivation

Computers are more than just the CPU

Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400

I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....

Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....

Image credit: Wikipedia

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14

Motivation

Computers are more than just the CPU

Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom Wifi

I Intel NICI ACPI Embeded ControllerI SSD ControllerI ....

Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....

Image credit: Wikipedia

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14

Motivation

Computers are more than just the CPU

Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NIC

I ACPI Embeded ControllerI SSD ControllerI ....

Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....

Image credit: Wikipedia

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14

Motivation

Computers are more than just the CPU

Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded Controller

I SSD ControllerI ....

Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....

Image credit: Wikipedia

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14

Motivation

Computers are more than just the CPU

Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD Controller

I ....

Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....

Image credit: Wikipedia

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14

Motivation

Computers are more than just the CPU

Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....

Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....

Image credit: Wikipedia

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14

Motivation

Computers are more than just the CPU

Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....

Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53

I Adreno 405I Cell Baseband processorI Battery controllerI ....

Image credit: Wikipedia

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14

Motivation

Computers are more than just the CPU

Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....

Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405

I Cell Baseband processorI Battery controllerI ....

Image credit: Wikipedia

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14

Motivation

Computers are more than just the CPU

Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....

Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processor

I Battery controllerI ....

Image credit: Wikipedia

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14

Motivation

Computers are more than just the CPU

Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....

Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controller

I ....

Image credit: Wikipedia

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14

Motivation

Computers are more than just the CPU

Example: My Thinkpad T400 (2008)I Intel Core 2 Duo P8400I Intel GMA4500I ATI Radeon HD 3400I Broadcom WifiI Intel NICI ACPI Embeded ControllerI SSD ControllerI ....

Example: My Motorola G (2015)I Qualcomm ARM Cortex-A53I Adreno 405I Cell Baseband processorI Battery controllerI ....

Image credit: Wikipedia

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 4 / 14

Motivation

Malware authors are using this diversity to hide their software whereexisting countermeasures can’t reach

”..but the most interesting finding is the malwares ability to reprogramthe victims hard drives, making their implants invisible and almostindestructible...” - Kaspersky, Feb 17 2015

”..The malware they created, called BadUSB, can be installed on aUSB device to completely take over a PC, invisibly alter files installedfrom the memory stick, or even redirect the users internet tra�c...” –Wired, July 2014

”..Developers have published two pieces of malware [Jellyfish rootkitand Demon keylogger] that take the highly unusual step of completelyrunning on an infected computer’s graphics card, rather than itsCPU...” – Ars Technica, May 7, 2015

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 5 / 14

Existing Methods

Existing methods either rely on file metadata to determine architecture,look for existing signatures within the sample, or just try to disassemble itand leave it up to the analyst to determine if the output is valid. Binwalkuses the last two methods.

File metadata can be either missing or misleadingI Unadorned firmware blobsI Obfuscated/packed malwareI Incomplete file fragments or packet tracesI Embedded code in native object files

Signature detection requires prior knowledge of the architecture andcan lead to false positives

Disassembly methods can be misleading and require tools thatsupport the architecture

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 6 / 14

Proposed Method

Hypothesis

Machine learning techniques can be applied to object code directly(without including metadata) to automatically classify the object code’starget architecture and endianess

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 7 / 14

Dataset

CharacteristicsI 16785 unique ELF files

(exes/libs)I 20 architecturesI Sources include Debian

package repositories, Arduinodevelopment kits, and CUDAlibraries

ProblemsI Only one compiler (GCC)I CUDA sample sizeI Need more

embedded/micro-controllersamples

I Heavy bias towards RISC

Architecture # Samples Wordsize Endianessalpha 1,383 64-bit Bighppa 625 32-bit Bigm68k 1,296 32-bit Bigarm64 1,134 64-bit Littleppc64 823 64-bit Bigsh4 822 32-bit Littlesparc64 752 64-bit Bigamd64 965 64-bit Littlearmel 960 32-bit Littlearmhf 960 32-bit Littlei386 967 32-bit Littleia64 650 64-bit Littlemips 960 32-bit Bigmipsel 960 32-bit Littlepowerpc 992 32-bit Bigs390 649 32-bit Bigs390x 653 64-bit Bigsparc 648 32-bit Bigcuda 17 32-bit Littleavr 596 8-bit LittleTotal 16,785

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 8 / 14

Intuition

Can you tell which architectures the above two samples target?

Instructions contain two parts:

opcode: Unique to the architecture

operands: Data to be operated on

Opcode Density

Opcode Density =length of opcode

average instruction length

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 9 / 14

Feature VectorTo classify Architecture:

Extract code sections of each ELFobject

Generate a normalized bytehistogram to become a 256-entryattribute vector

To classify Endianess:

Above is insu�cient; endianess requires adjacency information lost in a bytehistogram

Look for distinctive patterns: encoding of ’1’ and ’-1’

Count the number of occurrences of ’0x0001’, ’0x0100’, ’0x↵e’, ’0xfe↵’

Use these four counts as additional endianess attributes

Experiments with 2-byte bi-grams proved resource intensive and performedpoorly compared to this method

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 10 / 14

Results: Overall Performance

Trained Model Multi-class Strategy WEKA Name Histogram Hist + Endian1-NN Inherent IBk 89.3238% 92.7256%3-NN Inherent IBk 89.8660% 94.9002%Decision Tree Inherent J48 93.2976% 98.0697%Random Tree Inherent RandomTree 87.8046% 92.9461%Random Forest Inherent RandomForest 90.4617% 96.4373%Naive Bayes Inherent NaiveBayes 92.5827% 95.8951%BayesNet Inherent BayesNet 89.5144% 92.2252%SVM (SMO) 1-vs-1 SMO 92.7256% 98.3497%Logistic Regression Inherent SimpleLogistic 93.0831% 97.9386%Neural Net Inherent MultilayerPerceptron 94.0244% 97.9565%

10-fold stratified cross validation accuracy for various models using the byte-valuehistogram alone, and the byte-value histogram augmented with heuristic-basedendianess attributes.

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 11 / 14

Results: E↵ect of Sample Size

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 12 / 14

Summary

Contributions of this work include:

A dataset of 16K samples of object code from 20 architectures

Machine learning techniques using byte histograms can automaticallyclassify object code’s target architecture

Endianess determined with high accuracy using with the addition offour extra heuristics in the feature vector

High accuracy with a small sample size

A method of automatic object code classification that does notrequire signatures, toolchain support, correct metadata, or anyprevious knowledge

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 13 / 14

Future Directions / Questions?

Future WorkI More complete dataset (send me your esoteric samples!)

FMicrocontroller / Embedded samples

FVirtual byte code (e.g. Python, Java, Dalvik/ART, CIL)

I Word-size detectionI Compiler attributionI Detect embedded code (Thumb vs. ARM)I Function / basic block boundary detectionI Plugin (IDA and friends)

I will post original dataset and updated dataset for download(location TBD)

Questions?

J. Clemens (UMBC / JHUAPL) Classification of Object Code DFRWS 2015 14 / 14