detection of malware through code reuse and yarabox5781.temp.domains/~dfircouk/papers/detection...

1

Detection of malware through code reuse and Yara

Adam Burt

3rd August 2018

2

Abstract

This paper explores the use of Yara to detect malicious code samples in Windows

PE files, specifically by analysing code that is reused and writing detection rules for

this code. This paper also explores the current methods and their deficiencies for

malware detection using Yara; such as hash matching, string analysis and import

hashes (ImpHashes). This paper makes use of four Windows PE files, written in C,

specifically to demonstrate how existing methods can be subverted and whilst many

of the properties of a PE file can change, the functionality can remain the same.

Similarly, how many properties of a PE file can remain the same, whilst the

functionality can change.

3

1.1 Introduction

Code reuse in malware is becoming increasingly common. Code reuse allows the

malware author(s) to release new variants of malware quicker than they would if

code reuse was not utilised. The code reused in malware can vary, however, tends to

be used to introduce new functionality into a malware variant.

Taking an example of this; the “notpetya” malware made use of code from the

“petya” malware, to achieve its goal. Whilst the “notpetya” malware was designed

to look like ransomware, its outcome was different than that of other ransomware

variants. Similarities between “notpetya” and “petya” were discovered in its

behaviour and also its code reuse. Specifically, the “petya” and “notpetya” used the

same code for Master File Table (MFT) encryption. The differences lie in how the

boot loader achieved this and how the boot loader was overwritten from a dropper.

“Most likely someone ripped the boot loader code straight out of Petya…”.

(MalwareTech, 2017)

Detecting malicious software is usually carried out using an Anti-Virus solution.

Anti-Virus solutions employ various techniques ranging from hash matching of files,

through to behaviour detection. These static and dynamic detections are usually used

sequentially, allowing for quicker detection with the primary static methods and

relying on dynamic methods as a secondary method.

It is a goal of malware to evade detection. To evade the static and simplistic

hash detection, malicious files are altered slightly to ensure a unique hash and

therefore a unique program, each time they are deployed. This can be through

polymorphic code which “uses a polymorphic engine to mutate while keeping the

original algorithm intact” (Wikipedia, 2017). Polymorphic Viruses can also “rely on

mutation engines to alter their decryption routines every time they infect a machine”

(Trend Micro, Unknown). In either case, the functionality of the malicious software

remains the same. It is this functionality that results in particular actions being

carried out on a system, that can be detected using behavioural engines. It is also

this functionality that can be targeted using more advanced static detection methods.

‘Yara’ is a pattern matching tool that is used to “create descriptions of malware

families (or whatever you want to describe) based on textual or binary patterns”

(Alvarez, Unknown). A rule within Yara “consists of a set of strings and a Boolean

expression which determine its logic” (Alvarez, Unknown). Diagram 1 shows a

simple example of what a Yara rule may look like.

4

Diagram 1 (Alvarez, Welcome to YARA’s documentation! - yara 3.7.0

documentation, 2018)

Within the Yara functionality also contains the ability to load modules that

extend the scanning functionality of Yara. To target PE files specifically “The PE

module allows you to create more fine-grained rules for PE files by using attributes

and features of the PE file format. This module exposes most of the fields present in

a PE header and provides functions which can be used to write more expressive and

targeted rules”. (Alvarez, PE module - yara 3.7.0 documentation, 2018). Within this

module, Yara can calculate an Import Table Hash from the PE file and use it in the

Yara rule. An Import Table Hash was first discussed by FireEye / Mandiant.

Mandiant describe the Import Table Hash: “Mandiant creates a hash based on

library/API names and their specific order within the executable. We refer to this

convention as an ImpHash (for "import hash")“ (Mandiant, 2018). This methods was

first discussed in FireEye’s paper entitled “Supply Chain Analysis: From

Quartermaster to SunshopFireEye” (FireEye, 2014). The hash created by Yara is an

MD5 hash of the PE’s import table and is based on the techniques discussed by

FireEye / Mandiant (Alvarez, PE module - yara 3.7.0 documentation, 2018). Another

module commonly used is the “Hash” module. “The Hash module allows you to

calculate hashes (MD5, SHA1, SHA256) from portions of your file and create

signatures based on those hashes.” (Alvarez, Hash module - yara 3.7.0 documentation,

2018).

5

2 PE samples used in examinations

The samples that were created for examination for this paper were written in C

and carry out minimal functionality. The sample names are:

• TestApp1.exe

• TestApp2.exe

• TestApp3.exe

• TestApp4.exe

Each sample carries out the following functions:

1. Creates a file in the directory the EXE is executed from called

“mytestfile.txt”.

2. Writes data (a string) into the newly created file that is “This is test

data”

3. If successful it will print “Data Written Successfully” to the console

screen

4. If unsuccessful it will print “Data could not be written” to the console

screen

5. In either case of unsuccessful or successful the program will then wait for

the user to press a key and then exit

The only addition to the above chain of events is in the sample named

“TestApp4.exe”. In this sample, steps 1-4 are carried out above, followed by:

1. Opens the first Key in the registry location

“HKEY_LOCAL_MACHINE\SOFTWARE\ and prints the data

from the “Path” value (if it exists).

The “TestApp4.exe” application then carries out step 5 above (awaiting user

input before closing).

6

3 Examining the MD5 hash and strings

Each of the test applications were run through a tool called ‘FCIV’ (Microsoft,

03) created by Microsoft. The FCIV tools provides the calculated MD5 sum of each

of the files. The results were as follows:

• TestApp1.exe – e69b8e7cd74113d6c94565740dc1e8ff

• TestApp2.exe – b239be61cffd1607ea702bdf263168b0

• TestApp3.exe – 12fbca2eeb557f6b15882f5a373a52e5

• TestApp4.exe - e13c11c301c43a9af1790da13708b5b4

Each file was unique in that the raw content of each file was different. This

difference, even as little as 1 bit of data, will results in a different MD5 sum

(excluding collisions) (Selinger, 2018). To use Yara for detecting these files and any

subsequent files that are related, requires the use of each MD5. The below code

represents a Yara rule that will detect these samples:

import "hash" rule TestAppHash { meta: author = "Adam Burt" description = "Detect based on MD5 hash" date = "2018-08-03" condition: hash.md5(0,filesize) == "e69b8e7cd74113d6c94565740dc1e8ff" or hash.md5(0,filesize) == "b239be61cffd1607ea702bdf263168b0" or hash.md5(0,filesize) == "12fbca2eeb557f6b15882f5a373a52e5" or hash.md5(0,filesize) == "e13c11c301c43a9af1790da13708b5b4" } When testing this rule using the Yara64.exe binary against all of the TestApp

files we receive a positive match. Diagram 2 shows the output form running the

‘Yara64.exe’ against the test EXE files using the above rule.

Diagram 2 – Results of hash matching Yara rule against test files.

7

This means that when targeting these samples (which are all related), while file

hashes are enough, the resource required to track and create detections for individual

hashes soon becomes overwhelming. Because each file can naturally or forcibly

product a different hash value, we will look closer inside the TestApp files for

commonalities. We know that each TestApp file creates a file called “mytestfile.txt”

and writes a string “This is test data” into that file. They also print the same

strings to the console based on being successful, or not. We make use of the tool

“BinText” by McAfee / Foundstone (McAfee, 2018) to examine the strings within

the file “TestApp1.exe”. Diagram 3 shows the strings that are of interest, present

in “TestApp1.exe”.

Diagram 3 – Strings present within “TestApp1.exe”

By using these strings, we can create a Yara rule based on string pattern

matching. We the following rule:

rule TestAppStrings

{

meta:

author = "Adam Burt"

description = "Detect based on MD5 hash"

date = "2018-08-03"

strings:

$a1 = "This is test data" ascii

$a2 = "mytestfile.txt" ascii

8

$a3 = "Data could not be written" ascii

$a4 = "Data written successfully" ascii

condition:

all of them

}

Subsequently, we can test this rule using ‘Yara64.exe’ on all the TestApp

sample files. Diagram 4 shows the results.

Diagram 4 – Matching a Yara string rule against all TestApp samples.

Why do we not get a match on all the files? Quite simply, the strings are not

present in each sample file. Each sample writes the same string of data to the same

created filename and then displays the same string for successful, or unsuccessful.

The difference begins with the strings used in each sample. We again use the tool

“BinText” to display the strings within the “TestApp1.exe” and “TestApp2.exe”

sample files. When comparing “TestApp1.exe” and “TestApp2.exe” there are

noticeable differences in the text being used for file creation, the strings being written

to the file and also the resulting message displayed in the console. Diagram 5 shows

the differences (highlighted) between “TestApp1.exe” and “TestApp2.exe”.

Diagram 5 – String differences in “TestApp1.exe” and “TestApp2.exe”

9

The strings are different because they have been encrypted, albeit a simple

XOR encryption with a single byte. This introduces two major differences in the

files:

1. A new string (or in this case a single byte) used for the XOR

encryption of the string.

2. A new routine that decrypts the strings in memory before being used.

This simple change rules out another method for detecting commonality

between these two sample files – their strings. A rule could be written in Yara to

detect the strings present in each file, such as:

rule TestAppMoreStrings

{

meta:


description = "Detect based on MD5 hash"

date = "2018-08-03"

strings:

$a1 = "This is test data" ascii

$a2 = "mytestfile.txt" ascii

$a3 = "Data could not be written" ascii

$a4 = "Data written successfully" ascii

$b1 = "Ecuc!epwmf!ppv!df\"xtjvugo" ascii

$b2 = "Ecuc!yskuvfp!uvedgtugwmnz" ascii

$b3 = "n{ugtvgkmg/vyv" ascii

$b4 = "Ujju!kt\"ugtv!fbvb" ascii

condition:

all of ($a*) or all of ($b*)

}

When using the above rule against all TestApp samples, Diagram 6 shows the

results.

10

Diagram 6 – Using multiple strings to detect samples

Whilst this has detected all samples, in the real-world these samples would also

have varying string content and varying XOR or encryption ciphers to handle

them. This, similar to the MD5 hash, requires a large amount of resource to track

strings in each sample and will soon become overwhelming. If this method was to

be carried out, it is just as viable to use MD5 sums for detection.

11

4 Examining the import tables

So far, both hashes and string content have been ruled out as a sustainable

method for detecting families of malware. Maintaining a library of hash values and

strings is resource intensive and cannot keep up with the mutation algorithms /

engines available for varying these parameters in files. Looking at the Import Hash

Tables may provide a more accurate method for detecting malware samples. Each

of the example files carry out similar tasks and therefore should have a similar

ImpHash associated with them. By using VirusTotal, we can upload each TestApp

EXE file to calculate its ImpHash. “VirusTotal inspects items with over 70

antivirus scanners and URL/domain blacklisting services, in addition to a myriad

of tools to extract signals from the studied content.” (VirusTotal, 2018). Diagram

7 shows the results of uploading the samples to VirusTotal.

Diagram 7 – File uploads into VirusTotal

The ImpHashes are calculated for each file and are as follows:

• TestApp1.exe – 9ea0752d8b73240994b03d0c502b2bd1

• TestApp2.exe – ce4bf5ebe8bd1fabe49efb61ec8de70e

• TestApp3.exe – 2004d3668e8b2cd2f627bd16a066c9f5

• TestApp4.exe – 2004d3668e8b2cd2f627bd16a066c9f5

Each ImpHash is different apart from “TestApp3.exe” and “TestApp4.exe”.

The reasons for this are simple. In each TestApp file, there are different Import

Tables requirements to carry out the functionality. This were modified to show

that, whilst functionality of a file remains the same, the content (string and hash)

and also the Import Tables and consequently the ImpHashes can change. This, as

with MD5 hashing, can be difficult to maintain and some becomes resource

intensive and overwhelming. TestApp files one through to three were all forcibly

modified to show that the attacker, or malware author, can control the ImpHash.

12

When basing malware categorisation or grouping on ImpHash, samples can be

missed by purposefully modifying the Import Tables.

The same principle applies to “TestApp3.exe” and “TestApp4.exe”. They

have the same ImpHash, however, each file carries out different functionality. The

code in “TestApp4.exe” was modified to look like the same Import Tables as

“TestApp3.exe”. The functionality difference is acheived by not using statically or

dynamically (run-time) linked DLLs and functions. Whilst this paper is not

focussed on this technique, a short description follows.

When a program calls a function that has an publicly known (or sometimes

unknown) API, it can make use of an existing library. An example of this is calling

the “RegGetValue” function that forms part of “Windows.h” within Microsoft

Windows. When a program containing this function is compiled, the Import Table

includes the “ADVAPI32.DLL” module and the “RegGetValueA” or

“RegGetValueW” function within that module. A program can be altered so that

“RegGetValueA” or “RegGetValueW” can be called directly from the

“ADVAPI32.DLL” function at execution time, by loading the library and

function directly, rather than using the provided APIs. This generally involves

calling the “LoadLibrary” function to get a handle to the “ADVAPI32.DLL”

and then consequently the calling the “GetProcAddress”. When inspecting an

EXE file that calls a function this way, the Import Table only indicates that

“KERNEL32.DLL” is required and that “LoadLibrary” and “GetProcAddress”

functions are required. This means the malware author’s EXE file can always have

the same ImpHash, but, carry out varying functions.

This technique allows for a legitimate EXE file, perhaps something such as

“notepad.exe”, can be replicated, by means of the Import Tables, but, carry out

malicious functionality. If it was the intent of the malware author to spoof the

Import Table of another program and the Import Tables were used to detect

malware samples, there would soon be many false positives detected.

13

5 Examining the reused code

Within each of the TestApp programs from two through to four, there is code

reuse being utilised. “TestApp1.exe” is being ignored on the basis that malicious

code rarely has clear text to be identified. To find this code reuse, the programs

need to be reverse engineered and decompiled to examine how they function. Hex-

Rays IDA (Hex-Rays, 2018) is used to look at the code involved with each TestApp

EXE file.

We know that each TestApp from two through to four utilises a decryption

routine to decrypt various strings. In this example, it is this decryption routine

that should be targeted as part of a common routine or code reuse detection. In

“TestApp2.exe”, “TestApp3.exe” and “TestApp4.exe”, the “CreateFileA”

function is found and just prior; an encrypted string is being run through a routine.

Diagram 8 shows the routine for “TestApp2.exe”, Diagram 9 show the routine

for “TestApp3.exe” and Diagram 10 shows the routine for “TestApp4.exe”.

Diagram 8 – Potential decryption routine being called in “TestApp2.exe”

14


15


Taking a closer look at these routines, it does look to be a decryption routine

(albeit a very simple one). Diagram 11 shows the routine present in

“TestApp2.exe”, Diagram 12 shows the routine present in “TestApp3.exe” and

Diagram 13 shows the routine present in “TestApp4.exe”.

16

Diagram 11 – The decryption routine in “TestApp2.exe”

17


18


Each of these routines has a very similar code base. Whilst some of the

registers used in the opcodes change, the general function is still the same.

Yara allows for the variation in code by use of regular expressions. Regular

expressions can allow for small, or even large, variations in code. In the routines

above we can take a look at some similar code (as byes code):

• “TestApp2.exe”

o 8a 08 40 84 c9 75 f9 2b c2 83 c0 02 6a 01 50 e8 d5 01 00 00 8b


o 8a 08 40 84 c9 75 f9 2b c2 83 c0 02 6a 01 50 e8 55 02 00 00 8b


o 8a 08 40 84 c9 75 f9 2b c2 83 c0 02 6a 01 50 e8 17 03 00 00 8b

The similarities here are quite obvious, expect for 3 bytes after the “50 e8” and

before the “00 00 8b ”. This variation can be excused using regular expression in

Yara, such that the regular expression becomes:

$textdecryptionroutine = { 8a 08 40 84 c9 75 f9 2b c2 83 c0 02 6a 01 50 e8 [4] 8b }

19

As a quick test, the complete Yara rule would look something like:

rule TestApp2

{

meta:


description = "Detects all variants of TestApp using code matching and

string searching"

date = "03-08-2017"

strings:

$textdecryptionroutine = { 8a 08 40 84 c9 75 f9 2b c2 83 c0 02 6a 01

50 e8 [4] 8b }

condition:

$textdecryptionroutine

}

When running this rule against the “TestApp2.exe”, “TestApp3.exe” and

“TestApp4.exe” we have matches. Diagram 14 shows the output.

Diagram 14 – Code matching against “TestApp2.exe”, “TestApp3.exe”

and “TestApp4.exe”

This Yara rule is effective against these three EXE files, however, the code that

it targets is not enough. If the same Yara rule is run against a local Windows

installation (with varying installed programs) there is a match against a DLL

provided as part of the Postbox Mail software. This file is as follows:

Name: mozcrt19.dll

MD5 hash: 8f9cded297d37a8b9ad691e6b08dcab2

By extended the search throughout the three sample files, the regular

expression can be built out to reduce the risk of a false positive. The following Yara

rule contains a greater regular expression for matching the code:

20

rule TestApp

{

meta:


description = "Detects all variants of TestApp using code matching and

string searching"

date = "19-12-2017"

strings:

$textdecryptionroutine = { 83 ec 08 53 55 8b 6c ?? ?? 56 8b c5 57 bb

01 [3] 8d 50 01 8a 08 40 84 c9 75 ?? 2b c2 83 c0 02 6a 01 50 e8 [4] 8b cd 83 c4 08 89

44 [2] 33 f6 8d 79 01 8a 11 41 84 d2 75 ?? 2b cf 74 ?? 8b fd 2b f8 8b d0 89 7c [2] eb

[0-8] 8a 04 17 2a c3 33 c9 66 83 fb 01 0f 94 c1 88 02 46 41 8b d9 8b cd 42 8d 79 01 8b

ff 8a 01 41 84 c0 75 ?? 2b cf 3b f1 72 ?? 8b 44 [2] 5f 5e 5d 5b 83 c4 08 c3}

condition:

$textdecryptionroutine

}

When this rule is run against the three files; “TestApp2.exe”,

“TestApp3.exe” and “TestApp4.exe” it matches all three. It also does not match

any other file on the test operating system. Diagram 15 shows the results.

Diagram 15 – Code matching all three sample files and reducing false

positives

21

5.1 Conclusion

From the samples files generated for this research, it was demonstrated that

using hashes, string content or Import Tables can be problematic and lead to false

positives. Whilst these samples were very simple examples, their real-life partners

are not too dissimilar.

Detecting malware or groups of malware is resource intensive when using a

simple hash. It also becomes problematic when targeting strings within a file as

these are subject to change and are quite often encrypted. Using Import Table

Hashes (ImpHashes) produces false positives and can soon become redundant if

malware authors align their Import Tables to legitimate software. Outside of

behavioural detection, these static methods form a large basis for detecting

malware. By targeting routines within programs and looking for code reuse,

detection can be carried out more effectively. However, the trade-off, is in the time

spent analysing these files. It takes a great deal of skill and effort to reverse

engineer even a simple program. Locating, matching and documenting common

routines or code reuse, only adds to the resource and time required.

The more prevalent code reuse is in the malicious programs, the easier it is to

detect. Common code samples can be documented and Yara rules written for them.

In this process, at least, code reuse and therefore routine reuse becomes more

apparent. This helps identify what a program is potentially capable of, based on

known code patterns.

22

6 Bibliography

Alvarez, V. M. (2018, 08 03). Hash module - yara 3.7.0 documentation. Retrieved

from Yara v3.7.0: http://yara.readthedocs.io/en/v3.7.0/modules/hash.html Alvarez, V. M. (2018, 08 03). PE module - yara 3.7.0 documentation. Retrieved from

Yara v3.7.0: http://yara.readthedocs.io/en/v3.7.0/modules/pe.html

Alvarez, V. M. (2018, 08 03). Welcome to YARA’s documentation! - yara 3.7.0

documentation. Retrieved from Yara v3.7.0:

http://yara.readthedocs.io/en/v3.7.0/

Alvarez, V. M. (Unknown, Unknown Unknown). YARA - the pattern matching swiss

knife for malware researchers. Retrieved from virustotal.github.io:

https://virustotal.github.io/yara/

FireEye. (2014, 08 29). SUPPLY CHAIN ANALYSIS: From Quatermaster to

SunshopFireEye. Retrieved from FireEye:

https://www.fireeye.com/content/dam/fireeye-www/global/en/current-

threats/pdfs/rpt-malware-supply-chain.pdf

Hex-Rays. (2018, 08 03). IDA: ABout. Retrieved from Hex-Rays: https://www.hex-

rays.com/products/ida/

MalwareTech. (2017, 06 27). Petya Ransomware Attack - What's Known |

MalwareTech. Retrieved from MalwareTech:

https://www.malwaretech.com/2017/06/petya-ransomware-attack-whats-

known.html

Mandiant. (2018, 08 03). Tracking Malware with Import Hashing; Tracking Malware

with Import Hashing | FireEye Inc. Retrieved from FireEye.com:

https://www.fireeye.com/blog/threat-research/2014/01/tracking-malware-

import-hashing.html

McAfee. (2018, 08 03). BinText. Retrieved from McAfee: http://b2b-

download.mcafee.com/products/tools/foundstone/bintext303.zip

Microsoft. (03, 08 2018). Availability and description of the File Checksum Integrity

Verifier utility. Retrieved from Microsoft: https://support.microsoft.com/en-

gb/help/841290/availability-and-description-of-the-file-checksum-integrity-

verifier-u

Selinger, P. (2018, 08 03). MD5 Collision Demo. Retrieved from mscs.da.ca:

https://www.mscs.dal.ca/~selinger/md5collision/

Trend Micro. (Unknown, Unknown Unknown). Polymorphic Virus - Definition -

Trend Micro USA. Retrieved from Trend Micro:

https://www.trendmicro.com/vinfo/us/security/definition/Polymorphic-

virus

VirusTotal. (2018, 08 03). Upload a sample. Retrieved from VirusTotal:

https://www.virustotal.com/#/home/upload

Wikipedia. (2017, 11 23). Polymorphic code - Wikipedia. Retrieved from Wikipedia:

https://en.wikipedia.org/wiki/Polymorphic_code

detection of malware through code reuse and yarabox5781.temp.domains/~dfircouk/papers/detection...

Documents