fast fingerprinting of ole2 files: heuristics for ... · fast fingerprinting of ole2 files......

14
FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS 172 VIRUS BULLETIN CONFERENCE OCTOBER 2011 FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR DETECTION OF EXPLOITED OLE2 FILES BASED ON SPECIFICATION NON-CONFORMANCE Stephen Edwards, Paul Baccas SophosLabs UK, The Pentagon, Abingdon, UK Email {stephen.edwards, paul.baccas}@ sophos.com ABSTRACT Today, the main class of malicious OLE2 files seen by SophosLabs exploit vulnerabilities in Microsoft Office applications. These are used to install malware, most often rootkits, backdoors, or downloaders. Ten years ago, SophosLabs would have been inundated with self-replicating threats or macro-based trojans. As the attack vector has changed, techniques for detection have also adapted – the knowledge of the OLE2 specification is a powerful tool in the fight. OLE2 documents are complex, therefore the cost of parsing in order to directly detect an exploit can be prohibitive for a security scanner. However, it is typical for Microsoft Office file formats to have early records with a significant number of rigidly defined fields. This paper will investigate whether non-adherence to specification within these fields can be used as a low-cost heuristic to improve detection of this class of malware. Additionally this paper will set out which violations are pertinent to exploit detection via the scanning of diverse clean and exploited files. DEFINITIONS CFBF / CFB – Compound File Binary Format, a Microsoft specification for a type of file structured as a file system, able to hold arbitrary content. OLE2 – Object Linking and Embedding version 2, a Microsoft programming construct. OLE2 files/documents – common informal name for a Compound File Binary Format file. Exploit file – a file crafted to make use of a programming error within the file format parser, typically to execute malicious code. Vulnerability – a programming error within the file format parser which could lead to exploitation. INTRODUCTION The OLE2 file format is complex and large. The specifications contain hundreds of different binary structures, which may need to be parsed down to the bit level. OLE2 files may even contain other embedded OLE2 objects, allowing for recursion. Consequently, it is no surprise that the format is hard and expensive to fully parse. In the case of crafted malicious documents, the exploited part of the file format may be deep within the OLE2 structure. As anti-malware products must scan far larger numbers of clean documents than malicious ones, it is too expensive to parse every OLE2 file completely. As in the more traditional executable file format scanning, filtering must be employed to quickly screen out files which rarely contain malware, while selecting more suspicious files for more extensive processing. Historically, there are many approaches to perform this filtering. Examples include confirming the existence (or lack) of macros, entropy analysis for possible payload files, and white- or blacklisting based on meta data within the file. This paper investigates a further method: tracking violations of the document specification, in order to correlate sets of violations with distinct sets of files. Even if such groupings are not directly malicious, if the cost of creating them is low and if families of malware show a preference for certain groups, then this approach will be an effective filter. BACKGROUND OLE2 files are synonymous with, but not limited to, Microsoft Office applications (from Word 6.0 [1] and Excel 5.0 [2] onwards). For the first two years of OLE2’s life in Microsoft Office (1993-1995), security scanners ignored the OLE2 nature of the file (the header is valid COM code). This changed with the arrival of WM.Concept [3]. Security scanners scrambled to detect and disinfect the new wave of macro viruses [4]. As previously noted, the OLE2 file format is a complex one: ... more complex than that of a conventional executable. A Word document looks like an entire filing system, ...’ [5]. In order to safely disinfect, the file format needs to be understood. At the time of WM.Concept, there was no official, publicly available documentation. Therefore, in order to understand the format, developers of anti-virus programs had either to reverse engineer it, or to negotiate with Microsoft. Neither solution was wholly desirable: • Information from Microsoft was relatively slow to get hold of [6], prone to errors [7], and subject to non- disclosure agreements [8]. • Reverse engineering a file format is slow and prone to errors in understanding [9]. Still, it is the best option for rapid response to a new exploit targeting an undocumented feature! These approaches are not perfect and caused issues, e.g. one study [10] found that if the sector size was 4,096 bytes 13% of products tested crashed. The annals of Office macro malware are, unfortunately, littered with examples of replicating malware that seem to have been either poorly disinfected or deliberately corrupted. These include: • X97M/Jini.A1 [11] • W97M/Melissa.W [12] • W97/Marker.GO [13] • W97M/Class.EZ [14]. In the latter case (deliberately corrupted), this would have indicated a high level understanding of the OLE2 file format. We know that malware authors have some understanding of OLE2: • Anarchy.6093 [15] • Navrhar [16]

Upload: others

Post on 03-Sep-2019

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

172 VIRUS BULLETIN CONFERENCE OCTOBER 2011

FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR

DETECTION OF EXPLOITED OLE2 FILES BASED ON SPECIFICATION

NON-CONFORMANCEStephen Edwards, Paul Baccas

SophosLabs UK, The Pentagon, Abingdon, UK

Email {stephen.edwards, paul.baccas}@sophos.com

ABSTRACTToday, the main class of malicious OLE2 fi les seen by SophosLabs exploit vulnerabilities in Microsoft Offi ce applications.

These are used to install malware, most often rootkits, backdoors, or downloaders. Ten years ago, SophosLabs would have been inundated with self-replicating threats or macro-based trojans. As the attack vector has changed, techniques for detection have also adapted – the knowledge of the OLE2 specifi cation is a powerful tool in the fi ght.

OLE2 documents are complex, therefore the cost of parsing in order to directly detect an exploit can be prohibitive for a security scanner. However, it is typical for Microsoft Offi ce fi le formats to have early records with a signifi cant number of rigidly defi ned fi elds. This paper will investigate whether non-adherence to specifi cation within these fi elds can be used as a low-cost heuristic to improve detection of this class of malware. Additionally this paper will set out which violations are pertinent to exploit detection via the scanning of diverse clean and exploited fi les.

DEFINITIONSCFBF / CFB – Compound File Binary Format, a Microsoft specifi cation for a type of fi le structured as a fi le system, able to hold arbitrary content.

OLE2 – Object Linking and Embedding version 2, a Microsoft programming construct.

OLE2 fi les/documents – common informal name for a Compound File Binary Format fi le.

Exploit fi le – a fi le crafted to make use of a programming error within the fi le format parser, typically to execute malicious code.

Vulnerability – a programming error within the fi le format parser which could lead to exploitation.

INTRODUCTIONThe OLE2 fi le format is complex and large. The specifi cations contain hundreds of different binary structures, which may need to be parsed down to the bit level. OLE2 fi les may even contain other embedded OLE2 objects, allowing for recursion. Consequently, it is no surprise that the format is hard and expensive to fully parse.

In the case of crafted malicious documents, the exploited part of the fi le format may be deep within the OLE2 structure. As

anti-malware products must scan far larger numbers of clean documents than malicious ones, it is too expensive to parse every OLE2 fi le completely. As in the more traditional executable fi le format scanning, fi ltering must be employed to quickly screen out fi les which rarely contain malware, while selecting more suspicious fi les for more extensive processing.

Historically, there are many approaches to perform this fi ltering. Examples include confi rming the existence (or lack) of macros, entropy analysis for possible payload fi les, and white- or blacklisting based on meta data within the fi le.

This paper investigates a further method: tracking violations of the document specifi cation, in order to correlate sets of violations with distinct sets of fi les. Even if such groupings are not directly malicious, if the cost of creating them is low and if families of malware show a preference for certain groups, then this approach will be an effective fi lter.

BACKGROUNDOLE2 fi les are synonymous with, but not limited to, Microsoft Offi ce applications (from Word 6.0 [1] and Excel 5.0 [2] onwards). For the fi rst two years of OLE2’s life in Microsoft Offi ce (1993-1995), security scanners ignored the OLE2 nature of the fi le (the header is valid COM code). This changed with the arrival of WM.Concept [3]. Security scanners scrambled to detect and disinfect the new wave of macro viruses [4].

As previously noted, the OLE2 fi le format is a complex one:

‘... more complex than that of a conventional executable. A Word document looks like an entire fi ling system, ...’ [5].

In order to safely disinfect, the fi le format needs to be understood. At the time of WM.Concept, there was no offi cial, publicly available documentation. Therefore, in order to understand the format, developers of anti-virus programs had either to reverse engineer it, or to negotiate with Microsoft. Neither solution was wholly desirable:

• Information from Microsoft was relatively slow to get hold of [6], prone to errors [7], and subject to non-disclosure agreements [8].

• Reverse engineering a fi le format is slow and prone to errors in understanding [9]. Still, it is the best option for rapid response to a new exploit targeting an undocumented feature!

These approaches are not perfect and caused issues, e.g. one study [10] found that if the sector size was 4,096 bytes 13% of products tested crashed. The annals of Offi ce macro malware are, unfortunately, littered with examples of replicating malware that seem to have been either poorly disinfected or deliberately corrupted. These include:

• X97M/Jini.A1 [11]

• W97M/Melissa.W [12]

• W97/Marker.GO [13]

• W97M/Class.EZ [14].

In the latter case (deliberately corrupted), this would have indicated a high level understanding of the OLE2 fi le format. We know that malware authors have some understanding of OLE2:

• Anarchy.6093 [15]

• Navrhar [16]

Page 2: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

173VIRUS BULLETIN CONFERENCE OCTOBER 2011

• {Win32,W97M}/Beast [17]

• {Win32,W97M}/Coke.22231 [18]

• {W95,W97M}/Heathen [19].

Yet it is doubtful that the malware authors had signed an NDA with Microsoft. So they must have been reversing the fi le format, and this accounts for the logical errors within the code.

In September 2007, Microsoft made a number of specifi cations publicly available under the Microsoft Community Promise [20]. The OLE2 or Compound File Binary Format was not released in the fi rst tranche of specifi cations but is now available via MSDN [21].

Throughout this investigation the public specifi cation as provided by Microsoft [22, 23] is used, along with the date of the version used.

METHODS AND TOOLS

Overview

A set of Python modules was developed to parse OLE2 fi les and store, process and display the associated data.

The relevant Microsoft specifi cations follow RFC2119 [24] in their usage of ‘MUST’, ‘SHOULD’ and ‘MAY’. When these terms are used, the parsing script stores whether a violation occurs, e.g. from page 26 of MS-CFB v20101230:

‘The root directory entry’s Name fi eld MUST contain the null-terminated string “Root Entry” in Unicode UTF-16.’

This example was chosen for several reasons:

• it is simple and unambiguous

• clean fi les have been observed that violate even this simple part of the specifi cation

• malicious documents have been observed to alter this fi eld (e.g. missing the fi nal ‘y’). Since Microsoft Offi ce will still open these fi les, this is presumably done either to break parsing, avoid detection, or both.

File format

Other researchers have provided outlines of the fi le format [25], however due to the need to defi ne several terms, a brief overview follows.

The OLE2 fi le format begins with the magic eight bytes 0xd0cf11e0a1b11ae1, often given as just the fi rst four bytes (see Figure 1). The header is 512 bytes long and is at offset 0. Simple parsing of this header reveals the sector size and the position of the directory chain. The sector size is either 512 or 4,096 bytes.

The directory chain or Compound File Directory Sector contains information about the stream and storage objects within the compound fi le. The directory chain is a red-black [26] binary search tree that allows for effi cient searching. The fi rst entry within the directory chain is always the Root entry and should begin with the Unicode string of ‘Root Entry’. Each entry is 128 bytes in length. A Word document should contain an entry starting with the Unicode string ‘WordDocument’ (see Table 1).

The relevant entry will contain the position of the start of the stream.

OLE2 parser

While there are existing parsers available, extensive changes would have been necessary to track specifi cation violations in the required manner. Additionally, a high percentage of the parsing development cost was encoding the specifi cation, and this had already been achieved for signifi cant portions of the standard. However, as this approach used the Sophos in-house language, VDL, along with the OLE2 parsing capabilities of the scanning engine, it was not accessible to the wider community.

To allow reproduction and further experimentation, the decision was made to port to Python. Porting fi rst-party code was judged to be cheaper than adapting code from a third party. Python was chosen because it is highly accessible, easily extensible, and lends itself to clean code.

The OLE2 parser has some limitations that due to time constraints were not fi xed for this paper. The main limitation is that the directory chain is not fully parsed; it is assumed that the relevant stream will be in the fi rst four entries. Further, the parser invariably uses three fi le loads to locate the interesting stream. One solution for both these problems

Figure 1: OLE2 compound fi le format.

OLE2 typeWord WordDocument

Excel WorkBook

PowerPoint PowerPoint Document

Table 1: Unicode string associated with OLE2 fi le types.

Page 3: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

174 VIRUS BULLETIN CONFERENCE OCTOBER 2011

would be to load a single larger chunk from disk and restrict all parsing within this block.

The source code for the OLE2 parser and header validator is given in Appendix I and II respectively. Abridged extracts are used to illustrate the core functionality.

After the fi le has been determined to be a specifi c subtype, the relevant parser is invoked:

if fl avour == “Workbook” and options.types[“xls”] == True:

xls = ole2_excel.parser()

if xls.parse(fd,fl av_offset) != 0:

print “Couldn’t complete parsing “ + str(fi lename)

xls = None

fl av_offset holds the fi le offset to the stream that ole2_excel.parse() requires. This function parses the stream linearly, recording violations in a dictionary of dictionaries:

# page 213 [MS-XLS].pdf

# rupYear: “The value MUST be 0x07cc or 0x07cd”

# “Excel 97 writes 0x07CC for rupYear.”

rupYear = bof.litEndH(6,8)

logging.debug(“bof_rupYear: “ + rupYear)

vio = 0 if (rupYear == ‘07cc’ or rupYear == ‘07cd’) else 1

store( index, self.must_dict, ‘bof_rupYear’,

(“violates”,vio),

(“uint”,int(rupYear,16)),

(“name”, “rupYear”),

(“page”, 213)

)

Storing in this manner allows for easy searching via names from the specifi cation documents. These structure names are often used in CVE descriptions. Note, the bits of the Excel BOF structure have been converted to a utility class which enables easy access in different formats, in this case, little-endian hex-based.

There are numerous reserved bitfi elds which may cause violations; these are also recorded:

reserved1 = bof.bits[64+19:64+19+13] # “MUST be zero, and MUST be ignored”

vio = 0 if reserved1 == “0000000000000” else 1 # 13 zeros

store( index, self.must_dict, “bof_reserved1”,

(“violates”,vio),

(“bits”,reserved1),

(“name”,”reserved1”),

(“page”,214)

)

In addition to actual breaches of the specifi cation, there are certain fi elds which seemed inherently interesting, e.g.:

fOOM = bof.bits[64+9] # I - fOOM “this fi le had an out-of-memory error”

if fOOM == “1” : store( index, self.interesting_dict,

“bof_bitfi eld01_fOOM”,

(“fl ag”,1),

(“name”,”fOOM”),

(“page”,213)

)

Once parsed, the modules allow for the combined results to be stored to a Python Pickle (an external fi le holding a serialized representation of the data), which can later be updated with additional results. A set of data processing scripts was developed to produce statistics, cluster the results, and translate the clusters of data into groups of actual fi les for analysis. The open source libraries matplotlib, scipy and hcluster were key to this processing. Hcluster is a

specialized set of modules for cluster generation and visualization.

Violations and other interesting traits are recorded as boolean values. These are presented as a string of zeroes and ones, often termed a ‘bitstring’. A less generic term that is also used in this paper is the ‘extended bitstring’. This is a bitstring composed of all bits for features which may violate the specifi cations, as well as all recorded bits from bitfi elds. These bitfi elds are almost exclusively parts listed in the specifi cations as reserved bits or unused bits, and thus were thought likely to be distinguishing features.

RESULTS

Samples

Collecting a representative set of OLE2 fi les proved a large task, so it was decided to limit data collection to Excel fi les. These were obtained from several sources and grouped:

• Sophos Master Collection (MASTER) – fi les that SophosLabs passes on to other malware researchers. Includes macro and formula viruses, as well as exploited fi les. Known to cover many versions of Excel.

• False positive set (FPS) – fi les that are used to test the Sophos products. These are clean fi les, though some are corrupt or malformed and are known to have caused problems for AV products in the past.

• OLE2SC – fi les detected by Mal/OLE2SC-A, a Sophos identity that looks for known shellcode within OLE2 fi les.

• MSExcel – fi les that match competitors’ detections of ‘Exploit[\.-]MSExcel’ within the SophosLabs fi les database.

• CVE – fi les that match ‘Exploit[\.-]CVE’.

Many of these groups contain non-Excel fi les, however the scripts were instructed to process only the required type.

Bitstrings Initially, we scanned the individual groups and graphed the frequency of occurrence of each possible violation in the extended XLS bitstring.

Figure 2: Violations in Sophos MASTER set.

Page 4: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

175VIRUS BULLETIN CONFERENCE OCTOBER 2011

The Sophos MASTER set (see Figure 2) shows that many fi les contain violations to the specifi cations. The fi gure shows a graph with the index into the bitstring on the x-axis and the log of the number of violations on the y-axis.

The OLE2SC data (see Figure 3) shows fewer violations though there are several that are common with the Sophos collection.

The FPS data (see Figure 4) again has many violations and is closer to the natural fi les within the Sophos collection. Note that a great many clean fi les violate certain portions of the specifi cation – the high peaks here represent ~80% of the set population. Having only 20% of the population conform to a part of the specifi cation suggests that the published information may be incorrect or incomplete.

Additionally, there appear to be some bits which never violate in the known malicious OLE2SC set, but do in the clean set. This may indicate certain kinds of clean fi les are never used as the base for exploited fi les – but again, the OLE2SC set is signifi cantly smaller than the FP set, so this may just be noise. More detailed investigation is required here.

The MSEXCEL data (see Figure 5) has fewer violations over the whole bitstream than all but the OLE2SC data.

The CVE data (see Figure 6) has slightly more violations, but fewer than the clean set.

WorkBook

Within the OLE2 specifi cation there is no set position for any part of the fi le, excepting the header, which must be at the beginning. In practice many fi les have the WorkBook stream at a fi xed point (see Table 2).

Taking the sets as a whole, less than 4% have the WorkBook stream past the 8k border (see Figure 7). Parsing the fi le up to and including this stream is likely to be cheap enough for most applications.

ClusteringAll sets were scanned for specifi cation violations and the results combined. The hcluster implementation of Ward’s Algorithm was used over the extended Excel bitstring to cluster the fi les hierarchically.

A dendrogram of the relationships between items was produced. This is a hierarchical structure, with each node representing a cluster. The highest node represents the cluster which contains all items. Moving down the branches splits the

Figure 3: Violations in OLE2SC set.

Figure 4: Violations in FPS set.

Figure 5: Violations in MSEXCEL set.

Figure 6: Violations in CVE set.

Page 5: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

176 VIRUS BULLETIN CONFERENCE OCTOBER 2011

items into larger numbers of increasingly smaller and tighter groupings. In this type of visualization, the y-axis denotes distance between nodes/clusters, and each point on the x-axis is a single sample. The vertical distance between two nodes is analogous to the distance between two clusters (see Figure 8).

The set to be clustered was suffi ciently large (9,457 fi les) that displaying all leaf nodes was impractical. It was decided to contract leaf nodes of non-singleton clusters past a certain depth. This makes higher-level clusters more distinct, but does obscure the size of groups with many similar members. For example, Group 3 appears small, but in fact contains ~20% of the total fi les, with 99.4% of the 1,831 members having an identical bitstring.

For dendrograms, the threshold for indicating a signifi cant cluster is a matter of choice. In Figure 8 a value of 12 was used. This groups the fi les into 18 groups, some better defi ned than others. Groups 0, 3, 5, 6, 8, 10 and 15 are tight; groups 2, 4, and 14

quite diverse. Group 14 is a good example of a group that should probably be subdivided – but the hcluster library only permits a single cut-off value.

Clustering algorithms will always produce sets. While Figure 8 shows an uneven spread of results that is expected from non-random data, further analysis is required to determine if the discovered clusters are meaningful.

Groups 0, 5, 6, 10 and 15 are all mainly infected with macro or formula viruses. In Group 15, 78.5% of the fi les are XM97/Hidemod-A [27]. Example bitstring violations for those Groups are shown in Table 3.

Group 8 – Troj/DocDrop-S

Over 50% of the fi les in Group 8 are detected by the identity Troj/DocDrop-S [28]. Troj/DocDrop-S detects the majority of fi les exploiting CVE-2009-3129 [29].

In fact, a grep for the Unicode string of ‘LaserJet’ shows that 124 fi les hit (see Figure 9).

Table 4 demonstrates that multiple vendors believe that this group is signifi cant. Not all the Sophos detections are Troj/DocDrop-S – but of those that are not, most are closely related detections. By the time this paper is published the intention is to fold the minor detections into the main Troj/DocDrop-S.

Troj/DocDrop-S detections are in Groups 1 (2 out of 173), 8 (92 out of 165), and 9 (186 out of 2147) – approximately 1/5 of the fi les exploit some vulnerability in this group.

Offset (decimal)

Sophos FPS OLE2SC CVE MSEXCEL

512 1172 3358 131 38 567

1024 0 2 0 0 0

1536 775 524 27 30 342

2048 12 6 12 2 6

2560 709 96 1 1 0

3072 6 11 0 0 0

3584 7 4 0 0 0

4096 1699 262 5 1 3

4608 0 5 0 0 1

5120 87 50 0 0 0

5632 2 23 0 0 0

6144 1 164 0 0 0

6656 2 6 0 1 1

7168 2 3 0 0 0

Other 44 207 24 3 80

Table 2: Offset to WorkBook stream, different sets.

Figure 7: Offset to WorkBook stream, percentage, different sets.

Figure 8: Dendrogram showing clustered samples.

Group Bitstring0 0000110100010000000000000000000000000000000000000000000000

5 0000010110011000000000000000000000000100100000000000000000

6 0000010100011000000000000000000000000100110000000000000000

10 0000010110001000000000000000000000000100010000000000000000

10 0000110110011000000000000000000000000100010000000000000000

10 0000010110011000000000000000000000000100010000000000000000

15 0000110110010000000000000000000000000100010000000000000000

15 0000110110010010000000000000000000000100010000000000000000

Table 3: Example bitstrings from clusters.

Page 6: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

177VIRUS BULLETIN CONFERENCE OCTOBER 2011

The Group 8 bitstring is 0001010110011000000000000000000000000100110000000000000000. The bits that are set are detailed in order (see Table 5).

Interpreting these bits with a little guesswork suggests the original fi le was only ever edited on a Mac, probably with a version of Excel older than Excel 97. Mac Excel products have existed since 1985. The spec lists the lowest valid identifi er as 0x07cc for Excel 97, while in this sample we have 0x0700. Perhaps the spec doesn’t fully cover older versions. Notably, Excel 97 was version 8 – so having the previous version set rupYear 0x0700 could make sense.

CONCLUSIONTracking suffi cient signifi cant features of OLE2 documents can fi ngerprint a fi le as being a type more commonly used by malware authors. The tracked features need not be related to an exploit. This allows for higher cost scanning on those classes of fi le more likely to be malicious.

While the initial clustering step is slow, classifying an individual fi le into an established hierarchy is fast. The clustered fi les need more

analysis to determine the signifi cance of groups, but once meaningful groups are established, the scanning process for a new fi le can be optimized.

Consequently, looking at specifi cation violations in Excel fi les has shown promise as a low cost heuristic for fi ltering. There are some strong groupings of exploited fi les. This is most likely due to malware authors making extensive use of template fi les from exploit kits and modifying proven exploit fi les. Changing the payload without changing the document structure will leave the set of specifi cation violations for a document intact.

For Excel an analysis of the fi rst 8KB of the fi le would probably be suffi cient to determine whether more exhaustive

Figure 9: Example commonality in Group 8 sample.

Detected Suspicious Total

Avira 48 2 165

Kaspersky 65 - 165

McAfee 141 - 165

Microsoft 60 94 165

Sophos 137 - 165

Symantec 104 - 165

Trend 85 - 165

Table 4: Comparison of scans of Group 8.

Bit index Property name Spec notes Probable meaning

3 rupYear ‘The value MUST be 0x07cc or 0x07cd’‘Excel 97 writes 0x07CC for rupYear.’

Maps to Excel version.Is set to 0x0700 in this instance.

5 fRisc ‘MUST be 0’‘last edited on a RISC platform’

Unknown. This bit is so commonly set that the given meaning in the spec seems unlikely.

7 fWinAny ‘SHOULD be 1’ Flag denoting if fi le has ever been edited on Windows.

8 fMacAny ‘MUST be 0’ Flag denoting if fi le has ever been edited on Mac. Again, this violation is very common, and it would make little sense to have this bit, have Mac products, yet never set it!

11 2nd bit of unused1 fi eld

‘Undefi ned and MUST be ignored’ Unknown

35 reserved2 20 bits, ‘MUST be zero’ Unknown. The 20 following bits often have a consistent pattern within a group so it is likely these bits are in use and have an undocumented meaning.

38-39 bits 2–3 of the reserved2 bits.

As above As above

Table 5: Meanings of set bits for Group 8.

Page 7: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

178 VIRUS BULLETIN CONFERENCE OCTOBER 2011

analysis needs to be done. It was determined that for >96% of fi les processed, all required information could be obtained within 8KB. The default cluster size for common disk/fi le system combinations is between 4KB and 16KB. Importantly, most AV scanning engines will already parse OLE2 fi les deeper than this in order to check for macros, etc. For many engines, the (expensive) disk reads needed for the cheap tracking of specifi cation violations are effectively free.

Parsing deeper within the OLE2 structure, while remaining close to the fi le start, would lead to a longer bitstring. More signifi cant bits should lead to better clustering, albeit at a higher cost in generating the initial clusters. An additional problem here is that only the very early structures within the WorkBook stream are non-optional. Different techniques for clustering would be needed to handle optional traits.

Further work which we hope to present will consist of:

• Expanded sample sets.

• Extending the violation parser in the WorkBook stream.

• Adding other violation parsers for other Excel streams as well as Word and PowerPoint.

• Implementing a more robust directory chain parser.

• Determining if Ward’s Algorithm is appropriate. Jarvis-Patrick clustering has been studied and stated to be superior in similar problem domains. A further similar extension would be to weight the distance function by inverse frequency of each bit being set.

• Storing clustering results in such a way that a single sample can quickly be added to the most appropriate group i.e. utilize the clustering as a classifi er.

REFERENCES[1] http://word.mvps.org/faqs/numbering/

numberingexplained/fi leformats.htm.

[2] http://sc.openoffi ce.org/excelfi leformat.pdf.

[3] Gordon, S. Concept cross platform. Proceedings of the Virus Bulletin International Conference 1995.

[4] Comparative Review. Virus Bulletin, May 1996, p10. http://www.virusbtn.com/pdf/magazine/1996/199605.pdf.

[5] Krukov, A. ‘In the Beginning was the Word ...’ Virus Bulletin November 1996, p.14. http://www.virusbtn.com/pdf/magazine/1996/199611.pdf.

[6] Bontchev, V. Possible Macro Virus Attacks and how to prevent them. Proceedings of the Virus Bulletin International Conference 1996.

[7] Bontchev, V. Solving the VBA Upconversion problem. Proceedings of the Virus Bulletin International Conference 2000.

[8] FitzGerald, N. A Word in your ear. Virus Bulletin August 1997, p.2. http://www.virusbtn.com/pdf/magazine/1997/199708.pdf.

[9] Raiu, C. Looping the Kloop. Virus Bulletin, November 2000, p.8. http://www.virusbtn.com/pdf/magazine/2000/200011.pdf.

[10] Raiu, C. The Little Fixed Variable Constant. Virus Bulletin, October 1999, p.8.

http://www.virusbtn.com/pdf/magazine/1999/199910.pdf.

[11] Muttik, I. A Portrait of Jini. Virus Bulletin, October 2000, p.7. http://www.virusbtn.com/pdf/magazine/2000/200010.pdf.

[12] Melissa Macs a Comeback? Virus Bulletin, February 2001, p.2. http://www.virusbtn.com/pdf/magazine/2000/200010.pdf.

[13] Baccas, P. Apple Blight. Virus Bulletin, September 2001, p.10. http://www.virusbtn.com/pdf/magazine/2001/200109.pdf.

[14] Bontchev, V. Three faces of VBA: Part 2. Virus Bulletin, February 2005, p.4. http://www.virusbtn.com/pdf/magazine/2005/200502.pdf.

[15] Daniloff, I. Anarchy in the USSR. Virus Bulletin, October 1997, p.6. http://www.virusbtn.com/pdf/magazine/1997/199710.pdf.

[16] Kaspersky, E. The Windows Virus Drama; Act ii, scene iv. Virus Bulletin, November 1997, p.15. http://www.virusbtn.com/pdf/magazine/1997/199711.pdf.

[17] Ször, P. Beast Regards. Virus Bulletin, June 1999, p.6. http://www.virusbtn.com/pdf/magazine/1999/199906.pdf.

[18] Lines to Ponder. Virus Bulletin, July 1999, p.2. http://www.virusbtn.com/pdf/magazine/1999/199907.pdf.

[19] van Oers, M. Introducing the Infi del. Virus Bulletin, August 1999, p.9. http://www.virusbtn.com/pdf/magazine/1999/199908.pdf.

[20] http://www.microsoft.com/interop/cp/default.mspx.

[21] http://msdn.microsoft.com/en-us/library/cc313118.aspx.

[22] http://msdn.microsoft.com/en-us/library/dd942138%28PROT.13%29.aspx.

[23] http://download.microsoft.com/download/2/4/8/24862317-78F0-4C4B-B355-C7B2C1D997DB/Offi ceFileFormatsProtocols.zip.

[24] http://www.ietf.org/rfc/rfc2119.txt.

[25] Muttik, I. Macro Viruses Part 1. Virus Bulletin, September 1999, p.13. http://www.virusbtn.com/pdf/magazine/1999/199909.pdf.

[26] http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/video-lectures/lecture-10-red-black-trees-rotations-insertions-deletions/.

[27] http://www.sophos.com/en-us//threat-center/threat-analyses/viruses-and-spyware/XM97~Hidemod-A.aspx.

[28] http://www.sophos.com/en-us/threat-center/threat-analyses/viruses-and-spyware/Troj~DocDrop-S.aspx.

[29] CVE-2009-3129. http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2009-3129.

Page 8: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

179VIRUS BULLETIN CONFERENCE OCTOBER 2011

APPENDIX I#!/usr/bin/env python

#

# main script to run

# python 2.6.x

#

import os, sys, hashlib, re, getopt, logging, confi g

import cPickle as pickle

from utils import *

from glob import glob

import ole2_header, ole2_dir_chain, ole2_excel

class Options:

types = {}

def __init__(self):

self.types = {“xls”:False, “doc”:False, “ppt”:False, “header”:False, “root_dir_chain”:False, “type_dir_chain”:False}

def restoreFromPickle(path, options, results):

options = pickle.load(confi g.pickle)

results = pickle.load(confi g.pickle)

return True

def usage():

print “ole2.py [options] [fi les]”

print “\t-h, --help\t\t\tdisplay this usage screen”

print “\t-t, --types xls,doc,ppt,header,root_dir_chain,type_dir_chain\t\tenable fi letype specifi c info”

print “\t all\t\tenable all of the above”

print “\t-p, --picklefi le <fi lename>\t\tname of fi le to use for Pickling, data from this run will be added/up-dated.”

print “\t-l, --loglevel [debug|info|warning|error|critical]”

def main(argv):

options = Options() # defi ned in utils.py

results = {} # for storing combined results and pickling

xls = None

ppt = None

doc = None

headerDict = {}

rootDirChainDict = {}

xlsDict = {}

pptDict = {}

docDict = {}

print “starting”,

try:

opts, args = getopt.getopt(argv[1:], “ht:p:l:”, [“help”,”types=”,”picklefi le=”,”loglevel=”])

except getopt.GetoptError, e:

# print e

usage()

sys.exit(2)

for opt, arg in opts:

if opt in (“-h”, “--help”):

usage()

sys.exit()

elif opt in (“-t”, “--types”):

for a in arg.split(“,”):

if a == “xls”:

options.types[“xls”] = True

elif a == “ppt”:

Page 9: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

180 VIRUS BULLETIN CONFERENCE OCTOBER 2011

options.types[“ppt”] = True

elif a == “doc”:

options.types[“doc”] = True

elif a == “header”:

options.types[“header”] = True

elif a == “root_dir_chain”:

options.types[“root_dir_chain”] = True

elif a == “type_dir_chain”:

options.types[“type_dir_chain”] = True

elif a == “all”:

for ftype in options.types:

options.types[ftype] = True

else:

print “Unrecognised type in -t/--types option: “ + str(a)

usage()

sys.exit()

elif opt in (“-p”, “--picklefi le”):

if os.path.isfi le(arg):

print “Pickle fi le “ + str(arg) + “ existed”

confi g.pickle = open(arg, ‘rb’)

# import Pickled data. This will lead to the running job

# updating shas that already exist in the Pickle.

try:

restoreFromPickle(confi g.pickle, options, results)

confi g.pickle.close()

confi g.pickle = open(arg,’wb’) # trashes the pickle on disk, so try not to crash.

# FIXME be sensible and keep a temp copy, overwrite at the end, not here.

except Exception, e:

logging.error(“Validation of Pickle failed”)

print e

raise

else:

print “Pickle fi le didn’t exist, creating: “ + str(arg)

# initialise the Pickle for later storage.

# No earlier data to restore however.

try:

confi g.pickle = open(arg, ‘wb’)

except:

logging.error(“Error creating Pickle fi le”)

raise

confi g.usePickle = True

elif opt in (“-l”, “--loglevel”):

if arg == “debug”:

logging.basicConfi g(level=logging.DEBUG)

elif arg == “info”:

logging.basicConfi g(level=logging.INFO)

elif arg == “warning”:

logging.basicConfi g(level=logging.WARNING)

elif arg == “error”:

logging.basicConfi g(level=logging.ERROR)

elif arg == “critical”:

logging.basicConfi g(level=logging.CRITICAL)

else:

print “Unrecognised loglevel in -l/--loglevel option: “ + str(arg)

usage()

sys.exit()

else:

print “Unrecognised option: “ + str(opt)

usage()

sys.exit()

if confi g.usePickle:

# pickle the options so that when we unpickle

# we know what to expect for parsing.

pickle.dump(options, confi g.pickle)

# get all the fi le names

deglobbed = []

fi lelist = []

Page 10: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

181VIRUS BULLETIN CONFERENCE OCTOBER 2011

for arg in args:

deglobbed.append( glob(arg) )

#print deglobbed

for fl ist in deglobbed:

fi lelist.extend(fl ist)

print fi lelist

for fi lename in fi lelist:

xls = None

ppt = None

doc = None

headerDict = {}

rootDirChainDict = {}

xlsDict = {}

pptDict = {}

docDict = {}

if os.path.isdir(fi lename) == True:

# we don’t do recursive parsing yet...

continue

# print fi lename,

fd = open(fi lename, “rb”)

data = fd.read()

fi lehash = hashlib.sha1(data).hexdigest()

print fi lehash,

header = magicstring(fd,0,4).bigEndH(0,4)

magic = ‘d0cf11e0’

if header == magic:

print “OLE”,

else:

print “NOLE”,

continue

# sector size can vary, get exponent used to calc

power = lseek_read(fd,30,2)

# get sector offset of dir_chain (relative to fi rst sector end)

dir_chain = lseek_read(fd,48,4)

# get root object of dir_chain

# sector_size * dir_chain offset

root = 2 ** little_endian(power) * (little_endian(dir_chain) + 1)

# get dir_chain name

root_table = lseek_read(fd,root,64)

i = 0

fl avour = “unknown”

fl av_rel_offset = 0

while 1:

i += 1

# walk dir entry streams.

# next = root + (dir entry size * i)

# name is

dirEntryName = lseek_read(fd,root + 128*i, 64)

### maybe we shouldn’t break in the below loop?

### Haven’t checked but it’s probably against spec

### to have both a WordDocument and Workbook stream etc.

### So could loop looking for all.

if re.match(b’W\00o\00r\00d’, dirEntryName):

fl avour =”WordDocument”

fl av_rel_offset = i

break

if re.match(b’P\00o\00w\00e’, dirEntryName):

Page 11: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

182 VIRUS BULLETIN CONFERENCE OCTOBER 2011

fl avour =”PowerPoint Document”

fl av_rel_offset = i

break

if re.match(b’W\00o\00r\00k’, dirEntryName):

fl avour =”Workbook”

# page 56 [MS-XLS].pdf

# “A fi le MUST contain exactly one Workbook Stream”

fl av_rel_offset = i

break

if i ==4: break

else:

next

print fl avour,

offset = lseek_read(fd, root +128 * fl av_rel_offset + 116 , 4)

fl av_offset = 2 ** little_endian(power) * (little_endian(offset) + 1)

fl av_stream = lseek_read(fd, fl av_offset, 2)

#print “0x%02x%02x” % (fl av_stream[0], fl av_stream[1])

print fl av_offset

if options.types[“header”] == True:

header = ole2_header.parser()

if header.parse(fd,0) !=0:

print “Couldn’t complete parsing “ + str(fi lename)

header = None

if confi g.usePickle:

pass

if header != None:

buildValueListAll(header.must_dict, (“index”,”violates”) )

print ‘extended header bitstring: ‘,

print vioBitsPlus(header.must_dict,

header.should_dict,

header.interesting_dict)

if options.types[“root_dir_chain”] == True:

dir_chain = ole2_dir_chain.parser()

if dir_chain.parse(fd,root) !=0:

print “Couldn’t complete parsing “ + str(fi lename)

dir_chain = None

if dir_chain != None:

buildValueListAll(dir_chain.must_dict, (“index”,”violates”) )

print ‘extended bitstring root_dir_chain: ‘,

print vioBitsPlus(dir_chain.must_dict,

dir_chain.should_dict,

dir_chain.interesting_dict)

# processing now becomes ole2fi le fl avour dependent.

if fl avour == “Workbook” and options.types[“xls”] == True:

xls = ole2_excel.parser()

if xls.parse(fd,fl av_offset) != 0:

print “Couldn’t complete parsing “ + str(fi lename)

xls = None

#print ‘”unused” bitstring: ‘

#unusedBits =

if xls != None:

buildValueListAll(xls.must_dict, (“index”,”violates”,”uint”) )

print ‘extended xls bitstring: ‘,

print vioBitsPlus(xls.must_dict,

xls.should_dict,

xls.interesting_dict)

Page 12: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

183VIRUS BULLETIN CONFERENCE OCTOBER 2011

#elif fl avour == “PowerPoint Document”:

# if we’re going to be dumping to fi le, we need

# to track each doc for later output

if confi g.usePickle:

if options.types[“header”] == True:

headerDict[“header”] = header

if options.types[“root_dir_chain”] == True:

rootDirChainDict[“root_dir_chain”] = dir_chain

if options.types[“xls”] == True: # some of these three may be None

xlsDict[“xls”] = xls

if options.types[“ppt”] == True:

pptDict[“ppt”] = ppt

if options.types[“doc”] == True:

docDict[“doc”] = doc

results[fi lehash] = (headerDict, rootDirChainDict, xlsDict, pptDict, docDict)

if confi g.usePickle:

# pickle the results

print(“pickling results”)

pickle.dump(results, confi g.pickle)

confi g.pickle.close()

if __name__ == “__main__”:

main(sys.argv[0:])

APPENDIX IIimport os, sys, hashlib, re, logging

from utils import *

title = “[MS-CFB]- v20101230 Compound File Binary File Format”

logging.basicConfi g(level=logging.INFO)

class parser:

def __init__(self):

self.must_dict = {}

self.should_dict = {}

self.interesting_dict = {}

def parse(self,fd,fi le_offset):

‘’’

fi le_offset should be the start of the fi le.

fd is fi le object

We will walk linearly through, tracking traits.

‘’’

index = [0] # indexing

logging.info(“Parsing the OLE2 header”)

ba = lseek_read(fd,0,512)

head_sig = magicstring(fd,0,8).bigEndH(0,8)

vio = 0 if head_sig == ‘d0cf11e0a1b11ae1’ else 1

store( index,self.must_dict, ‘header_signature’,

(“violates”,vio),

(“value”, head_sig),

(“name”,”Header Signature”),

(“page”, 16)

Page 13: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

184 VIRUS BULLETIN CONFERENCE OCTOBER 2011

)

head_clsid = little_endian(ba[8:24])

vio = 0 if head_clsid == 0 else 1

store( index,self.must_dict, ‘header_clsid’,

(“violates”,vio),

(“value”, head_clsid),

(“name”,”Header CLSID”),

(“page”, 16)

)

head_minor_ver = little_endian(ba[24:26])

head_major_ver = little_endian(ba[26:28])

vio = 0 if head_minor_ver == 62 and (head_major_ver == 3 or head_major_ver == 4) else 1

store ( index,self.should_dict, ‘header_minor_version’,

(“violates”,vio),

(“value”, head_minor_ver),

(“name”,”Header Minor Version”),

(“page”, 16)

)

vio = 0 if head_minor_ver == 62 else 1

store ( index,self.interesting_dict, ‘header_minor_version’,

(“violates”,vio),

(“value”, head_minor_ver),

(“name”,”Header Minor Version”),

(“page”, 16)

)

vio = 0 if head_major_ver == 3 or head_major_ver == 4 else 1

store ( index,self.must_dict, ‘header_major_version’,

(“violates”,vio),

(“value”, head_major_ver),

(“name”,”Header Major Version”),

(“page”, 16)

)

head_byte_order = little_endian(ba[28:30]) # Should play with hex

vio = 0 if head_byte_order == 65534 else 1

store ( index,self.must_dict, ‘header_byte_order’,

(“violates”,vio),

(“value”, head_byte_order),

(“name”,”Header Byte Order”),

(“page”, 17)

)

head_sect_shift = little_endian(ba[30:32])

vio = 0 if (head_sect_shift == 9 and head_major_ver == 3) or (head_sect_shift == 12 and head_major_ver == 4) else 1

store ( index,self.must_dict, ‘header_sector_shift’,

(“violates”,vio),

(“value”, head_sect_shift),

(“name”,”Header Sector Shift”),

(“page”, 17)

)

vio = 0 if head_sect_shift != 9 or head_sect_shift != 12 else 1

store ( index,self.interesting_dict, ‘header_sector_shift’,

(“violates”,vio),

(“value”, head_sect_shift),

(“name”,”Header Sector Shift”),

(“page”, 17)

)

Page 14: FAST FINGERPRINTING OF OLE2 FILES: HEURISTICS FOR ... · fast fingerprinting of ole2 files... edwards & baccas 172 virus bulletin conference october 2011 fast fingerprinting of ole2

FAST FINGERPRINTING OF OLE2 FILES... EDWARDS & BACCAS

185VIRUS BULLETIN CONFERENCE OCTOBER 2011

head_mini_sector_shift = little_endian(ba[32:34])

vio = 0 if head_mini_sector_shift == 6 else 1

store ( index,self.must_dict, ‘header_mini_sector_shift’,

(“violates”,vio),

(“value”, head_mini_sector_shift),

(“name”,”Header Mini Sector Shift”),

(“page”, 17)

)

head_reserved_0 = little_endian(ba[34:40])

vio = 0 if head_reserved_0 == 0 else 1

store ( index,self.must_dict, ‘header_reserved_0’,

(“violates”,vio),

(“value”, head_reserved_0),

(“name”,”Header Reserved 0”),

(“page”, 17)

)

head_num_dir_sect = little_endian(ba[40:44])

vio = 0 if head_num_dir_sect == 0 and head_minor_ver == 3 else 1

store ( index,self.must_dict, ‘header_number_directory_sectors’,

(“violates”,vio),

(“value”, head_num_dir_sect),

(“name”,”Header Number of Directory Sectors”),

(“page”, 17)

)

head_trans_sign = little_endian(ba[52:56])

vio = 0 if head_trans_sign == 0 else 1

store ( index,self.interesting_dict, ‘header_transaction_signature_number’,

(“violates”,vio),

(“value”, head_trans_sign),

(“name”,”Header Transaction Signature Number”),

(“page”, 17)

)

head_mini_str_cutoff_sz = little_endian(ba[56:60])

vio = 0 if head_mini_str_cutoff_sz == 4096 else 1

# Transaction Signature Number (4 bytes):

# This integer fi eld MAY contain a sequence number that

# is incremented every time the compound fi le is saved

# by an implementation that supports fi le transactions.

# This is fi eld that MUST be set to all zeroes if fi le

# transactions are not implemented.

store ( index,self.must_dict, ‘header_number_directory_sectors’,

(“violates”,vio),

(“value”, head_mini_str_cutoff_sz),

(“name”,”Header Mini Stream Cutoff Size”),

(“page”, 17)

)

return 0