spinal tap: high level analysis for heavy metal systems · dean pucsek university of victoria...

Spinal Tap:High Level Analysis for Heavy Metal Systems

Nieraj SinghUniversity of Victoria

Email: [email protected]

Celina GibbsUniversity of Victoria


Dean PucsekUniversity of Victoria


Martin SaloisDRDC Valcartier


Jonah WallUniversity of Victoria


Yvonne CoadyUniversity of Victoria


Abstract—Program comprehension tools targeting specifichigh-level languages do not currently scale to the complexitiesof many of today’s low level systems. At the lowest level,the wide variety of architectures and platforms results in awidening spectrum of instruction sets and assembly languages.Slightly above this level, C-based systems targeting multiplearchitectures and platforms are riddled with compiler directivesto accommodate the demands of configurable systems.

This paper proposes a generalized and extensible frameworkfor the purpose of program navigation and analysis, leveraging anintermediate representation of source code to separate low-leveldomain detail from tool support. A prototype of this frameworkis provided with two case studies evaluating its efficacy withinmultiple domains. This study demonstrates the feasibility of anextensible framework as a common core for low-level programcomprehension tools.

I. INTRODUCTION

Analysis of low-level code bases written in C or assembly is

currently challenging due to lack of tool support. Developers

performing program analysis of low-level code bases are faced

with large code-bases, a plethora of computer architectures

(x86, ARM, PowerPC, HLASM etc.), and binary file formats

(ELF, Mach-O, PE, etc.). More specifically, in the case of

malware analysis, an ever growing landscape of obfuscation

techniques employed by malware developers has intensified

the need for low-level analysis tools.

Given the future trajectory of multi-core applications that

are bound to amplify these problems, we believe it is

paramount to consider extensibility as a primary requirement

for program comprehension tools specifically targeting low-

level code bases. Though many inroads have been made for

tools targeting specific high-level languages, often enhancing

an already impressive tool-set within an Integrated Develop-

ment Environment (IDE), these tools typically do not currently

extend to low-level code bases. This is at least in part due to

the fact that these tools align well with high-level language

mechanisms, such as classes, methods and even functions, but

were not designed to scale to include complexities such as

less structured pre-processor directives in C (CPP) and even

transfer of control and data through jumps and registers in

assembly.

This paper proposes a generalized and extensible framework

for the purpose of program navigation and analysis, leveraging

an intermediate representation of source code to separate low-

level domain detail from tool support. A prototype of this

framework is provided with a detailed case study evaluating

its efficacy in terms of pre-processor directives in C-based

systems. Further investigation in terms of the assembly lan-

guage domain reveals the possible synergy of leveraging an

intermediate language within further program comprehension

tools. These studies demonstrate the feasibility of an extensible

framework as a common core for low-level source comprehen-

sion tools.

Extensibility is crucial because it allows a framework of

comprehension tools to more easily adapt to both current and

future technologies. Moreover, a focus on extensibility allows

us to form a community of third-party developers who are able

to contribute ideas and code to better the system. In an attempt

to address this issue of comprehension of complex software

in various forms, we propose a generalized framework built

on the ability to define various models that interface with

the tools through a formalized intermediate representation.

This framework, we call Spinal Tap (ST), is also intended

to be inter-operable with tools designed for a wide variety of

areas including program verification, program comprehension,

malware analysis, vulnerability detection and education.

In this paper we begin by discussing related work in

Section II and motivate the need for extensibility in Section III,

suggesting an extensible framework capable of performing

arbitrary analysis on various representations of low-level code

bases. In Section IV we describe the design and implemen-

tation of a prototype framework followed by an evaluation

of the proposed framework with case studies targeting C

preprocessor directives, program documentation techniques,

and assembly languages in Section V. Finally, we present our

plans for future work and a summary of our conclusions in

Section VI.

II. RELATED WORK

One of the various ways of implementing a software anal-

ysis tool includes defining modeling meta-data to leverage

881 978-1-4577-0253-2/11/$26.00 ©2011 IEEE

tooling functionality and UI with higher level views of the

underlying software system using existing frameworks like

IBMs RAD [19]. Another is developing a tool that works on

intermediate object representations of a code, like Eisenbarths

C concept analysis tool [10]. Our ST approach focuses on

an API-based mechanism for information transfer between a

specialized tool dedicated towards the analysis of a particular

code base and the underlying common tooling framework that

performs much of the tooling functionality.

Prior research with tools like BEAGLE [24] has shown

the need of source-based analysis. However, some software

concerns like fine grained source level configuration may

be highly scattered across many files, and concerns may

not be easily conceptualized by viewing source code alone.

Consequently, source analysis tools may opt for higher level

views of a software system that take into account complex

structures, dependencies and flows that are not immediately

discerned from just viewing the source code. This level

of abstraction may allow to developers to not only better

understand functionality and identify accurate structures and

structural dependencies in a code base, but possibly query and

analyze the source code using these abstractions.

In particular, programme structures and call graphs cannot

always be directly parsed from language definitions of a

complex, configurable code base without further input about

the software system during the parsing process. For example,

research has shown that parsing C code with heavy CPP

usage often results in incomplete or inaccurate syntax trees

if the code is parsed before pre-processing [3], [21]. Instead,

the research showed that using an Application Programming

Interface (API) to obtain additional information during source

code processing can result in more accurate modeling of com-

plex configurable source code as opposed to simply parsing

the source using C and CPP language rules which may even

result in a syntactically incorrect representation. Our proposed

ST framework uses a similar principle to allow specialized

analysis tools built on top of ST to leverage the API during

the source processing and modeling phases to fine tune a

formalized intermediate representation, and therefore better

account for source structures, dependencies and flows that

may not be correctly modeled from language rules alone.

This API introduces modeling flexibility, where a specialized

analysis tool using our framework can generate source models

that take into consideration variations and complexities not

readily captured by just the syntax and semantic rules of an

implementation language, in particular if the implementation

language is coupled with pre-processing.

Most people are familiar with the telephone game, in which

a player whispers a starting sentence to a second player, which

whispers it to the third player and so on until the last player

speaks it out loud. More often than not, the original sentence

is utterly deformed, to the general amusement of the crowd.

Few people know that the same phenomenon is almost a given

when translating from one computer language to another. This

characteristic is due to the presence of heterogeneities between

data structures. When a data type does not exist in the target

structure, the human translator will still represent it somehow.

This will inevitably result in information loss or, even worse,

false information. Euzenat and Shvaiko [11] provide a general

introduction to the subject for inter-operability, and Dorion et

al [8] provide a detailed characterization of the possible hetero-

geneity types, leading to a measure of how much information

is lost.

There are multiple strategies and projects dedicated to the

comprehension of application programs and their translation

to an intermediate form or language. Formalized Intermediate

Languages (ILs) have been developed for both the translation

of source to a specified IL and the translation of binaries

or executables to an IL. In both cases, the IL facilitates

program analysis through integration with tools that leverage

the structure of the given IL.

For example, SAIL [7] is an open source, front-end which,

given a program’s source code, generates two levels of ILs and

creates a graphical representation of the program’s control flow

and is conducive to static analysis. On the other hand, REIL

(Reverse Engineering Intermediate Language) [9] provides

an IL abstracted from native assembly code supporting the

development of analysis tools and algorithms that are platform

independent. BinNavi [1] is an example of commercial plug-in

tool implemented using REIL as an intermediate language.

REIL’s limited instruction set introduces a one-to-many

relation between native instructions and REIL instructions

and therefore leaves it unable to translate certain classes of

instructions. Unusual instructions are essentially ignored by

translating them into a variant of the NOP instruction, like

the UNKN instruction in REIL that the translator uses to

signal an unrecognized instruction [1]. However, if such an

instruction exists in an IL, dynamic binary analysis becomes

infeasible [17]. If dynamic binary analysis is required, it must

then be accomplished by means of an emulator or a full system

simulator. HLASM [14], for example, is significantly different

than X86 assembly and currently has no translation to the IL

in REIL.

IDA Pro [12] is a popular interactive disassembler that

allows for inspection of software binaries through analysis

both built-in and external to IDA Pro. IDA Pro follows a

traditional plug-in architecture and provides an API that allows

third-party developers to produce more complex analysis [4].

There are several other frameworks in place, each with their

own IL, and each with their own advantages and drawbacks.

CodeSurfer/x86 [16] is one of the more well-known frame-

works but its proprietary nature leave.

HERO (Hybrid sEcurity extension of binaRy transla-

tiOn) [13] is a promising framework that claims to support

an efficient combination of static and dynamic binary analysis

methods. One key advantage to HERO is that it appears to

be entirely self-contained; that is to say, it does not seem to

rely on any third-party components. A formal specification

language is used for its IL, but no details are revealed imple-

mentation to verify the claim that it is a faithful representation

of the assembly code.

BitBlaze [20] is a framework that was developed at UC

882

Berkeley and aims to provide a better understanding of dis-

assembled software through a fusion of static and dynamic

analysis. It consists of three main components: VINE, TEMU,

and Rudder. A recent study of various emulators, including

QEMU, Valgrind, PIN and BOCHS, revealed some unfaith-

fully emulated instructions, resulting in system misrepresenta-

tion, emulator malfunction or failure [15].

BAP (Binary Analysis Platform) [5] was developed by two

of the contributors to the BitBlaze project one year after

the BitBlaze paper was published. It contains essentially the

same features, except it cannot tie IL instructions to the

original assembly the way BitBlaze and many frameworks can.

Both BitBlaze and BAP are distributed as a VMWare virtual

machine, however, the errors revealed by Martignoni et al [15]

are still a relevant concern.

BinNavi [1] is a commercial plug-in tool that was created by

Zynamics and is implemented using REIL as an intermediate

language. It provides extensibility support through MonoR-

EIL, an add-on framework. REIL consists of 17 instructions,

including NOP and UNKN, which do not actually correspond

to any executable instructions. UNKN is just like NOP except

it tells the analyst that an instruction was encountered that

could not be translated, such as a floating-point instruction.

In addition to floating point instructions, MMX, SSE, and

other CPU extensions are also not supported at this time.

The presence of the NOP and UNKN instructions in the REIL

instruction set precludes the possibility of dynamic binary

translation, so BinNavi currently supports only static analysis.

Dynamic analysis is possible with REIL, but would have to

be implemented using a whole-system simulator or emulator.

III. A NEED FOR EXTENSIBILITY

As established by the survey in Section II, the primary

inadequacy of current support for program comprehension is

the inability for these tools to keep pace with modern code-

bases and to scale or to adapt to other languages. The reality

is that unless analysis tools are designed to fluidly adapt to

constantly changing circumstances they will not be able to

provide adequate support and will quickly become obsolete.

In [18], we proposed the concept of an extensible frame-

work, specifically targeting binary analysis. The layered design

of this framework, shown in Figure 1 is intended to meet these

requirements: 1) support for new instruction set architectures,

2) maintaining platform independence and 3) supporting inter-

operability of new analysis modules. In this paper we look to

generalize the problem space by proposing the ST analysis and

program comprehension framework to scale across multiple

forms of complex software including binary.

With respect to instruction set architectures, more and more

software systems are breaking out of the confines of the

desktop PC and forging paths in new areas such as mobile

devices, computing on GPUs and embedded systems. As a

result, it is no longer a question of if a new architecture will

be encountered, it is a question of when. ST must adapt to

new instruction sets, accurately and faithfully modeling the

intricacies of the different architectures.

Fig. 1. Proposed Spinal Tap framework design.

In terms of software platforms, rather than dictating which

platform (e.g. Microsoft Windows, Mac OS X, Linux) users

of ST must be on, users should be able to choose. As a result

of loosening the platform requirement we allow ST to fit into

existing work flows and we increase the number of potential

users. Decoupling the ST tooling framework from the runtime

environment is one of the primary reasons why our portable

ST framework is built on top of an existing IDE like Eclipse,

where platform specifics are largely hidden at the ST level.

Tool users may have to obtain a base Eclipse IDE that is

specific to a platform, but the installation of the ST tooling

framework would be the same regardless of which platform-

specific Eclipse is used. Furthermore, it more readily allows

the integration of new and unanticipated analysis techniques

through well defined, Eclipse-based ST extension points, and

specifically allows for inter-operability with other ST and

Eclipse-based tools.

Section II mentioned some of the various ways of imple-

menting a software analysis tool. We proposed an API-based

tooling framework, where two important reasons to consider

an API-based approach are:

1) API may allow for tools built on top of a tooling

framework to identify abstract concepts in the code

that may aid software comprehension, and furthermore

serve as selectable context for analysis. This is explored

further in Section IV.

2) API cleanly separates common tooling functionality

from specialized analysis tool logic. In particular, API

delegates responsibility of fine tuning an intermediate

model representation to the specialized analysis tool.

This may allow source analysis tools to more accurately

model an underlying source code during the parsing and

modeling process that implementation language specifics

alone may not be able to capture, in particular if the

source language is coupled with pre-processing.

The layered design shown in Figure 1 provides a struc-

ture that applies to framework construction in general. This

technique is similar to that used in the Eclipse IDE [2] to

achieve increased portability between platforms. The isolation

of platform-specific details at the lowest level is tailored to the

core API that integrates with an intermediate representation.

883

Specific analysis and visualization tools leverage the abstract

representation of the source code provided by this intermediate

representation.

This use of an intermediary allows analysis tools to integrate

in a uniform way with various representations of source code.

Our investigation into related work has demonstrated the

limitations of a strictly intermediate language based approach,

and as a result we investigate a more general and inclusive

approach for ST based on the addition of protocols and APIs

for inter-operability.

IV. A PROTOTYPE OF THE SPINAL TAP FRAMEWORK

The ST framework design is motivated by four key goals:

1) automatically integrate a software analysis tool set into an

existing IDE, 2) establish a clear separation between the inter-

mediate representation and the underlying source code details,

3) present a common abstraction mechanism that may aid a

user in software comprehension and analysis, and 4) define

a common, extensible querying mechanism that allows for

source code analysis on both the intermediate representation

as well as the underlying source linked to structures in the

intermediate model.

The first goal is addressed with the integration of ST as

a set of Eclipse plug-ins, where ST leverages the Eclipse

platform and allows specific source analysis tools contributed

into the ST framework to automatically communicate with

other existing components of Eclipse, like source editors,

source outline views, and file management.

The second goal is addressed with the introduction of a

common intermediate representation and API that decouples

general framework functionality like parsing, modeling, visu-

alisation and querying of extracted source facts from specifi-

cations and properties of the language used in the source.

The third goal is achieved through the common API defined

by the ST framework and implemented by the source analysis

tool. This interface allows the specialized tool to mark facts

about the modeled data, referred to as constructs, that represent

meaningful, selectable abstractions to a tool user. This abstrac-

tion can both facilitate an understanding of a code base and

serve as set contexts for query capabilities of software analysis

techniques.

Finally, the fourth goal is once again achieved by the API

in the separation of the execution and visualization of query

results in a common ST view from the actual specialized

source analysis.

All four features work seamlessly together, and allow soft-

ware analysis tool developers to focus solely on specific source

properties like: 1) defining source language model building

rules that allow the transformation into the intermediate repre-

sentation or model, 2) establishing queries that analyze specific

qualities of the code base by querying this model of the

underlying source and 3) optionally marking constructs in the

parsed and modeled data that are displayed to a tool user and

serve as querying contexts.

The ability to query both the underlying source and the

intermediate representation provides flexibility to the analysis

tools by providing access to a code base at a higher level

of abstraction as well as direct access to lines of code.

The intermediate representation is mapped to structures in

the source code that may facilitate further analysis of those

blocks. The case study to follow in Section V demonstrates

these capabilities with an intermediate representation used

to model configuration logic within source code. The model

structures are automatically linked to relevant source code

during the modeling process using common source location

properties and allow the queries to focus on the structures

of the intermediate model when looking for potential logic

patterns.

The following paragraphs present the ST framework design

in terms of the three layers depicted by shading in Figure 2.

Further we identify the ways in which the goals are addressed

within the details of this design.

The top User Interface (UI) layer contains common ST

mechanisms that allow for the launching of a source analysis

tool within an integrated environment, where the analysis

tool leverages existing ST and Eclipse UI. A reusable UI for

running the tool from within Eclipse provides visualization of

analysis results and navigation from results to existing source

editors.

This UI layer has three main sections: 1) a project selectioncomponent for source project/directory selection and invoca-

tion of ST through context menu actions that integrate into

existing Eclipse project views, namely the Project Explorer, 2)

selection view that displays selectable constructs according to

goal three, and launches wizards to select and execute source

analysis queries, and 3) and query view which presents navi-

gable query results that automatically integrate with existing

Eclipse source editors. To support the integration goal, the two

ST views are set within the Eclipse main workbench where

development, source control and building are also performed

according to the requirement that a source analysis tool be

integrated into an IDE [23].

The middle layer, in general, deals with common ST frame-

work functionality like generation of an intermediate repre-

sentation and query capabilities within the source processingand query processing components, respectively. These two

processing components work together to provide information

to the UI layer for the purpose of visualization of the ab-

stractions marked by an analysis tool during the ST source

parsing and modeling phase, as well as displaying query

results. The generic intermediate model generator provides the

commonly understood model or intermediate representation

that the query processing and visualization layers of the

framework understand regardless of the source code specifics.

The middle layer contains common tooling functionality code

that is independent of the source analysis tool extending ST,

and includes parsing, tokenizing and modeling of a source

base into an intermediate model. The optional abstraction

of constructs to represent the source code can be used 1)

to support code comprehension through the UI and 2) and

for high-level analysis by the query processing component.

The query manager within the query processing component

884

Fig. 2. Design of the ST Framework.

contains the common query execution and query results logic

independent of the source analysis tools at the UI level.

The bottom layer provides the source analysis specific

implementation of the API offered up by the middle layer

of the ST framework for the generation of the model, optional

marking of constructs and query definitions. Specifically, a do-main contributor provides implementation of a model providerand a query provider which form the core components of a

source analysis tool that leverages STs extensible and reusable

mechanisms. The model provider supplies an implementation

of the model building rules for tokenizing a source code and

building an intermediate model, as well as construct marking

that serves as query context abstractions. The query provider

understands the structure of the intermediate model built by

the model provider and performs analysis on the model and

underlying source code.

The initial implementation of ST presents a flexible interme-

diate model structure that follows a syntax tree style as seen in

compiler implementations. The evaluation section presents an

intermediate model for CPP conditional compilation analysis,

using STs generic model definition, but built by a specific CPP

analysis tool in the CPPs model provider. Future versions of

ST intend to expand the syntax tree based intermediate model

that can also adapt to call graphs.

The interaction between these layers is prescribed through

well defined interfaces. The intermediate model generator and

query manager offer up an API for instances of the bottom

source analysis contributor layer. While the model is being

generated, the ST framework relies on the model provider

to optionally identify and mark elements, or constructs in

the model. Specifically, these marked constructs are later

displayed by ST at the UI layer and visually aid a tool user in

querying the source code. The evaluation section will present

examples of these user-selectable abstractions. Both model and

query providers that form part of a source analysis tool are

contributed into ST using Eclipse extension points defined by

ST.

The goal here is to determine whether this intermediate

model in the form of a syntax tree is adaptable to sources

implemented in various languages and can present meaningful

representations of the source code to be queried and analyzed

through a common UI.

V. CASE STUDIES

STs viability as a general framework that contains common

and reusable functionality UI and visualization, as well as

automatic integration with the underlying Eclipse platform,

including inter-operability with existing Eclipse components

like source editors that can be leveraged by a source analysis

tool is examined through three case studies, one focused

on analysis of syntactically similar code patterns used in C

pre-processor (CPP) controlled conditional compilation, the

second a simple proof-of-concept Comment analysis tool that

finds duplicate comments in a source code, and the third a

proposed prototype for assembly code analysis. All three case

studies are built upon the ST framework.

885

A. Case Study: CPP Conditional Compilation Analysis

Our first evaluation of the ST framework focuses on imple-

menting an analysis tool to investigate potential duplicate code

patterns controlled by CPP flags in an open source C-based

system, namely the CVS 1.11.17 source code [6].

The motivation behind implementing such an analysis tool is

driven by the need to study code patterns that may potentially

be found across different build configurations of a C-based

software system relying on CPP for configuration. Eclipses

built in search functionality may not readily assist in such

source analysis, as it does not 1) automatically discover all

possible CPP flags that control configuration, and 2) the tool

user does not know ahead of time what patterns to search

for. Rather, the tool user would like to discover these patterns

through analysis of all conditionally compiled sections of code

using a specific query that compares conditionally compiled

sections against each other. A specialized tool to perform

such an operation is therefore needed, and ST provides the

framework and API to do so.

In order to build an intermediate model that can be used

by STs existing common UI and querying mechanisms, STs

extension points requires a CPP analysis tool to contribute

two types of providers implementing various API: a model

provider and a query provider.

In general, an analysis tool using ST must implement model

and query providers that are responsible for:

1) defining tokens that ST should tokenize, like language

keywords (for instance #, if, define). ST provides a

clear API to define tokens which an analysis tool must

implement.

2) intermediate model generation rules that build an in-

termediate model based on the tokens parsed by ST.

This means that although ST manages the generation

of tokens from an input source, and subsequent man-

agement of the intermediate model when used by other

components of ST like the UI and Query layers, ST

delegates the actual building of the model to the analysis

tool model provider.

3) marking constructs of interest to the analysis tool, which

are displayed by ST in a common ST UI, and serve

as analysis context for performing queries. These con-

structs are marked by the analysis tool through API

provided by the framework. In the case of the CPP

analysis tool, these constructs are CPP flags controlling

conditionally compiled sections of code, and may rep-

resent possible build configurations.

Shown in Figure 3 is part of an intermediate model built by

the CPP model provider representing conditionally compiled

blocks of code as elements in the CPP specific intermediate

representation. The actual source code, int i = 0; within

the conditionally compiled blocks can be further modeled, but

for the purposes on this evaluation it shown as one child model

element. This intermediate representation is formalized and

generic enough that it can be understood by other components

of the ST framework, particularly the UI and query processing,

but the structure is specific to CPP analysis.

The CPP model provider creates two types of model ele-

ments, one representing a conditionally controlled block for

texttt#if win —— linux, and the other representing the actual

conditionally compiled code. Nested conditionally compiled

blocks are also possible (although not shown for brevity) as

child elements. The structure of the model is determined by

the CPP analysis tool, while the management of the model

and the execution of any query falls in the realm of the ST

framework.

Fig. 3. ST CPP syntax tree.

The proposal section mentioned that one of the features

of the ST framework is the ability for an analysis tool to

generically mark constructs in the intermediate representation

of the source code as potential abstractions that are displayed

to a tool user using STs common, generic UI selection views.

These are then used by the ST framework as user-selectable

context to execute analysis queries. In the case of the CPP

model provider, CPP flags are marked as constructs which

are then displayed by ST as selectable values through a

common, generic ST Constructs View shown in Figure 4.

As the intention of the CPP analysis tool is to analyze con-

ditionally compiled code, this construct marking mechanism

allows an analysis tool to discover querying contexts using the

ST framework. In the example of the CPP analysis tool, the

visualized constructs present an automatically discovered list

of different possible configurations allowing a user to analyze

the source according to a given configuration of CPP flags.

ST delegates semantic knowledge of the marked constructs

to the analysis tool, and so it is not aware that these are

CPP flags, yet when displayed to a user in the Constructs

View, and a query performed on them, STs generic views will

display conditionally compiled code controlled by these flags,

even though the framework functionality that allows the CPP

flags to be marked, managed and displayed to a user is fully

decoupled from the analysis tool extending the framework.

The Query Selection button in the view launches a wizard

where various queries for CPP analysis can be selected, all

which were contributed by the CPP analysis tool through a

886

CPP Query Provider. Shown in Figure 4 is the query result for

a CPP query that discovers potentially equivalent conditionally

compiled code controlled by SIGABRT, SIGHUP, SIGINT,

SIGPIPE, SIGQUIT, and SIGTERM configuration flags. The

results on the bottom right side show the query results in the

first column, where the root result item is the result itself:

(void)SIG_register(SIGABRT,patch_cleanup);and the child elements show similar patterns across different

locations. The lowest child is the line number locations of

the results within a file. The constructs column show any

constructs that served as context for the query results, which

in this case are the selected CPP flags controlling the inclusion

of the conditionally compiled C code. The last two columns

show query results location information that is generic, for

example number of hits per result, and number of affected

files containing the results.

Each results entry is fully navigable, as seen in Figure 4,

where a user has navigated from the result at line number 8359

in the rsc.c file pertaining to the flag SIGHUP, which ST then

detects and highlights in the editor. All the UI and automatic

integration with Eclipse source editors is fully leveraged by the

CPP analysis tool, which only needed to contribute a model

and query providers to ST, and not require any UI or IDE

integration implementation.

This evaluation shows that at least for some CPP source

analysis, STs separation of common tooling functionality and

specialized analysis tooling allows the analysis tooling to

leverage all the common, generic ST functionality in the

middle and upper layers. Specifically, the CPP analysis tool

was able to use STs generic construct abstraction mechanism

to mark CPP flags as selectable context for invoking certain

CPP queries. In addition, STs generic Query Results view

was able to present the CPP results in a way that made it

meaningful for analysis of potentially duplicate conditionally

compiled code, without ST framework being aware or having

any dependencies on the CPP analysis tool. Finally, the CPP

analysis tool was able to automatically use integration with

the rest of Eclipse, as seen with the navigation from the Query

Results View to an Eclipse editor, again, without the analysis

tool having to implement any of the integration.

B. Case Study: Source Code Comments

As ST is intended to be a general, extensible framework

for software analysis tool development, to further evaluate

the reusability of the framework for other domains aside

from CPP, this section will present a simpler evaluation of

the framework geared towards finding syntactically similar

comment blocks in some high level language. This proof-

of-concept study will generally describe the steps needed to

implement an analysis tool that identifies similar comments

within a code base leveraging the capabilities of the ST

framework.

There are three types of information that the Comment

analyzing domain would have to provide to ST:

1) A list of tokens to extract from the source.

2) Model rules for building the intermediate representation

of the comment blocks within a code base.

3) A query that performs analysis on the intermediate

model elements representing comment blocks

The C language defines comments as blocks of text in the

source preceded by either //, or starting with a /* and ending

with a */, where the content within the block starts with a

*. Therefore the Comment model provider forming part of

the Comment analysis tool contributed into the ST framework

would define the following four tokens: //, /*, * and */.

The second step is to define a model rule that builds the

intermediate model. It is the job of the Comment model

provider to build the model based on the stream of tokens

processed by the underlying ST framework. As the pattern

for comments is fairly straight forward, the provider would

detect within the token stream provided by ST the first two

tokens: the comment symbol and the comment line following.

It therefore builds a model element representing a comment

block by detecting the starting and ending comment token for

/* and */ case, or in the case of //, by detecting a token that

is not preceded by a // token.

The provider also has the option of marking constructs to

display in the Constructs View that can facilitate querying and

analysis of comments in the source. In this case, the Comment

provider creates construct objects representing files containing

comment blocks, and marks them as selectable values using

ST API that ST then displays in the Constructs View.

The Comment analysis tool would also contribute a query

into ST that would identify a comment model element as

ST visits the intermediate model during a query operation.

If the comment is identified, the Comment analysis tool query

can use one of STs many general, built in analyzers available

through a factory to detect syntactically similar code blocks.

Both the CPP and Comment Query providers use an ST

Levenshtein comparator that is applicable to both cases, as

the comparator only requires string representation of the model

element values (either the raw conditionally compiled code, or

the comment body text).

An extension to this implementation of a comment-specific

analysis tool contributed into the ST framework, comments

for other programming languages can be taken into account.

For example, to include Javadocs [22] the Comment model

provider would have to define a fifth comment token type /**.

The language rule would remain the same, with the exception

of adding a model element indicating a Javadoc block.

Additionally, extending the identification of constructs to

be used by the Java Comment query provider, with two

types of comment blocks, Javadocs and inline comments, the

enhanced Comment analysis tool can mark the block type as

a construct itself. This introduces a level of abstraction that

can be leveraged at the ST UI layer allowing a user to narrow

a search within one of the two types: NON_JAVA_DOC and

JAVA_DOC.

887

Fig. 4. ST views with navigation and query capabilities demonstrated.

C. Future Work: Assembly Code

A proof-of-concept tool [4], Tracks connects to a debugger,

allowing for dynamic analysis and navigation in an Eclipse

plug-in. As shown in Figure 5, the protocol between Tracks

and the debugger, currently provided by IDA Pro, establishes

communication for the analysis which includes system calls.

This essentially extends the static capabilities of existing

support in IDA Pro to incorporate dynamic analysis, external

modules, and even multi-threaded code bases. This protocol

and API could also be used interchangeably to provide the

same navigation support for other debuggers, for example,

those associated with HLASM code bases.

As a third case study, ST can be used to implement an

analysis tool for low level languages. As with the CPP and

Comment case studies mentioned above, an Assembly analysis

tool built on top of ST would have to contribute 1) a set

of tokens to parse (for example, instructions and registers),

2) modeling rules to build an intermediate representation of

the assembly source, 3) mark facts about the parsed assembly

source as constructs that can be used as query contexts and 4)

queries that can perform analysis on the intermediate model.

The Assembly analysis tool would have to contribute two types

of providers: model and query providers. These implement ST-

defined API which allow the analysis tool to interact with the

Fig. 5. Tracks’ debugger interaction protocol.

underlying common framework, much like in the case of the

CPP and Comment use cases.

Although the set of tokens may be straightforward to

implement for any assembly language, an area of current

research is to determine what modeling structure can be built

to represent assembly code, or specific portions of assembly

888

code that require analysis. One research area is to determine

if assembly-specific models contain enough common structure

such that a general Assembly analysis tool built on top of ST

can be implemented, and further extended to accommodate

specific variations. If so, it would also increase the possibility

that the general, ST-based Assembly analysis tool would better

adapt to changes in the low level languages, which was cited

earlier as one of the weak points in current low level analysis

tooling.

The proof-of-concept case studies for CPP and comment

analysis tools built using the ST framework indicate that our

framework is a viable starting point to further study and

implement general Assembly analysis tools that are extensible

and adaptable to future changes.

VI. CONCLUSION

Section III proposes ST as a framework that is extensible

in terms of accommodating new ISAs, platforms and tools. A

prototype of this framework illustrates the layered architecture

intended to provide this extensibility and inter-operability with

existing tools. Two case studies and a discussion of how the

ST approach could apply to assembly code serves to evaluate

this proposed approach.

While this proposed framework focuses on static analysis

techniques we intend to investigate the possibility of lever-

aging virtualization technology in order to support dynamic

analysis. We are particularly interested in understanding how

ST should interact with malware, especially when malware

tries to detect our system, and how to accurately monitor the

virtual environment without introducing inaccuracies into the

results. We would like to further investigate the application of

an intermediate representation as was posed in this work but

will also be investigating how an intermediate representation

could leverage the structure of an intermediate language to

provide a clean method of performing dynamic analysis in a

virtual environment.

REFERENCES

[1] BinNavi. http://www.zynamics.com/binnavi.html, 2011.[2] Eclipse platform technical overview. http://www.eclipse.org/articles/

Whitepaper-Platform-3.1/eclipse-platform-whitepaper.html, January2011.

[3] G. J. Badros and D. Notkin. A framework for preprocessor-aware csource code analyses. Softw. Pract. Exper., 30:907–924, July 2000.

[4] J. Baldwin, P. Sinha, M. Salois, and Y. Coady. Progressive user interfacesfor regressive analysis: Making tracks with large, low-level systems.AUIC 2011, pages 1–10, Nov 2010.

[5] D. Brumley and I. Jager. The BAP Handbook. 2009.[6] CVS. Concurrent versions system. http://www.nongnu.org/cvs/, April

2011.[7] I. Dillig, T. Dillig, and A. Aiken. SAIL: Static Analysis Intermediate

Language with a Two-Level Representation. Technical report, 2009.[8] E. Dorion, D. Grenier, J. Brodeur, and D. O’Brien. A Taxonomy of

Heterogeneity Types for Ontological Alignment. In preparation, Dec.2010.

[9] T. Dullien and S. Porst. REIL: A platform-independent intermediate rep-resentation of disassembled code for static code analysis. In CanSecWestApplied Security Conference), 2009.

[10] T. Eisenbarth, R. Koschke, and D. Simon. Derivation of feature compo-nent maps by means of concept analysis. In In Proceedings of the FifthEuropean Conference on Software Maintenance and Reengineering,pages 176–179. Society Press, 2001.

[11] J. Euzenat and P. Shvaiko. Ontology Matching. Springer-Verlag,Heidelberg (DE), 2007.

[12] I. Guilfanov. Decompilers and beyond. BlackHat USA 2008, pages1–12, Jul 2008.

[13] H. Guo, J. Pang, Y. Zhang, F. Yue, and R. Zhao;. Hero: A novel malwaredetection framework based on binary translation. Intelligent Computingand Intelligent Systems (ICIS), 2010 IEEE International Conference on,1:411 – 415, 2010.

[14] IBM. High Level Assembler (HLASM) and Toolkit Feature. http://www-01.ibm.com/software/awdtools/hlasm/library.html, 2011.

[15] L. Martignoni, R. Paleari, G. Fresi Roglia, and D. Bruschi. TestingCPU emulators. In Proceedings of the 2009 International Conferenceon Software Testing and Analysis (ISSTA), pages 261–272, Chicago,Illinois, USA. ACM.

[16] D. Melski, T. Teitelbaum, and T. Reps. Static analysis of softwareexecutables. Conference For Homeland Security, 2009. CATCH ’09.Cybersecurity Applications & Technology, pages 97 – 102, 2009.

[17] N. Nethercote and J. Seward. Valgrind: A framework for heavyweightdynamic binary instrumentation. In Proceedings of ACM SIGPLAN 2007Conference on Programming Language Design and Implementation(PLDI 2007), pages 89–100, San Diego, California, USA, June 2007.

[18] D. Pucsek, J. Wall, C. Gibbs, J. Baldwin, M. Salois, and Y. Coady. Ice:Circumventing meltdown with an advanced binary analysis framework.In proceedings of the First International Workshop on Developing Toolsas Plug-ins (TOPI) - colocated with the International Conference onSoftware Engineering (ICSE), 2011.

[19] J. Rodriguez, C. Cesario, K. Galvan, B. Gonzalez, G. Kroner,G. Rutigliano, and R. Wilson. IBM rational application developer v6portlet application development and portal tools. IBM Corp., Riverton,NJ, USA, 2005.

[20] D. Song, D. Brumley, H. Yin, J. Caballero, I. Jager, M. Kang, Z. Liang,J. Newsome, P. Poosankam, and P. Saxena. Bitblaze: A new approachto computer security via binary analysis. Information Systems Security,pages 1–25, 2008.

[21] D. Spinellis. Global analysis and transformations in preprocessedlanguages. IEEE Transactions on Software Engineering, 29:1019–1030,2003.

[22] Sun Microsystems. Javadoc Tool Home Page. http://www.oracle.com/technetwork/java/javase/documentation/index-jsp-135444.html, April2011.

[23] A. Telea, O. Ersoy, and L. Voinea. Visual analytics in softwaremaintenance: Challenges and opportunities. In Proceedings of the 5thInternational Symposium on Software Visualization, New York, NY,USA, 2010. ACM. 102102.

[24] Q. Tu and M. W. Godfrey. An integrated approach for studyingarchitectural evolution. In In 10th International Workshop on ProgramComprehension (IWPC02), pages 127–136. IEEE Computer SocietyPress, 2002.

889

Spin

al T

ap: H

igh

Leve

l A

naly

sis

for

Hea

vy

Met

al S

yste

ms

2011

IEEE

Pac

ific

Rim

Con

fere

nce

on C

ompu

ters

, C

omm

unic

atio

ns, a

nd S

igna

l Pro

cess

ing

Nie

raj S

ingh

Dea

n Pu

csek

Jona

h W

all

Cel

ina

Gib

bsM

artin

Sal

ois

Yvo

nne

Coa

dy

Prob

lem

•D

ifficu

lt to

und

erst

and

larg

e as

sem

bly

code

ba

ses

•La

ck o

f too

ls t

o an

alyz

e as

sem

bly

code

•So

urce

not

alw

ays

avai

labl

e

2

Exam

ple:

Tex

t

3

_mai

n:100000ec4�

pushq�

%rbp

1000

00ec

5�mo

vq�

%rsp

,%rb

p10

0000

ec8�

subq

�$0

x10,

%rsp

100000ecc�

callq�

0x100000f18�;

getc

har

1000

00ed

1�mo

vl�

%eax

,0xf

c(%r

bp)

1000

00ed

4�cm

pl�

$0x6

1,0x

fc(%

rbp)

1000

00ed

8�je

�

0x

1000

00ee

210

0000

eda�

cmpl

�$0

x62,

0xfc

(%rb

p)10

0000

ede�

je�

0x

1000

00ef

010

0000

ee0�

jmp�

0x

1000

00ef

e10

0000

ee2�

leaq

�0x

0000

003b

(%ri

p),%

rdi

100000ee9�

callq�

0x100000f1e�;

puts

1000

00ee

e�jm

p�

0x

1000

00f0

a10

0000

ef0�

leaq

�0x

0000

002f

(%ri

p),%

rdi

100000ef7�

callq�

0x100000f1e�;

puts

1000

00ef

c�jm

p�

0x

1000

00f0

a10

0000

efe�

leaq

�0x

0000

0023

(%ri

p),%

rdi

100000f05�

callq�

0x100000f1e�;

puts

1000

00f0

a�mo

vl�

$0x0

0000

000,

%eax

100000f0f�

leave

100000f10�

ret

Exam

ple:

Tex

t

3

_mai

n:100000ec4�

pushq�

%rbp

1000

00ec

5�mo

vq�

%rsp

,%rb

p10

0000

ec8�

subq

�$0

x10,

%rsp

100000ecc�

callq�

0x100000f18�;

getc

har

1000

00ed

1�mo

vl�

%eax

,0xf

c(%r

bp)

1000

00ed

4�cm

pl�

$0x6

1,0x

fc(%

rbp)

1000

00ed

8�je

�

0x

1000

00ee

210

0000

eda�

cmpl

�$0

x62,

0xfc

(%rb

p)10

0000

ede�

je�

0x

1000

00ef

010

0000

ee0�

jmp�

0x

1000

00ef

e10

0000

ee2�

leaq

�0x

0000

003b

(%ri

p),%

rdi

100000ee9�

callq�

0x100000f1e�;

puts

1000

00ee

e�jm

p�

0x

1000

00f0

a10

0000

ef0�

leaq

�0x

0000

002f

(%ri

p),%

rdi

100000ef7�

callq�

0x100000f1e�;

puts

1000

00ef

c�jm

p�

0x

1000

00f0

a10

0000

efe�

leaq

�0x

0000

0023

(%ri

p),%

rdi

100000f05�

callq�

0x100000f1e�;

puts

1000

00f0

a�mo

vl�

$0x0

0000

000,

%eax

100000f0f�

leave

100000f10�

ret

int

main

(){

swit

ch(g

etch

ar()

) {

ca

se '

a':

prin

tf("

a\n"

); b

reak

;

case

'b'

: pr

intf

("b\

n");

bre

ak;

de

faul

t: p

rint

f("d

ef\n

");

brea

k;

}

re

turn

0;

}

Exam

ple:

CFG

4

_mai

n:pu

shq

%rbp

movq

%rsp

,%rb

psu

bq$0

x10,

%rsp

call

q0x

1000

00f1

8mo

vl%e

ax,0

xfc(

%rbp

)cm

pl$0

x61,

0xfc

(%rb

p)je

0x10

0000

ee2

1000

00ed

a:cm

pl$0

x62,

0xfc

(%rb

p)je

0x10

0000

ef0

1000

00ee

0:jm

p

0x1

0000

0efe 10

0000

ee2:

leaq

0x00

0000

3b(%

rip)

,%rd

ica

llq

0x10

0000

f1e

jmp

0

x100

000f

0a

1000

00ef

e:le

aq0x

0000

0023

(%ri

p),%

rdi

call

q0x

1000

00f1

e

1000

00ef

0:le

aq0x

0000

002f

(%ri

p),%

rdi

call

q0x

1000

00f1

ejm

p

0x1

0000

0f0a

1000

00f0

a:mo

vl$0

x000

0000

0,%e

axle

ave

ret

Goa

ls

•En

able

sta

tic a

nd d

ynam

ic a

naly

ses

•O

S in

depe

nden

ce

•V

isua

l rep

rese

ntat

ion

of a

ssem

bly

•Ex

tens

ibili

ty

•In

stru

ctio

n se

t ar

chite

ctur

es (

ISA

s)

•A

naly

sis

tech

niqu

es

•O

bjec

t fil

e fo

rmat

s

5

IDA

Pro

6

How

Doe

s ID

A P

ro D

o?

7

Stat

ic A

naly

sis

✓D

ynam

ic A

naly

sis

OS

inde

pend

ent

✓V

isua

l rep

rese

ntat

ion

Add

add

ition

al IS

As

✓C

reat

e cu

stom

ana

lysi

s✓

Inse

rt fi

le lo

ader

s✓

Tool

Fou

ndat

ion

•T

hree

pos

sibl

e ap

proa

ches

•Pu

re fr

amew

ork

•Pu

re in

term

edia

te la

ngua

ge

•H

ybri

d

8

Fram

ewor

k

9

Ana

lysi

sBi

nary

Inte

rmed

iate

Lan

guag

e

10

Ana

lysi

sIS

AIn

term

edia

te

Lang

uage

Hyb

rid

11

Ana

lysi

sIS

AIn

term

edia

te

Lang

uage

ICE

12

ICE

UI

ICE

Brid

ge

ICE

Cor

e

ICE

12

ICE

UI

ICE

Brid

ge

ICE

Cor

e

x86

AR

M

Pow

erPC

Bina

ry

Com

pari

son

Cod

e C

over

age

Dyn

amic

Tr

ace

ICE

Brid

ge

13

ICE

Brid

ge

Bina

ry A

PI

Inte

rmed

iate

La

nagu

age

Ana

lysi

s API

Sum

mar

y

14

Stat

ic A

naly

sis

✓D

ynam

ic A

naly

sis

OS

inde

pend

ent

✓V

isua

l rep

rese

ntat

ion

Add

add

ition

al IS

As

✓C

reat

e cu

stom

ana

lysi

s✓

Inse

rt fi

le lo

ader

s✓

Futu

re W

ork

•Bi

nary

API

•En

viro

nmen

t an

d fu

nctio

nalit

y of

dyn

amic

an

alys

is

•V

isua

lizat

ions

of a

ssem

bly

code

•Fu

rthe

r im

plem

enta

tion

of t

he p

roto

type

in

conj

unct

ion

with

sta

keho

lder

s

15

Que

stio

ns?

16

Dea

n Pu

csek

Emai

l: dp

ucse

k@cs

.uvi

c.ca

spinal tap: high level analysis for heavy metal systems · dean pucsek university of victoria...

Documents