spinal tap: high level analysis for heavy metal systems · dean pucsek university of victoria...
TRANSCRIPT
Spinal Tap:High Level Analysis for Heavy Metal Systems
Nieraj SinghUniversity of Victoria
Email: [email protected]
Celina GibbsUniversity of Victoria
Email: [email protected]
Dean PucsekUniversity of Victoria
Email: [email protected]
Martin SaloisDRDC Valcartier
Email: [email protected]
Jonah WallUniversity of Victoria
Email: [email protected]
Yvonne CoadyUniversity of Victoria
Email: [email protected]
Abstract—Program comprehension tools targeting specifichigh-level languages do not currently scale to the complexitiesof many of today’s low level systems. At the lowest level,the wide variety of architectures and platforms results in awidening spectrum of instruction sets and assembly languages.Slightly above this level, C-based systems targeting multiplearchitectures and platforms are riddled with compiler directivesto accommodate the demands of configurable systems.
This paper proposes a generalized and extensible frameworkfor the purpose of program navigation and analysis, leveraging anintermediate representation of source code to separate low-leveldomain detail from tool support. A prototype of this frameworkis provided with two case studies evaluating its efficacy withinmultiple domains. This study demonstrates the feasibility of anextensible framework as a common core for low-level programcomprehension tools.
I. INTRODUCTION
Analysis of low-level code bases written in C or assembly is
currently challenging due to lack of tool support. Developers
performing program analysis of low-level code bases are faced
with large code-bases, a plethora of computer architectures
(x86, ARM, PowerPC, HLASM etc.), and binary file formats
(ELF, Mach-O, PE, etc.). More specifically, in the case of
malware analysis, an ever growing landscape of obfuscation
techniques employed by malware developers has intensified
the need for low-level analysis tools.
Given the future trajectory of multi-core applications that
are bound to amplify these problems, we believe it is
paramount to consider extensibility as a primary requirement
for program comprehension tools specifically targeting low-
level code bases. Though many inroads have been made for
tools targeting specific high-level languages, often enhancing
an already impressive tool-set within an Integrated Develop-
ment Environment (IDE), these tools typically do not currently
extend to low-level code bases. This is at least in part due to
the fact that these tools align well with high-level language
mechanisms, such as classes, methods and even functions, but
were not designed to scale to include complexities such as
less structured pre-processor directives in C (CPP) and even
transfer of control and data through jumps and registers in
assembly.
This paper proposes a generalized and extensible framework
for the purpose of program navigation and analysis, leveraging
an intermediate representation of source code to separate low-
level domain detail from tool support. A prototype of this
framework is provided with a detailed case study evaluating
its efficacy in terms of pre-processor directives in C-based
systems. Further investigation in terms of the assembly lan-
guage domain reveals the possible synergy of leveraging an
intermediate language within further program comprehension
tools. These studies demonstrate the feasibility of an extensible
framework as a common core for low-level source comprehen-
sion tools.
Extensibility is crucial because it allows a framework of
comprehension tools to more easily adapt to both current and
future technologies. Moreover, a focus on extensibility allows
us to form a community of third-party developers who are able
to contribute ideas and code to better the system. In an attempt
to address this issue of comprehension of complex software
in various forms, we propose a generalized framework built
on the ability to define various models that interface with
the tools through a formalized intermediate representation.
This framework, we call Spinal Tap (ST), is also intended
to be inter-operable with tools designed for a wide variety of
areas including program verification, program comprehension,
malware analysis, vulnerability detection and education.
In this paper we begin by discussing related work in
Section II and motivate the need for extensibility in Section III,
suggesting an extensible framework capable of performing
arbitrary analysis on various representations of low-level code
bases. In Section IV we describe the design and implemen-
tation of a prototype framework followed by an evaluation
of the proposed framework with case studies targeting C
preprocessor directives, program documentation techniques,
and assembly languages in Section V. Finally, we present our
plans for future work and a summary of our conclusions in
Section VI.
II. RELATED WORK
One of the various ways of implementing a software anal-
ysis tool includes defining modeling meta-data to leverage
881 978-1-4577-0253-2/11/$26.00 ©2011 IEEE
tooling functionality and UI with higher level views of the
underlying software system using existing frameworks like
IBMs RAD [19]. Another is developing a tool that works on
intermediate object representations of a code, like Eisenbarths
C concept analysis tool [10]. Our ST approach focuses on
an API-based mechanism for information transfer between a
specialized tool dedicated towards the analysis of a particular
code base and the underlying common tooling framework that
performs much of the tooling functionality.
Prior research with tools like BEAGLE [24] has shown
the need of source-based analysis. However, some software
concerns like fine grained source level configuration may
be highly scattered across many files, and concerns may
not be easily conceptualized by viewing source code alone.
Consequently, source analysis tools may opt for higher level
views of a software system that take into account complex
structures, dependencies and flows that are not immediately
discerned from just viewing the source code. This level
of abstraction may allow to developers to not only better
understand functionality and identify accurate structures and
structural dependencies in a code base, but possibly query and
analyze the source code using these abstractions.
In particular, programme structures and call graphs cannot
always be directly parsed from language definitions of a
complex, configurable code base without further input about
the software system during the parsing process. For example,
research has shown that parsing C code with heavy CPP
usage often results in incomplete or inaccurate syntax trees
if the code is parsed before pre-processing [3], [21]. Instead,
the research showed that using an Application Programming
Interface (API) to obtain additional information during source
code processing can result in more accurate modeling of com-
plex configurable source code as opposed to simply parsing
the source using C and CPP language rules which may even
result in a syntactically incorrect representation. Our proposed
ST framework uses a similar principle to allow specialized
analysis tools built on top of ST to leverage the API during
the source processing and modeling phases to fine tune a
formalized intermediate representation, and therefore better
account for source structures, dependencies and flows that
may not be correctly modeled from language rules alone.
This API introduces modeling flexibility, where a specialized
analysis tool using our framework can generate source models
that take into consideration variations and complexities not
readily captured by just the syntax and semantic rules of an
implementation language, in particular if the implementation
language is coupled with pre-processing.
Most people are familiar with the telephone game, in which
a player whispers a starting sentence to a second player, which
whispers it to the third player and so on until the last player
speaks it out loud. More often than not, the original sentence
is utterly deformed, to the general amusement of the crowd.
Few people know that the same phenomenon is almost a given
when translating from one computer language to another. This
characteristic is due to the presence of heterogeneities between
data structures. When a data type does not exist in the target
structure, the human translator will still represent it somehow.
This will inevitably result in information loss or, even worse,
false information. Euzenat and Shvaiko [11] provide a general
introduction to the subject for inter-operability, and Dorion et
al [8] provide a detailed characterization of the possible hetero-
geneity types, leading to a measure of how much information
is lost.
There are multiple strategies and projects dedicated to the
comprehension of application programs and their translation
to an intermediate form or language. Formalized Intermediate
Languages (ILs) have been developed for both the translation
of source to a specified IL and the translation of binaries
or executables to an IL. In both cases, the IL facilitates
program analysis through integration with tools that leverage
the structure of the given IL.
For example, SAIL [7] is an open source, front-end which,
given a program’s source code, generates two levels of ILs and
creates a graphical representation of the program’s control flow
and is conducive to static analysis. On the other hand, REIL
(Reverse Engineering Intermediate Language) [9] provides
an IL abstracted from native assembly code supporting the
development of analysis tools and algorithms that are platform
independent. BinNavi [1] is an example of commercial plug-in
tool implemented using REIL as an intermediate language.
REIL’s limited instruction set introduces a one-to-many
relation between native instructions and REIL instructions
and therefore leaves it unable to translate certain classes of
instructions. Unusual instructions are essentially ignored by
translating them into a variant of the NOP instruction, like
the UNKN instruction in REIL that the translator uses to
signal an unrecognized instruction [1]. However, if such an
instruction exists in an IL, dynamic binary analysis becomes
infeasible [17]. If dynamic binary analysis is required, it must
then be accomplished by means of an emulator or a full system
simulator. HLASM [14], for example, is significantly different
than X86 assembly and currently has no translation to the IL
in REIL.
IDA Pro [12] is a popular interactive disassembler that
allows for inspection of software binaries through analysis
both built-in and external to IDA Pro. IDA Pro follows a
traditional plug-in architecture and provides an API that allows
third-party developers to produce more complex analysis [4].
There are several other frameworks in place, each with their
own IL, and each with their own advantages and drawbacks.
CodeSurfer/x86 [16] is one of the more well-known frame-
works but its proprietary nature leave.
HERO (Hybrid sEcurity extension of binaRy transla-
tiOn) [13] is a promising framework that claims to support
an efficient combination of static and dynamic binary analysis
methods. One key advantage to HERO is that it appears to
be entirely self-contained; that is to say, it does not seem to
rely on any third-party components. A formal specification
language is used for its IL, but no details are revealed imple-
mentation to verify the claim that it is a faithful representation
of the assembly code.
BitBlaze [20] is a framework that was developed at UC
882
Berkeley and aims to provide a better understanding of dis-
assembled software through a fusion of static and dynamic
analysis. It consists of three main components: VINE, TEMU,
and Rudder. A recent study of various emulators, including
QEMU, Valgrind, PIN and BOCHS, revealed some unfaith-
fully emulated instructions, resulting in system misrepresenta-
tion, emulator malfunction or failure [15].
BAP (Binary Analysis Platform) [5] was developed by two
of the contributors to the BitBlaze project one year after
the BitBlaze paper was published. It contains essentially the
same features, except it cannot tie IL instructions to the
original assembly the way BitBlaze and many frameworks can.
Both BitBlaze and BAP are distributed as a VMWare virtual
machine, however, the errors revealed by Martignoni et al [15]
are still a relevant concern.
BinNavi [1] is a commercial plug-in tool that was created by
Zynamics and is implemented using REIL as an intermediate
language. It provides extensibility support through MonoR-
EIL, an add-on framework. REIL consists of 17 instructions,
including NOP and UNKN, which do not actually correspond
to any executable instructions. UNKN is just like NOP except
it tells the analyst that an instruction was encountered that
could not be translated, such as a floating-point instruction.
In addition to floating point instructions, MMX, SSE, and
other CPU extensions are also not supported at this time.
The presence of the NOP and UNKN instructions in the REIL
instruction set precludes the possibility of dynamic binary
translation, so BinNavi currently supports only static analysis.
Dynamic analysis is possible with REIL, but would have to
be implemented using a whole-system simulator or emulator.
III. A NEED FOR EXTENSIBILITY
As established by the survey in Section II, the primary
inadequacy of current support for program comprehension is
the inability for these tools to keep pace with modern code-
bases and to scale or to adapt to other languages. The reality
is that unless analysis tools are designed to fluidly adapt to
constantly changing circumstances they will not be able to
provide adequate support and will quickly become obsolete.
In [18], we proposed the concept of an extensible frame-
work, specifically targeting binary analysis. The layered design
of this framework, shown in Figure 1 is intended to meet these
requirements: 1) support for new instruction set architectures,
2) maintaining platform independence and 3) supporting inter-
operability of new analysis modules. In this paper we look to
generalize the problem space by proposing the ST analysis and
program comprehension framework to scale across multiple
forms of complex software including binary.
With respect to instruction set architectures, more and more
software systems are breaking out of the confines of the
desktop PC and forging paths in new areas such as mobile
devices, computing on GPUs and embedded systems. As a
result, it is no longer a question of if a new architecture will
be encountered, it is a question of when. ST must adapt to
new instruction sets, accurately and faithfully modeling the
intricacies of the different architectures.
Fig. 1. Proposed Spinal Tap framework design.
In terms of software platforms, rather than dictating which
platform (e.g. Microsoft Windows, Mac OS X, Linux) users
of ST must be on, users should be able to choose. As a result
of loosening the platform requirement we allow ST to fit into
existing work flows and we increase the number of potential
users. Decoupling the ST tooling framework from the runtime
environment is one of the primary reasons why our portable
ST framework is built on top of an existing IDE like Eclipse,
where platform specifics are largely hidden at the ST level.
Tool users may have to obtain a base Eclipse IDE that is
specific to a platform, but the installation of the ST tooling
framework would be the same regardless of which platform-
specific Eclipse is used. Furthermore, it more readily allows
the integration of new and unanticipated analysis techniques
through well defined, Eclipse-based ST extension points, and
specifically allows for inter-operability with other ST and
Eclipse-based tools.
Section II mentioned some of the various ways of imple-
menting a software analysis tool. We proposed an API-based
tooling framework, where two important reasons to consider
an API-based approach are:
1) API may allow for tools built on top of a tooling
framework to identify abstract concepts in the code
that may aid software comprehension, and furthermore
serve as selectable context for analysis. This is explored
further in Section IV.
2) API cleanly separates common tooling functionality
from specialized analysis tool logic. In particular, API
delegates responsibility of fine tuning an intermediate
model representation to the specialized analysis tool.
This may allow source analysis tools to more accurately
model an underlying source code during the parsing and
modeling process that implementation language specifics
alone may not be able to capture, in particular if the
source language is coupled with pre-processing.
The layered design shown in Figure 1 provides a struc-
ture that applies to framework construction in general. This
technique is similar to that used in the Eclipse IDE [2] to
achieve increased portability between platforms. The isolation
of platform-specific details at the lowest level is tailored to the
core API that integrates with an intermediate representation.
883
Specific analysis and visualization tools leverage the abstract
representation of the source code provided by this intermediate
representation.
This use of an intermediary allows analysis tools to integrate
in a uniform way with various representations of source code.
Our investigation into related work has demonstrated the
limitations of a strictly intermediate language based approach,
and as a result we investigate a more general and inclusive
approach for ST based on the addition of protocols and APIs
for inter-operability.
IV. A PROTOTYPE OF THE SPINAL TAP FRAMEWORK
The ST framework design is motivated by four key goals:
1) automatically integrate a software analysis tool set into an
existing IDE, 2) establish a clear separation between the inter-
mediate representation and the underlying source code details,
3) present a common abstraction mechanism that may aid a
user in software comprehension and analysis, and 4) define
a common, extensible querying mechanism that allows for
source code analysis on both the intermediate representation
as well as the underlying source linked to structures in the
intermediate model.
The first goal is addressed with the integration of ST as
a set of Eclipse plug-ins, where ST leverages the Eclipse
platform and allows specific source analysis tools contributed
into the ST framework to automatically communicate with
other existing components of Eclipse, like source editors,
source outline views, and file management.
The second goal is addressed with the introduction of a
common intermediate representation and API that decouples
general framework functionality like parsing, modeling, visu-
alisation and querying of extracted source facts from specifi-
cations and properties of the language used in the source.
The third goal is achieved through the common API defined
by the ST framework and implemented by the source analysis
tool. This interface allows the specialized tool to mark facts
about the modeled data, referred to as constructs, that represent
meaningful, selectable abstractions to a tool user. This abstrac-
tion can both facilitate an understanding of a code base and
serve as set contexts for query capabilities of software analysis
techniques.
Finally, the fourth goal is once again achieved by the API
in the separation of the execution and visualization of query
results in a common ST view from the actual specialized
source analysis.
All four features work seamlessly together, and allow soft-
ware analysis tool developers to focus solely on specific source
properties like: 1) defining source language model building
rules that allow the transformation into the intermediate repre-
sentation or model, 2) establishing queries that analyze specific
qualities of the code base by querying this model of the
underlying source and 3) optionally marking constructs in the
parsed and modeled data that are displayed to a tool user and
serve as querying contexts.
The ability to query both the underlying source and the
intermediate representation provides flexibility to the analysis
tools by providing access to a code base at a higher level
of abstraction as well as direct access to lines of code.
The intermediate representation is mapped to structures in
the source code that may facilitate further analysis of those
blocks. The case study to follow in Section V demonstrates
these capabilities with an intermediate representation used
to model configuration logic within source code. The model
structures are automatically linked to relevant source code
during the modeling process using common source location
properties and allow the queries to focus on the structures
of the intermediate model when looking for potential logic
patterns.
The following paragraphs present the ST framework design
in terms of the three layers depicted by shading in Figure 2.
Further we identify the ways in which the goals are addressed
within the details of this design.
The top User Interface (UI) layer contains common ST
mechanisms that allow for the launching of a source analysis
tool within an integrated environment, where the analysis
tool leverages existing ST and Eclipse UI. A reusable UI for
running the tool from within Eclipse provides visualization of
analysis results and navigation from results to existing source
editors.
This UI layer has three main sections: 1) a project selectioncomponent for source project/directory selection and invoca-
tion of ST through context menu actions that integrate into
existing Eclipse project views, namely the Project Explorer, 2)
selection view that displays selectable constructs according to
goal three, and launches wizards to select and execute source
analysis queries, and 3) and query view which presents navi-
gable query results that automatically integrate with existing
Eclipse source editors. To support the integration goal, the two
ST views are set within the Eclipse main workbench where
development, source control and building are also performed
according to the requirement that a source analysis tool be
integrated into an IDE [23].
The middle layer, in general, deals with common ST frame-
work functionality like generation of an intermediate repre-
sentation and query capabilities within the source processingand query processing components, respectively. These two
processing components work together to provide information
to the UI layer for the purpose of visualization of the ab-
stractions marked by an analysis tool during the ST source
parsing and modeling phase, as well as displaying query
results. The generic intermediate model generator provides the
commonly understood model or intermediate representation
that the query processing and visualization layers of the
framework understand regardless of the source code specifics.
The middle layer contains common tooling functionality code
that is independent of the source analysis tool extending ST,
and includes parsing, tokenizing and modeling of a source
base into an intermediate model. The optional abstraction
of constructs to represent the source code can be used 1)
to support code comprehension through the UI and 2) and
for high-level analysis by the query processing component.
The query manager within the query processing component
884
Fig. 2. Design of the ST Framework.
contains the common query execution and query results logic
independent of the source analysis tools at the UI level.
The bottom layer provides the source analysis specific
implementation of the API offered up by the middle layer
of the ST framework for the generation of the model, optional
marking of constructs and query definitions. Specifically, a do-main contributor provides implementation of a model providerand a query provider which form the core components of a
source analysis tool that leverages STs extensible and reusable
mechanisms. The model provider supplies an implementation
of the model building rules for tokenizing a source code and
building an intermediate model, as well as construct marking
that serves as query context abstractions. The query provider
understands the structure of the intermediate model built by
the model provider and performs analysis on the model and
underlying source code.
The initial implementation of ST presents a flexible interme-
diate model structure that follows a syntax tree style as seen in
compiler implementations. The evaluation section presents an
intermediate model for CPP conditional compilation analysis,
using STs generic model definition, but built by a specific CPP
analysis tool in the CPPs model provider. Future versions of
ST intend to expand the syntax tree based intermediate model
that can also adapt to call graphs.
The interaction between these layers is prescribed through
well defined interfaces. The intermediate model generator and
query manager offer up an API for instances of the bottom
source analysis contributor layer. While the model is being
generated, the ST framework relies on the model provider
to optionally identify and mark elements, or constructs in
the model. Specifically, these marked constructs are later
displayed by ST at the UI layer and visually aid a tool user in
querying the source code. The evaluation section will present
examples of these user-selectable abstractions. Both model and
query providers that form part of a source analysis tool are
contributed into ST using Eclipse extension points defined by
ST.
The goal here is to determine whether this intermediate
model in the form of a syntax tree is adaptable to sources
implemented in various languages and can present meaningful
representations of the source code to be queried and analyzed
through a common UI.
V. CASE STUDIES
STs viability as a general framework that contains common
and reusable functionality UI and visualization, as well as
automatic integration with the underlying Eclipse platform,
including inter-operability with existing Eclipse components
like source editors that can be leveraged by a source analysis
tool is examined through three case studies, one focused
on analysis of syntactically similar code patterns used in C
pre-processor (CPP) controlled conditional compilation, the
second a simple proof-of-concept Comment analysis tool that
finds duplicate comments in a source code, and the third a
proposed prototype for assembly code analysis. All three case
studies are built upon the ST framework.
885
A. Case Study: CPP Conditional Compilation Analysis
Our first evaluation of the ST framework focuses on imple-
menting an analysis tool to investigate potential duplicate code
patterns controlled by CPP flags in an open source C-based
system, namely the CVS 1.11.17 source code [6].
The motivation behind implementing such an analysis tool is
driven by the need to study code patterns that may potentially
be found across different build configurations of a C-based
software system relying on CPP for configuration. Eclipses
built in search functionality may not readily assist in such
source analysis, as it does not 1) automatically discover all
possible CPP flags that control configuration, and 2) the tool
user does not know ahead of time what patterns to search
for. Rather, the tool user would like to discover these patterns
through analysis of all conditionally compiled sections of code
using a specific query that compares conditionally compiled
sections against each other. A specialized tool to perform
such an operation is therefore needed, and ST provides the
framework and API to do so.
In order to build an intermediate model that can be used
by STs existing common UI and querying mechanisms, STs
extension points requires a CPP analysis tool to contribute
two types of providers implementing various API: a model
provider and a query provider.
In general, an analysis tool using ST must implement model
and query providers that are responsible for:
1) defining tokens that ST should tokenize, like language
keywords (for instance #, if, define). ST provides a
clear API to define tokens which an analysis tool must
implement.
2) intermediate model generation rules that build an in-
termediate model based on the tokens parsed by ST.
This means that although ST manages the generation
of tokens from an input source, and subsequent man-
agement of the intermediate model when used by other
components of ST like the UI and Query layers, ST
delegates the actual building of the model to the analysis
tool model provider.
3) marking constructs of interest to the analysis tool, which
are displayed by ST in a common ST UI, and serve
as analysis context for performing queries. These con-
structs are marked by the analysis tool through API
provided by the framework. In the case of the CPP
analysis tool, these constructs are CPP flags controlling
conditionally compiled sections of code, and may rep-
resent possible build configurations.
Shown in Figure 3 is part of an intermediate model built by
the CPP model provider representing conditionally compiled
blocks of code as elements in the CPP specific intermediate
representation. The actual source code, int i = 0; within
the conditionally compiled blocks can be further modeled, but
for the purposes on this evaluation it shown as one child model
element. This intermediate representation is formalized and
generic enough that it can be understood by other components
of the ST framework, particularly the UI and query processing,
but the structure is specific to CPP analysis.
The CPP model provider creates two types of model ele-
ments, one representing a conditionally controlled block for
texttt#if win —— linux, and the other representing the actual
conditionally compiled code. Nested conditionally compiled
blocks are also possible (although not shown for brevity) as
child elements. The structure of the model is determined by
the CPP analysis tool, while the management of the model
and the execution of any query falls in the realm of the ST
framework.
Fig. 3. ST CPP syntax tree.
The proposal section mentioned that one of the features
of the ST framework is the ability for an analysis tool to
generically mark constructs in the intermediate representation
of the source code as potential abstractions that are displayed
to a tool user using STs common, generic UI selection views.
These are then used by the ST framework as user-selectable
context to execute analysis queries. In the case of the CPP
model provider, CPP flags are marked as constructs which
are then displayed by ST as selectable values through a
common, generic ST Constructs View shown in Figure 4.
As the intention of the CPP analysis tool is to analyze con-
ditionally compiled code, this construct marking mechanism
allows an analysis tool to discover querying contexts using the
ST framework. In the example of the CPP analysis tool, the
visualized constructs present an automatically discovered list
of different possible configurations allowing a user to analyze
the source according to a given configuration of CPP flags.
ST delegates semantic knowledge of the marked constructs
to the analysis tool, and so it is not aware that these are
CPP flags, yet when displayed to a user in the Constructs
View, and a query performed on them, STs generic views will
display conditionally compiled code controlled by these flags,
even though the framework functionality that allows the CPP
flags to be marked, managed and displayed to a user is fully
decoupled from the analysis tool extending the framework.
The Query Selection button in the view launches a wizard
where various queries for CPP analysis can be selected, all
which were contributed by the CPP analysis tool through a
886
CPP Query Provider. Shown in Figure 4 is the query result for
a CPP query that discovers potentially equivalent conditionally
compiled code controlled by SIGABRT, SIGHUP, SIGINT,
SIGPIPE, SIGQUIT, and SIGTERM configuration flags. The
results on the bottom right side show the query results in the
first column, where the root result item is the result itself:
(void)SIG_register(SIGABRT,patch_cleanup);and the child elements show similar patterns across different
locations. The lowest child is the line number locations of
the results within a file. The constructs column show any
constructs that served as context for the query results, which
in this case are the selected CPP flags controlling the inclusion
of the conditionally compiled C code. The last two columns
show query results location information that is generic, for
example number of hits per result, and number of affected
files containing the results.
Each results entry is fully navigable, as seen in Figure 4,
where a user has navigated from the result at line number 8359
in the rsc.c file pertaining to the flag SIGHUP, which ST then
detects and highlights in the editor. All the UI and automatic
integration with Eclipse source editors is fully leveraged by the
CPP analysis tool, which only needed to contribute a model
and query providers to ST, and not require any UI or IDE
integration implementation.
This evaluation shows that at least for some CPP source
analysis, STs separation of common tooling functionality and
specialized analysis tooling allows the analysis tooling to
leverage all the common, generic ST functionality in the
middle and upper layers. Specifically, the CPP analysis tool
was able to use STs generic construct abstraction mechanism
to mark CPP flags as selectable context for invoking certain
CPP queries. In addition, STs generic Query Results view
was able to present the CPP results in a way that made it
meaningful for analysis of potentially duplicate conditionally
compiled code, without ST framework being aware or having
any dependencies on the CPP analysis tool. Finally, the CPP
analysis tool was able to automatically use integration with
the rest of Eclipse, as seen with the navigation from the Query
Results View to an Eclipse editor, again, without the analysis
tool having to implement any of the integration.
B. Case Study: Source Code Comments
As ST is intended to be a general, extensible framework
for software analysis tool development, to further evaluate
the reusability of the framework for other domains aside
from CPP, this section will present a simpler evaluation of
the framework geared towards finding syntactically similar
comment blocks in some high level language. This proof-
of-concept study will generally describe the steps needed to
implement an analysis tool that identifies similar comments
within a code base leveraging the capabilities of the ST
framework.
There are three types of information that the Comment
analyzing domain would have to provide to ST:
1) A list of tokens to extract from the source.
2) Model rules for building the intermediate representation
of the comment blocks within a code base.
3) A query that performs analysis on the intermediate
model elements representing comment blocks
The C language defines comments as blocks of text in the
source preceded by either //, or starting with a /* and ending
with a */, where the content within the block starts with a
*. Therefore the Comment model provider forming part of
the Comment analysis tool contributed into the ST framework
would define the following four tokens: //, /*, * and */.
The second step is to define a model rule that builds the
intermediate model. It is the job of the Comment model
provider to build the model based on the stream of tokens
processed by the underlying ST framework. As the pattern
for comments is fairly straight forward, the provider would
detect within the token stream provided by ST the first two
tokens: the comment symbol and the comment line following.
It therefore builds a model element representing a comment
block by detecting the starting and ending comment token for
/* and */ case, or in the case of //, by detecting a token that
is not preceded by a // token.
The provider also has the option of marking constructs to
display in the Constructs View that can facilitate querying and
analysis of comments in the source. In this case, the Comment
provider creates construct objects representing files containing
comment blocks, and marks them as selectable values using
ST API that ST then displays in the Constructs View.
The Comment analysis tool would also contribute a query
into ST that would identify a comment model element as
ST visits the intermediate model during a query operation.
If the comment is identified, the Comment analysis tool query
can use one of STs many general, built in analyzers available
through a factory to detect syntactically similar code blocks.
Both the CPP and Comment Query providers use an ST
Levenshtein comparator that is applicable to both cases, as
the comparator only requires string representation of the model
element values (either the raw conditionally compiled code, or
the comment body text).
An extension to this implementation of a comment-specific
analysis tool contributed into the ST framework, comments
for other programming languages can be taken into account.
For example, to include Javadocs [22] the Comment model
provider would have to define a fifth comment token type /**.
The language rule would remain the same, with the exception
of adding a model element indicating a Javadoc block.
Additionally, extending the identification of constructs to
be used by the Java Comment query provider, with two
types of comment blocks, Javadocs and inline comments, the
enhanced Comment analysis tool can mark the block type as
a construct itself. This introduces a level of abstraction that
can be leveraged at the ST UI layer allowing a user to narrow
a search within one of the two types: NON_JAVA_DOC and
JAVA_DOC.
887
Fig. 4. ST views with navigation and query capabilities demonstrated.
C. Future Work: Assembly Code
A proof-of-concept tool [4], Tracks connects to a debugger,
allowing for dynamic analysis and navigation in an Eclipse
plug-in. As shown in Figure 5, the protocol between Tracks
and the debugger, currently provided by IDA Pro, establishes
communication for the analysis which includes system calls.
This essentially extends the static capabilities of existing
support in IDA Pro to incorporate dynamic analysis, external
modules, and even multi-threaded code bases. This protocol
and API could also be used interchangeably to provide the
same navigation support for other debuggers, for example,
those associated with HLASM code bases.
As a third case study, ST can be used to implement an
analysis tool for low level languages. As with the CPP and
Comment case studies mentioned above, an Assembly analysis
tool built on top of ST would have to contribute 1) a set
of tokens to parse (for example, instructions and registers),
2) modeling rules to build an intermediate representation of
the assembly source, 3) mark facts about the parsed assembly
source as constructs that can be used as query contexts and 4)
queries that can perform analysis on the intermediate model.
The Assembly analysis tool would have to contribute two types
of providers: model and query providers. These implement ST-
defined API which allow the analysis tool to interact with the
Fig. 5. Tracks’ debugger interaction protocol.
underlying common framework, much like in the case of the
CPP and Comment use cases.
Although the set of tokens may be straightforward to
implement for any assembly language, an area of current
research is to determine what modeling structure can be built
to represent assembly code, or specific portions of assembly
888
code that require analysis. One research area is to determine
if assembly-specific models contain enough common structure
such that a general Assembly analysis tool built on top of ST
can be implemented, and further extended to accommodate
specific variations. If so, it would also increase the possibility
that the general, ST-based Assembly analysis tool would better
adapt to changes in the low level languages, which was cited
earlier as one of the weak points in current low level analysis
tooling.
The proof-of-concept case studies for CPP and comment
analysis tools built using the ST framework indicate that our
framework is a viable starting point to further study and
implement general Assembly analysis tools that are extensible
and adaptable to future changes.
VI. CONCLUSION
Section III proposes ST as a framework that is extensible
in terms of accommodating new ISAs, platforms and tools. A
prototype of this framework illustrates the layered architecture
intended to provide this extensibility and inter-operability with
existing tools. Two case studies and a discussion of how the
ST approach could apply to assembly code serves to evaluate
this proposed approach.
While this proposed framework focuses on static analysis
techniques we intend to investigate the possibility of lever-
aging virtualization technology in order to support dynamic
analysis. We are particularly interested in understanding how
ST should interact with malware, especially when malware
tries to detect our system, and how to accurately monitor the
virtual environment without introducing inaccuracies into the
results. We would like to further investigate the application of
an intermediate representation as was posed in this work but
will also be investigating how an intermediate representation
could leverage the structure of an intermediate language to
provide a clean method of performing dynamic analysis in a
virtual environment.
REFERENCES
[1] BinNavi. http://www.zynamics.com/binnavi.html, 2011.[2] Eclipse platform technical overview. http://www.eclipse.org/articles/
Whitepaper-Platform-3.1/eclipse-platform-whitepaper.html, January2011.
[3] G. J. Badros and D. Notkin. A framework for preprocessor-aware csource code analyses. Softw. Pract. Exper., 30:907–924, July 2000.
[4] J. Baldwin, P. Sinha, M. Salois, and Y. Coady. Progressive user interfacesfor regressive analysis: Making tracks with large, low-level systems.AUIC 2011, pages 1–10, Nov 2010.
[5] D. Brumley and I. Jager. The BAP Handbook. 2009.[6] CVS. Concurrent versions system. http://www.nongnu.org/cvs/, April
2011.[7] I. Dillig, T. Dillig, and A. Aiken. SAIL: Static Analysis Intermediate
Language with a Two-Level Representation. Technical report, 2009.[8] E. Dorion, D. Grenier, J. Brodeur, and D. O’Brien. A Taxonomy of
Heterogeneity Types for Ontological Alignment. In preparation, Dec.2010.
[9] T. Dullien and S. Porst. REIL: A platform-independent intermediate rep-resentation of disassembled code for static code analysis. In CanSecWestApplied Security Conference), 2009.
[10] T. Eisenbarth, R. Koschke, and D. Simon. Derivation of feature compo-nent maps by means of concept analysis. In In Proceedings of the FifthEuropean Conference on Software Maintenance and Reengineering,pages 176–179. Society Press, 2001.
[11] J. Euzenat and P. Shvaiko. Ontology Matching. Springer-Verlag,Heidelberg (DE), 2007.
[12] I. Guilfanov. Decompilers and beyond. BlackHat USA 2008, pages1–12, Jul 2008.
[13] H. Guo, J. Pang, Y. Zhang, F. Yue, and R. Zhao;. Hero: A novel malwaredetection framework based on binary translation. Intelligent Computingand Intelligent Systems (ICIS), 2010 IEEE International Conference on,1:411 – 415, 2010.
[14] IBM. High Level Assembler (HLASM) and Toolkit Feature. http://www-01.ibm.com/software/awdtools/hlasm/library.html, 2011.
[15] L. Martignoni, R. Paleari, G. Fresi Roglia, and D. Bruschi. TestingCPU emulators. In Proceedings of the 2009 International Conferenceon Software Testing and Analysis (ISSTA), pages 261–272, Chicago,Illinois, USA. ACM.
[16] D. Melski, T. Teitelbaum, and T. Reps. Static analysis of softwareexecutables. Conference For Homeland Security, 2009. CATCH ’09.Cybersecurity Applications & Technology, pages 97 – 102, 2009.
[17] N. Nethercote and J. Seward. Valgrind: A framework for heavyweightdynamic binary instrumentation. In Proceedings of ACM SIGPLAN 2007Conference on Programming Language Design and Implementation(PLDI 2007), pages 89–100, San Diego, California, USA, June 2007.
[18] D. Pucsek, J. Wall, C. Gibbs, J. Baldwin, M. Salois, and Y. Coady. Ice:Circumventing meltdown with an advanced binary analysis framework.In proceedings of the First International Workshop on Developing Toolsas Plug-ins (TOPI) - colocated with the International Conference onSoftware Engineering (ICSE), 2011.
[19] J. Rodriguez, C. Cesario, K. Galvan, B. Gonzalez, G. Kroner,G. Rutigliano, and R. Wilson. IBM rational application developer v6portlet application development and portal tools. IBM Corp., Riverton,NJ, USA, 2005.
[20] D. Song, D. Brumley, H. Yin, J. Caballero, I. Jager, M. Kang, Z. Liang,J. Newsome, P. Poosankam, and P. Saxena. Bitblaze: A new approachto computer security via binary analysis. Information Systems Security,pages 1–25, 2008.
[21] D. Spinellis. Global analysis and transformations in preprocessedlanguages. IEEE Transactions on Software Engineering, 29:1019–1030,2003.
[22] Sun Microsystems. Javadoc Tool Home Page. http://www.oracle.com/technetwork/java/javase/documentation/index-jsp-135444.html, April2011.
[23] A. Telea, O. Ersoy, and L. Voinea. Visual analytics in softwaremaintenance: Challenges and opportunities. In Proceedings of the 5thInternational Symposium on Software Visualization, New York, NY,USA, 2010. ACM. 102102.
[24] Q. Tu and M. W. Godfrey. An integrated approach for studyingarchitectural evolution. In In 10th International Workshop on ProgramComprehension (IWPC02), pages 127–136. IEEE Computer SocietyPress, 2002.
889
Spin
al T
ap: H
igh
Leve
l A
naly
sis
for
Hea
vy
Met
al S
yste
ms
2011
IEEE
Pac
ific
Rim
Con
fere
nce
on C
ompu
ters
, C
omm
unic
atio
ns, a
nd S
igna
l Pro
cess
ing
Nie
raj S
ingh
Dea
n Pu
csek
Jona
h W
all
Cel
ina
Gib
bsM
artin
Sal
ois
Yvo
nne
Coa
dy
Prob
lem
•D
ifficu
lt to
und
erst
and
larg
e as
sem
bly
code
ba
ses
•La
ck o
f too
ls t
o an
alyz
e as
sem
bly
code
•So
urce
not
alw
ays
avai
labl
e
2
Exam
ple:
Tex
t
3
_mai
n:100000ec4�
pushq�
%rbp
1000
00ec
5�mo
vq�
%rsp
,%rb
p10
0000
ec8�
subq
�$0
x10,
%rsp
100000ecc�
callq�
0x100000f18�;
getc
har
1000
00ed
1�mo
vl�
%eax
,0xf
c(%r
bp)
1000
00ed
4�cm
pl�
$0x6
1,0x
fc(%
rbp)
1000
00ed
8�je
�
0x
1000
00ee
210
0000
eda�
cmpl
�$0
x62,
0xfc
(%rb
p)10
0000
ede�
je�
0x
1000
00ef
010
0000
ee0�
jmp�
0x
1000
00ef
e10
0000
ee2�
leaq
�0x
0000
003b
(%ri
p),%
rdi
100000ee9�
callq�
0x100000f1e�;
puts
1000
00ee
e�jm
p�
0x
1000
00f0
a10
0000
ef0�
leaq
�0x
0000
002f
(%ri
p),%
rdi
100000ef7�
callq�
0x100000f1e�;
puts
1000
00ef
c�jm
p�
0x
1000
00f0
a10
0000
efe�
leaq
�0x
0000
0023
(%ri
p),%
rdi
100000f05�
callq�
0x100000f1e�;
puts
1000
00f0
a�mo
vl�
$0x0
0000
000,
%eax
100000f0f�
leave
100000f10�
ret
Exam
ple:
Tex
t
3
_mai
n:100000ec4�
pushq�
%rbp
1000
00ec
5�mo
vq�
%rsp
,%rb
p10
0000
ec8�
subq
�$0
x10,
%rsp
100000ecc�
callq�
0x100000f18�;
getc
har
1000
00ed
1�mo
vl�
%eax
,0xf
c(%r
bp)
1000
00ed
4�cm
pl�
$0x6
1,0x
fc(%
rbp)
1000
00ed
8�je
�
0x
1000
00ee
210
0000
eda�
cmpl
�$0
x62,
0xfc
(%rb
p)10
0000
ede�
je�
0x
1000
00ef
010
0000
ee0�
jmp�
0x
1000
00ef
e10
0000
ee2�
leaq
�0x
0000
003b
(%ri
p),%
rdi
100000ee9�
callq�
0x100000f1e�;
puts
1000
00ee
e�jm
p�
0x
1000
00f0
a10
0000
ef0�
leaq
�0x
0000
002f
(%ri
p),%
rdi
100000ef7�
callq�
0x100000f1e�;
puts
1000
00ef
c�jm
p�
0x
1000
00f0
a10
0000
efe�
leaq
�0x
0000
0023
(%ri
p),%
rdi
100000f05�
callq�
0x100000f1e�;
puts
1000
00f0
a�mo
vl�
$0x0
0000
000,
%eax
100000f0f�
leave
100000f10�
ret
int
main
(){
swit
ch(g
etch
ar()
) {
ca
se '
a':
prin
tf("
a\n"
); b
reak
;
case
'b'
: pr
intf
("b\
n");
bre
ak;
de
faul
t: p
rint
f("d
ef\n
");
brea
k;
}
re
turn
0;
}
Exam
ple:
CFG
4
_mai
n:pu
shq
%rbp
movq
%rsp
,%rb
psu
bq$0
x10,
%rsp
call
q0x
1000
00f1
8mo
vl%e
ax,0
xfc(
%rbp
)cm
pl$0
x61,
0xfc
(%rb
p)je
0x10
0000
ee2
1000
00ed
a:cm
pl$0
x62,
0xfc
(%rb
p)je
0x10
0000
ef0
1000
00ee
0:jm
p
0x1
0000
0efe 10
0000
ee2:
leaq
0x00
0000
3b(%
rip)
,%rd
ica
llq
0x10
0000
f1e
jmp
0
x100
000f
0a
1000
00ef
e:le
aq0x
0000
0023
(%ri
p),%
rdi
call
q0x
1000
00f1
e
1000
00ef
0:le
aq0x
0000
002f
(%ri
p),%
rdi
call
q0x
1000
00f1
ejm
p
0x1
0000
0f0a
1000
00f0
a:mo
vl$0
x000
0000
0,%e
axle
ave
ret
Goa
ls
•En
able
sta
tic a
nd d
ynam
ic a
naly
ses
•O
S in
depe
nden
ce
•V
isua
l rep
rese
ntat
ion
of a
ssem
bly
•Ex
tens
ibili
ty
•In
stru
ctio
n se
t ar
chite
ctur
es (
ISA
s)
•A
naly
sis
tech
niqu
es
•O
bjec
t fil
e fo
rmat
s
5
IDA
Pro
6
How
Doe
s ID
A P
ro D
o?
7
Stat
ic A
naly
sis
✓D
ynam
ic A
naly
sis
OS
inde
pend
ent
✓V
isua
l rep
rese
ntat
ion
Add
add
ition
al IS
As
✓C
reat
e cu
stom
ana
lysi
s✓
Inse
rt fi
le lo
ader
s✓
Tool
Fou
ndat
ion
•T
hree
pos
sibl
e ap
proa
ches
•Pu
re fr
amew
ork
•Pu
re in
term
edia
te la
ngua
ge
•H
ybri
d
8
Fram
ewor
k
9
Ana
lysi
sBi
nary
Inte
rmed
iate
Lan
guag
e
10
Ana
lysi
sIS
AIn
term
edia
te
Lang
uage
Hyb
rid
11
Ana
lysi
sIS
AIn
term
edia
te
Lang
uage
Hyb
rid
11
Ana
lysi
sIS
AIn
term
edia
te
Lang
uage
ICE
12
ICE
UI
ICE
Brid
ge
ICE
Cor
e
ICE
12
ICE
UI
ICE
Brid
ge
ICE
Cor
e
x86
AR
M
Pow
erPC
Bina
ry
Com
pari
son
Cod
e C
over
age
Dyn
amic
Tr
ace
ICE
12
ICE
UI
ICE
Brid
ge
ICE
Cor
e
x86
AR
M
Pow
erPC
Bina
ry
Com
pari
son
Cod
e C
over
age
Dyn
amic
Tr
ace
ICE
Brid
ge
13
ICE
Brid
ge
Bina
ry A
PI
Inte
rmed
iate
La
nagu
age
Ana
lysi
s API
Sum
mar
y
14
Stat
ic A
naly
sis
✓D
ynam
ic A
naly
sis
OS
inde
pend
ent
✓V
isua
l rep
rese
ntat
ion
Add
add
ition
al IS
As
✓C
reat
e cu
stom
ana
lysi
s✓
Inse
rt fi
le lo
ader
s✓
Futu
re W
ork
•Bi
nary
API
•En
viro
nmen
t an
d fu
nctio
nalit
y of
dyn
amic
an
alys
is
•V
isua
lizat
ions
of a
ssem
bly
code
•Fu
rthe
r im
plem
enta
tion
of t
he p
roto
type
in
conj
unct
ion
with
sta
keho
lder
s
15
Que
stio
ns?
16
Dea
n Pu
csek
Emai
l: dp
ucse
k@cs
.uvi
c.ca