malware analysis with tree automata inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf ·...

23
Malware Analysis with Tree Automata Inference Domagoj Babi´ c, Daniel Reynaud, and Dawn Song University of California, Berkeley {babic, reynaud, dawnsong}@cs.berkeley.edu Abstract. The underground malware-based economy is flourishing and it is ev- ident that the classical ad-hoc signature detection methods are becoming insuffi- cient. Malware authors seem to share some source code and malware samples often feature similar behaviors, but such commonalities are difficult to detect with signature-based methods because of an increasing use of numerous freely- available randomized obfuscation tools. To address this problem, the security community is actively researching behavioral detection methods that commonly attempt to understand and differentiate how malware behaves, as opposed to just detecting syntactic patterns. We continue that line of research in this paper and explore how formal methods and tools of the verification trade could be used for malware detection and analysis. We propose a new approach to learning and gen- eralizing from observed malware behaviors based on tree automata inference. In particular, we develop an algorithm for inferring k-testable tree automata from system call dataflow dependency graphs and discuss the use of inferred automata in malware recognition and classification. 1 Introduction Over the last several decades, the IT industry advanced almost every aspect of our lives (including health care, banking, traveling,. . . ) and industrial manufacturing. The tools and techniques developed in the computer-aided verification community played an important role in that advance, changing the way we design systems and improving the reliability of industrial hardware, software, and protocols. In parallel, another community made a lot of progress exploiting software flaws for various nefarious purposes, especially for illegal financial gain. Their inventions are often ingenious botnets, worms, and viruses, commonly known as malware. Malware source code is rarely available and malware is regularly designed so as to thwart static analysis through the use of obfuscation, packing, and encryption [36]. This material is based upon work partially supported by the National Science Foundation un- der Grants No. 0832943, 0842694, 0842695, 0831501, 0424422, by the Air Force Research Laboratory under Grant No. P010071555, by the Office of Naval Research under MURI Grant No. N000140911081, and by the MURI program under AFOSR Grants No. FA9550-08-1- 0352 and FA9550-09-1-0539. The work of the first author is also supported by the Natural Sciences and Engineering Research Council of Canada PDF fellowship. 0 This is an extended full version of the CAV 2011 paper.

Upload: others

Post on 04-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

Malware Analysis with Tree Automata Inference ⋆

Domagoj Babic, Daniel Reynaud, and Dawn Song

University of California, Berkeley{babic, reynaud, dawnsong}@cs.berkeley.edu

Abstract. The underground malware-based economy is flourishing and itis ev-ident that the classical ad-hoc signature detection methods are becoming insuffi-cient. Malware authors seem to share some source code and malware samplesoften feature similar behaviors, but such commonalities are difficult to detectwith signature-based methods because of an increasing use of numerous freely-available randomized obfuscation tools. To address this problem, the securitycommunity is actively researching behavioral detection methods that commonlyattempt to understand and differentiate how malware behaves, as opposed to justdetecting syntactic patterns. We continue that line of research in this paper andexplore how formal methods and tools of the verification trade could be used formalware detection and analysis. We propose a new approach tolearning and gen-eralizing from observed malware behaviors based on tree automata inference. Inparticular, we develop an algorithm for inferringk-testable tree automata fromsystem call dataflow dependency graphs and discuss the use ofinferred automatain malware recognition and classification.

1 Introduction

Over the last several decades, the IT industry advanced almost every aspect of ourlives (including health care, banking, traveling,. . . ) andindustrial manufacturing. Thetools and techniques developed in the computer-aided verification community playedan important role in that advance, changing the way we designsystems and improvingthe reliability of industrial hardware, software, and protocols.

In parallel, another community made a lot of progress exploiting software flaws forvarious nefarious purposes, especially for illegal financial gain. Their inventions areoften ingenious botnets, worms, and viruses, commonly known asmalware. Malwaresource code is rarely available and malware is regularly designed so as to thwart staticanalysis through the use of obfuscation, packing, and encryption [36].

⋆ This material is based upon work partially supported by the National Science Foundation un-der Grants No. 0832943, 0842694, 0842695, 0831501, 0424422, by the Air Force ResearchLaboratory under Grant No. P010071555, by the Office of NavalResearch under MURI GrantNo. N000140911081, and by the MURI program under AFOSR Grants No. FA9550-08-1-0352 and FA9550-09-1-0539. The work of the first author is also supported by the NaturalSciences and Engineering Research Council of Canada PDF fellowship.

0 This is an extended full version of the CAV 2011 paper.

Page 2: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

0

500000

1e+06

1.5e+06

2e+06

2.5e+06

3e+06

2002 2003 2004 2005 2006 2007 2008 2009

Fig. 1: New Malicious Signatures added to the Symantec Antivirus Tool per Year [33].

For the above mentioned reasons, detection, analysis, and classification of mal-ware are difficult to formalize, explaining why the verification community has mostlyavoided, with some notable exceptions (e.g., [8, 18]), the problem. However, the area isin a dire need of new approaches based on strong formal underpinnings, as less princi-pled techniques, like signature-based detection, are becoming insufficient. Recently, wehave been experiencing a flood of malware [33], while the recent example of Stuxnet(e.g., [28]) shows that industrial systems are as vulnerable as our every-day computers.

In this paper, we show how formal methods, more precisely tree automata inference,can be used for capturing the essence of malicious behaviors, and how such automatacan be used to detect behaviors similar to those observed during the training phase.First, we execute malware in a controlled environment to extract dataflow dependen-cies among executed system calls (syscalls) using dynamic taint analysis [5, 30]. Themain way for programs to interact with their environment is through syscalls, which arebroadly used in the security community as a high-level abstraction of software behav-ior [13, 24, 34]. The dataflow dependencies among syscalls can be represented by anacyclic graph, in which nodes represent executed syscalls,and there is an edge betweentwo nodes, says1 ands2, when the result computed bys1 (or a value derived from it) isused as a parameter ofs2. Second, we use tree automata inference to learn an automatonrecognizing a set of graphs. The entire process is completely automated.

The inferred automaton captures the essence of different malicious behaviors. Weshow that we can adjust the level of generalization with a single tunable factor and howthe inferred automaton can be used to detect likely malicious behaviors, as well as formalware classification. We summarize the contributions of our paper as follows:

– Expansion of dependency graphs into trees causes exponential blowup in the size ofthe graph, similarly as eager inlining of functions during static analysis. We foundthat a class of tree languages, namelyk-testable tree languages [37] can be inferreddirectly from dependency graphs, avoiding the expansion totrees.

Page 3: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

– We improve upon the prior work on inference ofk-testable tree languages by pro-viding anO (kN) algorithm, wherek is the size of the pattern andN is the size ofthe graph used for inference.

– We show how inferred automata can be used for detecting likely malicious behav-iors and for malware classification. To our knowledge, this is the first work applyingthe theory of tree automata inference to malware analysis. We provide experimentalevidence that our approach is both feasible and useful in practice.

– While previous work (e.g., [7, 13]) often approximated dependencies by syntacticmatching of syscall parameters, we implemented a tool for tracking dependenciesvia taint analysis [5, 30] and we made the generated dependency graphs, as well asthe tree automata inference engine, publicly available to encourage further research.

2 Related Work

2.1 Tree Automata Inference

Gold [17] showed that no super-finite (contains all finite languages and at least oneinfinite) is identifiable in the limit from positive examples1 only. For instance, regularand regular tree languages [9] are super-finite languages. We have two options to cir-cumvent this negative result; either use both positive and negative examples, or focuson less expressive languages that are identifiable in the limit from positive examplesonly. Inference of minimal finite state automata from both positive and negative exam-ples is known to be NP-complete [17], because minimization of incomplete automatais NP-complete [31]. The security community is discoveringmillions of new malwaresamples each year and inferring a single minimal classifier for all the samples might beinfeasible. Inferring a non-minimal classifier is feasible, but the classifier could be toolarge to be useful in practice. Thus, we focus on a set of languages identifiable in thelimit from positive examples in this paper.

A subclass of regular tree languages —k-testable tree languages [37] — is identifi-able in the limit from positive examples only. These languages are defined in terms of afinite set ofk-level-deep tree patterns. Thek factor effectively determines the level of ab-straction, which can be used as a knob to regulate the ratio offalse positives (goodwaredetected as malware) and false negatives (undetected malware). The patterns partitiondependency graphs into a finite number of equivalence classes, inducing a state-minimalautomaton. The automata inferred from positive (malware) examples could be furtherrefined using negative (goodware) examples. Such a refinement is conceptually simple,and does not increase the inference complexity, because of the properties ofk-testabletree languages. We leave such a refinement for future work.

A number of papers focused onk-testable tree automata inference. Garcia and Vidal[15] proposed anO (kPN) inference algorithm, wherek is the size of the pattern,P thetotal number of possible patterns, andN the size of the input used for inference. Manypatterns might not be present among the training samples, sorather than enumerating allpatterns, [14] and [23] propose very similar algorithms that use only the patterns present

1 Positive examples are examples belonging to the language tobe inferred, while negative ex-amples are those not in the language.

Page 4: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

in the training set. Their algorithms are somewhat complex to implement as they requirecomputation of three different sets (called roots, forks, and leaves). Their algorithmsareO

(

MkN log(N))

, whereM is the maximal arity of any alphabet symbol in the treelanguage. We derive a simpler algorithm, so that computing forks and leaves becomesunnecessary. The complexity of our algorithm isO (kN), thanks to an indexing trickthat after performingk iterations over the training sample builds an index for findingpatterns in the training set. Patterns in the test set can be located in the index table inamortized time linear in the size of the pattern. In our application — malware analysis— thek factor tends to be small (≤ 5), so our algorithm can be considered linear-time.

2.2 Malware Analysis

From the security perspective, several types of malware analysis are interesting: mal-ware detection (i.e., distinguishing malware from goodware), classification (i.e., deter-mining the family of malware to which a particular sample belongs), and phylogeny(i.e., forensic analysis of evolution of malware and common/distinctive features amongsamples). All three types of analyses are needed in practice: detection for preventingfurther infections and damage to the infected computers, and the other two analyses arecrucial in development of new forms of protection, forensics, and attribution. In thispaper, we focus on detection and classification.

The origins of the idea to use syscalls to analyze software can be traced to For-rest et al. [12], who used fixed-length sequences of syscallsfor intrusion detection.Christodorescu et al. [7] note that malware authors could easily reorder data-flow-independent syscalls, circumventing sequence-detectionschemes, but if we analyzedata-flow dependencies among syscalls and use such dependency graphs for detec-tion, circumvention becomes harder. Data-flow-dependent syscalls cannot be (easily)reordered without changing the semantics of the program. They compute a differencebetween malware and goodware dependency graphs, and show how resulting graphscan be used to detect malicious behaviors. Such graph matching can detect only theexact behavioral patterns already seen in some sample, but does not automatically gen-eralize from training samples, i.e., does not attempt to overapproximate the training setin order to detect similar, but not exactly the same behaviors.

Fredrikson et al. [13] propose an approach that focuses on distinguishing features,rather than similarities among dependency graphs. First, they compute dependencygraphs at runtime, declaring two syscalls, says1 and s2, dependent, if the type andvalue of the value returned bys1 are equal to the type and value of some parameter ofs2 ands2 was executed afters1. They extract significant behaviors from such graphsusing structural leap mining, and then choose behaviors that can be combined togetherusing concept analysis. In spite of a very coarse unsound approximation of the depen-dency graph and lack of automatic generalization, they report 86% detection rate onaround 500 malware samples used in their experiments. We seetheir approach as com-plementary to ours: the tree-automata we infer from real dependency graphs obtainedthrough taint analysis could be combined with leap mining and concept analysis, toimprove their classification power.

Bonfante et al. [3] propose to unroll control-flow graphs obtained through dynamicanalysis of binaries into trees. The obtained trees are morefine-grained than the syscall

Page 5: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

dependency graphs. The finer level of granularity could, in practice, be less susceptibleto mimicry attacks (e.g., [35]), but is also easier to defeatthrough control-flow graphmanipulations. The computed trees are then declared to be tree automata and the recog-nizer is built by a union of such trees. Unlike inference, theunion does not generalizefrom the training samples. The reported experiments include a large set of malwaresamples (over 10,000), but the entire set was used for training, and authors report onlyfalse positives on a set of goodware (2653 samples). Thus, itis difficult to estimate howwell their approach would work for malware detection and classification.

2.3 Taint Analysis

Dynamic taint analysis (DTA) [30] is a technique used to follow data flows in programsor whole systems at runtime. DTA can be seen as a single-path symbolic execution [22]over a very simple domain (set of taints). Its premises are simple: taint is a variableannotation introduced throughtaint sources, it is propagated through program execu-tion according to somepropagation rulesuntil it reaches ataint sink. In our case, forinstance, taint sources are the syscalls’ output parameters, and taint sinks are the inputparameters.

As will be discussed in detail later, our implementation is based on the binary rewrit-ing framework Pin [26] and uses the taint propagation rules from Newsome and Song[30]. Since DTA must operate at the instruction-level granularity, it poses a signifi-cant runtime overhead. Our DTA implementation executes applications several thou-sand times slower than the native execution. Our position isthat the speed of the taintanalysis is less important than the speed of inference and recognition. The taint analysiscan be run independently for each sample in parallel, the dependency graph extractionis linear with the length of each execution trace, and hardware-based information flowtracking has been proposed (e.g., [32, 11]) as a potential solution for improving per-formance. In contrast, inference techniques have to process all the samples in order toconstruct a single (or a small number of) recognizer(s). An average anti-virus vendorreceives millions of new samples annually and the number of captured samples has beensteadily growing over the recent years. Thus, we believe that scalability of inference isa more critical issue than the performance of the taint analysis.

3 Notation and Terminology

In this section, we introduce the notation and terminology used throughout the paper.First, we build up the basic formal machinery that allows us to define tree automata.Second, we introduce some notions that will help us definek-roots that can be intu-itively seen as the topk levels of a tree. Later, we will show howk-roots induce anequivalence relation used in our inference algorithm. Towards the end of this section,we introducek-testablelanguages, less expressive than regular tree languages, but suit-able for designing fast inference algorithms.

Let N be the set of natural numbers andN∗ the free monoid generated byN withconcatenation (·) as the operation and the empty stringε as the identity. The prefixorder≤ is defined as:u≤ v for u,v ∈ N

∗ iff there existsw ∈ N∗ such thatv = u ·w.

Page 6: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

For u ∈ N∗,n ∈ N, the length |u| is defined inductively:|ε| = 0, |u · n| = |u|+ 1. We

say that a setS is prefix-closedif u ≤ v∧ v ∈ S⇒ u ∈ S. A tree domainis a finitenon-empty prefix-closed setD⊂ N

∗ satisfying the following property: ifu ·n∈ D then∀1≤ j ≤ n . u · j ∈D.

A ranked alphabetis a finite setF associated with a finiteranking relation arity⊆F ×N. DefineFn as a set{ f ∈F |( f ,n) ∈ arity}. The setT (F ) of termsover theranked alphabetF is the smallest set defined by:

1. F0 ⊆ T (F )2. if n≥ 1, f ∈Fn, t1, . . . , tn ∈ T (F ) then f (t1, . . . , tn) ∈ T (F )

Each term can be represented as a finite orderedtree t : D→F , which is a mappingfrom a tree domain into the ranked alphabet such that∀u∈ D:

1. if t (u) ∈Fn, n≥ 1 then{ j | u · j ∈ D}= {1, . . . ,n}2. if t (u) ∈F0 then{ j | p · j ∈ D}= /0

a b

f

g a h

b

1

11 12 13

111 112 131

Fig. 2: An Example of a Treet and its Tree Domain.dom(t) = {1,11,111,112,12,13,131}, F = { f ,g,h,a,b}, ‖ t ‖= 3, t(1) = f , t/131= b.

As usual in the tree automata literature (e.g., [9]), we use the lettert (possibly withvarious indices) both to represent a tree as a mathematical object and to name a rela-tion that maps an element of a tree domain to the corresponding alphabet symbol. Anexample of a tree with its tree domain is given in Figure 2.

The set of all positions in a particular treet, i.e., its domain, will be denoteddom(t).A subtreeof t rooted at positionu, denotedt/u is defined as(t/u)(v) = t(u · v) anddom(t/u) = {v | u ·v∈ dom(t)}. We generalize thedomoperator to sets as:dom(S) ={dom(u) | u∈ S}. Theheightof a treet, denoted‖ t ‖, is defined as:

‖ t ‖= max({|u| such thatu∈ dom(t)})

Let Ξ ={

ξ f | f ∈⋃

i>0Fi}

be a set of new nullary symbols such thatΞ ∩F = /0.The Ξ set will be used as a set ofplaceholders, such thatξ f can be substituted onlywith a treet whose position one (i.e., thehead) is labelled with f , i.e., t(1) = f . LetT (Ξ ∪F ) denote the set of trees over the ranked alphabet and placeholders. Fort, t ′ ∈

Page 7: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

T (Ξ ∪F ), we define thelink operationt♯t ′ by:

(t♯t ′)(n) =

{

t(n) if t(n) 6∈ Ξ ∨ (t(n) = ξ f ∧ f 6= t ′(1))t ′(z) if n= y ·z, t(y) = ξt′(1), y∈ dom(t) , z∈ dom(t ′)

For any two trees,t, t ′ ∈ T(F ), thetree quotient t−1t ′ is defined by:

t−1t ′ ={

t ′′ ∈ T (Ξ ∪F ) | t ′ = t ′′♯t}

The tree quotient operation can be extended to sets, as usual: t−1S={

t−1t ′ | t ′ ∈ S}

.For anyk≥ 0, definek-root of a treet as:

rootk (t) =

t if t(1) ∈F0

ξ f if f = t(1), f ∈⋃

i>0Fi , k= 0f (rootk−1 (t1) , . . . , rootk−1 (tn)) if t = f (t1, . . . , tn), ‖ t ‖> k> 0

A finite deterministic bottom-up tree automaton(FDTA) is defined as a tuple(Q,F ,δ ,F),whereQ is a finite set of states,F is a ranked alphabet,F ⊆Q is the set of final states,andδ =

i δi is a set oftransition relationsdefined as follows:δ0 : F0→ Q and forn> 0, δn : (Fn×Qn)→Q.

Thek-testable in the strict sense(k-TSS) languages [23] are intuitively defined bya set of tree patterns allowed to appear as the elements of thelanguage. The followingtheorem is due to Lopez et al. [25]:

Theorem 1. Let L ⊆ T(F ). L is a k-TSS iff for any trees t1, t2 ∈ T(F ) such thatrootk (t1) = rootk (t2), when t−1

1 L 6= /0∧ t−12 L 6= /0 then it follows that t−1

1 L = t−12 L .

We choose Lopez et al.’s theorem as a definition ofk-TSS languages. Other defi-nitions in the literature [14, 23] definek-TSS languages in terms of three sets; leaves,roots, and forks. Forks are roots that have at least one placeholder as a leaf. Theorem1 shows that such more complex definitions are unnecessary. Intuitively, the theoremsays that within the language, any two subtrees that agree onthe topk levels are in-terchangeable, meaning that a bottom-up tree automaton hasto remember only a finiteamount of history. In the next section, we show that we can define an equivalence re-lation inducing an automaton accepting ak-TSS language using only our definition ofthek-root, as expected from Theorem 1.

4 k-Testable Tree Automata Inference

4.1 Congruence Relation

We begin with our definition of the equivalence relation thatis used to induce a state-minimal automaton from a set of trees. The equivalence relation, intuitively, comparestrees up tok levels deep, i.e., comparesk-roots.

Definition 1 (Root Equivalence Relation∼k). For some k≥ 0, two trees t1, t2∈T (F )are root-equivalent with degree k, denoted t1∼k t2, if rootk (t1) = rootk (t2).

Page 8: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

Lemma 1. The∼k relation is a congruence (monotonic equivalence) relationof finiteindex.

Proof (Sketch).It is obvious that∼k is an equivalence relation (reflexive, symmetric,and transitive), and here we show that it is also monotonic, and therefore a congruence.Supposet1 = f (t11, . . . , t1n) andt2 = f (t21, . . . , t2n), such thatrootk (t1/i) = rootk (t2/i)for all 1≤ i ≤ n. First, note that ifk > 0 androotk (t) = rootk (t ′), thenrootk−1 (t) =rootk−1 (t ′). According to the definition ofrootk, for k> 0 we obtain:

rootk (t1)= f (rootk−1 (t11) , . . . , rootk−1 (t1n)) By definition ofrootk= f (rootk−1 (t21) , . . . , rootk−1 (t2n)) By inductive hypothesis= rootk (t2) By definition ofrootk

Thek= 0 case is trivial, asroot0 (t1) = ξ f = root0 (t2).The size of ak-root is bounded byMk, whereM = max({n |Fn ∈ F,Fn 6= /0}).

Each positionu in thek-root’s domain can be labelled with at most|Farity(t(u))| symbols.Thus,rootk generates a finite number of equivalence classes, i.e., is offinite index.

As a consequence of Lemma 1, inference algorithms based on the root equivalencerelation need not propagate congruences using union-find [10] algorithms, as the rootequivalence relation is a congruence itself.

Definition 2 (∼k-induced Automaton). Let T′ ⊆ T(F ) be a finite set of finite trees.The A∼k(T ′) = (Q,F ,δ ,F) automaton induced by the root equivalence relation∼k isdefined as:

Q = {rootk (t ′) | ∃t ∈ T ′ . ∃u∈ dom(T ′) . t ′ = t/u}F = {rootk (t) | t ∈ T ′}

δ0( f ) = f for f ∈F0

δn( f , rootk (t1) , . . . , rootk (tn)) = rootk ( f (t1, . . . , tn)) for n≥ 1, f ∈Fn

Corollary 1 (Containment). From the definition it follows that∀k≥ 0 . T ′⊆L (A∼k(T ′)).In other words, the∼k-induced automaton abstracts the set of trees T′.

Theorem 2. L (A∼k) is a k-TSS language.

Proof. We need to prove that∀t1, t2∈T(F ), k≥ 0 . rootk (t1) = rootk (t2)∧t−11 L (A∼k) 6=

/0∧ t−12 L (A∼k) 6= /0⇒ t−1

1 L (A∼k) = t−12 L (A∼k). Suppose the antecedent is true, but

the consequent is false, i.e.,t−11 L (A∼k) 6= t−1

2 L (A∼k). Then there must existt suchthatt♯t1∈L (A∼k) andt♯t2 6∈L (A∼k). Letu be the position ofξt2(1), i.e.,(t♯t2)/u= t2.Without loss of generality, lett be the tree with minimal|u|. Necessarily,|u|> 1, as oth-erwiset−1

1 L (A∼k) = /0. Let u = w · i, i ∈ N. We prove thatt♯t2 must be inL (A∼k),contradicting the initial assumption, by induction on the length ofw.

Base case (|w| = 1): Let (t(w))(1) = f , f ∈ Fn. There are two subcases:n = 1and n > 1. For n = 1, the contradiction immediately follows, asδ ( f , rootk (t1)) =δ ( f , rootk (t2)). For then> 1 case, observe that for all positionsw· j such that 1≤ j ≤ n

Page 9: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

and j 6= i, (t♯t1)/w · j = (t♯t2)/w · j = t/w · j. From that observation androotk (t1) =rootk (t j), it follows that

δ(

(t♯t1/w)(1), rootk (t♯t1/w ·1) , . . . , rootk (t♯t1/w ·n))

= δ(

(t♯t2/w)(1), rootk (t♯t2/w ·1) , . . . , rootk (t♯t2/w ·n))

Induction step (|w|> 1): Letw= w′ ·m, m∈N. From the induction hypothesis, weknow that for allm, rootk (t♯t1/w) = rootk (t♯t2/w), thus it follows:

δ(

(t♯t1/w′)(1), rootk (t♯t1/w′ ·1) , . . . , rootk (t♯t1/w′ ·n))

= δ(

(t♯t2/w′)(1), rootk (t♯t2/w′ ·1) , . . . , rootk (t♯t2/w′ ·n))

Theorem 3 (Minimality). A∼k is state-minimal.

Proof. Follows from Myhill-Nerode Theorem [20, pg. 72] and Lemma 1.

Minimality is not absolutely crucial for malware analysis in a laboratory setting,but it is important in practice, where antivirus tools can’timpose a significant systemoverhead and have to react promptly to infections.

Theorem 4 (Garcia [14]).L (A∼k+1)⊆L (A∼k)

An important consequence of Garcia’s theorem is that thek factor can be used as anabstraction knob — the smaller thek factor, the more abstract the inferred automaton.This tunability is particularly important in malware detection. One can’t hope to designa classifier capable of perfect malware and goodware distinction. Thus, tunability of thefalse positive (goodware detected as malware) and false negative (undetected malware)ratios is crucial. More abstract automata will result in more false positives and fewerfalse negatives.

4.2 Inference Algorithm

In this section, we present our inference algorithm, but before proceeding with the al-gorithm, we discuss some practical aspects of inference from data-flow dependencygraphs. As discussed in Section 2, we use taint analysis to compute data-flow depen-dencies among executed syscalls at runtime. The result of that computation is not a tree,but an acyclic directed graph, i.e., a partial order of syscalls ordered by the data-flowdependency relation, and expansion of such a graph into a tree could cause exponentialblowup. Thus, it would be more convenient to have an inference algorithm that operatesdirectly on graphs, without expanding them into trees.

Fortunately, such an algorithm is only slightly more complicated than the one thatoperates on trees. In the first step, our implementation performs common subexpressionelimination [1] on the dependency graph to eliminate syntactic redundancies. The resultis a maximally-shared graph [2], i.e., an acyclic directed graph with shared commonsubgraphs. Figure 3 illustrates how a tree can be folded intoa maximally-shared graph.In the second step, we compute a hash for eachk-root in the training set. The hashis later used as a hash table key. Collisions are handled via chaining [10], as usual,but chaining is not described in the provided algorithms. The last step of the inference

Page 10: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

h

g

a b

g

a b

f

h

g

a b

f

h

g

a b

Fig. 3: Folding a Tree into a Maximally-Shared Graph.

algorithm traverses the graph and folds it into a tree automaton, using the key computedin the second phase to identify equivalentk-roots, which are mapped to the same state.

To simplify the exposition, we shall use the formal machinery developed in Section3 and present indexing and inference algorithms that work ontrees. The extension tomaximally-shared graphs is trivial and explained briefly later.

input : Treet, factorkresult : Key computed for every subtree oft

tmp← hash(t(1))foreach1≤ i ≤ arity(t(1)) do

ts← t/itmp← tmp⊕hash(ts.key)ComputeKey(ts,k)

t.key← tmp

Algorithm 1: ComputeKey — Computingk-Root Keys (Hashes). The⊕ operator canbe any operator used to combine hashes, like bitwise exclusive OR. Thehash: F →N function can be implemented as a string hash, returning an integral hash of thealphabet symbols.

Algorithm 1 traverses treet in postorder (children before the parent). Every subtreehas a fieldkeyassociated with its head, and the field is assumed to be initially zero.If the algorithm is called once, for treet, the key of the head of each subtreets willconsist only of the hash of the alphabet symbol labelingts, i.e., hash(ts(1)). If thealgorithm is called twice (on the same tree), the key of the head of each subtree willinclude the hash of its own label and the labels of its children, and so on. Thus, afterk calls toComputeKey, the key of each node will be equal to itsk-root key. Note thatthe temporary key, stored in thetmp variable, has to be combined with the children’s(k−1)-root key. The algorithm can be easily extended to operate onmaximally-sharedgraphs, but has to track visited nodes and visit each node only once in postorder. Thecomplexity of the algorithm isO (k ·N), whereN is the size of the tree (or maximally-

Page 11: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

shared graph). For multi-rooted graphs (or when processingmultiple trees), all rootscan be connected by creating a synthetic super-root of all roots, and the algorithm isthen calledk times with the super-root as the first operand.

input : Treet, factork, alphabetFoutput: A∼k = (Q,F ,δ ,F)

foreachsubtree ts in {t/u | u∈ dom(t)} traversed in postorderdoif rep[ts.key] = /0 then

q← rootk (ts)rep[ts.key] = qQ←Q∪q

n← arity(ts(1))

δ ←(

(

ts(1), rep[(ts/1).key], . . . , rep[(ts/n).key])

, rep[ts.key])

F = F ∪ rep[t.key]return (Q,F ,δ ,F)

Algorithm 2: k-Testable Tree Automaton Inference. Therep: hash(rootk (T(F )))→rootk (T(F )) hash map contains representatives of equivalence classes induced by∼k. Collisions are handled via chaining (not shown).

Algorithm 2 constructs theA∼k automaton. The tree (alternatively maximally-sharedgraph) used for training is traversed in postorder, andk-root of each subtree is used toretrieve the representative for each∼k-induced equivalence class. Multi-rooted graphscan be handled by introducing super-roots (as described before). Amortized complexityis O (kN), whereN is the size of the tree (or maximally-shared graph).

5 Implementation

5.1 Taint Analysis

We use Pin [26] to perform instruction-level tracing and analysis. Pin is a dynamic bi-nary instrumentation framework that allows program monitoring and rewriting only inuser space, which prevents us from propagating taints through syscalls in the kernelspace. One possible solution would be to declare all syscalls’ input parameters to betaint sinks, and all output parameters to be taint sources. Unfortunately, the kernel in-terface for the Windows XP operating system is only partially documented. To workaround this problem, we use the libwst library by Martignoniand Paleari [27] to auto-matically extract and parse parameters of Windows syscalls. With libwst, we find outthe number, type, and directionality (in/out) of parameters. The reverse-engineeredparameters are then used as an input-output specification ofsyscalls. After each returnfrom a syscall, we walk the stack and mark any location pointed to by anout parameteras tainted with a new taint mark. At syscall entry (i.e., justbefore our tool loses control),we walk the stack and check if taint has reached any of itsin parameters. Since each

Page 12: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

taint mark can be traced back to a uniqueout parameter, the set of dependencies foranin parameter corresponds exactly to the set of its taint marks.We approximate theleaves of the dependency graph (i.e., input parameters not returned by any syscall) withtheir types. A more precise approach, left for future work, would be to use the actualvalues.

Ideally, each malware sample would run unencumbered in the environment targetedby its authors. According to conventional wisdom, most malware samples target Win-dows XP, so we set it up with the latest service pack in a VirtualBox virtual machinewith no network connection and only one user with administrative rights. Although thelack of network connection might prevent some samples from executing their payload,such a precaution is necessary to avoid spreading the infection. We infect the virtualmachine via a shared folder. The physical machine used to runthe dependency graphextraction experiments has a 2.66GHz Intel Core i7 CPU and 8GB RAM. After eachrun, we revert the virtual machine to a clean snapshot so thatmalware samples can notinterfere with each other.

5.2 Inference Algorithm

The inference algorithm is a relatively straightforward implementation of algorithms inSection 4.2, written in about 3200 lines of C++ code. As explained before, after readingthe dependency graphs, the implementation performs commonsubexpression elimina-tion (CSE), computesk-root hashes (Algorithm 1), infers ak-testable tree automaton(Algorithm 2), and then runs the dependency graphs from the test set against that au-tomaton. Both CSE and inference are done directly on dependency graphs, avoiding anexpansion into trees.

6 Experimental Results

6.1 Benchmarks

For the experiments, we use two sets of benchmarks: the malware and the goodware set.The malware set comprises 2631 samples pre-classified into 48 families. Each familycontains 5–317 samples. We rely upon the classification of Christodorescu et al. [6]and Fredrikson et al. [13].2 The classification was based on the reports from antivirustools. For a small subset of samples, we confirmed the qualityof classification usingvirustotal.com, a free malware classification service. However, without knowing theinternals of those antivirus tools and their classificationheuristics, we cannot evaluatethe quality of the classification provided to us. Our classification experiments indicatethat the classification antivirus tools do might be somewhatad-hoc. Table 1 shows thestatistics for every family, while Table 2 shows goodware statistics. Table 3 gives someidea of how antivirus tools classify one randomly chosen sample.

2 The full set of malware contains 3136 samples, but we eliminated samples that were not exe-cutable, executable but not analyzable with Pin (i.e., MS-DOS executables), broken executa-bles, and those that were incompatible with the version of Windows (XP) that we used forexperiments.

Page 13: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

ID Family Name Samples Avg. Nodes Trees Max.1 ABU.Banload 16 7.71 544 303 212 Agent 42 8.86 965 593 273 Agent.Small 15 8.88 950 588 274 Allaple.RAHack 201 8.78 1225 761 445 Ardamax 25 6.21 144 69 166 Bactera.VB 28 7.09 333 177 287 Banbra.Banker 52 13.97 1218 686 378 Bancos.Banker 46 14.05 742 417 459 Banker 317 17.70 2952 1705 43

10 Banker.Delf 20 14.78 939 521 5011 Banload.Banker 138 19.38 2370 1332 15212 BDH.Small 5 5.82 348 199 2113 BGM.Delf 17 7.04 339 199 2514 Bifrose.CEP 35 11.17 1190 698 5015 Bobax.Bobic 15 8.98 859 526 3016 DKI.PoisonIvy 15 9.22 413 227 4017 DNSChanger 22 12.62 874 483 3618 Downloader.Agent 13 12.89 1104 613 4919 Downloader.Delf 22 10.76 1486 906 3220 Downloader.VB 17 10.80 516 266 2921 Gaobot.Agobot 20 17.54 1812 1052 4522 Gobot.Gbot 58 7.01 249 134 2223 Horst.CMQ 48 16.86 1030 541 4224 Hupigon.ARR 33 23.58 2388 1244 55

ID Family Name Samples Avg. Nodes Trees Max.25 Hupigon.AWQ 219 24.63 7225 3758 6226 IRCBot.Sdbot 66 16.51 3358 1852 4727 LdPinch 16 16.88 1765 1012 6628 Lmir.LegMir 23 9.00 1112 667 2829 Mydoom 15 5.78 484 305 2030 Nilage.Lineage 24 9.64 1288 657 8331 Games.Delf 11 8.44 971 632 2232 Games.LegMir 76 17.18 11892 8184 5933 Games.Mmorpg 19 7.00 654 478 2534 OnLineGames 23 7.30 718 687 1635 Parite.Pate 71 14.31 1420 816 3636 Plemood.Pupil 32 6.29 330 189 2437 PolyCrypt.Swizzor 43 10.32 415 213 3038 Prorat.AVW 40 23.47 1031 572 5839 Rbot.Sdbot 302 14.23 4484 2442 4740 SdBot 75 14.13 2361 1319 4041 Small.Downloader 29 11.93 2192 1216 3442 Stration.Warezov 19 9.76 1682 1058 3443 Swizzor.Obfuscated 27 21.75 1405 770 4944 Viking.HLLP 32 7.84 512 315 2445 Virut 115 11.76 3149 1953 4046 VS.INService 17 11.42 307 178 3747 Zhelatin.ASH 53 12.14 1919 1146 3948 Zlob.Puper 64 15.16 2788 1647 90

Table 1: Malware Statistics per Family. All dependency graphs were obtained by run-ning each sample for 120sec in a controlled environment. Theidentifier that will beused in later graphs is given in the first column. The third column shows the numberof samples per family. TheAvg.column shows the average height of the dependencygraphs across all the samples in the family. TheNodescolumn shows the total numberof nodes in the dependency graph (after CSE). TheTreescolumn shows the total num-ber of different trees (i.e., roots of the dependency graph)across all the samples. TheMax column gives the maximal height of any tree in the family.

The goodware set comprises 33 commonly used applications: AdobeReader, AppleSW Update, Autoruns, Battle for Wesnoth, Chrome, Chrome Setup, Firefox, Freecell,Freeciv, Freeciv server, GIMP, Google Earth, Internet Explorer, iTunes, Minesweeper,MSN Messenger, Netcat port listen and scan, NetHack, Notepad, OpenOffice Writer,Outlook Express, Ping, 7-zip archive, Skype, Solitaire, Sys info, Task manager, TuxRacer, uTorrent, VLC, Win. Media Player, and WordPad. We deemed these applicationsto be representative of software commonly found on the average user’s computer, froma number of different vendors and with a diverse set of behaviors. Also, we used twomicro benchmarks: a HelloWorld program written in C and a filecopy program. Micro-benchmarks produce few small dependency graphs and therefore might be potentiallymore susceptible to be misidentified for malware.

In behavioral malware detection, there is always a contention between the amountof time the behavior is observed and the precision of the analysis. For malware samples,which are regularly small pieces of software, we set the timeout to 120sec of runningin our environment. For goodware, we wanted to study the impact of the runtime onthe height and complexity of generated dependency graphs, and the impact of thesedifferences on the false positive rates. Thus, we ran goodware samples for both 120 and800sec. To give some intuition of how that corresponds to theactual native runtime, it

Page 14: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

800sec Trace 120sec TraceID Application Avg. Nodes Trees Max.Avg. Nodes Trees Max.1 AdobeReader 8.09 340 191 228.57 271 147 222 Apple SW Update13.74 561 317 5120.87 293 151 513 Autoruns 12.29 330 181 4312.45 304 160 434 Battle for Wesnoth41.01 602 355 7634.73 380 187 765 Chrome 13.85 436 240 4311.11 273 143 316 Chrome Setup 5.19 148 74 17 5.19 148 74 177 Copy 77.14 913 426 24464.99 880 412 2158 Firefox 30.43 785 464 9444.02 356 175 899 Freecell 11.65 308 167 3311.49 316 172 33

10 Freeciv 28.48 472 241 7537.14 300 137 7211 Freeciv server 11.46 300 177 3011.62 297 174 3012 GIMP 30.97 681 359 8636.33 299 134 6913 Google Earth 33.08 321 155 764.63 88 37 1314 HelloWorld 1.62 35 15 4 1.53 34 14 415 Internet Explorer 10.58 572 319 4913.08 279 139 4516 iTunes 48.81 852 457 12032.05 404 217 7517 Minesweeper 10.85 304 167 3010.72 305 167 3018 MSN Messenger 17.75 809 477 5923.28 308 158 5819 Netcat port listen 65.08 997 494 24167.05 873 413 22520 Netcat port scan 54.67 1123 597 24165.69 882 420 22521 NetHack 4.94 124 63 15 4.94 124 63 1522 Notepad 9.68 350 198 3010.69 298 165 3023 OpenOffice Writer 6.55 271 156 196.60 271 156 1924 Outlook Express 20.45 490 279 5120.64 360 201 4925 Ping 11.82 535 317 3412.19 360 197 3426 7-zip archive 12.96 269 149 2612.97 267 144 3027 Skype 1.38 31 12 3 1.38 31 12 328 Solitaire 11.63 303 165 3111.30 311 170 3129 Sys. Info 6.48 613 382 267.01 305 171 2630 Task Manager 11.28 513 307 3511.94 343 196 3531 TuxRacer 14.11 441 261 4415.53 279 157 3932 uTorrent 9.31 267 151 2810.49 214 114 2833 VLC 12.92 325 178 3812.76 295 159 3834 Win. Media Player 9.50 448 255 3610.23 315 174 3635 WordPad 8.33 420 235 288.52 262 147 27

Average 19.06 426 235 5117.10 295 153 46Table 2: Goodware Statistics. For the description of other columns, see Table 1.

Page 15: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

Antivirus Classification Antivirus ClassificationAVG Downloader.Generic4.GAF AhnLab-V3 Win-Trojan/Xema.variantAntiVir TR/Agent.8192.123 Antiy-AVL Trojan/Win32.Agent.genAvast Win32&Agent-GQA Avast5 Win32&Agent-GQABitDefender Trojan.Downloader.Agent.AAHCAT-QuickHeal TrojanDownloader.Agent.aahClamAV Trojan.Downloader-6542 Command W32/Downldr2.COHComodo TrojWare.Win32. nProtect Trojan-Downloader/

TrojanDownloader.Agent.AAH W32.Agent.8192.KEmsisoft Trojan-Dropper.Agent!IK F-Prot W32/Downldr2.COHF-Secure Trojan.Downloader.Agent.AAHGData Trojan.Downloader.Agent.AAHIkarus Trojan-Dropper.Agent Jiangmin Trojan/PSW.GamePass.girK7AntiVirus Trojan-Downloader Kaspersky Trojan-Downloader.

Win32.Agent.aahMcAfee Suspect-AB!3BC816C45FD4 DrWeb Adware.DealHelperMicrosoft TrojanDownloader& NOD32 Win32/

Win32/Agent TrojanDownloader.Agent.AAHNorman W32/Agent.ECGQ PCTools Trojan-Downloader.Agent.AAHPanda Trj/Downloader.OEW Prevx Med. Risk Malware DownloaderRising Trojan.Clicker.Win32.Small.nhSophos Mal/Generic-LSymantec Downloader TheHacker Trojan/Downloader.Agent.aahTrendMicro TROJGeneric VirusBuster Trojan.DL.Agent.TOR

Table 3: Sample 3BC816C45FD461377E13A775AE8768A3 Classification. Data ob-tained from Virustotal.com.

takes approximately 800s in our DTA analysis environment for Acrobat Reader to opena document and display a window.

We noticed a general tendency that detection and classification tend to correlatepositively with the average height of trees in samples used for training and testing. Weprovide the average heights in tables 1 and 2, and heat maps providing a deeper insightinto the distribution of the heights in figures 4, 6, and 5.

6.2 Malware and Goodware Recognition

For our malware recognition experiments, we chose at random50% of the entire mal-ware set for training, and used the rest and the entire goodware set as test sets. Train-ing with k = 4 took around 10sec for the entire set of 1315 training samples, and thetime required for analyzing each test sample was less than the timing jitter (sub-secondrange). All the experiments were performed in Ubuntu 10.04,running in a VMware7.1.3 workstation, running on Win XP Pro and dual-core 2.5GHz Intel machine with4GB of RAM. In Figure 7 (resp. 8), we show the results, using the goodware depen-dency graphs produced with an 800sec (resp. 120sec) timeout.

The detection works as follows. We run all the trees (i.e., roots of the dependencygraph) in each test sample against the inferred automaton. First, we sort the trees byheight, and then compute how many trees for each height are accepted by the automa-ton. Second, we score the sample according to the following function:

score=∑i

accepteditotali

∗ i

∑i i(1)

wherei ranges from 1 to the maximal height of any tree in the test sample (the lastcolumn of Table 1),acceptedi is the number of trees with heighti accepted by theautomaton, andtotali is the total number of trees with heighti. The test samples thatproduce no syscall dependency graphs are assumed to have score zero.

Page 16: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282930313233343536373839404142434445464748

10 20 30 40 50 60 70

5

10

15

20

25

30

35

40

45

50

Fig. 4: Malware Tree Height Heat Map. Thex axis represents the tree height, while they axis lists malware families. The legend on the right is a color code for the number oftrees observed with a particular height.

The score can range from 0 to 1. Higher score signifies a higherlikelihood thesample is malicious. The ratio in the nominator of Eq. 1 is multiplied by the depth ofthe tree to filter out the noise from shallow trees, often generated by standard libraryfunctions, that have very low classification power.

The results turned out to be slightly better with an 800sec timeout than with the120sec timeout, as the average height of dependency graphs was slightly larger. Asexpected, we found that with the risingk factor (and therefore decreasing level of ab-straction), the capability of inferred tree automaton to detect malware decreases, whichobviously indicates the value of generalization achieved through tree automata infer-ence. On the other hand, with the risingk factor, the detection becomes more preciseand therefore the false positive rate drops down. Thus, it isimportant to find the rightlevel of abstraction. In our experiments, we determined that k = 4 was the optimal ab-straction level. The desired ratio between false positivesand negatives can be adjustedby selecting the score threshold. All samples scoring above(resp. below) the thresholdare declared malware (resp. goodware). For example, fork= 4, timeout of 800sec, andscore 0.6, our approach reports two false positives (5%) — Chrome setup and NetHack,and 270 false negatives (20%), which corresponds to an 80% detection rate. Fork= 4,timeout of 800sec, and score 0.6, our approach reports one additional false positive(System info), and the same number of false negatives, although a few malware samplesare somewhat closer to the threshold. Obviously, the longerthe behavior is observed,the better the classification.

Page 17: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435

20 40 60 80 100

5

10

15

20

25

30

35

40

45

50

Fig. 5: Goodware (800sec Trace) Tree Height Heat Map. Thex axis represents the good-ware samples.

It is interesting to notice that increasing the value ofk above 4 does not make a sig-nificant difference in (mis)detection rates. We ran the experiments withk up to 10, butdo not show the results as they are essentially the same as fork = 4. From our prelim-inary analysis, it seems that generalization is effective when a sequence of dependentsyscalls are executed within a loop. If two samples execute the same loop body a dif-ferent number of times, our approach will be able to detect that. Changingk effectivelychanges the window with which such loop bodies are detected.During the inference, itseems like one size (ofk) does not fit all cases. We believe that by analyzing the repet-itiveness of patterns in dependency graphs, we could detectthe sizes of loop bodiesmuch more accurately, and adjust thek factor according to the size of the body, whichshould in turn improve the generalization capabilities of the inference algorithm. Manyother improvements of our work are possible, as discussed later.

6.3 Malware Classification

We were wondering what is the classification power of inferred automata, so we did thefollowing experiment. We divided at random each family intotraining and test sets ofequal size. For each training set, we inferred a family-specific tree automaton. For eachtest set, we read the dependency graphs for all the samples inthe set, and compute asingle dependency graph, which is then analyzed with the inferred tree automaton. Thescores are computed according to Equation 1, withk= 3. The only difference from theexperiment done in the previous section is that the score is computed for the entire testset, not individual samples in the set. Results are shown in Figure 9.

Page 18: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435

20 40 60 80 100

5

10

15

20

25

30

35

40

45

50

Fig. 6: Goodware (120sec Trace) Tree Height Heat Map. Thex axis represents the good-ware samples.

The pronounced diagonal in Figure 9 shows that our inferred automata clearly havea significant classification power and could be used to classify malware into families.There is some noise as well. The noise could be attributed to many factors: over-generalization, over- and under-tainting of our DTA [5, 19], insufficiently large depen-dency graphs, frequently used dynamic libraries that are shared by many applicationsand malware, and a somewhat ad-hoc pre-classification by theantivirus tools.

7 Limitations

There are several inherent limitations of our approach. An attacker could try to masksyscall dependencies so as to be similar (or the same) as those of benign applications.This class of attacks are known asmimicry attacks[35]. All intrusion and behavioralmalware detection approaches are susceptible to mimicry attacks. One way to make thisharder for the attacker, is to make the analysis more precise, as will be discussed in thefollowing section.

Triggering interesting malware behavior is another challenge. Some behaviors couldbe triggered only under certain conditions (date, web site visited, choice of the defaultlanguage, users’ actions,. . . ). Moser et al. [29, 4] proposed DART [16] as a plausibleapproach for detecting rarely exhibited behaviors.

As discussed earlier, our DTA environment slows the execution several thousandtimes, which is obviously too expensive for real-time detection. A lot of work on mal-ware analysis is done in the lab setting, where this is not a significant constraint, but

Page 19: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

% o

f sa

mple

s w

ith s

core

s ab

ove

(bel

ow

)

Score

k=2k=3k=4k=5

Fig. 7: Malware and Goodware Recognition. Timeouts for generating the dependencygraphs were 120sec for malware test and training sets and 800sec for the goodware testset in the figure on the left. The training set consists of 50% of the entire malware set,chosen at random. The test set consists of the remaining malware samples (curves risingfrom left to right), and the goodware set (curves falling from left to right). The risingcurves represent the percentage of malware samples for which the computed score waslessthan the corresponding value on thex axis. The falling curves represent the per-centage of goodware samples for which the score wasgreater than the correspondingvalue on thex axis. The figure shows curves for four different values ofk, there is es-sentially no difference between the cases whenk = 4 andk = 5. For the rising curves,the lowest curve is fork= 2, the next higher one fork= 3, and the two highest ones forthe remaining cases. For the falling curves, the ordering isreversed. The optimal scorefor distinguishing malware from goodware is the lowest intersection of the rising andfalling curves for the samek.

Page 20: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

0

20

40

60

80

100

0 0.2 0.4 0.6 0.8 1

% o

f sa

mp

les

wit

h s

core

s ab

ove

(bel

ow

)

Score

k=2k=3k=4k=5

Fig. 8: Malware and Goodware Recognition. Timeouts for generating dependencygraphs were 120 seconds for malware test and training sets, as well as for the good-ware test set.

efficiency obviously has to be improved if taint-analysis based approaches are ever tobe broadly used for malware detection. Hardware taint-analysis accelerators are a viableoption [32, 11], but we also expect we could probably achievean order of magnitudespeedup of our DTA environment with a very careful optimization.

8 Conclusions and Future Work

In this paper, we presented a novel approach to detecting likely malicious behaviors andmalware classification based on tree automata inference. Weshowed that inference, un-like simple matching of dependency graphs, does generalizefrom the learned patternsand therefore improves detection of yet unseen polymorphicmalware samples. We pro-posed an improvedk-testable tree automata inference algorithm and showed howthek factor can be used as a knob to tune the abstraction level. In our experiments, ourapproach detects 80% of the previously unseen polymorphic malware samples, with a5% false positive rate, measured on a diverse set of benign applications.

Our technique is, at the current stage of development, intended to be used in a lab-oratory setting. Currently, detection and classification of malware require significantamounts of manual work (see [21] for discussion and references). The goal of our ap-proach is to automate these processes in the lab setting. Currently, tracing targeted ap-

Page 21: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Fig. 9: Malware Classification Results. Thex (y) axis represents the training (test) sets.The size of the shaded circle corresponds to the score computed by Eq. 1.

plications and tracking their syscall dependencies incursa significant slowdown, whichis, in our view, the most significant obstacle to adopting ourapproach in a real-time real-world setting. We expect that further research and recent progress in hardware-assisted[32, 11] taint analysis could bridge the performance gap.

There are many directions for further improvements. The classification power ofour approach could be improved by a more precise analysis of syscall parameters (e.g.,using their actual values in the analysis), by dynamically detecting the best value of thekfactor in order to match the size of loop bodies that produce patterns in the dependencygraphs, by using goodware dependency graphs as negative examples during training,and by combining our approach with the leap mining approach [13].

Also, in the current dependency graphs analysis, we do not distinguish how syscallsreturn values. For example, if a syscall returns two values,one through the firstoutparameter and another one through the second, we consider these two values to be thesame during the inference, even though our taint analysis distinguishes them. In otherwords, the inference merges all outputs into a single outputand all dependencies areanalyzed with respect to that single merged output.

Another interesting direction is inference of more expressive tree languages. In-ference of more expressive languages might handle repeatedpatterns more precisely,generalizing only as much as needed to fold a repeatable pattern into a loop in the treeautomaton. Further development of similar methods could have a broad impact in secu-rity, forensics, detection of code theft, and perhaps even testing and verification, as theinferred automata can be seen as high-level abstractions ofprogram’s behavior.

Page 22: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

Acknowledgments

We are grateful to Matt Fredrikson and Somesh Jha for sharingtheir library of classifiedmalware [13] with us. We thank Emiliano Martinez Contreras for giving us an accountat virustotal.com that we used to double-check the classification, and for giving us sup-port in using the API through which our experimentation scripts communicated withvirustotal.com. We would especially like to thank Lorenzo Martignoni, who wrote thelibwst library [27] for extracting and parsing arguments ofWindows’s system calls. Wealso thank reviewers for their insightful and constructivecomments.

References

1. Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: principles, techniques, and tools. Addison-Wesley Longman Publishing Co., Inc., Boston, Massachusetts, USA (1986)

2. Babic, D.: Exploiting Structure for Scalable Software Verification. Ph.D. thesis, Universityof British Columbia, Vancouver, Canada (2008)

3. Bonfante, G., Kaczmarek, M., Marion, J.Y.: Architectureof a morphological malware detec-tor. Journal in Computer Virology 5, 263–270 (2009)

4. Brumley, D., Hartwig, C., Zhenkai Liang, J.N., Song, D., Yin, H.: Botnet Detection Counter-ing the Largest Security Threat, Advances in Information Security, vol. 36, chap. Automati-cally Identifying Trigger-based Behavior in Malware, pp. 65–88. Springer (2008)

5. Chow, J., Pfaff, B., Garfinkel, T., Christopher, K., Rosenblum, M.: Understanding data life-time via whole system simulation. In: Proc. of 13th USENIX Security Symposium (2004)

6. Christodorescu, M., Jha, S.: Testing malware detectors.In: ISSTA’04: Proc. of the 2004 ACMSIGSOFT Int. Symp. on Software Testing and Analysis. pp. 34–44. ACM (2004)

7. Christodorescu, M., Jha, S., Kruegel, C.: Mining specifications of malicious behavior. In:Proc. of the the 6th joint meeting of the European software engineering conf. and the ACMSIGSOFT symp. on The foundations of software engineering. pp. 5–14. ACM (2007)

8. Christodorescu, M., Jha, S., Seshia, S.A., Song, D., Bryant, R.E.: Semantics-aware malwaredetection. In: SP’05: Proc. of the 2005 IEEE Symp. on Security and Privacy. pp. 32–46.IEEE Comp. Soc. (2005)

9. Comon, H., Dauchet, M., Gilleron, R., Loding, C., Jacquemard, F., Lugiez, D., Tison, S.,Tommasi, M.: Tree automata techniques and applications (2007)

10. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.:Introduction to Algorithms. The MITPress, 2nd edn. (2001)

11. Crandall, J., Chong, F.: Minos: Control data attack prevention orthogonal to memory model.In: In the Proc. of the 37th Int. Symp. on Microarchitecture.pp. 221–232. IEEE (2005)

12. Forrest, S., Hofmeyr, S.A., Somayaji, A., Longstaff, T.A.: A sense of self for unix processes.In: Proc. of the 1996 IEEE Symp. on Security and Privacy. pp. 120–129. IEEE Comp. Soc.(1996)

13. Fredrikson, M., Jha, S., Christodorescu, M., Sailer, R., Yan, X.: Synthesizing near-optimalmalware specifications from suspicious behaviors. In: Proc. of the 2010 IEEE Symposiumon Security and Privacy. pp. 45–60. IEEE Comp. Soc. (2010)

14. Garcıa, P.: Learningk-testable tree sets from positive data. Tech. rep., Dept. Syst. Inform.Comput., Univ. Politecnica Valencia, Valencia, Spain (1993)

15. Garcıa, P., Vidal, E.: Inference of k-testable languages in the strict sense and application tosyntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12, 920–925 (1990)

Page 23: Malware Analysis with Tree Automata Inferencebitblaze.cs.berkeley.edu/papers/malware-cav11.pdf · 2013-03-29 · (i.e., forensic analysis of evolution of malware and common/distinctive

16. Godefroid, P., Klarlund, N., Sen, K.: DART: directed automated random testing. In:PLDI’05: Proc. of the ACM SIGPLAN Conf. on Prog. Lang. Designand Implementation.pp. 213–223. ACM (2005)

17. Gold, E.M.: Complexity of automaton identification fromgiven data. Information and Con-trol 37(3), 302–320 (1978)

18. Holzer, A., Kinder, J., Veith, H.: Using verification technology to specify and detect malware.LNCS, vol. 4739, pp. 497–504. Springer (2007)

19. Kang, M.G., McCamant, S., Poosankam, P., Song, D.: DTA++: Dynamic taint analysis withtargeted control-flow propagation. In: Proceedings of the 18th Annual Network and Dis-tributed System Security Symposium. San Diego, CA (2011)

20. Khoussainov, B., Nerode, A.: Automata Theory and Its Applications. Birkhauser (2001)21. Kinder, J., Katzenbeisser, S., Schallhart, C., Veith, H.: Detecting malicious code by model

checking. In: Julisch, K., Krugel, C. (eds.) GI SIG SIDAR Conference on Detection of Intru-sions and Malware and Vulnerability Assessment. LNCS, vol.3548, pp. 174–187. Springer(2005)

22. King, J.C.: Symbolic execution and program testing. Comm. of the ACM 19(7), 385–394(1976)

23. Knuutila, T.: Inference ofk-testable tree languages. In: Bunke, H. (ed.) Advances in Struc-tural and Syntactic Pattern Recognition: Proc. of the Int. Workshop, pp. 109–120. WorldScientific (1993)

24. Kolbitsch, C., Milani, P., Kruegel, C., Kirda, E., Zhou,X., Wang, X.: Effective and efficientmalware detection at the end host. In: The 18th USENIX Security Symposium (2009)

25. Lopez, D., Sempere, J.M., Garcıa, P.: Inference of reversible tree languages. IEEE Transac-tions on Systems, Man, and Cybernetics, Part B 34(4), 1658–1665 (2004)

26. Luk, C., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V.,Hazelwood, K.: Pin: building customized program analysis tools with dynamic instrumenta-tion. In: PLDI’05: Proc. of the 2005 ACM SIGPLAN Conf. on Prog. lang. design and impl.pp. 190–200. ACM (2005)

27. Martignoni, L., Paleari, R.: The libwst library (a part of WUSSTrace) (2010),http://code.google.com/p/wusstrace/

28. Matrosov, A., Rodionov, E., Harley, D., Malcho, J.: Stuxnet under the microscope. Tech.rep., Eset (2010)

29. Moser, A., Kruegel, C., Kirda, E.: Exploring multiple execution paths for malware analysis.In: SP’07: Proc. of the 2007 IEEE Symposium on Security and Privacy. pp. 231–245. IEEEComputer Society, Washington, DC, USA (2007)

30. Newsome, J., Song, D.: Dynamic Taint Analysis: Automatic Detection, Analysis, and Sig-nature Generation of Exploit Attacks on Commodity Software. In: Proc. of the Network andDistributed Systems Security Symposium (2005)

31. Pfleeger, C.P.: State reduction in incompletely specified finite-state machines. IEEE Trans-actions on Computers 22(12), 1099–1102 (1973)

32. Suh, G., Lee, J., Zhang, D., Devadas, S.: Secure program execution via dynamic informationflow tracking. ACM SIGOPS Operating Systems Review 38(5), 85–96 (2004)

33. Symantec: Symantec global internet security threat report: Trends for 2009. volume xv. Tech.rep., Symantec (April 2010)

34. Wagner, D., Dean, D.: Intrusion detection via static analysis. In: Proc. of the 2001 IEEESymposium on Security and Privacy. p. 156. IEEE Computer Society (2001)

35. Wagner, D., Soto, P.: Mimicry attacks on host-based intrusion detection systems. In: CCS’02:Proc. of the 9th ACM Conf. on Comp. and Comm. Security. pp. 255–264. ACM (2002)

36. You, I., Yim, K.: Malware Obfuscation Techniques: A Brief Survey. Int. Conf. on Broadband,Wireless Computing, Communication and Applications pp. 297–300 (2010)

37. Zalcstein, Y.: Locally testable languages. J. Comput. Syst. Sci. 6(2), 151–167 (1972)