impeding automated malware analysis with environment-sensitive malware · impeding automated...

Impeding Automated Malware Analysiswith Environment-sensitive Malware

Chengyu SongGeorgia Institute of Technology

[email protected]

Paul RoyalGeorgia Institute of Technology

[email protected]

Wenke LeeGeorgia Institute of Technology

[email protected]

Abstract

To solve the scalability problem introduced by the ex-ponential growth of malware, numerous automated mal-ware analysis techniques have been developed. Unfor-tunately, all of these approaches make previously un-addressed assumptions that manifest as weaknesses tothe tenability of the automated malware analysis process.To highlight this concern, we developed two obfuscationtechniques that make the successful execution of a mal-ware sample dependent on the unique properties of theoriginal host it infects. To reinforce the potential for mal-ware authors to leverage this type of analysis resistance,we discuss the Flashback botnet’s use of a similar tech-nique to prevent the automated analysis of its samples.

1 Introduction

Malware analysis is the process of understanding thebehavior of malicious programs. Information collectedfrom malware analysis can be used to detect simi-lar malware, repair damaged systems, and dismantlemalicious infrastructure (e.g., C&C servers) [15, 2, 7,16, 8]. As these activities endanger the cyber crimi-nal’s profit model, attackers have continuously devel-oped techniques to prevent malware from being ana-lyzed. In response, defenders have created new tech-niques [14, 10, 1, 5, 13] to address malware analysis re-sistance. This arms race has lasted for decades and ingeneral, neither side has yet claimed a definitive advan-tage.

In this paper we demonstrate that through novel reuseof existing techniques traditionally reserved for digitalrights management (DRM), automated malware analysiscan be made ineffective and unscalable. Specifically, weintroduce the concepts of host identity-based encryption(HIE) and instruction set localization (ISL), which inter-relate the successful execution of a malware sample withunique properties of the original host it infects. Note that

these techniques are not intended to prevent a human an-alyst from employing manual efforts to understand thebehavior of a particular piece of malware (e.g., Stuxnet).

1.1 Background

Many early software obfuscation techniques (such aspacking) were designed to complicate static analysisand omitted concerns about what an analyzer couldlearn by running a program. Later, virtual machine-based obfuscation was used to prevent static analysis ondumped memory images (e.g., which can include un-packed code). Thus, while these techniques have suc-cessfully rendered static analysis infeasible, they do notpose a significant threat to dynamic analysis. As part ofevolutions in the threat landscape, malware authors re-alized this flaw and began integrating dynamic analysisdetections. As an example, some malware samples willattempt to detect whether they are being debugged [11],or whether execution is occurring in a virtual environ-ment [9]. If detection is successful, a given malwareinstance will not exhibit malicious behavior. Althoughthese techniques have met with some degree of success,they remain brittle to mitigation, as researchers can de-vise new techniques (e.g., [14, 5]) to make their analysisenvironment appear normal. Based on this observation,our obfuscation techniques are designed to be effectiveagainst two key conceptual challenges faced by malwareauthors:

• (A1): It is difficult to reliably distinguish an analy-sis environment from a production environment;

• (A2): It is impossible to hide high level (e.g. systemcall and network) behaviors from dynamic analysis.

The motivation behind overcoming these challengesis that the effectiveness of our obfuscation techniquesshould not be limited to existing analysis environments

1

or mechanisms. Concisely, they should be able to resistall potential analysis techniques in the predictable future.

The Goal. Instead ofpreventing a human analyst fromunderstanding a piece of malicious software, the goal ofour obfuscation techniques is toprevent human analystsfrom analyzing the malware efficientlywith what havebecome ubiquitous automated means. Since manuallyanalyzing the entirety of new malware samples createdeach day is known to be untenable, we will focus on de-feating automated malware analysis.

1.2 Defeating Automated Malware Analy-sis

The intuition behind our obfuscation techniques arisesfrom the decades-old operational model of the anti-malware industry. This model consists of four phases:capture, analysis, signature generation, and runtime de-tection. In the first phase, malware samples are collectedin several ways: honeypots, mail filters, web crawlers,client submissions, and malware exchanges. Samplescollected are then put into an isolated environment foranalysis. Any sample that is believed to be maliciouswill be sent to the signature generation phase, where in-variant characteristics will be extracted from the binary.Finally, the generated signature will be dispatched to in-dividual antivirus clients to detect the malware instance,or better, other variants of the same family.

The problem with this model is thatmalware samplesare captured and analyzed in two different environments.Therefore, instead of trying to detect a particular analysisenvironment or popular virtualization container used tocreate automated malware analysis systems, we assumethat any environment other than the originalis an analy-sis environment. In other words, malware that possessesthe protections we propose will behave maliciously onlyon the original host that was infected; any attempt toexecute a given sample in another environment willresult in incorrect execution. This goal is achievedthrough two techniques.

Host Identity-based Encryption (HIE). Before deploy-ing a malware instance on a given system, informa-tion (based on system hardware and software) that canuniquely identify this system is collected. This informa-tion is then used to derive a key (or host ID) that will beused to encrypt certain portions of the malware instance.At runtime, the malware instance will gather the sameset of information again and use it to derive a decryptionkey. Thus, if the instance is put into a different executionenvironment, decryption will fail and the sample will notexhibit malicious behavior.

A straightforward implementation of HIE involves en-crypting the entire malware binary (e.g., encryption key-

based packing). While such an approach could provideideal protection for a program, this method may actuallybenefit the defender. As an example, once an analysissystem is made aware that the sample leverages HIE, ex-ecution failure could be used to determine whether thehost ID is correct, which can ease the process of brute-forcing the decryption key. A more appropriate use ofthis technique may involve encryption of only mech-anisms critical to the malware. As an example, por-tions of code or data associated with the sample’s do-main name generation algorithm (used to contact C&Cservers) could be encrypted. Then, even if decryptionfails, the sample will attempt to resolve or connect to thewrong C&C server. The malware analysis system wouldin turn treat this information as real.

HIE has two major advantages. First, it uses mod-ern cryptography, which means that knowledge of how akey is derived does not affect the integrity of the protec-tion; unless the defender can guess the same decryptionkey, they cannot unlock the sample. Second, any twoinstances of malware will possess different decryptionkeys; intelligence gathered from successfully analyzingone malware instance provides no advantage in analyz-ing the second. As an example, some malware instancesattempt to detect whether their execution is inside a vir-tual environment (e.g., by detecting the emulated harddrive). Once a defender discovers this particular method,they can modify their environment to mitigate attempteddetections by all samples that use it. However, with HIE,even if the analyst knows samplea expects environmentα, this result cannot be used to run a sampleb that ex-pects environmentβ .

Although it shares similarities with conventionalDRM techniques (i.e., locking software to a specific en-vironment), HIE is different in several ways. First, toprevent protection bypass, DRM systems often assumethe highest privilege level on a system or utilize spe-cial hardware. Without precluding itself from portionsof its target audience, HIE cannot (and does not) makethe same assumptions. Second, the goal of DRM is toprevent the protections of even a single instance of copy-righted material from being broken (because the unpro-tected copy could then be widely distributed). In con-trast, to be effective, HIE only needs to prevent the suc-cessful processing of large volumes of malware in an ac-ceptably small time period. Finally, although it is pos-sible to enumerate all possible host configurations andbrute force a decryption key, use of a sufficiently largekey space will make this process unacceptably ineffi-cient.

Despite its advantages, host identity-based encryptionis not considered sufficiently resistant to forgery. Thus,we also propose a network-based identifier that isderived at the C&C server. The combination of host

2

and network-based keys are used by instruction setlocalization (ISL), a second technique that provides amalware instance running on an infected system with itsactual malicious behavior.

Instruction Set Localization. Instruction set virtual-ization (e.g., as used in VMProtect and Code Virtual-izer) is an obfuscation technique that protects softwareby transforming the source code, intermediate represen-tation, or native machine code of a program into byte-code for an arbitrarily chosen instruction set architec-ture. At runtime, the execution semantics of the origi-nal program are fulfilled by a native interpreter bundledwith the bytecode. Malware obfuscated with this tech-nique is thus not vulnerable to memory dump or unpack-ing strategies, as the executable code present in the bi-nary is bytecode representing an unknown machine lan-guage. This property makes instruction set virtualizationa successor to more traditional transformations that en-crypt, compress, or otherwise reversibly transform na-tively executable code. However, as mentioned above,this technique does not pose a significant challenge todynamic analysis, as researchers have already developedtechniques [13, 4] capable of automatically reverse engi-neering bytecode execution semantics.

To fix the weaknesses associated with instruction setvirtualization, we propose a technique similar to HIE,whereby a virtualized instruction set is bound (orlo-calized) to a specific environment. In this scenario, amalware instance deployed on a given system representsonly an interpreter of bytecode for a virtualized instruc-tion set. All malicious tasks, which will be requestedfrom and provided by the C&C server, represent byte-code to be interpreted.

The interpreter’s request for a task includes the hostidentifier of the infected system. The C&C server com-bines the host identifier with a network identifier anduses this information as part of virtualizing the nativecode representing the malicious task. The bytecodegiven to the infected host will thus only run on that spe-cific host, as determined by forgery-resistant host andnetwork-based identifiers.

Similar to HIE, this obfuscation technique offers sev-eral guarantees. First, unless the interpreter deployment-time (or infection-time) signature matches the runtimesignature, the task cannot be executed correctly due toincorrect bytecode interpretation. Second, the only wayto understand the task is to correctly determine its in-terpretation, such as by brute-forcing the combination ofhost and network identifiers.

Malware instances that use ISL hold several advan-tages. First, they will (in general) be more extensible, in amanner similar to a Platform-as-a-Service (PaaS) model.In addition, a given instance will contain no information

about its actual malicious tasks, which can complicatethe identification of behaviors. Finally, botnets assem-bled using these instances will offer better resistance toanalysis and tracking, as without access to the originalinfected host, researchers and security practitioners willneed to forge both host and network-level identifiers.

2 Prototype Design

In this section, we present additional details on the designof host identity-based encryption (HIE) and instructionset localization (ISL).

2.1 Host Identity-based Encryption

The operational particulars of host identity-basedencryption present two major challenges. First, thespecifics of generating a sufficiently unique ID for thehost must be decided. Second, deployment logisticsmust be worked out, which will include the use ofintermediate code agents that determine the host ID towhich the delivered malware instance will be bound.

Host ID Generation. As mentioned previously, identi-fier generation is a critical part of a DRM system. If aDRM scheme fails to generate an appropriately uniqueID, the protections otherwise afforded are easily by-passed. For this reason, many schemes require execu-tion at the highest privilege levels (e.g., the OS kernelor hypervisor), or special hardware capable of providingtrusted information (e.g., TPM).

Identifier generation is likewise a critical part of HIEand ISL, but requires additional considerations. First, amalware author cannot assume availability of the highestprivilege level, as in contrast to legitimate DRM systems,malware may execute as an unprivileged user. For sim-ilar reasons, attackers also cannot assume there will behardware support. Thus, the information used must beretrievable without any privilege.

Second, information for identifier generation cannotbe collected from a single source, as unlike a legitimateDRM system, this information may not be trusted. Thus,to prevent brute force attacks, host identifier generationmust use multiple sources to increase key space. How-ever, the information must also be as stable as possible tominimize the introduction of false positives. As the sta-bility of information usually decreases with size, a trade-off must be made.

To the benefit of the proposed system, on current com-modity systems, there exists sufficient information that isunique, invariant, and requires no privilege to obtain. Asan example, on Windows systems, the following couldbe used to create the host-based identifier:

3

• The Environment Block. When a process is cre-ated, Windows stores environment information inthe process’ address space. In our design we usethe process owner’s username, computer name, andCPU identifier. As the environment block is directlyaccessible by code that executes inside a given pro-cess, this information can be obtained without APIor system calls.

• MAC address. The MAC address of the NIC canbe obtained from theGetAdaptersInfoAPI.

• GPU info. GPU information can be obtainedfrom the GetAdapterIdentifier method ofIDirect3D9Ex interface. In our design, we use thedevice description.

• SID of the user. Using the token of a process, theGetTokenInformationAPI can be used to obtainthe SID of the process’ owner. This identifier isunique across a Windows domain.

Once collected, this information can be concatenatedand used as an input to a cryptographic hash function(e.g. bcrypt) to generate the host identifier, which willthen be used as the encryption/decryption key. Due totime constraints, we did not examine the viability of filesystem or registry information to supplement the abovedesign, but believe they are likewise a rich source ofstable, uniquely identifiable information.

Malware Deployment. As mentioned previously, an ad-ditional challenge is introduced by the need to obtaina host identifier prior to deploying a malware instancebound to that identifier. Solutions to this problem arecomplicated by exploit reliability concerns that mandateshellcode be as small as possible. Thus, gathering infor-mation used to produce the host identifier must insteadbe deferred to an intermediate downloader agent.

The use of multiple, intermediate deployment agentscreates an additional hurdle. That is, if security prac-titioners capture exploit shellcode or the intermediatedownloader, they could use these agents to obtain a mal-ware instance bound to their analysis environment. Tosolve this problem, we propose using one-time URLssimilar to those offered in password reset procedures.More specifically, before the shellcode or downloader issent to the infected host, the server will assign it a uniquepath to download the next stage. Although the opera-tional specifics will vary based on the attack vector (e.g.,drive-by download versus email attachment), in all casesthe one-time URL will also be short-lived.

2.2 Instruction Set Localization

The design of ISL requires overcoming three major chal-lenges: network identifier generation, transformationrule generation, and fault tolerance.

Network Identifier Generation. Network identifiers arealready used by companies like Google and Facebook toprevent session hijacking. Similar to their selections, thefollowing could be used as the network identifier:

• Geo-location. The IP address is the most straight-forward candidate for the network identifier, butis not sufficiently stable. For this reason the geo-location of the IP address should be used instead atthe granularity of state or province. This choice ofgranularity permits a certain level of mobility whilemaintaining the objectives of ISL.

• Autonomous System Number (ASN). In general,geo-location alone comprises a sufficient networkidentifier. However, as the publication of this infor-mation is not mandatory, geo-location databases cancontain outdated or incorrect data. For this reason,the ASN should be used as well.

As it is collected at the C&C server, we considerinformation used to create the network identifier un-forgeable.

Transformation Rule Generation. Ideally, an execu-tion environment’sunique ID (the combination of thehost and network identifiers) would be mapped to aunique virtual instruction set. However, as designingsuch a system exceeds the scope of this paper, we insteadpropose an alternative but equally effective approach.That is, only one virtual instruction set is used, but eachmalware instance receives atask decryption keyderivedfrom the unique ID when deployed. When responding toa task request, the C&C server will encrypt the task usinga key derived from the malware instance’s unique ID. Ifthere is a mismatch (e.g., in the network identifier used tocreate the unique ID), the decryption routine will gener-ate invalid or incorrect bytecode that does not reveal themalicious task.

As the proposed design replaces per-host uniquevirtual instruction sets with task decryption keys, gen-eration of this component merits additional discussion.Although straightforward, creating a task decryption keyby simply hashing the unique ID could compromise theprotections provided by ISL. In this scenario, securitypractitioners could leverage knowledge of the featuresused in unique ID generation to reproduce a key fora given execution environment, or worse, search amalware analysis database to identify samples that use

4

the same key and successfully decrypt tasks sent tothem. To prevent this potential attack, a keyed hash (e.g.,HMAC) should be used instead; the corresponding keywould be kept on the C&C server and known only to thebotnet operator.

3 Discussion

Operational Security. Both HIE and ISL are imple-mented using modern cryptography and thus are immuneto knowledge of how keys are generated– the only wayto break their protections is to derive the correct keys.As security organizations automatically analyze malwarein environments separate from those originally infected,derivation of the correct keys requires searching throughthe entire key space, which is of non-trivial size. More-over, some configuration information (i.e., that used toderive the network identifier) may be impossible to du-plicate.

Another advantage of HIE and ISL is that they areinsensitive to analysis techniques. That is, regardlessof the employed analysis granularity (e.g., fine-graineddataflow analysis used in [16] or high-level, blackboxnetwork intelligence collection), the resistance offeredby HIE and ISL can be broken only if the configurationparameters of the original execution environment aresuccessfully matched.

Potential Countermeasures.One straightforward ideafor bypassing the protections provided by HIE and ISLis to analyze samples in the original environments theyinfected. While such an approach may work for samplescollected by high-interaction honeypots, for a variety ofpractical reasons the use of this method is not feasible forother sources. Challenges include monitoring system ca-pability limitations (e.g., of low-interaction honeypots),legal and privacy considerations and impact on businessoperations and continuity (e.g., for client submissions).As samples collected by high-interaction honeypots rep-resent only a small portion of all collected samples, theeffectiveness of this approach is limited.

An alternative to analyzing malware on the systemsoriginally infected is the collection and duplication ofhost and network-level environment information. How-ever, for similar (though perhaps less significant) pol-icy and privacy reasons, the implementation of this ideawould face significant hurdles. Moreover, even if thehost identifier can be successfully forged, duplication ofthe correct network identifier would require analysis sys-tem deployment on an unprecedented, globally coopera-tive scale.

Another potential countermeasure is to record and col-lect the network activity between an infected host and

the C&C server, then replay that communication dur-ing analysis. Without one additional protection, this ap-proach would bypass ISL and could be combined withattacks or cooperative efforts to forge the host identifier.However, the use of SSL/TLS for C&C communicationmitigates the successful use of this response.

Finally, the very manner by which HIE and ISL pro-tect a malware instance could be leveraged by the secu-rity community to create instability in a set of host ornetwork identifiers and thus prevent successful or cor-rect execution (i.e., the allergy attack [3]). However, thiscountermeasure could also make legitimate software sys-tems that use the same information equally unreliable.As such, the success of this response may depend on thewillingness of users to accept security over usability.

4 Concept Integration by Malware

At the inception of this paper, concerns were raisedabout the extent to which malware authors would adopttechniques like HIE and ISL; the rise of the Flashbackbotnet supports the practicability of their use in real-world malware.

The Flashback Botnet. In September 2011, Flash-back emerged as malware that targeted Mac OS X. ByApril 2012, the botnet representing Flashback-infectedsystems had grown to over 600,000 Macintosh systems.This size makes Flashback the most considerable threatbuilt using Mac OS X thus far.

While its size and propagation activities are interest-ing, the operational properties of Flashback malware re-inforce the concerns expressed in this paper. To elab-orate, the initial Flashback agent connects to its C&Cserver and downloads one or more additional payloads(e.g., those that can illegally monetize the victim’s use ofsearch engines). When requesting a payload, the agentsubmits the hardware UUID of the infected system. Thisvalue is then hashed (via MD5) to create a key that, incombination with the RC4 stream cipher, binds the pay-load to that system. Like HIE, unless the hardware UUIDof the system matches the one used to create the payload,it will not execute successfully.

While Flashback’s use of a system’s hardware UUIDbears some similarities to HIE, other aspects of the mal-ware’s operation are less sophisticated. For example,Flashback’s use of encryption does not optimally protectits domain name generation algorithm, which is thus eas-ier to reverse engineer. In addition, communication be-tween a Flashback-infected system and the C&C serveris in plaintext, which permits impersonation of the C&C.

Despite its other shortcomings, Flashback demon-strates that malware authors have already begun usingprotections like those described in this paper. Security

5

researchers must therefore prepare for a future of mali-cious software that will only run on the systems it origi-nally infects.

5 Related Work

Previous security research has included the design ofanalysis-resistant malware. As an example, Sharif, et alintroduced the design and implementation of a techniquecalledconditional code obfuscation[12], which encryptstrigger-based code with a key derived from an input thatwould activate the trigger. In a manner somewhat con-ceptually similar to HIE and ISL, code protected by thismechanism cannot be decrypted unless the threat analystcan derive an input that would activate the trigger.

More recently, Dunn et. al proposed using TPM tocloak the execution of malware [6]. Like HIE, this workalso proposes the idea of encrypting a payload with akey specific to a particular host. In contrast to our work(which ensures that malware is not executed in an analy-sis environment), the goal of Dunn’s research is to makehost-levelmonitoringof malware infeasible.

6 Conclusion

In this paper we proposed two obfuscation techniques–host identity-based encryption (HIE) and instruction setlocalization (ISL)–that make the successful execution ofa malware sample dependent on the unique properties ofthe original host it infects. Going forward, researchersmust include ways to mitigate these protections or ex-amine alternatives to threat detection and analysis. Tohighlight the current and future importance of the asso-ciated concerns, we discussed the Flashback botnet’s useof a similar technique to prevent the automated analysisof its samples.

7 Acknowledgements

The authors would like to thank the anonymous review-ers for their feedback. This material is based uponwork supported in part by the National Science Founda-tion grant 0831300, the Department of Homeland Secu-rity contract FA8750-08-2-0141, and the Office of NavalResearch grants N000140710907 and N000140911042.Any opinions, findings, and conclusions or recommen-dations expressed in this material are those of the authorsand do not necessarily reflect the views of the NationalScience Foundation, the Department of Homeland Secu-rity, or the Office of Naval Research.

References

[1] BRUMLEY, D., HARTWIG, C., LIANG , Z., NEWSOME, J.,SONG, D., AND Y IN , H. Automatically identifying trigger-basedbehavior in malware.Botnet Detection(2008), 65–88.

[2] CABALLERO , J., POOSANKAM, P., KREIBICH, C.,AND SONG,D. Dispatcher: Enabling Active Botnet Infiltration using Au-tomatic Protocol Reverse-Engineering. InProceedings of the16th ACM Conference on Computer and Communication Secu-rity (2009), CCS ’09.

[3] CHUNG, S.,AND MOK, A. Allergy attack against automatic sig-nature generation. InProceedings of the 8th International Sym-posium on Recent Advances in Intrusion Detection(2006).

[4] COOGAN, K., LU, G., AND DEBRAY, S. Deobfuscation ofvirtualization-obfuscated software: a semantics-based approach.In Proceedings of the 18th ACM conference on Computer andCommunications Security(2011).

[5] D INABURG, A., ROYAL , P., SHARIF, M., AND LEE, W. Ether:malware analysis via hardware virtualization extensions.In Pro-ceedings of the 15th ACM conference on Computer and Commu-nications Security(2008).

[6] DUNN, A. M., HOFMANN, O. S., WATERS, B., AND

WITCHEL, E. Cloaking malware with the trusted platform mod-ule. In Proceedings of the 20th USENIX conference on Security(2011).

[7] K IRDA , E., KRUEGEL, C., BANKS, G., VIGNA , G.,AND KEM-MERER, R. Behavior-based spyware detection. InProceedingsof the USENIX Security Symposium(2006).

[8] K OLBITSCH, C., COMPARETTI, P., KRUEGEL, C., KIRDA , E.,ZHOU, X., AND WANG, X. Effective and efficient malware de-tection at the end host. InProceedings of the 18th conference onUSENIX security symposium(2009).

[9] PALEARI , R., MARTIGNONI, L., ROGLIA , G., AND BRUSCHI,D. A fistful of red-pills: How to automatically generate proce-dures to detect cpu emulators. InProceedings of the 3rd USENIXconference on Offensive technologies(2009).

[10] ROYAL , P., HALPIN , M., DAGON, D., EDMONDS, R., AND

LEE, W. Polyunpack: Automating the hidden-code extractionof unpack-executing malware. InProceedings of the 22nd Com-puter Security Applications Conference (ACSAC)(2006).

[11] SAITO , Y. Jockey: a user-space library for record-replay debug-ging. In Proceedings of the sixth international symposium onAutomated analysis-driven debugging(2005).

[12] SHARIF, M., LANZI , A., GIFFIN, J., AND LEE, W. Impedingmalware analysis using conditional code obfuscation. InNetworkand Distributed System Security (NDSS)(2008).

[13] SHARIF, M., LANZI , A., GIFFIN, J., AND LEE, W. Automaticreverse engineering of malware emulators. InProceedings of the30th IEEE Symposium on Security and Privacy(2009).

[14] VASUDEVAN, A., AND YERRABALLI , R. Cobra: Fine-grainedmalware analysis using stealth localized-executions. InSecurityand Privacy, 2006 IEEE Symposium on(2006).

[15] Y IN , H., LIANG , Z., AND SONG, D. HookFinder: Identifyingand understanding malware hooking behaviors. InProceedings ofthe 15th Annual Network and Distributed System Security Sym-posium (NDSS’08)(2008).

[16] Y IN , H., SONG, D., EGELE, M., KRUEGEL, C., AND K IRDA ,E. Panorama: Capturing system-wide information flow for mal-ware detection and analysis. InProceedings of the 14th ACMconference on Computer and Communications Security(2007),CCS ’07.

6

impeding automated malware analysis with environment-sensitive malware · impeding automated...

Documents