hybrid analysis - nextgen technology for advanced malware
TRANSCRIPT
June 2014
Payload Security UG (haftungsbeschränkt) [email protected], www.payload-security.com
Hybrid Analysis - NextGen Technology for Advanced Malware
As malware evolves, the era of pure dynamic analysis systems is coming to an end.
What potential does Hybrid Analysis have?
by Jan Miller ([email protected])
What you will learn… What you should know… About automated malware analysis challenges What Hybrid Analysis is about Why Hybrid Analysis is part of a successful strategy
Basic knowledge of x86 Assembly Basic knowledge of Malware Analysis Systems
Table of Contents Introduction ................................................................................................................................................ 2
Terminology ................................................................................................................................................ 3
Static Analysis ......................................................................................................................................... 3
Dynamic Analysis .................................................................................................................................... 3
Dormant Code......................................................................................................................................... 3
Hybrid Analysis........................................................................................................................................ 4
Hybrid Analysis in Action ............................................................................................................................ 5
Tools ....................................................................................................................................................... 5
VirtualBox ........................................................................................................................................... 5
StaticStream ........................................................................................................................................ 5
Dynamic Analysis Tools ....................................................................................................................... 5
Hybrid Analysis vs. Matsnu Trojan .......................................................................................................... 6
Conclusion ............................................................................................................................................ 11
Summary ................................................................................................................................................... 12
About the Tools .................................................................................................................................... 12
On the Web ........................................................................................................................................... 12
About the author .................................................................................................................................. 12
Table of Figures ......................................................................................................................................... 13
Bibliography .............................................................................................................................................. 14
June 2014
Payload Security UG (haftungsbeschränkt)
[email protected], www.payload-security.com 2
Introduction The Internet connects a wide range of personal computers for private and business purposes that often
run Microsoft Windows OS on x86 compatible architectures with Windows ranging at 90% market share
in the desktop segment (NetMarketShare, 2014). These monocultures are an extremely attractive
environment for numerous malware attacks. Today, malware often appears in the form of highly
complex Trojan systems that come with exploit kits and very sophisticated anti-detection measures. The
number of infections and the awareness in the industry is larger than ever. Today, there are about 4
million new infections per month (SecureList, 2014). The worm MyDoom.X alone caused damages of
about $38.5 billion – and that was in 2006 (Borglund, 2014). Lately, also due to the NSA scandal, the
awareness for IT security has been growing a lot and IT security is becoming a highly invested market.
Classical malware detection methods were based on pure static code analysis, such as finding a specific
byte pattern and matching it against a known database of “malicious signatures”. Static analysis can be
described (in the most general sense) as code analysis without execution of the target payload. In turn,
malware authors started releasing packed/encrypted or even polymorphic software that rendered
classical methods worthless. Consequently, anti-virus (AV) vendors, CERTs/CIRTs and malware
researchers started developing and using dynamic analysis systems. Dynamic analysis can be described
(in the most general sense) as code analysis during execution or emulation of the target payload. This
was a huge step in evolution, because when the execution environment is instrumented appropriately, it
allows the observer to see the target software behavior after the malware unpacks its security layers.
Today, dynamic analysis systems run the target software on virtual environments with hardware
acceleration support (such as VMWare or VirtualBox), in order to observe the malware behavior during
runtime. These often automatic systems are called “Sandbox” analysis systems, as they represent an
isolated execution environment for malware that simulates a real victim’s machine1. Using systems such
as VirtualBox, the virtual machine (VM) state can be restored to a clean state by loading predefined
snapshot files, thus allowing execution of numerous malware samples in sequence without the need to
restore the infected machine. Of course, malware authors have adapted to the growth of Sandbox
systems and introduced a variety of VM detection methods. If a VM environment can be detected, the
malware may behave differently as it would in the wild and not show its true behavior. The not-
observed malicious functionality is what we call dormant code. These avoiding techniques range from
delayed execution – so called “time bombs” – to complex system/hardware state detection methods.
For example, if the real payload is not executed within a reasonable amount of time – the analysis
system will give up on the analysis and potentially miss valuable information. Thus, dormant code
detection is a vital prerequisite to Sandbox systems. Analysis results get even better when dormant code
is analyzed in-depth using runtime context information.
Combining both static and dynamic analysis (typical term is Hybrid Analysis) in a fully automated,
scalable and performant analysis environment is the next generation in malware forensics and detection
algorithms. In this article, we will take a look at what dynamic analysis data is necessary to understand
dormant code and how we can combine it with static analysis to extract in-depth behavior information.
1 Executing malware on a prepared physical machine is possible as well, of course.
June 2014
Payload Security UG (haftungsbeschränkt)
[email protected], www.payload-security.com 3
Terminology In this chapter the most important terms are outlined, in order for all readers to be at the same level for
when the terms are used in the article.
Static Analysis Static analysis can be described in the most general sense as code analysis without execution of the
target payload. The target code (the analysis input data) may be a compiled binary file or a human-
readable format, such as program source code, scripting language files or any other type of machine
code representation. N. Ayewah et al. define static analysis as a method that “(…) examines code in the
absence of input data and without running the code, and can detect potential security violations (…),
runtime errors (…) and logical inconsistencies (…).” (Nathaniel Ayewah, David Hovemeyer, J. David
Morgenthaler, John Penix and William Pugh, 2008).
Dynamic Analysis Dynamic analysis can be described in the most general sense as code analysis during execution or
emulation of the target payload. Involved techniques are usually implemented by tools such as
execution visualizers, system observing tools (e.g. malicious behavior detection, intrusion detection,
performance observation, etc.), profilers or other types of behavior analysis tools (e.g. sandbox
systems). The only known technique used for performing dynamic analysis is instrumentation of the
target code or its host (i.e. instrumenting the Operating System to enable system-level profiling of the
suspect application), in order to profile the target code’s behavior (Kendall, 2007). Instrumentation
refers to techniques that insert additional code for analysis purpose (or instrumentation code) into the
target code, in order to measure client performance, detect bugs or intercept code-flow in order to
analyze certain behavior patterns. In malware analysis, behavior patterns are often the most interesting.
Dormant Code Dormant code or dormant functionality in malicious programs is payload/code that is not observed
during dynamic analysis. In the context of malware, dormant code (not to be confused with “Software
rot”) may be hiding very interesting behavior that was not executed during analysis for whatever reason
(e.g. due to virtual machine detection, a command and control server not being available, a long initial
sleeping delay, etc.). We can say that every pure dynamic analysis containing “no malicious behavior”
always contains some kind of dormant code (as the executed code coverage is never 100%) and
sometimes malicious dormant code. As the “false negative” case is to be avoided at all cost (i.e. thinking
something is clean that is not), it makes sense to invest resources into detecting dormant code. This can
be achieved by adding e.g. an additional static analysis layer on memory snapshots.
On a side-note: process memory context constantly changes. Thus, it is necessary to take memory
snapshots at an intelligent point in time or with a high frequency to “catch” e.g. unpacked code or
injected shellcode, etc. In a “perfect” world with quantum processors, an analysis system would be able
to observe any memory change and instantly analyze the entire process address space for all potentially
executable code locations and not make an impact on the performance. Unfortunately, we do not have
quantum computers and as such need to require on heuristics and shortcuts, leaving room for mistakes.
June 2014
Payload Security UG (haftungsbeschränkt)
[email protected], www.payload-security.com 4
For example, analysis systems that run through thousands of files per day have an analysis time limit
that they have to abide by. If nothing happens within the first ~5-10 minutes, it is off to the next file and
heuristics have to do the job. Thus, the better and more intelligent the underlying algorithms and
performance of the system overall is, the more files can be analyzed in a more complete and error-
reduced fashion. Of course, scalable systems and a lot of hardware can solve bad implementations to
some degree, but there is always a limit in the real world hardware-wise and other bottlenecks surface
on large parallel systems, i.e. quality starts at the lowest level keeping in mind a flexible architecture.
Hybrid Analysis Hybrid Analysis (HA) is something we call intelligent combination of static and dynamic analysis. It is a
technology or method that can integrate run-time data extracted from dynamic analysis into a static
analysis algorithm to detect behavior or malicious functionality otherwise not as easily possible. Often,
the dynamic “helper data” resembles memory snapshots, runtime API symbol data (memory reference
address values) and adding them as an input to a sophisticated static analysis engine (possibly including
data flow analysis). For example, if a dormant code sequence executes an indirect call, it would not be
possible to resolve the called function address without knowing the value read from a memory location
at the point in time of execution2. Even if we knew the value, it would not be possible to associate the
called function address with a system call, if a mapping of memory references to symbol information is
not available for the specific execution environment3
2 Using a memory snapshot from a later point in time is possible as well, if the value remains unchanged. 3 The “specific analysis” reference is important, because techniques such as ASLR (Address space layout
randomization) cause system API function addresses to not be predictable. As such, we always need to understand detected dormant code in a process context of a specific execution environment.
June 2014
Payload Security UG (haftungsbeschränkt)
[email protected], www.payload-security.com 5
Hybrid Analysis in Action In this chapter we will apply Hybrid Analysis techniques on an exemplary malware and evaluate the
results in order to take a look at the practical side of the topic. In the previous chapter, Hybrid Analysis
and its associated terms were outlined briefly.
Tools Before we get to the experimental results, the involved tools will be outlined briefly.
VirtualBox
For our example malware analysis, we will be using VirtualBox as our preferred virtual machine
environment. From the main page Oracle states that “VirtualBox is a powerful x86 and AMD64/Intel64
virtualization product for enterprise as well as home use. Not only is VirtualBox an extremely feature
rich, high performance product for enterprise customers, it is also the only professional solution that is
freely available as Open Source Software under the terms of the GNU General Public License (GPL)
version 2.” (VirtualBox) Sounds good? It is good. Definitely good enough to show what HA is about.
StaticStream
StaticStream is our preferred static analysis engine, as it can take dynamic data (such as memory
snapshots, symbol data) and put it together using HA technology. From the webpage, it is described as
following: “StaticStream is a high-performance static analysis engine that is written in C++ and can
analyze x86 PE files, memory dumps or shellcode. It uses a novel approach of combining dynamic data
with state of the art static analysis techniques in order to detect and understand dormant code. It offers
a wide range of configuration options and regular updates.” (Payload Security)
Dynamic Analysis Tools
For run-time data capturing we are going to use the AREE (Automatic Reverse Engineering Engine)
Manager and Monitor binaries. These are two in-house tools used at Payload Security to generate
dynamic data when running malware. These tools work similar to the Cuckoo Sandbox monitor library
“CuckooMon” in the sense that they detour calls at the application level, whereby the Manager is used
to load configuration data and start the analysis. The monitor is a DLL file that is injected into the initial
malware process and user-level hooks are applied to catch system API calls. Also, whenever the malware
tries to inject itself into another process (e.g. using a remote thread or other techniques), the monitor is
applied to the new target process. In order for our experiment to be successful, injected shellcode,
memory dumps, process context (loaded modules, registry accesses, mutants, etc.) and symbol
information (module exports) are logged before the malware is able to modify/taint the data. Why did
we use our own tools? Basically, we only decided to use them, because the generated dynamic data has
a preferred format that is understandable to StaticStream and we can show how HA works more easily.
If you want to replicate our experiment and want to try out the tools, feel free to contact us.
June 2014
Payload Security UG (haftungsbeschränkt)
[email protected], www.payload-security.com 6
Hybrid Analysis vs. Matsnu Trojan Now that we know about the tools involved, let us take a look at real malware and see HA come into
action. For our “experiment”, we decided to use a Trojan called Matsnu4 that encrypts files on the target
drive in order hold the unencrypted data as a ransom. These are the steps we will be taking:
Install a VirtualBox instance with a typical OS, such as Windows XP
Load Matsnu sample on the virtual machine drive
Run Matsnu sample using AREEv2Mgr and inject AREEv2Mon monitor library
Let the analysis run for a couple of seconds (it is enough) and grab the generated run-time data
Take the grabbed run-time data and use it to analyze memory snapshots using HA technology
Evaluate the results and draw a conclusion
First, let us install Windows XP and load Matsnu on the main drive. The following screenshot shows the
system after setup shortly before an analysis.
Figure 1: Start Screen after Installing Windows XP and loading “matsnu” on the main drive
4 MD5 e008e161cce090242262fc977b6fe707d3058cdaa3b5d5c3bab24c8c6b05ce9e
June 2014
Payload Security UG (haftungsbeschränkt)
[email protected], www.payload-security.com 7
As we can see, there is a “shared folder” (release) open with the Manager ready to start the Matsnu
application. Also, we notice that Matsnu is using a PDF icon in order to mislead the Windows user into
thinking it is dealing with a document and not an executable. As extensions are disabled by default, we
cannot know at first sight that it is an executable.
In the next screenshot we see the manager open and use the command “.run C:/Matsnu” to start
analysis manually. There is also a command-line interface, but that is not outlined here.
Figure 2: Running “matsnu” from the Manager using the interactive mode
At this point we can already observe an output folder “AREE” that has been created on the C: drive. It
will contain all the dynamic analysis information. Also, the Matsnu file is missing. Checking the captured
files in the “AREE” folder, we detect that this is implemented using a dynamically created batch, which is
deletes itself after deleting the original file “Matsnu.exe” on the C: drive. Also, the batch file is executed
from a duplicated process so that the original file is not in use by the OS. This is the batch file content:
:l
if not exist "C:\Matsnu.exe" goto e
del /Q /F "C:\Matsnu.exe"
goto l
:e
del /Q /F "C:\DOCUME~1\mjkdmjmj\APPLIC~1\5176313.bat"
June 2014
Payload Security UG (haftungsbeschränkt)
[email protected], www.payload-security.com 8
All in all, the malicious process duplicates itself upon startup, deletes the original file, but continues to
exist. The PDF file is missing for the user and the malware author’s probably assume that the user will
continue with daily business not putting thought to what happened.
After running the sample for a couple of seconds, we abort the analysis, quit the VM and take a look at
the captured dynamic data. This is how the dynamic data folder looks like.
Figure 3: Dynamic Data Folder
The “api” folder contains system calls and parameters, the “bin” folder contains captured files (e.g. the
*.bat file mentioned above), the “ctx” folder contains environment data (such as loaded modules, their
symbols, registry accesses, etc.), the “dmp” folder contains memory snapshots of multiple frames and
the “shc” folder contains extracted shellcodes. The “monprocs.csv” file contains an overview of all
monitored processes. In this case, the contents are similar to the following (reduced version):
15539444-00013192,"INJECT_NEW","c:\Matsnu.exe","\Device\HarddiskVolume1\Matsnu.exe","<date>"
15540015-00013280,"INJECT_EXISTING","C:\WINDOWS\system32\cmd.exe","\Device\HarddiskVolume1\WINDOWS\system32\cmd.exe","<date>"
15540115-00001528,"INJECT_EXISTING","C:\WINDOWS\Explorer.EXE","\Device\HarddiskVolume1\WINDOWS\explorer.exe","<date>"
We quickly see that Matsnu first runs the batch file and then injects itself into “explorer.exe” where it
remains to execute most of its payload. This makes manual debugging with e.g. OllyDbg more difficult.
Consequently, we first try to analyze the memory dump files (ignoring all system files) from the
explorer.exe process using symbol memory references and module information as “context
information”, which is one of the ideas of Hybrid Analysis. Specifically, we start StaticStream letting it
analyze the last frame of the process (i.e. the last “dump” we logged before quitting the VM), because it
often contains already unpacked code sequences. See the following StaticStream’s output in a shorter
form (passing by nearly 1.6 million instructions including data flow in an impressive ~3 seconds):
Welcome to AREE v2.1
Starting analysis ...
Adding undefined memory file 15540115-00001528.00000002.15561486.2B90000.00000040.mdmp (POI: 0, Executable: 1) for later analysis
…
Found a hidden PE file in memory file 15540115-00001528.00000002.15561486.3730000.00000002.mdmp at 3730000
…
Analyzing in-memory binary file 15540115-00001528.00000002.15561486.3730000.00000002.mdmp
Analyzing 1 exports
1 of 1 exports accepted
No packed files could be detected
…
Running heuristic scan on binary file 15540115-00001528.00000002.15561486.3730000.00000002.mdmp
June 2014
Payload Security UG (haftungsbeschränkt)
[email protected], www.payload-security.com 9
…
Generating final analysis report
Number of passed instructions: 1660669
Finished analysis in 3276 ms with a throughput of 445 KB/s
This is an excerpt of how one output folder with stream files containing disassembly listings looked like
(a human-readable output is the default behavior):
Figure 4: Streams Folder File Listing
Hand-browsing some of the stream files quickly reveal that one portion of the streams contains
encrypted payload and one portion contains unencrypted payload. Here are some of the more
interesting functions that could be used for post-processing to generate behavior signatures or used as
an entrypoint for an additional manual analysis:
Figure 5: Persistance using RegCreateKeyEx
June 2014
Payload Security UG (haftungsbeschränkt)
[email protected], www.payload-security.com 10
The above “code sequence” (or “Stream”) shows the call to RegCreateKeyExW at ADVAPI32.dll that
would otherwise not be detected using pure static analysis, as the indirect call memory reference would
not be resolved. In this case, the creation of a registry key and a registry key value was set during
execution, as indicated by the dynamic analysis registry logfile (i.e. the associated code sequence is not
dormant code):
Figure 6: Persistance using Registry
Converting the hex values to ASCII reveals the following pathway:
C:\Documents and Settings\mjkdmjmj\Application Data\Microsoft\qfpvideo.exe
Matsnu obviously tries to survive a reboot by adding itself to the auto-start registry, which is a very
common technique. Checking more streams, another interesting entrypoint was found quickly. It is the
function that encrypts the Command & Control server requests before sending the data over an
alternate HTTP connection.
Figure 7: Encrypting Payload before C&C request
The code location above is a good starting point to check cross-references and intercept the encrypted
key creation (of course, this requires a flexible monitor system). Also, please note that using a run-time
capturing mechanism located at the kernel level, such a system would not be able to capture the
unencrypted data without hooking into the user mode and becoming detectable again.
June 2014
Payload Security UG (haftungsbeschränkt)
[email protected], www.payload-security.com 11
Today, more and more malware is using encrypted traffic (not only HTTPS, but the payload itself being
encrypted as well), making it necessary to move closer to the malware code itself, as
encryption/decryption of important system data happens at the application level.
On a side note, the HA technology also revealed the following C&C server IP addresses using the
alternate HTTP port 8080:
50.31.146.134:8080 204.197.254.94:8080 78.129.181.191:8080 27.124.127.10:8080 173.203.112.215:8080
50.97.99.2:8080 103.25.59.120:8080 5.135.208.53:8080 50.31.146.109:8080 204.93.183.196:8080
… and a lot more interesting dormant code sequences, which are not outlined here.
Conclusion Although the Matsnu Trojan is not the most sophisticated malware available today, it is a good example,
because it reflects typical and state of the art aspects. The traffic communication uses encrypted
payloads, it tries to hide its payload injecting itself into a variety of processes, it decrypts its payload
inside the explorer making manual debugging difficult, and so forth. Using some run-time data capturing
tools we were able to extract a lot of information, including dormant code and complete symbol
information. Of course, the dynamic analysis tool was required to follow the malware into the explorer
and remain undetected. As a next step, the static analysis engine StaticStream associated run-time data
and generated code sequences for post-processing quickly, allowing us to find valuable analysis
entrypoints and behavior data otherwise unseen by a pure dynamic analysis engine.
In general we can say that static analysis is good, if the to-be-analyzed data is not encrypted, not
obfuscated and available in a more or less complete manner, etc. Sadly, this is not often the case with
malware today. Furthermore, we can say that dynamic analysis is good as well, but it misses dormant
code and potentially malicious functionality. As we cannot make any qualified statements about the
unknown, it is impossible for a pure dynamic analysis system to safely make a statement about a file
being benign/clean, because maybe the real payload was never executed. Thus, new Hybrid Analysis
(HA) technologies are not only a necessity, but part of a future solution in the battle on malware. Due to
the additional overhead imposed by hybrid technologies, very efficient and performance-oriented
algorithms are necessary, especially if viewed on a large scale.
June 2014
Payload Security UG (haftungsbeschränkt)
[email protected], www.payload-security.com 12
Summary In this article we outlined that today’s malware development is opening up new challenges for malware
analysis systems. In the early days, simple static analysis byte patterns were enough to detect and
classify malware. Then, as malware became more sophisticated, dynamic analysis systems that observed
run-time behavior surfaced. The dynamic analysis systems have evolved and are a powerful tool today,
but their impact is becoming more and more limited. Today, neither static nor dynamic analysis alone is
an effective weapon against modern malware. Dynamic analysis environments are either being detected
and/or malicious dormant code is not being analyzed, due to time-constraints or unpredictable code
flow behavior. Using intelligent algorithms and Hybrid Analysis (HA) technologies, the best of both
worlds can be put together: first-pass checks, analyzing/logging run-time behavior, as well as detecting
and understanding dormant code functionality. In this article we showed that Hybrid Analysis is an
answer, if the run-time data captured has a sufficient quality and the static analysis engine is flexible
enough to produce usable analysis results that can be post-processed to generate signatures or
indicators.
About the Tools In this article we put focus on a static analysis engine called StaticStream. It is a product of Payload
Security and makes automatic and efficient Hybrid Analysis available to dynamic analysis systems and
analysts. Its easy interface, high configurability and flexible data stream processing architecture make it
an interesting option to upgrade any dynamic analysis system for challenges today and tomorrow.
On the Web More information on StaticStream is available on the web at www.payload-security.com.
About the author Jan Miller is a specialist for static binary analysis algorithms, reverse engineering and malware
signatures. He is the CEO and founder of Payload Security UG (haftungsbeschränkt). In the past two
years, he has been putting focus on Android based malware, as well as implementing Hybrid Analysis
technologies for a leading dynamic analysis system.
June 2014
Payload Security UG (haftungsbeschränkt)
[email protected], www.payload-security.com 13
Table of Figures Figure 1: Start Screen after Installing Windows XP and loading “matsnu” on the main drive .................... 6
Figure 2: Running “matsnu” from the Manager using the interactive mode .............................................. 7
Figure 3: Dynamic Data Folder .................................................................................................................... 8
Figure 4: Streams Folder File Listing ........................................................................................................... 9
Figure 5: Persistance using RegCreateKeyEx ............................................................................................... 9
Figure 6: Persistance using Registry .......................................................................................................... 10
Figure 7: Encrypting Payload before C&C request .................................................................................... 10
June 2014
Payload Security UG (haftungsbeschränkt)
[email protected], www.payload-security.com 14
Bibliography Borglund, J. (2014, April). Top 5 Most Costly Viruses of All Time. Retrieved April 2014, from TopTen
Reviews: http://anti-virus-software-review.toptenreviews.com/top-5-most-costly-viruses-of-all-
time-pg5.html
Cuckoo Sandbox. (n.d.). Malwr - Malware Analysis by Cuckoo Sandbox. Retrieved June 24, 2014, from
https://malwr.com/analysis/YjQzNzExNjcwMDQyNDBhMmJmOTFhN2Y4ODk5ZmQ0NGM/
Kendall, K. (2007). Practical Malware Analysis. Mandiant, Intelligent Information Security.
Nathaniel Ayewah, David Hovemeyer, J. David Morgenthaler, John Penix and William Pugh. (2008).
Experiences Using Static Analysis to Find Bugs.
NetMarketShare. (2014, April). Desktop Operating System Market Share. Retrieved April 2014, from
http://www.netmarketshare.com/
Payload Security. (n.d.). Payload-Security.com - Combining Static and Dynamic Analysis Intelligently.
Retrieved June 24, 2014, from http://www.payload-security.com/
SecureList. (2014, April). Internet threats statistics. Retrieved April 2014, from SecureList:
http://www.securelist.com/en/statistics#/en/map/oas/month
VirtualBox. (n.d.). Oracle VM VirtualBox. Retrieved June 24, 2014, from https://www.virtualbox.org/