analyze'15 - bulk malware analysis at scale

Extracting Malware Configurations at

ScaleANALYZE2015

John Bambenek, Fidelis Cybersecurity

Sharing Restrictions

• All the content on the slides can be considered TLP:GREEN.

• Anything that I say that’s more restrictive, I will tell you.

• Slides will eventually be posted to SlideShare.

• Questions to [email protected]

Introduction

• Sr. Threat Researcher with Fidelis

Cybersecurity

• Faculty at the University of Illinois at

Urbana-Champaign

• Producer of open-source intelligence feeds

• Run several takedown-oriented groups for

various malware families

Problem Statement

• We are on the losing end of an arms race

• The adversaries produce more malware than we can possible analyze.

• We have to operate in the open while they operate in secret.

• Their core business is exploitation, security for us is a cost center.

• We operate in a global economy without an effective means of global law enforcement.

TL;DR

Bad News: We’re Doomed

Good News: Unlimited Job Security

The Problem… Illustrated

Virustotal Statistics taken at 20 Apr 2015 14:24 PDT

Another way to look at it…

• How long does it take to reverse engineer a malware sample?

• How long does it take to create a signature/rule/defense?

• How long does it take to create all the IOCs?

• Now… how long does it take that actor to change?

Is it really that many?

• Even though hundreds of thousands of unique files are seen daily, the number of malware families is much lower.

• Key is to develop the tooling to take a sample and rip out the pieces we need that are interesting.

• Single stage malware is easy, the entire configuration is in one place.

• What about multi-stage malware?• Still has some place it calls to for the next

stage.

The problem of “sufficiency”

• Once we “detect” a threat work occurs until some “defense” is developed.

• Once a threat is “blocked”, the work tends to stop.

• Many times there are multiple actor sets that may use a specific piece of malware but detection can be generic to the tool level.

The missing pieces…

• What about ongoing surveillance?

• What about tracking and identifying all the unique endpoints used by a specific piece of malware?

• i.e. If you could know every C2 that ever was an njRat server, would that be of interest to you?

• What about the unique attributes (mutex, campaign ID) that may be used?

Making RE more efficient

• Full RE most expensive but most thorough.

• Dynamic analysis is good, but bin may not run correctly.

• Static analysis can be very fast… if you know how to pull the information out.

• Key is to automate such that you can do as much static analysis as possible, dynamic for much of the rest and RE only for the items where there is no other alternative.

BULK RAT CONFIG EXTRACTION

Why RATs?

• Single stage malware will generally always have full configuration in the binary itself.

• Used not just by skiddies but by advanced attackers also.

• Large sample set to deal with as proof of concept.

• Dozens of RAT types all well-known to deal with.

• Gotta walk before you can run.

What can you do with RAT configs?

Maybe I’m being a little too harsh

• RAT operators tend to be the black hat farm team.

• It may be “simple” but the fact we haven’t eradicated it suggests its not so simple.

• Takedowns are an art form in progress, this provides lower stakes targets to develop the tradecraft.

• Lack of enforcement breeds the feeling of invulnerability of cyber criminals.

• Don’t forget, “APT” use RATs too.

Also, there is this magic sauce…

• https://github.com/kevthehermit/RATDecoders

• Python scripts that will statically rip configurations out of 32 different flavors of RATs.

• Actively developed and you can see in action at malwareconfig.com

• Disclaimer: I had nothing to do with the development of these tools; they just fit my need and Kevin Breen deserves mad props.

https://github.com/kevthehermit/RATDecoders

https://github.com/kevthehermit/RATDecoders

The next piece of the puzzle

• In order to determine which decoder to use, you need to know which RAT it is.

• Yara used for this piece using configs from:• https://github.com/kevthehermit/YaraRules• Yara Exchange• In-House Rules

• Yara results used as “authoritative” for purposes of selecting the decoder.

https://github.com/kevthehermit/YaraRules

https://github.com/kevthehermit/YaraRules

Malware Sources

• VirusTotal

• MSFT VIA Program

• Others I haven’t had chance to see if they want recognition

• RAT Traps

• In total, upwards of .25 TB a day (not all RATs)

• In short, every piece of malware I can find.

RAT Traps

• Some RAT operators tend to have some targeting information in mind when they are seeking infections…• Celebrities• Corporate executives• Young girls

• Create faux persona that mimic some of these characteristics with an available email address and let nature take its course.• Or leak them to pastebin if you’re in a

hurry.

DESIGNING A SYSTEM TO HANDLE IT ALL

Process

• Intake of Malware• Normalize into one directory with MD5 as

filename• Process and Unpack Samples• Scan all samples with Yara• Use yara output to run selected samples

with correct decoder• Normalize output• Process into CSV feed for daily summary of

configuration info• Profit

First Bottleneck… Bandwidth

• Running a hi 1.4 xlarge all this could run in about 90 minutes

• It also costs $1000/mo for on-demand• Oh, and there is no capacity for spot

instances

• Running in corporate datacenter it took about 9-10 hours which is still acceptable for current data.• Insufficient to do this retroactively.

• There was one issue with running it in corporate datacenter though…

When datacenter gangsters attack…

• Apparently they get mad when you take up the whole pipe during business hours…

Next bottleneck… Disk

• All of this is disk I/O intensive:• Writing to disk• Processing file magic• Yara scanning• Python scripts pulling configurations out

of files.

• SSD or Bust…

• Discard binaries when done processing• But keep source information

Last bottleneck… time

• Downloading files one at a time (I don’t control packaging)

• Yara scanning one file at a time

• Lots of wasted CPU cycles sitting in idle.

• Solution: parallel

find . -type f -exec basename {} \; | parallel --max-lines 1 -j 160 yara ~/yara/all_trojans.yar 2> /dev/null >> ../yarascan.$prettystamp

Malware Configs

• Every RAT has different configurable items.

• Not every configuration item is necessarily valuable for intelligence purposes.

• Some items may have default values.

• Free-form text fields provide interesting data that may be useful for correlation.

• Mutex can be useful for correlating binaries to the same actor.

Sample DarkComet config

Key: CampaignID Value: Guest16Key: Domains Value: 06059600929.ddns.net:1234Key: FTPHost Value: Key: FTPKeyLogs Value: Key: FTPPassword Value: Key: FTPPort Value: Key: FTPRoot Value: Key: FTPSize Value: Key: FTPUserName Value: Key: FireWallBypass Value: 0Key: Gencode Value: 3yHVnheK6eDmKey: Mutex Value: DC_MUTEX-W45NCJ6Key: OfflineKeylogger Value: 1Key: Password Value: Key: Version Value: #KCMDDC51#

Sample njRat config

Key: Campaign ID Value: 1111111111111111111Key: Domain Value: apolo47.ddns.netKey: Install Dir Value: UserProfileKey: Install Flag Value: FalseKey: Install Name Value: svchost.exeKey: Network Separator Value: |'|'|Key: Port Value: 1177Key: Registry Value Value: 5d5e3c1b562e3a75dc95740a35744ad0Key: version Value: 0.6.4

Processing DNS/IP Info

• Config takes FQDN or IP in free-form field.

• The only configuration item any processing is done on is here.

• If RFC 1918 IP, then drop config.

• If FQDN resolves to RFC1918 IP, keep it.

• If it doesn’t resolve, keep it.

Sample Output

0739b6a1bc018a842b87dcb95a73248d3842c5de,150213,Dark Comet Config,Guest16,lolikhebjegehackt.ddns.net,,1604,,,,o1o5GgYr8yBB,DC_MUTEX-4E844NR

0745a4278793542d15bbdbe3e1f9eb8691e8b4fb,150213,Dark Comet Config,Guest16,ayhan313.noip.me,,1604,,,,aWUZabkXJRte,DC_MUTEX-TX61KQS

07540d2b4d8bd83e9ba43b2e5d9a2578677cba20,150213,Dark Comet Config,FUDDDDD,bilalsidd43.no-ip.biz,204.95.99.66,1604,,,,qZYsyVu0kMpS,DC_MUTEX-8VK1Q5N

07560860bc1d58822db871492ea1aa56f120191a,150213,Dark Comet Config,Victim,cutedna.no-ip.biz,,1604,,,,sfAEjh4m1lQ7,DC_MUTEX-F2T2XKC

07998ff3d00d232b6f35db69ee5a549da11e96d1,150213,Dark Comet Config,test1,,192.116.50.238,90,,,,4A2xbJmSqvuc,DC_MUTEX-F54S21D

07ac914bdb5b4cda59715df8421ec1adfaa79cc7,150213,Dark Comet Config,Guest16,alkozor.ddns.net,31.132.106.94,1604,1.ekspert60.z8.ru,######60,######2012,zwd8tEC0F0tA,DC_MUTEX-W3VUKQN

NOTE – Redacted entries are username and password for FTP drop for keylogs.

Pump it all into a database… profit

• CSV is all fine and good, but not great for historical searching…

• Main table with Hash, C2 info, description, source and date.• Also pumped into CIF

• RAT-specific table with Hash and RAT specific config info.

Artifact Mining

• Often (but not always) the operators of a given piece of malware are distinct and separate from the author of the malware.

• Correlating related pieces of code may not be worthwhile.• Cryptolocker example

• At least for RATs, the interesting artifacts are the configuration, not the code.• Malware actors may change tools but may

continue to use some of the configuration elements.

Why in the world would you ever do this?

1524 Guest16 145 Guest16_min 50 Anonymous 43 29 Hacked 28 Victim 28 HF 27 TestGuest 27 Test1 26 Guest162 25 Slave 23 B--L--A--Y 22 Guest1 20 Test 17 Guest 17 1 16 DOS 15 Eb0la 14 Kurban 13 12 HACKIADO MUAHAHAHAHA 11 test 11 Bot 10 VoltandoAHackear 10 Hack 10 AVA

More examples

2652 HacKed 119 109 72 50 Hacked 37 hacked 18 14 google 13 Victim 11 isLam 10 victim 10 system 9 test 9 8 xXxVICTIMxXx 8 vitima 8 4kurdistan.no-ip.biz… 7 HacKed By Amr Nasr 6 HacKed By Mohamed Ashraf 5 HacKed_by_Hammouda-Hacker 4 Ahmed Najar 4 ahMed-haKerS

RAT Creed

This is my RAT. There are many like it, but this one is mine.My RAT is my best friend. It is my life. I must master it as I must master my life.

My RAT, without me, is useless. Without my RAT, I am useless. I must fire my RAT true. I must shoot straighter than my enemy who is trying to kill me. I must shoot him before he shoots me. I will...

Top Global ASNs for RAT C2s

294 36947 DZ ALGTEL-AS,DZ 131 8452 EG TE-AS TE-AS,EG 115 42708 SE PORTLANE Portlane Networks AB,SE 113 36903 MA MT-MPLS,MA 98 50710 IQ EARTHLINK-AS EarthLink Ltd. Communications&Internet Services,IQ 69 9121 TR TTNET Turk Telekomunikasyon Anonim Sirketi,TR 69 25019 SA SAUDINETSTC-AS Saudi Telecom Company JSC,SA 52 NA NA 39 47869 SE NETROUTING-AS Netrouting,NL 35 37705 TN TOPNET,TN 31 24863 EG LINKdotNET-AS,EG 30 45595 PK PKTELECOM-AS-PK Pakistan Telecom Company Limited,PK 25 7738 BR Telemar Norte Leste S.A.,BR 25 3215 FR AS3215 Orange S.A.,FR 25 2609 TN TN-BB-AS Tunisia BackBone AS,TN 24 8376 JO Jordan Data Communications Company LLC,JO 23 4565 MEGAPATH2-US - MegaPath Networks Inc.,US 22 8075 US MICROSOFT-CORP-MSN-AS-BLOCK - Microsoft Corporation,US

Top Countries for RAT C2s

294 DZ 261 US 225 RU 186 EG 168 SE 152 IQ 145 MA 114 BR 103 SA 99 TR 99 TN 89 FR 81 UA

Top US Cities for RAT C2s

22 Redmond, Washington 12 Dallas, Texas 7 Phoenix, Arizona 6 Providence, Utah 6 New York, New York 6 Los Angeles, California 3 Wilmington, Delaware 3 San Antonio, Texas 3 Philadelphia, Pennsylvania 3 Houston, Texas 2 Willoughby, Ohio

Eventually fully-retroactive

• All that malware in Virustotal? You can still use that.

• Think of the intelligence possibilities of having a “master” database of RAT configurations for “all time”…

• If nothing else, Amazon’s stock price will go up from the AWS fees

• Why?• Because often we don’t know what is important

until after-the-fact and the ability to go back and have information readily available can shorten the response time.

OPERATIONALIZING THE DATA

What to do with this data?

• Give to LE for action is obvious

• Give to CERTs for them to take action

• Or you can burn all the RATs #OpTrollHackforums

• Creating alerts on this data is probably ok.

• Taking automated blocking action based on this data is probably not.

#OpSoapbox

• This is a wealth of very useful information… but it is just information.

• Intelligence is the process of thinking critically about the information you have…• What is it telling you• What are all the possible conclusions• Where can the adversary deceive you• What harm could be caused if you acted

on it

Don’t be that guy

Adapted from Brandon Levene* (I think)

Counterintelligence

• DNS resolution is under the control of the adversary.

• The adversary has motive to deceive.

• The adversary has motive to cause harm.

• DGA feeds anecdote• Shameless plug:

http://osint.bambenekconsulting.com/feeds



What’s the worst that can happen…

• If I were evil and knew you were taking automated blocking action based on something I controlled resolution for, here is what I would use for IPs:

198.41.0.4192.228.79.201192.33.4.12199.7.91.13192.203.230.10192.5.5.241192.112.36.4128.63.2.53192.36.148.17192.58.128.30193.0.14.129199.7.83.42202.12.27.33

Analyzing data at scale

• How can you possibly analyze thousands of configurations to determine confidence in each individual record?• You can’t.

• Ultimately need something to correlate it with.• Wiretap if LE• Correlation with other malicious activity

at same IP

But the data changes…

• If the adversary uses DNS, they can change information at-will.

• Long-term goal is to feed “live” data into another application that handles surveillance called PSS – Permanent Surveillance System.• Maybe I’ll open-source it, don’t know yet.

• Beyond that, there are some interesting fields to pivot off of to correlate campaigns• Campaign ID• Mutex• Registry Keys

Long-Term

• Identifying a threat point-in-time has value.

• Surveilling a threat as it moves and changes proactively reduces the the window of opportunity for an adversary.

• RATs are just the start• They are relatively easy• Still useful to improve the tradecraft• And they are still used by adversaries

QUESTIONS?THANK YOU

[email protected] / 217 493 0760

@bambenek

mailto:[email protected]

analyze'15 - bulk malware analysis at scale

Internet

malware sample

single stage malware

multistage malware

specific piece of malware

various malware families

number of malware families

dynamic analysis

rat configs