labeling the virus share malware dataset lessons learned

31
Labeling the VirusShare Dataset: Lessons Learned John Seymour [email protected] , @_delta_zero 2016-04-23

Upload: john-seymour

Post on 17-Feb-2017

216 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Labeling the virus share malware dataset  lessons learned

Labeling the VirusShare Dataset: Lessons Learned

John [email protected], @_delta_zero

2016-04-23

Page 2: Labeling the virus share malware dataset  lessons learned

whoami

• Ph.D. student at the University of Maryland, Baltimore County (UMBC)

• Also a data scientist at ZeroFOX, Inc.

Page 3: Labeling the virus share malware dataset  lessons learned

Outline

• Introduction to Malware Classification• Labeling the VirusShare Corpus• Building a Malware Index using PySpark• Pretty graphs, words of caution, and useful

extensions

Page 4: Labeling the virus share malware dataset  lessons learned

Malware meets Machine Learning

Main Problem: more malware variants created than we can possibly ever analyze

Page 5: Labeling the virus share malware dataset  lessons learned

Crash Course in Machine Learning

• Machine Learning: finding patterns in data• Features: these potential patterns• Models: methods for making sense of

features• Libraries exist for creating models from

features (python, R, WEKA)• Places to get started:

• http://www.r2d3.us/visual-intro-to-machine-learning-part-1/• https://www.dataquest.io/mission/74/getting-started-with-kaggle/

Page 6: Labeling the virus share malware dataset  lessons learned

Where to find malware

600 samples • Lots of exploit kits• Includes analyses

10,868 samples(about 500GB)

• 9 families of malware• Hexdumps/Assembly files (from IDA)• Neutered: PE headers removed

271,092 samples• Labeled by KAV• Last update: 2007• Most-used academic dataset

24,783,626 samples• Split into chunks of 65,536 samples• Available by Torrent• Unlabeled (until now!)

As many as you want

• VirusTotal: Needs Private API• Research Requests• Licensing issues

Page 7: Labeling the virus share malware dataset  lessons learned

Features commonly used

PE-File metadata

Python pefile library is amazing

Image courtesy of trustwave.com

Page 8: Labeling the virus share malware dataset  lessons learned

N-Grams on hexdumps/assembly files

• Sliding window over text

• Features:• DEAD: 1• ADBE: 1• BEEF: 1

Page 9: Labeling the virus share malware dataset  lessons learned

Other features commonly used

• Opcodes, imports, etc• Assembly instructions

Page 10: Labeling the virus share malware dataset  lessons learned

Top performing models

Again, use libraries!

SVMs xgboost Deep Learning

Page 11: Labeling the virus share malware dataset  lessons learned

Determining if Malware Classifiers

OverfitA use case for labeling the VirusShare corpus

Page 12: Labeling the virus share malware dataset  lessons learned

Main question: do these models overfit?

Page 13: Labeling the virus share malware dataset  lessons learned

2-Step Method to Show Overfitting

1. Download another dataset2. Run that dataset through models

Page 14: Labeling the virus share malware dataset  lessons learned

Why VirusShare?

• Size (27 million samples: almost 2500x Kaggle dataset size, not neutered)

• Consistently updated• Make future research more reproducible• Scrape your own => you’ll probably overfit• VirusTotal: can't release any raw data from the

platform• Short descriptions of dataset: “Chunks 25, 60, 90”

Page 15: Labeling the virus share malware dataset  lessons learned

Labeling the VirusShare Corpus

Page 16: Labeling the virus share malware dataset  lessons learned

Overview

• VirusTotal has an awesome API• Two versions: Private/Research and Public• Public is rate-limited• Private/Research has licensing agreements• Meaning we wouldn’t be able to distribute results

• Took 30 people + 6 months to label• Due to rate-limiting• Mostly undergraduates wanting extra credit

Page 17: Labeling the virus share malware dataset  lessons learned
Page 18: Labeling the virus share malware dataset  lessons learned
Page 19: Labeling the virus share malware dataset  lessons learned

Labels available here:https://drive.google.com/file/d/0B_IN6RzP69b2TkNrYVdOMnQ4LVE/view

Page 20: Labeling the virus share malware dataset  lessons learned

Building a Malware Index

Page 21: Labeling the virus share malware dataset  lessons learned

What is an index and why do I need one?

• An index will list each malware label with how many occur in each VirusShare chunk

• Why? We want to minimize download time + size on hard drive

• Not all VirusShare chunks have all types of malware

• Very useful for when we want to download a large number of particular type of malware

Page 22: Labeling the virus share malware dataset  lessons learned

Building an index

• The MapReduce framework is awesome for counting things• See PySpark tutorials

Page 23: Labeling the virus share malware dataset  lessons learned

Entire PySpark Script – everything else is just formatting

Page 24: Labeling the virus share malware dataset  lessons learned

What that index looks like

Page 25: Labeling the virus share malware dataset  lessons learned

It’s really easy to use!

Page 26: Labeling the virus share malware dataset  lessons learned

Pretty Graphs

Page 27: Labeling the virus share malware dataset  lessons learned

Pretty Graphs

Page 28: Labeling the virus share malware dataset  lessons learned

Words of warning

• Don’t use this to compare AVs• VT has a big disclaimer on their site• Main reason: not all AVs integrate their full

software into VT• Ground truth might be noisy• AV labels sometimes different even though

specimens are similar• AV labels sometimes similar even when

specimens are different.

Page 29: Labeling the virus share malware dataset  lessons learned

These might be very different malware specimens!

Page 30: Labeling the virus share malware dataset  lessons learned

Useful Extensions

• Adding when the specimen was first seen• Allows for better forecasting and time-based

analyses• VirusTotal only has this field on its

private/research API• Some sort of stemming• This might help with the “AV labels different even

though specimens are similar” problem• AV labeling is inconsistent, so this could help us

get higher granularity on our ground truth labels

Page 31: Labeling the virus share malware dataset  lessons learned

Thank you!Questions?

[email protected], @_delta_zero