an introduction to malware classification

13
An Introduction to Malware Classification John Seymour [email protected] , @jjseymour3 2016-04-23

Upload: john-seymour

Post on 17-Feb-2017

413 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: An Introduction to Malware Classification

An Introduction to Malware Classification

John [email protected], @jjseymour3

2016-04-23

Page 2: An Introduction to Malware Classification

whoami

• Ph.D. student at the University of Maryland, Baltimore County (UMBC)

• Also a data scientist at ZeroFOX, Inc.

Page 3: An Introduction to Malware Classification

Outline

• Malware meets Machine Learning• Where to find lots of malware• Features commonly used• Top performing models• Where we currently need effort

Page 4: An Introduction to Malware Classification

Malware meets Machine Learning

Problem: more malware variants created than we can possibly ever analyze

Page 5: An Introduction to Malware Classification

Crash Course in Machine Learning

• Machine Learning: finding patterns in data• Potential patterns: “Features”• Making sense of those patterns: “Models”• Libraries exist for creating models from

features (python, R, WEKA)

• Places to get started:• http://www.r2d3.us/visual-intro-to-machine-learning-part-1/• https://www.dataquest.io/mission/74/getting-started-with-kaggle/

Page 6: An Introduction to Malware Classification

Where to find malware

600 samples • Lots of exploit kits• Includes analyses

10,868 samples(about 500GB)

• 9 families of malware• Hexdumps/Assembly files (from IDA)• Neutered: PE headers removed

271,092 samples• Labeled by KAV• Last update: 2007• Most-used academic dataset

24,783,626 samples• Split into chunks of 65,536 samples• Available by Torrent• Unlabeled (so far…)

As many as you want

• VirusTotal: Needs Private API• Research Requests• Licensing issues

Page 7: An Introduction to Malware Classification

No presentation complete without a graph(Note Logarithmic Scale)

malware-traffic-analysis Kaggle Vx Heaven VirusShare256

1024

4095.99999999999

16384

65535.9999999999

262144

1048576

4194303.99999999

16777216

67108863.9999998

Number of Samples

Page 8: An Introduction to Malware Classification

Features commonly used

PE-File metadata

Python pefile library is amazing

Image courtesy of trustwave.com

Page 9: An Introduction to Malware Classification

N-Grams on hexdumps/assembly files

• Sliding window over text

• Features:• DEAD: 1• ADBE: 1• BEEF: 1

Page 10: An Introduction to Malware Classification

Other features commonly used

• Opcodes, imports, etc• Assembly instructions

Page 11: An Introduction to Malware Classification

Top performing models

Again, use libraries!

SVMs xgboost Deep Learning

Page 12: An Introduction to Malware Classification

Where we need more effort

• Stopping the “overfitting” problem• Models don’t generalize to unseen data/new

networks• Happens even in high-tier conferences: See 2012

“Selecting Features to Classify Malware”• More collaboration with malware analysts• “Pentesting” machine learning models

Page 13: An Introduction to Malware Classification

Thank you!Questions?

[email protected], @jjseymour3