44con 2014: using hadoop for malware, network, forensics and log analysis

Download 44CON 2014: Using hadoop for malware, network, forensics and log analysis

If you can't read please download the document

Upload: michael-boman

Post on 16-Apr-2017

9.563 views

Category:

Presentations & Public Speaking


0 download

TRANSCRIPT

Keyboard

Using Hadoop for Malware, Network, Forensics and Log analysis

Michael Boman

[email protected]

http://blog.michaelboman.org

@mboman

Background

44CON 2012 Malware analysis as a hobby

DEEPSEC 2012 Malware analysis on a shoe-string budget

DEEPSEC 2013 - Malware Datamining and Attribution

VirusShare Malware Collection

2012-01-01

2014-07-21

5.8 TByte

VirusShare Latest Releases

What is Hadoop?

Distributed processing of large data sets (Big Data)

Runs on of-the-shelf hardware

Runs from a single node to thousands of machines

High failure toleranceHardware is crappy and will fail

Hadoop components

Operating system (Redhat, Ubuntu, Windows etc.)Java Virtual Machine

Data Access

Data Storage

InteractionVisualizationExecutionDevelopment

Data Serialization

Data Intelligence

Data Integration

Sqoop

Flume

Chukwa

HDFS(distributed storage)

Map Reduce(distributed processing)

YARN(DistributedScheduling)

Pig

Hive

HBase

Cassandra

HCatalog

Lucene

Hama

Crunch

Avro

Thrift

Drill

Mahout

Mgmnt, Monitoring, Orchestration

Ambari

Zookeeper

Oozie

Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data.

MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The Map function divides a query into multiple parts and processes data at the node level. The Reduce function aggregates the results of the Map function to determine the answer to the query.

Hive: Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language caled HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.

Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.)

HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily.

Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure inside web servers, application servers and mobile devices, for example to collect data and integrate it into Hadoop.

Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.

Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure inside web servers, application servers and mobile devices, for example to collect data and integrate it into Hadoop.

Ambari: Ambari is a web-based set of tools for deploying, administering and monitoring Apache Hadoop clusters. It's development is being led by engineers from Hortonworoks, which include Ambari in its Hortonworks Data Platform.

Avro: Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls.

Mahout: Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.

Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores such as relational databases and data warehouses into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.

HCatalog: HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored.

BigTop: BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop's sub-projects and related components with the goal improving the Hadoop platform as a whole.

How to obtain your Hadoop infrastructure (examples)

Pre-packaged distributionsCloudera

Hortonworks

RentAmazon Web Services

Roll your ownCompile from source

Malware Analysis - BinaryPig

Creates large archives of individual samples on HDFS as key/value sets (samples are small, HDFS likes them big)

Static analysis in done in batch

Results are stored in ElasticSearch for easy access/further analysis

Malware Analysis - BinaryPig

Extracting resource information

AV-(re)scanning

Scanning samples with new/updated Yara signatures

How does it work?

ZIP-archive /local dir

BinarypigSequence file

How does it work?

Sequence files storedin HDFS

How does it work?

Pig-scripts for:HashesClamAVYaraStrings

Network Analysis - PacketPig

PCAP in HDFS

Detecting anomalies and intrusion signatures

Learn time frame and identity of attacker

Triage incidents

Show me packet captures Ive never seen before.

How does it work?

PCAP are created locallyand uploaded to HDFS

How does it work?

PCAP uploaded to HDFS

How does it work?

Pig Scripts forsnort signaturesP0fUser-agent extractionWhat-ever you want

Computer Forensics - Sleuth Kit Hadoop Framework

Uses both HDFS and HBase to store file information

Ingest

Analysis

Reporting

How does it work?

Fsrip dumps informationabout image and informationabout files in the image

How does it work?

Fsrip dumps informationabout image and informationabout files in the image

How does it work?

RAW disk image file isuploaded to HDFS

How does it work?

RAW disk image file isuploaded to HDFS

How does it work?

Populates HBase entries table with information from the hard drive files.

How does it work?

Extract raw filedataKeyword searchExtracts textTokenizeCluster similar objectsCompare with other image

How does it work?

Build a report from previous steps

Log Analysis

FLUME-agents push local logs to HDFS.

Pig scripts process data on schedule. Results from Pig are stored in HDFS / HBase.

HBase will have the data processed by Pig ready for reporting or further analysis.

Data interaction/extraction using REST services.

How does it work?

FLUME-agents push local logs to HDFS

How does it work?

FLUME-agents push local logs to HDFS

How does it work?

Pig-scripts extracts data and puts them into HBase

How does it work?

Pig-scripts can perform additional analysis on HBase data

How do I do it?

Store malware samples locally

Upload samples to analyze to S3

Run EMR on samples on S3

Download the results from S3 to local

Saving money

Samples stored locally and backed up on Amazon Glacier.

Use reduced redundancy storage on S399.99% instead of 99.999999999%

Spot-bid on EC2 instances for EMR~$0.011 instead of $0.052

My AWS cost is expecting to be about $20/month

Conclusions

Malware Analysis

Network Analysis

Computer Forensics

Log Analysis

Questions?

[email protected]

@michael

http://blog.michaelboman.org

VirusShareTotal Size (GB)

DateNaN

4108013.56

4108098.89

41080145.51

41080173.08

41080193.62

41080215.91

41090248.26

41100291.39

41100336.77

41100378.88

41170408.11

41170434.78

41180461.1

41190488.06

41190515.59

41200542.13

41210569.58

41210600.95

41210628.59

41210654.69

41220697.44

41230757.23

41240819.07

41240899.91

41250949.84

41260995.95

412701053.9

412801094.33

412801144.72

409201208.84

412901268.48

413001321.48

413001373.91

413101397.28

413201420.67

413201445.18

413201469.37

413301492.9

413301517.72

413301543.87

413401573.74

413401604.76

413401636.54

413401674.73

413501712.66

415001738.63

413601764.71

413601799.27

413701839.19

413701880

413701916.65

413701942.85

413801969.67

413801998.06

413802024.69

413802058.07

413902090.68

413902115.12

413902142.13

413902171.28

414102204.66

414202242.19

414302281.61

414302318.04

414402356.75

414402411.56

414502461.06

414502500.35

414502537.29

414602573.97

414602612.59

414602651.71

414602692.15

414702734.16

414702771.99

414802812.41

414902852.76

415002895.84

415002943.53

415002989.62

415003038.1

415003084.25

415003128.16

415103167.82

415103211.1

415103255.8

415103297.24

415103340.94

415103381.55

415103424.84

415103472.2

415203523.21

415203575.72

415203621.99

415303676.11

415303728.27

415303773.41

415303816.63

415303863.56

415403910.98

415403958.06

415404012.96

415404063.31

415504113.93

415504177.76

415504228.6

415504271

415604311.33

415604380.37

415804436.59

416104468.17

416104516.2

416204556.95

416304591.14

416404623.23

416804668.13

416804702.87

416804735.8

416804775.57

416904812.67

417004848.18

417004905.62

417104946.66

417104986.22

417205034.37

417305074.34

417405129.21

417405181.3

417505246.53

417605311.12

417705377.83

417805434.39

418005498.28

418105582.94

418205660.29

418305714.29

418405760.24

418405785.09

Click to edit the title text format