44con 2014: using hadoop for malware, network, forensics and log analysis
TRANSCRIPT
Keyboard
Using Hadoop for Malware, Network, Forensics and Log analysis
Michael Boman
http://blog.michaelboman.org
@mboman
Background
44CON 2012 Malware analysis as a hobby
DEEPSEC 2012 Malware analysis on a shoe-string budget
DEEPSEC 2013 - Malware Datamining and Attribution
VirusShare Malware Collection
2012-01-01
2014-07-21
5.8 TByte
VirusShare Latest Releases
What is Hadoop?
Distributed processing of large data sets (Big Data)
Runs on of-the-shelf hardware
Runs from a single node to thousands of machines
High failure toleranceHardware is crappy and will fail
Hadoop components
Operating system (Redhat, Ubuntu, Windows etc.)Java Virtual Machine
Data Access
Data Storage
InteractionVisualizationExecutionDevelopment
Data Serialization
Data Intelligence
Data Integration
Sqoop
Flume
Chukwa
HDFS(distributed storage)
Map Reduce(distributed processing)
YARN(DistributedScheduling)
Pig
Hive
HBase
Cassandra
HCatalog
Lucene
Hama
Crunch
Avro
Thrift
Drill
Mahout
Mgmnt, Monitoring, Orchestration
Ambari
Zookeeper
Oozie
Hadoop Distributed File System: HDFS, the storage layer of Hadoop, is a distributed, scalable, Java-based file system adept at storing large volumes of unstructured data.
MapReduce: MapReduce is a software framework that serves as the compute layer of Hadoop. MapReduce jobs are divided into two (obviously named) parts. The Map function divides a query into multiple parts and processes data at the node level. The Reduce function aggregates the results of the Map function to determine the answer to the query.
Hive: Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language caled HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as Microstrategy, Tableau, Revolutions Analytics, etc.
Pig: Pig Latin is a Hadoop-based language developed by Yahoo. It is relatively easy to learn and is adept at very deep, very long data pipelines (a limitation of SQL.)
HBase: HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily.
Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure inside web servers, application servers and mobile devices, for example to collect data and integrate it into Hadoop.
Oozie: Oozie is a workflow processing system that lets users define a series of jobs written in multiple languages such as Map Reduce, Pig and Hive -- then intelligently link them to one another. Oozie allows users to specify, for example, that a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.
Flume: Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure inside web servers, application servers and mobile devices, for example to collect data and integrate it into Hadoop.
Ambari: Ambari is a web-based set of tools for deploying, administering and monitoring Apache Hadoop clusters. It's development is being led by engineers from Hortonworoks, which include Ambari in its Hortonworks Data Platform.
Avro: Avro is a data serialization system that allows for encoding the schema of Hadoop files. It is adept at parsing data and performing removed procedure calls.
Mahout: Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.
Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores such as relational databases and data warehouses into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target.
HCatalog: HCatalog is a centralized metadata management and sharing service for Apache Hadoop. It allows for a unified view of all data in Hadoop clusters and allows diverse tools, including Pig and Hive, to process any data elements without needing to know physically where in the cluster the data is stored.
BigTop: BigTop is an effort to create a more formal process or framework for packaging and interoperability testing of Hadoop's sub-projects and related components with the goal improving the Hadoop platform as a whole.
How to obtain your Hadoop infrastructure (examples)
Pre-packaged distributionsCloudera
Hortonworks
RentAmazon Web Services
Roll your ownCompile from source
Malware Analysis - BinaryPig
Creates large archives of individual samples on HDFS as key/value sets (samples are small, HDFS likes them big)
Static analysis in done in batch
Results are stored in ElasticSearch for easy access/further analysis
Malware Analysis - BinaryPig
Extracting resource information
AV-(re)scanning
Scanning samples with new/updated Yara signatures
How does it work?
ZIP-archive /local dir
BinarypigSequence file
How does it work?
Sequence files storedin HDFS
How does it work?
Pig-scripts for:HashesClamAVYaraStrings
Network Analysis - PacketPig
PCAP in HDFS
Detecting anomalies and intrusion signatures
Learn time frame and identity of attacker
Triage incidents
Show me packet captures Ive never seen before.
How does it work?
PCAP are created locallyand uploaded to HDFS
How does it work?
PCAP uploaded to HDFS
How does it work?
Pig Scripts forsnort signaturesP0fUser-agent extractionWhat-ever you want
Computer Forensics - Sleuth Kit Hadoop Framework
Uses both HDFS and HBase to store file information
Ingest
Analysis
Reporting
How does it work?
Fsrip dumps informationabout image and informationabout files in the image
How does it work?
Fsrip dumps informationabout image and informationabout files in the image
How does it work?
RAW disk image file isuploaded to HDFS
How does it work?
RAW disk image file isuploaded to HDFS
How does it work?
Populates HBase entries table with information from the hard drive files.
How does it work?
Extract raw filedataKeyword searchExtracts textTokenizeCluster similar objectsCompare with other image
How does it work?
Build a report from previous steps
Log Analysis
FLUME-agents push local logs to HDFS.
Pig scripts process data on schedule. Results from Pig are stored in HDFS / HBase.
HBase will have the data processed by Pig ready for reporting or further analysis.
Data interaction/extraction using REST services.
How does it work?
FLUME-agents push local logs to HDFS
How does it work?
FLUME-agents push local logs to HDFS
How does it work?
Pig-scripts extracts data and puts them into HBase
How does it work?
Pig-scripts can perform additional analysis on HBase data
How do I do it?
Store malware samples locally
Upload samples to analyze to S3
Run EMR on samples on S3
Download the results from S3 to local
Saving money
Samples stored locally and backed up on Amazon Glacier.
Use reduced redundancy storage on S399.99% instead of 99.999999999%
Spot-bid on EC2 instances for EMR~$0.011 instead of $0.052
My AWS cost is expecting to be about $20/month
Conclusions
Malware Analysis
Network Analysis
Computer Forensics
Log Analysis
Questions?
@michael
http://blog.michaelboman.org
VirusShareTotal Size (GB)
DateNaN
4108013.56
4108098.89
41080145.51
41080173.08
41080193.62
41080215.91
41090248.26
41100291.39
41100336.77
41100378.88
41170408.11
41170434.78
41180461.1
41190488.06
41190515.59
41200542.13
41210569.58
41210600.95
41210628.59
41210654.69
41220697.44
41230757.23
41240819.07
41240899.91
41250949.84
41260995.95
412701053.9
412801094.33
412801144.72
409201208.84
412901268.48
413001321.48
413001373.91
413101397.28
413201420.67
413201445.18
413201469.37
413301492.9
413301517.72
413301543.87
413401573.74
413401604.76
413401636.54
413401674.73
413501712.66
415001738.63
413601764.71
413601799.27
413701839.19
413701880
413701916.65
413701942.85
413801969.67
413801998.06
413802024.69
413802058.07
413902090.68
413902115.12
413902142.13
413902171.28
414102204.66
414202242.19
414302281.61
414302318.04
414402356.75
414402411.56
414502461.06
414502500.35
414502537.29
414602573.97
414602612.59
414602651.71
414602692.15
414702734.16
414702771.99
414802812.41
414902852.76
415002895.84
415002943.53
415002989.62
415003038.1
415003084.25
415003128.16
415103167.82
415103211.1
415103255.8
415103297.24
415103340.94
415103381.55
415103424.84
415103472.2
415203523.21
415203575.72
415203621.99
415303676.11
415303728.27
415303773.41
415303816.63
415303863.56
415403910.98
415403958.06
415404012.96
415404063.31
415504113.93
415504177.76
415504228.6
415504271
415604311.33
415604380.37
415804436.59
416104468.17
416104516.2
416204556.95
416304591.14
416404623.23
416804668.13
416804702.87
416804735.8
416804775.57
416904812.67
417004848.18
417004905.62
417104946.66
417104986.22
417205034.37
417305074.34
417405129.21
417405181.3
417505246.53
417605311.12
417705377.83
417805434.39
418005498.28
418105582.94
418205660.29
418305714.29
418405760.24
418405785.09
Click to edit the title text format