automated document and media exploitation at npssimson.net/ref/2009/2009-11-17 adomex.pdf1 million...
TRANSCRIPT
UNCLASSIFIED
Automated Document and Media Exploitation at NPS
Simson L. Garfinkel, Ph.D.Associate ProfessorDepartment of Computer ScienceNaval Postgraduate Schoolhttp://www.simson.net/
1
Media: USB Memory Sticks Digital Cameras Cyber Cafes Websites(*)
Sources: Searches Border crossings Web searches
US forces encounter large numbers of digital documents and electronic media.
2
!"!#$% & % '( ) ' * & +,-% & . $ ' '
& " ' . /, ' !. .(-",$ '0(-1)(-2, ' '
1 345 ' 2("%6 # . & "5 ' 76",'89:';<<;' 3!5,',%=;'
!
!
"#$!%$&'()*$+)!,-..$(,!.(/*!'+!"#$%&'($)&!*(+$#!,&-.(,/&-!+$#$0&+&#)!"#1,$-),(/)(,&0!!
"#$($!1,!213$,&($'3!&$(4$&)1/+5!$,&$41'667!'*/+8!*1+/(1)1$,5!)#')!9:!&('4)14$,!6'4;!
)('+,&'($+470!!"#1,!($,-6),!1+!'))/(+$7,!&$(4$1<1+8!)#')!&('4)14$,!'($!-+.'1(0!!"#$!%$&'()*$+)!3/$,!
+/)!$*&#',1=$!4'($$(!3$<$6/&*$+)5!'+3!)//6,!./(!&$(./(*'+4$!'&&('1,'6!'($!3$.141$+)0!!>,!'!
($,-6)5!'))/(+$7,!41)$!&//(!?&$/&6$!*'+'8$*$+)@!A7!,-&$(<1,/(,0!
!
2&/)".#!3*"&1-!$,&!$#!&4),&+&56!/,")"/$5!&5&+&#)!/.!)#$!%$&'()*$+)B,!31<$(,1)7!461*')$0!!"#$7!
#'<$!,18+1.14'+)!'-)#/(1)7!1+!($4(-1)*$+)5!#1(1+85!&(/*/)1/+5!&$(./(*'+4$!'&&('1,'65!4',$!
',,18+*$+)5!'+3!4'($$(!3$<$6/&*$+)0!!"#$!C$4)1/+!D#1$.!2/(;./(4$!1,!+/)!31<$(,$!'+3!)-(+/<$(!1,!
6/20!!"#1,!&'))$(+5!4/*A1+$3!21)#!)#$!8$+$('667!6/2!'))$+)1/+!)#')!)#$,$!*'+'8$(,!&'7!)/!,)'..!
4'($$(!3$<$6/&*$+)5!6$'3,!*1+/(1)1$,!)/!&$(4$1<$!'!6'4;!/.!'3<'+4$*$+)!/&&/()-+1)1$,0!
!
"#$!%$&'()*$+)B,!'))/(+$7!2/(;./(4$!1,!+.,&!%"7&,-&!)*$#!)*&!8929!5&0$5!:.,;1.,/&E!!FGH!
.$*'6$5!4/*&'($3!)/!FIH!1+!)#$!J0C0!6$8'6!6'A/(!&//65!'+3!KLH!*1+/(1)75!4/*&'($3!)/!KMH!1+!
)#$!6'A/(!&//60!!"#$!%$&'()*$+)B,!'))/(+$7!2/(;./(4$!1,!'A/-)!$-!%"7&,-&!$-!)*&!1&%&,$5!
0.7&,#+&#)!5&0$5!:.,;1.,/&5!2#/,$!'))/(+$7,!'($!FGH!.$*'6$!'+3!KNH!*1+/(1)70!
!
<","#0!"-!-&,7"#0!).!+$;&!)*&!=&>$,)+&#)!&7&#!+.,&!%"7&,-&E!!#1($,!1+!MIIK!2$($!OIH!.$*'6$!
'+3!MKH!*1+/(1)70!!P+!&'()14-6'(5!)*&!?)).,#&6!@&#&,$5A-!<.#.,-!B,.0,$+!"-!$#!"+>.,)$#)!
)..5!./(!1+4($',1+8!31<$(,1)70!!9/+/(,!Q(/8('*!#1($,!1+!MIIK!2$($!NFH!.$*'6$5!4/*&'($3!)/!OLH!
/.!)#$!6'2!,4#//6!8('3-')1+8!46',,5!'+3!FIH!*1+/(1)75!4/*&'($3!)/!MKH!/.!)#$!46',,!/.!MIIK0!
!
R1+/(1)1$,!'($!-"0#"1"/$#)56!(#%&,C,&>,&-&#)&%!"#!+$#$0&+&#)!,$#;-0!!"#$7!4/*&(1,$!/+67!
SH!/.!T4'($$(U!CVC!'))/(+$7,!'+3!KKH!/.!,-&$(<1,/(7!>,,1,)'+)!J0C0!>))/(+$7,0!!W/*$+!
4/+,)1)-)$!FKH!/.!CVC,!'+3!FSH!/.!,-&$(<1,/(7!>JC>,0!!>*/+8!XCYKL!'))/(+$7,!1+!)#$!
Z1)18')1+8!%1<1,1/+,5!*1+/(1)1$,!4/*&(1,$!KKH!/.!+/+Y,-&$(<1,/(,!'+3!NH!/.!,-&$(<1,/(,5!'+3!
2/*$+!4/*&(1,$!FSH!/.!+/+Y,-&$(<1,/(,!'+3!FFH!/.!,-&$(<1,/(,0!
!
D"#.,")"&-!$,&!-(E-)$#)"$556!+.,&!5";&56!).!5&$7&!)*&!=&>$,)+&#)!)*$#!:*")&-0!!P+!MIIK5!)#$!
'))(1)1/+!(')$!2',!O[H!#18#$(!'*/+8!*1+/(1)1$,!)#'+!2#1)$,0!!"#$($!2',!+/!31..$($+4$!1+!($4$+)!
'))(1)1/+!A$)2$$+!*$+!'+3!2/*$+0!
!
"#$($!'($!'6,/!,)')1,)14'667!-"0#"1"/$#)!,$/&!$#%F.,!0&#%&,!&11&/)-!/+!'!+-*A$(!/.!9:!/-)4/*$,5!
1+46-31+8!,)'()1+8!8('3$5!4-(($+)!8('3$5!&(/*/)1/+,5!'+3!4/*&$+,')1/+0!!\/(!$]'*&6$5!)#$!'<$('8$!
*1+/(1)7!XC!'))/(+$7!1,!4-(($+)67!I0O!,)$&,!6/2$(!)#'+!)#$!'<$('8$!2#1)$5!'+3!)#$!'<$('8$!2/*'+!
1,!I0F!,)$&,!6/2$(!)#'+!)#$!'<$('8$!*'+5!4/+)(/661+8!./(!,$+1/(1)75!8('3$5!'+3!4/*&/+$+)0!
!
^',$3!/+!)#$,$!.1+31+8,5!2$!($4/**$+3!)#')!)#$!%$&'()*$+)!)';$!)#$!./66/21+8!'4)1/+,E!
!
G4&,/"-&!?@C!$#%!=?@C5&7&5!5&$%&,-*">!)/!,)($,,!)#$!1*&/()'+4$!/.!31<$(,1)7!'+3!)#$1(!
4/**1)*$+)!)/!1)0!!Q-A61467!4/**1)!)#$!%$&'()*$+)!)/!&'(1)7!A/)#!1+!31<$(,1)7!/-)4/*$,!T$0805!
4/*&'('A6$!($&($,$+)')1/+!')!'66!6$<$6,U!'+3!1+!'))1)-3$,!T$0805!_/A!,')1,.'4)1/+U!'*/+8!'66!
3$*/8('!8(/-&,0!!P3$+)1.7!6$<$(,!./(!4#'+8$5!./4-,1+8!/+!>>X,!T2#/!'($!31<$(,$U!'+3!
C$4)1/+!D#1$.,0!!P*&6$*$+)!)('1+1+8!/.!6$'3$(,!)/!13$+)1.7!)#$1(!(/6$!1+!,#'&1+8!2/(;!461*')$!
1,,-$,!'+3!1+!$..$4)-')1+8!4#'+8$0!
!
!
June 2007
S M T W T F S
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Most of this data is analyzed using trained personnel and off-the-shelf software.
DOMEX in Iraq
3
Software is mostly COTS GUI tools designed for law enforcement.
EnCase by Guidance Software
4
Designed for convictions, not exploitation. Does not scale to 100s or 1000s of drives. Most vendor "research" is reverse-engineering.
Manual analysis misses opportunities.
Different analysts see different hard drives.Keyword searches don't connect the dots.
5
email from [email protected] to [email protected]
Tools designed for Law Enforcement do notuse the data once the investigation is over.
The bad guy goes to prison. The hard drive goes to storage.
6
New techniques & algorithms Highly automated for minimal training. Designed to exploit the "data rich environment." Scientific breakthroughs that can be incorporated into
COTS and GOTS tools.
Standardized Research Corpora Real Data:
—2000+ hard drives & USB sticks from around the world.—1 million publicly releasable documents
Realistic Data:—Purpose-built disk images for tool testing & training
We are building tools for performing Automated DOMEX.
7
Statistical Drive Sampling:Analyzing a 1TB drive in 2 minutes.
8
Statistical Drive Sampling:Analyzing a 1TB drive in 2 minutes.
8
Statistical Drive Sampling:Analyzing a 1TB drive in 2 minutes.
8
Statistical Drive Sampling:Analyzing a 1TB drive in 2 minutes.
8
Statistical Drive Sampling:Analyzing a 1TB drive in 2 minutes.
8
Statistical Drive Sampling:Analyzing a 1TB drive in 2 minutes.
8
Identifiable: Blank sectors JPEGs Encrypted data HTML
Report: Audio Data Reported by iTunes: 2.42GB MP3 files reported by file system: 2.39GB Estimated MP3 usage:
—2.71GB (1.70%) with 5,000 random samples—2.49GB (1.56%) with 10,000 random samples
Total time: 118 seconds
9
We identify the content of a 160GB iPod in 118 seconds.
Carved data ascription: automatically determining the owner of data in a multi-user system.
10
?
Carved data ascription: automatically determining the owner of data in a multi-user system.
10
?
… Or finding the match for information discovered on
portable media
Science requires the scientific process.
Hallmarks of Science: Controlled and repeatable experiments. No privileged observers.
Why repeat some other scientistʼs experiment? Validate that an algorithm is properly implemented. Determine if your new algorithm is better than someone elseʼs old one.
We canʼt do this today. Bobʼs tool can identify 70% of the data in the windows registry.
—He publishes a paper. Alice writes her own tool and can only identify 60%.
—She writes Bob and asks for his data.—Bob canʼt share the data because of copyright & privacy issues.
To address this problem, we are creating releasable corpora.
11
1 million documents downloaded from US Government web servers Specifically for file identification, data & metadata extraction. Found by random word searches on Google & Yahoo DOC, DOCX, HTML, ASCII, SWF, etc.
Free to use; Free to redistribute No copyright issues — US Government work is not copyrightable. Other files have simply been moved from one USG webserver to another. No PII issues — These files were already released.
Distribution format: ZIP files 1000 ZIP files with 1000 files each. 10 “threads” of 1000 randomly chosen files for student projects. Full provenance for every file (how found; when downloaded; SHA1; etc.)
http://domex.nps.edu/corp/files/12
NPS-govdocs1: 1 Million files available now
Test Images: nps-2009-hfstest1# (HFS+) nps-2009-ntfs1 # (NTFS)
Realistic Images: nps-2009-canon2# (FAT32) nps-2009-UBNIST1# (FAT32) nps-2009-casper-rw # (embedded EXT3) nps-2009-domexusers# (NTFS)
Each image has: Narrative of how the image was created and expected uses. Image file in RAW/SPLITRAW, AFF and E01 formats SHA1 of raw image “Ground truth” report
http://digitalcorpora.org/
13
We have created six disk images.
The Real Data Corpus: "Real Data from Real People."
Most forensic work is based on “realistic” data created in a lab.
We get real data from CN, IN, IL, MX, and other countries.
Real data provides: Real-world experience with data management problems. Unpredictable OS, software, & content Unanticipated faults
We have multiple corpora: Non-US Persons Corpus US Persons Corpus (@Harvard) Releasable Real Corpus Realistic Corpus
14
Real Data Corpus: Current Status
15
Corpus HDs Flash CDs GB
US* 1258 2939
BA 7 38
CA 46 1 420
CN 26 568 98 999
DE 37 1 765
GR 10 6
IL 152 4 964
IN 66 29
MX 156 571
NZ 1 4
TH 1 3 13
1694 643 98 6748
Note: IRB Approval is Mandatory!
* Not available to USG
Forensic XML standard enables data sharing between researchers and interchange between tools.
Supports: NTFS, FAT, Ext2/3, UFS/2, HFS, ISO 9660
16
AFF
fiwalk
ARFFXML
fiwalk XML with metadata...
<fileobject>
<Orientation>right - top</Orientation>
<x-Resolution>0.01</x-Resolution>
<y-Resolution>1.42</y-Resolution>
<Resolution-Unit>Inch</Resolution-Unit>
<YCbCr-Positioning>centered</YCbCr-Positioning>
<Compression>JPEG compression</Compression>
<x-Resolution>0.00</x-Resolution>
<y-Resolution>0.00</y-Resolution>
<Resolution-Unit>Inch</Resolution-Unit>
<Exif-Version>Exif Version 2.1</Exif-Version>
<ComponentsConfiguration>Y Cb Cr -</ComponentsConfiguration>
<FlashPixVersion>Unknown FlashPix Version</FlashPixVersion>
<Color-Space>sRGB</Color-Space>
...
17
Automated Metadata Extraction, Maj. James Migletz, NPS Master's Thesis, June 2008
Rear Admiral Grace Murray Hopper Computer Science Award
Recent Publications
Publications: Garfinkel, Farrell, Roussev and Dinolt, Bringing Science to Digital Forensics with
Standardized Forensic Corpora, Best Paper, DFRWS, August 2009 Roussev, Vassil, and Garfinkel, Simson, File Classification Fragment---The Case for
Specialized Approaches, Systematic Approaches to Digital Forensics Engineering (IEEE/SADFE 2009), Oakland, California., June 2009
Garfinkel, "Automating Disk Forensic Processing with SleuthKit, XML and Python," IEEE/SADFE, Oakland, CA, June 2009
Garfinkel & Migletz, "The new XML Office Document Files," IEEE Security & Privacy Magazine, March/April 2009
Farrell, P., Garfinkel, S., White, D. Practical Applications of Bloom filters to the NIST RDS and hard drive triage, Annual Computer Security Applications Conference 2008, Anaheim, California, December 2008.
18
Student Theses
Cpt. Daniel Huynh, “Exploring and Validating Data Mining Algorithms for use in Data Ascription,” Master's Thesis, June 2008
Maj. James Migletz, “Automated Metadata Extraction,” Master's Thesis, June 2008 Mr. Steven D. Bassi Jr., "An Automated Acquisition System for Media Exploitation," June
2008 LT Paul Farrell, "A Framework for Automated Digital Forensic Reporting," March 2009 Capt. James Regan, "Recovery of Deleted Data from Flash Memory Device, " June 2009
19
Sponsorship to Date
DIA Automated Media Exploitation (SEP 09 — SEP 10) Basic and applied research.
USMC STRIKE Development (JUN 09 - SEP 10) Goal: Provide assistance to STRIKE mobile forensics platform.
DARPA Sector Discrimination Project (JAN 09 - SEP 09) Goal: Dramatically Improve Sector Discrimination Application: 5-minute Hard Drive Analysis using statistical sampling
NIST Computer Forensic Tool Testing Support (OCT 08 - JAN 09) Produce forensic-quality disk images for testing forensic software. Deliver Linux Kernel modifications to support testing of forensic hardware.
20