automated document and media exploitation at npssimson.net/ref/2009/2009-11-17 adomex.pdf1 million...

26
UNCLASSIFIED Automated Document and Media Exploitation at NPS Simson L. Garfinkel, Ph.D. Associate Professor Department of Computer Science Naval Postgraduate School http://www.simson.net/ 1

Upload: others

Post on 05-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

UNCLASSIFIED

Automated Document and Media Exploitation at NPS

Simson L. Garfinkel, Ph.D.Associate ProfessorDepartment of Computer ScienceNaval Postgraduate Schoolhttp://www.simson.net/

1

Page 2: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Media: USB Memory Sticks Digital Cameras Cyber Cafes Websites(*)

Sources: Searches Border crossings Web searches

US forces encounter large numbers of digital documents and electronic media.

2

!"!#$% & % '( ) ' * & +,-% & . $ ' '

& " ' . /, ' !. .(-",$ '0(-1)(-2, ' '

1 345 ' 2("%6 # . & "5 ' 76",'89:';<<;' 3!5,',%=;'

!

!

"#$!%$&'()*$+)!,-..$(,!.(/*!'+!"#$%&'($)&!*(+$#!,&-.(,/&-!+$#$0&+&#)!"#1,$-),(/)(,&0!!

"#$($!1,!213$,&($'3!&$(4$&)1/+5!$,&$41'667!'*/+8!*1+/(1)1$,5!)#')!9:!&('4)14$,!6'4;!

)('+,&'($+470!!"#1,!($,-6),!1+!'))/(+$7,!&$(4$1<1+8!)#')!&('4)14$,!'($!-+.'1(0!!"#$!%$&'()*$+)!3/$,!

+/)!$*&#',1=$!4'($$(!3$<$6/&*$+)5!'+3!)//6,!./(!&$(./(*'+4$!'&&('1,'6!'($!3$.141$+)0!!>,!'!

($,-6)5!'))/(+$7,!41)$!&//(!?&$/&6$!*'+'8$*$+)@!A7!,-&$(<1,/(,0!

!

2&/)".#!3*"&1-!$,&!$#!&4),&+&56!/,")"/$5!&5&+&#)!/.!)#$!%$&'()*$+)B,!31<$(,1)7!461*')$0!!"#$7!

#'<$!,18+1.14'+)!'-)#/(1)7!1+!($4(-1)*$+)5!#1(1+85!&(/*/)1/+5!&$(./(*'+4$!'&&('1,'65!4',$!

',,18+*$+)5!'+3!4'($$(!3$<$6/&*$+)0!!"#$!C$4)1/+!D#1$.!2/(;./(4$!1,!+/)!31<$(,$!'+3!)-(+/<$(!1,!

6/20!!"#1,!&'))$(+5!4/*A1+$3!21)#!)#$!8$+$('667!6/2!'))$+)1/+!)#')!)#$,$!*'+'8$(,!&'7!)/!,)'..!

4'($$(!3$<$6/&*$+)5!6$'3,!*1+/(1)1$,!)/!&$(4$1<$!'!6'4;!/.!'3<'+4$*$+)!/&&/()-+1)1$,0!

!

"#$!%$&'()*$+)B,!'))/(+$7!2/(;./(4$!1,!+.,&!%"7&,-&!)*$#!)*&!8929!5&0$5!:.,;1.,/&E!!FGH!

.$*'6$5!4/*&'($3!)/!FIH!1+!)#$!J0C0!6$8'6!6'A/(!&//65!'+3!KLH!*1+/(1)75!4/*&'($3!)/!KMH!1+!

)#$!6'A/(!&//60!!"#$!%$&'()*$+)B,!'))/(+$7!2/(;./(4$!1,!'A/-)!$-!%"7&,-&!$-!)*&!1&%&,$5!

0.7&,#+&#)!5&0$5!:.,;1.,/&5!2#/,$!'))/(+$7,!'($!FGH!.$*'6$!'+3!KNH!*1+/(1)70!

!

<","#0!"-!-&,7"#0!).!+$;&!)*&!=&>$,)+&#)!&7&#!+.,&!%"7&,-&E!!#1($,!1+!MIIK!2$($!OIH!.$*'6$!

'+3!MKH!*1+/(1)70!!P+!&'()14-6'(5!)*&!?)).,#&6!@&#&,$5A-!<.#.,-!B,.0,$+!"-!$#!"+>.,)$#)!

)..5!./(!1+4($',1+8!31<$(,1)70!!9/+/(,!Q(/8('*!#1($,!1+!MIIK!2$($!NFH!.$*'6$5!4/*&'($3!)/!OLH!

/.!)#$!6'2!,4#//6!8('3-')1+8!46',,5!'+3!FIH!*1+/(1)75!4/*&'($3!)/!MKH!/.!)#$!46',,!/.!MIIK0!

!

R1+/(1)1$,!'($!-"0#"1"/$#)56!(#%&,C,&>,&-&#)&%!"#!+$#$0&+&#)!,$#;-0!!"#$7!4/*&(1,$!/+67!

SH!/.!T4'($$(U!CVC!'))/(+$7,!'+3!KKH!/.!,-&$(<1,/(7!>,,1,)'+)!J0C0!>))/(+$7,0!!W/*$+!

4/+,)1)-)$!FKH!/.!CVC,!'+3!FSH!/.!,-&$(<1,/(7!>JC>,0!!>*/+8!XCYKL!'))/(+$7,!1+!)#$!

Z1)18')1+8!%1<1,1/+,5!*1+/(1)1$,!4/*&(1,$!KKH!/.!+/+Y,-&$(<1,/(,!'+3!NH!/.!,-&$(<1,/(,5!'+3!

2/*$+!4/*&(1,$!FSH!/.!+/+Y,-&$(<1,/(,!'+3!FFH!/.!,-&$(<1,/(,0!

!

D"#.,")"&-!$,&!-(E-)$#)"$556!+.,&!5";&56!).!5&$7&!)*&!=&>$,)+&#)!)*$#!:*")&-0!!P+!MIIK5!)#$!

'))(1)1/+!(')$!2',!O[H!#18#$(!'*/+8!*1+/(1)1$,!)#'+!2#1)$,0!!"#$($!2',!+/!31..$($+4$!1+!($4$+)!

'))(1)1/+!A$)2$$+!*$+!'+3!2/*$+0!

!

"#$($!'($!'6,/!,)')1,)14'667!-"0#"1"/$#)!,$/&!$#%F.,!0&#%&,!&11&/)-!/+!'!+-*A$(!/.!9:!/-)4/*$,5!

1+46-31+8!,)'()1+8!8('3$5!4-(($+)!8('3$5!&(/*/)1/+,5!'+3!4/*&$+,')1/+0!!\/(!$]'*&6$5!)#$!'<$('8$!

*1+/(1)7!XC!'))/(+$7!1,!4-(($+)67!I0O!,)$&,!6/2$(!)#'+!)#$!'<$('8$!2#1)$5!'+3!)#$!'<$('8$!2/*'+!

1,!I0F!,)$&,!6/2$(!)#'+!)#$!'<$('8$!*'+5!4/+)(/661+8!./(!,$+1/(1)75!8('3$5!'+3!4/*&/+$+)0!

!

^',$3!/+!)#$,$!.1+31+8,5!2$!($4/**$+3!)#')!)#$!%$&'()*$+)!)';$!)#$!./66/21+8!'4)1/+,E!

!

G4&,/"-&!?@C!$#%!=?@C5&7&5!5&$%&,-*">!)/!,)($,,!)#$!1*&/()'+4$!/.!31<$(,1)7!'+3!)#$1(!

4/**1)*$+)!)/!1)0!!Q-A61467!4/**1)!)#$!%$&'()*$+)!)/!&'(1)7!A/)#!1+!31<$(,1)7!/-)4/*$,!T$0805!

4/*&'('A6$!($&($,$+)')1/+!')!'66!6$<$6,U!'+3!1+!'))1)-3$,!T$0805!_/A!,')1,.'4)1/+U!'*/+8!'66!

3$*/8('&#14!8(/-&,0!!P3$+)1.7!6$<$(,!./(!4#'+8$5!./4-,1+8!/+!>>X,!T2#/!'($!31<$(,$U!'+3!

C$4)1/+!D#1$.,0!!P*&6$*$+)!)('1+1+8!/.!6$'3$(,!)/!13$+)1.7!)#$1(!(/6$!1+!,#'&1+8!2/(;!461*')$!

1,,-$,!'+3!1+!$..$4)-')1+8!4#'+8$0!

!

!

June 2007

S M T W T F S

1 2

3 4 5 6 7 8 9

10 11 12 13 14 15 16

17 18 19 20 21 22 23

24 25 26 27 28 29 30

Page 3: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Most of this data is analyzed using trained personnel and off-the-shelf software.

DOMEX in Iraq

3

Page 4: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Software is mostly COTS GUI tools designed for law enforcement.

EnCase by Guidance Software

4

Designed for convictions, not exploitation. Does not scale to 100s or 1000s of drives. Most vendor "research" is reverse-engineering.

Page 5: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Manual analysis misses opportunities.

Different analysts see different hard drives.Keyword searches don't connect the dots.

5

email from [email protected] to [email protected]

[email protected]

Page 6: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Tools designed for Law Enforcement do notuse the data once the investigation is over.

The bad guy goes to prison. The hard drive goes to storage.

6

Page 7: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

New techniques & algorithms Highly automated for minimal training. Designed to exploit the "data rich environment." Scientific breakthroughs that can be incorporated into

COTS and GOTS tools.

Standardized Research Corpora Real Data:

—2000+ hard drives & USB sticks from around the world.—1 million publicly releasable documents

Realistic Data:—Purpose-built disk images for tool testing & training

We are building tools for performing Automated DOMEX.

7

Page 8: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Statistical Drive Sampling:Analyzing a 1TB drive in 2 minutes.

8

Page 9: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Statistical Drive Sampling:Analyzing a 1TB drive in 2 minutes.

8

Page 10: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Statistical Drive Sampling:Analyzing a 1TB drive in 2 minutes.

8

Page 11: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Statistical Drive Sampling:Analyzing a 1TB drive in 2 minutes.

8

Page 12: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Statistical Drive Sampling:Analyzing a 1TB drive in 2 minutes.

8

Page 13: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Statistical Drive Sampling:Analyzing a 1TB drive in 2 minutes.

8

Page 14: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Identifiable: Blank sectors JPEGs Encrypted data HTML

Report: Audio Data Reported by iTunes: 2.42GB MP3 files reported by file system: 2.39GB Estimated MP3 usage:

—2.71GB (1.70%) with 5,000 random samples—2.49GB (1.56%) with 10,000 random samples

Total time: 118 seconds

9

We identify the content of a 160GB iPod in 118 seconds.

Page 15: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Carved data ascription: automatically determining the owner of data in a multi-user system.

10

?

Page 16: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Carved data ascription: automatically determining the owner of data in a multi-user system.

10

?

… Or finding the match for information discovered on

portable media

Page 17: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Science requires the scientific process.

Hallmarks of Science: Controlled and repeatable experiments. No privileged observers.

Why repeat some other scientistʼs experiment? Validate that an algorithm is properly implemented. Determine if your new algorithm is better than someone elseʼs old one.

We canʼt do this today. Bobʼs tool can identify 70% of the data in the windows registry.

—He publishes a paper. Alice writes her own tool and can only identify 60%.

—She writes Bob and asks for his data.—Bob canʼt share the data because of copyright & privacy issues.

To address this problem, we are creating releasable corpora.

11

Page 18: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

1 million documents downloaded from US Government web servers Specifically for file identification, data & metadata extraction. Found by random word searches on Google & Yahoo DOC, DOCX, HTML, ASCII, SWF, etc.

Free to use; Free to redistribute No copyright issues — US Government work is not copyrightable. Other files have simply been moved from one USG webserver to another. No PII issues — These files were already released.

Distribution format: ZIP files 1000 ZIP files with 1000 files each. 10 “threads” of 1000 randomly chosen files for student projects. Full provenance for every file (how found; when downloaded; SHA1; etc.)

http://domex.nps.edu/corp/files/12

NPS-govdocs1: 1 Million files available now

Page 19: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Test Images: nps-2009-hfstest1# (HFS+) nps-2009-ntfs1 # (NTFS)

Realistic Images: nps-2009-canon2# (FAT32) nps-2009-UBNIST1# (FAT32) nps-2009-casper-rw # (embedded EXT3) nps-2009-domexusers# (NTFS)

Each image has: Narrative of how the image was created and expected uses. Image file in RAW/SPLITRAW, AFF and E01 formats SHA1 of raw image “Ground truth” report

http://digitalcorpora.org/

13

We have created six disk images.

Page 20: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

The Real Data Corpus: "Real Data from Real People."

Most forensic work is based on “realistic” data created in a lab.

We get real data from CN, IN, IL, MX, and other countries.

Real data provides: Real-world experience with data management problems. Unpredictable OS, software, & content Unanticipated faults

We have multiple corpora: Non-US Persons Corpus US Persons Corpus (@Harvard) Releasable Real Corpus Realistic Corpus

14

Page 21: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Real Data Corpus: Current Status

15

Corpus HDs Flash CDs GB

US* 1258 2939

BA 7 38

CA 46 1 420

CN 26 568 98 999

DE 37 1 765

GR 10 6

IL 152 4 964

IN 66 29

MX 156 571

NZ 1 4

TH 1 3 13

1694 643 98 6748

Note: IRB Approval is Mandatory!

* Not available to USG

Page 22: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Forensic XML standard enables data sharing between researchers and interchange between tools.

Supports: NTFS, FAT, Ext2/3, UFS/2, HFS, ISO 9660

16

AFF

fiwalk

ARFFXML

Page 23: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

fiwalk XML with metadata...

<fileobject>

<Orientation>right - top</Orientation>

<x-Resolution>0.01</x-Resolution>

<y-Resolution>1.42</y-Resolution>

<Resolution-Unit>Inch</Resolution-Unit>

<YCbCr-Positioning>centered</YCbCr-Positioning>

<Compression>JPEG compression</Compression>

<x-Resolution>0.00</x-Resolution>

<y-Resolution>0.00</y-Resolution>

<Resolution-Unit>Inch</Resolution-Unit>

<Exif-Version>Exif Version 2.1</Exif-Version>

<ComponentsConfiguration>Y Cb Cr -</ComponentsConfiguration>

<FlashPixVersion>Unknown FlashPix Version</FlashPixVersion>

<Color-Space>sRGB</Color-Space>

...

17

Automated Metadata Extraction, Maj. James Migletz, NPS Master's Thesis, June 2008

Rear Admiral Grace Murray Hopper Computer Science Award

Page 24: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Recent Publications

Publications: Garfinkel, Farrell, Roussev and Dinolt, Bringing Science to Digital Forensics with

Standardized Forensic Corpora, Best Paper, DFRWS, August 2009 Roussev, Vassil, and Garfinkel, Simson, File Classification Fragment---The Case for

Specialized Approaches, Systematic Approaches to Digital Forensics Engineering (IEEE/SADFE 2009), Oakland, California., June 2009

Garfinkel, "Automating Disk Forensic Processing with SleuthKit, XML and Python," IEEE/SADFE, Oakland, CA, June 2009

Garfinkel & Migletz, "The new XML Office Document Files," IEEE Security & Privacy Magazine, March/April 2009

Farrell, P., Garfinkel, S., White, D. Practical Applications of Bloom filters to the NIST RDS and hard drive triage, Annual Computer Security Applications Conference 2008, Anaheim, California, December 2008.

18

Page 25: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Student Theses

Cpt. Daniel Huynh, “Exploring and Validating Data Mining Algorithms for use in Data Ascription,” Master's Thesis, June 2008

Maj. James Migletz, “Automated Metadata Extraction,” Master's Thesis, June 2008 Mr. Steven D. Bassi Jr., "An Automated Acquisition System for Media Exploitation," June

2008 LT Paul Farrell, "A Framework for Automated Digital Forensic Reporting," March 2009 Capt. James Regan, "Recovery of Deleted Data from Flash Memory Device, " June 2009

19

Page 26: Automated Document and Media Exploitation at NPSsimson.net/ref/2009/2009-11-17 ADOMEX.pdf1 million documents downloaded from US Government web servers Specifically for file identification,

Sponsorship to Date

DIA Automated Media Exploitation (SEP 09 — SEP 10) Basic and applied research.

USMC STRIKE Development (JUN 09 - SEP 10) Goal: Provide assistance to STRIKE mobile forensics platform.

DARPA Sector Discrimination Project (JAN 09 - SEP 09) Goal: Dramatically Improve Sector Discrimination Application: 5-minute Hard Drive Analysis using statistical sampling

NIST Computer Forensic Tool Testing Support (OCT 08 - JAN 09) Produce forensic-quality disk images for testing forensic software. Deliver Linux Kernel modifications to support testing of forensic hardware.

20