pdf mirage: content masking attack against information ... · pdf mirage: content masking attack...

PDFMirage:ContentMaskingAttackAgainstInformation-BasedOnlineServices

IanMarkwood*,Dakun Shen*,YaoLiu,andZhuo LuUniversityofSouthFlorida

*Co-firstauthors

PresentedbyIanMarkwood

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

Motivation

• TheAdobePortableDocumentFormat(PDF)isthestandardforconsistentcross-computerdocumentrendering

• PDFdocumentscannotbeeditedwithcommonlyaccessibletools(MSWord,AdobeReader,etc.)

• Thisconfersasenseofintegritytothedocumentfortheenduser

Motivation

• ThereisadisconnectbetweenthecontentofaPDFandwhatisactuallydisplayed

• Acomputerandahumanseetwodifferentthings

Motivation

• WithinthisdisconnectwecanperformacontentmaskingattackwhichcompromisesthecontentintegrityofPDFfiles

• Threeinformation-basedonlinesystemsrelyontheintegrityofPDFdocuments:– Automaticreviewerassignmentsystemsforacademicpapers

– Plagiarismdetectionsystems– Searchengines

Outline

BackgroundInformation

• Whatdotheseserviceshaveincommon?– TheysupportPDFsubmission– TheyscrapethetextoutofsubmittedPDFfilestoperformtheirfunction,ratherthanusingOpticalCharacterRecognition(OCR)

– TextscrapingcopiestheplaintextoutofallstringswithinthePDFfile

– Ignoresfontassociatedwithtext

• Automaticconferencereviewerassignmentsystems– Usetopicmatchingtoassignreviewerstosubmittedpapers

– Comparefrequentwordsappearinginreviewers’publishedpaperstofrequentwordsappearinginsubmittedpapers

– INFOCOMusesLatentSemanticIndexing(LSI)

• Plagiarismdetectionsystems–Measuresimilaritybetweenstringswithinsubjectdocumentandallotherdocumentssubmittedthusfar

• Documentindexing– Searchenginesreturndocumentsbasedonthesimilarityoftheircontenttothesearchstring

Outline

ContentMaskingAttack

plaintext cipher

ciphertext

ContentMaskingAttack

• “Maskingfont”– acustomfontwithsomerearrangementofthecharacter/glyphrelationship

• OpensourcetoolssuchasFontForgeallowcopy/pasteofcharacterglyphswithinfonts

• CustomfontsmaybeimportedintoLATEX

Outline

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Anauthorcantargetaspecificreviewerbyreplacingenoughkeywordsinthepaperwithkeywordsfromthereviewer’spapers

• Keywords– uncommonwordsthatappearmostfrequently

• Algorithm:– Orderkeywordsinsubjectpaperandtargetreviewer’scorpusbydescendingfrequency

– Constructa“wordmapping”betweenthesetwolists

– Createa“charactermapping”betweenthelettersofeachpairofwords

• Challenges:– One-to-ManyCharacterMapping–WordLengthDisparity

• Experiment:–WehavereproducedtheINFOCOMautomaticreviewerassignmentsystem

– Thisincludes114TPCmembersfromawell-knownsecurityconferenceand2094oftheirrecentlypublishedpapersfortraining

– 100additionalpapersusedastestingdata

• Experiment:–Matchingapapertoonereviewer

Similarityscoresrelativetoamountofwordsmasked.Bluestarsshowthedesiredmatching.

Wordmaskingrequirementsforall100testingpapers

Maskingfontrequirementsforall100testingpapers

• Experiment:–Matchingapapertomultiplereviewers

Similarityscoresrelativetoamountofwordsmasked,betweenapaperandthreereviewers.Bluestars,blackcircles,

andgreentrianglesshowthedesiredmatchings

Outline

ContentMaskingAttackAgainstPlagiarismDetection

• Acheatingstudentcanevadeaplagiarismdetectorbyreplacingtheunderlyingtextwithgibberish

• Usea“scramblingfont”torenderthegibberishaslegible(plagiarized)text

• Resultsinzerosimilaritywithexistingwork

• Zerosimilarityisunrealisticduetocommonphrasesinlanguage

• Weevaluatethreemethodstotargetaspecificsimilarityscore

• Eachmethodchooseswhattexttoscrambleandwhattexttoleaveunaltered

• Byletter– Usescramblingfontwhichscramblesallcharacters

– Removecharactersfrombeingscrambledbyorderoftheirfrequencyofappearanceinthelanguage

– Continueremovingcharactersuntilatargetsimilarityscoreisreached

• Byword,infrequencyofappearance– Usescramblingfontwhichscramblesallcharacters

– Orderdistinctwordsbyfrequencyofappearance– Applyscramblingfonttoallwords– Removescramblingfontfromdistinctwordsuntilatargetsimilarityscoreisreached

• Byword,atrandom– Usescramblingfontwhichscramblesallcharacters

– Iterateoverdocument,applyingscramblingfontatrandomaccordingtochosenprobability

–Modifyprobabilityuntilatargetsimilarityscoreisreached

• Experiment:– Applyscramblingfontsto10publishedpapersandtarget5-15%similarityscoremeasuredbyTurnitin

Outline

ContentMaskingAttackAgainstDocumentIndexing

• AnattackercanplacespamorillicitcontentinPDFdocumentsindexedbysearchengines

• ThesePDFscanshowadsinsteadoflegitimatecontentthatuserssearchfor

• Thiscanbeconsideredaspecialcaseofthereviewerassignmentsystemsubversionmethod

• Insteadofmaskingparticularwords,wearemaskingtheentiredocument

• Notconstrainedbyspaceshowever

• Thelargernumberofmaskedcharactersrequiresmoremaskingfonts

• Insteadofgeneratingfontsadhoc,wemakeonefontforeachglyph

• ~84fonts• Allowsforeasyautomatedgenerationofmaskeddocuments

• Experiment– Used5well-knownpublishedpapers–Maskedeachasgibberish

• Experiment– Submittedthemtoleadingsearchenginesforindexing(Google,Bing,Yahoo!,DuckDuckGo)

– Resultswerethesameforalltestdocuments

• Experiment

SearchEngine

IndexedPapers

AttackSuccessful

EvadesSpamDetection

NotLaterRemoved

Google ✔ ✘ ✘ ✘

Bing ✔ ✔ ✔ ✔

Yahoo! ✔ ✔ ✘à✔ ✔

DuckDuckGo ✔ ✔ ✔ ✔

• Experiment

Outline

ContentMaskingDefense

• Onefeasible defense:performOpticalCharacterRecognition(OCR)onthedocumenttochecktheintegrityofeachcharacter.

• Problem:– Highcomputationaloverhead– Highfalsepositiverate

50,000- 75,000characters

ContentMaskingDefense– Ourproposal

• RendereachcharacterinthefontsembeddedinthesubjectPDFfileandperformOCRonthosecharactercodesratherthantherenderedPDFfileitself.

• Saveprocessingtime

100-2000characters

50,000- 75,000characters

ChallengesandTechnicalDetails

• Challenge1:Wholefontfileisembedded– Contain2"# = 65,536 charactersmaximum– Causehighcomputationaloverhead

• Solution:Scanthedocumenttoextractthecharactersused,andperformOCRontheseriesofcharacterusedineachfont.

• Challenge2:Specialcharacters

pUnicode:0xfe

þUnicode:0x70

Unicodemismatch

Falsealarm

• Solution:FontTraining1. PerformOCRonthefontandlistallsimilar

characters.2. Ifthedetectedglyphisinthesimilarcharacter

list,replacethecharacter’sUnicodeasthenormalletteritlookslike.

FontTraining

Unicode:0xfe

Inthelist

ChangeUnicode

Unicode:0x70

Whitelist

ã0xe3

ɧ0x267

Ѡ0x460

…… ……

Þ0xfe

…… ……

FontVerificationPerformance

• Experiment1– ToanalyzetheaccuracyofourFontVerificationmethodandtheWholeDocumentOCRmethod

– Generated10PDFfileswithmaskedcharactersvaryingfrom5-20%infrequencyofappearance

Performance– Experiment1

• Experiment2– Toanalyzetheeffectsofdocumentlengthonthedetectionrateforeachmethod.

– Generated10PDFfilesrangingfrom1-10pagesinlengthandhavinganeven30%distributionofmaskedcharacters

• Experiment3– Toanalyzetheeffectofdocumentlengthonthedetectiontimeforeachmethod

– Generated20PDFfilesrangingfrom1-20pagesinlengthandhavinga30%distributionofmaskedcharacters

Outline

Conclusion

• WedescribeanewcontentmaskingattackagainsttheAdobePDFstandard

• Wecreateandevaluatealgorithmsforeffectivelyperformingattacksagainst:– Automaticreviewerassignmentsystems– Plagiarismdetection– Documentindexing

• WecreateandevaluateafontverificationalgorithmthatismoreaccurateandlightweightthanOCR

Thankyou!

• Questions?

PDFfileimagefromhttp://iconbug.com/detail/icon/5940/file-format-pdf/TrueTypefontfileimagefromhttps://typography.guru/journal/opentype-myths-explained-r24/

pdf mirage: content masking attack against information ... · pdf mirage: content masking attack...

Documents

vmware mirage web management guide - mirage 5.7

two-pass realtime rendering of mirage...

mirage condo

layer - wordpress.comklik masking icon . masking 1. masking...

mirage brochure

vmware horizon mirage installation guide - horizon mirage...

mirage 2010

mirage vs. multicam operator field test mirage multicam

ipv4: abusing fragmentation fields (cont.) ·...

city of el mirage el mirage water conservation program

data masking counter attack to identity theft paul preston...

catalog mirage

mirage - music.duke.edu

mirage 2013

city of rancho mirage rancho mirage ... - library technology

mirage - digileddigiled.com/digiled_mirage_series.pdf ·...

mirage 2012

3m industrial masking made simple selection wallchart value...

mirage manual

cf6 masking solutions - aim...