pdf mirage: content masking attack against information ... · pdf mirage: content masking attack...

Post on 25-Jun-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

PDFMirage:ContentMaskingAttackAgainstInformation-BasedOnlineServices

IanMarkwood*,Dakun Shen*,YaoLiu,andZhuo LuUniversityofSouthFlorida

*Co-firstauthors

PresentedbyIanMarkwood

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

Motivation

• TheAdobePortableDocumentFormat(PDF)isthestandardforconsistentcross-computerdocumentrendering

• PDFdocumentscannotbeeditedwithcommonlyaccessibletools(MSWord,AdobeReader,etc.)

• Thisconfersasenseofintegritytothedocumentfortheenduser

Motivation

• ThereisadisconnectbetweenthecontentofaPDFandwhatisactuallydisplayed

• Acomputerandahumanseetwodifferentthings

Motivation

• WithinthisdisconnectwecanperformacontentmaskingattackwhichcompromisesthecontentintegrityofPDFfiles

• Threeinformation-basedonlinesystemsrelyontheintegrityofPDFdocuments:– Automaticreviewerassignmentsystemsforacademicpapers

– Plagiarismdetectionsystems– Searchengines

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

BackgroundInformation

• Whatdotheseserviceshaveincommon?– TheysupportPDFsubmission– TheyscrapethetextoutofsubmittedPDFfilestoperformtheirfunction,ratherthanusingOpticalCharacterRecognition(OCR)

– TextscrapingcopiestheplaintextoutofallstringswithinthePDFfile

– Ignoresfontassociatedwithtext

BackgroundInformation

• Automaticconferencereviewerassignmentsystems– Usetopicmatchingtoassignreviewerstosubmittedpapers

– Comparefrequentwordsappearinginreviewers’publishedpaperstofrequentwordsappearinginsubmittedpapers

– INFOCOMusesLatentSemanticIndexing(LSI)

BackgroundInformation

• Plagiarismdetectionsystems–Measuresimilaritybetweenstringswithinsubjectdocumentandallotherdocumentssubmittedthusfar

• Documentindexing– Searchenginesreturndocumentsbasedonthesimilarityoftheircontenttothesearchstring

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

ContentMaskingAttack

plaintext cipher

ciphertext

ContentMaskingAttack

• “Maskingfont”– acustomfontwithsomerearrangementofthecharacter/glyphrelationship

• OpensourcetoolssuchasFontForgeallowcopy/pasteofcharacterglyphswithinfonts

• CustomfontsmaybeimportedintoLATEX

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Anauthorcantargetaspecificreviewerbyreplacingenoughkeywordsinthepaperwithkeywordsfromthereviewer’spapers

• Keywords– uncommonwordsthatappearmostfrequently

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Algorithm:– Orderkeywordsinsubjectpaperandtargetreviewer’scorpusbydescendingfrequency

– Constructa“wordmapping”betweenthesetwolists

– Createa“charactermapping”betweenthelettersofeachpairofwords

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Challenges:– One-to-ManyCharacterMapping–WordLengthDisparity

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Experiment:–WehavereproducedtheINFOCOMautomaticreviewerassignmentsystem

– Thisincludes114TPCmembersfromawell-knownsecurityconferenceand2094oftheirrecentlypublishedpapersfortraining

– 100additionalpapersusedastestingdata

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Experiment:–Matchingapapertoonereviewer

Similarityscoresrelativetoamountofwordsmasked.Bluestarsshowthedesiredmatching.

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Experiment:–Matchingapapertoonereviewer

Wordmaskingrequirementsforall100testingpapers

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Experiment:–Matchingapapertoonereviewer

Maskingfontrequirementsforall100testingpapers

ContentMaskingAttackAgainstAutomaticConferenceReviewerAssignmentSystems

• Experiment:–Matchingapapertomultiplereviewers

Similarityscoresrelativetoamountofwordsmasked,betweenapaperandthreereviewers.Bluestars,blackcircles,

andgreentrianglesshowthedesiredmatchings

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

ContentMaskingAttackAgainstPlagiarismDetection

• Acheatingstudentcanevadeaplagiarismdetectorbyreplacingtheunderlyingtextwithgibberish

• Usea“scramblingfont”torenderthegibberishaslegible(plagiarized)text

• Resultsinzerosimilaritywithexistingwork

ContentMaskingAttackAgainstPlagiarismDetection

• Zerosimilarityisunrealisticduetocommonphrasesinlanguage

• Weevaluatethreemethodstotargetaspecificsimilarityscore

• Eachmethodchooseswhattexttoscrambleandwhattexttoleaveunaltered

ContentMaskingAttackAgainstPlagiarismDetection

• Byletter– Usescramblingfontwhichscramblesallcharacters

– Removecharactersfrombeingscrambledbyorderoftheirfrequencyofappearanceinthelanguage

– Continueremovingcharactersuntilatargetsimilarityscoreisreached

ContentMaskingAttackAgainstPlagiarismDetection

• Byword,infrequencyofappearance– Usescramblingfontwhichscramblesallcharacters

– Orderdistinctwordsbyfrequencyofappearance– Applyscramblingfonttoallwords– Removescramblingfontfromdistinctwordsuntilatargetsimilarityscoreisreached

ContentMaskingAttackAgainstPlagiarismDetection

• Byword,atrandom– Usescramblingfontwhichscramblesallcharacters

– Iterateoverdocument,applyingscramblingfontatrandomaccordingtochosenprobability

–Modifyprobabilityuntilatargetsimilarityscoreisreached

ContentMaskingAttackAgainstPlagiarismDetection

• Experiment:– Applyscramblingfontsto10publishedpapersandtarget5-15%similarityscoremeasuredbyTurnitin

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

ContentMaskingAttackAgainstDocumentIndexing

• AnattackercanplacespamorillicitcontentinPDFdocumentsindexedbysearchengines

• ThesePDFscanshowadsinsteadoflegitimatecontentthatuserssearchfor

ContentMaskingAttackAgainstDocumentIndexing

• Thiscanbeconsideredaspecialcaseofthereviewerassignmentsystemsubversionmethod

• Insteadofmaskingparticularwords,wearemaskingtheentiredocument

• Notconstrainedbyspaceshowever

ContentMaskingAttackAgainstDocumentIndexing

• Thelargernumberofmaskedcharactersrequiresmoremaskingfonts

• Insteadofgeneratingfontsadhoc,wemakeonefontforeachglyph

• ~84fonts• Allowsforeasyautomatedgenerationofmaskeddocuments

ContentMaskingAttackAgainstDocumentIndexing

• Experiment– Used5well-knownpublishedpapers–Maskedeachasgibberish

ContentMaskingAttackAgainstDocumentIndexing

• Experiment– Submittedthemtoleadingsearchenginesforindexing(Google,Bing,Yahoo!,DuckDuckGo)

– Resultswerethesameforalltestdocuments

ContentMaskingAttackAgainstDocumentIndexing

• Experiment

SearchEngine

IndexedPapers

AttackSuccessful

EvadesSpamDetection

NotLaterRemoved

Google ✔ ✘ ✘ ✘

Bing ✔ ✔ ✔ ✔

Yahoo! ✔ ✔ ✘à✔ ✔

DuckDuckGo ✔ ✔ ✔ ✔

ContentMaskingAttackAgainstDocumentIndexing

• Experiment

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

ContentMaskingDefense

• Onefeasible defense:performOpticalCharacterRecognition(OCR)onthedocumenttochecktheintegrityofeachcharacter.

• Problem:– Highcomputationaloverhead– Highfalsepositiverate

50,000- 75,000characters

ContentMaskingDefense– Ourproposal

• RendereachcharacterinthefontsembeddedinthesubjectPDFfileandperformOCRonthosecharactercodesratherthantherenderedPDFfileitself.

• Saveprocessingtime

100-2000characters

50,000- 75,000characters

ChallengesandTechnicalDetails

• Challenge1:Wholefontfileisembedded– Contain2"# = 65,536 charactersmaximum– Causehighcomputationaloverhead

• Solution:Scanthedocumenttoextractthecharactersused,andperformOCRontheseriesofcharacterusedineachfont.

ChallengesandTechnicalDetails

• Challenge2:Specialcharacters

pUnicode:0xfe

þUnicode:0x70

OCR

Unicodemismatch

Falsealarm

ChallengesandTechnicalDetails

• Solution:FontTraining1. PerformOCRonthefontandlistallsimilar

characters.2. Ifthedetectedglyphisinthesimilarcharacter

list,replacethecharacter’sUnicodeasthenormalletteritlookslike.

FontTraining

Unicode:0xfe

þ

Inthelist

ChangeUnicode

Unicode:0x70

Whitelist

ã0xe3

a0x61

ɧ0x267

h0x68

Ѡ0x460

W0x57

…… ……

Þ0xfe

p0x70

…… ……

FontVerificationPerformance

• Experiment1– ToanalyzetheaccuracyofourFontVerificationmethodandtheWholeDocumentOCRmethod

– Generated10PDFfileswithmaskedcharactersvaryingfrom5-20%infrequencyofappearance

Performance– Experiment1

FontVerificationPerformance

• Experiment2– Toanalyzetheeffectsofdocumentlengthonthedetectionrateforeachmethod.

– Generated10PDFfilesrangingfrom1-10pagesinlengthandhavinganeven30%distributionofmaskedcharacters

Performance– Experiment2

FontVerificationPerformance

• Experiment3– Toanalyzetheeffectofdocumentlengthonthedetectiontimeforeachmethod

– Generated20PDFfilesrangingfrom1-20pagesinlengthandhavinga30%distributionofmaskedcharacters

Performance– Experiment3

Outline

• Motivation• BackgroundInformation• ContentMaskingAttack– AgainstConferenceReviewerAssignmentSystems– AgainstPlagiarismDetection– AgainstDocumentIndexing

• ContentMaskingDefense• Conclusion

Conclusion

• WedescribeanewcontentmaskingattackagainsttheAdobePDFstandard

• Wecreateandevaluatealgorithmsforeffectivelyperformingattacksagainst:– Automaticreviewerassignmentsystems– Plagiarismdetection– Documentindexing

• WecreateandevaluateafontverificationalgorithmthatismoreaccurateandlightweightthanOCR

Thankyou!

• Questions?

PDFfileimagefromhttp://iconbug.com/detail/icon/5940/file-format-pdf/TrueTypefontfileimagefromhttps://typography.guru/journal/opentype-myths-explained-r24/

top related