diving into the portable document format - toulouse ... · pdfsyntax101...
TRANSCRIPT
![Page 1: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/1.jpg)
Diving into the Portable Document FormatToulouse Hacking Convention 2017
Guillaume Endignoux@gendignoux
Friday 3rd March, 2017
1 / 34
![Page 2: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/2.jpg)
Portable Document Format ?
PDF timeline:1991-1993: inception and first release by Adobe1
2008: ISO specification released (PDF 1.7) ⇒ alternativereaders: Evince, PDF.js, Chrome...Soon? ISO specification for PDF 2.0
Many features (not all portable):interactive formsencryptionscripting: JavaScript, Flashmultimedia: video, sound, 3D artwork...
1https://acrobat.adobe.com/us/en/why-adobe/about-adobe-pdf.html
2 / 34
![Page 3: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/3.jpg)
Portable Document Format ?
PDF timeline:1991-1993: inception and first release by Adobe1
2008: ISO specification released (PDF 1.7) ⇒ alternativereaders: Evince, PDF.js, Chrome...Soon? ISO specification for PDF 2.0
Many features (not all portable):interactive formsencryptionscripting: JavaScript, Flashmultimedia: video, sound, 3D artwork...
1https://acrobat.adobe.com/us/en/why-adobe/about-adobe-pdf.html
2 / 34
![Page 4: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/4.jpg)
Portable Document Format ?
A commonly used format, but many security issues:500+ reported vulnerabilities in Adobe Reader2 (since 1999).Variations between implementations.Syntax facilitates polymorphism, e.g. PoC||GTFO (PDF+ZIP,PDF+JPEG...).SHA-1 collisions...
I worked on PDF validation: Caradoc3 project started in 2015 (atANSSI), paper & presentation at LangSec Workshop 20164.
2http://www.cvedetails.com3https://github.com/ANSSI-FR/caradoc4http://spw16.langsec.org/
3 / 34
![Page 5: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/5.jpg)
Table of contents
1 Introduction to PDF syntax
2 Security problems: case studies
3 Caradoc: 2 years of PDF validation
4 / 34
![Page 6: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/6.jpg)
Table of contents
1 Introduction to PDF syntax
2 Security problems: case studies
3 Caradoc: 2 years of PDF validation
5 / 34
![Page 7: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/7.jpg)
PDF syntax 101
A PDF document is made of objects. Textual format, similar toJSON but different syntax:
null
booleans: true, falsenumbers: 123, -4.56strings: (foo)names: /bararrays: [1 2 3], [(foo) /bar]
dictionaries: << /key (value) /foo 123 >>
references: 1 0 obj ... endobj and 1 0 R
streams: << ... >> stream ... endstream
6 / 34
![Page 8: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/8.jpg)
Structure of a PDF file
HeaderObject
Object...
Reference tableTrailer
End-of-file
%PDF-1.7
1 0 obj<< /Type /Catalog /Pages 2 0 R >>endobj
2 0 obj<< /Type /Pages /Count 1 /Kids [3 0 R] >>endobj
xref0 60000000000 65536 f0000000009 00000 n0000000060 00000 n...
trailer<< /Size 6 /Root 1 0 R >>
startxref428%%EOF
Organization of a simple PDF file.
7 / 34
![Page 9: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/9.jpg)
Structure of a PDF file
More complex structures:incremental updates,object streams,linearization.
HeaderObjects
...Table + trailer #1
End-of-file #1
Objects...
Table + trailer #2
End-of-file #2
%PDF-1.7
xref0 60000000000 65536 f0000000009 00000 n0000000060 00000 n...trailer<< /Size 6 /Root 1 0 R >>
startxref428%%EOF
xref0 30000000002 65536 f0000000567 00001 n0000000000 00001 f6 10000001234 00000 ntrailer<< /Size 7 /Root 1 1 R /Prev 428 >>
startxref1347%%EOF
Original file
Incrementalupdate
Incremental update.
8 / 34
![Page 10: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/10.jpg)
Logical structure of a PDF file
Document of 17 pages (about 1000 objects).
9 / 34
![Page 11: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/11.jpg)
Graphical instructions
Vector graphics = low-level instructions, stored in a stream. Someexamples:
set font ABC in size 10: /ABC 10 Tf
set blue color (RGB): 0 0 1 rg
draw text: (Hello world) Tj
move to (x , y) = (5, 10): 5 10 m
draw line to (15, 20): 15 20 l
...
I made a cheat sheet:https://github.com/gendx/pdf-cheat-sheets
10 / 34
![Page 12: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/12.jpg)
Draw your own PDF!
Creating reference tables/streams is error-prone and boring...
Python script to automate the process:https://github.com/gendx/pdf-corpus
Sourcetemplate = contentstream---BT0 700 Td/F1 100 Tf(Hello world !) TjET
Resulting PDF
11 / 34
![Page 13: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/13.jpg)
Table of contents
1 Introduction to PDF syntax
2 Security problems: case studies
3 Caradoc: 2 years of PDF validation
12 / 34
![Page 14: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/14.jpg)
Security problems: case studies
Security problems arise from:unclear or ambiguous specification,complex or flawed designs in the standard,improper input checking by PDF readers.
Some case studies:malicious graph structures,graphics instructions,home-made encryption.
13 / 34
![Page 15: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/15.jpg)
Security problems: case studies
Security problems arise from:unclear or ambiguous specification,complex or flawed designs in the standard,improper input checking by PDF readers.
Some case studies:malicious graph structures,graphics instructions,home-made encryption.
13 / 34
![Page 16: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/16.jpg)
Graph organization
The graph of objects is organized into sub-structures, especiallytrees.
Page tree.Catalog Root of the page tree
Page 3Node Page 4
Page 1 Page 2
14 / 34
![Page 17: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/17.jpg)
Graph organization
The table of contents uses doubly-linked lists.
Table of contents.
CatalogOutline root
ChapterChapter Chapter
SectionSection Section
15 / 34
![Page 18: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/18.jpg)
Problematic structure
Some PDF readers loop forever with an invalid structure...
Invalid table of contents.
CatalogOutline root
ChapterChapter Chapter
SectionSection Section
16 / 34
![Page 19: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/19.jpg)
Problematic structure
This is a design flaw:Complex structures everywhere, but PDF readers do not checkthem...Simpler design: array of references to store pages?
17 / 34
![Page 20: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/20.jpg)
Graphics instructions
Graphics instructions = core of the format ⇒ potential for manybugs!
18 / 34
![Page 21: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/21.jpg)
Graphics instructions
Graphics instructions = core of the format ⇒ potential for manybugs!
18 / 34
![Page 22: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/22.jpg)
Graphics instructions
I tried to write a PDF optimizer, and found more weird bugs...
19 / 34
![Page 23: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/23.jpg)
Graphics instructions
What is in the graphics interpreter?
A simple example:Graphics state = font, colors, translations, etc. (e.g. fontmodified by setfont, used by drawtext).Graphics state stack: push and pop operators to save &restore graphics state.
What if we pop too much (stack underflow)?
20 / 34
![Page 24: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/24.jpg)
Graphics instructions
Example5 for Evince: unbalanced pop seems to stop the interpreter.
Pseudo-code: pop beforepopsetfontdrawtext (Hello world !)
Pseudo-code: pop aftersetfontdrawtext (Hello world !)pop
PDF PDF
5https://github.com/gendx/pdf-corpus/tree/master/corpus/contentstream/graphic-stack
21 / 34
![Page 25: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/25.jpg)
Demonstration
Demonstration
Loop in the outline structurehttps://github.com/ANSSI-FR/caradoc/blob/master/test_files/negative/outlines/cycle.pdf
Polymorphic filehttps://github.com/ANSSI-FR/caradoc/blob/master/test_files/negative/polymorph/polymorph.pdf
Poc||GTFO 0x13https://www.alchemistowl.org/pocorgtfo/pocorgtfo13.pdf
22 / 34
![Page 26: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/26.jpg)
Demonstration
These problems may lead to several attacks:Attacks against the parser: denial of service, crash (or worse).Evasion techniques: variations PDF reader vs. malwaredetector.
23 / 34
![Page 27: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/27.jpg)
Encryption
PDF encryption supported since v1.1.
Based on 2 passwords.User password Pu: decrypt and view content.Owner password Po : unlock permissions (print, modify...) ⇒enforced only by compliant software (Pu is enough to decrypt).
Security issues:Partial encryption: only strings and streams are encrypted,general document structure is leaked...Ad-hoc key-derivation from passwords & checksums (basedon MD5+RC4).
24 / 34
![Page 28: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/28.jpg)
Encryption
PDF encryption supported since v1.1.
Based on 2 passwords.User password Pu: decrypt and view content.Owner password Po : unlock permissions (print, modify...) ⇒enforced only by compliant software (Pu is enough to decrypt).
Security issues:Partial encryption: only strings and streams are encrypted,general document structure is leaked...Ad-hoc key-derivation from passwords & checksums (basedon MD5+RC4).
24 / 34
![Page 29: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/29.jpg)
Encryption
PDF encryption supported since v1.1.
Based on 2 passwords.User password Pu: decrypt and view content.Owner password Po : unlock permissions (print, modify...) ⇒enforced only by compliant software (Pu is enough to decrypt).
Security issues:Partial encryption: only strings and streams are encrypted,general document structure is leaked...Ad-hoc key-derivation from passwords & checksums (basedon MD5+RC4).
24 / 34
![Page 30: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/30.jpg)
Home-made encryption
Complex derivation of keys from passwords.
Po A Ko B
Pu
O C
P, ID
Ku D U
E Ka,ba, b
A, C, E ≈ MD5B ≈ RC4D ≈ MD5+RC4
password checksum (in file) salt (in file) object key
Main problem: checksum O is deterministic function of passwords,no salt! ⇒ 33% collisions for 478 files crawled from Internet...
25 / 34
![Page 31: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/31.jpg)
Table of contents
1 Introduction to PDF syntax
2 Security problems: case studies
3 Caradoc: 2 years of PDF validation
26 / 34
![Page 32: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/32.jpg)
Caradoc validation
I worked on Caradoc, a PDF validator. Implementation in OCamlfrom the PDF specification6.
Caradoc verifies the following:File syntax.Objects consistency (type checking).Graph (page tree...).Vector graphics instructions (syntax).
Validation workflow.
strict parser
relaxed parser
objects
graph ofreferences
extraction ofspecific objects
typechecking
list oftypes
graphchecking
graphicsinstructions
futurework
no errordetectednormalization
6https://www.adobe.com/devnet/pdf/pdf_reference.html27 / 34
![Page 33: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/33.jpg)
Caradoc validation
I worked on Caradoc, a PDF validator. Implementation in OCamlfrom the PDF specification6.
Caradoc verifies the following:File syntax.Objects consistency (type checking).Graph (page tree...).Vector graphics instructions (syntax).
Validation workflow.
strict parser
relaxed parser
objects
graph ofreferences
extraction ofspecific objects
typechecking
list oftypes
graphchecking
graphicsinstructions
futurework
no errordetectednormalization
6https://www.adobe.com/devnet/pdf/pdf_reference.html27 / 34
![Page 34: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/34.jpg)
Syntax restriction
At syntax level, guarantee extraction of objects without ambiguity:Grammar formalization7 (BNF).Structure restrictions (no updates, no linearization, etc.).Systematic rejection of “corrupted” files.
When a conforming reader reads a PDF file with adamaged or missing cross-reference table, it mayattempt to rebuild the table by scanning all the objectsin the file.
— ISO 32000-1:2008, annex C.2
7https://github.com/ANSSI-FR/caradoc/tree/master/doc/grammar28 / 34
![Page 35: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/35.jpg)
Syntax restriction
At syntax level, guarantee extraction of objects without ambiguity:Grammar formalization7 (BNF).Structure restrictions (no updates, no linearization, etc.).Systematic rejection of “corrupted” files.
When a conforming reader reads a PDF file with adamaged or missing cross-reference table, it mayattempt to rebuild the table by scanning all the objectsin the file.
— ISO 32000-1:2008, annex C.2
7https://github.com/ANSSI-FR/caradoc/tree/master/doc/grammar28 / 34
![Page 36: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/36.jpg)
Type checking
Types of a 17-page document.
actionpagedestinationannotationresourceoutlinecontent streamfontname treeother
29 / 34
![Page 37: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/37.jpg)
Real-world files: lessons learned
Real-world evaluation: 10K files collected from random querieson a web search engine.
The strict parser rejects common features:
Feature % of filesincremental updates 65%object streams 37%free objects 28%encryption 5%
⇒ Workaround: normalize with relaxed parser first!
30 / 34
![Page 38: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/38.jpg)
Real-world files: lessons learned
Real-world evaluation: 10K files collected from random querieson a web search engine.
The strict parser rejects common features:
Feature % of filesincremental updates 65%object streams 37%free objects 28%encryption 5%
⇒ Workaround: normalize with relaxed parser first!
30 / 34
![Page 39: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/39.jpg)
Real-world files: lessons learned
Validation after normalization.
normalized
9829 files
type checking typechecked
2105 filestype error
1575 files
graphchecking
instructionschecking
no errorfound
1891 files
Type-checker detected typos:/Blackls1 instead of /BlackIs1,/XObjcect instead of /XObject.
We identified incorrect tree structures in the wild.
31 / 34
![Page 40: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/40.jpg)
Caradoc: main commands
Some useful caradoc commands:
Get stats$ caradoc stats file.pdf
Validate$ caradoc stats --strict file.pdf
Normalize$ caradoc cleanup file.pdf --out output.pdf
Interactive console UI: explore objects, decode stream, search...$ caradoc ui file.pdf
More on GitHub: https://github.com/ANSSI-FR/caradoc
32 / 34
![Page 41: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/41.jpg)
Conclusion
PDF is an old format (25+ years), not designed for simpleparsing ⇒ error-prone.
Producers make mistakes, readers try best-effort ⇒compatibility bugs, security holes...
We need cleaner, simpler and more robust file formats! ⇒ e.g.Protocol Buffers8.
8https://developers.google.com/protocol-buffers/.33 / 34
![Page 42: Diving into the Portable Document Format - Toulouse ... · PDFsyntax101 APDFdocumentismadeofobjects. Textualformat,similarto JSONbutdifferentsyntax: null booleans: true,false numbers:](https://reader034.vdocuments.us/reader034/viewer/2022050109/5f46f6044bf40521506c245e/html5/thumbnails/42.jpg)
Conclusion
My PDF projects:Caradoc: github.com/ANSSI-FR/caradocCheat sheet: github.com/gendx/pdf-cheat-sheetsPDF corpus: github.com/gendx/pdf-corpus
Some blog posts about PDF: https://gendignoux.com/blog/
Twitter: @gendignouxGitHub: @gendx
34 / 34