dissecting pdf documents - mark s....
TRANSCRIPT
![Page 2: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/2.jpg)
What Is This Session NOT About?
• Creating PDFs
• How to use Acrobat
• Transparency flattening options in InDesign
• So what is it about?– PDF documents
– Tooling
– Extracting data
![Page 3: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/3.jpg)
The PDF Format
• 1.0 released in 1993
• Open standard as of July 1st 2008
• Reference publicly available– http://www.adobe.com/devnet/pdf/pdf_reference_archive.html
0
500
1000
1500
PDF 1.3 PDF 1.4 PDF 1.5 PDF 1.6 PDF 1.7 OOXML 1.0
![Page 4: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/4.jpg)
PDF Structure
• Header– %PDF-1.4– %âãÏÓ (optional but common)
• Body– Objects
• Xref table– Index table containing pointers to objects
• Trailer– Pointers to Xref table, key objects– %%EOF
![Page 5: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/5.jpg)
PDF Objects
• Boolean, Number, String, Name, Array, Dictionary, Stream, Null
• Indirect & direct objects
• Random access
”A PDF file should be thought of as a flattenedrepresentation of a data structure consisting of a collection of objects that can refer to each other in any arbitrary way.”
![Page 6: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/6.jpg)
Reading A PDF – The Ninja Way!
![Page 7: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/7.jpg)
Incremental Changes
• Fast saves, but not for free
• Undo & history
• Save vs Save As
• Single-pass writing
• Linearization
![Page 8: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/8.jpg)
Linearization & Xref Chaining
![Page 9: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/9.jpg)
PDF Objects: Image
• Stream object with dictionary header
![Page 10: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/10.jpg)
ABCpdf
• Commercial
• Excellent .NET API
• ObjectSoup is avaluable friend
• Good image rendering
• Useless SWF rendering
• Unstable rendering
• Decent support
• http://www.websupergoo.com/secret.htm
![Page 11: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/11.jpg)
Acrobat
• Commercial (tricky license)
• No COM libraries after 7.x
• Surprisingly stable and fast
• Ugly API
![Page 12: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/12.jpg)
Rendering Using Acrobat
![Page 13: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/13.jpg)
Xpdf
• Open source (GPL)
• Pdffonts, pdfimages,pdfinfo, pdftops, pdftotext
• Basis for many other libraries & tools
• Commercial license & COM library available at www.glyphandcog.com
• http://www.foolabs.com/xpdf/
![Page 14: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/14.jpg)
PDF Font Management
• Client must have fonts used in PDF document
• However…
– Complete font can be embedded
– Or a subset
– 14 standard fonts (Courier, Helvetica, Times + ITC Zapf & Dingbats)
– Font replacement
![Page 15: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/15.jpg)
Text In PDF
• No concept of text, just characters
• Flow order not guaranteed
• Requires guesstimation to extract text
• Extraction may require embedded fonts
• Lots of tools, some better than others
![Page 16: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/16.jpg)
Text According To ABCpdf1 2
2
3
3
4
4
5
5
6
1
6
![Page 17: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/17.jpg)
Text According To Xpdf
1
2
3
4
1 2
3
4
5
6
5
6
![Page 18: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/18.jpg)
Physical Text According To Xpdf1 2
3
4
5
6
1 23
4
5
![Page 19: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/19.jpg)
SWFTools
• Open source (GPL)
• PDF2SWF converts PDF files to SWF format
– Based on Xpdf
– Active mailing list
– Author actively working on project
– Use dev snapshots / git repo
– Stable, but some kinks
• http://www.swftools.org
![Page 20: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/20.jpg)
iTextSharp
• Open source (5.0 – AGPL(!), 4.1 - LGPL)
• Commercial license available
• .NET port of iText
• Very stable
• Excellent for creating &modifying PDFs
• No rendering capabilites
• http://itextsharp.sourceforge.net/
• http://itextpdf.com/
![Page 21: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/21.jpg)
Extracting Bookmarks
![Page 22: Dissecting PDF Documents - Mark S. Rasmussenimprove.dk/miracle-openworld-2010-slides/Dissecting_PDF_Documen… · PDF Objects •Boolean, Number, String, Name, Array, Dictionary,](https://reader035.vdocuments.us/reader035/viewer/2022062604/5f7b24d219f08b75240d2961/html5/thumbnails/22.jpg)
Extracting Links