“semantic pdf processing & document representation”

Future of Cognitive Computing and AISemantic PDF Processing and Knowledge Representation

Sridhar Iyengar

Distinguished Engineer

Cognitive Computing Research

IBM T.J. Watson Research Center

[email protected]

© 2017, IBM Corporation

20102009 2011 2012 2013 2014 2015

50

0

25

75

100

125

150

Financial Services : Query from Unstructured Data

Financial Documents

(.pdf, .html, docx…)

Ingest

“ShowmerevenuesforCitibankbetween2009and2015”


Summary : PDF understanding is hard and requires significant

Research breakthroughs and Product Innovations

3

▪ PDF Documents are optimized for display and often do not

include metadata and structure to facilitate Cognitive post

processing

– Existing technologies and solutions are optimized for printing and

viewing – not cognitive post processing

▪ Need to handle Programmatically created (via MS Word, PPT….)

and Legacy and scanned documents (Forms, hand written

notes...)

▪ Approach : The definition of a Semantic Document Structure

Model (DSM) for a consistent internal representation of document

structures to be used in Future WDC Services and products

▪ Currently focused on Table and Diagram Understanding from PDF

– Healthcare, Financial Services, Compliance, Legal…


Research Focus : IBM AI Platform for Business

• Best platform for building applications that incorporate enterprise and industry knowledge

• Time to Value at every step of cognitive application development

• Tools & Methodology to support development, deployment & intuitive usage

4

Data:

‒ Structured & Unstructured data sources

‒ Multimodal (text, visual, speech, etc.) data sources

‒ Public & private data sources

Training

‒ Create Domain Models and Specialize them

§ for conversations aligned with business process

§ For discovery of insights

‒ Fast adaptation to new domains

‒ Scale from small to large amounts of training data

‒ Tuned model creation for accuracy vs. training time.

‒ Incremental & Automated Knowledge Evolution

Conversation

‒ Tools for SMEs train from Business Processes

‒ Inference engines for specific content structure

Discovery

‒ Tools for SMEs train from Business Knowledge

‒ Reason about domain knowledge (vs. Lexical/Syntactic)

Tools & Methodology

‒ Cognitive application lifecycle (code/data/model)

Resilient deployment of cognitive models


Cognitive Computing (AI) Technologies Research

Decision Support People InsightsCognitive Software

and Data Life Cycle*

Reasoning and

Planning

Human Computer

Interaction

Conversation

Query and Retrieval*

Knowledge Extraction

and Representation*

Learning*

Natural Language &

Text Understanding*

Visual

Comprehension*

Speech and Audio Embodied Cognition

Cognitive Computing

Platform Infrastructure

Signal

Comprehension

Reasoning

About Domains

Interaction Systems

Trust and Security

Semantic PDF

Processing*


Goal : From Raw Data to Business Artifacts

.pdf

Line PlotBulleted List

• Create representation for an obligation

• Models for “obligation language”

• Reason about list or data that refines the obligation

• Create document fragments by parsing out chunks

• Document structure models

• Reason about document chunks

Obligation

• Create representation for a fragments

• Document fragment models

• Reason about fragment constituentsfragment

Section

fragmentfragment

• Hierarchical Processing

• Machine-learned models and reasoning at all levels

• Learnability of artifacts, models

• Learn how to specify reasoners

Example 1: Semantic PDF Processing

6© 2017, IBM Corporation

Example 1:

.pdf

Line PlotBulleted List

• Create representation for an obligation

• Models for “obligation language”

• Reason about list or data that refines the obligation

• Create document fragments by parsing out chunks

• Document structure models

• Reason about document chunks

Obligation

• Create representation for a fragments

• Document fragment models

• Reason about fragment constituentsfragment

Section

fragmentfragment

.mp4

SceneScene

Boy Girl NightSoft

MusicCandles

Romantic Scene

• Hierarchical Processing

• Machine-learned models and reasoning at all levels

• Learnability of artifacts, models

• Learn how to specify reasoners

Example 2: Semantic MPEG Processing

7

From Raw Data to Business Artifacts


Complexity akin to “Natural Language Understanding”

Why is PDF Processing hard?

▪ Thousands of PDF generators (driver), with their own rules for placing marks on paper.

▪ Incredible variety in content – complex tables, images, diagrams, formulas, varying resolution in scanned content

▪ No closed form / algorithmic solution feasible – must resort to machine learning.


Why is it hard? Variety of tables : 20-25 major

table types in discussion with just one major

customer

Complex tables – graphical lines can be misleading – is this 1, 2 or 3

tables ?

Table with visual clues

only

Multi-row, multi-column column

headers

Nested row

headers

Tables with Textual content

Table with graphic

lines

Table interleaved

with text and charts

Complex multi-row, multi-column column headers identifiable using graphical lines

and visual clues


Why is it hard? Variety in Image, Diagram TypesL. Lin et al. / Pattern Recognition 42 (2009) 1297 -- 1307 1305

Fig. 8. ROC curves of the detection results for bicycle parts. Each graph shows the ROC curve of the results for a different part of the bicycle using just bottom-up informationand bottom-up + top-down information. We can see that the addition of top-down information greatly improves the results. We can also see that the bicycle wheel is themost reliably detected object using only bottom-up cues, so we will look for that part first.

With a quick second glance, even the seat and handlebars may be“seen”, though they are actually occluded. Our algorithm simulatesthe top-down process (indicated by blue/green downward arrows inFig. 4) in a similar way, using the constructed And–Or graphs.

Verification of hypotheses: Each of the bottom-up proposals ac-tivates a production rule that matches the terminal nodes in thegraph, and the algorithm predicts its neighboring nodes subject tothe learned relationships and node attributes. For example in Fig. 4,a proposed circle will activate the rule that expands a wheel intotwo rings. The algorithm then searches for another circle of propor-tional radius, subject to the concentric relation with existing circle.In Fig. 5(b), the wheels are already verified. The candidate framesare then predicted with their ends affixed to the center points of thewheels. Since we cannot tell the front wheels from the rear ones atthis moment, frames facing in two different directions are both pre-dicted and put in the Open List. In Fig. 5(a), the triangle templatesare detected using a Generalized Hough Transform only when thewheels are first verified and frames are predicted. If no neighboring

nodes are matched, the algorithm stops pursuing this proposal andremoves it from the Lists. Otherwise, if all of the neighboring nodesare matched, the production rule is completed. The grouped nodesare then put in the Closed List and lined up to be another bottom-upproposal for the higher level. Note that we may have both bottom-up and top-down information being passed about a particular pro-posal as shown by the gray arrows in Fig. 3. In Fig. 4, the sub-partsof the frame are predicted in the top-down phase from the framenode (blue arrows); at the same time, they are also proposed in thebottom-up phase based on the triangles we detected (red arrows).Proposals with bidirectional supports such as these are more likelyto be accepted. After one particle is accepted from the Open List, anyother overlapping particles should update accordingly.

Template match: The pre-defined part templates, such as the bi-cycle frames or teapot bodies, are represented by sub-sketch-graphs,which are composed of a set of linked edgelets and junctions. Once atemplate is proposed and placed at a location with initial attributes,the template matching process is then activated. As shown in

10

PDF renderingq .doc, .ppt rendering to .pdf keeps minimal structure formatting.

Geared towards visual fidelityq Often .pdf is created by “screen scraping” or scanning or hybrid

ways that do not keep structure information.

Multi-modality: extremely rich informationq Images + Text + Tables both co-exist as well as form nested

hierarchies possibly with several levels

Nested table (numeric andnon-numeric + image)

Tabular representationof images with pictorialcross reference

Images + captions + cross references andtext that comments the image


Two major approaches to tackling PDF Processing

▪Unsupervised Learning and out of the box PDF

processing

– Works well for a large class of domains with some compromise in

quality

▪Supervised Learning with a graphical labelling tool

– Potential for improved quality when many similar documents are

available

Both approaches can be used together© 2017, IBM Corporation

…

…

DU:Lineplots(LP)

DU:flowcharts(FC)

DU:bubbleplots(BP)

Imageclassification

TU:Tableunderstanding

(ProgrammaticPDF

Textanalytics(Programmatic

PDF)

PDFParser

DU:scannedtables(ST)

Dataintegration:Linkingtexttodiagrams,tables,

serialization….

PDFUnderstanding:HighLevelOverview


Learned Semantic Document Representation


PDFProcessingOverviewinWDCWDCDCSService

PDFDocsHTML

JSON

PlainText

https://www.ibm.com/watson/developercloud/document-conversion.html

CurrentimplementationofDCShaslimitedTableprocessingcapabilityandnosupportforscanneddocuments,diagrams,graphsetc.

TextandSimpleTablestructure


PDFProcessing.Next Overview(2017/2018)

WDCDCS.Next Service

PDF,HTML,WordDocs

DSM-XML

JSON

PlainText

HTMLWDCDCSService…

PDF2HTML

PDF2JSON

PDF2-DSMXML

NewPDFTools

SME,DataScientist(DomainAdaptationusingML)

DeveloperUsingDCSAPI

Text,Tables,DiagramsGraphs..

PDF,HTML,WordDocs(Training)


PDFConversionArchitecture

ProgrammaticPDF

PDFBox API: Parse PDF Document

HTML

Layout+ReadingOrderInference

HTMLGeneration

TableStructurePopulation

MetadataIdentification

TableIdentification

CleanseRawPDFData

OpenSourceorCommercialSoftware

ResearchSolution

CompositeUnit/RegionIdentification

ScannedPDF

CleanseRawOCROutput

OCR Engine API: Scan PDF Document

• ML-based PDF conversion Pipeline is source-independent

• SAME ML-based algorithms can be applied directly to data

extracted from either scanned or programmatic PDF

• PDF Conversion ML algorithms are unsupervised; thus achieve

stated performance out-of-box with NO training / tuning data

required

• Deployable in WDC for document-at-a-time processing (thru

Document Conversion Service) and in deep enrichment service

• Scanned PDF processing available

now using Datacap OCR engine

• Extension using Tesseract engine

ProgrammaticPDFExtraction

Scanned&HybridPDFExtraction

HybridPDF

ChartIdentification

ML-basedPDFConversionPipeline


• HTMLoutputfromWCSPDFConversionisdirectlyconsumablebydownstreamanalytics

• WCSPDFConversionTableprocessingexample(Norbury sampletable):

17

PDFConversionDownstreamAnalyticsExample

PDF HTML WatsonKnowledgeGraph

WCSPDF

ConversionTable

Processing

NLQAnswering

StructuredFactsfromTable

Answer

OriginalScannedPDFtable

HTMLgeneratedfromcurrentPDFConversionWebservice

Bridge Designer Length

Brooklyn J.A.Roebling

1595

Manhattan G.Lindenthal

1470

Queensborough Palmer &Hornbostel

1182

StructuredfactsfromexistingTableProcessingLibraries

(withappropriatecustomization)

WhodesignedBrooklynBridge?

NLQAnsweringJ.A.Roebling…


DocumentStructureModel(DocumentRepresentation)• Definecommondocumentstructureidealforsubsequentsemanticanalysis

• Definedperfeature:Section,BulletedLists,Headers,Footers,Footnotes,Tables,...

18

Definehowsectioninformationsuchastitle,numberandnestingshouldbeRepresented

DefinehowlistinformationsuchaslistitemsandlisttypeshouldbeRepresented


DocumentStructureModel(DSM)- Draft

ScanPDF

ProgPDF

Page

[1…n]

Token

Character

Phrase

TextLine

Paragraph

PageColumn

[1…n]

[1…n]

[1…n]

[1…n]

[1…n]

PageChart

TableCell

Table GraphicalLine

[1…n]meansorderedlistAllobjectshaveBoundingBox attribute

ColordisplayOrder

rowSpancolSpan

[1…n][1…n]

[1…n]

[1…n][1…n]

[1…n][1…n]

EmbeddedImage

BoundBoxCoordscontentsdisplayOrder

[1…n]

LogicalDataModel

OntologyRepresentation

(C) 2017 IBM Corporation

19

• Goal: Define common document structure ideal for subsequent semantic

processing

• Captures both raw extracted information (text, vector graphics) along

with inferred artifacts (tables, charts, paragraphs)

• Start with PDF documents and extend to other formats such as Word

and Excel

• DSM Schema in OWL, Serializations to HTML, JSON...

20102009 2011 2012 2013 2014 2015

50

0

25

75

100

125

150

Financial Services : Query from Unstructured Data

Financial Documents

(.pdf, .html, docx…)

Ingest

“ShowmerevenuesforCitibankbetween2009and2015”


Thank You

“semantic pdf processing & document representation”

Technology