prénom nom document analysis: structure recognition prof. rolf ingold, university of fribourg...

28
Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

Post on 19-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

Prénom Nom

Document Analysis:Structure Recognition

Prof. Rolf Ingold, University of Fribourg

Master course, spring semester 2008

Page 2: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

2

Outline

Objectives Physical and logical structures Examples of applications Methodologies for structure recognition Microstructures vs. macrostructures Model driven approaches Interactive Systems

Page 3: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

3

Importance of document structures

Document = Content + Structures

Structures convey abstract high level information

Structures are revealed by styles

Page 4: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

4

Applications of document structure recognition

Information extraction form analysis (check readers, ...) business applications : mail distribution, invoice processing, ... analysis of museum & library notices analysis of bibliographical references

Document mining, content analysis business reports legal documents scientific publications

Intelligent indexing laws magazine & newspaper

Document restyling teaching material ...

Page 5: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

5

Extended Processing Chain

Blocs

Image

Simple Text

Preprocessing

Postanalysis

OCR

Segmentation

Fonts

OFR

Logical labeling Struct. Document

Layout analysis

Page 6: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

6

Physical document structures

Reveal the publisher's view Composed of a hierarchy of

physical entities text blocs, text lines and

tokens graphical primitives

Universal, i.e. independent of the document class

region

blockhr

document

region

block block block

region

hr block frm

Page 7: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

7

Illustration of physical document structure

from A. Belaïd

Page 8: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

8

Illustration of logical document structure

Page 9: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

9

Logical structures

Reflect the author’s mind Independent of presentation

can be mapped on various physical structures

Composed of application dependent logical entities

Specific to the application and document class

article

p p p p p p p p p

author author

title titlehdln

link

article

link

document

Page 10: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

10

Relation between logical and physical structure

There is no 1:1 relation between physical and logical structure There are some correspondences between as shown below

Page 11: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

11

Role of style sheets

analysis

formatting

StylesheetLogical

StructurePhysical Structure

editprint

display

Document formatting is straightforward ... But document analysis is a non trivial task that generally can not be

fully automated

Page 12: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

12

Methodologies

Document structural analysis can be data-driven : the recognition task is based on image analysis model-driven approaches : the recognition task is

Methods of structural document analysis can be classified into geometrical approaches syntactic approaches based on formal grammars structural approaches based on graphs rule based approaches expert systems (artificial intelligence) machine learning

Page 13: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

13

Syntactic Document Recognition [Ingold89]

Full model driven approach Formal document description language

attributed grammar translated into an analysis graph

Top down matching algorithm with backtracking for macro-structure as well as micro-structure recognition

Very generic approach Sensitive to noise (no error recovering) Theoretically exponential complexity

Page 14: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

14

Document Description Language [Ingold89]

Document class specific formal description composed of composition rules (context-free grammar) typographical rules (attributes)

Act:DOC => ActNumber ActContent FootNotes Headings ;

ActNumber:FRG => {Number $ Period} ;

ActContent:PRT => ActTitle ActDate Otgan {Provis} Formul {Chapter} [Validity] ;...

Chapter:PRT => ChTitle ({Section} | {Article}) ; ChTitle.zone = Inherited ChTitle.alignment = (Allowed, Centered, 0pt, 0pt, Undefined) ; ChTitle.lineHeight = 11pt ; ChTitle.spaceBefore = (Allowed,[6pt, 60pt] ) ; ChTitle.interSpace = (Forbidden, [2pt, 3pt]) ; ChTitle.font = (Times, 11pt, Bold, Roman); Article.spaceBefore = <FST: (Forbidden, [6pt, 30pt]), NXT: (Allowed, [6pt, 30pt])> ;...

Page 15: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

15

Analysis graph [Ingold89]

Analysis graph for syntactic analysis where each node has two links successor (in case of

successful match) alternative (in case of

unsuccessful match)

Page 16: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

16

Fuzzy document structure recognition [Hu94]

The previous approach has been adapted to be less sensitive to matching errors matching is using fuzzy logic

Page 17: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

17

Fuzzy document structure recognition [Hu94]

Pattern matching is using fuzzy logic Parsing is expressed as a cost function to be optimized

finding the shortest path in a graph (solved by linear programming)

Page 18: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

18

Graphein : Blackboard approach [Chenevoy92]

Page 19: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

19

Model of Graphein [Chenevoy92]

Page 20: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

20

Complex Layout Analysis [Azolky95]

Page 21: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

21

Modeling of Scientific Journals [Azokly95]

Page 22: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

22

Model for a Scientific Journal

<volume name="article" width="160" height="240"> <page name="first"> ... </page> <page name="even"> <hsep name="hs1" bloc="4 3 LEFT RIGHT" type="BLANK"/> <layer name="principle"> <vsep name="vs1" bloc="40 65 TOP hs1" type="BLANK"/> <vsep name="vs2" bloc="[50,60] 4 hs1 BOTTOM" type="BLANK"/> <region name="center" bloc="vs2 RIGHT hs1 BOTTOM"

content="ANY"/> <region name="margin" bloc="LEFT vs2 hs1 BOTTOM"

content="TEXT"/> ... </layer> <layer name="secondary"> <hsep name="hs2" bloc="[10,220] 2 LEFT RIGHT" type="BLANK"> <subst value="hs1"/> <hsep/> <hsep name="hs3" bloc="[10,220] 2 LEFT RIGHT" type="BLANK"> <subst value="BOTTOM"/> <hsep/> <region name="figure" bloc="LEFT RIGHT hs2 hs3"

content="FIGS"/> > </layer> </page> ...

Page 23: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

23

Use of Document Recognition Models

There is no universal approach !

Document recognition systems must be tuned for specific applications for specific document classes

Contextual information is required Models provide information like

generic document structures (DTD or XML-schema) geometrical and typographical attributes (style information) semantic information (keywords, dictionaries, databases, ...) statistical information

Page 24: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

24

Content of document models

Generic structure Document Type Definition (DTD) or XML-schema

Style information Absolute or relative positioning Typographical attributes & formatting rules

Semantics (if available) Linguistic information, keywords Application specific ontology

Probabilistic information Frequencies of items or sequences, co-occurrences

Page 25: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

25

Trouble with document models

Document models are hard to produce and to maintain implicit models (hard coded in the application) => hard to modify, adapt, extend explicit models, written in a formal language => cumbersome to produce, needs high expertise abstract models, learned automatically => needs a lot of training data (with ground-truth!)

Need for more flexible tools: assisted environments with friendly user interfaces recognition improving with use models are learned incrementally

Page 26: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

26

Pattern Based Document Understanding [Robaday 03]

Configurations consist of Set of vertices

Labeled (type) Attributed (pos, typo, ...)

Edges between vertices Labeled (neighborhood

relation) Attributed (geom, ...)

Model consists of Extraction rules For each class

Attribute selector List of pattern

extraction

configura-tion

model

classification

document image

rules

patt.

sele

cto

r

id

Page 27: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

27

Evolution of 2-CREM performance

0

50

100

150

200

250

300

350

400

450

500

0 50 100 150 200

improvement of correct labeling as a function of clicks used for correcting labels manually

Page 28: Prénom Nom Document Analysis: Structure Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold

28

Conclusion

Structure recognition of documents is still an open issue Solutions exist for specialized applications

Generic approaches are not mature model are hard to establish training data is missing

As alternative interactive systems with incremental model adaptation