1 document image matching based on component blocks fuhui long, hanchuan peng, zheru chi, and wanchi...

1

Document Image Matching Based on Component Blocks

Fuhui Long, Hanchuan Peng, Zheru Chi, and Wanchi Siu

Center for Multimedia Signal Processing,Department of Electronic & Information Eng.,The Hong Kong Polytechnic Univ.,

Email: {fhlong, phc, enzheru}@eie.polyu.edu.hkhttp://pandora.rad.jhu.edu/~phc/publications.html

2

Outline

IntroductionComponent Block & Data StructureMatching AlgorithmExperimentsDiscussion & Conclusion

3

Document Image Matching

Key technique for document image registration & retrievalCan be applied widely for office automation, digital library, video-conferencing, etc.

4

Current TechniquesExisting methods are mainly based on local features of document page image Cesarini’s form-reader system (attributed

relational graphs) Shimotsuji’s cell struct based 2-dimensional

hash table Watanabe’s blank form structure of repetitions

and positions of cells Tseng and Chen’s line segments based method Fan and Chang’s line crossing relationship

matrix Watanabe and Huang’s predefined logical

structure for business cards Safari’s projective geometry method etc

5

Our Approach

Decompose a document page image into local component blocks Propose measurements to combine ------local block information ------global page layout informationIt is closely related to our e-Doc technique, which is developed for document databases

6

e-DocDocuments on Papers

Image Acquiring

Optical Images

Pre-Processing

Component Block List

e-Doc

Page Block Organization

Functions for Applications

The Block-oriented e-Doc technique can be very useful for document databases related applications, including the document page image retrieval, etc.

7

Preprocessing

Noise removingRegion based binarization and foreground extractionCorrelation based skew correctionImage Blocking: Scan from bottom to top and from left to right Use the simplest region growing method (not

pixel-by-pixel, but line-by-line, i.e. if there is a pixel on the out boundary of current block, then grow out one more line.)

8

Component Block List

……

order=1bound={(250,429),(45,86)}lang_index = 0.4309type=English

We only make use of the block location & size information for matching.

9

Matching Algorithm

Procedure CBL-MA;{Input: a CBL for the input document image, a handle to a template image database of K TBLs }{Output: the TBL with the minimum distance D to CBL}{Preprocessing: for k=1,…,K, do begin sort the kth TBL by block size (from

small to large); end.}{Note: the preprocessing is not a part of this CBL-MA and needs to be done

only once beforehand}begin sort the CBL by block size (from small to large); for k=1,…,K, do begin compute Dk, which is the distance between CBL and the kth TBL;

end; select the TBL with the minimum Dk as output;

end.

(Notation: TBL – Template Block List, CBL – Component Block List)

10

Distance Definitions

)},(min{arg TBLi

CBLjS

CBLRjS SSdI

Size Matching:

TBLi

CBLj

TBLi

CBLjS

TBLi

CBLjS AAAAdSSd ),(),(

Location Matching: )},({minarg],[

TBLi

CBLjC

CBLRcTsIcTsIjC CCdI

TBLi

CBLj

TBLi

CBLjC CCCCd ),(

Total Distance:

TBLRi

TBLi

CBL

cIC CCdD ),(

11

Illustration of the Algorithm

B A

BT

…… ……

…… …… Sequencing with size

Component block list

Sequenced block list

Template block list

1. Matching with size2. Matching with location

12

Experimental Data

A large document template data set of 1350 templates. Define 5 subsets with sizes: 50 templates 100 templates 200 templates 500 templates 1350 templates

Use computer to generate all test data (deformation images according to these templates)

13

Deformation TypesDetection Error Block misdetection rate Pm

Block misaddition rate Pa

Block Size Variation Block size variation rate Ps

Block size variation scale Ss

Block Location Displacement Block displacement rate Pd

Block displacement scale Sd

Block Rotation Block rotation rate Pr

Block rotation angle Dr

14

Data Examples

Template image Deformation image

15

Results for Detection Errors

(1) CBL-MA can perform well (rc > 85%) even when 50% blocks in the block list are

lost or wrongly added (see the column of Pm = 0.5 and Pa = 0.5). Even when 80% blocks

are lost or added, this algorithm can still produce matching accuracy nearly 60% (see the column of Pm = 0.8 and Pa = 0.8).

(2) CBL-MA is more insensitive to block misaddition than to block misdetection. This is reasonable because when additional blocks are wrongly put into CBL, the original blocks still play, although weaker, roles. On the contrary, the lost information due to block misdetection is non-recoverable.

16

Results for Detection Errors

Pm: Block misdetection rate

Pa: Block misaddition rate

17

Results for Block Size Variation

For block size variation, the influence of parameters Ps and Ss (here we set the same

scale factor for both block width and height) on rc is given in Table 2. When blocks

expand or shrink greatly, CBL-MA can keep rc above 90% (the fourth row of Table 2).

At the same time, even when all blocks have size variations (Ps = 1.0), our algorithm

can produce a very high matching accuracy of 95%. Note that the latter corresponds to many office automation applications, where Ss is not very large, however, most blocks

are subject to some degree of size variation, i.e. Ps is close to 1.

18

Results for Block Size Variation

Ps :Block size variation rate

Ss :Block size variation scale

19

Results for Block Displacement

For block location displacement, the influence of parameters Pd and Sd on rc is given in Table 3.

Evidently CBL-MA is robust to block location variation (rc always larger than 95%).

20

Results for Block Displacement

Pd :Block displacement rate Sd: Block displacement scale

21

Results for Block Rotation

For block rotation, the influence of parameters Pr and Dr on rc is given in Table 4. For both cases

{Dr=15, Pr varies from 0.2 to 1.0} and { Pr=0.5, Dr varies from 5 to 45}, CBL-MA produces

satisfying classification, even when the test images contain strong deformation, e.g. 50% component blocks have at most 45 rotation, or all component blocks have at most 15 rotation. Notice that block rotation will directly lead to the significant change of block sizes.

22

Results for Block Rotation

Pr :Block rotation rate

Dr : Block rotation angle

23

Results for Template Set Size

Here under a general setting of parameters {Pa =0.2, Pm=0.2, Ps =0.2, Ss=0.2, Pd =0.5, Sd =0.5,

Pr =0.5, Dr=15}, we examine the influence of the template image set size on the matching

accuracy. All the five template sets are used. For each template set, we independently generate at least 2000 images for testing. The results are listed in Table 5. It is clear that even when the template set size grows to 500, the matching accuracy is satisfying (>80%). For the 1350-template set (Set-E), rc is still around 70%.

24

Results for Template Set Size

25

Comparison to Other Algorithms

It is noticed that the failure in detecting local features (e.g. line-segments) usually immediately results in bad performance of several other algorithms. However, our experiments demonstrate that even when the block information is partially lost or inaccurate, there is no significant performance reduction of CBL-MA.

26

Computational ComplexityDenote n as the number of blocks in a CBL, m as the number of blocks in a TBL, K as the number of template images in a document image database. When we use quicksort and binary search algorithms, the typical computational complexity is O(nlogn) for CBL sorting O(Kmlogn) for CBL/TBL size matching O(Km(2TC+1)) for CBL/TBL location matching O(K) for CBL/TBL distance calculation O(logK) for finding the minimum distance Totally O((Km+n)logn)

27

Application: Column Table Data Extraction

This page matching algorithm is then refined and applied to automatic data extraction of column forms.

---All fields in input image differ from each other a lot and the local-feature based approach can not work well.---Our algorithm is a powerful tool to find out the correct image template, which is used to annotate the image data fields to accomplish the data extraction successfully.

28

Conclusions

Based on block list (and tree), our algorithm can effectively make use of the local information of each page block and the global information of page layout.We present a method for effective document image matching. The algorithm gives satisfying performance for various image deformations. The algorithm is robust to image distortion, filled-in text, and noises.

We report an successful application of our algorithm

in column table data auto-reading.