1 document image matching based on component blocks fuhui long, hanchuan peng, zheru chi, and wanchi...
TRANSCRIPT
1
Document Image Matching Based on Component Blocks
Fuhui Long, Hanchuan Peng, Zheru Chi, and Wanchi Siu
Center for Multimedia Signal Processing,Department of Electronic & Information Eng.,The Hong Kong Polytechnic Univ.,
Email: {fhlong, phc, enzheru}@eie.polyu.edu.hkhttp://pandora.rad.jhu.edu/~phc/publications.html
2
Outline
IntroductionComponent Block & Data StructureMatching AlgorithmExperimentsDiscussion & Conclusion
3
Document Image Matching
Key technique for document image registration & retrievalCan be applied widely for office automation, digital library, video-conferencing, etc.
4
Current TechniquesExisting methods are mainly based on local features of document page image Cesarini’s form-reader system (attributed
relational graphs) Shimotsuji’s cell struct based 2-dimensional
hash table Watanabe’s blank form structure of repetitions
and positions of cells Tseng and Chen’s line segments based method Fan and Chang’s line crossing relationship
matrix Watanabe and Huang’s predefined logical
structure for business cards Safari’s projective geometry method etc
5
Our Approach
Decompose a document page image into local component blocks Propose measurements to combine ------local block information ------global page layout informationIt is closely related to our e-Doc technique, which is developed for document databases
6
e-DocDocuments on Papers
Image Acquiring
Optical Images
Pre-Processing
Component Block List
e-Doc
Page Block Organization
Functions for Applications
The Block-oriented e-Doc technique can be very useful for document databases related applications, including the document page image retrieval, etc.
7
Preprocessing
Noise removingRegion based binarization and foreground extractionCorrelation based skew correctionImage Blocking: Scan from bottom to top and from left to right Use the simplest region growing method (not
pixel-by-pixel, but line-by-line, i.e. if there is a pixel on the out boundary of current block, then grow out one more line.)
8
Component Block List
……
order=1bound={(250,429),(45,86)}lang_index = 0.4309type=English
We only make use of the block location & size information for matching.
9
Matching Algorithm
Procedure CBL-MA;{Input: a CBL for the input document image, a handle to a template image database of K TBLs }{Output: the TBL with the minimum distance D to CBL}{Preprocessing: for k=1,…,K, do begin sort the kth TBL by block size (from
small to large); end.}{Note: the preprocessing is not a part of this CBL-MA and needs to be done
only once beforehand}begin sort the CBL by block size (from small to large); for k=1,…,K, do begin compute Dk, which is the distance between CBL and the kth TBL;
end; select the TBL with the minimum Dk as output;
end.
(Notation: TBL – Template Block List, CBL – Component Block List)
10
Distance Definitions
)},(min{arg TBLi
CBLjS
CBLRjS SSdI
Size Matching:
TBLi
CBLj
TBLi
CBLjS
TBLi
CBLjS AAAAdSSd ),(),(
Location Matching: )},({minarg],[
TBLi
CBLjC
CBLRcTsIcTsIjC CCdI
TBLi
CBLj
TBLi
CBLjC CCCCd ),(
Total Distance:
TBLRi
TBLi
CBL
cIC CCdD ),(
11
Illustration of the Algorithm
B A
BT
…… ……
…… …… Sequencing with size
Component block list
Sequenced block list
Template block list
1. Matching with size2. Matching with location
12
Experimental Data
A large document template data set of 1350 templates. Define 5 subsets with sizes: 50 templates 100 templates 200 templates 500 templates 1350 templates
Use computer to generate all test data (deformation images according to these templates)
13
Deformation TypesDetection Error Block misdetection rate Pm
Block misaddition rate Pa
Block Size Variation Block size variation rate Ps
Block size variation scale Ss
Block Location Displacement Block displacement rate Pd
Block displacement scale Sd
Block Rotation Block rotation rate Pr
Block rotation angle Dr
14
Data Examples
Template image Deformation image
15
Results for Detection Errors
(1) CBL-MA can perform well (rc > 85%) even when 50% blocks in the block list are
lost or wrongly added (see the column of Pm = 0.5 and Pa = 0.5). Even when 80% blocks
are lost or added, this algorithm can still produce matching accuracy nearly 60% (see the column of Pm = 0.8 and Pa = 0.8).
(2) CBL-MA is more insensitive to block misaddition than to block misdetection. This is reasonable because when additional blocks are wrongly put into CBL, the original blocks still play, although weaker, roles. On the contrary, the lost information due to block misdetection is non-recoverable.
16
Results for Detection Errors
Pm: Block misdetection rate
Pa: Block misaddition rate
17
Results for Block Size Variation
For block size variation, the influence of parameters Ps and Ss (here we set the same
scale factor for both block width and height) on rc is given in Table 2. When blocks
expand or shrink greatly, CBL-MA can keep rc above 90% (the fourth row of Table 2).
At the same time, even when all blocks have size variations (Ps = 1.0), our algorithm
can produce a very high matching accuracy of 95%. Note that the latter corresponds to many office automation applications, where Ss is not very large, however, most blocks
are subject to some degree of size variation, i.e. Ps is close to 1.
18
Results for Block Size Variation
Ps :Block size variation rate
Ss :Block size variation scale
19
Results for Block Displacement
For block location displacement, the influence of parameters Pd and Sd on rc is given in Table 3.
Evidently CBL-MA is robust to block location variation (rc always larger than 95%).
20
Results for Block Displacement
Pd :Block displacement rate Sd: Block displacement scale
21
Results for Block Rotation
For block rotation, the influence of parameters Pr and Dr on rc is given in Table 4. For both cases
{Dr=15, Pr varies from 0.2 to 1.0} and { Pr=0.5, Dr varies from 5 to 45}, CBL-MA produces
satisfying classification, even when the test images contain strong deformation, e.g. 50% component blocks have at most 45 rotation, or all component blocks have at most 15 rotation. Notice that block rotation will directly lead to the significant change of block sizes.
22
Results for Block Rotation
Pr :Block rotation rate
Dr : Block rotation angle
23
Results for Template Set Size
Here under a general setting of parameters {Pa =0.2, Pm=0.2, Ps =0.2, Ss=0.2, Pd =0.5, Sd =0.5,
Pr =0.5, Dr=15}, we examine the influence of the template image set size on the matching
accuracy. All the five template sets are used. For each template set, we independently generate at least 2000 images for testing. The results are listed in Table 5. It is clear that even when the template set size grows to 500, the matching accuracy is satisfying (>80%). For the 1350-template set (Set-E), rc is still around 70%.
24
Results for Template Set Size
25
Comparison to Other Algorithms
It is noticed that the failure in detecting local features (e.g. line-segments) usually immediately results in bad performance of several other algorithms. However, our experiments demonstrate that even when the block information is partially lost or inaccurate, there is no significant performance reduction of CBL-MA.
26
Computational ComplexityDenote n as the number of blocks in a CBL, m as the number of blocks in a TBL, K as the number of template images in a document image database. When we use quicksort and binary search algorithms, the typical computational complexity is O(nlogn) for CBL sorting O(Kmlogn) for CBL/TBL size matching O(Km(2TC+1)) for CBL/TBL location matching O(K) for CBL/TBL distance calculation O(logK) for finding the minimum distance Totally O((Km+n)logn)
27
Application: Column Table Data Extraction
This page matching algorithm is then refined and applied to automatic data extraction of column forms.
---All fields in input image differ from each other a lot and the local-feature based approach can not work well.---Our algorithm is a powerful tool to find out the correct image template, which is used to annotate the image data fields to accomplish the data extraction successfully.
28
Conclusions
Based on block list (and tree), our algorithm can effectively make use of the local information of each page block and the global information of page layout.We present a method for effective document image matching. The algorithm gives satisfying performance for various image deformations. The algorithm is robust to image distortion, filled-in text, and noises.
We report an successful application of our algorithm
in column table data auto-reading.