discrete point based signatures and applications to document matching
Post on 17-Jan-2017
145 Views
Preview:
TRANSCRIPT
Discrete Point Based Signatures and Applications to Document MatchingNemanja Spasojevic, Guillaume Poncin, and Dan Bloomberg
September 15th 2011, Ravenna, Italy
Overview
● Background● Algorithm● Applications
○ duplicate page detection○ text image lockup
● Conclusion
Background (Duplicate Page Detection)
● Find duplicate pages for a given set of scans of physically same book. Assumption:
○ has to handle at time corpus of ~10k text pages○ pages rich in text ○ < 4o of rotation from image to image○ some translation○ needs to be quick ○ simple to use (discrete signatures, for easy indexing /
lookup)
Background (Image Lookup)
● See how well we perform in image lookup mode. Test how robust algorithm is for something it was not designed for:
○ index clean, images○ lookup by image take with cell phone camera○ skew○ rotation ○ blur○ part of original page
Other Aproaches
● Image matching well studied problem ○ SURF, SIFT, FIT work well at point matching across
images and image lookup ○ do not work as well for repetetive patterns such as text
documents ● Document page matching
○ Locally Likely Arrangement Hashing (LLAH), Nakai, et. al.■ affine invariant ■ produces thousands of signatures per page■ precise ■ handles 10k image corpus
Algorithm
Possible inputs:● raw image (operate on word centroids)● OCR-ed text with word bounding boxes (operate on word
bounding box center)● PDF with word bounding box info (operate on word
bounding box center)
Signature Instability
Signature is composed of N sub-signatures: S = [s(0)][s(1)]...[s(N-1)]
Instability of signatures comes from: ● Small shifts may lead to changes in discretized angle value
(e.g. s(0) flipping from 13 to 14 due to small word position shifts)
● order of sub-signatures may change (s(0) and s(1) swap as they had almost same radial distance)
Superposition of AmbiguousSignature
[s1][{s2,s'2}][s3][s4] => { [s1][s2][s3][s4], [s1][s'2][s3][s4] }
Duplicate Page Detection (metrics)
● The similarity based on signature sets is calculated as:
● The similarity based on matched (aligned) word bounding boxes:
Image Lookup Result
1M pages index (32bit signatures) stats:● 386M signatures total● stored as sorted array (<signature, book_id, page_pid, x, y>) fits
in ~4GB of memory● 0.8% of all signatures filtered (those repeated on 1k or more
pages)● each query on average returns 2000 canidates
Index Size [pages] Accuracy Signature Size [bits]
4.1k 0.966 164.1k 0.949 321M 0.871 32
Conclusion
● Simple schema for point cloud based discrete signature generation
● Filtering based on signature stability estimate● Superposing signatures ● Duplicate page detection ● Image lookup by cell phone camera image query (87.1%
accuracy for 1M pages indexed)
top related