Transcript

Discrete Point Based Signatures and Applications to Document MatchingNemanja Spasojevic, Guillaume Poncin, and Dan Bloomberg

September 15th 2011, Ravenna, Italy

Overview

● Background● Algorithm● Applications

○ duplicate page detection○ text image lockup

● Conclusion

Background (Duplicate Page Detection)

● Find duplicate pages for a given set of scans of physically same book. Assumption:

○ has to handle at time corpus of ~10k text pages○ pages rich in text ○ < 4o of rotation from image to image○ some translation○ needs to be quick ○ simple to use (discrete signatures, for easy indexing /

lookup)

Background (example)

Background (Image Lookup)

● See how well we perform in image lookup mode. Test how robust algorithm is for something it was not designed for:

○ index clean, images○ lookup by image take with cell phone camera○ skew○ rotation ○ blur○ part of original page

Other Aproaches

● Image matching well studied problem ○ SURF, SIFT, FIT work well at point matching across

images and image lookup ○ do not work as well for repetetive patterns such as text

documents ● Document page matching

○ Locally Likely Arrangement Hashing (LLAH), Nakai, et. al.■ affine invariant ■ produces thousands of signatures per page■ precise ■ handles 10k image corpus

Algorithm

Possible inputs:● raw image (operate on word centroids)● OCR-ed text with word bounding boxes (operate on word

bounding box center)● PDF with word bounding box info (operate on word

bounding box center)

Algorithm (Image Processing)

Signature Generation Algorithm

Signature Instability

Signature is composed of N sub-signatures: S = [s(0)][s(1)]...[s(N-1)]

Instability of signatures comes from: ● Small shifts may lead to changes in discretized angle value

(e.g. s(0) flipping from 13 to 14 due to small word position shifts)

● order of sub-signatures may change (s(0) and s(1) swap as they had almost same radial distance)

Signature Filtering Based on Estimated Risk

Superposition of AmbiguousSignature

[s1][{s2,s'2}][s3][s4] => { [s1][s2][s3][s4], [s1][s'2][s3][s4] }

Duplicate Page Detection (metrics)

● The similarity based on signature sets is calculated as:

● The similarity based on matched (aligned) word bounding boxes:

Duplicate Page Detection (example)

Js = 19% Jb = 93%

Duplicate Page Detection (example)

Js = 5% Jb = 37%

Image Lookup

Image Lookup Examples

Image Lookup Examples

Image Lookup Result

1M pages index (32bit signatures) stats:● 386M signatures total● stored as sorted array (<signature, book_id, page_pid, x, y>) fits

in ~4GB of memory● 0.8% of all signatures filtered (those repeated on 1k or more

pages)● each query on average returns 2000 canidates

Index Size [pages] Accuracy Signature Size [bits]

4.1k 0.966 164.1k 0.949 321M 0.871 32

Conclusion

● Simple schema for point cloud based discrete signature generation

● Filtering based on signature stability estimate● Superposing signatures ● Duplicate page detection ● Image lookup by cell phone camera image query (87.1%

accuracy for 1M pages indexed)

Q & A

Thank You!

Synthetic Data Evaluation

Synthetic Data Evaluation


Top Related