mining historical documents for - phd alumni from the...
TRANSCRIPT
![Page 1: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/1.jpg)
Mining Historical Documents forMining Historical Documents for Near‐Duplicate FiguresNear Duplicate Figures
h ( ) k hThanawin (Art) Rakthanmanon
Qiang ZhuQiang Zhu
Eamonn Keogh
![Page 2: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/2.jpg)
What is a near‐duplicate pattern?What is a near duplicate pattern?Same Species of diatoms in different books
Biddulphia alternans
A Hi t f I f i i l di D idi d Di t 1861 A S i f th B iti h Di t 1853A History of Infusoria, including Desmidiaceae and Diatomaceae, 1861. A Synopsis of the British Diatomaceae, 1853.
2
![Page 3: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/3.jpg)
MotivationMotivation
• There are about 130 million books in the worldThere are about 130 million books in the world (according to Google 2010).
• Many are now digitized• Many are now digitized.
• Finding repeated patterns can ..– allow us to trace the evolution of cultural ideas
– allow us to discover plagiarism
– allow us to combine information
from two different sources
3
![Page 4: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/4.jpg)
Problem StatementProblem Statement
• Given 2 books and user defined parameters (i.e. sizeGiven 2 books and user defined parameters (i.e. size of motifs), find similar pattern/figures between these books in reasonable amount of time.books in reasonable amount of time.
Wh t i “ bl t f ti ”?What is a “reasonable amount of time”?
• It can take minutes to hours for scanning a book.
• We would like to be able to discover similar figures in minutes or tens of minutes.
• This could be done offline (a ‘screensaver’ could work on you personal library at night).work on you personal library at night).
4
![Page 5: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/5.jpg)
ObjectivesObjectives
• We propose an algorithm to discover similar patternsWe propose an algorithm to discover similar patterns inside a manuscript or across 2 books.
• Our scalable method consider only shape so input• Our scalable method consider only shape so input documents can be b/w or color documents.
O th d ill t i t l t d• Our method will return approximately repeated shape patterns in small amount of time.
5
![Page 6: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/6.jpg)
Example Results (1)Example Results (1)
• Two Petroglyph BooksTwo Petroglyph Books
1st book 2nd book
Similar Figures
[1] [2]
[1] Indian Rock Art of Southern California with Selected Petroglyph Catalog, 1975.
( )[2] Südamerikanische Felszeichnungen (South American Petroglyphs), Berlin, 1907.
6
![Page 7: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/7.jpg)
Example Results (2)Example Results (2)• 25 seconds to find motifs across 2 books of size 478 d 252478 pages and 252 pages.
7
![Page 8: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/8.jpg)
Example Results (3)Example Results (3)
• Similar figures from 4 different books are discovered.Similar figures from 4 different books are discovered.Book1 [3]: Scottish Heraldry (243 pages)
Book2 [4]: Peeps at Heraldry (110 pages)
Book3 [5]: British Heraldry (252 pages)
Book4 [6]: English Heraldry (487page)g y ( p g )
8
![Page 9: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/9.jpg)
Example Results (4)Example Results (4)
• Also works well for handwritten documents.
IAM dataset from Research Group on Computer Vision and Artificial Intelligence, University of Bern
9
![Page 10: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/10.jpg)
Overview of Our AlgorithmOverview of Our Algorithm1500x1000 pixel2 300x200 pixel2 15 potential windows
DownSampling
LocatePotentialWindows
HashingHash Signature
Compute all pair distances
10
(GHT‐based distance)
![Page 11: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/11.jpg)
Locating Potential WindowsLocating Potential Windows
• Humans easily locate the figures. How?Humans easily locate the figures. How?
Document contain black
Our observations ...
Document contain black and white pixels.
Count the number of black pixels around a fix pixel.
Locate figures by the peaks.Remove noise by threshold.
50,000 windows reduce to 20 potential windows
11
20 potential windows.
![Page 12: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/12.jpg)
Down SamplingDown Sampling
Overlap only 16%
Overlap 82%Overlap 82%
• Reduce search space• Reduce search space. • Increase the quality of matching.
12
![Page 13: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/13.jpg)
Random ProjectionRandom Projection
• Hashing is an efficient way to reduce the number ofHashing is an efficient way to reduce the number of expensive real distance calculations.
Mask templateMask template
Remove Enough
S SiMask template
Same Signature
Remove Enough
13
![Page 14: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/14.jpg)
GHT‐based Distance CalculationGHT based Distance CalculationGHT = Generalized Hough Transform
Atlatls(1) GHT‐based distance measure
correctly groups all seven pairs.
Anthropomorphs(2) The higher level structure of
the dendrogram also correctly groups similar petroglyphs.
Bighorn Sheep
Figure and Equation from [14] Q Zhu X Wang E Keogh and S H Lee “Augmenting the
14
Figure and Equation from [14] Q. Zhu, X. Wang, E. Keogh and S.H. Lee, Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs,” SIGKDD, 2009
![Page 15: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/15.jpg)
Overview of Our AlgorithmOverview of Our Algorithm1500x1000 pixel2 300x200 pixel2 15 potential windows
DownSampling
LocatePotentialWindows
HashingHash Signature
Compute all pair distances
15
(GHT‐based distance)
![Page 16: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/16.jpg)
Experimental ResultsExperimental Results
• The performance of our algorithm depends on dataset.p g p
• We created artificial “books” to test on.
• Each page of book contains 100 random characters.Each page of book contains 100 random characters.
• Each characters contains 14 segments.
Polynomial Distortion
GaussianNoise
16
![Page 17: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/17.jpg)
Experimental ResultsExperimental Results
• Our algorithm can find similar figures (motifs) g g ( )from 100‐page book in less than a minute.
2.5
3.0x 104
c)
6.9 hours
1.5
2.0
ion
Tim
e (s
ec
Best known algorithm to findexact motif (All Windows)
0.5
1.0
Exec
uti
Exact Motif (Potential Windows)
0
Number of Pages
DocMotif
1 2 4 8 16 32 64 128 256 512
5.5 minutes
17
Number of Pages
![Page 18: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/18.jpg)
ScalabilityScalability
Scalability2000
Effect of DistortionScalability
1000
1500
tion
Tim
e (s
ec) Polynomial
distortion
No distortion
Effect of Distortion
0
500
Exec
ut
1 2 4 8 16 32 64 128 256 512 1024 2048
Sample Motifs
sec) 100
250Effect of Gaussian Noise
Var = 0.20
1 2 4 8 16 32 64 128 256 512 1024 2048
Number of Pages
utio
n T
ime
(
1
10
Var = 0 01Var = 0.05Var = 0.10Var = 0.15
Exe
cu
Number of Pages1 2 4 8 16 32 64 128 256
0No noiseVar = 0.01
18
Number of Pages
![Page 19: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/19.jpg)
How Good of the Results?How Good of the Results?• The average distances from top 20 motifs are not much
d ff d ff h
202530
Mask 50%Mask 40%e
Dis
tanc
eMasking Ratio
different among different parameter choices.
2 4 8 16 32 64 128 256 51205
1015
Mask 40%Mask 30%Mask 20%BruteForceA
vera
ge
A
nce
1015202530
Ave
rage
Dis
tan
HDS=2 (4:1)HDS=3 (9:1)BruteForce
Hash Downsampling
2530
Dis
tanc
e
iteration=5iteration=9
Number of Iterations2 4 8 16 32 64 128 256 512
05
A B
05
101520
Ave
rage
iteration=9iteration=10iteration=11iteration=20BruteForce
C
19
2 4 8 16 32 64 128 256 512Number of pages
![Page 20: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/20.jpg)
Parameter EffectsParameter Effects
400 Number of Iterations600
700
c) Masking Ratio
200
300
400
me
(sec
)
Number of Iterations
11 iterationsC400
500
600
tion
Tim
e (s
ec Masking Ratio
40%
50%A
10 iterations
9 iterations0
100
200
Exe
cutio
n Ti
100
200
300
Exec
ut
20%30%
1 2 4 8 16 32 64 128 256 512
Number of Pages
0
400c) H h D li
01 2 4 8 16 32 64 128 256 512
Number of Pages
200
300
on T
ime
(sec
HDS=3 (9:1)
HDS=2 (4:1)
Hash DownsamplingB
1 2 4 8 16 32 64 128 256 5120
100
Exe
cuti
HDS=1 (1:1)
20
Number of Pages
![Page 21: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/21.jpg)
ConclusionConclusion
• An algorithm to find similar figures across two manuscripts.g g p– Approximation algorithm
– Work pretty well on both figures and text
– Practical: very fast and very similar
• Key Ideas– Locating potential windows
– Down Sampling
– Random Projection– Random Projection
– GHT‐based Distance
• Drawbacks– Not support rotation invariance
– Many parameters but not much sensitive
21
![Page 22: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/22.jpg)
ReferencesReferences1. G. Ramponi, F. Stanco, W. D. Russo, S. Pelusi, and P. Mauro, “Digital automated restoration of manuscripts and
antique printed books,” EVA ‐ Electronic Imaging and the Visual Arts, 2005.q p , g g ,2. J. V. Richardson Jr., “Bookworms: The Most Common Insect Pests of Paper in Archives, Libraries, and Museums.” 3. A. Pritchard, “A history of Infusoria, including Desmidiaceae and Diatomaceae,” British and foreign. Ed. IV. 968.
London, 1861.4. W. Smith, “A synopsis of the British Diatomaceae; with remarks on their structure, function and distribution; and
instructions for collecting and preserving specimens ” vol 1 pp [V]‐XXXIII pp 1‐89 31 pls London: John van Voorstinstructions for collecting and preserving specimens, vol. 1 pp. [V] XXXIII, pp. 1 89, 31 pls. London: John van Voorst, 1853.
5. W. West, G S.. West, “A Monograph of the British Desmidiaceae,” Vols. I–V. Ray Society, London, 1904–1922.6. C. R. Dod, R. P. Dod, “Dod’s Peerage, Baronetage and Knightage of Great Britain and Ireland for 1915,” London:
Simpkin, Marshall, Hamilton, Kent and co. ltd, 1915.7 J B B k “B k f O d f K i hth d d D ti f H f ll N ti ” L d H t d Bl k tt7. J. B. Burke, “Book of Orders of Knighthood and Decorations of Honour of all Nations,” London: Hurst and Blackett,
pp. 46‐47, 1858.8. B. Gatos, I. Pratikakis, and S. J. Perantonis, “An adaptive binarisation technique for low quality historical documents,”
Proc. of Int. Work. on Document Analysis Sys., pp. 102–13.9. E. Kavallieratou and E. Stamatatos, “Adaptive binarization of historical document images,” Proc. 18th International
C f f P R i i 742 745Conf. of Pattern Recognition, pp. 742–745. 10.H. J. Wolfson and I. Rigoutsos, “Geometric Hashing: An Overview,” IEEE Comp’ Science and Engineering, 4(4), pp. 10‐
21, 1997.11.X. Bai, X. Yang, L. J. Latecki, W. Liu, and Z. Tu, “Learning context sensitive shape similarity by graph transduction,”
IEEE TPAMI, 2009.12.E. J. Keogh, L. Wei, X. Xi, M. Vlachos, S. Lee, and P. Protopapas, “Supporting exact indexing of arbitrarily rotated
shapes and periodic time series under Euclidean and warping distance measures,” VLDB J. 18(3), 611‐630, 2009.13.P. V. C. Hough, “Method and mean for recognizing complex pattern,” USA patent 3069654, 1966.
22
![Page 23: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/23.jpg)
ReferencesReferences14. Q. Zhu, X. Wang, E. Keogh, and S. H. Lee, “Augmenting the Generalized Hough Transform to Enable the Mining of
Petroglyphs,” SIGKDD, 2009.15. R. O. Duda and P. E. Hart, “Use of the Hough transform to detect lines and curves in pictures,” Comm. ACM 15(1), pp.11‐15,
1972.16. D. H. Ballard, “Generalizing the Hough transform to detect arbitrary shapes,” Pattern Recognition 13, 1981, pp. 111‐122. 17. M. Tompa and J. Buhler, “Finding motifs using random projections,” In proceedings of the 5th Int. Conference on
Computational Molecular Biology. pp 67‐74, 2001.18. T. Koch‐Grunberg, “Südamerikanische Felszeichnungen” (South American petroglyphs), Berlin, E. Wasmuth A.‐G, 1907.19. A. Fornés, J. Lladós, and G. Sanchez, “Old Handwritten Musical Symbol Classification by a Dynamic Time Warping Based
Method. in Graphics Recognition: Recent Advances and New Opportunities,” Lecture Notes in Computer Science, vol. 5046, pp. 51‐60, 2008.
20. G. Sanchez, E. Valveny, J. Llados, J. M. Romeu, and N. Lozano, “A platform to extract knowledge from graphic documents. li ti t hit t l k t h d t di i ” D t A l i S t VI V l 3163 389 400 2004application to an architectural sketch understanding scenario,” Document Analysis Systems VI, Vol. 3163, pp. 389 ‐400, 2004.
21. J. Mas, G. Sanchez, and J. Llados, “An Incremental Parser to Recognize Diagram Symbols and Gestures represented by Adjacency Grammars,” Graphics Recognition: Ten Year Review. Lecture Notes in Computer Science, vol. 3926, pp. 252‐263, 2006.
22. K. B. Schroeder et al., “Haplotypic Background of a Private Allele at High Frequency,” the Americas, Molecular Biology and Evolution 26 (5) pp 995‐1016 2009Evolution, 26 (5), pp. 995 1016, 2009.
23. G. A. Smith, and W. G. Turner, “Indian Rock Art of Southern California with Selected Petroglyph Catalog,” San Bernardino County, Museum Association, 1975.
24. C. Davenport, “British Heraldry,” London Methuen, 1912.25. C. Davenport, “English heraldic book‐stamps, figured and described,” London : Archibald Constable and co. ltd, 1909.26 X Xi E J Keogh L Wei and A Mafra‐Neto “Finding Motifs in a Database of Shapes ” Prof of Siam Conf Data Mining 200726. X. Xi, E. J. Keogh, L. Wei, and A. Mafra‐Neto, Finding Motifs in a Database of Shapes, Prof. of Siam Conf. Data Mining, 2007.
23
![Page 24: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/24.jpg)
24
![Page 25: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten](https://reader033.vdocuments.us/reader033/viewer/2022042214/5eb8fc580677c465d65b3cc1/html5/thumbnails/25.jpg)
25