mining historical documents for - phd alumni from the...

25
Mining Historical Documents for Mining Historical Documents for NearDuplicate Figures Near Duplicate Figures h ( ) kh Thanawin (Art) Rakthanmanon Qiang Zhu Qiang Zhu Eamonn Keogh

Upload: others

Post on 11-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Mining Historical Documents forMining Historical Documents for Near‐Duplicate FiguresNear Duplicate Figures

h ( ) k hThanawin (Art) Rakthanmanon

Qiang ZhuQiang Zhu

Eamonn Keogh

Page 2: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

What is a near‐duplicate pattern?What is a near duplicate pattern?Same Species of diatoms in different books

Biddulphia alternans

A Hi t f I f i i l di D idi d Di t 1861 A S i f th B iti h Di t 1853A History of Infusoria, including Desmidiaceae and Diatomaceae, 1861. A Synopsis of the British Diatomaceae, 1853.

2

Page 3: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

MotivationMotivation

• There are about 130 million books in the worldThere are about 130 million books in the world (according to Google 2010).

• Many are now digitized• Many are now digitized.

• Finding repeated patterns can ..– allow us to trace the evolution of cultural ideas

– allow us to discover plagiarism

– allow us to combine information

from two different sources

3

Page 4: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Problem StatementProblem Statement

• Given 2 books and user defined parameters (i.e. sizeGiven 2 books and user defined parameters (i.e. size of motifs), find similar pattern/figures between these books in reasonable amount of time.books in reasonable amount of time.

Wh t i “ bl t f ti ”?What is a “reasonable amount of time”?

• It can take minutes to hours for scanning a book. 

• We would like to be able to discover similar figures in minutes or tens of minutes.

• This could be done offline (a ‘screensaver’ could work on you personal library at night).work on you personal library at night).

4

Page 5: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

ObjectivesObjectives

• We propose an algorithm to discover similar patternsWe propose an algorithm to discover similar patterns inside a manuscript or across 2 books.

• Our scalable method consider only shape so input• Our scalable method consider only shape so input documents can be b/w or color documents.

O th d ill t i t l t d• Our method will return approximately repeated shape patterns in small amount of time.

5

Page 6: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Example Results (1)Example Results (1)

• Two Petroglyph BooksTwo Petroglyph Books

1st book 2nd book

Similar Figures

[1] [2]

[1] Indian Rock Art of Southern California with Selected Petroglyph Catalog, 1975.

( )[2] Südamerikanische Felszeichnungen (South American Petroglyphs), Berlin, 1907.

6

Page 7: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Example Results (2)Example Results (2)• 25 seconds to find motifs across 2 books of  size 478 d 252478 pages and 252 pages.

7

Page 8: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Example Results (3)Example Results (3)

• Similar figures from 4 different books are discovered.Similar figures from 4 different books are discovered.Book1 [3]: Scottish Heraldry (243 pages)

Book2 [4]: Peeps at Heraldry (110 pages)

Book3 [5]: British Heraldry (252 pages)

Book4 [6]: English Heraldry (487page)g y ( p g )

8

Page 9: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Example Results (4)Example Results (4)

• Also works well for handwritten documents. 

IAM dataset from Research Group on Computer Vision and Artificial Intelligence, University of Bern

9

Page 10: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Overview of Our AlgorithmOverview of Our Algorithm1500x1000 pixel2 300x200 pixel2 15 potential windows

DownSampling

LocatePotentialWindows

HashingHash Signature

Compute all pair distances

10

(GHT‐based distance)

Page 11: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Locating Potential WindowsLocating Potential Windows

• Humans easily locate the figures. How?Humans easily locate the figures. How? 

Document contain black

Our observations ...

Document contain black and white pixels.

Count the number of black pixels around a fix pixel.

Locate figures by the peaks.Remove noise by threshold.

50,000 windows reduce to 20 potential windows

11

20 potential windows.

Page 12: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Down SamplingDown Sampling

Overlap only 16%

Overlap 82%Overlap 82%

• Reduce search space• Reduce search space. • Increase the quality of matching.

12

Page 13: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Random ProjectionRandom Projection

• Hashing is an efficient way to reduce the number ofHashing is an efficient way to reduce the number of expensive real distance calculations.

Mask templateMask template

Remove Enough

S SiMask template

Same Signature

Remove Enough

13

Page 14: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

GHT‐based Distance CalculationGHT based Distance CalculationGHT = Generalized Hough Transform 

Atlatls(1) GHT‐based distance measure 

correctly groups all seven pairs.

Anthropomorphs(2) The higher level structure of 

the dendrogram also correctly groups similar petroglyphs.

Bighorn Sheep

Figure and Equation from [14] Q Zhu X Wang E Keogh and S H Lee “Augmenting the

14

Figure and Equation from [14] Q. Zhu, X. Wang, E. Keogh and S.H. Lee, Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs,” SIGKDD, 2009

Page 15: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Overview of Our AlgorithmOverview of Our Algorithm1500x1000 pixel2 300x200 pixel2 15 potential windows

DownSampling

LocatePotentialWindows

HashingHash Signature

Compute all pair distances

15

(GHT‐based distance)

Page 16: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Experimental ResultsExperimental Results

• The performance of our algorithm depends on dataset.p g p

• We created artificial “books” to test on.

• Each page of book contains 100 random characters.Each page of book contains 100 random characters.

• Each characters contains 14 segments.

Polynomial Distortion 

GaussianNoise

16

Page 17: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Experimental ResultsExperimental Results

• Our algorithm can find similar figures (motifs) g g ( )from 100‐page book in less than a minute.

2.5

3.0x 104

c)

6.9 hours

1.5

2.0

ion

Tim

e (s

ec

Best known algorithm to findexact motif (All Windows)

0.5

1.0

Exec

uti

Exact Motif (Potential Windows)

0

Number of Pages

DocMotif

1 2 4 8 16 32 64 128 256 512

5.5 minutes

17

Number of Pages

Page 18: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

ScalabilityScalability

Scalability2000

Effect of DistortionScalability

1000

1500

tion

Tim

e (s

ec) Polynomial

distortion

No distortion

Effect of Distortion

0

500

Exec

ut

1 2 4 8 16 32 64 128 256 512 1024 2048

Sample Motifs

sec) 100

250Effect of Gaussian Noise

Var = 0.20

1 2 4 8 16 32 64 128 256 512 1024 2048

Number of Pages

utio

n T

ime

(

1

10

Var = 0 01Var = 0.05Var = 0.10Var = 0.15

Exe

cu

Number of Pages1 2 4 8 16 32 64 128 256

0No noiseVar = 0.01

18

Number of Pages

Page 19: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

How Good of the Results?How Good of the Results?• The average distances from top 20 motifs are not much 

d ff d ff h

202530

Mask 50%Mask 40%e

Dis

tanc

eMasking Ratio

different among different parameter choices.

2 4 8 16 32 64 128 256 51205

1015

Mask 40%Mask 30%Mask 20%BruteForceA

vera

ge

A

nce

1015202530

Ave

rage

Dis

tan

HDS=2 (4:1)HDS=3 (9:1)BruteForce

Hash Downsampling

2530

Dis

tanc

e

iteration=5iteration=9

Number of Iterations2 4 8 16 32 64 128 256 512

05

A B

05

101520

Ave

rage

iteration=9iteration=10iteration=11iteration=20BruteForce

C

19

2 4 8 16 32 64 128 256 512Number of pages

Page 20: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

Parameter EffectsParameter Effects

400 Number of Iterations600

700

c) Masking Ratio

200

300

400

me

(sec

)

Number of Iterations

11 iterationsC400

500

600

tion

Tim

e (s

ec Masking Ratio

40%

50%A

10 iterations

9 iterations0

100

200

Exe

cutio

n Ti

100

200

300

Exec

ut

20%30%

1 2 4 8 16 32 64 128 256 512

Number of Pages

0

400c) H h D li

01 2 4 8 16 32 64 128 256 512

Number of Pages

200

300

on T

ime

(sec

HDS=3 (9:1)

HDS=2 (4:1)

Hash DownsamplingB

1 2 4 8 16 32 64 128 256 5120

100

Exe

cuti

HDS=1 (1:1)

20

Number of Pages

Page 21: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

ConclusionConclusion

• An algorithm to find similar figures across two manuscripts.g g p– Approximation algorithm

– Work pretty well on both figures and text

– Practical: very fast and very similar

• Key Ideas– Locating potential windows

– Down Sampling

– Random Projection– Random Projection

– GHT‐based Distance

• Drawbacks– Not support rotation invariance

– Many parameters but not much sensitive

21

Page 22: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

ReferencesReferences1. G. Ramponi, F. Stanco, W. D. Russo, S. Pelusi, and P. Mauro, “Digital automated restoration of manuscripts and 

antique printed books,” EVA ‐ Electronic Imaging and the Visual Arts, 2005.q p , g g ,2. J. V. Richardson Jr., “Bookworms: The Most Common Insect Pests of Paper in Archives, Libraries, and Museums.” 3. A. Pritchard, “A history of Infusoria, including Desmidiaceae and Diatomaceae,” British and foreign. Ed. IV. 968. 

London, 1861.4. W. Smith, “A synopsis of the British Diatomaceae; with remarks on their structure, function and distribution; and 

instructions for collecting and preserving specimens ” vol 1 pp [V]‐XXXIII pp 1‐89 31 pls London: John van Voorstinstructions for collecting and preserving specimens,  vol. 1 pp. [V] XXXIII, pp. 1 89, 31 pls. London: John van Voorst, 1853.

5. W. West, G S.. West, “A Monograph of the British Desmidiaceae,” Vols. I–V. Ray Society, London, 1904–1922.6. C. R. Dod, R. P. Dod, “Dod’s Peerage, Baronetage and Knightage of Great Britain and Ireland for 1915,” London: 

Simpkin, Marshall, Hamilton, Kent and co. ltd, 1915.7 J B B k “B k f O d f K i hth d d D ti f H f ll N ti ” L d H t d Bl k tt7. J. B. Burke, “Book of Orders of Knighthood and Decorations of Honour of all Nations,” London: Hurst and Blackett, 

pp. 46‐47, 1858.8. B. Gatos, I. Pratikakis, and S. J. Perantonis, “An adaptive binarisation technique for low quality historical documents,” 

Proc. of Int. Work. on Document Analysis Sys., pp. 102–13.9. E. Kavallieratou and E. Stamatatos, “Adaptive binarization of historical document images,” Proc. 18th International 

C f f P R i i 742 745Conf. of Pattern Recognition, pp. 742–745. 10.H. J. Wolfson and I. Rigoutsos, “Geometric Hashing: An Overview,” IEEE Comp’ Science and Engineering, 4(4), pp. 10‐

21, 1997.11.X. Bai, X. Yang, L. J. Latecki, W. Liu, and Z. Tu, “Learning context sensitive shape similarity by graph transduction,” 

IEEE TPAMI, 2009.12.E. J. Keogh, L. Wei, X. Xi, M. Vlachos, S. Lee, and P.  Protopapas, “Supporting exact indexing of arbitrarily rotated 

shapes and periodic time series under Euclidean and warping distance measures,” VLDB J. 18(3), 611‐630, 2009.13.P. V. C. Hough, “Method and mean for recognizing complex pattern,” USA patent 3069654, 1966.

22

Page 23: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

ReferencesReferences14. Q. Zhu, X. Wang, E. Keogh, and S. H. Lee, “Augmenting the Generalized Hough Transform to Enable the Mining of 

Petroglyphs,” SIGKDD, 2009.15. R. O. Duda and P. E. Hart, “Use of the Hough transform to detect lines and curves in pictures,” Comm. ACM 15(1), pp.11‐15, 

1972.16. D. H. Ballard, “Generalizing the Hough transform to detect arbitrary shapes,” Pattern Recognition 13, 1981, pp. 111‐122. 17. M. Tompa and J. Buhler, “Finding motifs using random projections,” In proceedings of the 5th Int. Conference on 

Computational Molecular Biology. pp 67‐74, 2001.18. T. Koch‐Grunberg, “Südamerikanische Felszeichnungen” (South American petroglyphs), Berlin, E. Wasmuth A.‐G, 1907.19. A. Fornés, J. Lladós, and G. Sanchez, “Old Handwritten Musical Symbol Classification by a Dynamic Time Warping Based 

Method. in Graphics Recognition: Recent Advances and New Opportunities,” Lecture Notes in Computer Science, vol. 5046, pp. 51‐60, 2008.

20. G. Sanchez, E. Valveny, J. Llados, J. M. Romeu, and  N. Lozano, “A platform to extract knowledge from graphic documents. li ti t hit t l k t h d t di i ” D t A l i S t VI V l 3163 389 400 2004application to an architectural sketch understanding scenario,” Document Analysis Systems VI, Vol. 3163, pp. 389 ‐400, 2004.

21. J. Mas, G. Sanchez, and J. Llados, “An Incremental Parser to Recognize Diagram Symbols and Gestures represented by Adjacency Grammars,” Graphics Recognition: Ten Year Review. Lecture Notes in Computer Science, vol. 3926, pp. 252‐263, 2006.

22. K. B. Schroeder et al., “Haplotypic Background of a Private Allele at High Frequency,” the Americas, Molecular Biology and Evolution 26 (5) pp 995‐1016 2009Evolution, 26 (5), pp. 995 1016, 2009.

23. G. A. Smith, and W. G. Turner, “Indian Rock Art of Southern California with Selected Petroglyph Catalog,” San Bernardino County, Museum Association, 1975. 

24. C.  Davenport, “British Heraldry,” London Methuen, 1912.25. C. Davenport, “English heraldic book‐stamps, figured and described,” London : Archibald Constable and co. ltd, 1909.26 X Xi E J Keogh L Wei and A Mafra‐Neto “Finding Motifs in a Database of Shapes ” Prof of Siam Conf Data Mining 200726. X. Xi, E. J. Keogh, L. Wei, and A. Mafra‐Neto,  Finding Motifs in a Database of Shapes,  Prof. of Siam Conf. Data Mining, 2007.

23

Page 24: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

24

Page 25: Mining Historical Documents for - PhD Alumni from The ...alumni.cs.ucr.edu/~Rakthant/DocMotif/Papers/ICDM11_DocMotif_slide.pdfFornés, J. Lladós, and G. Sanchez, “Old Handwritten

25