math spotting in technical documents using handwritten …rlaz/ms_seminar/math spotting.pdfmath...

17
Math Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi Document and Pattern Recognition Lab Rochester Institute of Technology, NY, USA [email protected], [email protected]

Upload: others

Post on 12-Apr-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Math Spotting in Technical DocumentsUsing Handwritten Queries

Li Yu and Richard Zanibbi

Document and Pattern Recognition Lab

Rochester Institute of Technology, NY, USA

[email protected], [email protected]

Page 2: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Math spotting

• OCR (optical character recognition) avoided

• Structure feature & Visual feature

Page 3: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Document image and query image

Page 4: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

X-Y cutting

Original image

X-Y cuttingA vertical cutting

G. Nagy and S. Seth, “Hierarchical representation of optically scanned documents,” Proc. of ICPR, (1984) 347-349.

Page 5: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

X-Y cut and X-Y tree

Page 6: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Sub-tree matching

• What if we can find a matched sub-tree in the page tree?

• What we want?

Speed & Accuracy

• Problems?

Inexact matching

Page 7: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Noise and “Bad Division”

Noise Bad Division

• Avoid noise

• Control the way in which regions are cut

• Rectangles whose size smaller than thresholds will be ignored

Cutting in Query

Cutting in Page

Page 8: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Thresholds

Horizontal Projection

One Line In Document

Width of Peaks

• Dominant height/width of characters

• Ch = Mode(h1,h2,…hn), where hn represents the heights of lines in one page

• Wh = Mode(W1,W2,…Wn ), where Wn represents the widths of blank spaces in one line

• Scaled linearly based on the current region’s height and width

Page 9: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Equivalency Class

• Two trees have same code (equivalence class number) if and only if they are isomorphic

• Bottom-up algorithm with linear time in the size of the trees

A.V. Aho, J.E. Hopcroft, and J.D. Ullman. The design and analysis of computer algorithms. Addison Wesley, Reading, Mass, 1974.

Page 10: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Ranking by Equivalency Class

E_NUM:N i

E_NUM:N i-1

E_NUM:N i-2

E_NUM:N i-3

E_NUM:N i-4

1

Page 11: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Ranking by Equivalency Class

Rank2

Rank3 Rank4

• The query are included in the page

11

Page 12: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Ranking by Equivalency Class

Rank2

Rank3

Rank5

• The query are not included in the page

Rank5 Rank3

8

Rank5

Page 13: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Other Rankings

• Ranking by Number of Nodes:

– Divide the page nodes into bins based on their size.

– Start with the size of the query root.

– Search for the page nodes in decreasing size order.

• Ranking by both equivalence class number and number of nodes:

– Generate the equivalence class number for both query and page.

– Start with the query root and by decreasing order.

– Find all the exact sub-matches in the page tree.

Page 14: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Visual Feature

9

1

2)(tan

i i

ii

P

PQceDis

Q1 Q2 Q3

Q4 Q6Q5

Q7 Q9Q8

• Dividing the region into nine sub-regions and computing sum of pixel intensity respectively

• Ranking the candidates by decreasing visual similarity

Where Qi and Pi represents the sum

of pixel intensity in the sub-region in

query and candidate respectively

Page 15: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Problems and Future work

• The situation where the target is “scattered” in the page.

• q03vp03.htm

Page 16: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Problems and Future work

• Different Rankings

• More visual features && comparison

• Document image indexing

Page 17: Math Spotting in Technical Documents Using Handwritten …rlaz/ms_seminar/math spotting.pdfMath Spotting in Technical Documents Using Handwritten Queries Li Yu and Richard Zanibbi

Thanks

Question?