d01 choueka dershowitz_word_spotting_algorithm

Querying a Large Corpus of Historical Handwritten ManusciptsUsing Word-Spotting Alagorithms

Yaacov Choueka, Adiel ben-ShalomThe Friedberg Genizah Project

Nachum Dershowitz, Lior Wolf, Adi SilberfenigSchool of Computer Science, Tel Aviv University

Minerva 2015 ,Jerusalem

The Problem: find all occurrences of a given query-word in all the manuscripts

of the corpus(arbitrary language, arbitrary script)

Example: The Cairo Genizah Corpus

360,000 fragments Hebrew characters, Hebrew and Arabic languagesThe query: בראשית

Simple Solution: full-text search

KWIC Output

The catch:

The software can search only manuscripts that have been

transcribed into electronic form!Usually, however, most of the manuscripts are never transcribed!

In the Genizah case:480,000 images are available only 40,000 (8%) have been transcribed!

OCRDoes not work well

for handwritten historical documentsאהבתי כי ישמע יהוה את

קולי תחנוני כי הטה אוזנו לי ובימי אקרא אפפוני חבלי

מות ומצרי שאול מצאוני צרה

ויגון אמצא ובשם יהוה אקרא אנה יהוה מלטה

נפשי חנון יהוה וצדיק ואלוהינו

מרחם שומר פתאים יהוה דלותי ולי יהושיע שובי נפשי למנוחיכי כי יהוה גמל עליכי

כי חלצת נפשי ממות את עיני

מדמעה את רגלי מדחיאתהלך לפני יהוה בארצות החיים האמנתי כי אדבר אני

אדזבעיכישעידודארוליעחנוניכידסראזנויוביסיארראאוניחבלישתומצרישאולצאוניצדוגוןאמצאובשםידוארראאנאידודלטכשינוןידודוצדידואדינוסרחסשוערתאיסיזוזדלייייליידושיעשובינשילסנוחיכיכיידודגמלעיכיכיחלצתנשיממועאעעיניסדסעדאערגליאעדלךלניידודבארדחייפדאסנעיכיאדבראני גליאעדלOCR Transcription

Search for the image of the query word

(and not for its text)

The word-spotting approach:

Given one (or more) image(s)of a query word, find all occurrences of similar images in the corpus collection of manuscripts’ images

Query:

Word-spotting

Query:

1 .Binarization

2 .Extracting Word-Candidates (“Patches”) From a Manuscript’s Image

3 .Patch Normalization

Normalizing every patch into a standard grid of 8960 pixels (20*7 cells of 8*8 pixels each)

4 .Image descriptors for every patch

Constructing, for every patchan image-descriptor vector of

12,460 real numbers

140 cells * (31+58)=12,460

(31 features of HOG vector)(58 features of LBP vector)

5 .Dimension Reduction12,460

Patch 1

Patch 2

Patch 3

Patch M

M = Total Number of Patches In all images of the corpus

Patch 1

Patch 2

Patch 3

Patch M

PCA – Principal Component Analysis

6 .Similarity Computation

Computing an efficient similarity measure

between the query-reduced

vectorand

the reduced vectorsof all patches of all images in the corpus

QueryDataset

Patch 1

Patch 2

Patch 3

Patch M

Query Patch 11000

Result

MSimilarity of Query

Patch to Patch number i

7 .Result Sort the results by decreasing similarity

and display the patches with the best similarity to the query

Two Tests

Precision 50% 91%Single query 0.08 sec 0.03 secPre-processing per Page 46 sec 3 sec

1. George Washington – Handwritten2. Lord Byron – Printed20 pages, about 5000 words each

Current Problems

1. Efficiently building (off-line, in terms of space and time) compact image-descriptors for all patches from all (half-a-million) images.

2. Building an efficient (on-line) system for comparing the query vector to all (100 million?) patches’ vectors

When solved and implemented

it will offer new horizons

to the study of large corpora of historical documents

Thank You

d01 choueka dershowitz_word_spotting_algorithm

Internet

10319-d01-00) osv mobile medical unit 102wx53l complete

passage 02 - etenders.gov.za - plans.pdf · cubicle shower...

sense-d01 · 2020-06-29 · sense-d01 iot sensor to cloud...

ramblin’ with robert – robert’s rules and pta bylaws...

catalogue code d01 full presentation or module? presentation...

hbii2007 d01 oct overview

mdr0763rp00011a synopsis technical assessment report d01

d01 e07 results

pm [d01] matter waves

d01: ultimate physics analysiskomatsu/presentation/...d01:...

new business wor(i)d01

d01 deoiler cyclones rev 07-10

d01 03pm alexkabuga kesws

arizona republic 20171213 d01 2 - big brothers big sisters...

g1 4 accounting for depreciation [d01-j14]

d01-goemetric road design

d01 gemma palomo y lucia perez

d01 choueka dershowitz_word_spotting_algorithm

d01 profidrive-system-descr e aug07

innov8ive gem d01