ancient greek ocr w ith gamera and the google/ perseus greek and latin collection

20
Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin Collection Bruce Robertson, Mount Allison University

Upload: gen

Post on 23-Feb-2016

68 views

Category:

Documents


0 download

DESCRIPTION

Ancient Greek OCR w ith Gamera and the Google/ Perseus Greek and Latin Collection. Bruce Robertson, Mount Allison University. ἀλήθεια truth Ἀ λήθεια. ‘Breathing’ marks on vowels at beginning of a word Accents possible on all vowels. Diversity of Greek Fonts in 19 th C. Other Examples. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin

Collection

Bruce Robertson, Mount Allison University

Page 2: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

ἀλήθειαtruth

Ἀλήθεια

• ‘Breathing’ marks on vowels at beginning of a word

• Accents possible on all vowels

Page 3: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Diversity of Greek Fonts in 19th C.

Page 4: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Other Examples

Page 5: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Greek OCR With Gamera• Dalitz and Brandt provide an experimental

framework– I added splitting, grouping, sql output, etc.

• Teams of undergraduates making multiple classifiers– Based on families of fonts– Comparing strategies of composite

characters, splitting, etc.– Must also train for Latin scripts used

• Not yet working on post-processing

Page 6: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Good Results

Page 7: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection
Page 8: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Systematic Approach to Automated Greek OCR

• Remove the curator from the loop – especially important for journals, monographs, etc. – Assign classifier by computation means

• Using:– Federico Boschetti’s ground-truth-less Greek

text evaluator– Atlantic Computational Excellence Network,

Atlantic Canada’s parallel computing network

Page 9: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Process• 160 Greek-heavy texts chosen• Of these, random samples of 10

pages were taken• Each was processed with each of the

20 classifiers made this summer• The result were evaluated and given

a ‘Boschetti score’ from 0 – 1

Page 10: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

0EQOAA

AAYAAJ

0OkuA

AAAQAA

J

0qBEA

AAAMAAJ

0w8A

NA2-pu

EC

0xcO

AAAAYA

AJ

0zABA

AAAMAA

J

14lfA

AAAMAA

J

190N

AAAAYA

AJ

1DUrAAA

AYAAJ

0

0.1

0.2

0.3

0.4

0.5

0.6

16thcentAlpha_Font

Aristides_DindorfAristides_Dindorf_1

BekkerBude

CambridgeEarly_Teubner

Etymologicumgamera-greekocr-training-loeb-separatistic-2011-04-25

KurkeLexicon

LittreLoeb_Wholistic

New_TeubnerOribase_Font

Oribase_Font_1Oribase_Font_2

Oribase_TestOxford

SmythSuper_Swirly

Super_Swirly2Teubner_Latin

Teubner_SansSerifTeubner_Similar

Teubner_Similar2Teubner_Slim

16thcentAlpha_FontAristides_DindorfAristides_Dindorf_1BekkerBudeCambridgeEarly_TeubnerEtymologicumgamera-greekocr-training-loeb-separatistic-2011-04-25KurkeLexiconLittreLoeb_WholisticNew_TeubnerOribase_FontOribase_Font_1Oribase_Font_2Oribase_TestOxfordSmythSuper_SwirlySuper_Swirly2Teubner_LatinTeubner_SansSerifTeubner_SimilarTeubner_Similar2Teubner_Slim

Page 11: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection
Page 12: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection
Page 13: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection
Page 14: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Google/ABBYY Line Splitting

Page 15: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Gamera’s Text Line Finding(bbox_merging)

Page 16: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Replaced with runlength_smearing

Page 17: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Two-step processing

Page 18: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Future Work• Combining and re-optimizing classifiers?• Assign classifier based on Latin text

– Is ‘Oxford’, ‘Clarendon’ or ‘Oxonii’ in the first pages of output?

• Align with Google’s output, and provide Google with corrected Greek

• Implement line-splitting from other OCR engines• Discover badly OCR’d Greek in others’ output• Implement OCR correction frameworks

described here

Page 19: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Common Problems• Assessments of pre-processing

strategies and tools• Schemas for page description

Page 20: Ancient Greek OCR  w ith  Gamera  and the Google/ Perseus Greek and Latin Collection

Thanks• Colleagues in Dynamic Variorum

Editions:– Greg Crane at Perseus / Tufts– Brian Fuchs at Imperial College

• Federico Boschetti • AceNet, especially tech. support of

Sergiy Khan