ancient greek ocr w ith gamera and the google/ perseus greek and latin collection

Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin

Collection

Bruce Robertson, Mount Allison University

ἀλήθειαtruth

Ἀλήθεια

• ‘Breathing’ marks on vowels at beginning of a word

• Accents possible on all vowels

Diversity of Greek Fonts in 19th C.

Other Examples

Greek OCR With Gamera• Dalitz and Brandt provide an experimental

framework– I added splitting, grouping, sql output, etc.

• Teams of undergraduates making multiple classifiers– Based on families of fonts– Comparing strategies of composite

characters, splitting, etc.– Must also train for Latin scripts used

• Not yet working on post-processing

Good Results

Systematic Approach to Automated Greek OCR

• Remove the curator from the loop – especially important for journals, monographs, etc. – Assign classifier by computation means

• Using:– Federico Boschetti’s ground-truth-less Greek

text evaluator– Atlantic Computational Excellence Network,

Atlantic Canada’s parallel computing network

Process• 160 Greek-heavy texts chosen• Of these, random samples of 10

pages were taken• Each was processed with each of the

20 classifiers made this summer• The result were evaluated and given

a ‘Boschetti score’ from 0 – 1

0EQOAA

AAYAAJ

0OkuA

AAAQAA

J

0qBEA

AAAMAAJ

0w8A

NA2-pu

EC

0xcO

AAAAYA

AJ

0zABA

AAAMAA

J

14lfA

AAAMAA

J

190N

AAAAYA

AJ

1DUrAAA

AYAAJ

0

0.1

0.2

0.3

0.4

0.5

0.6

16thcentAlpha_Font

Aristides_DindorfAristides_Dindorf_1

BekkerBude

CambridgeEarly_Teubner

Etymologicumgamera-greekocr-training-loeb-separatistic-2011-04-25

KurkeLexicon

LittreLoeb_Wholistic

New_TeubnerOribase_Font

Oribase_Font_1Oribase_Font_2

Oribase_TestOxford

SmythSuper_Swirly

Super_Swirly2Teubner_Latin

Teubner_SansSerifTeubner_Similar

Teubner_Similar2Teubner_Slim

16thcentAlpha_FontAristides_DindorfAristides_Dindorf_1BekkerBudeCambridgeEarly_TeubnerEtymologicumgamera-greekocr-training-loeb-separatistic-2011-04-25KurkeLexiconLittreLoeb_WholisticNew_TeubnerOribase_FontOribase_Font_1Oribase_Font_2Oribase_TestOxfordSmythSuper_SwirlySuper_Swirly2Teubner_LatinTeubner_SansSerifTeubner_SimilarTeubner_Similar2Teubner_Slim

Google/ABBYY Line Splitting

Gamera’s Text Line Finding(bbox_merging)

Replaced with runlength_smearing

Two-step processing

Future Work• Combining and re-optimizing classifiers?• Assign classifier based on Latin text

– Is ‘Oxford’, ‘Clarendon’ or ‘Oxonii’ in the first pages of output?

• Align with Google’s output, and provide Google with corrected Greek

• Implement line-splitting from other OCR engines• Discover badly OCR’d Greek in others’ output• Implement OCR correction frameworks

described here

Common Problems• Assessments of pre-processing

strategies and tools• Schemas for page description

Thanks• Colleagues in Dynamic Variorum

Editions:– Greg Crane at Perseus / Tufts– Brian Fuchs at Imperial College

• Federico Boschetti • AceNet, especially tech. support of

Sergiy Khan

ancient greek ocr w ith gamera and the google/ perseus greek and latin collection

Documents

googleperseus greek

ocrd greek

examples greek ocr

automated greek ocrremove

vowelsdiversity of greek

ocr enginesdiscover

pages of output

latin textis oxford