ancient greek ocr w ith gamera and the google/ perseus greek and latin collection
DESCRIPTION
Ancient Greek OCR w ith Gamera and the Google/ Perseus Greek and Latin Collection. Bruce Robertson, Mount Allison University. ἀλήθεια truth Ἀ λήθεια. ‘Breathing’ marks on vowels at beginning of a word Accents possible on all vowels. Diversity of Greek Fonts in 19 th C. Other Examples. - PowerPoint PPT PresentationTRANSCRIPT
Ancient Greek OCR with Gamera and the Google/Perseus Greek and Latin
Collection
Bruce Robertson, Mount Allison University
ἀλήθειαtruth
Ἀλήθεια
• ‘Breathing’ marks on vowels at beginning of a word
• Accents possible on all vowels
Diversity of Greek Fonts in 19th C.
Other Examples
Greek OCR With Gamera• Dalitz and Brandt provide an experimental
framework– I added splitting, grouping, sql output, etc.
• Teams of undergraduates making multiple classifiers– Based on families of fonts– Comparing strategies of composite
characters, splitting, etc.– Must also train for Latin scripts used
• Not yet working on post-processing
Good Results
Systematic Approach to Automated Greek OCR
• Remove the curator from the loop – especially important for journals, monographs, etc. – Assign classifier by computation means
• Using:– Federico Boschetti’s ground-truth-less Greek
text evaluator– Atlantic Computational Excellence Network,
Atlantic Canada’s parallel computing network
Process• 160 Greek-heavy texts chosen• Of these, random samples of 10
pages were taken• Each was processed with each of the
20 classifiers made this summer• The result were evaluated and given
a ‘Boschetti score’ from 0 – 1
0EQOAA
AAYAAJ
0OkuA
AAAQAA
J
0qBEA
AAAMAAJ
0w8A
NA2-pu
EC
0xcO
AAAAYA
AJ
0zABA
AAAMAA
J
14lfA
AAAMAA
J
190N
AAAAYA
AJ
1DUrAAA
AYAAJ
0
0.1
0.2
0.3
0.4
0.5
0.6
16thcentAlpha_Font
Aristides_DindorfAristides_Dindorf_1
BekkerBude
CambridgeEarly_Teubner
Etymologicumgamera-greekocr-training-loeb-separatistic-2011-04-25
KurkeLexicon
LittreLoeb_Wholistic
New_TeubnerOribase_Font
Oribase_Font_1Oribase_Font_2
Oribase_TestOxford
SmythSuper_Swirly
Super_Swirly2Teubner_Latin
Teubner_SansSerifTeubner_Similar
Teubner_Similar2Teubner_Slim
16thcentAlpha_FontAristides_DindorfAristides_Dindorf_1BekkerBudeCambridgeEarly_TeubnerEtymologicumgamera-greekocr-training-loeb-separatistic-2011-04-25KurkeLexiconLittreLoeb_WholisticNew_TeubnerOribase_FontOribase_Font_1Oribase_Font_2Oribase_TestOxfordSmythSuper_SwirlySuper_Swirly2Teubner_LatinTeubner_SansSerifTeubner_SimilarTeubner_Similar2Teubner_Slim
Google/ABBYY Line Splitting
Gamera’s Text Line Finding(bbox_merging)
Replaced with runlength_smearing
Two-step processing
Future Work• Combining and re-optimizing classifiers?• Assign classifier based on Latin text
– Is ‘Oxford’, ‘Clarendon’ or ‘Oxonii’ in the first pages of output?
• Align with Google’s output, and provide Google with corrected Greek
• Implement line-splitting from other OCR engines• Discover badly OCR’d Greek in others’ output• Implement OCR correction frameworks
described here
Common Problems• Assessments of pre-processing
strategies and tools• Schemas for page description
Thanks• Colleagues in Dynamic Variorum
Editions:– Greg Crane at Perseus / Tufts– Brian Fuchs at Imperial College
• Federico Boschetti • AceNet, especially tech. support of
Sergiy Khan