078. kestemont-script identification in medieval latin ...was 91.17%, using a network architecture...

Script Identification in Medieval Latin Manuscripts Using Convolutional Neural Networks [email protected],BelgiumDominiqueStutzmanndominique.stutzmann@irht.cnrs.frCentreNationaldelaRechercheScientifique,France

Introduction Inpaleography,scholarsstudythehistoryofhand-

writing, a crucial aspect of book history and manu-script studies. Paleography has traditionally beendominatedbyexpert-basedapproaches,drivenbytheopinionsofasmallgroupofhighlytrainedindividuals.These have acquired a set of expert skills throughyear-longtraining,e.g.theabilitytodateahandwrit-ing or attribute it to specific individuals. Thisknowledgeremainsveryhardtorenderexplicit,inor-dertoshareitwithothers.Therefore,paleographersare increasingly interested indigitalmodelling tech-niques toenhance thecreationanddisseminationofpaleographic knowledge (Stutzmann, 2015). An im-portant task in paleography is the classification ofscript types, especially now that digital libraries(BVMM, Gallica, e-Codices,ManuscriptaMediaevalia,etc.) are amassing reproductions ofmedievalmanu-scripts,oftenwithscarcemetadata.Beingabletorec-ognizethescripttypeofsuchhistoricartefactsiscru-cialtodate,localizeor(semi-)automaticallytranscribethem. This paper focuses on script identification formedievalLatinmanuscripts(ca.500ADto1600AD)and demonstrates the feasibility of a fairly accurate,meaningfulautomatedclassification.

Medievalscriptclassificationwasthe focusof therecentCLaMM(ClassificationofLatinMedievalManu-scripts)competition.Forthissharedtask,theorganiz-ersreleasedatrainingdatasetof2,000photographic(greyscale,300dpi)reproductionsofpagesextractedfromLatinmanuscripts,whichwereclassified intoa

12scripttypeclasses,includinguncial,caroline,textu-alisandhumanisticscript,butalsomoredifficulttode-lineate classes such as the cursiva or (semi)hybrida.The participating teams had to submit a standaloneapplicationwhichwasabletoclassifyunseenimagesandestimatethedistancebetweenthem.Theorganiz-erswould thenapply thesubmissions to1,000resp.2,000 test images (Stutzmann, 2016) and evaluatetheirperformanceusingvariousevaluationschemes.Here, we discuss the DeepScript submission to theCLaMM competition. The competition’s results havebeen officially been released on 26Oct. 2016.Deep-Scriptwasrankedfirstontask2,i.e.the‘crisp’classifi-cationofmixedscriptimages(Cloppetetal.,2016).Asthe ground truth and results were released too re-cently,welimitthisabstracttoageneraldiscussionoftheapproach;thefinalversionandpresentationofthispaper will be supplemented with additional infor-mationandtestresults.

TheDeepScriptsubmissionbuildsuponrecentad-vancesinComputerVision,wheretheuseofso-called‘deep’ neural networks has recently led to dramaticbreakthroughsinthestateoftheartofimageclassifi-cation (LeCun et al., 2015). The kind of neural net-worksappliedinComputerVisionaretypicallyconvo-lutional: they slide small ‘filters’ (feature detectors)across images tomake the network robust to smalltranslations of objects. The networks make use ofmany‘layers’ofsuchfeaturedetectors,wheretheout-putofonefeaturedetectoralwaysfeedsintothenextone.Theuseofsuchastackoflayersisbeneficial,be-cause this ‘deep architecture’ allows algorithms tomodelfeaturesofanincreasingcomplexity(Bengioetal.,2013):inthefirstlayersofthenetwork,veryrawandprimitiveshapesaredetected(‘edges’);itisonlyatthehigherlayersinthenetworksthattheseprimi-tive features are combined into more complex, ab-stractvisualpatterns(e.g.entirefaces).Theseneuralnetworkslieatthebasisofe.g.modernfaceverifica-tionalgorithmsonsocialmediawebsitessuchasFace-book.

Neural networks are composedofmillions of pa-rameterswhichhavebeoptimized.Forthis,theavail-abletrainingdataissplitoutinasetoftrainingimagesandasmallersetofdevelopmentimages(respectivelyca.1,800and200images):theformerisusedtoopti-mize theparametersof thenetworkduring training,the latter is used tomonitor theperformanceof thenetwork.Theuseofdevelopmentdataisnecessarytoavoid‘overfitting’:itispossibleforanetworktostart‘memorizing’thetrainingimages,sothatitproducesperfectpredictionsforthetrainingdata,butisnotable

anymore to generalizeproperly tonew,unseen im-ages.Byusingadevelopmentset,wecanstopoptimiz-ingthenetwork,ifitspredictionsforthedevelopmentdatadonot increaseinqualityanymore.Onlyatthisstage,thealgorithmisevaluatedontheactualtestim-ages.

Modern neural networks are typically trained onhundredthousandsoftrainingimages.InthefieldofCultural Heritage data, a common challenge is thatmostdatasetsaremuchsmaller,andCLaMMisnoex-ception, so that the danger of overfitting is muchlarger.Wethereforeproceededasfollows:thegener-ousresolutionforeachtrainingimagewasdownsizedby one half. Next, we would select random squarecropsorpatchesfromtheimage(150x150pixels)andtrainthealgorithmsonbatchesofthesecrops.Thisap-proachisblunt,yetinnovative,sincewemakenoeffortto extractmore specific regions of interest from theimages,suchasindividuallines,wordsorcharacters.To avoid overfitting, we also applied augmentation:each training cropwouldbe ‘distorted’ through ran-domly varying the zoom level, rotation and transla-tion.Introducingsuchnoiseintheinputisacommonstrategytocombatoverfitting.Belowgoesanexampleofsuchasetofaugmentedpatchesforasinglemanu-scriptpage(Fig.1).

Fig. 1: Example of augmented crops for a single manuscript

page.

Aftereachepoch,weevaluatedtheperformanceof

the current state of the network through inspectingthe classification accuracy on the development im-ages:werandomlyselected30cropsfromeachimage(without augmentation), and calculated the averageprobability foreachoutputclass.The full imagewasassignedtotheclasswiththehighestaverageproba-bility.Thebestvalidationaccuracywhichweachievedwas91.17%,usinganetworkarchitectureof14layers,

inspiredbythefamousOxfordVGGnet(Simonyanetal.,2015).ThemanualclassificationofCLaMMimageswas based on morphological differences and allo-graphs,asdefinedinstandardworksonLatinscriptssuchas(Bischoff,1986;Derolez,2003).Theconfusionmatrices in Fig. 2-4 for the actual and the predictedclassesinthedevelopmentandtestdataillustratethatthe confusions generally make sense from a paleo-graphicpointofview(thenormaltextualisletterisforinstanceoftenmistakenfortheSouthernorsemi-tex-tualisvariant).

Fig. 2: Confusion matrix for the development data.

Fig. 3: Classifications for the test data as a confusion matrix (task 1). Horizontal lines: ground-truth; Vertical columns: predictions. Order: 1-Uncial; 2-Half-uncial; 3-Caroline; 4-

Humanistic; 5-Humanistic; Cursive; 6-Praegothica; 7-Southern Textualis; 8-Semitextualis 9-Textualis; 10-Hybrida

11-Semihybrida 12-Cursiva.

Fig. 4: Membership Degree matrices for task 2, on the 999 test images where only one label is attributed to the image.

Thereexistinterestingmethodstovisualizewhichpatternsthetrainednetworkissensitiveto.Usingtheprincipleofgradientascent,westart fromarandomnoiseimageandfeedittooneofthefiltersonthelastconvolutionallayer:during3,000iterationswechangethe image so that itmaximizes the activationof thisparticular filter. InFig.3,weshowthe25artificiallygeneratedimageswhichyieldedthestrongestresults;clearly, the network picks up relevant patterns. Thethirdimagefromtheleftinthefirstline,forinstance,clearlycapturesthepresenceofloopsintheascendersofindividualcharacters(e.g.inthe‘b’or‘h’,whichiscrucial todifferentiatebetweene.g.a textualisandacursiveletter).Thesevisualizationsdirectlytackletheissueofthecomputational‘blackbox’intheDigitalHu-manities, and espsecially Digital Palaeography(Hassneretal.,2013;Stutzmannetal.,2014). Inourpaper,wewillofferfurtherinterpretationsandvisual-izationsofourmodelandconfrontthesewiththere-sults fromotherparticipants intheCLaMMcompeti-tion tooffernewperspectiveson thegraphicdefini-tionofscriptclassesintraditionalpaleography.

Fig. 5: Artificially generated images that maximally activate

filters in the final convolutional layer.

Bibliography

Bengio,Y.,Courville,A.andVincent,P.(2013).Repre-sentationLearning:AReviewandNewPerspectives.TPAMI35:8,1798–1828.

Bischoff,B.,PaläographiedesrömischenAltertumsund

desabendländischenMittelalters,Berlin,1986.Cloppet, F., Eglin, V., Kieu, V. C., Stutzmann, D. and

Vincent,N.(2016),ICFHR2016CompetitionontheClassification of Medieval Handwritings in LatinScript,ProceedingsoftheICFHR2016,590-595.

Derolez, A., The Palaeography of Gothic Manuscript

BooksfromtheTwelfthtotheEarlySixteenthCentury,Cambridge,2003.

Hassner,T.,Rehbein,M.,Stokes,P.andWolf,L.2013.

Computationandpalaeography:Potentialsandlim-its.DagstuhlManifestos2:14–35.

LeCun,Y.,Bengio,Y.andHinton,G.(2015).Deeplearn-

ing.Nature521(7553):436–44.Simonyan, K. and Zisserman, A. (2015). Very Deep

Convolutional Networks for Large-Scale ImageRecognition. Proceedings of ICLR 2015.https://arxiv.org/abs/1409.1556/.

Stutzmann,D.andTarte,S.(2014).Digitalpalaeogra-

phy: Newmachines and old texts : Executive sum-mary.DagstuhlReports4.7:112–34(112–14,132).

Stutzmann, D. (2015). Clustering of medieval scripts

throughcomputerimageanalysis:Towardsaneval-uation protocol.DigitalMedievalist10, http://digi-talmedievalist.org/journal/10/stutzmann/.

Stutzmann, D. (2016). ICFHR2016 Competition on the

Classification of Medieval Handwritings in LatinScript.http://icfhr2016-clamm.irht.cnrs.fr/.

078. kestemont-script identification in medieval latin ...was 91.17%, using a network architecture...

Documents