078. kestemont-script identification in medieval latin ...was 91.17%, using a network architecture...

4
Script Identification in Medieval Latin Manuscripts Using Convolutional Neural Networks Mike Kestemont [email protected] University of Antwerp, Belgium Dominique Stutzmann [email protected] Centre National de la Recherche Scientifique, France Introduction In paleography, scholars study the history of hand- writing, a crucial aspect of book history and manu- script studies. Paleography has traditionally been dominated by expert-based approaches, driven by the opinions of a small group of highly trained individuals. These have acquired a set of expert skills through year-long training, e.g. the ability to date a handwrit- ing or attribute it to specific individuals. This knowledge remains very hard to render explicit, in or- der to share it with others. Therefore, paleographers are increasingly interested in digital modelling tech- niques to enhance the creation and dissemination of paleographic knowledge (Stutzmann, 2015). An im- portant task in paleography is the classification of script types, especially now that digital libraries (BVMM, Gallica, e-Codices, Manuscripta Mediaevalia, etc.) are amassing reproductions of medieval manu- scripts, often with scarce metadata. Being able to rec- ognize the script type of such historic artefacts is cru- cial to date, localize or (semi-)automatically transcribe them. This paper focuses on script identification for medieval Latin manuscripts (ca. 500 AD to 1600 AD) and demonstrates the feasibility of a fairly accurate, meaningful automated classification. Medieval script classification was the focus of the recent CLaMM (Classification of Latin Medieval Manu- scripts) competition. For this shared task, the organiz- ers released a training data set of 2,000 photographic (greyscale, 300 dpi) reproductions of pages extracted from Latin manuscripts, which were classified into a 12 script type classes, including uncial, caroline, textu- alis and humanistic script, but also more difficult to de- lineate classes such as the cursiva or (semi)hybrida. The participating teams had to submit a standalone application which was able to classify unseen images and estimate the distance between them. The organiz- ers would then apply the submissions to 1,000 resp. 2,000 test images (Stutzmann, 2016) and evaluate their performance using various evaluation schemes. Here, we discuss the DeepScript submission to the CLaMM competition. The competition’s results have been officially been released on 26 Oct. 2016. Deep- Script was ranked first on task 2, i.e. the ‘crisp’ classifi- cation of mixed script images (Cloppet et al., 2016). As the ground truth and results were released too re- cently, we limit this abstract to a general discussion of the approach; the final version and presentation of this paper will be supplemented with additional infor- mation and test results. The DeepScript submission builds upon recent ad- vances in Computer Vision, where the use of so-called ‘deep’ neural networks has recently led to dramatic breakthroughs in the state of the art of image classifi- cation (LeCun et al., 2015). The kind of neural net- works applied in Computer Vision are typically convo- lutional: they slide small ‘filters’ (feature detectors) across images to make the network robust to small translations of objects. The networks make use of many ‘layers’ of such feature detectors, where the out- put of one feature detector always feeds into the next one. The use of such a stack of layers is beneficial, be- cause this ‘deep architecture’ allows algorithms to model features of an increasing complexity (Bengio et al., 2013): in the first layers of the network, very raw and primitive shapes are detected (‘edges’); it is only at the higher layers in the networks that these primi- tive features are combined into more complex, ab- stract visual patterns (e.g. entire faces). These neural networks lie at the basis of e.g. modern face verifica- tion algorithms on social media websites such as Face- book. Neural networks are composed of millions of pa- rameters which have be optimized. For this, the avail- able training data is split out in a set of training images and a smaller set of development images (respectively ca. 1,800 and 200 images): the former is used to opti- mize the parameters of the network during training, the latter is used to monitor the performance of the network. The use of development data is necessary to avoid ‘overfitting’: it is possible for a network to start ‘memorizing’ the training images, so that it produces perfect predictions for the training data, but is not able

Upload: others

Post on 25-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 078. Kestemont-Script Identification in Medieval Latin ...was 91.17%, using a network architecture of 14 layers, inspired by the famous Oxford VGG net (Simonyan et al., 2015). The

Script Identification in Medieval Latin Manuscripts Using Convolutional Neural Networks [email protected],BelgiumDominiqueStutzmanndominique.stutzmann@irht.cnrs.frCentreNationaldelaRechercheScientifique,France

Introduction Inpaleography,scholarsstudythehistoryofhand-

writing, a crucial aspect of book history and manu-script studies. Paleography has traditionally beendominatedbyexpert-basedapproaches,drivenbytheopinionsofasmallgroupofhighlytrainedindividuals.These have acquired a set of expert skills throughyear-longtraining,e.g.theabilitytodateahandwrit-ing or attribute it to specific individuals. Thisknowledgeremainsveryhardtorenderexplicit,inor-dertoshareitwithothers.Therefore,paleographersare increasingly interested indigitalmodelling tech-niques toenhance thecreationanddisseminationofpaleographic knowledge (Stutzmann, 2015). An im-portant task in paleography is the classification ofscript types, especially now that digital libraries(BVMM, Gallica, e-Codices,ManuscriptaMediaevalia,etc.) are amassing reproductions ofmedievalmanu-scripts,oftenwithscarcemetadata.Beingabletorec-ognizethescripttypeofsuchhistoricartefactsiscru-cialtodate,localizeor(semi-)automaticallytranscribethem. This paper focuses on script identification formedievalLatinmanuscripts(ca.500ADto1600AD)and demonstrates the feasibility of a fairly accurate,meaningfulautomatedclassification.

Medievalscriptclassificationwasthe focusof therecentCLaMM(ClassificationofLatinMedievalManu-scripts)competition.Forthissharedtask,theorganiz-ersreleasedatrainingdatasetof2,000photographic(greyscale,300dpi)reproductionsofpagesextractedfromLatinmanuscripts,whichwereclassified intoa

12scripttypeclasses,includinguncial,caroline,textu-alisandhumanisticscript,butalsomoredifficulttode-lineate classes such as the cursiva or (semi)hybrida.The participating teams had to submit a standaloneapplicationwhichwasabletoclassifyunseenimagesandestimatethedistancebetweenthem.Theorganiz-erswould thenapply thesubmissions to1,000resp.2,000 test images (Stutzmann, 2016) and evaluatetheirperformanceusingvariousevaluationschemes.Here, we discuss the DeepScript submission to theCLaMM competition. The competition’s results havebeen officially been released on 26Oct. 2016.Deep-Scriptwasrankedfirstontask2,i.e.the‘crisp’classifi-cationofmixedscriptimages(Cloppetetal.,2016).Asthe ground truth and results were released too re-cently,welimitthisabstracttoageneraldiscussionoftheapproach;thefinalversionandpresentationofthispaper will be supplemented with additional infor-mationandtestresults.

TheDeepScriptsubmissionbuildsuponrecentad-vancesinComputerVision,wheretheuseofso-called‘deep’ neural networks has recently led to dramaticbreakthroughsinthestateoftheartofimageclassifi-cation (LeCun et al., 2015). The kind of neural net-worksappliedinComputerVisionaretypicallyconvo-lutional: they slide small ‘filters’ (feature detectors)across images tomake the network robust to smalltranslations of objects. The networks make use ofmany‘layers’ofsuchfeaturedetectors,wheretheout-putofonefeaturedetectoralwaysfeedsintothenextone.Theuseofsuchastackoflayersisbeneficial,be-cause this ‘deep architecture’ allows algorithms tomodelfeaturesofanincreasingcomplexity(Bengioetal.,2013):inthefirstlayersofthenetwork,veryrawandprimitiveshapesaredetected(‘edges’);itisonlyatthehigherlayersinthenetworksthattheseprimi-tive features are combined into more complex, ab-stractvisualpatterns(e.g.entirefaces).Theseneuralnetworkslieatthebasisofe.g.modernfaceverifica-tionalgorithmsonsocialmediawebsitessuchasFace-book.

Neural networks are composedofmillions of pa-rameterswhichhavebeoptimized.Forthis,theavail-abletrainingdataissplitoutinasetoftrainingimagesandasmallersetofdevelopmentimages(respectivelyca.1,800and200images):theformerisusedtoopti-mize theparametersof thenetworkduring training,the latter is used tomonitor theperformanceof thenetwork.Theuseofdevelopmentdataisnecessarytoavoid‘overfitting’:itispossibleforanetworktostart‘memorizing’thetrainingimages,sothatitproducesperfectpredictionsforthetrainingdata,butisnotable

Page 2: 078. Kestemont-Script Identification in Medieval Latin ...was 91.17%, using a network architecture of 14 layers, inspired by the famous Oxford VGG net (Simonyan et al., 2015). The

anymore to generalizeproperly tonew,unseen im-ages.Byusingadevelopmentset,wecanstopoptimiz-ingthenetwork,ifitspredictionsforthedevelopmentdatadonot increaseinqualityanymore.Onlyatthisstage,thealgorithmisevaluatedontheactualtestim-ages.

Modern neural networks are typically trained onhundredthousandsoftrainingimages.InthefieldofCultural Heritage data, a common challenge is thatmostdatasetsaremuchsmaller,andCLaMMisnoex-ception, so that the danger of overfitting is muchlarger.Wethereforeproceededasfollows:thegener-ousresolutionforeachtrainingimagewasdownsizedby one half. Next, we would select random squarecropsorpatchesfromtheimage(150x150pixels)andtrainthealgorithmsonbatchesofthesecrops.Thisap-proachisblunt,yetinnovative,sincewemakenoeffortto extractmore specific regions of interest from theimages,suchasindividuallines,wordsorcharacters.To avoid overfitting, we also applied augmentation:each training cropwouldbe ‘distorted’ through ran-domly varying the zoom level, rotation and transla-tion.Introducingsuchnoiseintheinputisacommonstrategytocombatoverfitting.Belowgoesanexampleofsuchasetofaugmentedpatchesforasinglemanu-scriptpage(Fig.1).

Fig. 1: Example of augmented crops for a single manuscript

page.

Aftereachepoch,weevaluatedtheperformanceof

the current state of the network through inspectingthe classification accuracy on the development im-ages:werandomlyselected30cropsfromeachimage(without augmentation), and calculated the averageprobability foreachoutputclass.The full imagewasassignedtotheclasswiththehighestaverageproba-bility.Thebestvalidationaccuracywhichweachievedwas91.17%,usinganetworkarchitectureof14layers,

inspiredbythefamousOxfordVGGnet(Simonyanetal.,2015).ThemanualclassificationofCLaMMimageswas based on morphological differences and allo-graphs,asdefinedinstandardworksonLatinscriptssuchas(Bischoff,1986;Derolez,2003).Theconfusionmatrices in Fig. 2-4 for the actual and the predictedclassesinthedevelopmentandtestdataillustratethatthe confusions generally make sense from a paleo-graphicpointofview(thenormaltextualisletterisforinstanceoftenmistakenfortheSouthernorsemi-tex-tualisvariant).

Fig. 2: Confusion matrix for the development data.

Page 3: 078. Kestemont-Script Identification in Medieval Latin ...was 91.17%, using a network architecture of 14 layers, inspired by the famous Oxford VGG net (Simonyan et al., 2015). The

Fig. 3: Classifications for the test data as a confusion matrix (task 1). Horizontal lines: ground-truth; Vertical columns: predictions. Order: 1-Uncial; 2-Half-uncial; 3-Caroline; 4-

Humanistic; 5-Humanistic; Cursive; 6-Praegothica; 7-Southern Textualis; 8-Semitextualis 9-Textualis; 10-Hybrida

11-Semihybrida 12-Cursiva.

Fig. 4: Membership Degree matrices for task 2, on the 999 test images where only one label is attributed to the image.

Thereexistinterestingmethodstovisualizewhichpatternsthetrainednetworkissensitiveto.Usingtheprincipleofgradientascent,westart fromarandomnoiseimageandfeedittooneofthefiltersonthelastconvolutionallayer:during3,000iterationswechangethe image so that itmaximizes the activationof thisparticular filter. InFig.3,weshowthe25artificiallygeneratedimageswhichyieldedthestrongestresults;clearly, the network picks up relevant patterns. Thethirdimagefromtheleftinthefirstline,forinstance,clearlycapturesthepresenceofloopsintheascendersofindividualcharacters(e.g.inthe‘b’or‘h’,whichiscrucial todifferentiatebetweene.g.a textualisandacursiveletter).Thesevisualizationsdirectlytackletheissueofthecomputational‘blackbox’intheDigitalHu-manities, and espsecially Digital Palaeography(Hassneretal.,2013;Stutzmannetal.,2014). Inourpaper,wewillofferfurtherinterpretationsandvisual-izationsofourmodelandconfrontthesewiththere-sults fromotherparticipants intheCLaMMcompeti-tion tooffernewperspectiveson thegraphicdefini-tionofscriptclassesintraditionalpaleography.

Fig. 5: Artificially generated images that maximally activate

filters in the final convolutional layer.

Bibliography

Bengio,Y.,Courville,A.andVincent,P.(2013).Repre-sentationLearning:AReviewandNewPerspectives.TPAMI35:8,1798–1828.

Bischoff,B.,PaläographiedesrömischenAltertumsund

desabendländischenMittelalters,Berlin,1986.Cloppet, F., Eglin, V., Kieu, V. C., Stutzmann, D. and

Vincent,N.(2016),ICFHR2016CompetitionontheClassification of Medieval Handwritings in LatinScript,ProceedingsoftheICFHR2016,590-595.

Derolez, A., The Palaeography of Gothic Manuscript

BooksfromtheTwelfthtotheEarlySixteenthCentury,Cambridge,2003.

Hassner,T.,Rehbein,M.,Stokes,P.andWolf,L.2013.

Computationandpalaeography:Potentialsandlim-its.DagstuhlManifestos2:14–35.

LeCun,Y.,Bengio,Y.andHinton,G.(2015).Deeplearn-

ing.Nature521(7553):436–44.Simonyan, K. and Zisserman, A. (2015). Very Deep

Convolutional Networks for Large-Scale ImageRecognition. Proceedings of ICLR 2015.https://arxiv.org/abs/1409.1556/.

Stutzmann,D.andTarte,S.(2014).Digitalpalaeogra-

Page 4: 078. Kestemont-Script Identification in Medieval Latin ...was 91.17%, using a network architecture of 14 layers, inspired by the famous Oxford VGG net (Simonyan et al., 2015). The

phy: Newmachines and old texts : Executive sum-mary.DagstuhlReports4.7:112–34(112–14,132).

Stutzmann, D. (2015). Clustering of medieval scripts

throughcomputerimageanalysis:Towardsaneval-uation protocol.DigitalMedievalist10, http://digi-talmedievalist.org/journal/10/stutzmann/.

Stutzmann, D. (2016). ICFHR2016 Competition on the

Classification of Medieval Handwritings in LatinScript.http://icfhr2016-clamm.irht.cnrs.fr/.