kate barbera extracting metadata from digital records...• the case study uncovered more than 14...

Post on 12-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

• Testthisworkflowonadditionalpropertiessuchascreator.• Scalethisworkflowforlargercollections(300,000+itemsintheH.JohnHeinz

Collection).• Exploretopicmodellingasamethodforextractingadditionalsubjectheadings

and/orkeywords.• InvestigateimprovedOCRtechnologiesforscientificandmathematicalformulas.• Explorecrowdsourcingsolutionsfornormalizingandenhancingresulting

metadatavalues.• ShareallresultingresearchandtoolsviaGitHubrepository.• Continuetopursueaccessible,practicalsolutionsthatwecansharewiththe

broadercommunity.

• Excludinganyscientificandmathematicalequations,theOCRfilesforthecasestudyareroughly70%accurate.

• Thecasestudyuncoveredmorethan14genreorformtypes.Thelargest—researchreports—contains1041items,whilethesmallest—drafts—contains9.

• Iftherearelessthan100recordsina“category”(i.e.- genre/form),traditionalmethodsaremorepracticalandefficient(e.g.- 10minutesperrecordacross50itemsequalsroughly8hoursofwork).

• Thismethodismostusefulformetadatavaluesthatcannotbegeneratedormodifiedinlargebatches(titleanddate).

• Unlikelytoachievemajorityaccuracy(morethan50%)usingthismethodwithoutfurtherrefinementandnormalization.

• Fixedresourcesandlimitedopportunitiesfortraining.

InvestigationThisprojectaimstodevelopanautomated,scalableworkflowforextractingitem-levelmetadatafromdigitalrecordsusingtoolsandtechnologiesemployedbythecommunity(archivists,digitalhumanists,etc.).In2016,theCarnegieMellonUniversityArchivesbeganresearchinthisareaaspartofalargerepositorymigration.Ourdigitalcollectionshave3million+pagesofitems,andarepository-wideassessmentfoundthemetadatatobewidelyinaccurateandinconsistent.Duetothesizeofthedigitalcollections,traditionalrefinementmethodsprovedimpractical.

Canweefficientlyintegratethisworkflowintoourcurrentpractices?Howdowescalefrompilottoprogram?

Case Study

Workflow

TestworkflowontheWilliamW.CooperCollection(2,884items)by:• EvaluatingexistingOCRfilesandcleaningresultingtextwhennecessary.• “Categorizing”recordsbasedongenre,form,andothercharacteristics(e.g.-

correspondence).• Usingscriptingtools(Python,RegEx)tohighlightandextractkeymetadata

values(title,date,creator,etc.).• EmployingNaturalLanguageProcessing(NLP)toolstoidentifypotentialsubject

headingsand/orkeywords.• UsingOpenRefine,DataWrangler,etc.tocleanandnormalizeresultingmetadata

values.• Comparingresearchworkflowwithexistinglocalpractices.

Challenges Future Research & Goals

Extracting Metadata from Digital Records Using Computational Methods

Kate Barberakbarbera@andrew.cmu.edu

@brightarchives

Ann Marie Mescomesco@andrew.cmu.edu

@amarieannm

https://github.com/cmuarchives/metadata.githttp://digitalcollections.library.cmu.eduSpecial thanks to Dr. Jessica Ottis

top related