kate barbera extracting metadata from digital records...• the case study uncovered more than 14...
TRANSCRIPT
• Testthisworkflowonadditionalpropertiessuchascreator.• Scalethisworkflowforlargercollections(300,000+itemsintheH.JohnHeinz
Collection).• Exploretopicmodellingasamethodforextractingadditionalsubjectheadings
and/orkeywords.• InvestigateimprovedOCRtechnologiesforscientificandmathematicalformulas.• Explorecrowdsourcingsolutionsfornormalizingandenhancingresulting
metadatavalues.• ShareallresultingresearchandtoolsviaGitHubrepository.• Continuetopursueaccessible,practicalsolutionsthatwecansharewiththe
broadercommunity.
• Excludinganyscientificandmathematicalequations,theOCRfilesforthecasestudyareroughly70%accurate.
• Thecasestudyuncoveredmorethan14genreorformtypes.Thelargest—researchreports—contains1041items,whilethesmallest—drafts—contains9.
• Iftherearelessthan100recordsina“category”(i.e.- genre/form),traditionalmethodsaremorepracticalandefficient(e.g.- 10minutesperrecordacross50itemsequalsroughly8hoursofwork).
• Thismethodismostusefulformetadatavaluesthatcannotbegeneratedormodifiedinlargebatches(titleanddate).
• Unlikelytoachievemajorityaccuracy(morethan50%)usingthismethodwithoutfurtherrefinementandnormalization.
• Fixedresourcesandlimitedopportunitiesfortraining.
InvestigationThisprojectaimstodevelopanautomated,scalableworkflowforextractingitem-levelmetadatafromdigitalrecordsusingtoolsandtechnologiesemployedbythecommunity(archivists,digitalhumanists,etc.).In2016,theCarnegieMellonUniversityArchivesbeganresearchinthisareaaspartofalargerepositorymigration.Ourdigitalcollectionshave3million+pagesofitems,andarepository-wideassessmentfoundthemetadatatobewidelyinaccurateandinconsistent.Duetothesizeofthedigitalcollections,traditionalrefinementmethodsprovedimpractical.
Canweefficientlyintegratethisworkflowintoourcurrentpractices?Howdowescalefrompilottoprogram?
Case Study
Workflow
TestworkflowontheWilliamW.CooperCollection(2,884items)by:• EvaluatingexistingOCRfilesandcleaningresultingtextwhennecessary.• “Categorizing”recordsbasedongenre,form,andothercharacteristics(e.g.-
correspondence).• Usingscriptingtools(Python,RegEx)tohighlightandextractkeymetadata
values(title,date,creator,etc.).• EmployingNaturalLanguageProcessing(NLP)toolstoidentifypotentialsubject
headingsand/orkeywords.• UsingOpenRefine,DataWrangler,etc.tocleanandnormalizeresultingmetadata
values.• Comparingresearchworkflowwithexistinglocalpractices.
Challenges Future Research & Goals
Extracting Metadata from Digital Records Using Computational Methods
Kate [email protected]
@brightarchives
Ann Marie [email protected]
@amarieannm
https://github.com/cmuarchives/metadata.githttp://digitalcollections.library.cmu.eduSpecial thanks to Dr. Jessica Ottis