echo depository technical architecture phase 1 final report
TRANSCRIPT
ECHODEPositoryTechnicalArchitecturePhase1FinalReportReportofprojectactivitiesfromFall2004through2007
[AuthorName]
UniversityofIllinoisatUrbanaChampaigninpartnershipwithOCLCContributors:MattCordial,DavidDubin,JanetEke,JosephFutrelle,ThomasHabing,LeahHouser,PatriciaHswe,WilliamIngram,JoanneKaczmarek,RobertManaster,JoelPlutchak,BethSandore,JohnUnsworthDecember2008RevisedJuly2009ReDraft2.3–notfinalized
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
2
TableofContents
1. Preface.................................................................................................................................41.1. AbouttheECHODEPositoryProject(Phase1)............................................................41.2. AboutThisDocument ..........................................................................................................41.3. ReviewofProjectObjectivesandDeliverables...........................................................41.3.1. ModelsandToolstoSupportWebArchiving .....................................................................41.3.2. RepositoryEvaluationandInteroperability .......................................................................51.3.3. Long‐termSemanticPreservationResearch.......................................................................6
2. ArchivingtheWeb:theWebArchivesWorkbench .............................................82.1. Overview ..................................................................................................................................82.2. TheWebArchivingProblem..............................................................................................82.2.1. TheUbiquitousWeb ......................................................................................................................82.2.2. VolumeandSelectionofWebContent...................................................................................92.2.3. TheImportanceofContext .........................................................................................................9
2.3. TheArizonaModel:AnArchivalApproachtoWebArchiving ............................ 102.3.1. Background.....................................................................................................................................102.3.2. AnArchivalApproach ................................................................................................................112.3.3. ArizonaModelSummary ..........................................................................................................11
2.4. TheWebArchivesWorkbench:ImplementingtheArizonaModel ................... 122.4.1. DevelopmentConsiderations..................................................................................................122.4.2. OverviewoftheWebArchivesWorkbench......................................................................142.4.3. ATouroftheWebArchivesWorkbench ...........................................................................152.4.4. WebArchivesWorkbenchToolsSummary......................................................................202.4.5. BehindtheScenes:OCLC’sTechnicalImplementationoftheWebArchivesWorkbench......................................................................................................................................................21
2.5. FindingsUserFeedback................................................................................................. 242.5.1. LimitedResourcesandLimitedTime .................................................................................252.5.2. ComplexityoftheTools.............................................................................................................252.5.3. WebContentDelivery ................................................................................................................25
2.6. ConclusionsandNextSteps............................................................................................. 263. RepositoryEvaluationandInteroperability ....................................................... 273.1. RepositoryEvaluation ...................................................................................................... 273.1.1. BuildinganEvaluationFramework:ApplyingtheTrustedDigitalRepositoryChecklisttoRepositoryEvaluation.......................................................................................................273.1.2. RepositoryTesting:IngestandExportTestsOnFourKeyOpen‐sourceRepositories ....................................................................................................................................................283.1.3. TestingApproachandMethodology....................................................................................323.1.4. RepositoryTestingFindings:NarrativeReports,andAnnotatedAuditChecklistCommentary ...................................................................................................................................................343.1.5. ConclusionandNextSteps.......................................................................................................35
3.2. HubandSpokeArchitecture(HandS):SupportingRepositoryInteroperabilityandEmergingPreservationStandards.................................................. 363.2.1. HandSOverview ...........................................................................................................................363.2.2. TheNeedforInteroperabilityandPreservationSupport ..........................................363.2.3. HubandSpokeKeyPrinciples................................................................................................37
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
3
3.2.4. METSProfile...................................................................................................................................393.2.5. HandSWorkflowCycle ..............................................................................................................443.2.6. HandSTechnicalImplementation.........................................................................................503.2.7. LessonsLearned...........................................................................................................................553.2.8. NextSteps:theHubandSpoke ..............................................................................................563.2.9. Conclusion.......................................................................................................................................57
4. PreservingMeaning,NotJustObjects:SemanticsandDigitalPreservation 584.1. Introduction:TheNeedforaSemanticsofPreservationApproach ................. 584.1.1. ThePreservationSemanticsProblem.................................................................................584.1.2. OurGoal............................................................................................................................................59
4.2. TheProblems:UnderstandingSemanticPreservation ......................................... 594.2.1. ProblemsPosedbyDescriptivePracticeandStructures............................................594.2.2. UnderstandingtheSemanticPreservationProblem:Summary..............................64
4.3. TowardMoreCapableArchivesandRepositories .................................................. 644.3.1. Recap:Theneedforautomatedinferencecapability...................................................644.3.2. BECHAMELandBuildingaMetadataOntology ..............................................................654.3.3. OvercomingSemanticProblemsinMetadataEncoding:AResourceandDescriptionVocabulary .............................................................................................................................654.3.4. ResolvingSemanticAmbiguity:anInferenceExample ...............................................664.3.5. AutomatedInferenceasaPreservationService.............................................................68
4.4. SystemArchitecture .......................................................................................................... 694.4.1. Architecture:Overview .............................................................................................................69
4.5. LessonsLearnedandNextSteps ................................................................................... 714.6. Conclusion............................................................................................................................. 71
5. ANotefromthePIs ...................................................................................................... 736. References....................................................................................................................... 746.1. ArchivingtheWeb:theWebArchivesWorkbench................................................. 746.1.1. Resources ........................................................................................................................................74
6.2. RepositoryEvaluationandInteroperability ............................................................. 756.2.1. RepositoryEvaluation................................................................................................................756.2.2. HandSToolsSuite ........................................................................................................................75
6.3. PreservingMeaning,NotJustObjects:SemanticsandDigitalPreservation.. 777. Appendices...................................................................................................................... 807.1. WebArchivesUserGuide................................................................................................. 807.2. WebArchivesWorkbenchImplementationGuide ................................................. 807.3. AnnotatedTrustedDigitalRepositoryChecklist ..................................................... 807.4. UsingtheAuditChecklistfortheCertificationofaTrustedDigitalRepositoryasaFrameworkforEvaluatingRepositorySoftwareApplications(DLibarticle)... 807.5. RepositoryTestingFindings:Narrative...................................................................... 807.6. RepositoryFindingsCommentaryUsingtheAnnotatedTrustedDigitalRepositoryChecklist ..................................................................................................................... 807.7. ResourceDescriptionVocabulary:AnOntologyofMetadataDescriptions ... 807.8. SustainedAccesstoEjournals:ContextValue,andFutureProspectus............ 80
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
4
1. Preface
1.1. AbouttheECHODEPositoryProject(Phase1)TheECHODEPository(Phase1)isanNDIIPP‐partnerresearchanddevelopmentprojectattheUniversityofIllinoisatUrbana‐Champaign(UIUC)inpartnershipwithOCLC,theNationalCenterforSupercomputingApplications);theMichiganStateUniversityLibrary;andanallianceofstatelibrariesfromArizona,Connecticut,Illinois,NorthCarolinaandWisconsin.OuraimistosupportthedigitalpreservationeffortsoftheLibraryofCongressbyaddressingissuesofhowwecollect,manage,preserve,andmakeusefultheenormousamountofdigitalinformationourcultureisnowproducing.Phase1projectactivities(Fall2004through2007)includeddevelopingwebarchivingtools,evaluatingexistingrepositorysoftware,developinganarchitecturetoenhanceexistingrepositories’interoperabilityandpreservationfeatures,andmodelingnext‐generationrepositoriesforsupportinglong‐termpreservation.
1.2. AboutThisDocumentThisnarrativereportprovidesadetailedoverviewofeachoftheareasofworkdescribedbelow.Theattachedappendicesprovidespecificadditionalprojectdeliverables.Acollectedarchiveofallprojectdeliverables,includingposters,presentationsandpublications,isforthcoming.SeveralsectionsincludematerialcontributedbythesameauthorswhowrotearticlesonECHODEPprojectsforanissueofLibraryTrends,guest‐editedbyPatriciaCruseandBethSandore,toappearintheWinter2009issue(specifically,LibraryTrends,Volume57,Number3).
1.3. ReviewofProjectObjectivesandDeliverables
1.3.1. ModelsandToolstoSupportWebArchivingGoals:
• Articulateamethodologyforselectingdigitalmaterialsattheaggregatelevelbasedonarchivalprinciples,anduseprovenance,functionalanalysis,andcontextanalysistofacilitatemeta‐taggingforretrieval.
• Buildasuiteofopensourcesoftwaretoolsthatsupportidentification,capture,anddescriptionofwebsites.
Deliverables:
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
5
• TheArizonaModel,anarchivalapproachtowebarchiving(developedbytheArizonaStateLibraryandArchives)
• TheWebArchivesWorkbenchsuiteoftools(developedbyOCLC)Overview
Traditionally,webarchivingmethodshavefocusedoneithermanualorautomatedcaptureapproaches,bothproblematic.Manualitem‐levelselectionfailsduetotheenormousnumberofresourcesontheweb,whilefullyautomatedweb‐captureapproachesriskburyingsubstantivematerialsunderamountainofirrelevantinformation.Toaddressthisfundamentalproblem,OCLCbuiltasuiteofopen‐sourcewebarchivingtoolsthatbridgethegapbetweenmanualselectionandautomatedcapture.BasedontheArizonaModel,whichprovidesforintegrationofbothhumanandmachineprocesses,theWebArchivesWorkbench(WAW)comprisesfourtoolstohelparchivistsidentify,describe,selectandharvestweb‐basedcontentforstorageinanyrepository.
DetailsSeeSection2ofthisreport,andAppendixitems6.1and6.2.
1.3.2. RepositoryEvaluationandInteroperabilityGoals:
• Install,configure,test,andevaluateexistingopen‐sourcedigitalrepositorysystems,withparticularregardtosupportforinteroperabilityandemergingpreservationstandards.
Deliverables:
• Arepositoryevaluationframeworkbasedonthe2005RLG/NARAAuditChecklistfortheCertificationofaTrustedDigitalRepository,DraftforPublicComment,withmappingtocurrentversion
• Repositorytestingfindings• TheHubandSpoke(HandS)toolssuitesupportingrepository
interoperabilityandpreservationmetadata• PREMIS‐basedMETSprofiles
Overview
Tohelpunderstandhowwellexistingrepositorysystemssupporttoday’sdigitalpreservationefforts,weevaluatedfourexistingopen‐sourcerepositorysystems(DSpace,ePrints,FedoraandGreenstone).Evaluationactivitiesincludedtheingestionandmanipulationofhalfaterabyteof
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
6
heterogeneouscontentineachrepositorysystem,andthedevelopmentofapreservation‐focusedrepositoryevaluationframeworkbasedonemergingstandardsfortrusteddigitalrepositories.
EarlyevaluationfindingsledtothedevelopmentoftheHubandSpoke(HandS)Architecture,aproof‐of‐conceptsuiteoftoolstoenhancetheinteroperabilityandpreservationfeaturesofsystemstested.TheHandSsuitesupportstoday’slibraries’effortstomanagecontentinmultiplerepositorysystemsandtopreservevaluablepreservationdata.Itincludesthedevelopmentofacommonstandards‐basedmethod(aPREMIS‐basedMETSprofile)forpackagingcontentthatallowsdigitalobjectstobemovedinandoutofmorerepositoriesmoreeasilywhilesupportingthecollectingoftechnicalandprovenanceinformationcrucialforlong‐termpreservation.Thismodelhaspotentialwideapplicability,andisalreadyinuseinseveralreal‐worldarchivingprojects.
DetailsSeeSection3ofthisreport,andAppendixitems6.3,6.46.5and6.6.
1.3.3. Long‐termSemanticPreservationResearch
Goals:• Researchtechniquestomigratethesemanticcontentofdocumentsand
documentstructuresacrossgenerationsofencodingschemes.
Deliverables:• Articulationofsemanticpreservationproblemsposedbycurrentmetadata
practice• Developmentofaformalmetadatadescriptionontology• Demonstrationofautomatedinferenceusingreasoningsoftware
Overview
Currentfirst‐generationrepositorysystemspreservethestructureofinformation,notitsmeaningorsemantics.Whenwemovecontentfromonesystemtoanother,thisstructuremaybesubtlyorunsubtlytransformed.Tomeaningfullypreserveourdigitalcontentovertime,wethereforehavetoinfermeaningorsemanticsfromstructuresthatchangeovertime.Becauseofthevolumeofinformationtobepreserved,weneedtobeabletodothiswithautomatedtools.
TheUIUCGraduateSchoolofLibraryandInformationScience(GSLIS)collaboratedwiththeNationalCenterforSupercomputingApplications
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
7
(NCSA)tocontributetothedevelopmentofnext‐generationarchiveswithsemanticanalysiscapabilitiestoreducelong‐termpreservationrisks.UsingrepositorytechnologydevelopedatNSCAandautomatedreasoningtoolsdevelopedatGSLIS,wemodelhowsemanticinferencecapabilitymayhelpnext‐generationarchivesheadofflong‐termpreservationrisks.Thisworkincludesarticulatingsemanticpreservationproblemsposedbycurrentpractice,andanalyzingreal‐worlddatamigrationexamplestodevelopaformalunderstandingofhowdescriptiveinformationaboutarchiveddigitalresourcesisstructured.Thisunderstandingispresentedinaformalmetadatadescriptionontology.
DetailsSeeSection4ofthisreport,andAppendixitem6.7.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
8
2. ArchivingtheWeb:theWebArchivesWorkbench
2.1. OverviewAcoredeliverableoftheECHODEPositoryProject'sfirstphasewasOCLC'sdevelopmentoftheWebArchivesWorkbench(WAW),anopen‐sourcesuiteofWebarchivingtoolsforidentifying,describing,andharvestingWeb‐basedcontentforingestintoanexternaldigitalrepository.ReleasedinOctober2007,thesuiteisdesignedtobridgethegapbetweenmanualselectionandautomatedcapturebasedonthe"ArizonaModel,"whichappliesatraditionalaggregate‐basedarchivalapproachtoWebarchiving.(By“aggregate‐basedarchiving,”wemeanarchivingitemsbygrouporinseries,ratherthanindividually.)CorefunctionalityofthesuiteincludestheabilitytoidentifyWebcontentofpotentialinterestthroughcrawlsof"seed"URLsandthedomainstheylinkto;toolsforcreatingandmanagingmetadataforassociationwithharvestedobjects;websitestructuralanalysisandvisualizationtoaidhumancontentselectiondecisions;andpackagingusingaPREMIS‐basedMETSprofiledevelopedbytheECHODEPositorytosupporteasieringestionintomultiplerepositories.ThenextsectionsprovideanoverviewoftheWebarchivingproblem;backgroundontheArizonaModel;anoverviewofhowthetoolsworkandtheirtechnicalimplementation;andabriefsummaryofuserfeedbackfromtestingandimplementingthetools.AppendixitemsincludetheWebArchivesWorkbenchUserGuide(6.1),whichprovidesdetailedscreen‐by‐screendocumentationofthetoolsuite’sfunctionality.TheWAWImplementationGuideisprovidedinAppendix6.2.
2.2. TheWebArchivingProblem
2.2.1. TheUbiquitousWebForabroadrangeoforganizations,Websitesarenowthedeliverymechanismofchoicefornearlyanytypeofinformationcontent.Muchofthiscontentiscreatedanddisseminatedinelectronicformatsonly,withprinted(hard)copiesconsideredjustacourtesyorconvenience.Theelectronicformatenvironment,whileexpedientforcurrentaccesspurposes,presentschallengesforanyonechargedwithpreservingtheinformationovertime.ThesechallengesincludethesheervolumeofWeb‐publishedinformation,traditionalissuesofselectionanddescription,aswellthetechnicalchallengesassociatedwithlong‐termpreservationofdigitalobjects.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
9
2.2.2. VolumeandSelectionofWebContentAnimmediatechallengeofWebarchivingisassuringthatallcontentoflong‐termrelevancedeliveredthroughtheWebisidentifiedandcollected(i.e.,harvested).DifficultiesarisefirstfromthetaskofselectingpertinentcontentforpreservationfromtheenormousvolumeofinformationstreamingfromWebserversatanygivenpointintime.Selectiondecisionswillbeinfluencedbythechargeoftheindividualresponsibleforcapturingspecificcontenttypes(suchasalibrarianorarchivist)basedonappraisal,orcollectiondevelopment;onpoliciescreatedinconcertwiththemissionoftheinstitutionororganization;andontheaudience,orusercommunity,beingserved.ThesheervolumeofcontentpublishedontheWebmakesafullymanualperusalofonlineresourcesinfeasible.VolumeisstillafactorevenwhenWebcrawlers—asexplainedbelow—areengaged.ThedynamicnatureoftheWebalsocreatesproblemsforselectionandharvestingofcontent.URLscanchangeovernight;resourcescanbetakenoff‐linewithlittleornonotice;andnew,relatedcontentcanbeaddedinnewordifferentdirectoriesthanthosevisitedpreviouslybyaWebcrawlerharvestinganorganization'swebsite.AlthoughWebcrawlingautomatesarchivingofawebsite,itisquitepossibleforWebcrawlerssimplytomisscontentbecauseofa“robotsexclusionprotocol”(activatedbythesitecreatortomakepartsofasite“uncrawlable”)orbecauseoftheimpenetrablecharacteroftheDeepWeb(wherecontent,suchasaresultspagetoaWebform,isinaccessibletoaWebcrawlerorWebspider).1Inaddition,thevastmeasureoftheWebrendersscalableWebcrawlinganalmostintractabletechnicalchallenge.KnowingwheretofindallcontenteligibleforharvestingaccordingtocollectiondevelopmentandappraisalpoliciesbecomesnearlyimpossiblewithoutintentionalcoordinationorwithoutWebcrawlingtoolsandresourcesthataredesignedfor,andtakeaccountof,thefluidnatureofwebsitecontentandthemassivescaleoftheWeb.
2.2.3. TheImportanceofContextContextisaboutunderstandingrelationshipsbetweendifferentanddiscretepiecesofinformation.Itisaboutunderstandingwhytheinformationwascreated,bywhichindividualororganization,andatwhatpointintime.Contextualinformationcanhelpdefinetheboundariesandthescopeofharvestedcontent.Aswithanalogobjects,muchoftheusefulnessofdigitalobjectswhichmakeupourculturalrecorddependsonourhavingdescriptiveandcontextualinformationaboutthem.Oncecontentisidentifiedandharvested,itisnecessarytoprovideaccesstothedigitalobject.Suchcontentaccessmeansthatattentionshouldbepaidtocapturingaccuratemetadataalongwiththecontentitself.Thiscontextualmetadata1Webarchiving.(2009,March2).InWikipedia,TheFreeEncyclopedia.Retrieved21:56,March2,2009,fromhttp://en.wikipedia.org/w/index.php?title=Web_archiving&oldid=274363238
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
10
willhelpdescribetheorigin(or"provenance")oftheresource,aswellaswhyandwhenitwascreated.(Forexample,isthediscoveredresourceoneinaseriesofannualreportsfromaparticularstateagency?Isitasinglepublicationsummarizingresearchfindings?Ordoesitencompassresultsfromaspecificsurveytakenaspartofalargerefforttorevampcommunityservices?)Inthecaseofadigitalobject,metadatanotonlysupportshumaninterpretationofcontent,itisneededtoprovidecrucialtechnicalinformationformaintaininglong‐termviabilityoftheobjectitself.
2.3. TheArizonaModel:AnArchivalApproachtoWebArchivingTheWebArchivesWorkbenchtoolsuiteispremisedontheprinciplesofthe“ArizonaModel,”anaggregate‐basedapproachtoWebarchivingdesignedtobridgethegapbetweenhumanselectionandautomatedcapture.“Aggregate‐based”meansthatratherthanarchiveitemssingly,orindividually,theyareorganized(grouped)inseries,orinaggregates.TheArizonaModelwasdevelopedin2003byRichardPearce‐MosesoftheArizonaStateLibraryandArchives.
2.3.1. BackgroundMoststatelibrariesandarchiveshavemandatestocollectstateagencypublicationsandmakethemavailabletothepublic.Tothisend,therearewell‐establisheddepositorysystemsthathaveworkedwithpaperpublicationsformanyyears.InaWebenvironmentthenuancesofdeterminingwhatapublicationis,orwhoisresponsibleforselectionandcollectionofparticularinformationresources,becomeslessclear.Nonetheless,tomeetthesemandateslibrariansandarchivistsmuststillidentify,select,acquire,describe,andprovideaccesstostateagencyinformation"published"onwebsites.Inearlyattemptstodevelopacollectionofstateagencyelectronicpublications,twoapproachescameabout.AccordingtoPearce‐Moses,Cobb,andSurface(2005),thefirstapproachhasitspremisein“traditionallibraryprocessesofselectingdocumentsonebyone,identifyingappropriatedocumentsforacquisition;electronicallydownloadingthedocumenttoaserverorprintingittopaper;thencataloging,processing,anddistributingitlikeanyotherpaperpublication.”(175)Whilethisapproachensuresthatvaluabledocumentswillbegathered,itsdependenceonmanualselectionlimitsarchivingtoonlyaveryfewitems.ScalingthisprocessinaccordancewiththevastnessofWeb‐baseddocumentswouldnecessitateanexpansioninpersonnelthatfewstatelibrarieshavethefundingtoaddress.(Pearce‐Moses,Cobb&Surface,2005)Alternatively,intheotherapproach,softwaretoolsthatautomateregularlyoccurringWebcrawlsareengaged.AsPearce‐Moses,Cobb,andSurfaceassert,thismodel“tradeshumanselectionofsignificantdocumentsforthehopethatfull‐textindexingandsearchengineswillbeabletofinddocumentsoflastingvalueamongtheclutterofother,ephemeralWebcontentcapturedintheprocess.”(176)Yet,whilethismodelrelieveslibrariansand
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
11
archivistsoftheupfrontonusofselectionandorganization,atthesametimeitmayundulyburdenfuturesearchers,iffull‐textindexingandsearchcapabilitiesdonotevolveasanticipated.TheArizonaModel,explainedindetailbelow,constitutesathirdapproachtoWebarchiving,incorporatingbothhumanassessmentandautomatedtools.
2.3.2. AnArchivalApproachTheArizonaModelappliesanarchivalperspectivetocuratingcollectionsofWebpublications.Itexploitscertaintellingparallelsbetweenwebsitesandarchives:namely,theconceptofprovenance(i.e.,documentsclassedtogetherstemfromthesamesource)andtheorganizationalstructureinherentinboththesekindsofcollections—directoriesandsubdirectoriesforwebsites,andseriesandsubseriesforarchives.(Pearce‐Moses,Cobb&Surface,2005)Intheory,ifwebsitesorganizeWebpublicationsusingcommonfiledirectorystructures,informationaboutindividualdocumentswithinsub‐directoriescouldbeinheritedfromparentdirectories.IntheArizonaModel,whichdrawsonbasicarchivalpractice,websitesarehandledashierarchicalaggregatesratherthanasindividualitems,andtheoriginalorderofthedocuments(theorderinwhichthecreatingagencyoversawthem)ismaintained.Provenanceandoriginalorderareconsideredimportantcontextualpiecesofinformation.Retainingdocumentsintheorderinwhichtheywereoriginallymanagedandkeepingthemclusteredtogetherbasedontheoriginatingagencyenhanceone’sknowledgeofthecreationandoriginaluseofthedocuments.Provenanceandoriginalorderalsoallowfor"inheritance"ofhigher‐levelmetadatameanttodescribethehomeagencyfromwhichthedocumentscameandthewaythedocumentswereoriginallyarranged.Finally,anarchivalapproachtocuratingacollectionofWebdocuments—focusingfirstonaggregates(collectionsandseries),ratherthanonindividualdocuments—trimsthenumberofitemsthatneedtobeappraisedbyahumandowntoamoremanageablenumber.
2.3.3. ArizonaModelSummaryTheArizonamethodologyisbasedonanarchivalapproachtotheWebthatincorporatesbothhumanselectionandautomatedcapture.Inthisapproach,Webmaterialsaremanagedinawaysimilartotheorganizationofmaterialsinpaper‐basedarchives:asahierarchyofaggregatesratherthanasindividualitems.ThisapproachreducestoamorepracticalsizethesheervolumeproblemofpreservingWebmaterials,whilemaintainingascalabledegreeofhumaninvolvement.ItistheguidingmodelforOCLC’sWebArchivesWorkbench.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
12
2.4. TheWebArchivesWorkbench:ImplementingtheArizonaModel
TheArizonaModelisparticularlyinstructiveinitsevocationofwhere,inthepracticeofarchivalmanagement,automationcanbeconsideredmostuseful.Thatis,whiletechnologymaybeappliedforinformationprocessingactivitiessuchasdatasearchingandtracking,andlistconstructionandclassification,tasksfordistinguishingwhethercontentisin‐scopeorisvaluablearebestreservedforhumans.TheoverallgoalbehinddevelopingasuiteoftoolsbasedontheArizonaModelistoachieveaproductivecomplementbetweenautomatedprocessingandhumandecision‐making,allthewhileadheringtoestablishedarchivalprinciples.ThesoftwarethatOCLCcreated,theWebArchivesWorkbench,comprisesfivetoolstoidentify,select,describe,andharvestWeb‐basedmaterials,aswellastokeeptrackof,orlog,theseactivitiesandtogeneratereportsaboutthem.Indoingso,theyserveasaconduitbetweenhumaninvolvement(viamanualselection)andcomputerizedcaptureofWebcontent:theyconvertthearchivist'spoliciesforcollectingcontentcreatedontheWebtosoftware‐centeredrulesandconfigurations.Theyalsoassistinformationprofessionalsbyprovidingthemeanstoaddmetadatatoharvestedobjectsasaggregates.Inaddition,thetoolsimplementthePREMIS‐basedMETSprofilesdevelopedbyECHODEP(attheUniversityofIllinois)forpackagingcontent;bydesigntheseprofilesfacilitateingestionintomultipleexternalrepositoriesandsupportlong‐termpreservation.2PackagingisthelaststepintheWAWworkflow,afterwhichtheobjectsarereadyforingestintoanexternaldigitalrepository.
2.4.1. DevelopmentConsiderationsOCLCledthedevelopmentofthetoolsuite.Priortotooldesignanddevelopment,OCLCcarefullyconsideredtheusercommunity,whichitidentifiedasablendoflibrariansandarchivists.Significanttoitsconsiderationwastheissueofterminology:howshouldtoolsandfeaturesintheWebArchivesWorkbenchbenamed,orcalled,ifamixedcommunityoflibrariansandarchivistswastoserveasitsuserbase?The word “series,” for example, might invoke semantics and usage for an archivist that is different, even unfamiliar, for a librarian. Thus,inexploringtheusercommunity,OCLChadarchivistslookatnewtypesofmetadataandaskedlibrarianstothinkaboutprinciplesofarchiving,suchasarchivalseriesandthecurationof
2TwoMETSprofilesdevelopedbyECHODEPareatworkhere:theECHODepGenericMETSProfileforPreservationandDigitalRepositoryInteroperability(accessibleathttp://www.loc.gov/standards/mets/profiles/00000015.html)andtheECHODepMETSProfileforWebSiteCaptures(accessibleathttp://www.loc.gov/standards/mets/profiles/00000016.html).Theformeristhe'toplevel'format‐genericprofile,whichfocusesonimplementingPREMIS.Thelatter,awebcaptureprofile,isanexampleofa'sub‐profile,’whichisusedwiththefirstonetoprovideastructureformoreformat‐specificinformation.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
13
documentsinaggregateratherthanasindividualitems.Eventually,OCLCelectednottodevisenewterminologyfortheconceptsatissue;notonlydidtheteamconcludethatterminologywas,inessence,atrainingmatter,italsosawthattheworkoflibrariansandarchivistsoftenoverlap—i.e.,eachisfrequentlyengagedinthemilieuoftheother.Indoinghigh‐levelanalysisfortheuserinterface,OCLCarrivedatseveralworkingassumptionsthathadsomebearingonthedesignofthetoolsuite.OneassumptionwasthatbecausethetoolsintheWebArchivesWorkbenchmightchangeovertime,theyneededtobe“aware”ofeachotherandenablethesharingofdata,but—asimportant—theusershouldhavetheabilitytooptnottouseatoolintheWorkbench.Throughinterviewswithlibrariansandarchivists,OCLCalsolearnedthatharvestingresponsibilitiesoftenweresharedamongindividuals;asaconsequence,datageneratedbyatoolhadtoberenderedshareablebymultipleusers—andsimultaneouslyso.Thisfeaturewouldallowausertoviewtheworkofanother.Inaddition,ratherthantryingtointegratetheWorkbenchintoaninstitution’smanyauthenticationschemes,OCLCincorporatedasimplescheme,allowingtheWorkbenchtorunwithjustbasicadministration.Intermsofharvesting,OCLCdesignedmorethanoneharvestingworkflow,sothatausercouldselecttheappropriatelevelofanalysisandsophisticationforatask.Forinstance,theQuickHarvestfeatureisasingle‐screenlaunchpointthatrunsaharvestimmediately.TheAnalysistool,whichispartofanextendedharvestingworkflow,requiresmoreset‐up,butitresultsinabigger“pay‐off”intermsofthewebsitechangeobservationsithandlesautomaticallyfortheuser(thisisexplainedbelowmoreformallyinthedescriptionoftheAnalysistool).Finally,wherethedepositofharvestedinformationisconcerned,OCLCknewthatingesttoavarietyofrepositories,includingitsownDigitalArchiveaswellasDSpacerepositories,wouldneedtobeaccommodated.Aclean,simpleinterfacewascreatedbetweenthepointwheretheWorkbenchendsandarepositorysoftwareapplicationwouldbegin;thatis,theWorkbenchgeneratesharvestedpackagesofcontentinafilesystemthattherepositorythenpicksupandprocesses.(Thisisthepointintheworkflowatwhichtheabove‐mentionedPREMIS‐basedMETSprofilesdevelopedbyECHODEPisimplemented.)
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
14
2.4.2. OverviewoftheWebArchivesWorkbenchTheWebArchivesWorkbenchisasuiteofwebarchivingtoolsforidentifying,selecting,describingandharvestingweb‐basedcontentbasedonlibraryandarchivalpractice.Itbridgesthegapbetweenmanualselectionandautomatedcapturebytransformingcollectionpoliciesintosoftware‐basedrulesandconfigurations.Itaccommodatesavarietyofwebharvestingapproaches,includingmassharvesting,selectiveharvesting,andindividualdocumentharvesting.ContentispackagedusingtheECHODEPMETSprofile,whichisdesignedtosupportthecollectionofPREMISpreservationmetadata,andtofacilitateingestionintoavarietyofexternalrepositories.ThefivetoolsintheWorkbencharetheDiscovery,Properties,Analysis,Harvest,andSystemtools.Below(Fig.1)isanoverviewoftheWorkbenchWorkflow,followedbyamoredetailedtourofthefunctionalityofeachtool.
Figure1:OverviewofWAWWorkflow
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
15
2.4.3. ATouroftheWebArchivesWorkbenchThescreenshotinbelowdisplaysthemainWAWtoolsscreenaftertheuserhasloggedon.Thefivetoolsareexemplifiedbythetopmostrowoftabs.(ThoughtheAlertstabsitsinthisrow,itislessatoolthanafeatureoftheWorkbench.ItenablesuserstoaccessacollectionofreportsandalertsfortheDiscovery,Properties,Analysis,andHarvestTools.)IntheinterfacefortheWAWtools,atabiscoloredintosignifywhichtoolisopen,oractive,atthatparticularmoment.InFigure2,forexample,theDiscoverytabisshaded,becausetheDiscoverytooliscurrentlyactive.Similarly,theEntryPointstabisshaded,becauseitisactiveasacomponentoftheDiscoverytool.
Figure2:ScreenshotofWAWinterfacehomescreenAkeyadvantagetotheWorkbenchtoolsisthatharvestingofWebcontentmaybescheduledsothatitoccursonaregularbasis.However,theWorkbenchtoolsalsoofferusersthealternativeofrunningaone‐timeharvest.ThisisknownastheQuickHarvest,accessibleviatheHarvesttab.QuickHarvestisaddressedbrieflyinthediscussionbelowoftheHarvesttool.
2.4.3.1. TheDiscoveryTool:FindingWebContentofInterestThefirststepinconstructinganarchiveofWeb‐basedresourcesistodeterminewhichpartsoftheWebholddesirable,andthuscollection‐worthy,content.ThisstepliesatthecruxoftheDiscoveryTool.TheDiscoveryToolaidsinidentifyingpotentiallyrelevantwebsitesbycrawlingrelevant"seed"entrypointstogeneratealistofdomainstowhichthe"seed"siteslink.(Note:AnentrypointisaspecificwebsiteURLwheretheDiscoveryToolwillbegintosearchfordomainsorcollectWebcontent.AdomainisaserverontheInternetthatmaycontainWebcontentandisidentifiedbyahigh‐leveladdress.Forexample,
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
16
http://www.illinois.gov/news/isawebsite,anditsdomainis"Illinois.gov".DomainsdoNOTinclude“http://”.)3Inanapproachthateffectivelyborrowsfromcitationanalysis,theDiscoveryToolisdesignedontheideathaton‐topicsiteslikelypointtoothersitesaddressingasimilartopic.Thedomainsinthegeneratedlistarethenmanuallyevaluatedasin‐scopeorout‐of‐scope,basedonsubjectinterestandcollectingpolicies.(SeeFigure3,whichshowsalistofdomainsreturnedafterentrypointshavebeencrawled,aswellasradiobuttonsthatnotethescopeforeachdomain.)Thisprocessresultsinalistofdomainsdefiningasub‐setoftheWebthatisrelevantfortheuser'sarchivingpurposes.Domainsmarkedasin‐scopecanbeassociatedwithanEntity(i.e.,creator,oragency,ororganizationresponsiblefortheWebcontent).Later,inthePropertiesandAnalysisTools,metadataassociatedwithentities(creatorssuchasagenciesororganizations)canbeinheritedbycontentharvestedfromaparticularwebsite.
Figure3:ScreenshotoftheinterfacefortheDomainsfeatureoftheDiscoveryToolInsum,theDiscoveryToolisusedto:
• Generatealistofpotentiallyrelevantdomainsbycrawlingseedsites.• Assigndomainsasin‐scopeorout‐of‐scope.• AdddomainsmanuallytotheDomainslist.• Associatedomainswithentities(creatingagenciesororganizations).
3Anoteaboutcapitalizationinthissectionthatprovidesatourofthesoftware:here,entrypointsanddomainsarenotcapitalized,becausewearespeakingoftheminthegeneraluseoftheDiscoveryTool.However,theyarealsofeaturesoftheDiscoveryTool.Whenwediscussthemassuch,theywillbecapitalized.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
17
2.4.3.2. ThePropertiesTool:EnteringMetadatatoDescribeContentCreators(Entities)
AnotherpremiseoftheArizonaModelisthat,asmuchaspossible,metadatashouldbeenteredonlyonceandbeinheritedbyassociatedharvestedobjects.AftertheEntryPointsandDomainfeaturesoftheDiscoveryToolarerun,andentities(i.e.,contentcreators)havebeenassociatedwithdomains,metadataabouttheresultingentitiesmaybeenteredviathePropertiesTool.Besidesenablingthemanagementofinformationaboutentities,thePropertiesToolalsoallowstheusertodescribetherelationships(e.g.,parent/child)ofentitieswithoneanother,aswellasenterotherinformationsuchascontactinformation.Importantly,inaddition,thePropertiesToolalsocanbeeasilyengagedtocreateanalysesandseriesfromentities'websites.Thepurposeofenablinganalysisofawebsiteistoexamineitsstructure—i.e.,thedirectoriesthatmakeupthewebsite.(FormoreontheAnalysisTool,seebelow.)Insum,thePropertiesToolisusedto:• Createandmanagealistofcontentcreators(entities).• Assignmetadataandotherpropertiestoentities.• Specifywebsitesthatentitiesareresponsiblefor,andcreateanalysesandseries
(explainedbelow)basedonthosewebsites.
2.4.3.3. TheAnalysisTool:VisualizingtheStructureofaWebsiteThroughtheAnalysisToolitispossibletodiscernwhetherthereisvaluablecontentinthedirectoriesthatcompriseawebsiteand,ifso,toidentifythosechunksofcontent."Series"referstoflexibleaggregatesofcontentthatareanalogoustoarchivalseries—whichmaybeawholewebsiteoraportionofit(e.g.,onlyPDFsofannualreports),orevenoneindividualpageordocumentfromwebsites.Looselydefined,aseriesisanycollectionofWebmaterialthatauserchoosestocollectinone"bucket."Inaddition,seriesareusedinordertodrivetheWorkbenchharvestoperations.WhileseriesmaybeestablishedwithinthePropertiesTool,theycanalsobeestablishedandmanagedusingtheAnalysisTool,thenharvestedandpackagedintheHarvestTool.TheAnalysisToolhastwofunctionalareas:
• theAnalysisscreen,whichprovidesvisualizationtoolstoaidincontentselectiondecision‐makingandinseriesstructuredecisions.Here,too,abaselineanalysiscanbecreatedagainstwhichtomeasurefuturewebsiteanalyses;
• theSeriesscreen,whereseriesarecreated,edited,andmanaged;Seriesobjectsarekept;andSeriesharvestsareregulated.
TheAnalysisToolisusedto:
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
18
• Analyzethestructureofawebsite.• Enterassociatedentities.• Setabaselineanalysisforcomparisonwithfutureanalyses.• Adjustsettings,suchasspidersettingsandchangenotificationthreshold
settings.• Definea"series"forharvesting(e.g.,harvestasanindividualobject),with
optiontoassociateitwithanentity.• Holdseriesobjectspriortoharvest.• Scheduleharvestsofseries.
Inaddition,operationsforholdingseriesobjectsandharvestingthemmaybeaccessedviathePropertiesTool.
2.4.3.4. TheHarvestTool:Reviewing,Packaging,andIngestingHarvestedContent
AlltheharvestsintheWorkbench,includingseriesharvests(viatheAnalysisTool)andquickharvests,arelistedintheHarvestTool.TheHarvestToolisusedtomonitorthestatusofharvestsandtoprovideanopportunitytoreviewandmodifytheharvestbeforepackagingitupandingestingitintoarepository.Theremaybesingle‐objectharvestsormultiple‐objectharvests,dependingonwhethertheoptiontoharvestcontentasindividualobjectswasselectedintheSeriesdetailsscreenofanAnalysis‐basedSeries(i.e.,intheAnalysisTool).TheQuickHarvestfeatureschedulesone‐timeharvestsofcontentbasedonaURLinputteddirectlyintotheHarvestTool.Afterharvestsarecompletetheymaybereviewed,atwhichtimeadditionalmetadatamaybeassigned.Theusercanrender,ordisplay,theharvestedcontentwithintheWAWtool,offtheHarvestResultspage.Theusercanactually"stepinto"theharvestedcontentatboththeharveststartingpointandatanyotherpointinthewebsite(viathewebsitefilestructuredisplay),andthesoftwarewillrenderthewebsiteappropriately.ThepurposeofthedisplayfeatureintheWebArchivesWorkbenchistoallowtheusertoverifythecorrectnessofwhatwasharvested—“correctness”meaningthatalltheinformationexpectedtohavebeencollectediscollected.Oncetheharvestedcontentisconfirmedascorrect,itthencanbeingestedintotheuser'slocalrepository.Insum,theHarvestToolisusedto:
• MonitorthestatusofharvestsscheduledintheAnalysisTool.• Deletecompletedharvests.• Reviewcompletedharvestcontent,whethersingle‐objectormulti‐object,
priortoingest.• Reviewcompletedharvests;ifdesired,editmetadataand/orinclude/exclude
content.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
19
• Ingestharvestedcontentintoarepository.• Launchaone‐timequickharvestusingtheQuickHarvestTool.
2.4.3.5. TheAlertsTab:WorkbenchNotificationsAsmentionedabove,the“Alerts”tabisnotatoolbut,rather,afeaturefornotifyingtheuserofavarietyofsystemsinformation.Thisinformationincludesnotificationabouterrors,incompleteprocesses,completedprocesses,andnewinformation(suchasthediscoveryofanewdomain,oranewfolderencounteredduringanalysis).Inshort,theAlertsTabisusedtoreviewreportsandalertsaboutWorkbenchfunctions.
2.4.3.6. TheSystemTools:MonitoringandManagingWorkbenchActivitiesTheSystemToolstabcontainsanumberofbehind‐the‐scenesfunctionsthataffectandreportonactivitiesofthefivemaintoolsoftheWorkbench.TheSystemToolsaredividedintofourfunctionalareas:
• theAuditLogpage,whichdisplaysrecentWorkbenchactivitiesandevents;• theSpiderSettingspage,wheretheusercanconfiguredefaultDomain,
Analysis,andHarvestspidersettings,aswellascreateadditionalDomain,Analysis,andHarvestspiderswithcustomsettings.Specifically,typesofspidersettingsinclude—butarenotlimitedto—depth(howdeeplyawebsiteshouldbecrawled,orspidered)andparametersoftime(when,howfrequently,andforhowlong);
• theImport/Exportpage,throughwhichtheusercanimportorexportavarietyofmetadatacommonlyusedintheWorkbench.Theseincludeentities,domains,andsubjectheadings.
• theReportspage,whichgeneratesprintablereportsonactivitiesofthemainfiveWorkbenchtools.Itoffersaviewofin‐developmententityandseriesreports.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
20
2.4.4. WebArchivesWorkbenchToolsSummaryTheWebArchivesWorkbenchimplementsanarchivalapproachtotheselectionandpreservationofdigital(Web‐based)content.TheWorkbenchautomatesmuchofthemethodologyembracedbytheArizonaModel,particularlybeyondtheinitialselectiondecisionsmadebythearchivist(e.g.,decidingatthestartofthearchivingprocesswhichwebsite,orwhichpartofawebsite,tocaptureandpreserve).Afterselectionparametersareset,theWorkbenchfacilitatesthecaptureandmanagementofthedigitalmaterialsinhierarchicalaggregates‐‐notunlikethearchivingofprint‐basedmaterials.
OVERVIEW OF WEB ARCHIVES WORKBENCH TOOLS
Discovery Tool
• discover domains
• group and prioritize domains
Comprising the Entry Points and Domains tabs, the Discovery Tool helps to identify potentially relevant web sites by crawling relevant “seed” Entry Points to generate a list of domains that they link to. At the end of this process the users have a list of domains that defines the sub-set of the web relevant for their archiving purposes. From here, the Properties and Analysis Tools are used to manage creator information about domains, and associate this information with harvests of content.
Properties Tool
• organize collection space
• create metadata
Comprising the Entities tab, the Properties Tool is used to maintain information about content creators or ‘Entities’ (e.g., government agencies), and associate them with the domains and web sites they are responsible for. The Properties Tool also allows users to describe the relationships (e.g., parent/child) of Entities with one another, as well enter high-level metadata about them that may be inherited by content harvested from their web sites. Importantly, the Properties tool can also be used to create and associate Series with Entities’ web sites. Series and harvests are then further managed using the Analysis and Harvest/Package Tool.
Analysis Tool
• visualize site structure
• associate metadata
• schedule harvests
Comprising the Analysis tab and the Series tab, the Analysis Tool provides website structure visualization tools to aid content selection decisions, and allows users to define archival Series, associate metadata with these series, and schedule recurring harvests of Web content. Harvesting activities are then monitored and managed in the Harvest Tool..
Harvest Tool
• review content
• package for ingest in external repository
Comprising the Harvester and Quick Harvest tabs, the Harvest Tool lists all harvests within the Workbench, including Series harvests scheduled using the Analysis Tool as well as Quick Harvests. It is used to monitor their status, initiate the final harvesting and ingest steps for the completed harvests tracked in the Harvest Tool, including reviewing harvest contents and metadata before ingest. This is the final step in the Web Archives Workbench workflow. It also offers a separate Quick Harvest feature.
Systems Tools
• reports and settings
The System Tools manage and monitor Workbench activities, reporting on operations undertaken in the four other tools. It has four functional sections: an Audit Log page (shows recent Workbench activities); a Spider Settings page (parameters for spidering may be set here); an Import/Export page (for moving metadata); and a Reports page (for producing printable reports about activities performed by the other tools).
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
21
2.4.5. BehindtheScenes:OCLC’sTechnicalImplementationoftheWebArchivesWorkbench
AnISO9001company,OCLChasanexternallyauditedQualitySystembasedontherequirementsofISO9001asanaidforensuringthatproductsmeetuserexpectationsandspecifiedrequirements.OCLC'sprojectdevelopmentlifecycleisaprocessthatspecifieshowOCLCservicesaremarketedanddeveloped.Thisprocessincludeslifecycledocumentssuchasprojectplans,requirements,design,testplans,operationssupportplansandpost‐projectreviews.TheWebArchivesWorkbenchprogramfollowedthislifecycle.TheWAWprogramwasdividedintothreemainprojectsandmanysmallerreleasesinordertoreduceriskandtocreateafeedbackloopallowingrefinementoftherequirementsbasedonpreviousreleases.Therewerethreemajorsoftwarereleases,plusapproximately20additionalreleasesoverthecourseofthethree‐yearprogram.Thethreemaindevelopmentprojectswerebasedonthemainareasoffunctionalityofthetoolsuite:(1)DomainandEntity,(2)AnalysisandPackager,and(3)SiteAnalysisandChangeManagement.ThoughtheDomainandEntityfeaturesinWAWweresomewhatfunctionallysimple,theDomainandEntityprojectcarriedasignificantamountofriskbecauseitbuiltthetechnicalfoundationonwhichtherestoftheprojectwouldrest.TheSiteAnalysisandChangeManagementtoolswereriskyduetotheusabilityissuesinvolvedinclearlyrepresentingtotheusertheprocessofharvestingandevaluatingchangestowebsites.ThroughouttheprojectoneofourmainconcernswashowtorepresenttheArizonaModelinaclearandusablewayinsoftware.(Thisconcernisaddressedinthesection“TheWebArchivesWorkbenchWorkflow.”)Basedonearlydiscussions,thesystembegantobeseenasa“workbench,”intowhichcomponentsandsystemswouldbeincorporatedanddroppedovertime—perhapsbecauseuserswouldprefertoapplysomeoftheirlocaltoolsorperhapsbecausetheywouldhavemultipletoolsforagiventask.Additionally,eachcomponentwouldgrowitsdataqualityovertime,thereforeforcingtherestofthesystemtoadapteasilytoevolvingspecificationsanddataversions.Therefore,thearchitectureisdesignedforlocation,interface,anddata‐exchangetransparencies,whichmeansthatchangesinthosethreemainareasareexpectedtodriveallothersystemcharacteristics.ThehighleveltechnicalarchitectureofthesystemwasspecifiedusingtheReferenceModel‐OpenDistributedProcessing(RM‐ODP).4Thisframeworkusesvarious
4Forthespecification,see“TheISOReferenceModelforOpenDistributedProcessing–AnIntroduction,”athttp://www.enterprise‐architecture.info/Images/Documents/RM‐ODP2.pdf.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
22
viewsofasystem,includingadomainmodelview,aninformationview,anapplicationview,andatechnologyanddeploymentview.5Usingthisframework,OCLCcreatedthefollowingearlydomainmodelofthesystem.(SeeFig.4onnextpage.)Someoftheboxesinthisdomainmodelwerelaterremovedfromtherequirements,asourunderstandingofthesystemtobebuiltchangedovertime.Thearchitectureconsistedofseverallayers:client,integration,service,andpersistence.TheclientlayerconsistedofauserinterfaceimplementedusingtheStrutsframeworkasamodel‐view‐controllertostructuretothecode.ThesecondlayerisaWebserviceslayerthatprovidesthehooksforaclienttotalktotheapplication(althoughthecodewasnotusedinthisway).Thislayeralsoprovidesintegrationbetweentoolsandtranslationbetweentheinternalandexternalrepresentationsofthedata.EachdevelopingWAWtool(Entity,Analysis,Domain,etc.)implementedaconsistentHelperAPItoallowtheuserinterfacelayertoAdd/Update/Delete/Searchsingleormultipleobjects.TheOracledatabaseprovidedapersistencelayer.Oncethehighleveldesignwasproduced,adetaileddesignwasproducedforeachtool.OCLCcreatedusecasesforallmainactivitiesineachofthetools.
5InRM‐ODPthearchitectureofasystemisdescribedby5views(essentially5differentpointsofview)reflectingtheseparationofresponsibilitiesbetweenbusinesssponsors,developers,andsupportstaff.Thoseviewsare:
1. Enterprise‐community,enterpriseobjects(domainmodel),objectives(requirements/usecases),roles
2. Information‐schemas,objectattributes,databoundaries,constraints,semantics3. Computational‐components,interfaces,interactions,contracts4. Engineering‐transparencies(location,access,failure,persistence),nodes,channels5. Technology‐technologies&products(theonlydependenceonspecificproductsand
implementationpackages)
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
23
Figure4:DiagramshowingOCLC'searlydomainmodelofthesystemthateventuallydevelopedintotheWAWsuiteoftools.Eachdeveloperworkedinhisown“sandbox,”whereaWAWinterfacewassetupforhisexclusiveuse.Theworkofmultipledeveloperswasintegratedintoa
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
24
developmenttestenvironmentcalled“Baseline.”Thisway,productmanagersandtestanalystscouldreviewworkinprogressinBaseline.WhenBaselinewasreadyitwasmigratedintoaQualityAssuranceenvironment,whereformalizedtestingwasdoneagainstatestplan.Formajorinstalls,BaselinewasalsoinstalledatUIUCforadditionaltesting.Thefinalstepofthedevelopmentprocesswastodeploythesoftwareintoaproductionenvironment.TheWebArchivesWorkbenchwasreleasedasanopen‐sourcepackageonSourceForgeinOctober2007.ReleasedocumentationincludesdetailedinstallationinstructionsandadetailedUserGuideforunderstandingandusingthetools.• WAWReleasehomepage:
https://sourceforge.net/projects/webarchivwkbnch/• AdministrationGuide:
https://sourceforge.net/project/showfiles.php?group_id=205495(alsoprovidedasanAppendixiteminthisreport)
• UserGuide:https://sourceforge.net/project/showfiles.php?group_id(alsoprovidedasanAppendixiteminthisreport)
• WAWsoftwarepackage:http://webarchivwkbnch.cvs.sourceforge.net/webarchivwkbnch/webarchivwkbnch/
TheAdministrationGuidehasruntimeenvironmentrequirementsforWAW.Italsohasalistofall3rd‐partysoftwareusedbyWAWintheIncorporatedCodesectionofthedocument.Thethird‐partysoftwareisincludedintheWAWdistribution(refertolinkforWAWsoftwarepackage).AnOCLCsubscriptionisnotrequiredtouseWAWortousethisthird‐partysoftware.PleaseseetheHOWTO‐build‐install‐locally.txtfileintheWAWdeploymentforadditionalinformation.TheWAWtools,asdevelopedbythisproject,willcontinuetobemadepubliclyavailableindefinitelythroughSourceForge.Inaddition,in2008OCLCreleasedanewarrayofservicesincorporatingcomponentsoftheWAWtoolsintoaworkflowwithCONTENTdm,WorldCAT,andtheOCLCDigitalArchive.
2.5. Findings‐UserFeedbackTestingoftheWAWtoolswasundertakeninvaryingdegreesbytheoriginalprojectcontentpartners,aswellasbyseveralvolunteerorganizations.Feedbackabouttheirexperiencesworkingwiththetoolswasgatheredduringlarge‐groupprojectmeetingsatOCLC,aswellasthroughphoneconversationsande‐mailexchanges.TheoverallresponseindicatesthattheWebarchivingapproachoftheWAWtoolswas“elegant”andworthconsideration,butinpracticecontentpartnersgenerallydidnotimplementthefullfunctionalityofthetools.Thus,thepotentialbenefitsof
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
25
applyinganarchivalapproachtotheWebwerenotrealizedcompletely.Reasonsforthispartialimplementationhavetodowithinadequateresourcesandtimetowardtrainingfortheuseofthetools,whichalsopointsuptheircomplexity(explainedinfurtherdetailbelow).TheWebArchivingWorkbenchispowerfulandextensiveintermsofwebharvestingandcontent,orseries,analysis,but—accordingtothefeedbackfromourcontentpartners—atacostofheuristicsandusability.Notsurprising,theQuickHarvestfunctionality(which,becauseseriesanalysisisnotanoptioninit,involvedfewerstepsandthuslessmanagementthantheregularHarvesttool)wasengagedmostoften;forsome,theQuickHarvestfeaturebecameamuch‐valuedcomponentoftheirdailyworkflows.Changesincontentdeliveryapproaches—suchasfromstaticWebpagestodatabase‐drivenpages—constitutedanotherreasonfornotapplyingthefullfunctionalityofthetools.
2.5.1. LimitedResourcesandLimitedTimeDuringtheirparticipationintheECHODEPositoryproject,statelibraryandarchivespartnersremainedundercontinualoperationalpressurestorespondtotheneedforcapturingcontentfromagencywebsites.SomepartnerstestedtheWAWtoolswhilecontinuingtouseotherWebcontentcaptureapproachesinordertomeettheirimmediateobligations,leavingfewerresourcestofocusontheWAWtools.Becausethetoolswerestillunderdevelopment,testingofthevariousphasedreleasesmayalsohavebeendifficulttoincorporateintodailyworkflows.Supportfromtheproject(intheformofinterns)hadbeenplannedbutwasgearedtotheearlyreleasesoftheWorkbench,beforethefullfunctionalityofthetoolswasimplemented.Inhindsight,puttingprojectresourcestowarddirectworkwithcontentpartners,asoriginallyintended,mighthaveresultedinmoreuseofthefullfunctionalityofthetools,especiallyiftimedmorespecificallytocoincidewithlater,morefullyfunctional,softwarereleases.
2.5.2. ComplexityoftheToolsAccordingtouserfeedback,theQuickHarvestandDiscoverytoolswereeasiesttouse,becausetheycouldbesetupquicklyandincorporatedintoexistingworkflowswithoutincreasingtheneedfornewresources.ThefullfunctionalityofthetoolsinvolvesunderstandingaprocesswithagreaterlevelofcomplexitythanthatpresentedbytheQuickHarvestoption;partnersreportedthatitwaseasiertousetheQuickHarvestandDiscoverytools,ratherthanexpendtimeandresourcesforlearning,ortesting,thetoolssuiteasawhole.Further,somecontentpartnersreportthatthecomplicatedinterfaceofthetoolswasabarriertousingthemtotheirfullestpotential.
2.5.3. WebContentDeliveryTheassumptionproposedbythearchivalmodel—thatawebsiteanditsdirectoriesaresimilartoanarchivalrecordcollectionandsetofrecordseries—doesnotapply
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
26
todayaseasilyasitdidwhenthemodelwasfirstproposedin2003.Anincreasingamountofcontentisnowdeliveredthroughdatabase‐drivenwebsitesratherthanthroughstaticWebpages.Therelationshipsbetweencontentitemsthatmayhavebeenobviouswhenstoredinafiledirectoryarenotalwaysapparentwhenstoredinadatabase.Therefore,crawlingdomainstofindpotentialcontenttoharvestandapplyinginheritedmetadataaccordingtoadirectorystructurearenowlessusefulapproachesthantheywerejustafewyearsago.Nonetheless,despitethisshiftinhowinformationisdeliveredviawebsites,theconceptofcontentinheritingmetadatafrompreviouslyharvestedcontent,andthenassociatingthatcontentwithanexistingaggregatecollection,continuestobeusefulformakingautomatedharvestprocessesmoreeffective.
2.6. ConclusionsandNextStepsStatelibrariansandarchivistscontinuetosearchforthebestmethodsforcapturingWebcontentbasedontheirspecificmandatesandtheresourcestheyhaveavailabletothem.RecentdevelopmentsinWebarchivingservicesandtoolsprovidenewopportunitiesforpartneringwithothersandforexploringnewworkflows.TheWebArchivesWorkbenchtoolsareoneoptionamongmany.TheyautomatethemethodologyprescribedbytheArizonaModel,whichispremisedonkeyarchivalpractices,suchasobservationofprovenanceandadherencetooriginalorder.Thefourmaintools(Discovery,Properties,Analysis,andHarvest)enabletheidentification,selection,description,andpackagingofdigitalcontent.Inaddition,theWAWsuiteincludesfunctionalitiesforerrornotification,aswellasSystemtoolsforoverseeingandreviewingWorkbenchactivities(intheformofauditlogs,spidersettings,metadataimport/exportoptions,andreportsontheactivitieslaunchedbyothertoolsinthesuite).ThelessonslearnedfromdevelopingtheWorkbench,andtheunderlyingarchivalmodelusedtodirectitsdevelopment,underscorethemergingrolesandresponsibilitiesofarchivistsandlibrariansinthedigitalenvironmentandtheneedtore‐evaluateandre‐envisionworkflows.Moreover,thecontinuingmissionandsignificanceofthisworkhavebeenaffirmedinthesecondphaseofNDIIPP.Forexample,theUniversityofIllinois,OCLC,andtheUniversityofMarylandhavepartneredtodevelopastand‐alone,open‐sourcemetadataextractiontoolintendedtoprovideaccesstoarchivedcontent–akindofnextstepfortheWebArchivesWorkbench.Inaddition,intheStateInitiativescomponentofNDIIPP,aselectionofstatelibrariesacrossthenationarecollaboratingtodeveloptoolsandservicemodelsforthemanagementandpreservationofstategovernmentdigitalmaterials.Theseprojectsaddressdigitalpreservationinavarietyofcontexts,includingdisasterreadinessandtherecoveryofdata.ThroughtheStateInitiativeswork,NDIIPPisaddressingthefundamentalissueofkeepingat‐riskstategovernmentresourcesviableaspartofournationalheritageandrecord.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
27
3. RepositoryEvaluationandInteroperabilityAnothercomponentoftheECHODEPositoryprojectistheevaluationofvariousopensourcerepositorysoftwareapplications,withafocusonhowtheseapplicationssupportactivitiesofaninstitutionororganizationinterestedinprovidingservicesassociatedwithatrustworthydigitalrepository.Thissectiondescribesthedevelopmentofanevaluationframeworkbasedonthefirstdraftofthe2005RLG/NARAAuditChecklistfortheCertificationofaTrustedDigitalRepository,DraftforPublicComment(AuditChecklist),ourrepositorytestingandfindings,andhowtheseactivitiesledtothedevelopmentofatoolsuite(theHubandSpoke)forsupportingrepositoryinteroperabilityandthecollectionofmetadataimportantforpreservation.
3.1. RepositoryEvaluation
3.1.1. BuildinganEvaluationFramework:ApplyingtheTrustedDigitalRepositoryChecklisttoRepositoryEvaluation
Ourgoalistoprovideanevaluationframeworkthatreflectscurrentthinkingondigitalpreservationstandardstohelpcuratorsofdigitalcollectionslibrariansandarchivistsassessdigitalrepositorysystems,withafocusontheirabilitytosupportlong‐termpreservation.The2005RLG/NARAAuditChecklistfortheCertificationofaTrustedDigitalRepository,DraftforPublicCommentprovidedavitalstartingpoint.TheAuditChecklistwasdevelopedbyajointtaskforcefromRLGandtheNationalArchivesandRecordsAdministration(NARA).Itprovidesameansbywhichaninstitutioncanperformaself‐evaluationtodeterminehowwellitispositionedatanorganizationalleveltoprovideanexpectedleveloftrustworthinessasadigitalrepository.Weconsideredittobeastate‐of‐the‐artarticulationofwhatitmeans,atanorganizationallevel,tobeasuccessfulcuratorofdigitalresources.Wethereforedecidedtousethisdocumentasastartingpoint,andprovidesupportforusingthoseportionsthatarerelevanttosoftwareasa‘lens’forconsideringrepositorysoftwaresystems.OurprojectteamreviewedeachAuditChecklistitemwiththequestioninmind,“Howmightarepositorysoftwaresystemsupportanorganizationinmeetingthiscriterion?”SomeChecklistitemsareapplicabletosoftwareapplications;othersarenot.Weisolatedtherelevantitems,and,throughmuchdiscussion,testingandreview,developedasystemofannotationstodescribehoweachparticularrelevantChecklistitemmightbeappliedtoassessmentofrepositorysoftwaresystems.ThisadaptedChecklistisourAnnotatedAuditChecklist.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
28
OurAnnotatedAuditChecklistevaluationframeworkisprovidedinthisreportinAppendix6.3.Astheversionweused,the2005RLG/NARAAuditChecklistfortheCertificationofaTrustedDigitalRepository,DraftforPublicCommenthasnowbeensucceededbytheTrustworthyRepositoriesAuditandCertification:CriteriaandChecklist(TRAC),wehaveincludedmappingtoequivalentsectionsofthenewversion.AdditionaldetailsandexamplesofannotationsareprovidedinAppendix6.4.Thefollowingoverviewisextractedfromthissource.
3.1.1.1. Findings:theAnnotatedAuditChecklistAsaRepositoryEvaluationFramework(fromKaczmareketal,2006)
WefoundtheprocessitselfofadaptingtheAuditChecklistasaframeworkforourrepositorysoftwareapplicationevaluationtobeausefullearningexperience.SituatingourevaluationwithintheoriginalAuditChecklistprovidedaframeworktodiscussrepositorysoftwareapplicationswithoutlosingsightofthelargerorganizationalcontext.Asweusedittodocumentourrepositoryinstallationandexperimentationexperiences,wefoundtheannotatedAuditChecklistprovidedagoodframeworkforlookingatrepositorysoftwareapplicationswithinthecontextofdigitalpreservation.However,informationaboutotheraspectsofsoftwarenotdirectlyrelatedtopreservation(e.g.,easeofinstallation,easeofmaintenance,programminglanguageused)donotfitwellintothisframework.Importantly,wehavealsofoundthattheprocessitselfofannotatingtheoriginalAuditChecklistprovidedaforumfortheprojectteammemberstobegindiscussionsthathaveopenedupopportunitiestoexploreourindividualassumptionsaboutvariouschecklistitemsandourinterpretationsofterminology.Throughthesediscussionsweestablisheddirectionstotakeourevaluationactivitiesfurther.
3.1.2. RepositoryTesting:IngestandExportTestsOnFourKeyOpen‐sourceRepositories
Thefouropen‐sourcerepositorysoftwareapplicationsthatweretestedwereDSpace,Eprints,Fedora,andGreenstone.Thecollectionitemsusedastestdataaredescribedindetailbelow.OuttestingapproachandmethodologyareexplainedinSection3.1.3,alsobelow.
3.1.2.1. TestData:aHeterogeneousCanonicalSetInordertotesttherepositories,anumberofheterogeneouscollectionsofdigitalitemswereidentified.Eachofthesecollectionshadvaryingstructures,formats,andexistingmetadata.Anoverviewofeachcollectionfollowsbelow.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
29
3.1.2.1.1. AerialPhotosThisisarelativelysmallcollectionofscannedaerialphotographsfromacoupleofIllinoiscounties.Itconsistsof1,021distinctscannedphotographsandtheiraccompanyingmetadata,whichaccountsfor2,042distinctfilesforatotalof235megabytes.ThemetadataarebasedonFederalGeographicDataCommittee(FGDC)GeospatialMetadataStandard.ThescannedimageswereJPEGsofscreenresolutionquality;thearchivalqualityimageswerenotavailableforthisproject.
3.1.2.1.2. DLI(DigitalLibraryInitiative)JournalArticlesTheDLI(DigitalLibraryInitiative)collectionconsistsof85,650distinctjournalarticlesonthesubjectsofscience,technology,andengineeringfromfivedifferentpublishers.ThiscollectionwascreatedaspartoftheGraingerLibrariesearlierNSFandCNRIfundedDigitalLibrarytestbed.Thiswasbyfarthelargestcollectionwith2,247,455filesforatotalof76,148megabytes.Eachjournalarticletypicallyconsistedofseveralinstantiations,typicallyincludingXMLandSGMLconformingtooneofseveraldifferentDTDsplusaPDFversion,butinsomecasesalsoPostscriptorTeXversions,plusalloftheassociatedfilessuchasimagesandmetadata,whichalsooccurredinseveraldifferentformats.
3.1.2.1.3. WILLPublicRadioBroadcasts
WILListhelocalPublicbroadcastingstation,andthiscollectionconsistsoftheaudiorecordingsforaselectionofitsFocus580,dailytalkradioprograms.Atotalof310programsareincluded.EachprogramhasaWAVaudiofileplustwoXMLmetadatafilesforatotalof930filesand82,456megabytes.ThemetadatawasoriginallyinaMicrosoftAccessdatabase.
3.1.2.1.4. VincentVoiceAudioCollectionThiscollectionisaselectionofaudiorecordingsfromtheVincentVoiceLibraryatMichiganStateUniversity.Itconsistsof209recordings,manyofwhicharecomposedofseveralaudioWAVfiles.ThereisanEncodedArchivalDescription(EAD)fileassociatedwitheachrecordingforatotalof3,515filesand110,186megabytes.
3.1.2.1.5. DOQ(DigitalOrthophotoQuadrangles)DataThisisacollectionof1,073highresolution,DigitalOrthophotoQuadranglesoftheChicagoarea.EachDOQconsistsofsixfiles:theimageTIFFfile,the‘worldfile’usedforgeo‐referencingtheTIFF,anFGDC1XMLmetadatafile,plusatext‐onlyversionofthemetadata,andtheDTDforthemetadatafile,andanXSLTstylesheetforthemetadata.
3.1.2.1.6. “TheCanonicalSet”ThefirstrepositorytestedwasDSpace.Foreachcollection,specializedscriptsandXSLTswerewrittentoarrangetheitemsandmetadatainsuchawaythattheycould
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
30
beingestedintoDSpaceusingtheItemImportutility.Thisprocessincludedmovingthefilesassociatedwitheach“item”intoasingledirectory,creatingdescriptivemetadatainDSpace’sidiosyncraticQualifiedDublinCoreformat,andcreatingacontentmanifest.TheDSpacebulkingestpackageformatwasacceptedasthebaselineconfigurationfromwhichallotherprocessingwouldoccur.Thecollectionofallthedigitalpackagesinthisformatbecameour“canonicalset.”Seefiguresbelowforvariousbreakdownsofallthefilesinallthisset:
CanonicalTestSet::NumberofPackagesbyCollection
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
31
CanonicalTestSet:TotalMegabytesbyCollection
CanonicalTestSet:NumberofFilesbyCollection
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
32
CanonicalTestSet:NumberofFilesbyFilenameExtension
3.1.3. TestingApproachandMethodologyInanutshell,ourprocesswastoinstallaparticularrepository,dowhateverwasnecessarytoingestourcanonicaldatasetintotherepository,dowhateverwasnecessarytoexportourcanonicaldatasetbackoutoftherepository,andrecordinanarrativefashionourfindings,especiallyinthecontextofdigitalpreservationand,also,ofourAnnotatedTrustedDigitalRepositoryChecklist.ItneedstobementionedthattheAnnotatedTrustedDigitalRepositoryChecklistandourconceptofwhatwasactuallyrequiredforlong‐termdigitalpreservationwasbeingcontinuallyrevisedinparallelwiththerepositoryevaluationprocess.Moredetailsofthetestingapproachandmethodologyareprovidedbelow.First,differentprojectstaffmembers,consistingprimarilyofgraduateresearchassistants,wereassignedtoeachofthedifferentrepositoriestobeevaluated.Cooperationbetweenevaluatorswasencouraged,especiallywhendifferentskillsetsmightbeneededinperforminganevaluation.Someoftheevaluatorshadlibrarybackgroundswhileothershadtechnicalcomputerbackgrounds.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
33
Theapproachwasfairlyfreeformandtheevaluatorshadsomeleewayinthedetails,butingeneraltheyfollowedthisroughoutline:
• Thefirststepwastobecomefamiliarwiththerepositorytobeevaluated.Thiscouldinvolvereviewofanypreviousevaluationsorotherwrittenmaterialabouttherepository,includingthedocumentationprovidedbytherepositoryitself.Thisstepculminatedintheinstallationoftherepositoryonourtestserver.
• Thenextstepwastodevelopaprocessforingestingourcanonicaldatasetintotherepository.Thisrequiredtheevaluatortogainagoodunderstandingofthedetailsoftherepository’ssupportedmetadataformatsandsupportedfilestructures,aswellastherepository’sprogramminginterfacesorbatchprocessingtoolsthatcouldbeusedtofacilitatetheingest.Theevaluatorwouldalsoneedtobecomefamiliarwithourcanonicaldatasetatthispoint,iftheywerenotalreadyfamiliarwithit.Theingestprocessgenerallyconsistedofthesesteps:
o Developmappingsbetweenthevariousmetadataformatsrepresentedinthecanonicaldatasetandthemetadataschemasrequiredbytherepository.ThesemappingscouldbeimplementedusingXSLTtransformationsorinsomecasesbywritingcustomizedcomputerprograms.
o Packagethemetadataandfilesinaformatthatisdigestiblebytherepository.Thiscouldbeassimpleascreatingatextfilemanifest,ornamingfilesaccordingtosomestandardandputtingthemallinacertaindirectorystructure,orascomplexascreatingMETSorFoxMLXMLpackages.Similartothemetadata,thesepackagesareusuallyimplementedusingsomecombinationofXSLTandcustomizedcomputerprograms.
o Finally,theactualingestneededtooccur.Onceagain,thiscouldbeassimpleasrunningoractivatingtherepository’snativeingesttool,orascomplexaswritingacustomingestprogramthatusestherepository’slow‐levelprogramminginterfaces.Iftherepositorysupportedanativebatchingestmechanismeveryattemptwasmadetouseitasisbeforeresortingtothecreationofanycustomizedingestprograms.
• Oftendevelopmentoftheingestprocesswasiterative,consistingofdevelopingandimplementingaprocess,testingit,andrefiningituntiltheentirecanonicalsetcouldbereliablyingested.
• Afterthecanonicaldatasethadbeeningested,theprocesswasreversed,andthedatawasexportedordisseminatedbackoutoftherepository.Similartoingest,thedisseminationcouldbeassimpleasinvokingnativerepositoryfunctions,orascomplexaswritingacustomprogram—althoughwe
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
34
preferredtousenativebatchexportcapabilitiesiftherepositoryhadsupportforany.Thisprocesswasalsoofteniterative.
• Evaluatorswereencouragedtorecordtheirprocessesandfindingsabouttherepositoriesthroughoutthisprocess.
• Finally,reuseoftransformationsandcomputerprogramsbetweendifferentrepositorieswasencouraged.
Inparallelwiththerepositoryevaluationprocessesdescribedabove,teammembersalsoparticipatedinthereviewoftheTrustedDigitalRepositoryChecklist,sothatbothofthesetasksinformedtheotherinaniterativefashion.ThesimultaneousTrustedDigitalRepositoryChecklistreviewandtherepositoryevaluationsculminatedintheevaluatorsbeingaskedtoapplyourAnnotatedTrustedDigitalRepositoryChecklisttotheirrepositoryevaluationfindings,whichareprovidedintheappendicesofthisreport.Unfortunately,oneofthepitfallsofemployinggraduateassistantsonalong‐termprojectlikethisisthattheyeventuallygraduateormoveontootherassistantshipsastheireducationalgoalsprogressorchange.Whilewestronglyencouragedourgraduateassistantstodocumenttheirwork,wefoundinsomecasesthattheirnoteswerenotalwaysdetailedenoughforustoaccuratelyreflecttheirfindingsasweappliedtheirevaluationstoourAnnotatedTrustedDigitalRepositoryChecklist.Thissometimesrequiredustorevisit,orrecreate,atestforagivenrepositoryinordertoaddressoneormoreofthechecklistitems.Anotheroutcomeoftherepositoryevaluationsisthatasweprogressedwithingestandexporttestingofthedifferentrepositories,oneofourgoalsbecametobeabletoreliablymoveacollectionofdigitalobjectsbetweenanytwooftherepositoriesthatwerebeingevaluatedandbackagain(roundtripping).ThiswasthegenesisforourcurrentlydevelopingHubandSpokerepositoryinteroperabilityarchitecture.
3.1.4. RepositoryTestingFindings:NarrativeReports,andAnnotatedAuditChecklistCommentary
Thefollowingopensourcerepositorieswereevaluated:• DSpace:Version1.2.2withlaterupgradeto1.3.1• Eprints:Version2.3.13• Fedora:Version2.0,withlaterupgradesto2.1,2.1.1and2.2• Greenstone:Version2.6,withupgradeto7.7
Overviewreportswereproducedforeachrepository,containingthefollowingsections.TheseareprovidedinAppendix6.5,RepositoryTestingFindings:Narrative.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
35
• RepositoryOverview• TestingandTechnicalEnvironment• Methodology• Findings• Conclusion
WealsodocumentedourexperiencesusingourAnnotatedChecklistevaluationframework.ThisdocumentationisprovidedinAppendix6.6.
3.1.5. ConclusionandNextStepsTwokeyoverallfindingsweretypicallylowout‐of‐the‐boxsupportforinteroperabilityandlowsupportforemergingpreservationstandards.Duringthedevelopmentofourtestbedwefoundourselvesdevelopinganumberofdifferentthoughsimilarcustomizedscriptsandprogramsforexportingdigitalpackagesfromonerepositorysystemandimportingthosedigitalpackagesintoanotherrepositorysystem.Therepositorysystemsthemselveshadverylittleincommonthatwouldfacilitatethistask.Theytypicallysupporteddifferentdescriptivemetadataformats,hadnosupportforprovenancemetadata,offeredlittleornosupportfortechnicalmetadata,andemployeddifferentmeansofidentifyingthefilesconstitutingapackage.Thedevelopmentofanin‐housetooltofacilitatedatainteroperabilitybetweenmultiplerepositorieswithouttheneedtodevelopcustomizedmechanismsforeachrepositorycombinationthereforesoonemergedasakeytasktosupportourrepositoryevaluationactivities.Atthesametime,wewerealsocomingtoamorestructuredunderstandingofemergingdigitalpreservationstandards,specificallyearlydraftsofAnAuditChecklistfortheCertificationofTrustedDigitalRepositories(RLG,2005;Kaczmarek,Hswe,Eke,&Habing,2006;Kaczmarek,Habing,&Eke,2006)andthePREMISDataDictionaryforPreservationMetadata(PREMISWorkingGroup,2005).Webegantoseethataformally‐developedinteroperabilityarchitecturedesignedwithafocusonprovidingadditionalsupportforretentionofprovenanceandtechnicalmetadatacouldbeavaluableandpracticalprojectdeliverable,andonewithimmediateapplicationinourownlibrariesandinotherinstitutionsthatcommonlyimplementmultiplerepositorysystemstomanageandpreservedigitalcollections.ThesefindingsledourproposingtheHubandSpoketoolsuiteasanadditionalprojectdeliverable.Thisworkisdescribedindetailinthenextsection(Section3.2).
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
36
3.2. HubandSpokeArchitecture(HandS):SupportingRepositoryInteroperabilityandEmergingPreservationStandards
ThissectiondescribesthedevelopmentoftheHubandSpoke(HandS)toolsuite,builttohelpcuratorsofdigitalobjectsmanagecontentinmultiplerepositorysystemswhilepreservingvaluablepreservationmetadata.ImplementingMETSandPREMIS,HandSprovidesastandards‐basedmethodforpackagingcontentthatallowsdigitalobjectstobemovedbetweenrepositoriesmoreeasilywhilesupportingthecollectionoftechnicalandprovenanceinformationcrucialforlong‐termpreservation.(Notethatrelatedprojectwork,investigatingthemorefundamentalsemanticissuesunderlyingthepreservationofthemeaningofdigitalobjectsovertime,isprofiledinSection4.)
3.2.1. HandSOverviewTheHandSisasuiteoftoolsbuilttosupportmovingcontentbetweenrepositorieswhilegeneratingandmaintainingPREMIS‐basedtechnicalandpreservationmetadata.Itemergedoutofprojectactivitiestoevaluateopen‐sourcerepositories(seeSection3.1)inwhichwefoundtypicallylowout‐of‐the‐boxsupportforinteroperabilityandlowsupportforemergingpreservationstandards.ThenextsectiondescribestheimpetusandrationalebehindtheHandSdevelopmentinmoredetail.
3.2.2. TheNeedforInteroperabilityandPreservationSupport
3.2.2.1. InstitutionsCommonlyRelyonMultipleRepositoriesTherearecurrentlymanydifferentdigitalrepositoriesinwidespreaduse,includingDSpace,Greenstone,Fedora,EPrints,andCONTENTdm,alongwithdigitalarchiveserviceslikethosefromOCLCandCDL.Therearealsomanydifferentsourcesofinputintothesesystems,suchasfromwebcrawlerslikeHeritrixorpackagedcontentfromOCLC'sWebArchivesWorkbench,aswellasnumerousdigitizationandscanningservices.Itisalsonotuncommonforseveralofthesesystemstobeinusewithinasingleinstitution.Ifcuratorswishtosharedatainternally,orwithotherinstitutionsorconsortia,thenmultiplerepositorysystemsverylikelywillcomeintoplay.Repositoryinteroperabilityissuesalsoemergeasinstitutionsupdateorreplacetheirrepositorysystems,andmustmigratecontentfromanexistingrepositorysystemtoitsreplacement.
3.2.2.2. Out‐of‐the‐boxRepositoryInteroperabilityisLowOurrepositoryevaluationexperimentsandourexperienceswithrepositoriesinproductionatourowninstitutionsshowthatthenativeabilityforrepositoriestointeroperateistypicallyverybasic.Almostnoneofthesystemswetestedwereable
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
37
tooperatewithoneanotherbeyondarudimentarylevel,usuallyrestrictedtotheOAIProtocolforMetadataHarvesting(OAI‐PMH)forDublinCore.IfanyOAISconceptsareimplemented(andfeware),suchastheuseofsubmissionordisseminationinformationpackages(SIPsandDIPs),theseimplementationsvarygreatly.InanidealOAIS‐compliantworld,aDIPfromonerepositoryshouldbeaSIPtoanother.However,inreality,adisseminationpackageproducedbyDSpacecannotbeusedforsubmissionintoEprints.Becauseoftheseinconsistencies,achievinganyrealinteroperabilitybetweenrepositorysystemsusuallyentailssomelevelofcustomsoftwaredevelopment.Further,anytimeanewrepositoryisaddedtothemix,newsoftwarewillneedtobedevelopedinordertoaccommodatetheaddedrepository.
3.2.2.3. SupportforEmergingPreservationStandardsisLowFewofthecurrentrepositorieshaveanyexplicitsupportforpreservation,suchasforcollectingpreservationmetadataasarticulatedbyPREMIS,oractivitiestosupportpreservationsuchasformatmigrationsorchecksumvalidationsasoutlinedintheTrustedDigitalRepositoryChecklist.Foraninstitutionthatdeploysseveralrepositorysystems,ataskassimpleasperformingconsistentbackupstooff‐linestoragecanbecomecomplicatedbythefactthatthesystemsstoretheirunderlyingdatadifferently.Theremaybedatastoredinrelationdatabases,XMLdatabases,RDFtriplestores,andvariousfilesystems–allofwhichmustbebackedup,andmayrequiredifferentbackuptechniques.Insummary,thegenerallackofrepositorysupportforinteroperabilityandforemergingrepositorystandardsatatimewhenlibrariesandotherinstitutionscommonlyrelyonmultiplerepositorysystemstomanage,shareandpreservecontent,isthefundamentalimpetusbehindthedevelopmentoftheHandStoolsuite.Thekeyprinciplesofinteroperabilityandpreservation,andtheapproachesimplementedintheHandStosupportthem,areexaminedmorecloselyinthenextsection,followedbyafunctionalandtechnicaloverviewoftheHandStools.
3.2.3. HubandSpokeKeyPrinciplesTheHubandSpokeapproachisbasedonthetwokeyprinciplesofinteroperabilityandpreservation,withtheunderstandingthatinteroperabilityisnotonlyanenduntoitself,butitisalsocriticalforpreservation.
3.2.3.1. InteroperabilityToreducethecomplexityofinteroperability,theHubandSpokeusesacommonpackagingformatwhichisusedforinterchangeofdigitalresourcesbetweendifferentrepositories.Digitalpackagescomingfromarepositoryaretransformedintothiscommonformatbeforeanyfurtherprocessing,anddigitalpackagesaretransformedfromthiscommonformatintothenativerepositoryformatwhenbeing
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
38
placedintoarepository.TheideaistoreduceanN2problemintoa2NproblemasshowninFigure5below.
Figure5:InteroperabilityStandards:aSimpleIdea
3.2.3.2. PreservationThesecondkeyprincipleisthatthecommonpackagingformataswellastheprocessesthatactonthatpackagingformatshouldnotonlysupportinteroperability,butshouldalsopromotepreservation.Thisprincipletreatsthecommonpackagingformatasanarchivalinformationpackage(AIP)intheOAISmodel.Theassumptionbeingthatonereasonpackagesarebeingmovedbetweenrepositoriesisforpreservation.ThereareseveralfeaturesoftheHubandSpokearchitecturethatpromotepreservation.Onekeypreservationfeatureistherelianceoncurrentbestpracticesregardingpreservationmetadata,primarilyinformedbyPREMIS.TheHubandSpokeisespeciallyconcernedwithtechnicalmetadataaboutthefilesandbitstreamswhichcompriseadigitalpackage,andalsowithprovenancemetadataabouttheeventsthatoccurduringthepackage'slifetime,includingeventspertainingnotonlytothefilesandbitstreams,butalsotothemetadataitself.Thetechnicalmetadataisusedtovalidatethefilesandbitstreamsthroughoutadigitalobject'slifetimeandareupdatedasrequired,forexamplewhenaformattransformationoccurs.Theprovenancemetadataisalsoupdatedthroughoutanobject'slifetime.ThetoolsthatimplementtheHandSarchitectureperformthese
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
39
actionsautomaticallyasrequiredduringprocessing,butthedataarealwayspresentinthepackagessothatothersystemscanalsoperformtheseactionsasneeded.Anotherkeypreservationfeatureisthetreatmentofthepackagesthemselves.Forpurposesofrepositoryinteroperabilityandalsotosupportpreservation,theHubandSpokeframeworktreatstheinstantiationsofthepackagesasfirstclassdigitalobjects.ThismeansthatwhenaHandSpackageistransformedforingestionintoaspecificdigitalrepository,notonlyarethemetadata,files,andbitstreamsthatcomprisethepackagedecomposedasappropriatefortherepositoryanduploaded,butthepackageitself(inourcaseasuiteofMETSfiles)isalsotreatedasadigitalobjecttobeuploadedtotherepository.Laterwhenthedigitalpackageneedstobedisseminatedfromtherepository,notonlyarethemetadata,files,andbitstreamsavailablefordownload,butalsotheoriginalHandSpackage.ThisallowstheHandSsystemtocomparethepackageasitwasoriginallyingestedtohowitnowappearsasdisseminatedfromtherepository.Thisprocess,wefeel,iscriticalforpreservationinanenvironmentofheterogeneousandchangingrepositories.AnotheraspectofthistreatmentoftheHubandSpokepackagesasfirstclassdigitalobjectsisthatwecancreatesnapshotsofindividualpackagesatpointsintimeandalsorecordpreservationmetadatadataaboutthepackagesnapshots.TheHandStoolsuitecurrentlyimplementsthisconceptasamasterpackagewhichreferencestime‐stampedsnapshotsofthemainpackage.Themasterpackagealsorecordspreservationmetadataaboutthesnapshots.Thisapproachisexplainedinmoredetailinthenextsection,whichdescribestheconcreteimplementationoftheHandSpackagesusingMETS.
3.2.4. METSProfileTorealizetheaboveprinciples,wewantedtoutilizetheprevailingdigitallibrarystandardsasmuchaspossible.Tothatend,weadoptedMETSasthepackagingstandard,PREMISasthepreservationmetadatastandard,andMODSasthedescriptivemetadatastandard.Wealsooptionallyutilizeseveralformat‐specifictechnicalmetadatastandardssuchasMIXandtextMDforimageandtextobjectsrespectively,amongothers.OurMETSprofile,theECHODEPGenericMETSProfileforPreservationandDigitalRepositoryInteroperability(Habing,2005),iscurrentlyregisteredwiththeLibraryofCongress.Asalreadydescribed,theprimaryfocusoftheHandSMETSprofileistoenablerepositoryinteroperabilityandtosupportpreservationofrepositorycontent.Becauseofthestrongfocusonpreservationratherthanaccess,theHandSprofileisrelativelynoncommittalregardingfileformatsorstructures;instead,specialattentionisgiventoadministrativeandtechnicalmetadata,particularlytointegratingthePREMISdatamodelandschemaintoMETS.Weanticipatethatourfileformat‐agnosticHandSprofilemaybeoverlaidontopof,orinheritedby,otherprofilesthatbetterdefineaparticularfileformatorstructure,providingthemwith
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
40
addedsupportforpreservationorinteroperability.Anexampleofthisarrangement,whereaformat‐specificMETSprofileisimplementedasasubclassofthePREMIS‐focusedHandSprofile,istheECHODEPMETSProfileforWebSiteCaptures(Habing,2006),alsoregisteredwiththeLibraryofCongress.ThustheECHODEPGenericMETSProfileforPreservationandDigitalRepositoryInteroperabilityisgenerallynotconcernedwithrenderingormakingaccessibleanyparticularrepresentationofanobject,butitisconcernedwithpreservingtheobjectanditsrepresentations,includingthehistoryofhowthosehavechangedoverperiodsoftime.Inthiscontext,preservationreferstoshort‐terminteroperability,preservingtherepresentationsandmetadataasadigitalpackageismovedbetweentwodifferentrepositories.Italsoreferstothelong‐termpreservationofthepackageanditshistoryasitexistsinvariousrepositoriesforlongperiodsoftimeandundergoesvarious"preservationactions"suchasfixitychecks,normalizations,orformatmigrations.Notethatthoughtheprofileisgenerallyagnosticaboutalmostallaspectsofadigitalobject'srepresentation,suchasstructureorfileformats,wehavemadesomepragmaticconcessions,suchasmandatingatleastMODSfortheprimarydescriptivemetadata(dmdSec)whileatthesametimeallowingmultiplealternativedescriptivemetadatasections.Thealternativedescriptivemetadatasectionsareusedasameanstorecordvariousversionsofthesemetadataastheyhaveexistedindifferentrepositoriesoratdifferentpointsintime.ApotentialusagescenariocanbeillustratedinthefollowingmigrationexampleusingourMETSprofile:
1. WestartwithadigitalobjectwhoseoriginalsourcedescriptivemetadataisintheMARCXMLformat.BecauseourprofilerequiresMODSastheprimarydescriptivemetadata,theMARCXMLwillbetransformedintoMODS,andtheMODSwillbestoredintheMETSdocumentalongwithaprovenancestatementwithsomedetailsaboutthetransformation,especiallyidentifyingthesourcemetadataformat.However,becausedescriptivemetadataareconsideredtobeasignificantpartoftherepresentationofanentityandbecausetransformationsbetweenmetadataformatsareoftenimperfect,theoriginalMARCXMLformatisalsostoredintheMETSdocumentasanalternatemetadataformat.
2. NowsupposethatthedigitalobjectistobeingestedintoDSpace.DSpace,however,doesnothavenativesupportfortheMODSorMARCXMLmetadataformats;therefore,aspartoftheingestprocess,theMODSmustbetransformedintotheidiosyncraticDublinCore(DC)metadataformatthatissupportedbyDSpace.ThismetadataformatisalsoaddedasanotheralternatedescriptivemetadataformattotheMETSdocument,alongwithaprovenancestatementdescribinghowthisnewDCformatwasderivedfromtheprimaryMODSformat.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
41
3. NextimaginethattheobjectexistsinDSpaceforsomeperiodoftimeduringwhichthedescriptivemetadataundergoessomerevision,suchastheadditionofnewsubjecttermsortheadditionofanabstract.NowtheobjectistobedisseminatedfromDSpaceforingestintosomenewrepository.ThiscouldtriggertheadditionofanotheralternatedescriptivemetadatasectiontotheMETSdocument.ThisalternateformatwouldconformtotheidiosyncraticDSpaceDublinCoreformat,buttheprovenancestatementwouldspecifythatthisDCformatrepresentsanewerversionofthedescriptivemetadatathanwasoriginallyingestedintoDSpace.
Theabovescenariowouldproduceachainofdescriptivemetadataformats,suchasMARCXML(original)→MODS(primary)→DC(version1)→DC(version2),withprovenancePREMISeventstatementsadequatetodeterminethesequenceofeventsthatledtothischain.AspartofthisprofilewealsoenvisionfutureprocessesthatmightreconcilelatermetadatarevisionsandmergethoserevisionsbackintoanewprimaryMODSdescriptivemetadatasection.Thepreservationofsemanticsduringthesetypesofmigrationsisoneoftheconcernsofsemanticpreservationdescribedinthefinalsectionofthisreport(Section4).Becausewefeelthatadministrativemetadataareimportantforpreservation,thisprofileisfairlyprescriptivewhenitcomestotheadministrativemetadata,whichcanbeassociatedwithalmostallofthesectionsthatmakeuparepresentation:structures,filesandbitstreams,anddescriptivemetadata.ParticularattentionispaidtothetechnicalandprovenancemetadataassociatedwiththeseMETSsections.
3.2.4.1. MasterMETSProfileAnotherkeyideabehindourMETSprofileistheideaofaMasterMETSdocument.EachpackageintheHandSarchitectureconsistsofasingleMasterMETSdocument,oneormoreMETSSnapshotdocuments,plusallthefilesandbitstreamsthatarereferencedfromtheMETSSnapshots,asshownbelowinFigure6.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
42
Figure6:MasterMETSShowingMultipleSnapshotsandAssociatedFilesEachSnapshotrepresentsaversionofthedigitalpackageatapointintime,usuallywhenthepackageiseitherretrievedfromorplacedintoagivenrepository.Nearlyanyaspectofadigitalobject'srepresentationmaychangewithtime,includingdescriptivemetadata,structure,and,asillustratedabove,eventhefilesreferencedfromapackagemaychangeovertime,perhapsasformatmigrationsoccur.ThesechangesarerecordedasprovenancestatementsintheMETSSnapshotinwhichthe
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
43
changeismanifest.Forexample,METSSnapshot2intheabovediagramwouldhaveaPREMISeventdescribingthatFile1wasdeletedfromthepackageandFile4wasaddedtothepackage.Inmostcases,theHandSsystemcanautomaticallydetectwhenthesechangesoccurandwillautomaticallyaddtheappropriateprovenancestatementsorembellishthetechnicalmetadataasrequired.However,itmaynotbeabletodeterminewhythechangesoccurredwithoutsomesortofintelligentintervention.HandSisabletodetectthechangesbecauseithasaccesstothepreviousSnapshotsandcancomparetheSnapshotofthepackageasitwentintoarepositorytothepackagethatisretrievedfromtherepository.ThisisoneoftheprimaryreasonsthattheMETSdocumentsthemselvesarealsoplacedintoarepositoryalongwiththeotherfilesthatareactuallypartofthepackage.TheMETSprofileimplementationsdescribedaboveareanintegralpieceoftheHandSarchitecture,usedasframeworkforgeneratingandmaintainingPREMIS‐basedmetadataovertimetosupportlongtermpreservation.ThenextsectionlooksinmoredetailatothermechanismsoftheHandStoolsuite,andillustratesitsoverallworkflowcycle.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
44
3.2.5. HandSWorkflowCycle
Figure7:HandSWorkflowCycleAsdescribedintheprecedingsections,theHandSToolSuiteprovidesaframeworkforsustainingandenrichingpreservationmetadatafordigitalobjectsastheyaremovedinto,outof,andbetweendigitalrepositorysystems.Digitalobjectsorpreservationpackagestypicallyrefertoasetoffilesthatrepresentsasingleintellectualentity,includingmetadataabouttheentityoraboutthefilesthemselves.IntheHubandSpokeworkflowcycle(seeFigure3),digitalobjectsareretrieved,convertedtoacommonprofile,validated,enrichedwithmetadata,transformedintoarepository‐compatibleform,andingestedintoadigitalrepository.
3.2.5.1. WorkflowOverview:GET,PROCESS,PUTPreservationpackagesmayentertheHandSworkflowinvariousways:somemaycomefromthird‐partyapplicationsliketheOCLCWebArchivesWorkbench,othersmaybedisseminatedfromadigitalrepositorylikeDSpaceorEPrints,andsomewilloriginatesimplyasdirectoriesoffilesonacomputerfilesystem.Inanycase,thesetoffilesthatmakeupthepreservationpackagemustfirstbegatheredandorganizedforprocessing.Objectsenteringtheworkflowfromadigitalrepositorysystemmustfirstbefetchedfromtherepositorybyinteractingwithitsnativedissemination
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
45
routine,whichwillvaryfromrepositorytorepository.Thisinteractionwiththerepositorysystemisfacilitatedbyour“LightweightRepositoryCreate,Retrieve,Update,andDelete”Service—affectionatelynamedLRCRUD.LRCRUDismadeupoftwomodules:theLRCRUDClient,whichrunsonthesamemachineastheotherHandStools,andtheLRCRUDService,whichrunsalongsideadigitalrepositorysystem.Toretrieveapackagefromtherepository,theLRCRUDClientmakesarequesttotheLRCRUDService.TheLRCRUDService,inturn,communicatesdirectlywiththerepositorysystemandretrievesthepackageviatherepository’snativedisseminationroutine.TheLRCRUDServicezipsupthepackageandsendsitoverthenetworktotheClient.OncethepackagehasbeenreceivedbytheLRCRUDClientandverified,itscontentsareunzippedontothelocalfilesystem.Fromthere,theTo‐HubPackagertoolconvertsthedigitalobjectintowhatwecallaHubPackage.AHubPackageismadeupofthecontentfilesthatconstitutethedigitalobject;METSdocumentscontainingdescriptive,administrative,andstructuralmetadataabouttheobjectatvariouspointsintime;andasingleMasterMETSdocumentthatcompriseschronologicalandstructuralinformationabouttheotherMETSdocuments.TheMasterMETSfilewillcontainapointertoatleastone,butpotentiallyseveralotherECHODEPMETSdocuments,eachofwhichservesasasnapshotoftheHubPackageatsomepointinitslifecycle.TheECHODEPMETSdocumentistheheartofaHubPackage;itholdstogetherallthefilesandvariousmetadatathatmakeupthepackage.WhenaHubPackageiscreated,anewECHODEPMETSdocumentisgeneratedforthepackage.IfthepackagealreadycontainsanolderECHODEPMETSdocument(generatedpriortoingestionintotherepository),thenewMETSdocumentiscomparedtotheolderonetodiscoveranychangesordamagestothepackagethatmighthaveoccurredwhileinthecustodyoftherepository.TheECHODEPMETSdocumentisthenenrichedwithtechnicalmetadataandvalidatedagainsttheECHODEPMETSProfileregisteredwiththeLibraryofCongress(Habing,2005)].TheTechMDAugmentortoolenrichestheMETSdocumentwithformat‐specifictechnicalmetadatafoundbyanalyzingeachofthepackage'scontentfiles,andconvertingtheresultintoPREMISObjectmetadata.Oncethepackagehasbeenanalyzedandenriched,theProfileValidatorcloselyinspectstheconstituentfilesthatmakeupthepackage,bothdataandmetadata,andverifiestherearenoerrorsorinconsistencies.Atthispoint,theHubPackageisreadytobesentontoanotherrepository.Butfirst,ithastobeconvertedintoaformcompatiblewithingestionintothetargetrepository,whichagainwillvaryfromrepositorytorepository.ThisfinalconversioniscarriedoutbyaFrom‐HubPackagertool,builtspecificallyforthetargetrepository.Fromthere,thepackageishandedofffromtheLRCRUDClienttotheLRCRUDServiceforthetargetrepositoryandingested.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
46
3.2.5.2. WorkflowExampleTheworkflowcyclemightbemoreeasilyunderstoodbyfollowinganexamplepreservationpackageasitmakesitswaythroughtheprocess.Forthisexample,wewillusethreesmallfilesthatmakeupasinglewebpage:anHTMLfile,aCascadingStyleSheet(CSS),andaJPEGimage.Thesethreefilescomposeasinglepreservationpackage,oritem,whichhasbeensubmittedtoaDSpacedigitalrepository.UsingtheHandStoolsuite,wewilltransfertheitemfromDSpacetoanEPrintsrepository,whilegeneratingpreservationandtechnicalmetadataalongtheway.1.RetrieveRepositoryXdisseminationpackageviaLRCRUDInourexample,supposewehaveanLRCRUDServicerunningalongsideaDSpacerepositoryonaremoteserver.TheLRCRUDClientapplicationsendsarequesttotheLRCRUDServicetoretrieve(GET)apackagefromtherepository.TheLRCRUDServicerelaystheretrievalrequesttoDSpaceusingtherepository’snativedisseminationmethod.Theoutputofarepository'sdisseminationwilltypicallybemadeupofanynumberofmetadatastreamsandothersupportingartifactsinadditiontotheitem’scontentfiles.InthecaseofDSpace,thepackagewillincludeaDSpaceMETSfilethatencompassesMODSdescriptivemetadataaboutthepackageandPREMIStechnicalmetadatapertainingtoeachoftheconstituentbitstreams.Inourexample,thepackagereturnedbyDSpacenowcontainsfourfiles:theHTML,CSS,andJPEGfileswebeganwithandaDSpaceMETSfile.TheLRCRUDServicereceivestheDSpacedisseminationandpackagesitscontentsintoaziparchive,whichwillbetransmittedoverHTTPtotheLRCRUDClient.TheLRCRUDServicealsocalculatesfilesizeandachecksumvalueforthezipfilebeforesendingit,andtransmitsthesevaluesasContent‐MD5andContent‐LengthHTTPheaderfieldsalongwiththepackagezipfile.AstheLRCRUDClientreceivesthepackagezipfile,ittoocalculatesfilesizeandchecksumvalues,whicharevalidatedagainsttheHTTPheaderfieldstoensurethepackagewasunharmedduringthefiletransfer.Assumingthevaluesagree,thepackageisunzippedandsavedtodisk.2.CreateHubPackagefromrepositorydisseminationfilesTocreateaHubPackagefromtherepositorydisseminationpackage,theTo‐HubpackagerneedstoproduceanewECHODEPMETSdocumentforthepackage.Thepackagerbeginsbysearchingtheretrievedfilesforanymetadataincludedbytherepository.Inourexample,thepackagerlocatestheDSpaceMETSdocumentandretrievesitsMODSdescriptivemetadatastream.ThisDSpaceMODSmetadatawillbetransformedintoAquiferMODSandinsertedintothenewECHODEPMETSdocument’sdescriptivemetadatasection.Otherrepositoriesexportmetadataindifferentformats(e.g.,DublinCore),butinallcasesthepackagemetadataareultimatelytransformedtoAquiferMODSbytheTo‐HubPackager.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
47
TheTo‐HubPackagerthencreatesanentryinthefilesectionoftheECHODEPMETSdocumentforeachofthepackage’sconstituentfiles.Inourexample,thenewECHODEPMETSdocumentwillcontainafileelementforeachofourthreecontentfiles.TheTo‐HubPackagerwillalsocreateinthenewECHODEPMETSdocumentthreePREMIStechnicalmetadataobjectstocorrespondtothreefileelements.Eachwillcontainbasictechnicalmetadataaboutoneofthefiles,includingchecksumvalues,filesize,andMIME‐type.Finally,anyleftoverdescriptiveandtechnicalmetadataelementsfromtheDSpaceMETSdocumentareinsertedintotheECHODEPMETSdocumentasalternatemetadatasothatitisneverlost.IfthepackagecontainsolderECHODEPMETSdocuments(becauseithadbeenpackagedbyHandSbeforeenteringtherepository),themostrecentECHODEPMETSdocumentiscomparedtothejust‐generatedECHODEPMETSdocumenttoexposeanychangesthepackagemayhaveundergonesinceitwaslastanalyzed.ThesedataarerecordedinthenewECHODEPMETSdocument’sprovenancemetadataasPREMISevents.IfthepackagecontainsaMasterMETSdocument,apointertothenewMETSdocumentiscreatedanddesignatedasthemost‐currentECHODEPMETSdocumentforthepackage.IfnoMasterMETSdocumentcanbefound,theTo‐HubPackagercreatesonefromscratch.OncethenewMETSdocumenthasbeencreatedandtheMasterMETSdocumentisupdated,HubPackagecreationiscomplete.OurexampleHubPackagenowconsistsofaMasterMETSdocument,whichpointstoasingleECHODEPMETSdocument.ThisECHODEPMETSdocumentcontainsdescriptivemetadataaboutthepackage;technicalmetadataabouteachofthethreecontentfiles,alongwithpointerstothosefilesandthetechnicalmetadataleftbyDSpace;andprovenancemetadatadocumentingthepackage’sexportfromtherepositoryanditsHubPackagetransformation.3.Generatetechnicalmetadata;augmentHubPackageMETSDocumentUsingtoolsfromtheJSTOR/HarvardObjectValidationEnvironment(JHOVE),theHandSTechMDAugmentermoduleanalyzeseachoftheHubPackage'scontentfilesandgeneratesformat‐specific,technicalmetadataforeach.TheJHOVE‐generatedmetadataistransformedusingformat‐specificXSLTstylesheets,andinsertedintothetechnicalmetadatasectionoftheECHODEPMETSdocumentastechnicalmetadata.AnyinconsistenciesbetweenthetechnicalmetadatacurrentlyheldintheMETSdocumentandthosegeneratedbyJHOVEarerecordedintheprovenancesectionoftheECHODEPMETSdocumentasPREMISvalidationevents.ThetechnicalmetadatastoredintheECHODEPMETSdocumentisformattedincompliancewiththefollowingmetadatapreservationstandards:AudioMDforaudiofiles;TextMDfortext,XML,andHTML;andMIXforimages.InourexampletheHTML,CSS,andJPEGfileswilleachbeanalyzedbyJHOVE.TheJHOVEoutputforboththeHTMLandCSSfileswillbeformattedasTextMD,andtheoutputfortheJPEGimagewillbeformattedasMIX.Eachwillbeinsertedintothe
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
48
ECHODEPMETSdocumentinatechnicalmetadataelementcorrespondingtotheappropriatefileelement.TheJHOVEanalysisitselfisalsodocumentedandrecordedintheECHODEPMETSdocumentasavalidationevent.OneofthelimitationsofJHOVEisitssmallnumberofsupportedmediatypes.InitscurrentreleaseJHOVEoffersnosupportforclosedformatssuchasMicrosoftOfficefiles.AnotherdrawbackofusingJHOVEisthatitonlyreportstheMIME‐typecorrectlyforHTMLorXMLfilesiftheyarewellformed;otherwiseitreportsthemasplaintext,causingdiscrepancieswithintheECHODEPMETSdocumentandvalidationwarnings.Nevertheless,wefoundJHOVEtobeausefultoolforanalyzingfilesandgeneratingtechnicalmetadata.FormoreonJHOVEvisithttp://hul.harvard.edu/jhove/.4.ValidateHubPackageMETSDocumentagainstMETSProfileTheProfileValidatorexaminesthecurrentECHODEPMETSdocumentfortheHubPackageagainsttherequirementsofourMETSprofilescurrentlyregisteredwiththeLibraryofCongress(Habing2005,2006).KeyvalidationpointsincludecheckingtomakesurethattheprimarydescriptivemetadataelementcontainsaMODSobjectthatconformstotheAquiferMODSprofile;thateveryfilereferencedbythefilesectionhasassociatedtechnicalmetadataPREMISobjects;andthatallprovenancemetadataassociatedwithafilecontainvalidPREMISeventelements.TheProfileValidatoralsochecksthatthepackagecontentfilesreferencedbytheECHODEPMETSdocumentareaccountedfor,andthattheirchecksum,file‐size,andmime‐typevaluesarecorrect.OurexampleECHODEPMETSdocumentpassesvalidationforthefollowingreasons:itcontainsvalidAquiferMODSinitsprimarydescriptivemetadataelement;eachofitsfileelementsreferencetechnicalmetadataelementscontainingvalidandcompletePREMISobjectmetadata;anditconformsstructurallytoourMETSprofilerequirements.Oncethevalidationhascompleted,thevalidationeventitselfisdocumentedandrecordedintheECHODEPMETSasaPREMISvalidationevent.5.CreateRepositoryPackagefromHubPackageBeforearepositorycanacceptapackageforsubmission,itmustfirstreceiveadescriptionofthepackage’scontents.TheFrom‐HubPackagermoduleusesdescriptivemetadataextractedfromtheHubPackageECHODEPMETSdocumenttogeneratetherepository‐specificmetadataneedforpackagesubmission.ThisprocessusuallyinvolvestransformingtheAquiferMODSmetadatafoundintheECHODEPMETSdocumentintoametadataformatrequiredforrepositorysubmission,andwillvaryfromrepositorytorepository.Inourexample,wearesendingthepackagetoanEPrintsrepository,whichmeansthepackagerwillgenerateanEPrints‐specificmetadatafilefromtheAquiferMODSstream.ThetransformationeventisrecordedintheECHODEPMETSdocumentasa
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
49
PREMISmetadata‐transformationevent,andthenewly‐generatedmetadataisaddedtotheMETSdocumentasalternatedescriptivemetadata.ARepositoryPackagezipfileisthencreated,consistingoftheMasterMETSdocument,allthesubordinateMETSsnapshotdocumentsandthecontentdatafiles,aswellasanyrepository‐specificmetadatafiles.6.SendRepositoryPackagetoRepositoryYviaLRCRUDAtthislaststepinourexample,wehaveanEPrints‐specificLRCURDServicerunningonaremoteserverwithanEPrintsrepository.TheLRCRUDClientsendsarequesttotheLRCRUDServicetocreate(POST)anewpackage.TheLRCRUDServicerelaysthecreaterequesttotherepositoryand,usingtherepository’snativemethods,createsanemptyrecord.TheLRCRUDServicereceivesanewlocationidentifier,orhandle,correspondingtothenewlycreatedlocationintherepository,whichitsendsbacktotheLRCRUDClient.Thislocationidentifierisinsertedintothepackage’sECHODEPMETSdocumentastheprimaryIDfortheMETSdocument.TheLRCRUDClientthensendsarequesttotheLRCRUDServicetoupdate(PUT)thenewpackageatthatlocation.TheLRCRUDClientcalculatesfilesizeandchecksumvaluesforthepackagezipfilebeforesendingittotheService,andittransmitsthesevaluesasContent‐MD5andContent‐LengthHTTPheaderfieldsalongwiththepackage.AstheLRCRUDServicereceivesthepackagezipfilefromtheClient,itcalculatesitsownfilesizeandchecksumvaluesandvalidatesthemagainsttheHTTPheaderfieldstoensurethepackagewasunharmedduringthefiletransfer.OncetheLRCRUDServicehasvalidatedthefiletransfer,itunzipsthepackageandingestseachofitscontents—includingthepackageMETSfiles—intotherepositoryusingtherepository’snativesubmissionroutine.Therepository‐specificdescriptivemetadatathatwasgeneratedinStep5aboveissubmittedtotherepositoryaswell.Oncethepackagehasbeenfullyingested,theLRCRUDServicereturnsanupdateresponsemessagetotheLRCRUDClientconfirmingthesuccessfulsubmission,oranerrorifthesubmissionfailed.Somerepositoriesallowforcertainbitstreamstobegivenprivilegedstatus.InsuchcasestheMasterMETSandECHODEPMETSfilesmayreceivespecialstatus;butinallcasestheMETSfilesarepreservedalongwiththeotherpackagecontentfilesandaretreatedasfirstclassobjectswithregardtotherepository.Thatway,whenthepackageisretrievedfromtherepository,allthemetadatapertainingtothestateofthepackagebeforeitwassubmittedtotherepositoryisnotlost.
3.2.5.3. WorkflowRecapThroughtheworkflowprocessdescribedabove,HandSprovidestoolstofacilitatemovingdigitalobjectsbetweenmultiplerepositorieswhilegeneratingandmaintainingimportantPREMIS‐basedtechnicalandprovenancepreservationmetadata.Digitalobjectsareretrieved,convertedtoacommonprofile,validated,
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
50
enrichedwithmetadata,transformedintoarepository‐compatibleform,andingestedintothetargetrepository.
3.2.6. HandSTechnicalImplementationThekeytechnicalcomponentsoftheHandSimplementationaretheHubandSpokeMETSprofileJavaclasses,providinganextensibleJavaAPItoourMETSprofilewithApacheXMLBeans;theTo‐andFrom‐HubPackagermodules,facilitatinginteroperabilitythroughpluggableinterfaces;andtheLightweightRepositoryCRUDService(LRCRUD),supportingthedisseminationandsubmissionofobjectsbydefiningaprotocolfortransmittingdigitalobjectstoandfromrepositorysystemsoverHTTP.
3.2.6.1. HubandSpokeMETSProfileAPIThecoreoftheHandSToolSuiteisourMETSProfileAPI,aJavacoderepresentationofaMETSXMLdocumentcompiledfromourMETSprofile.ThebulkofourMETSclasseswerecreatedwithApacheXMLBeans(http://xmlbeans.apache.org/),atoolforgeneratingJavaclassesfromXMLschemafiles(XSDfiles).WithXMLBeans,weareabletocompileXMLschemadocumentstoproduceaJavacodestructure,allowingustoworkwithXMLdatathroughourownJavaclassesandmethods.TocreateourMETSprofileAPI,wecombinemethodsfromXMLBeans‐generatedclassesfromtheMETS,MODS,andPREMISschemas,alongwithformat‐specificpreservationmetadataschemaslikeMIX,TextMD,andAudioMD.Wealsolayercustom‐builtconveniencemethodsontopoftheXMLBeans‐generatedmethodstofacilitateadditionalmanipulationoftheMETSdocumentinafashionuniquetoourMETSprofile.AnewHandSProfileJavaobjectcanbecreatedfromscratchgivenasetofcontentfilesandaccompanyingmetadata,orbyinstantiatinganexistingXMLMETSdocumentthatconformstoourprofile.Onceinstantiated,theunderlyingMETSdocumentobjectcanbeoperateduponprogrammaticallythroughAPIcalls.InworkingwiththeAPI,weareassuredthatanymanipulationoftheMETSdocumentwillalwaysbeconsistentwithourMETSprofile.Forinstance,toaddanewfiletothepreservationpackage,acallismadetothetotheaddFile()method,whichinturntriggerscallstoothermethodsthatensuretheMETSobjectremainsconsistentwithourprofile—suchasaddinganewPREMISObjecttechMDsectionassociatedwiththenewfile,andgeneratingchecksum,MIME‐type,andfilesizevalues.AtanytimetheMETSobjectcanbevalidatedagainstourprofile,orre‐serializedasXMLandsavedtothefilesystem.
3.2.6.2. To‐andFrom‐HubPackagersTofacilitaterepositoryinteroperability,theHandSToolSuiteincludesasetofpackagerclassesfortransformingacollectionofpreservationitemsintoaHubPackage,andfortransformingaHubPackageintoaformrequiredforsubmission
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
51
intoagivenrepository.ForitemsenteringtheHubandSpokefromadigitalrepository,therepository‐specificTo‐Hubpackagertakesthenativerepositorydissemination,unpacksit,andinstantiatesanewHandSMETSProfileobjectfromitscontents.Goingtheotherway,arepository‐specificFrom‐HubpackagerpreparesaHubPackageforsubmissionintotheparticularrepository.Currently,wehaveTo‐HubpackagersforprocessingitemscomingfromDSpace,EPrints,OCLC’sWebArchivesWorkbench,orfromadirectoryoffiles.OurcurrentlistofFrom‐HubpackagertargetsincludesDSpace,EPrints,andtheLibraryofCongressarchivestandardBagit.To‐andFrom‐HubpackagermodulesfortheFedorarepositoryarecurrentlyindevelopment.Wehaveemployedapluggablearchitectureforcreatingpackagermodules.BaseTo‐andFrom‐HubclassesareimplementedinJavaasabstractclasseswiththeintentionthattheywillbeoverriddenandextendedbyotherprogrammersneedingtotailortheHandSToolstotheirspecificrepositoryorarchivingstandard.Thismodulararchitectureallowsotherdeveloperstocreatepackagerplug‐insfortheirownrepositorysystemswithouthavingtorecompileorre‐factortheexistingHandScodebase.
3.2.6.3. LightweightRepositoryCRUDService(LRCRUD)TheLightweightRepositoryCRUDservicespecificationdefinesdisseminationandsubmissionweb‐serviceinterfacestodigitalrepositorysystemsforusewiththeECHODEPHubandSpokeToolSuite.TheLRCRUDspecificationdefinesaprotocolfortransmittingdigitalobjectstoandfromrepositorysystemsoverHTTP.ItenablesuserstoobtainobjectsinaformatexpectedbytheHandSprocessingscriptsandsuppliesdigitalobjectstorepositoriesinaformatexpectedbytheirnativeingestionmechanisms.Thespecificationisimplementation‐agnostic:itsimplydefinestheparametersandresponsesrequiredtoenableaserviceimplementationtocommunicatewiththeLRCRUDclientapplication.ThisallowsLRCRUDimplementerstochoosethemostappropriateenvironmentandprogramminglanguagesforinteractingwiththeirchosenrepository.TheHandSToolSuitecurrentlyhasLRCRUDimplementationsforDSpace,EPrints,andFedora.TheLRCRUDServicefollowsRepresentationalStateTransfer(REST)conventions.ItexposesCRUDactionsonrepositorycontentovertheHTTPprotocol.Asmentionedabove,CRUDisanacronymforCreateRetrieveUpdateandDelete–thebasicoperationsthatapplicationsshouldimplementwhenactinguponpersistentstoragelikerelationaldatabasemanagementsystems,filesystems,andthelike.TheLRCRUDclientcommunicateswiththeLRCRUDserviceviaHTTPmethods,statuscodes,andheaders.ThelistbelowshowshowtheCRUDactionsaremappedtotheHTTPmethods:
• Create==POST• Retrieve==GET• Update==PUT
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
52
• Delete==DELETE
InmostcasestheLRCRUDservicewillresideonthesamehostastherepositoryitservessothatithasaccesstotherepository'sAPI.LRCRUDisessentiallya"dumb"packager;itissimplyawaytosupplyfilestotheremoterepositoryinanyformat/configurationthatitcannativelyingest.InthisitissimilartoprotocolsliketheSimpleWebServiceOfferingRepositoryDeposit,orSWORD(Allinson,François,&Lewis,2008),whicharebeingadoptedbyrepositories‐‐andwhichmaymakethesubmissionfunctionofLRCRUDultimatelyunnecessary.Itmaybebeneficialtopresentsomedescriptivestep‐by‐stepexamplesinordertoclarifythefunctionsoftheLRCURDcomponentswithintheHubandSpokeToolSuite.Theseexamplesdescribeindetailtheinteractionsbetweentheclientandtheservice.
3.2.6.4. LRCRUDFunctions‐‐Examples
3.2.6.4.1. Dissemination
Figure8:LRCRUDDissemination
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
53
Dissemination(seeFigure8)istheactofretrievinganitemfromarepository,wherebytheitemisdefinedasanintellectualentitycomprisinganynumberofcontentstreams,metadatastreams,andothersupportingartifacts.ItemsdisseminatedfromarepositoryusingtheLRCRUDservicearemostlikelyboundforprocessingandtransformationbytheHandSToolSuiteto‐hubpackager.ThepackagercreatesMETSfilesconformanttotheHandSprofile,extractsandaugmentstechnicalmetadata,andrecordsprovenanceinformation.Describedbelowarethefourmajorstepsinnegotiatingdissemination:
1. TheLRCRUDclientsubmitsanHTTPGETrequesttotheLRCRUDservice.TheGETrequestprovidestheIDoftheitemdesiredviatheLRCRUDserviceURLsyntax.
2. Theservicecallstherepository'snativedisseminationroutinefortheIDindicated.
3. Theservicereceivestheoutputfromthedisseminationandaddstheentirecontentintoazip‐file.
4. Theservicereturnsthezip‐filecontainingthe"package"totheclient.
3.2.6.4.2. Submission
Figure9:LRCRUDSubmission
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
54
Submission(seeFigure9)istheactofeither1)addinganitemtoarepositoryforthefirsttime;or2)updatinganitemalreadyintherepository.Describedbelowistheprocessofaddinganitemtoarepositoryforthefirsttime.Thisisatwo‐stageprocess;thefirststagereservesanidentifierinthesystem,whilethesecondactuallyplacescontentintherepository.Stage1Createstubrecordtoreserveanidentifier:ItiscriticaltonotethatthepackageitselfisnotuploadedaspartofthePOSTrequest;rather,thePOSTrequestcreatesonlyastuborplaceholderrecord.ThereasonthattheactualpackageisnotuploadedaspartofthePOSTisthattheidentifierassignedtothepackagebytherepositoryneedstobeembeddedintheMETSfilewhichispartofthepackage.ThetypicalsequenceofoperationstoingestanewpackageistousePOSTtocreateanewplaceholderrecordandgettheidentifierforthatrecord.Thatidentifieristhenusedtoupdateprovenanceandothermetadatathatispartofthepackage,andthentheplaceholderrecordisupdatedoroverwrittenwiththeactualpackageusingthePUTaction.Themajorstepsinthisprocessare:
1. TheLRCRUDclientissuesaPOSTrequesttotheLRCRUDservicespecifyingtheIDof"where"tocreatetherecord(e.g.inaspecificcollection)ifneeded.
2. Theservicecallstherepository'snativeitemorIDcreationroutine.3. TherepositorysuppliestheservicewiththeIDforthenewly‐createdrecord.4. TheservicerespondstotheclientwithanHTTP201"Created"messageand
returnstheIDintheLocation:header.Stage2–Uploadandingesttheitem:Inthisstage,theitemisuploadedandplacedintherepository.Thisistheexactprocessforupdatinganexistingitem:
1. TheLRCRUDclientissuesaPUTrequesttotheLRCRUDservicetoreplacethepackageidentifiedbythesuppliedURI.Theentitybodyoftherequestwillcontainazip‐filecontainingthe"package"tobeingested.
2. Theserviceunpacksthefilesandcallstherepository'snativeingestionroutine.
3. TheservicerespondstotheclientwithanHTTP204"NoContent"messageindicatingthattherequestwassuccessful.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
55
3.2.7. LessonsLearnedBelow,innoparticularorder,areseveralkeylessonslearnedduringthecourseofdevelopingtheHandSarchitecture.
3.2.7.1. Mergingmetadataispotentiallyrisky.Afterexpendingmucheffortexploringhowwemightmergedifferentversionsofmetadatafilessoastomaintainasinglemastermetadatafile,wereachedtheconclusionthatthiswasverydifficultproblem,andpotentiallydangerousintermsofdataloss.Thisrealizationledustoourcurrentarchitecture,whichskirtstheissueofmergingmetadataintoasinglefilebymaintainingmultipleSnapshotMETSfilesallreferencedfromacommonMasterMETSfile.
3.2.7.2. METSsupportsmultiplemetadataformatswell.CombiningPREMIS,MODS,andotherXML‐basedtechnicalmetadataformatsintoasingleMETSdocumentworkedwellforthisparticularproject.ThegeneralstructureofMETSseemedtolenditselftoconstructingpreservationpackages.Ourconceptualmodel,whichwasdirectlyinfluencedbytheMETSandPREMISstructures,consistedatahighleveloftheintellectualentityhavingoneormorerepresentations.Theserepresentationsandalltheircomponentpartsweretheprimaryfocusofthepreservationefforts.TheMETSfileitselfistreatedastheabstractparentrepresentationoftheintellectualentity.However,therearealsooneormoreconcreterepresentationsconsistingofeachstructMapwithintheMETSfile.TheserepresentationsconsistoftherelationshipsembodiedinthestructMap(andpossiblytherelatedstructLinksections);thefilesandbitstreamsreferencedfromthestructMap;andtheassociateddescriptivemetadata(dmdSec),whichcouldbereferencedviathestructMaporviaindividualfilesorbitstreams.AllremainingpartsoftheMETSdocument,primarilytheheader(metsHdr)andadministrativemetadata(amdSec)sections,arenotconsideredpartoftheintellectualentity’srepresentationsbutare,instead,metadataabouttheserepresentations‐‐mostlyconcernedwithpreservationandthushavingastrongfocusontechnicalandprovenancemetadata.Therewerepragmaticchallengesingettingthesedisparatemetadatastandardstoworktogether,however,andthenextparagraphconveysonesuchexample.
3.2.7.3. ImplementingPREMISinMETSrequireshigh‐levelstructuraldecisions.EmbeddingPREMISmetadatawithinaMETSpackagewasnotanintuitiveundertaking.Therewereseveralreasonsforthis.Amongthesewerevariousoverlapsinthemetadatafieldssupportedbyeachstandard.Whenfacedwiththeseoverlapsourgeneralapproachwastoprovidethemetadatainbothplaces.Althoughthisapproachintroducedduplicationandtheopportunityforinconsistenciesintothemetadata,wefeeltheaddedflexibilityinprocessing
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
56
compensatedfortheseshortcomings.Moreover,theHandStoolsuitevalidationstepsensuredthatthesetypesofinconsistencieswerenotpresent.DecisionswerealsorequiredastowheretoembedthePREMISentitieswithintheMETSfile.Whiletheseentitiesareclearlyalladministrativemetadata,theydonotalwaysfitneatlywithinoneofthefoursubgroups,techMD,digiprovMD,sourceMD,orrightsMD,providedbyMETS.RefertotheECHODEPMETSprofiles(Habing,2005)fordetails.ProjectstaffalsoparticipatedinaworkinggroupchairedbyRebeccaGuentherattheLibraryofCongresstoaddressthisissue.Theworkinggroupproducedareport,GuidelinesforusingPREMISwithMETSforexchange(Guenther,2008).
3.2.8. NextSteps:theHubandSpokeDevelopmentoftheHubandSpoketoolsuiteisongoing.Thelatestversionsofthesourcecodecanbedownloadedfromtheproject’sSourceForgewebsite:http://sourceforge.net/projects/echodep/.RecentdevelopmentsincludetheadditionofaFrom‐SpokefortheBagItspecification(Boyko,Kunze,Littman,&Madden,2008)andmodificationstosupportversion1.5ofDSpace.WorkiscontinuingapaceonbothFrom‐andTo‐SpokesfortheFedorarepositorywithparticularattentionbeingpaidtohowourMETSprofilemightbeaccuratelymappedtoaFedoracontentmodel,reducingtheneedforpotentiallylossymappingsashavethusfarbeenrequiredforotherrepositorysystems.Theprojectisalsolookingatotherpotentialrepositories,suchasLOCKSSorCONTENTdm,forSpokedevelopment.InadditiontodevelopingnewSpokes,wearealsomonitoringdevelopmentswiththenextversionofJHOVE,aswellaswiththeGlobalDigitalFormatRegistry(GDFR),toexplorehowthesetoolsmightbeusedtoenhancetheformat‐specifictechnicalmetadatawearecurrentlygeneratingfordifferentfiletypes.
3.2.8.1. SupportingPreservationNowandintheFutureTheHubandSpoke(HandS)frameworkenhancestheinteroperabilityandpreservationfeaturesofexistingopen‐sourcerepositorysystems.Itprovidesasuiteoftoolstofacilitatemovingdigitalobjectsbetweenrepositoriesmoreeasilywhilesupportingthecollectionoftechnicalandprovenanceinformationcrucialforlong‐termpreservation.Itisintendedtosupportcurators’effortstodaytomanagecontentinmultiplerepositorysystemsandtopreservevaluablepreservationdatainaccordancewithemergingdigitalpreservationstandards.Inthelongterm,however,weseetheneedforthenextgenerationofdigitalrepositoriestodomoreinordertosupportourabilitytopreservethemeaningofthedigitalobjectsmaintainedinrepositories.Currentrepositorysystemspreservethestructuresofdigitalobjects,fromwhichmeaningorsemanticsmustbeinferred.Learningfromreal‐worlddatamigrationexamplesfromtheHandSefforts,GSLISandNCSAresearchersareworkingtomodelhowsemanticinferencecapabilitymayhelpnext‐generationarchivespreservethemeaning(notjustthestructures)of
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
57
digitalobjectsandheadofflonger‐termpreservationrisks.Specifically,wearedevelopingautomatedreasoningtechniquestargetedatidentifying,andeventuallycorrecting,problematicmetadatadescriptions.Thisworkisprofiledseparatelyinthenextsection(Section4).
3.2.9. ConclusionWithdigitalpreservationstillinitsinfancy,manychangestoemergingstandards,strategies,andmethodologiescanbeexpectedinthecomingyears.TheHubandSpokeframeworkprovidesamodelthatattemptstoincorporatecurrenttechnologiesandbestpracticesfromthefieldtosupportdigitalpreservationincurrentrepositoryenvironments.ItimplementsMETSandPREMIStoprovideastandards‐basedmethodforpackagingcontentthatallowsdigitalobjectstobemovedbetweenrepositoriesmoreeasilywhilesupportingthecollectionoftechnicalandprovenanceinformationcrucialforlong‐termpreservation.HandSisintendedtohelpcuratorsofdigitalobjectstodaybyprovidingimprovedsupportforpreservationandinteroperabilitytoexistingrepositorysystems.Ultimately,inordertomeaningfullypreserveourdigitalcontentovertime,wewillneedthenextgenerationofpreservationtoolstosupportautomaticinferenceofmeaning,orsemantics,fromchanged—andthuspotentiallyambiguous—informationstructures.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
58
4. PreservingMeaning,NotJustObjects:SemanticsandDigitalPreservation
AkeygoaloftheECHODEPositoryprojectistoinvestigatebothpracticalsolutionsforsupportingdigitalpreservationactivitiestoday,andthemorefundamentalresearchquestionsunderlyingthedevelopmentofthenext‐generationofdigitalpreservationsystems.Earlierinthisreport,wereviewedtwoareasofactivitythataimtosupporton‐the‐groundpreservationeffortsinexistingtechnicalandorganizationalenvironments:theWebArchivesWorkbench,asuiteoftoolstohelpcuratorscollectandmanageweb‐baseddigitalresources;andtheHandStoolssuite,whichaimstoenhanceexistingrepositories’supportforinteroperabilityandemergingpreservationstandards.Inthelongerterm,however,werecognizethatsuccessfuldigitalpreservationactivitieswillrequireamorepreciseandcompleteaccountofthemeaningofrelationshipswithinandamongdigitalobjects.Thissectiondescribesprojecteffortstoidentifythecoreunderlyingsemanticissuesaffectinglong‐termdigitalpreservation,andtomodelhowsemanticinferencemayhelpnext‐generationarchivesheadofflong‐ternpreservationrisks.
4.1. Introduction:TheNeedforaSemanticsofPreservationApproach
4.1.1. ThePreservationSemanticsProblemLikeanyinformationmanagementactivity,digitalpreservationeffortsareguidedbyhumanunderstanding.Decisionsaboutdocumentingafileformat,emulatinganenvironment,ormigratingfromonesystemtoanotheraremadewithanunderstandingofhowlevelsofdigitalexpressioncascadeandinterrelate:voltage,bit,octet,pointer,integer,grapheme,pixel,polygon,color,pitch,textstring,tree,image,tuple,file,andsoon.Thecomplexityoftheserelationshipsposesfewseriousproblemsforhumanbeings‐‐infact,theproblemsliepreciselyintheeasewithwhichourmindsinterpretthoserelationships.Long‐termpreservationisdistributednotonlyovertimebutalsoacrosstheresponsibilitiesofmanydifferentpeople.Itisdirectedatcollectionsmuchtoolargetoallowthoughtfulattentiontoindividualresources.Wemustthereforebuildintoourtoolsamorecarefulandpreciseencodingoftheknowledgethatguidesoureffortlessmentaldeductions.Thepreservationhazardsthatresultfromcurrentdescriptivepracticeandourexperimentswithautomatedtoolstoamelioratethoserisksaredescribedinthesectionsthatfollow.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
59
4.1.2. OurGoalOurgoalistounderstandbetterthesemanticproblemsarisingindigitalpreservation,andhowwemightapplythatunderstandingtothedevelopmentofresourcesandtools.Specifically,weareexperimentingwithautomatedinferencesaboutentities,theirproperties,therelationshipswithinandbetweenthem,andhowthesefactsareexpressedinmetadatadescriptions.Enrichingthatmetadatawithnewdeducedassertionsisonestepinheadingoffdigitalpreservationrisks.Weareworkingtowardadeductivesystemforreasoningaboutanomalousorincompletemetadata.Theaimisnottoautomaticallydeduceallmissinginformationortocorrectmalformedrecords,buttocallhumanattentiontodescriptionsthatareproblematicorsuspicious.Ourworkbeginswithananalysisofthekindsofsemanticproblemsposedbycurrentdescriptivepracticeandmetadataschemas,informedbyanalysesofreal‐worlddatamigrationexamples.Wehaveappliedtheunderstandinggainedinthisanalysistothedevelopmentofadraftmetadataontology(discussedinSectionC),whichmovesustowardamoreformalunderstandingofhowdescriptiveinformationaboutarchiveddigitalresourcesisstructured.Thismetadataontologyiskeytoaproof‐of‐conceptexperimentalsystemcomposedoftheRDFrepositoryTupeloandtheBECHAMELreasoningsoftware.
4.2. TheProblems:UnderstandingSemanticPreservation
4.2.1. ProblemsPosedbyDescriptivePracticeandStructuresInmanypreservationeffortsmetadatadescriptionmayseemstraightforward,butcrucialinformation‐–includingfactsthatseemobviousatfirstglance‐‐isleftunstated,andmustbeinferredbyhumanreaders.(Anexampleisprovidednext.)Asdiscussedabove,thissituationmaynotberiskywhenpeopleareavailabletoreasonaboutindividualrecords,butahuman‐basedmanualapproachdoesnotscaleoverlargecollectionsizesorovertime.Thesheervolumeofdigitalinformationmeansweincreasinglyrelyonautomatedmachineprocessingofrecords.Butsoftwaretoolsexecutetransactionsusingonlyknowledgethathasbeenexplicitlyrepresentedforthem.Ouraimthereforeistomakethoseunstatedfactsavailableinaformthatsoftwarecanuse.Thisworkbeginswithaninvestigationofthekindsofsemanticproblemsposedbycurrentinformationstructuresandimplementations.Theseproblemsbreakdownintothreebasiccategories:
• Semanticproblemsrelatingtodescriptivepractice• Semanticproblemsrelatingtoencodingstandards
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
60
• Semanticproblemsrelatingtometadataschemadesign
4.2.1.1. Semanticproblemsrelatingtodescriptivepractice.Someoftheproblemswefacearearesultofhowresourcesaredescribedusingmetadata,whileotherproblemsarisebywayofhowthosedescriptionsareexpressed,andwhathappenstothemovertimeastheyaremigratedfromonesystemtoanother.OnesemanticproblemofparticularinteresttousiswhatRenearetal(2002)describeas"ontologicalvariationinreference."Essentially,metadatacanfailtomakecriticaldistinctionsinwhat,precisely,itisdescribing.Theproblemisillustratedinthemetadataexamplebelow,whichshowspropertiesassertedatanumberofdifferentlevelsofabstraction.
Figure10:ExampleofMultipleLevelsofAbstractioninMetadataDescriptionWeseeinthisexamplepropertiesoftheimageitself(likeitstitleandsubjectmatterinlines8and23)describedalongsidepropertiesofthefilewhichencodestheimage(itsMIMEclassificationinlines2and12),propertiesofthemetadatadescription(itscreationdateinline28),andpropertiesoftherepositorysoftwareobjectthat
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
61
expressesthemetadatadescription(e.g.,thatitdisseminatesresources,andhasparticulardatastreamsassociatedwithit;lines1,3,7,11,18,and22).Themainpreservationriskproceedingfromthismixingoflevelsistheinabilitytodistinguish,withoutsemanticinformationabsentinthedescription,thelevelatwhichaparticularpropertyapplies.Forexample,whatisitexactlythathasaMIMEclassificationimage/jpeg?IsittheFedorarecordorisitoneorbothofthedatastreams?Ahumanreadercaneasilyresolvethatkindofambiguitywithoutconsciouseffort,butpreservationtransactions(suchasmigration),whicharetypicallyexecutedthroughsoftware,cannot.Thepreservationaimhereispresumablytopreserveaccesstotheimage.Thataimmayormaynotdependonpreservingthejpegfileexpressingtheimage,andpreservingtheFedoraobjectthatexpressesthemetadataisalmostcertainlynotarequirement.Thisexamplethereforeillustratestheproblemofmixedlevelsofdescription.Weneedtoclarifyandenrichmetadatadescriptionsbylinkingtheirassertionsexplicitlytotheappropriateentities,orelsedrawtheattentionofhumananalyststorecordsthatcannotbedisambiguatedautomatically.
4.2.1.2. Semanticproblemsrelatingtoencodingstandards.Inadditiontoproblemsofdescriptivepractice,wefacesemanticproblemsstemmingfromlimitationsoftheencodingtechnologiesinwhichmetadatadescriptionsareexpressed.Theseproblemsgenerallyfallintooneofthefollowingtwocategories:
4.2.1.2.1. Syntacticoverloadinginconventionalmarkup.
Familiarencodingtechnologiesformetadatadescriptions,suchasthosebasedonXML,workwellmostofthetime,buttheyhavecertainfundamentalproblems,suchasthoseasstemmingfromtheuseofmultiplecompetingsemanticrelationshipsandofunstructureddata.Specifically:
• Competingsemanticrelationships:Preservationmetadataformatstypicallyoverloadasimplesyntaxwithmultiplecompetingsemanticinterpretations.TypicalexamplesincludeXMLapplicationswhereasmallnumberofsyntacticrelationships(e.g.,theparent/childrelationshipbetweenelements)representanynumberofsemanticrelationships(whole/part,propertyname/value,etc.)thatarecontextdependent.Oftenapreciseinterpretationofthesesemanticscanbefoundonlyintheexecutionofapplicationsoftwarethatconsumesthefile‐‐and,presumably,inthemindoftheprogrammerwhowrotetheapplication.
• Unstructureddata:Theinformationinresourcedescriptionsmayonlybeincompletelyavailableformachineprocessingandverification.Crucial
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
62
contextualdatamayexistonlyasnaturallanguageannotationsorasunstructuredinformationinthecontentofmetadatafields.
ThemetadataexamplepresentedinFigure10doesnotexhibitproblemsofsyntacticoverloading,becauseitconformstoastandardserializationoftheRDFabstractmodelinwhichpropertiesandrelationshipsareexplicitlyidentified.Butthesecondproblemisevidentinhowmuchinformationinthisdescriptionisexpressedinnaturallanguagetextandannotations.(Note,forexample,thedc:dateelement(line28)inwhichthedocumentedevent(scanningandprocessing)isprependedtothedatestring.)
4.2.1.2.2. Problemswithobjectmodels.
Otherpotentialsemanticproblemsstemmingfromlimitationsofmetadataencodingtechnologiesconcerntheobjectmodelsofrepositorysystemsthemselves.Modelingdecisionsinrepositorydesigncancreatedescriptiveartifactsthatleavetheirmarkevenafterrecordmigration.Forexample,arepositorymaymingleinformationaboutrepositoryobjectswiththeinformationthattherepositoryobjectsaremeanttopreserve,creatingproblemswhenthoserecordsarefurtherprocessedandcontextualinformationisnolongeravailabletohelpinterprettherecordsandmakefurtherpreservationdecisions.AgoodillustrationofthisissuecanbeseenbyrevisitingthemetadatadescriptioninFigure10,whichwasserializedfromtriplesthatwereextractedfromtheRDFdatabasebackingaFedorarepositoryinstallation.InFigure10,noticethatinRDFtermsthisentiremetadatadescriptionis"about"anobjectidentifiedasinfo:fedora/changeme:97(line1).Thisrepositorysoftwareobjectistheonlyresourceidentifiedbyanrdf:typearc,andisthereforetheonlyentitywithanobjectclassidentification.Barringanyexplicittypeidentificationinaresourcedescription,FedoraobjectsseemtobetheonlykindofthingthattheFedorarepositoryknowsabout.Expressedinthatform,wecannotpreserveanyinformationexceptFedorarecords,andthoserecordsassertnoexplicitpreservationtargets.AsystemlikeFedoracanpreserveobjectswithinthecontextofitsowntransactions,buttheimplicitknowledgedirectingsuchoperationsdependsontheinterpretationofprogrammers,withalltheproblemsdiscussedsofar.Ontheotherhand,itisnotadesignflawofFedorathatitsmetadatarecordiscenteredinternallyontheFedoradigitalobject.Preservationontologyisproperlyamatterofdescriptivepractice,notsoftwareengineering.Infairness,ourmetadataexamplecomesfromamigrationscenarioinwhichRDFtriplesareextractedfromFedora'sRDFstoredirectly,ratherthanthroughaconventionalexportprocess.ButthisexampleservestoremindusthatobjectmodelinginasystemsuchasFedoraplaysthesameroletothesameendsaswithotherkindsofsoftware:efficientsourcecodemanagementbyandforthesystemdevelopers.Objectmodelingdecisionsarenotintendedandcannotbeexpectedtoaddresstheweaknessesof
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
63
resourceanalysisanddescription.Forlong‐termpreservation,therefore,itisimportanttoreduceambiguousorimplicitsemanticsinrepositoryobjectmodels.Thatcanmeaneithermodifyingthosemodelsor,aswehaveattempted,providingtoolsandtechniquesformigratingfromrepositoryobjectmodelstomodelsthatincludebetterrepresentationsofpreservationtargets.
4.2.1.3. Semanticproblemswithpublishedmetadataschemas.Finally,inadditiontosemanticproblemsrelatingtodescriptivepracticetoencodingstandards,weseesemanticproblemsstemmingfromlimitationsofpublishedmetadataschemasthemselves.Publishedschemasformalizeelementsetsonwhichtheproperty/valueascriptionsarebased.Eachofthesemetadataschemesnotonlyexpressesitsuniqueviewoftheuniversebutisitselfgroundedinbasicontologicalassumptions.Avarietyofambiguitiescanstillarise,asillustratedbelow,drawingagainonourrunningexamplefromFigure10.Weneedtobeginbyunderstandingthelogicalpartsofthemetadatarecordandtheirrelationshipstooneanother.Ametadatarecorddescribessomeentity‐‐aninstanceofaclassliketheclassofbooks,images,oraudiorecordings.Metadatadescriptionslistpropertiesofthatentity,eachofwhichhasavalue.Forinstancethe“author”propertyofthebookmighttakeasitsvaluethenameoftheauthor.Membershipinaclassrequiresthattheinstancerespectdefinedclassconstraints(movies,forexample,haverunningtimes,butbooksdonot).Considerthemetadatastatementdc:type>image</dc:type>(line9)fromouroriginalexampleinFigure10.Weeasilyrecognizethattheword"image"pointsustoaninstanceofaclass,justasthenameofanauthorpointsustoaparticularperson.Ahumanreaderwouldneverconcludethatabookwasauthoredbyanameorbyastringexpressinganame.Similarly,theword"image"licensesourinferencethatthepropertyvaluefordc:typeisaclassofentitiesintheworldratherthan,forexample,aquantity(suchas14centimeters)oraquality(likemonochrome).Inthiscasewearecuedtotheexistenceofnotjustanyentity,buttotheverytargetofourpreservationefforts‐‐somethingmuchmoreimportanttousinthelongrunthanthedigitalfileorthebitsequencethatonlyexpressesthisimagecontingently.Computersoftwarecannotmakethosekindsofmeaningfuldistinctionswithouthelp.Onekindofhelpwouldbeaconstraintontherangeofallowablepropertyvalues,buttheDublinCoreelementschemaenforcesnosuchconstraint:dc:typecantakeanyvaluethatindicatesthe"natureorgenreoftheresource."(DCMINamespacefortheDublinCoreMetadataElementSet,Version1.1,2008)Asecondkindofhelpwouldbeavaluestringthat,throughitsmachine‐readablestructureornotation,indentifiesaclass.InanRDFexpressionthiswouldbeaURIlinkedbyanrdf:typepropertytosomeclassdeclaration.TheDCMITypeVocabulary
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
64
hasthisstructure(http://dublincore.org/2008/01/14/dctype.rdf),andinterestingly,thescopenotefordc:typeintheDCElementsRDFschemarecommendstheuseofthatvocabulary(http://dublincore.org/2008/01/14/dcelements.rdf).HadtheauthorofourmetadatadescriptionusedtheURIdcmitype:Image,insteadoftheword"image,"wewouldbeonestepclosertoidentifyingtheabstractimageasanentity.Theword"image,"althoughitcontainsthesamesequenceofletters,isnotlinkedinastandardizedwaytothedeclarationofaclass.AssigningaDCMItyperesourcetothedc:typeelementsimplifiestheinferencethatanimageexists,andthatoneormoreofthemetadatastatementsinthatdescriptionareascribingpropertiesofanimage–semantically,asignificantstep.Butasourrunningexamplestands,theschema'sflexibilityinvitesambiguity,andadditionalinformationisnecessarytoconnecttheliteralvalue"image,"withaformalizedclasssuchasdcmitype:Image.
4.2.2. UnderstandingtheSemanticPreservationProblem:SummaryWehaveseeninthissectionthatdescriptivepractice,encodingstandards,andpublishedspecificationsmayallcomplicatedigitalobjectpreservation.Impreciseresourcedescriptionscanmakeitimpossibletodeterminethelevelatwhichaparticularpropertyapplies.Theflexibilityofferedbyencodingstandardsbringsrisksaswellasbenefits.We’vealsoseenhowobjectmodelingdecisionsandsemanticallyunderspecifiedmetadataschemascanleadtoincorrectorambiguoususage.Inthenextsection,wemovefromunderstandingthecoresemanticproblemsassociatedwithdescriptivepracticeandstructurestolookingattheresourcesandtoolsbeingdevelopedbytheECHODEPositoryprojecttoidentifysemanticambiguityinreal‐worldmetadatadescriptions,andhighlightpotentialpreservationrisks.
4.3. TowardMoreCapableArchivesandRepositories
4.3.1. Recap:TheneedforautomatedinferencecapabilityDigitalresourcepreservationeffortsaredistributednotonlyovertimebuttypicallyacrosstheresponsibilitiesofpeoplewhomayneverconsultwithoneanother.Transactionslikemigrationbetweensystemsareexecutedoverlargecollectionswherecloseattentiontoindividualrecordsistooexpensive,butwherecorrecttreatmentofaresourceoftendependsonknowledgethatisincompletelyorimpreciselyrepresentedinpreservationmetadata.Suchambiguitiespresentfewproblemsforhumanbeings:ourflexiblemindsmakecorrectinferenceswithoutconsciouseffort.Butthedatatosupportthoseinferencesarenotexpressedinaformthatcanguidetheexecutionofourprogramsandutilities.Wethereforeneedtoolsandmethodsthatsupportthediscoveryandcorrectionofpreservationrisks.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
65
Thenextsectiondescribesourexperimentsindevelopingthesemethodsandtools.Specifically,welookatourdevelopmentofanontologyofmetadatadescriptions,theResourceDescriptionVocabulary,andhowthismaybeappliedtoidentifysemanticambiguityinmetadatadescriptionsusingthereasoningtoolBECHAMEL.
4.3.2. BECHAMELandBuildingaMetadataOntologyBECHAMELisatoolforexpressingandtestingsemanticmodelsofdigitalresources.(Dubinatal,2003)IthasbeendevelopedbyresearchersattheUniversityofIllinois,theWorldWideWebConsortium,andtheUniversityofBergen.ABECHAMELapplicationcan,forexample,translatethebibliographicmetadataforajournalarticlefromonestandardformatintoanotherbyconstructingamodeloftheauthor'saffiliationwithaninstitution.(Renear&Dubin,2003)InourrecentandcurrentexperimentstheinputtoBECHAMELaremetadatadescriptionsretrievedfromanRDFrepository(Tupelo),togetherwithschemasdefinedintheOWLWebOntologyLanguage(OWL,2004).Newfactsdeducedfromthoseinputsareaddedbacktotherepositoryasannotationstothedescription.Atechnicaloverviewofthisapproachispresentedinthefollowingsections.
4.3.3. OvercomingSemanticProblemsinMetadataEncoding:AResourceandDescriptionVocabulary
Ouraimistoenrichmetadatawithnewassertionsinferredfromexistingresourcedescriptions.Towardthataimwehaveidentifiedclasses,properties,andrelationshipsforovercomingencodingproblems,andwehaveexpressedtheseinaschema.Thisvocabularydoesnotrepresentclassesorpropertiesforspecifictypesofresources.Instead,itoffersanontologyofmetadatadescriptionsthemselves.Simplystated,thevocabularyincludestermsthatcanbeusedtodescriberecords,metadatadescriptions,andrelationshipsbetweenthemandpreservationtargets.(TheResourceandDescriptionVocabularyisprovidedinAppendix6.7.)Morespecifically,thevocabularyisdividedintothefollowingsections:
• W3Cstandardclassesandpropertieso Theseincludeclassesandpropertiessuchasrdfs:Resource,
rdf:Statement,andowl:ObjectProperty.• Alternatereificationclasses
o ConventionaluseoftheRDFreificationvocabularyisbasedonanunderstandingthattriplesstandinatype/instancerelationshipwith"tokens"appearinginRDFdocuments(RDFSemantics,2004).Butthisinterpretation,intendedtosupportprovenancedocumentation,presentspuzzlesforunderstandinghowaserializedexpressioncanstandindirectrelationshipswithresourcesreferredtobyanabstracttriple.(ForthoseanalystswhomaybeconcernedwithabusingtheofficialaccountofRDFreification,thevocabularyincludesseparate
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
66
classesforgeneralizedstatements,RDFstatements,andabstracttriples.)
• Indicationrelationshipso Thissectionincludesagroupofhierarchicallyorganized
relationships,basedonrecommendationsinPiotrKaminiski's2002thesis.Therelationshipsincludeindication,representation,denotation,identification,description,depiction,ascription,expression,andencoding.
• ClassesbasedontheDCMIAbstractModelo IntheDublinCoreAbstractModel,theterm"metadataelement"is
usedsynonymouslywiththeterm"property."Butourclasses,thoughbasedonthatmodel,representmetadataelementsasspecializednamesofproperties,ratherthanaspropertiesthemselves.Classesinthissectionincludemetadataelement,metadataelementset,metadatastatement,andmetadatadescription.
• Markupstructureso AthirdalternativetoreifyingRDFstatementsundertheofficialW3
interpretation,orthroughuseofalternateclasses,istoreifythenotationexpressingtheRDF.ThissectionofthevocabularyincludesclassesforXMLelements,XMLdocuments,XMLschemas,XMLattributes,andURIs.
Insummary,theResourceDescriptionVocabularyisanontologyofmetadatadescriptionsthemselves.ItsaimistoprovideasemanticallysoundframeworkforovercomingtheencodingproblemsdescribedinSection4.2ofthisreport.Thenextsectionwalksusthroughademonstrationofhowthisontology,asusedbyBECHAMEL,canhelptohighlightpotentialpreservationrisks.
4.3.4. ResolvingSemanticAmbiguity:anInferenceExampleInSection4.2ofthispaperwediscussedproblemsofdescriptivepractice,encodingstandards,andschemadesign.Nowwepresentanillustrationofhowourinferencingsoftwarerespondstothoseproblems.Intheexamplebelow,anambiguousmetadatastatementfromtherecordshowninFigure10isidentifiedandassociatedwiththeimpliedpreservationtargetitdescribes.Figure11belowshowsoneRDFstatementextractedfromFigure10,ourrunningexample:
Figure11:AfragmentoftherecordshowninFigure10.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
67
ThisRDFstatementviewshowsthestringvalue"image"assignedtotheDublinCoretypepropertyfortheFedoraObjectidentifiedasinfo:fedora/changeme:97–aswediscussedearlierwhenviewingtheoriginalmarkuprecord(Figure10).Themainissueisoneofclearlyidentifyingthetargetofourpreservationefforts:animageinthiscase.Summarizingtheconcernsdiscussedearlier:
o TheFedoraobjectisanamorphousresource,whichseemstosharepropertiesoftheimageitself,theimagecontent,andthebitstreamencodingtheimage.TheFedoraobjectcannot,therefore,beourpreservationtarget.
o Accordingtotheformalschemadefinition,theDublinCoreTypepropertyindicatesthe"thenatureorgenreofaresource,"butneednotidentifytheexistenceofanyparticularconcreteobjectorabstractentity.Asalreadyseen,thisvaguenessintheformalschemaopensthedoortotheuseofvalues(suchastheliteralstring“image”)thatarecleartohumanreadersbutwhichposeproblemsformachineprocessing.
o Althoughtheword"image"invitesahumanreadertoinferthatourpreservationtargetisanimage,thatinformationisnotexplicitenoughtosupportautomatedprocessing.Theinferencedependsnotonlyonwordmeaningbutalsoonthetacitbackgroundknowledgethatthepropertyvaluemustinthiscasebeaclass(ratherthan,forexample,aquality,quantity,orname).
Torecapthen,thisimage(Figure11)illustratestherelationshipsbetweentheFedoraobject,theDCelement“type”,andthevalue“image”ambiguouslyexpressedintheoriginalrecord(Figure10).Inthenextstep,webegintoclarifytheserelationships.Figure12belowshowsthefirstinferencestage:
Figure12:BECHAMELhasidentifiedthefragmentasametadatastatement.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
68
ThisRDFstatementshowsthattheoriginaldc:typearchasbeenidentifiedasametadatastatement,andanewTAGURIhasbeengeneratedtodenotethatstatement.AlthoughthisstageoftheprocessingbeganwithconventionalRDFreification,ourassignmentofrdf:subject,rdf:predicate,andrdf:objectpropertiestoournewMetadata_StatementinstanceisadeparturefromorthodoxRDFsemantics.Thisfirststageofinferenceprocessinghasidentifiedthemetadatastatement.Inthenextstagewetakethisastepfurthertoidentifythepreservationtarget.
Figure13:BECHAMELhasidentifiedthemetadatastatementasadescriptionofanimage.Figure13showstheidentificationofthepreservationtarget.ThesysteminfersthatthismetadatastatementmustbedescribinganabstractimagethathasbothaclassidentityandanobjectidentitydistinctfromtheJPEGfile,thebitstreamencodingthatfile,thegeographydepictedintheimage,andtheFedoraobjectthatservesasthelocusforpropertyattributionsatallthoselevels.Inaddition,themetadatastatementisidentifiedasbelongingtoametadatadescription.Identifyingthepreservationtargetshouldsimplifythevalidationoflaterpreservationtransactions,makingiteasiertoverifythatessentialpropertiespersistacrossmigrationsandthroughtranslationsfromoneformattoanother.
4.3.5. AutomatedInferenceasaPreservationServiceTheontologyandinferencesthatitsupportsallowus,evenincaseswheremetadatarecordsareterseandincomplete,torecoverimportantdistinctions,suchasthe
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
69
distinctionbetweenapersonandthemetadatarecorddescribingthatperson.Thisknowledgeisexpressedinaportablesyntax(RDF/OWL)withexplicitly‐definedsemantics,soitcanbemaintainedwithouthavingtomodifytheoriginalrecordortransformitintoanothersyntax(eitherofwhichcouldintroducefurtherpreservationrisks).Indeed,BECHAMEL’sabilitytoreadandwritefromRDFdatabases(usingTupelo)meansthatitcanreadmetadatarecords,applyrulesandinfernewassertions,andwritethoseassertionsbacktotheRDFdatabasewithoutalteringtheoriginalrecordsinanyway.The“openworld”ofRDF/OWLmeansthatautomatedinferencecanbecomeapartofthepreservationprocesswithoutrequiringthatweredesignandreimplementinstitutionalrepositoriestoaccommodateit.Instead,inferenceisakindofservicethatcanbeusedalongsidethosetoolstoheadoffpreservationrisksandfillgapsinrepresentation.Thenextsectionlooksmorecloselyatthearchitectureandproof‐of‐conceptimplementationofanarchivethataugmentsaninstitutionalrepositorywithinferencecapabilitiesandservices.
4.4. SystemArchitectureWerespondtothepractice,standardization,andtechnicalproblemspreviouslyoutlinedintwoways:
o First,wedesignoursystemsforaworldwheremetadatawillvarygreatlyintheircompleteness,expressivity,andconsistency.Preservationriskswillarise,andwebuildtoolswiththeaimofamelioratingthoseproblems.
o Second,weproposeanarchitectureforrepositoriesthatwehopewillsupportmoreeffectiveresourcedescriptionandencoding:onethatincludescapabilitiesandservicesthatwillbeneededinthenextgenerationofdigitalcontentmanagementsystems.
4.4.1. Architecture:OverviewTheproposedarchitectureaugmentstypicalinstitutionalrepositoryarchitectureswithtwonewcapabilities:
o Theabilitytomanagenotjustbitstreamsandassociatedmetadata,butalsoassociatedsemantics,expressedinstandardRDFandOWLsyntax.
o Automatedservicesfordetectingand/orcorrectingsemanticambiguityinmetadatadescriptions.
4.4.1.1. Architecture:theTupelomodelTupeloisamiddlewarecomponentprovidingsemanticcontentmanagementfordistributed,heterogeneousapplications.Bymiddleware,wemeanthatTupeloprovidesabstractions(knownascontexts)thatencapsulatedifferentstorageandretrievaltechnologiesfordataandmetadata,includingfilesystems,webservices,
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
70
relationaldatabases,andRDFstores.Bywayofthesecontexts,applicationscanexchangeRDFstatementsandaccessrawoctetstreamsassociatedwiththem.Tupelocanthereforeservethesameroleasacontentmanagementsystem(CMS)orinstitutionalrepository.ButTupelodiffersfromthesesystemsinmakingonlyminimalassumptionsaboutthestructureoftheinformationitmanages,allowingapplicationstoencodethatstructureasexplicitRDFstatements.RDF'sopen‐worldassumptionanduseofUniformResourceIdentifiersmeansthatTupelocanassembledescriptionsfrommultiple,independentsources,evenifthosesourcesarenototherwisecoordinated.Tupelohasoriginallybeendesignedtosupportscienceapplicationswheredataisproduced,processed,andtransformedbymultiplepeopleandsoftwarecomponents.Suchapplicationsrequirepreservationofworkflowtracesandthetrackingofrelationshipsbetweenrawinputandoutputresultsacrossdistributedsystems.Thesesamechallengesariseindigitalpreservation,wherecriticaltransformationsmayoccuroutsideofthecontrolofarepositorysystem,orwithinmetadatawhosesemanticsareknownatonestageoftheprocessandunknownatanother.Suchtransformationsaredistributed,heterogeneousprocesses,andtyingadigitalartifacttotheprocessinwhichitparticipatedrequiresportable,globally‐scopedidentifiersthatcanbemanagedindependentlyoftheprocessitself.RDFusageenforcestheglobalscopeofidentifiersbyusingURIstoidentifynodes.
4.4.1.2. ConnectingBECHAMELtoTupeloOurBECHAMELclientapplicationretrievesanXML‐serializedsubgraphoftherepositorycontentsfromTupeloviaTupelo’sHTTP‐basedclient/serverprotocol,whichisbasedonextendingNokia’sproposedURIQAprotocol(http://sw.nokia.com/uriqa/URIQA.html)ThesubgraphissubmittedtoBECHAMEL,togetherwithsupportingOWLOntologiesandstandardizedRDFSvocabularies(e.g.,DublinCore).NewRDFstatementsandannotationsemergingfromBECHAMEL'sexecution(seetheinferenceexampleinFigures2‐4)arethendeliveredbacktotheTupeloserver.
4.4.1.3. ObservationsonImplementationLikethecharacteristicsoftheTupeloarchitecture,wepredictthatinferentialcapabilities(suchasthoseillustratedearlierinthissection)willbebasicservicesprovidedbyandforfuturedigitalrepositories.Butthefunctionalcomponentsofthoserepositorieswillbelooselycoupledanddistributed.Interpretiveservicesare,furthermore,neededrightawayforsystemsbasedoncurrentContentManagementSystemtechnologies,andtoaidinreformingdescriptivepracticeasitstandstoday.Forallthesereasons,wehavesoughtinourimplementationtomaketheinterpretationcomponentastructurallydistinctlayer,communicatingwiththeTupelomiddlewareviageneral‐purposeclient/serverprotocolssuchasHTTP.Whileweassumetheresourcedescriptionsandinferredknowledgewillconformto
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
71
theRDFabstractmodel,wehavechosentodelivertheminconventionalserializedforms,suchasRDF/XML.Aswithanysimilarproject,avarietyofengineeringchallengesrequirefurtherexperimentationandimprovement.Forexample,attheBECHAMELapplicationlayer,allRDFstatementsareexpressedasifpartofasingleglobalgraph,whetherretrievedfromTupelo,parsedfromanRDFSvocabulary,inferredbyBECHAMELitself,ordrawnfromanyothersource.Butobviously,onlyafiniteamountofinputknowledgecanbeefficientlysharedoverthenetworkbetweenclientandserver.Ourinterpretiverulesarethemselvesseparatefromthestrategiesforselecting,retrieving,andstoringRDFstatements,butpragmaticallytheycannotbetotallyindependentofeachother.
4.5. LessonsLearnedandNextStepsOurresearchcontributioncanbeseenfromoneperspectiveasthetechnicalgroundworkforafuturegenerationofimprovedautomateddigitalpreservationsystemsandmethods.Butonecanalsounderstandourfindingsasopportunitiestoapplyhumanintelligencemoreeffectivelywithexistingtoolsandstandards.Itmightneveroccurtoadigitallibrarianthathispreservationmethodsarebeingexecutedwithoutclearlyidentifiabletargets,orthatasimplechange(suchasdcmitype:Imageinsteadof"image")coulddramaticallyreducetheworkrequiredtocorrectthatproblem.Theexerciseofencodingsemanticknowledgewithenoughclarityandprecisionforacomputerrevealscomplexitiesthatourremarkablehumanmindswouldotherwiseallowustoignore.Withtheaidofthatinsight,muchprogresscouldbemadeinreformingthepracticesthatpromptourdevelopmentandresearch.
4.6. ConclusionInstitutionalrepositoriesandothercurrenteffortsforpreservingdigitalartifactsfacechallengesresultingfromunderspecifiedmetadataschemas,ambiguoususage,andmetadatamodelsthatrelatemoretorepositoryimplementationthantoissuesofmeaning.Theseentailveryrealriskstotheintegrityandusefulnessofpreserveddigitalartifactsastheyarestored,managed,andretrieved.Descriptivepracticesthatseemcorrectmayintroduceinconsistenciesthatareundetectablewithoutmanualinspectionofeachrecord‐‐anunreasonablerequirementforcollectionsofevenmoderatesize.Improvedmetadatastandardsandrepositorymetadatamodelsarepartofthesolutiontotheseproblems,butwealsoseearoleforautomationindetectingandmitigatingpreservationrisks.Ourexperimentalarchivingtechnologies,BECHAMEL
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
72
andTupelo,demonstratethatwecanlocateandcorrectambiguousmetadataexpressionsinthecontextoftransactionssuchasimportandexport.Asbestpracticesevolvefordigitalpreservation,weseereasoningcapabilitieslikethosedemonstratedbyBECHAMELbecominganintegralcomponentofdigitalpreservationsystems,allowingcuratorstotransformlargecollectionswithgreaterconfidencethatrecordswillfaithfullyrepresenttheinformationtheyareintendedtopreserve.ComplementinginteroperabilitymodelslikeECHODEPository'sHubandSpoketoolsuite,webelievethetechniquesdescribedherepointtoanewgenerationofpreservationtools,andrevealwaystouseexistingtoolswithmoresuccess.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
73
5. ANotefromthePIsWhentheUniversityofIllinoisLibraryandtheGraduateSchoolofLibraryandInformationSciencesubmittedtheproposalfortheECHODEPprojecttotheLibraryofCongressin2003,thedigitalpreservationlandscapewasradicallydifferent.Thenumberofwebarchivingtoolswasstillsmall;farfewerinstitutionsthannowhadinstancesofrepositorysoftwareapplicationsintheirlibraries;theproblemspaceofinteroperabilitybetweenrepositoryplatformswasjustgainingground;andtechniquesformigratingthesemanticcontentofdocumentsovertimeandthroughvariousencodingschemeswerestillonthehorizon.TheaccomplishmentsofECHODEPPhase1projects,intheformoftechnicalframeworksandsoftwareapplications,aswellasofpublishedresearchandenduringpartnerships,havecontributedtotheredesignofthislandscapeforthericherandmoresustainable.Forexample,oureffortsatenablingrepositoryinteroperabilityhaveresultedintheregistrationoftheECHODEPGenericMETSProfileforPreservationandDigitalRepositoryInteroperabilitywiththeLibraryofCongress.Becauseofourworkinthisarea,institutionssuchasHarvardUniversity,theArizonaStateLibrary,andtheGeorgiaInstituteofTechnologyhavecontactedustolearnmoreaboutthetechnicalarchitectureissuesinvolvedinourframework.Thesecontactsbespeakknowledgesharingandcommunitybuildingtowardapublicgood–interactionsthatareintegraltothedevelopmentofanetworkedapproachtodigitalcontentstewardship.Anotherbeneficialoutcomehasbeenthepartnershipsthemselves,initiallyestablishedduringPhase1,suchaswithOCLC;IllinoisandOCLCarecollaboratingagaininECHODEPPhase2,thistimeonanamed‐entityextractionandrecognitiontooldevelopmentprojectthatseekstoautomatecreationandextractionofmetadataforpreservationpurposesandcontexts.Indeed,theworkofstartingandsustainingcross‐organizationalcollaborationforaproject’speriodofperformanceshouldnotbeoverlooked.Asourteamshavelearnedintheprocess,effectivecollaborationentails–butisnotlimitedto–layingafoundationforacommunicationinfrastructurethatdrawsonanarrayoftools,suchaswikisandvirtualmeetingapplications;nurturingahealthybalancebetweenencouragementofnewdirectionsinresearchanddevelopmentandmeetingthedeliverablestowhichtheprojectiscommitted;andunderstandingfromthestartthattheoutcomeofoureffortswillonlybeasmeaningfulandsuccessfulasthecollaborationsthemselvesarerichandproductive.TheUniversityofIllinoisisgratefultotheLibraryofCongressforfundingitsdigitalpreservationresearchactivitiesunderNDIIPP.TheworkachievedduringPhase1hasaffordedusagreaterunderstandingofthechallengessurroundingpreservationstrategies,whichwehopetheNDIIPPcommunityatlargewillcontinuetolearnfromanddrawuponinfuturestewardshipendeavors.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
74
6. References
6.1. ArchivingtheWeb:theWebArchivesWorkbench
6.1.1. ResourcesCobb,J.,Pearce‐Moses,R.&Surface,T.(2005).ECHODEPositoryProject.In
Archiving2005:finalprogramandproceedings,April26,2005,Washington,D.C.,(175‐178).Springfield,VA:TheSocietyforImagingScienceandTechnology,2005.RetrieveJuly5,2008,fromhttp://www.ndiipp.uiuc.edu/pdfs/IST2005paper_final.pdf/.
TheISOReferenceModelforOpenDistributedProcessing–AnIntroductionECHODepGenericMETSProfileforPreservationandDigitalRepository
Interoperability.(2005).RetrievedAugust27,2008,fromhttp://www.loc.gov/standards/mets/profiles/00000015.html.
ECHODepMETSProfileforWebSiteCaptures(2006).RetrievedAugust27,2008,
fromhttp://www.loc.gov/standards/mets/profiles/00000016.html.TheECHODEPository:AnNDIIPP‐PartnerProjectoftheUniversityofIllinoisat
Urbana‐ChampaignwithOCLCandtheLibraryofCongress.(n.d.).RetrievedJuly5,2008,fromhttp://www.ndiipp.uiuc.edu/.
“TheISOReferenceModelforOpenDistributedProcessing–AnIntroduction.”
(1996).RetrievedAugust27,2008,fromhttp://www.enterprise‐architecture.info/Images/Documents/RM‐ODP2.pdf.
OCLCDigitalManagementServices.(2008).RetrievedJuly5,2008,fromhttp://www.oclc.org/us/en/services/collection/default.htm.TheNationalDigitalInformationandInfrastructurePreservationProgram.(n.d.).
RetrievedJuly5,2008,fromhttp://www.digitalpreservation.gov/.Pearce‐Moses,R.&Kaczmarek,J.(2005).AnArizonaModelforPreservationand
AccessofWebDocuments.DttP:DocumentstothePeople.33(1),17‐24.RetrievedJuly5,2008,fromhttp://www.ndiipp.uiuc.edu/pdfs/azmodel.pdf/.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
75
Rani,S.,Goodkind,J.,Cobb,J.,Habing,T.,Eke,J.,Urban,R.&Pearce‐Moses,R.(2006).Technicalarchitectureoverview:toolsforacquisition,packaging,andingestofwebobjectsintomultiplerepositories(poster).Openinginformationhorizons:6thACM/IEEE‐CSJointConferenceonDigitalLibraries:June11‐15,2006,ChapelHill,NC,USA:JCDL2006/sponsoredbyACMSIGonInformationRetrieval,ACMSIGonHypertext,HypermediaandtheWeb,IEEETechnicalCommitteeforDigitalLibraries,(360‐360).NewYork:ACM,2006.
WebArchivesWorkbench.(2008).RetrievedJuly5,2008fromhttp://sourceforge.net/projects/webarchivwkbnch/.Webarchiving.(2008,August21).InWikipedia,thefreeencyclopedia.Retrieved
August27,2008,fromhttp://en.wikipedia.org/wiki/Web_archiving.
6.2. RepositoryEvaluationandInteroperability
6.2.1. RepositoryEvaluationDLIFull‐TextJournalCollection.RetrievedApril7,2009,from
http://forseti.grainger.uiuc.edu/pubs/tocdli.asp.HistoricalAerialPhotoImageDatabase.RetrievedApril7,2009,from
http://images.library.uiuc.edu/projects/aerial_photos/.IllinoisDigitalOrthophotoQuarterQuadrangleData.RetrievedApril7,2009,from
http://www.isgs.illinois.edu/nsdihome/webdocs/doq05/.RLG.(2005).Anauditchecklistforthecertificationoftrusteddigitalrepositories.
MountainView,CA:RLG.RetrievedSeptember10,2008,fromhttp://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070511/viewer/file2416.pdf.
VincentVoiceAudioLibraryatMichiganStateUniversityLibraries.RetrievedApril7,2009,fromhttp://vvl.lib.msu.edu/showfindingaid.cfm?findaidid=CoolidgeC.
6.2.2. HandSToolsSuiteAllinson,F.,François,S.,&Lewis,S.(January,2008).SWORD:SimpleWeb‐service
OfferingRepositoryDeposit.Ariadne,(54).RetrievedSeptember11,2008,fromhttp://www.ariadne.ac.uk/issue54/allinson-et-al/.
Boyko,A.,Kunze,J.,Littman,J.,&Madden,L.(2008).TheBagItFilePackageFormat(V0.95).RetrievedSeptember15,2008,fromhttp://www.cdlib.org/inside/diglib/bagit/bagitspec.html.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
76
ConsultativeCommitteeforSpaceDataStandards.(2002).ReferenceModelforanOpenArchivalInformationSystem(OAIS).CCSDS650.0‐B‐1.BlueBook.RetrievedSeptember12,2008,fromhttp://public.ccsds.org/publications/archive/650x0b1.pdf.
DLFAquiferMetadataWorkingGroup.(2006).DigitalLibraryFederation/AquiferImplementationGuidelinesforShareableMODSRecords.RetrievedSeptember10,2008,fromhttp://wiki.dlib.indiana.edu/confluence/download/attachments/24288/DLFMODS_ImplementationGuidelines_Version1-2.pdf?version=1.
GlobalDigitalFormatRegistry(GDFR)InformationSite.(n.d.).RetrievedSeptember15,2008,fromhttp://www.gdfr.info/.
Guenther,R.(2008).GuidelinesforusingPREMISwithMETSforexchange.RetrievedSeptember11,2008,fromhttp://www.loc.gov/standards/premis/guidelines-premismets.pdf.
Guenther,R.(2008).BattleoftheBuzzwords:Flexibilityvs.InteroperabilityWhenImplementingPREMISinMETS.DLibMagazine,14(7/8).RetrievedSeptember12,2008,fromhttp://www.dlib.org/dlib/july08/guenther/07guenther.html.
Habing,T.G.(2005).ECHODepgenericMETSprofileforpreservationanddigitalrepositoryinteroperability.RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mets/profiles/00000015.xml.
Habing,T.G.(2006).ECHODepMETSprofileforwebsitecaptures.RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mets/profiles/00000016.xml.
Habing,T.G.(2007).LightweightrepositoryCRUDService(LRCRUDS).RetrievedSeptember10,2008,fromhttp://dli.grainger.uiuc.edu/echodep/hands/LRCRUDS.htm.
JHOVE:JSTOR/HarvardObjectValidationEnvironment.(2007).RetrievedSeptember11,2008,fromhttp://hul.harvard.edu/jhove/.
Kaczmarek,J.,Habing,T.G.,&Eke,J.(2006).Repositorysoftwareevaluationusingtheauditchecklistforcertificationoftrusteddigitalrepositories.InProceedingsofthe6thACM/IEEECSjointconferenceondigitallibraries2006,ChapelHill,NC,USAJune1115,2006.NewYork:AssociationforComputingMachinery.RetrievedSeptember10,2008,fromhttp://doi.acm.org/10.1145/1141753.1141774.
Kaczmarek,J.,Hswe,P.,Eke,J.,&Habing,T.G.(2006).Usingthe‘Auditchecklistforthecertificationofatrusteddigitalrepository’asaframeworkforevaluating
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
77
repositorysoftwareapplications.DLibMagazine,12(12).RetrievedSeptember10,2008,fromhttp://www.dlib.org/dlib/december06/kaczmarek/12kaczmarek.html.
METS:Metadataencoding&transmissionstandard,officialwebsite(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mets/.
MIX:NISOmetadataforimagesinXMLschema,technicalmetadatafordigitalstillimagesstandard,officialwebsite.(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mix/.
MODS:Metadataobjectdescriptionschema,officialwebsite.(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/mods/.
OAIPMH:OpenArchivesInitiative–ProtocolforMetadataHarvesting.(2008).RetrievedSeptember12,2008,fromhttp://www.openarchives.org/pmh/.
PREMIS:Preservationmetadatamaintenanceactivity,officialwebsite.(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/premis/.
PREMISWorkingGroup.(2005).Datadictionaryforpreservationmetadata.Dublin,OH:OCLCandRLG.RetreivedSeptember10,2005,fromhttp://www.oclc.org/research/projects/pmwg/premis-final.pdf.
RLG.(2005).Anauditchecklistforthecertificationoftrusteddigitalrepositories.MountainView,CA:RLG.RetrievedSeptember10,2008,fromhttp://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070511/viewer/file2416.pdf.
JISC.(2008).SWORD.RetrievedSeptember11,2008,fromhttp://www.ukoln.ac.uk/repositories/digirep/index/SWORD.
textMD:TechnicalMetadataforText,OfficialWebSite.(2008).RetrievedSeptember10,2008,fromhttp://www.loc.gov/standards/textMD/.
TheECHODEPositoryproject.(n.d.).RetrievedSeptember9,2008,fromhttp://ndiipp.uiuc.edu/.
TheApacheSoftwareFoundation.(2008).WelcometoXMLBeans.RetrievedSeptember11,2008,fromhttp://xmlbeans.apache.org/.
UIUCEchodepHubandSpokeFrameworkToolSuite.(n.d.).RetrievedSeptember12,2008,fromhttp://dli.grainger.uiuc.edu/echodep/hands/
6.3. PreservingMeaning,NotJustObjects:SemanticsandDigitalPreservation
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
78
DCMINamespacefortheDublinCoreMetadataElementSet,Version1.1.(2008).RetrievedOctober2,2008,fromhttp://dublincore.org/2008/01/14/dcelements.rdf
DCMITypeSchema.(2008).RetrievedOctober3,2008,from
http://dublincore.org/2008/01/14/dctype.rdfDubin,D.,Sperberg‐McQueen,C.M.,Renear,A.,andHuitfeldt,C.(2003).Alogic
programmingenvironmentfordocumentsemanticsandinference.LiteraryandLinguisticComputing,18(2):225–233.
Habing,T.,Ingram,W.,Cordial,M.,Manaster,R.andEke.J.(2008).Developmentsin
digitalpreservationattheUniversityofIllinois:theHubandSpokearchitectureforsupportingrepositoryinteroperabilityandemergingpreservationstandards.LibraryTrends,57(4),[pagenos.].
Kaczmarek,J.,Hswe,P.,Hauser,L.,andEke.J.(2008).TheWebArchivesWorkbench:
takinganarchivalapproachtothepreservationofWebcontent.LibraryTrends,57(4),[pagenos].
Kaminski.P.(2002).Integratinginformationonthesemanticwebusingpartially
orderedmultihypersets.Unpublishedmaster’sthesis.UniversityofWaterloo.RetrievedSeptember12,2008,fromhttp://www.ideanest.com/braque/Thesis-web.pdf.
OWLWebOntologyLanguage.(2004).RetrievedOctober2,2008,fromhttp://www.w3.org/TR/owl‐features/
RDFSemantics.(2004).RetrievedOctober2,2008,from
http://www.w3.org/TR/rdf-mt/Renear,A.,Dubin,D.,Sperberg‐McQueen,C.M.,andHuitfeldt,C.(2002).Towardsa
semanticsforXMLmarkup.InE.Munson,R.Furuta,andJ.I.Maletic(eds.)Proceedingsofthe2002ACMSymposiumonDocumentEngineering(119‐126).NewYork:ACM.
Renear,A.andDubin,D.(2003).Towardsidentityconditionsfordigitaldocuments.
InS.Sutton,editor,Proceedingsofthe2003DublinCoreConference.UniversityofWashington,Seattle,WA.
Tupelo.(2008).RetrievedOctober2,2008,from
http://dlt.ncsa.uiuc.edu/wiki/index.php/Main_Page.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
79
URIQA.TheURIQueryAgentModel:ASemanticWebEnabler.(2003‐2008).RetrievedOctober2,2008,fromhttp://sw.nokia.com/uriqa/URIQA.html.
ECHODEPositoryTechnicalArchitecturePhase1FinalReportNarrativeReportNationalDigitalInformationInfrastructurePreservationProgramUniversityofIllinoisatUrbana‐ChampaignwithOCLC
80
7. Appendices
7.1. WebArchivesUserGuide
7.2. WebArchivesWorkbenchImplementationGuide
7.3. AnnotatedTrustedDigitalRepositoryChecklist
7.4. UsingtheAuditChecklistfortheCertificationofaTrustedDigitalRepositoryasaFrameworkforEvaluatingRepositorySoftwareApplications(DLibarticle)
7.5. RepositoryTestingFindings:Narrative
7.6. RepositoryFindingsCommentaryUsingtheAnnotatedTrustedDigitalRepositoryChecklist
7.7. ResourceDescriptionVocabulary:AnOntologyofMetadataDescriptions
7.8. SustainedAccesstoEjournals:ContextValue,andFutureProspectus