migrating repository metadata & users: the harvard drs 2 project andrea goethals, harvard...
TRANSCRIPT
Migrating RepositoryMetadata & Users:
The Harvard DRS2 Project
Andrea Goethals, Harvard LibraryIS&T Archiving 2014, May 15 2014
Library Digital Initiative Funds (1998-)
• Build technical infrastructure - Digital Repository Service (DRS)
• Hire specialists• Build digital collections via 49 internal grants
to be preserved in the DRS
Oct-00
Oct-01
Oct-02
Oct-03
Oct-04
Oct-05
Oct-06
Oct-07
Oct-08
Oct-09
Oct-10
Oct-11
Oct-12
Oct-13
0
10
20
30
40
50
60
DRS Users Grew to 55 Organizational Units at Harvard
DRS is Central to User Workflows
• DRS
• Access (discovery,
search, delivery
platforms)
• Ingest (deposit
tools)
• Manage (cataloging
& manageme
nt tools)
• reformatting labs;
automated system
deposits; library,
archives and
museum staff
• reformatting labs ; library,
archives and
museum staff;
repository managers
• researchers, teachers, learners
Why a New DRS?
• Upgrade to best-in-breed technologies• Adopt digital preservation best practices and
standards• Preserve metadata better• Improve collection management• Support preservation planning & activities• Improve access to content & metadata• Support more formats & genres
Evolution of the DRS
2000 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 20122001
DRS in
production
New DRS in productionDRS enhancements
New DRS infrastructuredevelopment
2013 2014 2015
New DRS metadata migration
& user adoption
New DRS - Completed
2009 2010 2011 2012
convened DRS
Advisory Group
software in production
2013 2014 2015
users trained,phase 1
hardware in production
migrated content to new hardware
InfrastructureDevelopment
Metadata Migration
& User Adoption
Fedora assessment
DuraCloud pilot test
early release beta 1 beta 2
beta 3
first object deposited to the new
DRS
New DRS – In Progress
2009 2010 2011 2012 2013 2014 2015
InfrastructureDevelopment
Metadata Migration
& User Adoption
metadata migration tools
created & tested
migrating metadata
moving users
Why “Metadata” Migration?
Why not“content” migration?
Pre-migration
DRS Content
DRSDatabase
Post-migration
DRS Content
DRS Database New DRS Database
New DRS Index
New DRS Object Descriptors
New DRS Data Model
• Not a simple metadata conversion• A new DRS object is a logical intellectual entity
that unifies multiple DRS files, for example:– Still image objects - archival and production
masters, and deliverables including thumbnails – Audio objects - archival and production masters
and deliverables– PDS objects - page image and text files
Object Descriptors
• METS files generated for each object– Standards-based schemas (PREMIS, MODS, MIX,
etc.)• Metadata gathered from multiple sources
– Current DRS database– Every content file parsed using FITS– In some cases catalog records, finding aids, legacy
METS files
Technical Challenges
• Many formats• Unique migration rules per format • Preserving all identifiers• Uninterrupted access for end users• Large (>5000 file) page-turned documents• 46+ million DRS files -At 1 sec/file would
take 530+ days!
Formulating a Migration Plan
• Technical analysis– DRS content– Possible metadata sources
• User analysis– Management activity via system logs– Preparation via training and testing registration
lists– Perceived preparation & concerns via survey of
highest volume, active users
Migration Plan
• Combines needs of users with technical requirements– Respects all technical requirements– Minimizes the time users need to work in two
systems at the same time
Migrating Content in 5 Stages
Migrate 1st : Tier 1 contentMigrate 2nd: Tier 2 contentMigrate 3rd: Tier 3 contentMigrate 4th: Tier 4 contentMigrate 5th: Tier 5 content
Migrating Content in 5 Stages
Migrate 1st : Tier 1 contentMigrate 2nd: Tier 2 contentMigrate 3rd: Tier 3 contentMigrate 4th: Tier 4 contentMigrate 5th: Tier 5 content
simpler objects
more complex objects
Migrating Content in 5 Stages
Migrate 1st : Tier 1 contentMigrate 2nd: Tier 2 contentMigrate 3rd: Tier 3 contentMigrate 4th: Tier 4 contentMigrate 5th: Tier 5 content
dependenciesbetween tiers
dependencieswithin tiers
Migrating Content in 5 StagesTier Content
1 Text (Methodology, ESRI World File), Document, Color Profile, Target Image
2 PDS Document, Still Image
3 Audio, Text (SMIL)
4 Web Harvest, Opaque Container
5 Biomedical Image; Google Document Container 1, 2, 3
Migrating Content in 5 StagesTier Content
1 Text (Methodology, ESRI World File), Document, Color Profile, Target Image
2 PDS Document, Still Image
3 Audio, Text (SMIL)
4 Web Harvest, Opaque Container
5 Biomedical Image; Google Document Container 1, 2, 3
Migrating Content in 5 StagesTier Content
1 Text (Methodology, ESRI World File), Document, Color Profile, Target Image
2 PDS Document, Still Image
3 Audio, Text (SMIL)
4 Web Harvest, Opaque Container
5 Biomedical Image; Google Document Container 1, 2, 3
Tiers 1, 3, 4, 5: Migrate across all DRS owner codes at one timeTier 2: Migrate one DRS owner code at a time
Migrating Content in 5 StagesTier Content
1 Text (Methodology, ESRI World File), Document, Color Profile, Target Image
2 PDS Document, Still Image
3 Audio, Text (SMIL)
4 Web Harvest, Opaque Container
5 Biomedical Image; Google Document Container 1, 2, 3
Tiers 1, 3, 4, 5: Migrate across all DRS owner codes at one timeTier 2: Migrate one DRS owner code at a time
* Minimizes the amount of time the content they manage the most is in 2 different systems
Technical Strategies
• Modular, parallelizable migration design• Delivery services made migration-aware• Test, test, test• Design for migration failures – make do-overs
possible
Technical Strategy – Modular, Parallelizable
• 1) Group files into objects
• 2) Run FITS , combine with metadata to generate object descriptors
• 3) Ingest into new DRS
• Objects queue
• Descriptors ready queue
• END
• START
Tuning Experiments
• Single powerful computer– Dell R720 Server using Intel(R) Xeon(R) CPU E5-
2643 0 @ 3.30GHz CPU’s with 16 Cores, 64 GB of Memory and 1 TB of internal disk
– Various thread counts– 4-35 files processed per second
• Next: – RAM disk– Multiple computers
User Strategies
• Advisors - DRS Advisory Group• Minimize disruption
– Tier 2 migration - one owner at a time– Close partners - Imaging Services
• Tapping help of experts – “pioneer” depositors, beta testers, trainers
• Regular communications monthly via HL Update
Migration State Diagram
Migration Set Checklist
• Description of the affected content• List of steps needing human intervention, who
will do them, date of completion– includes communication, migration kickoff and
post-migration verification tasks• Final step – manager signs off on completion• Checklist is preserved
Learned So Far
• Can migrate in sub-second/file time• User-contributed metadata varies in quality
– Should automate more and/or put more validation checks in place
– Useful exercise to analyze metadata values and elements periodically
• errors in metadata values• value vs. effort of metadata elements
Preservation Capability Before and After the DRS2 Project
Level One Level Two Level Three Level Four
Storage & Geographic Location
File Fixity and Data Integrity
Information Security
Metadata
File Formats
= already compliant = will be compliant after the DRS2 project
Based on the NDSA Levels of Digital Preservation
Q & AThanks!
DRS Advisory GroupDRS beta testers
DCSWGBobbi Fox
Franziska FreyAndrea Goethals
Wendy GogelChip Goines
HUIT SecurityJonathan Kennedy
LTS OperationsSpencer McEwen
Grainne ReillyTracey Robinson
Randy SternJanet TaylorChris Vicary
Robin WendlerJulie WetherillVitaly Zakuta