digital archival storage for the university of michigan library collections

16
digital archival storage digital archival storage for the University of for the University of Michigan Library Michigan Library collections collections

Upload: deshawn-biringer

Post on 16-Dec-2015

222 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Digital archival storage for the University of Michigan Library collections

digital archival storagedigital archival storage

for the University of Michigan for the University of Michigan Library collectionsLibrary collections

Page 2: Digital archival storage for the University of Michigan Library collections

Project OverviewProject Overview

Project partnership with Google Project partnership with Google publicly announced in December publicly announced in December 2004.2004.

Bound print collection, about 7 Bound print collection, about 7 million volumes, to be scanned over million volumes, to be scanned over estimated four to six years.estimated four to six years.

Direct scanning costs are borne by Direct scanning costs are borne by Google.Google.

Page 3: Digital archival storage for the University of Michigan Library collections

Project OverviewProject Overview

UM receives a copy of all digital files, UM receives a copy of all digital files, including OCR and metadata, which we including OCR and metadata, which we may use to build services.may use to build services.

UM may share files with other research UM may share files with other research libraries under formal agreements.libraries under formal agreements.

UM may not redistribute content en UM may not redistribute content en masse to other commercial services or masse to other commercial services or the public.the public.

All uses are subject to copyright.All uses are subject to copyright.

Page 4: Digital archival storage for the University of Michigan Library collections

Project ScaleProject Scale

At about 320 pages per volume and At about 320 pages per volume and 2.01 files per page, we’ll have 2.2 2.01 files per page, we’ll have 2.2 billion files.billion files.

At about 6000 pages per GB or 54.6 MB At about 6000 pages per GB or 54.6 MB per volume, we’ll have 380 TB of data.per volume, we’ll have 380 TB of data.

Production at full volume can scan Production at full volume can scan about 35K volumes (1867 GB) per about 35K volumes (1867 GB) per week, which averages to a sustained week, which averages to a sustained 3.16 MB per second for four years.3.16 MB per second for four years.

Page 5: Digital archival storage for the University of Michigan Library collections

Not too many libraries do this!Not too many libraries do this!

Page 6: Digital archival storage for the University of Michigan Library collections

Characteristics of the DataCharacteristics of the Data

Extremely well-defined data conventions: Extremely well-defined data conventions: image files are TIFF or JPEG 2000, OCR image files are TIFF or JPEG 2000, OCR files and metadata are UTF-8 text.files and metadata are UTF-8 text.

A true archival system; indefinite A true archival system; indefinite retention requires its own set of best retention requires its own set of best practices.practices.

Files are largely static.Files are largely static. Much material is in-copyright (security is Much material is in-copyright (security is

paramount).paramount).

Page 7: Digital archival storage for the University of Michigan Library collections

Application RequirementsApplication Requirements

MBooks (web server farm/NAS)MBooks (web server farm/NAS) Periodic fixity check (checksum Periodic fixity check (checksum

validation)validation) Full-text search? (how?!)Full-text search? (how?!) Textual analysis or other research?Textual analysis or other research? Anything beyond MBooks is likely to be Anything beyond MBooks is likely to be

either compute- or IO-intensive, or both.either compute- or IO-intensive, or both. This is how you annoy storage vendors!This is how you annoy storage vendors!

Page 8: Digital archival storage for the University of Michigan Library collections

Overall ApproachOverall Approach

Engagement with Office of the Provost Engagement with Office of the Provost from the beginning; a University from the beginning; a University project project housedhoused in the Library in the Library

Our Library IT environment has unusual Our Library IT environment has unusual depth due to our mature digital library.depth due to our mature digital library.

Consulting relationship with academic Consulting relationship with academic computing and campus storage computing and campus storage expertsexperts

RFI provided vendor landscapeRFI provided vendor landscape RFP (very few Yes/No questions!)RFP (very few Yes/No questions!)

Page 9: Digital archival storage for the University of Michigan Library collections

Cost Model from RFI Cost Model from RFI ResponsesResponses

Model includes various ramp-up patterns, Model includes various ramp-up patterns, hardware replacement periods, starting hardware replacement periods, starting cost, and rate of cost decrease.cost, and rate of cost decrease.

Cost per GB from selected RFI responses: Cost per GB from selected RFI responses: average = median = $7average = median = $7

Too fast means initial investment is huge, Too fast means initial investment is huge, no benefit from Moore’s Law.no benefit from Moore’s Law.

Too slow means simultaneous growth and Too slow means simultaneous growth and replacement, costs peak at replacement replacement, costs peak at replacement interval.interval.

Four years is plenty fast, thank you!Four years is plenty fast, thank you!

Page 10: Digital archival storage for the University of Michigan Library collections

Potential Funding SourcesPotential Funding Sources

Development of CIC shared digital Development of CIC shared digital repository: multiple redundant sites repository: multiple redundant sites and some staff funded by pay-to-play and some staff funded by pay-to-play modelmodel

Again, engagement with Office of the Again, engagement with Office of the Provost from the beginningProvost from the beginning

Page 11: Digital archival storage for the University of Michigan Library collections

ConsiderationsConsiderations

““Future-proof” higher-cost investment Future-proof” higher-cost investment with proven vendor and incremental with proven vendor and incremental upgrades?upgrades?

““Throwaway” lower-cost solution with Throwaway” lower-cost solution with cutting-edge vendor and forklift upgrade?cutting-edge vendor and forklift upgrade?

Temporary solution (Linux NAS server and Temporary solution (Linux NAS server and commodity SCSI/SATA arrays) has allowed commodity SCSI/SATA arrays) has allowed project to proceed and further inform us project to proceed and further inform us on the decisions we’ll make.on the decisions we’ll make.

Page 12: Digital archival storage for the University of Michigan Library collections

Best Architecture?Best Architecture?

Must have simultaneous access from Must have simultaneous access from potentially potentially manymany front-end servers (cluster), front-end servers (cluster), so almost certainly a NAS component.so almost certainly a NAS component.

NAS? NAS gateway to SAN? NAS/SAN hybrid?NAS? NAS gateway to SAN? NAS/SAN hybrid? Probably most promising in the flexibility Probably most promising in the flexibility

department are the clustered NAS systems department are the clustered NAS systems with SAS or SATA back ends.with SAS or SATA back ends.

Keep our options open; the right vendor Keep our options open; the right vendor could make all the difference.could make all the difference.

Page 13: Digital archival storage for the University of Michigan Library collections

Highlights of the RFPHighlights of the RFP Does not ask about compliance with exact Does not ask about compliance with exact

specifications, but asks for detailed specifications, but asks for detailed explanations of system architecture: all of the explanations of system architecture: all of the usual, and…usual, and…

Recommended upgrade path given our Recommended upgrade path given our estimated growth pattern and project timelineestimated growth pattern and project timeline

Description of how load balancing and service Description of how load balancing and service are impacted as system is scaled and are impacted as system is scaled and maintainedmaintained

How virtualization is implementedHow virtualization is implemented Security provisionsSecurity provisions Contact me if you’d like to have a copy.Contact me if you’d like to have a copy.

Page 14: Digital archival storage for the University of Michigan Library collections

Proposal Evaluation CriteriaProposal Evaluation Criteria

Scalability of capacity, performance, Scalability of capacity, performance, and interconnect fabricand interconnect fabric

Proven models/methods for growthProven models/methods for growth Flexibility in applicationFlexibility in application Maintenance easeMaintenance ease

Page 15: Digital archival storage for the University of Michigan Library collections

Near-term WorkNear-term Work

RFP responses due (Monday!)RFP responses due (Monday!) Space, support, backupSpace, support, backup Work in CIC on governance and Work in CIC on governance and

funding model for shared digital funding model for shared digital repositoryrepository

Continued development of MBooks Continued development of MBooks functionality and integration with functionality and integration with existing digital library resourcesexisting digital library resources

Page 16: Digital archival storage for the University of Michigan Library collections

AccessAccess

MBooksMBooks

http://www.lib.umich.edu/mdp/http://www.lib.umich.edu/mdp/

Cory SnavelyCory Snavely

[email protected]@umich.edu