agenda, day 2 · stanford university libraries may 2011. agenda • stanford university ... •some...

68
Agenda, Day 2 08:30 – 08.35 Review of objectives and agenda 08:35 – 09:30 Infrastructure and tools 09:30 – 10:30 Case study: preservation activities at CDL 10:30 – 11:00 Morning break 11:00 – 12:00 Case study: preservation activities at Portico 12:00 – 12:30 Preservation initiatives and organizations: DataNet, DCC, DPC, IIPC, NDSA, OPF 12:30 – 14:00 Lunch 14:00 – Case study: preservation activities at Stanford – 15:00 Other preservation resources 15:00 – 15:30 Afternoon break 15:30 – 16:00 Format characterization 16:00 – 16:30 Characterization in preservation workflows 16:30 – 17:00 Questions and discussion

Upload: others

Post on 12-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Agenda, Day 208:30 – 08.35 Review of objectives and agenda

08:35 – 09:30 Infrastructure and tools

09:30 – 10:30 Case study: preservation activities at CDL

10:30 – 11:00 Morning break

11:00 – 12:00 Case study: preservation activities at Portico

12:00 – 12:30 Preservation initiatives and organizations: DataNet, DCC, DPC, IIPC, NDSA, OPF

12:30 – 14:00 Lunch

14:00 – Case study: preservation activities at Stanford

– 15:00 Other preservation resources

15:00 – 15:30 Afternoon break

15:30 – 16:00 Format characterization

16:00 – 16:30 Characterization in preservation workflows

16:30 – 17:00 Questions and discussion

Page 2: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Digital Preservation at Stanford University

Tom CramerChief Technology StrategistStanford University LibrariesMay 2011

Page 3: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Agenda• Stanford University

• First & Second Generation Digital Library

• Digitization Efforts

• The Stanford Digital Repository

– Preservation Core

– Management

– Access

Page 4: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Stanford University

“The Universityof Stanford” ?

Leland Stanford Junior Universityx

Page 5: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Stanford University• 15,000 students

• 8,000 graduate• 7,000

undergraduate• 2,000 faculty• 35,000 total

university community

• $3.4 billion annual operating budget• $17.2 billion endowment• Roots of Silicon Valley• One of the world’s leading research universities

Page 6: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Stanford’s Digital Library c. 2007

Typical of all first generation digital libraries?

Page 7: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

1st Generation Digital Libraries

• Small scale digitization, largely focused on text & images

• Purpose built systems for specific content types – application focus

• Highly theoretical approach to digital preservation

• Anemic UI’s

Page 8: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

2nd Generation Digital Libraries• Large scale digitization

• With more content types

• Multi-pathway workflows• Content use & reuse in an integrated

environment• Pragmatic approach to digital

preservation & full lifecycle of objects

• Infrastructure & service focus

Page 9: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Digitization Trends -- Drivers

• Boutique Large scale • Text & image text, image, audio,

video, software and more• Refresh of 1st generation delivery

systems with contemporary UI’s

Page 10: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Digitization Trends -- Responses

Replacing individual, handwroughtschemes with workflow-based systems, largely automated, with QA, exception handling and reporting that work for multiple content streams.

Management of full lifeycle of object, from physical object management through capture, preservation & access

Page 11: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Digitization at SULAIR

1. Robotic Book Scanning Lab2. Rare Book Scanning Lab3. Map Scanning Lab4. High End Imaging Lab5. Multipurpose (Sheet Feed, et al) Lab6. Media Preservation Lab7. Digital Forensics Lab

Page 12: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Stanford’s Legacy Media Counts

More than 20,000 handheld media objects in Special Collections alone

Page 13: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Legacy Media & Digital Forensics

• Files, operating systems & software• mss, correspondence, images,

records, data, etc.• Steps:

• Extraction• Forensic analysis• Archival processing & description• Access & emulation

• Paradigm shift for archivists, donors

Page 14: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Lifecycle Management = Integration

Page 15: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Lifecycle Management = Integration

Digitization & file processing are the easiest parts of any digitization initiative. Description, file management, collection management, access, and a holistic workflow uniting all pieces, is the real challenge.

Page 16: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Preservation at Stanford

• SDR is in production since Dec 2006•Now a second generation preservation

system• one component in a larger ecosystem of

digital library infrastructure

1997

needidentified

“Dark Cave”concept

‘02 ‘03 ‘04 ‘05 ‘06 ‘07

NDIIPPprototype redesign

1.0 inprod

‘08 ‘09

2.0 conceived

‘10

2.0 in prod

Page 17: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Three Major Areas of Preservation Needs• Digital Library

– Legacy collections– Digitized collections– Licensed, locally loaded content– Born digital collections

• Institutional Repository– Research data, – Publications, dissertations, – Learning objects, university assets

• External Depositors– Publishers– Discipline-specific repositories– Reciprocal deposits with peer institutions

Google Books (’00s of TB)Manuscripts (75 TB)Media (50 TB)Geospatial Data (10 TB)~30 other digi projects (15 TB)Purchased collections (25 TB)

Page 18: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Download, process and preserve 8 million volumes in SDR for...•local indexing,•text mining,•selective delivery, and •long-term access.

E.g., Google-Scanned Books

Page 19: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

E.g., Monterey Jazz Festival

•Festival founded in 1958: longest running jazz festival in the world.

•Rich collection of recordings from inception, spanning over 50 years, in varying states of condition & decay.

•Archives held at Stanford’s Archive of Recorded Sound

•~800 audio recordings, 1.6 TB audio files in SDR

•~250 video recordings, 22 TB video files in SDR

Access: - complete database of digital

recordings online at collections.stanford.edu/mjf

- Access via in-site visit to ARS- New commercial releases on

MJF Records

Page 20: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

E.g., National Geospatial Digital Archive

• Some 27,000 “at risk” geospatial objects

• TIFFs, GeoTIFFs, Shapefiles, Digital Elevation Models, Digital OrthophotoQuadrangle files

Page 21: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

E.g., Preserving Virtual Worlds

Stanford University LibrariesSecond Life Open House,31 July 2009

Page 22: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

E.g., Forensically Extracted Born Digital Files

•Digital Forensics lab extracting original computer files from legacy media

•Actively building pipeline from extraction to preservation store

•Support for both immediate and deferred archival processing & description

Page 23: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

E.g., Electronic Theses and Dissertations

Page 24: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

NSF Policy Position on Data Archiving 1

“NSF's policy position on data is straightforward:

1 National Science Foundation, Cyberinfrastructure Council. Cyberinfrastructure Vision for 21st

Century Discovery. March, 2007.

Page 25: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

NSF and NIH Grants to Stanford

Page 26: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

SDR 2.0 today

• 100+ TB of unique content • 300+ TB of managed data• 200,000+ objects• 62,000,000 files• 7 content types: books, images, audio,

video, manuscripts, GIS data, software• Integrated component of larger

environment

Page 27: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

2008: SDR 1.0 In Production & Working, BUT…

• Custom code, maintained by evolving & smaller team– No Reuse of code within Stanford, or larger community

• Bottlenecks– Needed to be quicker to add new content types– Needed to be quicker to add new collections– Needed to decompose code into more granular components

• Largely a stand-alone system– Lacked flexible Management services for streamlined,

continuous content deposit workflows– “Dark Archive” – No access services for rich, self-service

patron access

Page 28: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

SDR 1.0 Architecture: Strongly Rooted in OAIS

Page 29: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

SDR 2.0: New Technical Architecture • Adopt Fedora as a metadata management

system– Clean mapping of new data model to Fedora

content models– Reuse same design pattern, core technology as in

DOR

• Support for parallelized & asynchronous operations– Multiple ingest streams to increase throughput– Decompose one process (e.g, “ingest”) into

discrete, loosely coupled operations (“checksum”, “package”, “transfer”)

• Adopt a RESTful architecture & common workflow service

Page 30: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

SDR 2.0: New Technical Architecture

Page 31: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

SDR 2.0: Robots & “WorkDo” Service

Page 32: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Complex Systems from Atomic Pieces

Page 33: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

SDR 2.0: Revised Data Model SDR 1.x’s METS-based SIP, AIP and DIP, had many issues: – Each Transfer Manifest was content & collection

specific Doesn’t scale– Transfer manifests require too much interpretation and

analysis to change, augment– Too complex: Stanford METS structure breaks apart

related data across the object– Wraps (somewhat dynamic) metadata with (mostly

static) data files in same envelope– Recursive nature of transfer manifest makes

versioning self-referential, complex– No one speaks METS natively: depositors, SDR &

clients all forced to perform translation at handshakes

Page 34: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Content Structures and Flavors of Metadata

• Flexible data model can take any type of data, packaged in “bags”– A “bag” is a directory with

standardized top-level structure and syntax

• Minimizes analysis & processing required on ingest

• Preserves options for future processing & transformations based on future needs

Each object has seven discrete metadata files:– Identity metadata– Descriptive metadata– Content metadata

(aka structural metadata)

– Technical metadata– Rights metadata– Source metadata– Provenance metadata

Page 35: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

SDR Deposits: Content Transfer via Bagit

druid/bagit-info.txt

: Stanford-Content-Metadata: data/metadata/contentMetadata Stanford-Identity-Metadata: data/metadata/identityMetadata Stanford-Provenance-Metadata:

data/metadata/provenanceMetadata /data

/metadata /contentMetadata /descMetadata /identityMetadata /provenanceMetadata /rightsMetadata/sourceMetadata /technicalMetadata

/content/file1/file2

:

Page 36: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Lessons Learned Over 5 Years

• Custom code, maintained by evolving & smaller team, was inefficient & unsustainable– Adopted Fedora for metadata management, Hydra for

application framework– Shared technology & design patterns with rest of digital

library ecosystem– API’s for management, ingest, retrieval, reporting

• Bottlenecks– Need to be quicker to add new content types & collections:

simplify the data model, support “Zip & SIP”– Need to increase the throughput to the storage layer led to

parallelization of processes

• Need to refine & hone the SDR service model– Complement Preservation with robust Management & Access

services

Page 37: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Preservation Is One Leg of a Stool

• Preservation without Access is pointless– Further, all signs points indicate that it is not

economically viable

• Access without Preservation is myopic

• Robust Management services are prerequisite for accessioning, archiving and providing access to content– The “pre-ingest” phenomenon

Can one system handle it all? or

Page 38: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Stanford’s Digital Library Ecosystem

Page 39: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Three Spheres: Management, Preservation and Access

Digitization, Deposit & Management

Preservation

Discovery & Delivery

Page 40: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Stanford Digital Repository (SDR): content agnostic, preservation repository

Specialty applications provide context-specific, user-facing deposit, and access services tailored to content types and disciplines

SDR in Stanford’s DL Ecosystem

Library Management Applications

EEMS (acquiring born digital content), digitization workflow, etc.

Institutional Repository

ETDs, open access articles, faculty “papers”, research data, web sites, etc.

SULAIR Digital Stacks

Delivery for text, images, mss, media, data, & curated collections

National Geospatial Digital Archive(NGDA)

Geospatial data

and SDR provides “back-office” preservation services: replication, auditing, migration, and retrieval in a secure, sustainable, scalable stewardship environment

Page 41: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

E.g., Parker Manuscripts

•559 Anglo-Saxon manuscripts, 200,000 pages

•For each page:

22 MB JPEG2000 delivery surrogate22 MB JPEG2000 delivery surrogate110 MB submaster TIFF220 MB master TIFF SDR –

Preservation Core

Parker.stanford.edu: Rich web application, tailored for general public, medievalists

Page 42: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Separation of Concerns

• Scoped repository: differentiation between preservation (provided by SDR) and

…content management (provided by DOR)…access (provided by the Digital Stacks apps)

• Implications: – Reduces pressure on SDR to be all things to all

depositors, for all content– Reinforces need to provide managed & secure storage at

scale– Reinforces requirement to focus on fixity and integrity

services– Emphasizes need to integrate SDR to management &

access services through stable API’s

Page 43: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Management: Hydra-based Applications

Under Development…• SDR’s Front End – Institutional Repository for Stanford• Hypatia – Archival Arrangement, Description & Access• SDR Preservation Core Administrative Application

ETD’s –Electronic Theses & Dissertations

SALT –Self-Archiving Legacy Toolkit

EEMs –Everyday Electronic Materials

Page 44: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Hydra

• Joint development project among Stanford, University of Virginia, University of Hull and Fedora Commons

• Based on Fedora, Active Fedora and Ruby on Rails

• Reuse Blacklight & solr for search & browse within a hydra application

Page 45: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Fundamental Assumption #1

No single system can provide the full range of repository-based solutions for a given institution’s needs,

…yet sustainable solutions require a common repository infrastructure.

Page 46: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

For instance…

An ETD solution…- Single PDF- With auxiliary data

files- Simple, prescribed

workflow- Integrated with

student administration system

- Streamlined UI for depositors, reviewers & readers

A digitization workflow system…- Potentially hundreds of

files type per object- Complex, branching

workflow- Sophisticated operator

(back office) interfaces

A general purpose institutional repository- Heterogeneous file types- Simple to complex

objects- General purpose user

interfaces

Page 47: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Distinct Application NeedsMore than one dozen distinct repository application needs across three institutions.

• Electronic theses & dissertations• Open access articles• Data curation application(s)• General purpose institutional repository• Manuscript & archival collection delivery• Library materials accessioning tools• Digitization workflow system• And more...

Page 48: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Shared, Primitive Functions• Deposit – uploading simple or multipart

objects, singly or in bulk• Manage – editing an object’s content,

metadata and permissions• Search – full text and fielded search

supporting both user discovery and administration

• Browse – sequential viewing of objects by collection, attribute or ad hoc filtering

• Deliver – viewing, downloading & disseminating objects through user and machine interfaces

Page 49: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Hydra Philosophy -- Technical• Tailored applications and workflows for

different content types, contexts and user interactions

• A common repository infrastructure• Flexible, atomistic data models• Modular, “Lego brick” services• Library of user interaction widgets• Easily skinned UI

One body, many heads

Page 50: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital
Page 51: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Fundamental Assumption #2

No single institution can resource the development of a full range of solutions on its own,

…yet each needs the flexibility to tailor solutions to local demands and workflows.

Page 52: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Hydra Philosophy -- Community• An open architecture, with many

contributors to a common core• Collaboratively built “solution bundles” that

can be adapted and modified to suit local needs

• A community of developers and adopters extending and enhancing the core

• “If you want to go fast, go alone. If you want to go far, go together.”

One body, many heads

Page 53: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Electronic Theses and Dissertations

Page 54: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

• Automatic deposit to library as part of degree conferral• Built in digital collection building• Better access for patrons• Reduced expenses for students,

University, library processing• Increased visibility of and access to

Stanford research via catalog & Google• Built in preservation through Stanford

Digital Repository

Electronic Theses & Dissertation (ETD)

Page 55: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

EEMs: Accessioning Born Digital Materials

Browser widget enables selector to capture the PDF, plus URL, title, author, copyright status, payment information, and comments, and route to Acquisitions.

Page 56: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

EEMs: Accessioning Born Digital Materials

Dashboard enables item processing, ultimately leading to preservation in SDR and access via the catalog.

Page 57: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

SALT: Digital Archives

Page 58: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

SALT: Digital Archives

• Archiving unstructured and semi-structured data

• Allow access to semi-processed information,- with strong access & visibility controls- leveraging full text & entity extraction

• Ongoing enrichment of the archive- through self-annotation by the donor- through crowd-sourcing description and

organization

Page 59: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Component Based Architecture• Fedora as a metadata store• Well structured file system as data store• Solr index for rapid data access• Blacklight & Hydra: app logic & presentation• Atomic Services

– “Robots”: simple, autonomous scripts, providing small units of work in reusable packages

– “Services” provide common operations that support workflows across the environment

• “WorkDo”: lightweight workflow to orchestrate cascade of services

Page 60: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

DOR & Digital Stacks Architecture

Page 61: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Digital Library Ecosystem

Page 62: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Growth in Disk and Computing at SULAIR

Page 63: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

Stanford’s Digital Library, 2011The next generation of Digital libraries will be complex ecosystems made up of simple components.

Separate systems for digitization, management, preservation and access will enable pieces to be mixed and matched, supporting content streams from a variety of sources, and access by a variety of communities, services and tools.

Photo by Alun Salt. Used under CC Attribution-ShareAlike 2.0 Generic license.

Page 64: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

LOCKSS• Lots of Copies Keeps Stuff Safe• Originated at Stanford University• Peer-to-peer, decentralized digital

preservation system• Focus is on scholarly articles

– 7100 e-journal titles, 470 publishers– Collects web-based content – Preserves it locally – Provides 100% post-cancellation access– Done with publisher permission

Page 65: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

LOCKSS

Capture & Replication

Page 66: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

LOCKSS

Audit & Healing

Page 67: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

LOCKSS• Commodity Hardware & Open source

software & Appliance = very low cost• Follows traditional model of library-

based distribution and preservation– Lots of Copies– Locally Managed Copies

• Publisher permissions ensure legal coverage

• Extensible to other collections

Page 68: Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some 27,000 “at risk” geospatial objects •TIFFs, GeoTIFFs, Shapefiles, Digital

LOCKSS• CLOCKSS: Controlled LOCKSS

– Not-for-profit archive for ensuring access to orphaned scholarly content

– One dozen major publishers + libraries• Private LOCKSS Networks

– Alabama Digital Preservation Network– Arizona State Library, Archive & Public Records– Council of Prairie & Pacific University Libraries

Consoritum– Data Preservation Alliance for the Social Sciences– Digital Commons – Berkely Electronic Press– MetaArchive Cooperative Project– Digital Federal Depository Library Program