university library experience cdl case study 30 june 2005 john kunze, california digital library

18
University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

Upload: evelyn-holt

Post on 03-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

University Library Experience CDL Case Study

30 June 2005

John Kunze, California Digital Library

Page 2: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

California Digital Library

• A university library with no books, students, or faculty

• Central services for 10 campus libraries

• Content hosting: electronic texts, web-based material, datasets, finding aids

• Linked: California museums & archives

• Plus a Digital Preservation Program

Page 3: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

What’s digital preservation?

• Safeguarding electronic information– Viability (intact bit streams)– Renderability (by machines)– Understandability (by humans)

• There’s no preservation if we don’t know what it’s called– CDL core need for persistent identifiers

Page 4: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

What’s a persistent identifier?

• An identifier that is valid for long enough– valid, enough: these are service/user dependent

• What’s an identifier? It’s an association between a string and a thing. It follows that:– An id is not a string of data (good)– An id is a matter of opinion, not fact; there will be at

least one other provider, serial if not in parallel, or your objects die with you (inconvenient)

• Same thing, two strings; or same string, two things• Often: same string, different metadata• Often: same string, parallel things diverging over time due to

different preservation practices (eg, migrations)

Page 5: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

Accepting some disorder

• Long term preservation won’t happen unless objects can change residence and diverge– Campus snapshot to CDL; subsequent snapshots– Publisher to dim CDL archive; later CDL to SS?– Better if object lives in several places at once

• Eventually, Producer loses control of copies– Multiple opinions and practices will flourish– Static, id-based persistence claims soon irrelevant

• “urn:…”, “hdl:…”, etc. reflect hopes of people long gone

• Not pretty, but the alternative (loss) is worse

Page 6: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

Agreeing to disagree

• What we say, but shouldn’t (not loudly):– Don’t re-assign a persistent id to something else– Or don’t replace a persistent object with another

• What we do:– Knowingly replace our persistent objects (typos,

drafts, format conversions, home page redesign)– Honestly provide a real kind of persistence, but

with very different replacement policies– Won’t have one way within CDL, let alone without

Page 7: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

Diverse persistence practice

• How dissimilar must two objects be before they get different ids?

• CDL’s home-grown Digital Preservation Repository (open source) is self-service:– Lets the Submitter decide– Makes preservation a joint responsibility

• Requirement: need to be able to tell users what flavor of permanence is in effect

Page 8: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

CDL Persistent Ids Must…

• Identify, whether or not the object is at hand– It may not be convenient, helpful, or permitted for

you to inspect the object itself -- metadata needed

• Convey different flavors of permanence• Lead to access (if authorized)

– Not strictly an “identification” problem, but it is the “404 not found” that we need to fix

• Be valid for some longish period• Be carried on, in, or with the object

Page 9: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

How to choose an id scheme

• All CDL requirements are purely about service– Candidate schemes: URL, PURL, URN,

ARK, Handle, DOI, MD5, GUID, ISxx, …– CDL gets no direct service help from any

scheme; no scheme or syntax confers persistence of any kind

• We then ask which schemes are lowest cost and lowest risk?

Page 10: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

Myths to fight against

• Harmful Fallacy 1. A URL is a location, and is therefore inherently unstable. (ridiculous)

• Harmful Fallacy 2. Explicit server/resolver names make URLs inherently unstable.– So “loc.gov” is less stable than “handle.net” and

the implicit global resolvers that it depends on?

• Harmful Fallacy 3. HTTP-based resolvers will not scale for persistent access. (google)

• Harmful Fallacy 4. URLs are the problem.– “Cool URLs don’t break” -- Tim Berners-Lee

Page 11: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

Impersistence - big factors

• Bankruptcy - no successor found

• Loss of funding - no successor found

• Loss of political support

• War, social upheaval, natural disaster

• Scheme impact: zero

Page 12: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

Impersistence - lesser factors

Deliberately or accidentally, objects are• Removed• Replaced• Moved without setting up a redirect

– Everyone has an indirection mechanism, though most don’t use it

• Scheme impact: zero

Page 13: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

Impersistence - small factors

Your org likes persistent ids in principle, but• It lacks knowledge that vanilla web servers

trivially support 500,000 redirect directives• It lacks the expertise or staff to maintain a

web server, a two-column database table, and a nightly server config file report writer

• Scheme impact: zero

Page 14: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

Scheme costs and risks

• Every modern service needs to support indefinitely and find or be given replacements for at least– Web server, web browser, and DNS

• In addition, URN, Handle, and DOI resolution need a global proxy or a plugin for every access

• ARK could use a plugin, but doesn’t need it• Handle and DOI also require

– You to maintain an extra local server– The community to maintain a set of global servers

• For the CDL– Handle and DOI come with highest risk– ARK comes with lowest risk

Page 15: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

Persistence - indirect factors

CDL’s persistence requirements call for an id scheme (not service) connecting users to

• metadata• whether and what kind of persistence• sub-object and variant inferences• core ids on proxy failure (gracefully)• Scheme impact: ARK provides these

– A scheme is not a service (DOI is not CrossRef)– When choosing a scheme, we wanted to remain

independent of extra external service providers

Page 16: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

Our Stuff vs Their Stuff

• Persistence can be split into– the Our Stuff Problem– the Their Stuff Problem

• It makes no sense for CDL to assign persistent ids to Their Stuff– Their Stuff can be hugely important to our users,

but we don’t control it and cannot vouch for it– Where we can afford it, we track them with PURLs

• CDL does assign persistent ids to Our Stuff

Page 17: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

Distribution of Id Assignment

• Objects ingested in flows from other libraries per submission agreements

• Each object has an ARK after ingest– Either it has it already– Or we give it one upon entry

• Campuses can mint their own ARKs or rely on our minting service

• Their own campus ARK namespace is theirs to divide up as they wish

Page 18: University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library

Opaque ids with semantic extensions

• CDL dilemma:– opaque ids are needed for names that age and

travel well– Semantically laden ids are helpful in providing

many id services

• Hybrid:– opaque ids are used to name abstract

preservation objects– Semantic and sometimes transient extensions

address components inside of objects (the set of components evolves over time anyway)