university library experience cdl case study 30 june 2005 john kunze, california digital library
TRANSCRIPT
University Library Experience CDL Case Study
30 June 2005
John Kunze, California Digital Library
California Digital Library
• A university library with no books, students, or faculty
• Central services for 10 campus libraries
• Content hosting: electronic texts, web-based material, datasets, finding aids
• Linked: California museums & archives
• Plus a Digital Preservation Program
What’s digital preservation?
• Safeguarding electronic information– Viability (intact bit streams)– Renderability (by machines)– Understandability (by humans)
• There’s no preservation if we don’t know what it’s called– CDL core need for persistent identifiers
What’s a persistent identifier?
• An identifier that is valid for long enough– valid, enough: these are service/user dependent
• What’s an identifier? It’s an association between a string and a thing. It follows that:– An id is not a string of data (good)– An id is a matter of opinion, not fact; there will be at
least one other provider, serial if not in parallel, or your objects die with you (inconvenient)
• Same thing, two strings; or same string, two things• Often: same string, different metadata• Often: same string, parallel things diverging over time due to
different preservation practices (eg, migrations)
Accepting some disorder
• Long term preservation won’t happen unless objects can change residence and diverge– Campus snapshot to CDL; subsequent snapshots– Publisher to dim CDL archive; later CDL to SS?– Better if object lives in several places at once
• Eventually, Producer loses control of copies– Multiple opinions and practices will flourish– Static, id-based persistence claims soon irrelevant
• “urn:…”, “hdl:…”, etc. reflect hopes of people long gone
• Not pretty, but the alternative (loss) is worse
Agreeing to disagree
• What we say, but shouldn’t (not loudly):– Don’t re-assign a persistent id to something else– Or don’t replace a persistent object with another
• What we do:– Knowingly replace our persistent objects (typos,
drafts, format conversions, home page redesign)– Honestly provide a real kind of persistence, but
with very different replacement policies– Won’t have one way within CDL, let alone without
Diverse persistence practice
• How dissimilar must two objects be before they get different ids?
• CDL’s home-grown Digital Preservation Repository (open source) is self-service:– Lets the Submitter decide– Makes preservation a joint responsibility
• Requirement: need to be able to tell users what flavor of permanence is in effect
CDL Persistent Ids Must…
• Identify, whether or not the object is at hand– It may not be convenient, helpful, or permitted for
you to inspect the object itself -- metadata needed
• Convey different flavors of permanence• Lead to access (if authorized)
– Not strictly an “identification” problem, but it is the “404 not found” that we need to fix
• Be valid for some longish period• Be carried on, in, or with the object
How to choose an id scheme
• All CDL requirements are purely about service– Candidate schemes: URL, PURL, URN,
ARK, Handle, DOI, MD5, GUID, ISxx, …– CDL gets no direct service help from any
scheme; no scheme or syntax confers persistence of any kind
• We then ask which schemes are lowest cost and lowest risk?
Myths to fight against
• Harmful Fallacy 1. A URL is a location, and is therefore inherently unstable. (ridiculous)
• Harmful Fallacy 2. Explicit server/resolver names make URLs inherently unstable.– So “loc.gov” is less stable than “handle.net” and
the implicit global resolvers that it depends on?
• Harmful Fallacy 3. HTTP-based resolvers will not scale for persistent access. (google)
• Harmful Fallacy 4. URLs are the problem.– “Cool URLs don’t break” -- Tim Berners-Lee
Impersistence - big factors
• Bankruptcy - no successor found
• Loss of funding - no successor found
• Loss of political support
• War, social upheaval, natural disaster
• Scheme impact: zero
Impersistence - lesser factors
Deliberately or accidentally, objects are• Removed• Replaced• Moved without setting up a redirect
– Everyone has an indirection mechanism, though most don’t use it
• Scheme impact: zero
Impersistence - small factors
Your org likes persistent ids in principle, but• It lacks knowledge that vanilla web servers
trivially support 500,000 redirect directives• It lacks the expertise or staff to maintain a
web server, a two-column database table, and a nightly server config file report writer
• Scheme impact: zero
Scheme costs and risks
• Every modern service needs to support indefinitely and find or be given replacements for at least– Web server, web browser, and DNS
• In addition, URN, Handle, and DOI resolution need a global proxy or a plugin for every access
• ARK could use a plugin, but doesn’t need it• Handle and DOI also require
– You to maintain an extra local server– The community to maintain a set of global servers
• For the CDL– Handle and DOI come with highest risk– ARK comes with lowest risk
Persistence - indirect factors
CDL’s persistence requirements call for an id scheme (not service) connecting users to
• metadata• whether and what kind of persistence• sub-object and variant inferences• core ids on proxy failure (gracefully)• Scheme impact: ARK provides these
– A scheme is not a service (DOI is not CrossRef)– When choosing a scheme, we wanted to remain
independent of extra external service providers
Our Stuff vs Their Stuff
• Persistence can be split into– the Our Stuff Problem– the Their Stuff Problem
• It makes no sense for CDL to assign persistent ids to Their Stuff– Their Stuff can be hugely important to our users,
but we don’t control it and cannot vouch for it– Where we can afford it, we track them with PURLs
• CDL does assign persistent ids to Our Stuff
Distribution of Id Assignment
• Objects ingested in flows from other libraries per submission agreements
• Each object has an ARK after ingest– Either it has it already– Or we give it one upon entry
• Campuses can mint their own ARKs or rely on our minting service
• Their own campus ARK namespace is theirs to divide up as they wish
Opaque ids with semantic extensions
• CDL dilemma:– opaque ids are needed for names that age and
travel well– Semantically laden ids are helpful in providing
many id services
• Hybrid:– opaque ids are used to name abstract
preservation objects– Semantic and sometimes transient extensions
address components inside of objects (the set of components evolves over time anyway)