bodleian library's dams system

'An institutional repository needs to be a service with continuity behind it........Institutions need to recognize that they are making commitments for the long term.'

Clifford Lynch, 2004

Continuity and Services

We anticipate that the content our systems will hold or even hold now will outlive any software, hardware or even people involved in its creation.


What content are we anticipating to store for the long-term?


Bookshelves for the 21st CenturyLegal Deposit

Voluntary Deposit

Digitisation

Extended Remit - Research Data

Administrative data researchers, projects, places, dates

And anything else deemed worthy.


What core principles do you design a system with when you know the content will outlive you?


Oxford DAMS Digital Asset Management SystemNot a single piece of software, more a set of guiding principles and aims applied to software and hardware.

The DAMS is the hardware environment (in my opinion)


#1 Keep it simple

simple is maintainable, easy to understand.


#1 Keep it simple simple is maintainable, easy to understand. Think about how much we have achieved with the simple concept of using a single command on a simple protocol (HTTP GET) to get an HTML document. (HTML being a loose, ill-constrained set of elements.)

GET / HTTP1.1


#1 Keep it simple simple is maintainable, easy to understand. simple protocols such as HTTP, simple formats such as JSON and API patterns such as REST are heavily favoured

Remember, at some point, administration and development of the system will need to be handed over (as I am doing now!).


#2 Everything but the content is replaceable


#2 Everything but the content is replaceableThe content that is being held by the system at any one time is what is important, not the surrounding infrastructure

The content should survive and be reusable in the event that services like databases, search indexes, etc crash or corrupt.


1st logical outcome:

Do not use complex packages to hold your content!

Minimise the amount of software and work that you need to maintain to understand how your content is stored.


1st logical outcome:Do not use complex packages to hold your content!

The basic, low-level digital object in our system is a bagEach bag has a manifest, currently in either FOXML or RDF, which serves to identify the component parts and their inter-relationships.


#3 The services in the system are replaceable and can be rebuilt from the content.


#3 The services in the system are replaceable and can be rebuilt from the content.Services such as:Search indexes (eg via Lucene/Solr)

Databases and RDF triplestores (MySQL, Mulgara, etc)

XML databases (eXist db, etc)

Object management middleware (Fedora)

Object registries


2nd logical outcome:A content store must be able to tell services when items have changed, been created, or deleted, and what items an individual store holds.

We need slightly smarter storage to do this over a network. Local filesystems have had this paradigm for a long time now, but we need to migrate it to the network level.


2nd logical outcome:A content store must be able to tell services when items have changed, been created, or deleted, and what items an individual store holds.

Each store and set of services needs a way to pass messages in a transactional way, via a messaging system.This can be as simple as an HTTP call or as robust as an audit-able, message queue transaction.


3rd logical outcome:The storage must store the information and documentation for the standards and conventions its content follows.

The storage must be able to describe itself.


#4 We must lower our dependence on any one node in the system be it hardware, software or even a person.


#4 We must lower our dependence on any one node in the system be it hardware, software or even a person.This is another reason for #1 keeping things simple. Lowers the cost of hardware replacement and upgrade simple hardware tends to be cheaper and more readily available; lowering the number of crucial features broadens what can be used.

Software Oxford has been around for 1000 years. Vendors will not be.

People a simpler system is more readily understood by new/replacement engineers.

Challenges to maintaining Continuity

Flexibility

Scalability

Longevity

Availability

Sustainability

Interoperability


FlexibilityOpen standards and open APIs allow us to provide the tools people need to create the experience with the content they want.

'Text' based metadata XML, RDF, RDFa, JSON

Componentised web services providing open APIs flexibility through dynamically selecting what services run at anytime.

Open Source


ScalabilityWe need to anticipate exponential growth in demand, with the knowledge that storage will be longterm

A rough estimate of all the data produced in the world every second is 7km of CDROMs broadcast video, big science, CCTVs, cameras, research, etc. This figure is only ever going to go up.

To scale like the web, we have to be like the web; not one single black box and workflow, but a distributed net of storage and services, simply interlinked.

The Billion file [inode] problem must be avoided


LongevityLive Replicas (Backup Problems)

MAID (Power)

Self-healing Systems (Resilience)

Simplicity of Interfaces

Avoid 3rd Party dependence

Support Heterogeneity

Resolvers/Abstraction Layers


AvailabilityBasic IT availability

Enhanced long-term availability

Archival recoverability

Digital Preservation

Conversion

Emulation

Archive Preservation


SustainabilityBudget and cost as a conventional library

Factor archival costs into projects

Leverage content to generate income

Migrate skills


InteroperabilityInteroperability is an ongoing process

Support for emerging and established standards

Persistent, stable, well-defined interfaces

Ideally implement interfaces bidirectionally

Avoid low-level interfaces abstract as much as feasible

Embrace the web if you do it right, people will reuse your content in ways you never thought of and never thought possible.

DAMS overview

The following schematics might be not reflect the absolute current state, but they give the right idea of direction of growth of the DAMS.

VMWare ESXPublic

SATAVM ImageStore (2TB)

HC 1HC 2

MDICS

Phase 1 current hardware

VMWare ESXPublic


HC 1HC 2

MDICS


Content layer

VMWare ESXPublic


HC 1HC 2

MDICS


Service storage layer

VMWare ESXPublic


HC 1HC 2

MDICS


Service execution layer

VMWare ESXVMWare ESXPublicArchiveStorageVM ImageStore (2TB)NFSVM ImageStore (2TB)NFSZFSZFS ReplicationZFS

HC 1HC 2

Private

MDICS

Phase 2 (expecting hardware delivery this month)

VMWare ESXVMWare ESXPublicArchiveStorageVM ImageStore (2TB)NFSVM ImageStore (2TB)NFSZFSZFS ReplicationZFS

HC 1Thumper1Thumper2HC 2Split Writes &Crosschecks

Private

Sun RaysFutureArch

MDICS

RSLOsney?Aim multisite (add sites as required to scale up)

No Monolith? Chaos?

The side-effect is that there is no single point of entry into the DAMSBeneficial for reasons stated before

Moves the problem from continual programmatic work just to 'tread water' to adapting to needs and actual usecases.

Big Issue: Needs and Uses become much more important and these are hard to tackle!

Commonalities

What conventions and standards are common between the services and archives?Pro: The more commonalities we adopt, the more time we can spend addressing the real problems.

Con: Too much of a prescribed system can easily lead to paralysis (as we have seen in other big institutional projects outside of Oxford)

Commonalities

Global IdentifiersWhen source content enters a system on the DAMS it is given an identifier that is understandable outside of the system

e.g. Not just an internal table id in a database, but a [table, etc] id that has meaning outside of the database too.

URI is a current, widely adopted means of having globally understood identifiers so this is a nice commonality to adopt.

Commonalities

Consider:You have an auto-incrementing primary key in a database table.

You adopt a namespace that is unique to your system (but preferably not the project name)

Eg name:, etc rather than brii:, letter: rather than cok:

It is helpful if the namespace implies the source of the content.

Then you would have name:1, name:2 etc

Commonalities

BUT!There will be an awful lot of systems which deal with similar types of things.

Is name:1 from one of Su's systems the same as name:1 from one of Anusha's systems?

Or if name:1 in an OUCS system - that we dont even know about yet - is the same?

UUIDs (or GUIDs)

Fundamentally a very big number

So big, that if it is created from a random enough source, you are extremely unlikely to ever create the same one twice.

Ever.

I really mean it.

The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.

UUIDs (or GUIDs)

From Wikipedia:The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. Thus, anyone can create a UUID and use it to identify something with reasonable confidence that the identifier will never be unintentionally used by anyone for anything else. Information labeled with UUIDs can therefore be later combined into a single database without needing to resolve name conflicts.

UUIDs (or GUIDs)

What's a GUID?Microsoft have issues about adopting anything that already exists without owning it. So they adopted everything about UUID but changed the first letter and called it their own invention.

Globally rather than Universally

Everything else about it is the same.

Yeah, I think it's pretty pointless too, but it's a recurring theme with MS.

UUID vs local ID

One of the problems is that some people don't like to use big, long hexadecimal numbers.

We call these people 'normal'

One of my biggest mistakes with ORA was using this number in visual priority over a local, friendly ID.

Whilst this makes the item citable for much longer, it makes it way more unfriendly.

I didn't do enough URL masking.

External Identifiers

It's rare to get any content that doesn't have at least one identifier of some kind.from a controlled, centralised system of ids like DOI

To a local id like 'letter 275'

All of these are important to capture too! They represent the 'labels' that various user groups use to refer to the content.

Identifiers - Summary

Every 'thing' in your system will have more than one identifier or label. This is a Good Thing.

Give every 'thing' in your system identifiers which you, your users and your system will use.A local one is handy ('name:1'), but

A global one is essential (uuid:...)

I glossed over 'Thing', didn't I?

Instead of considering metadata as a set of records, the aim is to have metadata that model or represent that which we are archiving.If we have a book, X, written solely by an author Y, then we should have an author object Y linked as the author to a book object X.[Things get more complicated when we start considering versions, dates, and Work-Manifestations, but more on that later]

The archives should be object-orientated.

Global Ids? Why?

A global ID provides a 'node', a hook on which we can hang other information and from and to which we can make links.[Typed links links which have more meaning than 'A is linked to B' links more like 'A is authored by B and C']

A common label for the way this is modelling information is 'Linked Data'

Guidelines for Linked Data

Use URIs as names for things

Use HTTP URIs so that people can look up those names.

When someone looks up a URI, provide useful information, [ using the standards (RDF, SPARQL) ]

Include links to other URIs. so that they can discover more things.By Tim Berners-Lee (brackets are mine tho')

http://www.w3.org/DesignIssues/LinkedData.html

RDF and SPARQL

This is the point that I don't quite agree with as strongly as Sir Tim. Let's break it down:

URIs must supply widely understood metadata formatsAgreed

Services that work in an understood mannerAlso agreed

URIs must supply RDF?Amongst other standards, yes, agreed.

Services must supply SPARQL?Hmmm..... I don't agree. Services should be useful and wanted not necessarily SPARQL-based.

Linked Data... Semantic Web...?

They are slightly different, but you'll never get a straight answer on the differences.

By and large, those that talk about creating Linked Data are more pragmatic than those talking about linking in with the Semantic Web.

But the opposite might be true in some cases ;)

Playing well with others

The key is to think about your content in terms of it being discovered in your absence.How might they understand the terms?

Meanings of nodes and links? [Nouns and verbs]?

RDF provides ways to share this information. Model your terms with a view that this information will be shared.... LINKED DATA!

Curation - Microservices

i.e. What we do with the content we have or the content we will get?

Digital Curation

Activities focused on maintaining and adding value to trusted digital content

Encompasses preservation and access, which are complementary, not disparate functions

Preservation ensures access over time

Access depends on preservation up to a point in time

How can we make the Save button really mean save?

Although the title of our talk is couched in terms of preservation Im going to switch terms and instead speak about what we now consider the more expansive and appropriate word, curation.Digital curation is the set of activities focused on maintaining and adding value to a body of trusted digital content.There are two key points here:First, the value of a curated digital asset is not fixed at the time of its creation.With the burgeoning phenomenon of social networking, value can be added to an asset over its lifetime.Second, curation encompasses both preservation and access.Although these are often considered disparate activities, we see them instead as complementary,Preservation ensuring access over time,While access depends on preservation up to a point of time.

bodleian library's dams system

Technology