bodleian library's dams system
TRANSCRIPT
'An institutional repository needs to be a service with continuity behind it........Institutions need to recognize that they are making commitments for the long term.'
Clifford Lynch, 2004
Continuity and Services
We anticipate that the content our systems will hold or even hold now will outlive any software, hardware or even people involved in its creation.
Continuity and Services
What content are we anticipating to store for the long-term?
Continuity and Services
Bookshelves for the 21st CenturyLegal Deposit
Voluntary Deposit
Digitisation
Extended Remit - Research Data
Administrative data researchers, projects, places, dates
And anything else deemed worthy.
Continuity and Services
What core principles do you design a system with when you know the content will outlive you?
Continuity and Services
Oxford DAMS Digital Asset Management SystemNot a single piece of software, more a set of guiding principles and aims applied to software and hardware.
The DAMS is the hardware environment (in my opinion)
Continuity and Services
#1 Keep it simple
simple is maintainable, easy to understand.
Continuity and Services
#1 Keep it simple simple is maintainable, easy to understand. Think about how much we have achieved with the simple concept of using a single command on a simple protocol (HTTP GET) to get an HTML document. (HTML being a loose, ill-constrained set of elements.)
GET / HTTP1.1
Continuity and Services
#1 Keep it simple simple is maintainable, easy to understand. simple protocols such as HTTP, simple formats such as JSON and API patterns such as REST are heavily favoured
Remember, at some point, administration and development of the system will need to be handed over (as I am doing now!).
Continuity and Services
#2 Everything but the content is replaceable
Continuity and Services
#2 Everything but the content is replaceableThe content that is being held by the system at any one time is what is important, not the surrounding infrastructure
The content should survive and be reusable in the event that services like databases, search indexes, etc crash or corrupt.
Continuity and Services
1st logical outcome:
Do not use complex packages to hold your content!
Minimise the amount of software and work that you need to maintain to understand how your content is stored.
Continuity and Services
1st logical outcome:Do not use complex packages to hold your content!
The basic, low-level digital object in our system is a bagEach bag has a manifest, currently in either FOXML or RDF, which serves to identify the component parts and their inter-relationships.
Continuity and Services
#3 The services in the system are replaceable and can be rebuilt from the content.
Continuity and Services
#3 The services in the system are replaceable and can be rebuilt from the content.Services such as:Search indexes (eg via Lucene/Solr)
Databases and RDF triplestores (MySQL, Mulgara, etc)
XML databases (eXist db, etc)
Object management middleware (Fedora)
Object registries
Continuity and Services
2nd logical outcome:A content store must be able to tell services when items have changed, been created, or deleted, and what items an individual store holds.
We need slightly smarter storage to do this over a network. Local filesystems have had this paradigm for a long time now, but we need to migrate it to the network level.
Continuity and Services
2nd logical outcome:A content store must be able to tell services when items have changed, been created, or deleted, and what items an individual store holds.
Each store and set of services needs a way to pass messages in a transactional way, via a messaging system.This can be as simple as an HTTP call or as robust as an audit-able, message queue transaction.
Continuity and Services
3rd logical outcome:The storage must store the information and documentation for the standards and conventions its content follows.
The storage must be able to describe itself.
Continuity and Services
#4 We must lower our dependence on any one node in the system be it hardware, software or even a person.
Continuity and Services
#4 We must lower our dependence on any one node in the system be it hardware, software or even a person.This is another reason for #1 keeping things simple. Lowers the cost of hardware replacement and upgrade simple hardware tends to be cheaper and more readily available; lowering the number of crucial features broadens what can be used.
Software Oxford has been around for 1000 years. Vendors will not be.
People a simpler system is more readily understood by new/replacement engineers.
Challenges to maintaining Continuity
Flexibility
Scalability
Longevity
Availability
Sustainability
Interoperability
Challenges to maintaining Continuity
FlexibilityOpen standards and open APIs allow us to provide the tools people need to create the experience with the content they want.
'Text' based metadata XML, RDF, RDFa, JSON
Componentised web services providing open APIs flexibility through dynamically selecting what services run at anytime.
Open Source
Challenges to maintaining Continuity
ScalabilityWe need to anticipate exponential growth in demand, with the knowledge that storage will be longterm
A rough estimate of all the data produced in the world every second is 7km of CDROMs broadcast video, big science, CCTVs, cameras, research, etc. This figure is only ever going to go up.
To scale like the web, we have to be like the web; not one single black box and workflow, but a distributed net of storage and services, simply interlinked.
The Billion file [inode] problem must be avoided
Challenges to maintaining Continuity
LongevityLive Replicas (Backup Problems)
MAID (Power)
Self-healing Systems (Resilience)
Simplicity of Interfaces
Avoid 3rd Party dependence
Support Heterogeneity
Resolvers/Abstraction Layers
Challenges to maintaining Continuity
AvailabilityBasic IT availability
Enhanced long-term availability
Archival recoverability
Digital Preservation
Conversion
Emulation
Archive Preservation
Challenges to maintaining Continuity
SustainabilityBudget and cost as a conventional library
Factor archival costs into projects
Leverage content to generate income
Migrate skills
Challenges to maintaining Continuity
InteroperabilityInteroperability is an ongoing process
Support for emerging and established standards
Persistent, stable, well-defined interfaces
Ideally implement interfaces bidirectionally
Avoid low-level interfaces abstract as much as feasible
Embrace the web if you do it right, people will reuse your content in ways you never thought of and never thought possible.
DAMS overview
The following schematics might be not reflect the absolute current state, but they give the right idea of direction of growth of the DAMS.
VMWare ESXPublic
SATAVM ImageStore (2TB)
HC 1HC 2
MDICS
Phase 1 current hardware
VMWare ESXPublic
SATAVM ImageStore (2TB)
HC 1HC 2
MDICS
Phase 1 current hardware
Content layer
VMWare ESXPublic
SATAVM ImageStore (2TB)
HC 1HC 2
MDICS
Phase 1 current hardware
Service storage layer
VMWare ESXPublic
SATAVM ImageStore (2TB)
HC 1HC 2
MDICS
Phase 1 current hardware
Service execution layer
VMWare ESXVMWare ESXPublicArchiveStorageVM ImageStore (2TB)NFSVM ImageStore (2TB)NFSZFSZFS ReplicationZFS
HC 1HC 2
Private
MDICS
Phase 2 (expecting hardware delivery this month)
VMWare ESXVMWare ESXPublicArchiveStorageVM ImageStore (2TB)NFSVM ImageStore (2TB)NFSZFSZFS ReplicationZFS
HC 1Thumper1Thumper2HC 2Split Writes &Crosschecks
Private
Sun RaysFutureArch
MDICS
RSLOsney?Aim multisite (add sites as required to scale up)
No Monolith? Chaos?
The side-effect is that there is no single point of entry into the DAMSBeneficial for reasons stated before
Moves the problem from continual programmatic work just to 'tread water' to adapting to needs and actual usecases.
Big Issue: Needs and Uses become much more important and these are hard to tackle!
Commonalities
What conventions and standards are common between the services and archives?Pro: The more commonalities we adopt, the more time we can spend addressing the real problems.
Con: Too much of a prescribed system can easily lead to paralysis (as we have seen in other big institutional projects outside of Oxford)
Commonalities
Global IdentifiersWhen source content enters a system on the DAMS it is given an identifier that is understandable outside of the system
e.g. Not just an internal table id in a database, but a [table, etc] id that has meaning outside of the database too.
URI is a current, widely adopted means of having globally understood identifiers so this is a nice commonality to adopt.
Commonalities
Consider:You have an auto-incrementing primary key in a database table.
You adopt a namespace that is unique to your system (but preferably not the project name)
Eg name:, etc rather than brii:, letter: rather than cok:
It is helpful if the namespace implies the source of the content.
Then you would have name:1, name:2 etc
Commonalities
BUT!There will be an awful lot of systems which deal with similar types of things.
Is name:1 from one of Su's systems the same as name:1 from one of Anusha's systems?
Or if name:1 in an OUCS system - that we dont even know about yet - is the same?
UUIDs (or GUIDs)
Fundamentally a very big number
So big, that if it is created from a random enough source, you are extremely unlikely to ever create the same one twice.
Ever.
I really mean it.
The probability of one duplicate would be about 50% if every person on earth owns 600 million UUIDs.
UUIDs (or GUIDs)
From Wikipedia:The intent of UUIDs is to enable distributed systems to uniquely identify information without significant central coordination. Thus, anyone can create a UUID and use it to identify something with reasonable confidence that the identifier will never be unintentionally used by anyone for anything else. Information labeled with UUIDs can therefore be later combined into a single database without needing to resolve name conflicts.
UUIDs (or GUIDs)
What's a GUID?Microsoft have issues about adopting anything that already exists without owning it. So they adopted everything about UUID but changed the first letter and called it their own invention.
Globally rather than Universally
Everything else about it is the same.
Yeah, I think it's pretty pointless too, but it's a recurring theme with MS.
UUID vs local ID
One of the problems is that some people don't like to use big, long hexadecimal numbers.
We call these people 'normal'
One of my biggest mistakes with ORA was using this number in visual priority over a local, friendly ID.
Whilst this makes the item citable for much longer, it makes it way more unfriendly.
I didn't do enough URL masking.
External Identifiers
It's rare to get any content that doesn't have at least one identifier of some kind.from a controlled, centralised system of ids like DOI
To a local id like 'letter 275'
All of these are important to capture too! They represent the 'labels' that various user groups use to refer to the content.
Identifiers - Summary
Every 'thing' in your system will have more than one identifier or label. This is a Good Thing.
Give every 'thing' in your system identifiers which you, your users and your system will use.A local one is handy ('name:1'), but
A global one is essential (uuid:...)
I glossed over 'Thing', didn't I?
Instead of considering metadata as a set of records, the aim is to have metadata that model or represent that which we are archiving.If we have a book, X, written solely by an author Y, then we should have an author object Y linked as the author to a book object X.[Things get more complicated when we start considering versions, dates, and Work-Manifestations, but more on that later]
The archives should be object-orientated.
Global Ids? Why?
A global ID provides a 'node', a hook on which we can hang other information and from and to which we can make links.[Typed links links which have more meaning than 'A is linked to B' links more like 'A is authored by B and C']
A common label for the way this is modelling information is 'Linked Data'
Guidelines for Linked Data
Use URIs as names for things
Use HTTP URIs so that people can look up those names.
When someone looks up a URI, provide useful information, [ using the standards (RDF, SPARQL) ]
Include links to other URIs. so that they can discover more things.By Tim Berners-Lee (brackets are mine tho')
http://www.w3.org/DesignIssues/LinkedData.html
RDF and SPARQL
This is the point that I don't quite agree with as strongly as Sir Tim. Let's break it down:
URIs must supply widely understood metadata formatsAgreed
Services that work in an understood mannerAlso agreed
URIs must supply RDF?Amongst other standards, yes, agreed.
Services must supply SPARQL?Hmmm..... I don't agree. Services should be useful and wanted not necessarily SPARQL-based.
Linked Data... Semantic Web...?
They are slightly different, but you'll never get a straight answer on the differences.
By and large, those that talk about creating Linked Data are more pragmatic than those talking about linking in with the Semantic Web.
But the opposite might be true in some cases ;)
Playing well with others
The key is to think about your content in terms of it being discovered in your absence.How might they understand the terms?
Meanings of nodes and links? [Nouns and verbs]?
RDF provides ways to share this information. Model your terms with a view that this information will be shared.... LINKED DATA!
Curation - Microservices
i.e. What we do with the content we have or the content we will get?
Digital Curation
Activities focused on maintaining and adding value to trusted digital content
Encompasses preservation and access, which are complementary, not disparate functions
Preservation ensures access over time
Access depends on preservation up to a point in time
How can we make the Save button really mean save?
Although the title of our talk is couched in terms of preservation Im going to switch terms and instead speak about what we now consider the more expansive and appropriate word, curation.Digital curation is the set of activities focused on maintaining and adding value to a body of trusted digital content.There are two key points here:First, the value of a curated digital asset is not fixed at the time of its creation.With the burgeoning phenomenon of social networking, value can be added to an asset over its lifetime.Second, curation encompasses both preservation and access.Although these are often considered disparate activities, we see them instead as complementary,Preservation ensuring access over time,While access depends on preservation up to a point of time.