UC3 Curation Micro-ServicesSimplified Repository Ingest
UC Curation CenterCalifornia Digital Library
May 20, 2010
Agenda
Introduction– Welcome and review of objectives– UC3 and digital curation– Landscape, assumptions, and imperatives
Curation micro-services– The Merritt project– Design goals– The future of the DPR
Simplified repository ingest– Concepts– Implementation– Demonstration
Discussion
Objectives
By the end of this discussion we hope that you will understand– Digital curation and the UC3 mission– The emergent, micro-services approach to curation
infrastructure– The Merritt curation environment and the future of
the DPR– The Merritt Ingest service and its interactions with
the Identity, Storage, and Inventory services– How to incorporate the Ingest service into your
workflows
University of California Curation Center (UC3)
We’ve changed our name, but not our commitment– Ensuring that the information resources supporting,
and resulting from, the University’s research, teaching, and learning mission remains authentic, available, and usable
UC3 is a Center of Excellence– A creative partnership bringing together the expertise
and resources of the CDL, the ten UC campuses, and the broader international curation community
Digital curation
The set of policies and practices focused on managing and adding value to a body of trusted digital content
– Preservation ensures access over time– Access depends upon preservation up to a point in time
It can also be seen as facilitating the alignment of the scholarly and information lifecycles
Publish Preserve
Access
Collect
Discover
Gather
Create
Share
ManageResearchTeachingLearning
Information lifecycleScholarly lifecycle
Landscape
Ever increasing number, size, and diversity of content– More stuff, less resources
Ever increasing diversity of partners, stakeholders, and expectations
– Producers / consumers prosumers / conducers
Inevitability of disruptive change– Technology– User expectation– Institutional mission and resources
Problem or opportunity?
$
Work
Time
Assumptions
Curated content gains– Safety through redundancy “Lots of copies keeps stuff safe”
– Meaning through context “Lots of description keeps stuff meaningful”
– Utility through service“Lots of services keeps stuff useful”
– Value through use “Lots of uses keeps stuff valuable”
Curation is an outcome, not a place– Decentralized curation can be as effective as
centralized
Curation stewardship is a relay
Imperatives
Provide innovative, effective, and efficient services
Plan for change– Focus on content, not the systems in which that
content is managed Systems come and go (but not our system ;-)
– Occam’s Razor and Murphy’s Law suggest Favor the small and simple over the large and complex Favor the minimally sufficient over the feature laden Favor the configurable over the prescribed Favor the proven over the (merely) novel
Enable curation at the point of useDo more with less
Curation micro-services
Devolve curation function into a granular set of independent, but interoperable micro-services– Since each is small and self-contained, they are
collectively easier to develop, maintain, and enhance
– Since the level of investment in, and therefore commitment to, any given service is small, they are easier to replace when they have outlived their usefulness
– The scope of each service is limited, but complex behavior emerges from the strategic composition of individual atomistic services
Merritt curation micro-services
ValueAnnotation of content by consumers
Notification of new content availability
Transformation to create derivatives
Curation
Utility
Search of content and metadata
Index to enable fast search
of content for curation
PreservationContext
Characterization to extract content properties
of curated content
Replication for safety
StateFixity to verify bit-level integrity
for long-term retention
for long-term reference
UC3 M e r r i t t
Ingest
Inventory
StorageIdentity
What is the future of the DPR?
The DPR will continue to be operated as a core UC3 service
– However, the components of the underlying system will be gradually replaced with their new Merritt-based equivalents
– All content currently managed in the DPR will be automatically migrated to the new environment
Micro-services also can be used to deploy locally-hosted repositories to meet specialized local needs
UC3 M e r r i t t
What is the future of the DPR?
Continuing stewardship commitment by UC3 regarding managed content– Safety, persistence, efficiency, economy
Streamlined workflows for submission, access, and collection management– Easy in , easy out
Minimal technical requirements for contribution
Great flexibility in deploying customized repository solutions
UC3 M e r r i t t
Design goalsPolicy neutral, protocol and platform independent
– We know we can’t foresee all of the contexts in which these services can be usefully deployed
Principle of least surprise– Extensive options, but meaningful default behavior
Linked data– All entities exist within a web of semantic relations
http://linkeddata.org/
The file system is the database– All content and metadata are expressed in the file system– Some subset of this information may be replicated in
databases as an optimization for fast query
UC3 M e r r i t t
Design goals
Code to interfaces– Underlying implementations should and will evolve over
time without invalidating the public interface “contract”
Exploit agile methods– Early prototyping, frequent refactoring– Stakeholder engagement
The appropriate benchmark for submission user experience is Flickr
UC3 M e r r i t t
Storage concepts
Node– A sub-domain of the Storage service established to
meet specific policy, administrative, or technical needs
Object– Encapsulation in digital form of an abstract intellectual
or aesthetic work
Version– A set of files representing a discrete state of the object– Any change to object state constitutes a new version
File– A formatted bit stream
UC3 M e r r i t t
Storage concepts
Stable reference– All objects (and their versions, and their files) managed
in the Storage service have stable URLs that can be used to retrieve entities or metadata about entities, subject to appropriate access control
http://example-store.edu/content/abc/1234
http://example-store.edu/content/abc/1234/3
http://example-store.edu/state/abc/1234/3/xyz
UC3 M e r r i t t
File
Version
Object
Storage service
Request type
Storage node
Ingest concepts
Queue– Asynchronous processing of submitted material
Batch– A set of digital objects submitted together– The unit of notification and reporting
Job– The processing of a single digital object
Handler– A specific processing stage
UC3 M e r r i t t
Ingest concepts
Profile– A user-specific set of processing choices– Negotiated as part of the submission agreement
Notification– At the time of ingest submission and completion– Our stewardship obligation begins at the time of ingest
completion
Submit by-value (a file) or by-reference (a URL)
UC3 M e r r i t t
Ingest process flowUC3 M e r r i t t
Submitting library
Submitting library IngestIngest
InventoryInventory
StorageStorage
NodeNode
NodeNode
NodeNode
IdentityIdentity
Submit
Create identifier
Identifier
Add version
Get version metadata
Version metadata
Version metadata
Notification
Notification
Version metadata
Get version metadata
Add version
Ingest implementationUC3 M e r r i t t
Submitting library
Submitting library
SubmitterSubmitter ConsumerConsumer IngesterIngester StorageStorageQueue
HTML form
ServletImplicitly multi-threaded
ServletImplicitly multi-threaded
DæmonExplicitly multi-threaded
ZooKeeper dæmon
Job metadata
Job payload
Submission notification
Ingest notification
Batch or single object
Demonstration
A few caveats…– Still a work in progress!– The final interface style sheets are not yet applied– Inventory and authentication/authorization services
still under development– Full error reporting is not complete
UC3 M e r r i t t
Development roadmap
First wave Second wave Third wave Fourth wave Fifth wave Sixth wave
Identity Inventory Index Search Notification Annotation
Storage Ingest Fixity Replication Characterization Transformation
Object / collection modeling Metadata standards
Authentication / authorization Semantic interoperability
Policy / business model development
UC3 M e r r i t t
Early community reaction
Collaborative development and integration projects with UC3 partners
Independent implementation of key Merritt specifications
Presentation/BOF at Open Repositories 2010
Digital curation group and Barcamphttp://groups.google.com/group/digital-curationhttp://groups.google.com/group/digital-curation/web/curation-technology-sig
UC3 M e r r i t t
Discussion
Will existing workflows continue to work?– Yes, we have a crosswalk from the existing METS-
based feeder submission
What are the minimal requirements for an acceptable digital object?– A per-object METS file is no longer required– The DPR will accept any content in any form
However, the long-term curation service level may vary depending on the object’s formal characteristics, the presence (or absence) of accompanying metadata, the general state of curation understanding, and the availability of appropriate tools
UC3 M e r r i t t
Discussion
How do I include metadata in my submission?– The Ingest submission form provides an opportunity to
specify descriptive Dublin Kernel metadata
– Administrative metadata is implied by the user’s profileName, affiliation, contact information, collection, …
– Technical (and, potentially, descriptive) metadata is automatically extracted by the characterization handler
– Additional metadata can be expressed in recognized schemas and stored in files with well-known names
mrt-dublin-core.txtmrt-mods.xmlmrt-creative-commons.rdf…
UC3 M e r r i t t
Discussion
Isn’t a enterprise storage solution or RDMS (e.g. Oracle) better than just relying on the file system?– No, we believe that there are a number of important
advantages to directly exploiting the file systemNo vendor lock-in; propriety systems are difficult to debugModern file systems have excellent scaling characteristicsThe ability to re-instantiate the system by walking the file
system is significant
UC3 M e r r i t t
Discussion
Why is there a separate Ingest service? Why can’t I just submit directly to the Storage service?– Merritt embraces the “separation of concerns” principle
http://en.wikipedia.org/wiki/Separation_of_concerns
The Storage service only “knows” about storage and has strict requirements for the allowable form of submissions
The Ingest service was explicitly designed for user-facing operation and imposes minimal constraints on submission forms
UC3 M e r r i t t
Discussion (questions for you)
What constitutes a “collection”?–Does it have hierarchically-arranged sub-components?
What tools do you need to manage your collections effectively?
How do you expect to retrieve content from the repository?– Following a saved link?– Search query? If so, what would be the query terms?
UC3 M e r r i t t
Discussion (questions for you)
What level of access control is necessary? – Bright vs. dark policy– Embargo periods– Redaction
Who are the subject populations?– UC affiliates– Non-UC
How fine-grained must this control be?– Collection or object– Campus, research group, user
UC3 M e r r i t t
Discussion (questions for you)
Are there other repository tools or protocols that we should investigate?
Please respond to the DPR survey athttp://vovici.com/wsb.dll/s/aaeg44ec2
UC3 M e r r i t t
For more informationUC Curation Centerhttp://www.cdlib.org/services/uc3
Curation micro-serviceshttps://confluence.ucop.edu/display/Curation
DPR surveyhttp://vovici.com/wsb.dll/s/aaeg44ec2
Digital curation group and Barcamphttp://groups.google.com/group/digital-curationhttp://groups.google.com/group/digital-curation/web/curation-technology-sig
UC3Stephen Abrams Erik Hetzner Margaret Low Mark Reyes Perry
WillettPatricia Cruse Greg Janée John Kunze Tracy SenecaScott Fisher David Loy Isaac Rabinovitch Marisa Strong
UC3 M e r r i t t