e-irg open workshop on e-infrastructures 4-5 oct 2006 caspar project digital preservation and...
Post on 21-Dec-2015
215 Views
Preview:
TRANSCRIPT
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
CASPAR Project
Digital Preservation and
Digital interoperability
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Outline• Unfamiliar Data• Usability• Link to Preservation• OAIS Reference Model• OAIS Information Model• Representation Information• Preservation and Virtualisation• CASPAR project
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Unfamiliar Data• E-Research/e-Infrastructures allow users to
find and try to use data from many sources• Some familiar sources• Most available sources will be unfamiliar• How can one be sure that the unfamiliar data
is used correctly• Garbage in – garbage out principle• Various horror stories
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Usability• Ability for the user to “do something” with the
bits• Preferably using software
– Even better if software does not have to be specially written
• Better still if user does not have to guess what to do or trawl around looking for documentation
• Could use existing software to display and process – but how do we prevent nonsense being produced accidentally.
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Link to Preservation• An archive is just another remote source of
digitally encoded information– Preserved digital data was created some time ago
– possibly a considerable time ago (decades)
• Digital Preservation can mean many things• Simplest type is just keeping the “bits” and
making sure they are available
• A more useful definition comes from OAIS
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
OAIS Reference Model• ISO 14721 : Reference Model for an Open Archival Information System
(OAIS). • An OAIS is an archive, consisting of an organization of people and
systems, that has accepted the responsibility to preserve information and make it available for a Designated Community.
• Long Term Preservation: The act of maintaining information, in a correct and Independently Understandable form, over the Long Term.
• Long Term is long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community.
• Designated Community: An identified group of potential Consumers who should be able to understand a particular set of information. The Designated Community may be composed of multiple user communities.
• Has sufficient documentation to allow the information to be understood and used by the Designated Community without having to resort to special resources not widely available, including named individuals.
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Information ObjectsInformation
Object
RepresentationInformation
1+
interpretedusing1+Data
Object
interpretedusing
PhysicalObject
DigitalObject
BitSequence
1+
Recursion ends at KNOWLEDGEBASE (of whom?)
(tacit knowledge)
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Representation Information• The Data Object is “interpreted using” the
Representation Information (RepInfo) • The Reference Model is designed to ensure
that an OAIS is not set the impossible task of having to provide all possible RepInfo immediately
• Hence:– Take account of the Designated Community and its
associated Knowledge Base• Note that RepInfo may itself need further
RepInfo • NB very important for CERTIFICATION
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Representation Information The Representation Information accompanying a physical
object, like a moon rock, may give additional meaning– It typically is a result of some analysis of the physically observable
attributes of the rock
The Representation Information accompanying a digital object, or sequence of bits, is used to provide additional meaning.
– It typically maps the bits into commonly recognized data types such as character, integer, and real and into groups of these data types.
– It associates these with higher level meanings which can have complex inter-relationships that are also described
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Designated Community • general English reading public educated to High School and
above, with access to a Web Browser (HTML 4.0 capable) • GIS data: GIS researchers - undergraduates and above, having an
understanding of the concepts of Geographic data; having access to current (2005, USA) GIS tools/computer software e.g. ArcInfo (2005)
• Astronomer (undergraduate and above) with access to FITS software such as FITSIO, familiar with astronomical spectrographic instruments
• Student of Middle English with an understanding of TEI encoding and access to an XML rendering environment. – Variant 1: Cannot understand TEI – Variant 2: Cannot understand TEI and no access to XML rendering
environment – Variant 3: No understanding of Middle English but does understand
TEI and XML
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Rep.Info. Classification
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Structure• Distinguish
– formats which are used mainly for rendering – to be followed by human inspection, and
– formats used for automated processing – particularly important for science data
• Distinguish:– Things with unknown structure – needs software
• proprietary software e.g. MS Word• Open Source software e.g. CDF
– Things with known/well described structure• ASCII file, FITS file, telemetry etc
– Document the format– Use description language if possible e.g. EAST, DFDL, – The EAST tools are themselves Representation Information which in due course will
have to be fully defined – the closure of their Representation Nets will be the EAST standard
• Higher level definitions should include useful scientific objects and humanities objects
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Layered Model from OAIS
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Semantics– Meaning/ Relationships
• Data Dictionaries• Thesauri• Ontologies• Semantic interoperability
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Time Dependent Information– Many, perhaps most, datasets change over time and
the state at each particular moment in time may be important. It may be useful to break the issue into separate parts.
• at each moment in time we could, in principle, take a snapshot and store it. That snapshot has its associated Representation Net.
• efficient storage of a series of snapshots may lead one to store differences or include time tags in the data
– Additional Representation Information would be needed which describes how to get to a particular time's snapshot from the efficiently encoded version.
– Also applies to ANNOTATION – who said what about which and when did they say it
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Actions and Processes (Behaviour)
• Some information has, as an integral part of its content, an implicit or explicit process associated with it – An examples of this is a database or other
time dependent or reactive system such as a Neural Net.
• Emulations– Limited – but may be adequate for rendered
document-type data
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Sharing RepInfo• RepInfo is needed• RepInfo is extensive• May need to “extend” RepInfo as
Designated Community and/or its knowledgebase changes
• How can we avoid every Repository repeating the work– Need to control costs
• Need to share the effort
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Requirements• Data users - need to be able to obtain
pre-identified RepInfo
• Curators: need to be able to find suitable pre-existing RepInfo to re-use
Or
• Create RepInfo
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Registry for Representation Info
The Digital Object could have RepInfo packed with it, as well as CPID
Support automated access & processing
Rep. Info. Registry/Repository
network
Archive
User
Representation Information
Digital Object
CPID
CPID
CPIDCPID
CPID
CPID
CPID
Rep. Info. Registry/Repository
network
Archive
User
Representation Information
Digital Object
CPID
CPID
CPIDCPIDCPIDCPID
CPIDCPID
CPIDCPID
CPIDCPID
•1 – User gets data from archive. Data has associated Curation Persistent Identifier (CPID)
•2
•2 – User unfamiliar with data so requests Rep.Info.using CPID
•1
•3•3 – User receives Rep.Info – which has its own CPID in case it is not immediately usable
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Use of RepInfo
CPIDStructure = CPID
Semantics = CPID
Rendering s/w = CPID
CPID
Structure = CPID
Semantics = CPID
Rendering s/w = CPID
External Registry
Each “bag of bits” has an associated pointer (CPID) to a Label
•DCC Label – points to other RepInfo
CPID
•copy
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
CASPAR – EU FP6Cultural, Artistic and Scientific
knowledge for Preservation Access and retrieval
• Closely follows DCC Development ideas• Approx 16 M Euro – 8.8M from EU• 17 Partners• Led by CCLRC
– Co-ordinator: David Giaretta
See http://www.casparpreserves.eu
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
CASPAR Consortium
See http://www.casparpreserves.eu
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
CASPAR information flow architecture
•Rep
•Info
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
CASPAR Integrated architecture
See http://www.casparpreserves.eu
e-IRG Open Workshop on e-Infrastructures 4-5 Oct 2006
Possible Infrastructure Build-up
European Preservation Infrastructure
Task Force on Permanent Access Alliance
Other Alliance Members
Other Alliance Members
CCLRC Curation Activities
CASPAR
Other CCLRC projects
Other CCLRC projects
FP7 projects
http://tfpa.kb.nl
top related