metadata for digital long-term preservation

31
http://www.ukoln.ac.uk/ Metadata for digital long-term preservation Michael Day, Digital Curation Centre UKOLN, University of Bath [email protected] MPG eScience Seminar 2008: Aspects of long-term archiving, GWDG Göttingen, 19-20 June 2008

Upload: sahara

Post on 26-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Michael Day, Digital Curation Centre UKOLN, University of Bath [email protected] MPG eScience Seminar 2008: Aspects of long-term archiving, GWDG Göttingen, 19-20 June 2008. Metadata for digital long-term preservation. Presentation outline:. Some definitions An abstract approach: OAIS - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Metadata for digital long-term preservation

Michael Day,Digital Curation CentreUKOLN, University of [email protected]

MPG eScience Seminar 2008: Aspects of long-term archiving, GWDG Göttingen, 19-20 June 2008

Page 2: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Presentation outline:

• Some definitions

• An abstract approach: OAIS

• A framework for practical implementation: the PREMIS Data Dictionary

• Some open questions for e-research data

Page 3: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Definitions (1)

• Metadata:– A relatively new term that is used to describe a

very old concept– We primarily need to think about the different

functions it enables, e.g. discovery and access management, the management of resources, long-term preservation, etc.

Page 4: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Definitions (2)

• Preservation metadata:– “... the information a repository uses to support

the digital preservation process” (PREMIS Data Dictionary)

– Potentially very wide scope:• Technical information on data structures or formats• Information to help better understand the content• Information on contexts and provenance• Information on preservation processes

Page 5: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Definitions (3)

• Metadata for research data:– Metadata are fundamentally important to the

continued understanding and exploitation of research data

• It is “impossible to conduct a correct analysis of a data set without knowing how the data was cleaned, calibrated, what parameters were used in the process” (Deelman, et al 2004)

• In some cases, extremely detailed documentation will be required

• Captured from various stages of lifecycle

Page 6: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

The OAIS Information Model (1)

• General OAIS background:– An ISO standard (ISO 14721:2003)– Development led by the Consultative Committee

on Space Data Systems– Provides standard terminology and defines two

interrelated models (functional model, information model)

Page 7: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

The OAIS Information Model (2)

• Some general principles:– OAIS entities (Data Objects and Content

Information) are conceptually bound together with information that provides additional meaning

– There are two main classes of this:• Representation Information• Preservation Description Information

Page 8: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

The OAIS Information Model (3)

• Representation Information:– Is tightly bound with the Data Object– Provides a bridge between the bit-level

information being stored in an OAIS and something that can be understood

– Describing data structure concepts, or formats (Structure Information)

– Providing additional information on semantics (Semantic Information)

Page 9: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

The OAIS Information Model (4)

• Preservation Description Information:– The additional information “needed to make the

Content Information meaningful for the indefinite long-term” (p. 4-33)

• For example, the information “needed to preserve the Content Information, to ensure that it is clearly identified, and to understand the environment in which the Content Information was created” (p. 2-6)

• Reference, Context, Provenance, Fixity

Page 10: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

The OAIS Information Model (5)

• Lessons from OAIS (1):– Data objects (and content) need to be closely

coupled with additional layers of information (metadata) that will help provide meaning and context, etc.

– These layers broadly reflect the main characteristics of digital information (physical, logical, intellectual)

– Produces self-documenting objects

Page 11: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

The OAIS Information Model (6)

• Lessons from OAIS (2):– It highlights the importance of preserving context

and provenance (but these are quite vaguely defined)

– OAIS works on an abstract level, but there is a need to think about what needs to be done in practical terms to develop preservation metadata schemata ...

Page 12: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

PREMIS Data Dictionary (1)

• Background (1):– PREMIS Working Group (2003-2005)– An attempt to develop something that would be

implementable– Development informed by OAIS model– Built upon on several initiatives that had been

developing preservation metadata schemas and frameworks prior to 2003

– Data Dictionary first published in May 2005; v. 2.0 in March 2008

Page 13: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

PREMIS Data Dictionary (2)

• Background (2):– PREMIS Maintenance Activity set up by Library

of Congress– PREMIS Implementers Group (open discussion

list)– Recent revision of PREMIS takes account of the

experiences of implementers

Page 14: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

PREMIS Data Dictionary (3)

• What PREMIS aims to do:– The Data Dictionary is specifically focused on

defining the core metadata needed for long-term preservation

– “... the information a repository uses to support the digital preservation process”

– Related to a series of verbs:• “... functions to maintain viability, renderablility,

understandability, authenticity, and identity in a preservation context”

– Based on a data model

Page 15: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

PREMIS Data Dictionary (4)

• PREMIS Data Model:– Recognises that digital preservation is as much

about describing processes as well as objects– Five entities

• Intellectual Entities• Objects• Events• Agents• Rights

Page 16: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

PREMIS Data Dictionary (5)

Intellectual entities

Objects

Events

Rights

Agents

PREMIS 2.0 Data Model

Page 17: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

PREMIS Data Dictionary (6)

• PREMIS usage (1):– Survey undertaken for PREMIS Maintenance

Activity (2007)• 16 repositories and projects surveyed (mostly dealing

with documents rather than data)• Survey noted much diversity in the way PREMIS had

been implemented• Tools were being used to capture technical metadata

automatically• Formats could be identified using tools like JHOVE and

PRONOM DROID

Page 18: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

PREMIS Data Dictionary (7)

• PREMIS usage (2):– No major eScience input into PREMIS – PREMIS is occasionally used to help inform the

preservation of research data:• The National Snow and Ice Data Centre has used

PREMIS as a way of evaluating its own OAIS-inspired metadata schema

• The Stanford Digital Repository has experimented with the using PREMIS for geospatial resources

• Experiments with the Yale Social Science Data Archive

Page 19: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

PREMIS Data Dictionary (8)

• Lessons from PREMIS:– The Data Model demonstrates the importance of

recording the contexts of preservation (events, agents), not just metadata on the objects

– Currently little used in the e-research domain, but it has some potential where structured metadata already exists in some form (e.g., CSDGM, DDI)

Page 20: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Implications for e-research (1)

• The role of standards– The development of standards (e.g. PREMIS)

assumes that there is some level of commonality between domains

– However, generic solutions are not really feasible for e-research data because of the diversity and complexity of:

• Research data (content)• Research contexts• Stakeholders

Page 21: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Diversity and complexity (1)

• Diversity of content (1)– Research data is “... any information that can be

stored in digital form, including text, numbers, images, video or movies, audio, software, algorithms, equations, animations, models, simulations, etc.” (National Science Board, Long-lived digital data collections, 2005)

Page 22: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Diversity and complexity (2)

• Diversity of content (2):– Research data is extremely diverse - not really a

single category of material• tabular data, images, GIS, etc.

• raw machine output vs, derived data

• varying levels of structure (XML, legacy formats, etc.)

• many different standards

– Research data is not homogeneous

– No one-size-fits-all approach possible

Page 23: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Diversity and complexity (3)

• There is an even wider range of social contexts in which data is used (and shared)– DCC SCARP project has been exploring

disciplinary factors in curation practice• Practice even within single disciplines is very

fragmented• Case studies ongoing

– Big-science archives, medical and social sciences, architecutre and engineering, biological images

Page 24: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Diversity and complexity (4)

• Major disciplinary differences:– Attitudes towards data sharing

• Some are very open, some cannot see the point

– Existence of data centre infrastructures• In UK some centrally funded data centres, not

universal• Where do institutions fit?

– The existence of standards• Already present in social sciences (DDI), the

geospatial domain (FGDC), and many others

Page 25: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Diversity and complexity (5)

• Diversity of stakeholders:– The many different actors that have an interest in

data curation means that metadata requirements may differ

– Dealing with data (2007): Scientist, Institution, Data centre, User, Funder, Publisher

– Long-lived data collections (2005): Data authors, Data managers, Data scientists, Data users, Funding agencies

Page 26: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Implications for e-research (2)

• Metadata for digital curation or for long-term preservation?– The concept of digital curation focuses on reuse

and adding value - long-term preservation is not always the aim

– PREMIS metadata is focused on particular things (viability, renderablility, understandability, authenticity and integrity)

– What metadata do we need for digital curation? Could this ever be generic?

Page 27: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Implications for e-research (3)

• Metadata can be difficult to identify– Difficult sometimes to work out where data ends

and metadata begins– Depends on the point of view of the researcher

Page 28: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Implications for e-research (4)

• Lifecycle view– Metadata has to be captured at multiple places in

the scientiic workflow– Need to capture:

• Processes (can be driven by instrumentation)• Provenance• Context

Page 29: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Implications for e-research (5)

• Big science, little science:– Big science is by its nature data driven, and will

often develop appropriate frameworks for its management and reuse (data centres, data grids)

– Other scientific domains (e.g, ecology, biodiversity, chemistry) are moving in the same direction, but data retain a high-level of diversity and complexity

Page 30: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Summing-up

• The OAIS Information Model provides an abstract framework for thinking about preservation metadata

• PREMIS provides an implementation framework that is beginning to be adoped in some domains

• There are still many unresolved questions when it comes to defining metadata for research data

Page 31: Metadata for digital long-term preservation

                                                             

http://www.ukoln.ac.uk/

Aspects of Long-Term Archiving, Göttingen, 19-20 June 2008

Acknowledgements

The Digital Curation Centre is funded by the JISC and the UK Research Councils' e-Science Core Programme.

http://www.dcc.ac.uk/

UKOLN is funded by the Museums, Libraries and Archives Council, the Joint Information Systems Committee (JISC) of the UK higher and further education funding councils, as well as by project funding from the JISC, the European Union, and other sources. UKOLN also receives support from the University of Bath, where it is based.

http://www.ukoln.ac.uk/