metadata for digital libraries: a functional approach sandra payette digital library research group...
Post on 22-Dec-2015
218 views
TRANSCRIPT
Metadata for Digital Libraries:A Functional Approach
Sandra PayetteDigital Library Research Group
Cornell University
Cornell Digital Imaging Workshop
October 21, 1998
Metadata
CREATOR: Plato
TITLE: The Republic
Image 1 cdrom 1Image 2 cdrom 1Image 3 cdrom 2
Image File Storage
Metadata is structured data about data that imposes order on a disordered information universe.
Access Control List
Many Types of Metadata
• Descriptive
• Structural
• Terms and conditions
• Administrative
• Content ratings
• Provenance
• Relationship
Basic Functions We Must Support
• Resource Discovery
• Access and Use
• Preservation and Administration
Resource Discovery:
Focus on Descriptive Metadata
Metadata for Resource Discovery
• Catalogs– OPAC / MARC Records
• Indexes– Structured descriptive records (e.g., Dublin Core)– Abstracts – Full-text surrogates (e.g, via OCR)
Challenges
• Impracticality of large-scale traditional cataloging– time consuming, labor intensive, special skills– limited coverage - only “selected” items
• Problems with resource discovery– full-text indexing ineffective (false hits, irrelevancy,
overload)– full-text approaches not useful for non-textual data
(e.g., audio, video, executable programs)
One Solution:Simple Descriptive Surrogates
• Easy to create
• Applicable across domains
• Applicable for different genre of objects
• Allows interoperability among robots, indexers, and search clients
Dublin Core Element Set
• Good baseline descriptive record
• Can exist along side other specialized metadata
• Common ground for discovery across disparate resources
• No specialized skills required
• Flexibility through qualifiers
Source: http://www.purl.org/Metadata/dublin_core/
Dublin Core : 15 Elements
• Title name given to the work by the author
• Author or Creator person(s) responsible for the intellectual content
• Subject and Keywords the topic of the work, keywords, or formal classification schemes
• Description textual description of the content (abstract, prose describing an image, etc.)
• Publisher the organization making the work available in its present form
• Other Contributor person(s) other than the author who have made significant contributions to the intellectual content
• Date the date the work was made available
• Resource Type category of the resource
• Format Data representation of the resource
• Resource Identifier Unique Identification string (e.g. URL, URN, ISBN...)
• Source object from which this object is derived (if applicable)
• Language language of the intellectual content of the object
• Relation relationship of the object to other objects or collections
• Coverage spatial locations and temporal duration characteristics
• Rights Management a pointer to a copyright notice, a rights management statement, or a rights server.
Dublin Core in HTML META Tags
<html><head><title>Cornell Digital Library Research Group</title><META name="DC.subject" content=”digital library research"><META name="DC.subject" content="networked object description"><META name="DC.publisher" content=”Cornell University"><META name="DC.creator" content=”Lagoze, Carl, [email protected]."><META name="DC.creator" content=”Payette, Sandra, [email protected]."><META name="DC.title" content=”Cornell Digital Library Research Group"><META name="DC.date” content="1998-05-15"><META name="DC.form" scheme="IMT" content="text/html"><META name="DC.language" scheme="ISO639" content="en"><META name="DC.identifier" scheme="URL" content="http://www2.cs.cornell.edu/NCSTRL/CDLRG/cdlrg.htm"></head><IMG SRC="/mydir/mysubdir/mypicture.gif" WIDTH=208 HEIGHT=216></html>
Source: http://www.w3.org/TR/REC-html40/
Warwick Framework
• Developed by Dublin Core community
• Broader framework to accommodate diverse metadata schemes
• Encourages community-specific definition and administration of metadata
• Modularity supports interoperability among:– content providers – catalogers and indexers– automated resource discovery systems
Warwick Framework Container
Container
Package
Dublin Core
Package
Other Descriptive
Package
Reference to MARC
Simple Package:Typed Metadata Set
Package
MARC RecordURI
WWW Infrastructure Evolving in this Direction
• Dublin Core submitted to IETF as RFC– ftp://ftp.isi.edu/in-notes/rfc2413.txt
• Resource Description Framework (RDF)– http://www.w3.org/RDF/
• Extensible Markup Language (XML)– http://www.w3.org/XML/
Resource Description Framework (RDF)
• Influenced by the Warwick Framework, among others
• Enables interoperability between applications that exchange metadata
• Mix and match of metadata elements from different schemas
• An application of XML (transfer syntax)
A Simple RDF Model
www2.cs.cornell.edu/CDLRG/doc1
DC:Creator
DC:Publisher
QCSchema:Rating www.xxx.org/rate
A B
MyRating YourRating
RDF Expressed in XML
Dublin Core
Element Set
<?xml:namespace name=“http://www.purl.org/Metadata/dublin_core/” as=“DC”>
<?xml:namespace name=“http://www.w3.org/Schemas/RDF/” as=“RDF”>
<RDF:Serialization><RDF:Assertions href=“http://www2.cs.cornell.edu/CDLRG/doc1”>
<DC:Creator>Sandy Payette</DC:Creator><DC:Publisher>Cornell DLRG </DC:Publisher>
</RDF:Assertions></RDF:Serialization>
RDF: Why is it important?
• Market demand for metadata deployment• Software infrastructure will be ubiquitous (e.g. free in
browsers, servers, proxies, editors, etc.)• RDF is a general purpose framework that provides
structured, human-readable and machine-understandable metadata for the web
• Allows stakeholder communities to independently developed, maintain, and reuse vocabularies
Access and Use
Focus on Structural Metadata
Structural Metadata
• What is it? Data that….– Defines structure within documents– Aggregates images into meaningful entities– Correlates document components to image files– Organizes a collection of objects
• Where is it?– ASCII text files in directories– Relational databases– Embedded in documents or surrogates (e.g. SGML)
First... A Data Model
Data models mirror natural attributes and relationships of real-world objects
PageChapter
TableContents
Index
Front0:1
1:N
0:1
1:N 1:N
1:N
0:1
1:N
“Binding” Document Images with SGML
<!DOCTYPE EBIND PUBLIC "-//UC Berkeley//DTD ebind.dtd (ElectronicBinding (Ebind))//EN" [<!ENTITY % birch PUBLIC "-//UC Berkeley//ENTITIESBirch-tree fairy book (Page Images)//EN">%birch;]><ebind type="book"><front><page><image entityref="birch001" seqno="1" nativeno="i"></page><page><image entityref="birch002" seqno="2" nativeno="ii"></page><page><image entityref="birch003" seqno="3" nativeno="iii"></page><page><image entityref="birch004" seqno="4" nativeno="iv"></page><div0 type="titlepage"><page><image entityref="birch005" seqno="5" nativeno="v"></page><page><image entityref="birch006" seqno="6" nativeno="vi"></page></div0><div0 type="introduction"><head>Introductory note</head><page><image entityref="birch007" seqno="7" nativeno="vii"></page></div0>
Source: http://sunsite.berkeley.edu/Ebind/
Finding Aids in SGML
• Encoded Archival Description (EAD)– SGML mark up of descriptive access tools
(inventories, registers, indexes, and guides)– provides more detail about a collection than in
typical catalog record – facilitates access - “drill down” into collection– potential international standard– maintained jointly by Library of Congress and
Society of American Archivists (SAA)
Source: http://www.loc.gov/rr/ead/eadhome.html
Preservation and Administration
Focus on Administrative Metadata
and Persistent Identifiers
Administrative Metadata
• Information for managing images… over time– relocation– migration (new formats)– copyright tracking– archiving of objects and services
• Where is it?– File headers (to help prevent orphaned images)– External databases (e.g., relational db)– Separate files stored with images
Create a Preservation Audit Trail
Image File Attributes:• formats • versions • compression
Image Attributes:• resolution• bit depth• orientation
Process Data:• creation date/time• equipment used
Rights Management Data:•Expiration dates•Copyright info•source statements
Persistent Identifiers
• Globally unique names
• Persistent … names are permanent, lasting
• Used in resolution services to locate the object (locations change over time).
cnri.dlib/april97-payette
NamingAuthority
ItemName
UniqueIdentifier:
URL: http://www.somewebserver.org/somedirectory/somefile
Identifiers: Current Initiatives
• IETF Uniform Resource Names (URN) – specification of URN framework– requirements for resolution systems– syntax definition
• Existing Systems– CNRI’s Handle System – OCLC PURLs– DOI Initiative
Further reading
• IFLA: A Good List - http://www.nlc-bnc.ca/ifla/II/metadata.htm
• Lynch, et. al.: CNI Resource Discovery White Paper -http://www.cni.org/projects/nidr/nidr.html
• Lagoze: Resource Discovery in the Digital Age -http://www.dlib.org/dlib/june97/06lagoze.html
• Payette: Persistent Identifiers, RLG DigiNews - http://www.rlg.org/preserv/diginews/diginews22.html
• W3C: Metadata Overview - http://www.w3.org/Metadata