metadata architecture for digital libraries: conceptual framework for indian digital libraries

102
Metadata for DL Metadata Architecture for Digital Libraries: Conceptual Framework for Indian Digital Libraries Madhusudana Rao CR C-DAC, Bangalore.

Upload: elsie

Post on 21-Mar-2016

61 views

Category:

Documents


1 download

DESCRIPTION

Metadata Architecture for Digital Libraries: Conceptual Framework for Indian Digital Libraries. Madhusudana Rao CR C-DAC, Bangalore. Agenda. Introduction Metadata Digital Library Architecture SODA STARTS Indian Digital Library Background. Agenda. Proposed Architecture SODA & STARTS - PowerPoint PPT Presentation

TRANSCRIPT

Metadata for DL

Metadata Architecture for Digital Libraries:

Conceptual Framework for Indian Digital Libraries

Madhusudana Rao CRC-DAC, Bangalore.

Metadata for DL

Agenda

• Introduction• Metadata• Digital Library Architecture

– SODA – STARTS

• Indian Digital Library– Background

Metadata for DL

Agenda

– Proposed Architecture– SODA & STARTS

• Conclusion

Metadata for DL

Exclude

• Search Engines - General• Digital Library - General

Metadata for DL

Introduction

• Information Processing & Retrieval– Typical Library Environment– Library Automation– Networking of Libraries– Digital Library– Digital Library initiatives

Metadata for DL

Introduction

• Digital Library Scene– Search Engines

• Heterogeneous• Vertical Information Retrieval• Unique User Interface• Search engines are different• Protocols are different• Querying & Ranking• Incompatible across the sources

Metadata for DL

Introduction

– Possible solutions• Identifying the User Group• Identifying the Information Sources• Negotiating with different Information Sources• Resource Description Format• Choose best Information Source to evaluate Query• Evaluate the query at these sources• Merge the Query Results from these sources

Metadata for DL

New Protocol

• User• User Query• Information Source• Networked Environment• RDF Metadata• User Interface• Search & Retrieval

Metadata for DL

Issues..

• Metadata• Network Protocols• Possible Solutions for typical environment

Metadata for DL

Metadata…definition

Structured data about data...

Metadata for DL

Metadata…definition

• Data that helps in design, create, describe, preserve and use of information systems and resources is Metadata.

• Metadata can play in the development of effective, authoritative, interoperable, scaleable, and preservable information and record keeping systems.

Metadata for DL

Metadata…means

• Information Resource• Library Catalogue

– Index, Abstracts, Catalog Records, etc > MARC, AACR, LCSH etc.

• Human Generated Textual description• Machine generated data

Metadata for DL

• Content– Intrinsic

• What it contains?• What is about?

• Context– Extrinsic

• Who, What, Why, Where, How etc.

• Structure– Formal Set

Metadata….features

Metadata for DL

Metadata…Attributes

• Intrinsic– Subject, Title, Author, Publisher, Publication

place, Other agent, Date, Object type, Form - Identifier, Relation, Source, Language, Coverage, Abstract, Version, Notes, Signature, Classification, keyword

Metadata for DL

Metadata…Attributes

• Extrinsic– System Requirement, Mode of access,

Availability, Cost, Control, Extent, Encoding description, Revision description

Metadata for DL

Metadata…for two communities

• Information Generators• Librarians / Cataloguers

Metadata for DL

Metadata… can be

• Information Objects– Physical– Intellectual Form

Metadata for DL

Metadata…similar

• Typical Physical Library:– Catalogue – Book Racks– Books

Metadata for DL

Metadata…currently

• Electronic Information Environment– Users search Metadata– Pointers – Primary Information available on computer

display• Distinction

– Electronic Environment

Metadata for DL

Metadata…process

Two Communities

Generators Of information

Libraries & Cataloguers

User’s

Metadata

Metadata for DL

• Need not be Digital• More than description of an object• Come from variety of sources• Continue to accrue• One’s object Metadata can be another

information object’s metadata

Metadata…can be

Metadata for DL

Metadata…can be

• Intermediate steps to retrieve content• Surrogates of objects

Metadata for DL

Metadata… need

• Internet & WWW witnessed exponential growth

• Need of the hour in the internet is catalogs of some kind

• Internet/WWW is not designed to catalog the contents

Metadata for DL

Metadata…need

• Resource Description is a Challenge• Tools are available• Just directories listing of network resources

and search engines• Metadata is one of the solutions • Again Standards are yet to make its impact

Metadata for DL

Metadata…issues

• Increased accessibility– Searching > existence of rich and consistent

metadata– search across multiple collections– Distributed across several repositories

Metadata for DL

Metadata…issues

• Retention of Text– Collection of objects– Complex interrelationships with people, places,

movements & events– Documenting and maintaining those

relationships– authenticity, structural and procedural integrity

Metadata for DL

Metadata…issues

• Expanding use– Disseminating digital versions – Geography– Economics– Infinite ways to search information– Retrieve to wider community

Metadata for DL

Metadata…issues

• Multi-versioning– variant versions– High resolution copy for preservation– Low resolution copy for thumbnail image for

quick reference and network transfers

Metadata for DL

Metadata…issues

• Legal Issues– Track many layers of rights and reproduction

information – Privacy– Proprietary interests

Metadata for DL

Metadata…issues

• Preservation– Generations - H/W & S/W– Technical, Descriptive and Preservation data – Information objects to remain accessible and

intelligible over time

Metadata for DL

Metadata…issues

• System improvement and economics– Benchmarking– Planning new systems

Metadata for DL

Metadata..life cycle

Organization

Searching & Retrieval

Utilization

Preservation &Disposition

Creation & MultiVersioning

Metadata for DL

Metadata…standards

• In order Metadata to be useful & cost-effective it is essential– Structure, Semantics and Syntax conforms to

standards– Capture essence of sources– Distributed metadata model

Metadata for DL

Metadata…standards

• There is no single international standard for Metadata

• Different levels - complexity, richness to simple formats

• Several metadata schemes has been proposed for different levels of requirements

Metadata for DL

Metadata…standards

• IAFA templates• WWW semantic header• URS (Uniform Resource

Citation)• OCLC InterCat project• TEI (Text Encoding and

Interchange)• Search engine meta tags• Resource Description

Framework

• EAD (Encoding Archival Description)

• GILS (Govt Information Locator Service)

• Federal Geographic Data Committee

• Museum Educational Site Licensing Project

• Dublin Core

Metadata for DL

Dublin Core

Because it is simple…….. Yet effective ….

Metadata for DL

Dublin Core..means

• Dublin, Ohio• International consensus meetings, workshops,

etc• Emerging Infrastructure for Internet• Support Resource Discovery• Elements represent a broad interdisciplinary

consensus• Core set of elements

Metadata for DL

Dublin Core..standard

• Comprises of 15 core elements• Consensus by an International, Cross-

disciplinary group representing– Library & Information – Computer Science– Text Encoding– Museum– Related fields of scholarship

Metadata for DL

Dublin Core..standard

• Each 15 elements are optional and repetitive

• Each element has a limited set of qualifiers and attributes

• Simple DC • Qualified DC

Metadata for DL

Dublin Core..goals

• Simplicity of creation & Maintenance– Non-specialist to create descriptive records for

effective retrieval in an networked environment• Commonly understood semantics

– Digital tourist for non specialist searcher– Convergence of common, more generic

elements– increasing visibility and accessibility

Metadata for DL

Dublin Core..goals

• International scope– 20 languages– Coordinating efforts– RDF - WWW

• Technical challenges of Internationalization– Multilingual & Multicultural nature of

electronic information universe

Metadata for DL

Dublin Core..goals

• Extensibility– Additional resource discovery needs

Metadata for DL

Dublin Core..elements

• Content– Coverage, Description, type, relation, source,

subject and title• Intellectual property

– Contributor, Creator, Publisher & Rights• Instantiation

– Date, Format, Identifier & Language

Metadata for DL

Dublin Core..implementation

• Dublin Core web site lists 15 North America and Mexico in Europe and 12 Asia and Australia

Metadata for DL

Digital Library Architecture

• SODA (Smart Objects Dumb Archives)• STARTS (Stanford Protocol proposal for

Internet Retrieval and Search)

Metadata for DL

Digital Library

• Digital Library Services– User

• Functionality & Interface– Searching– Browsing

• Archive– Managed sets of objects

Metadata for DL

Digital Library

• Digital Object– Stored and trafficked digital content

• Simple files, • Sophisticated objects

Metadata for DL

Digital Library

Digital Library Services

Archive 1 Archive 2 Archive N

Digital Library Service Providers

Digital Objects in Archives

Publishers

Library Users

Digital Objectsout of Archives

Metadata for DL

Digital Library.. builds

• Identifying a user group• Identifying archives holding information of

interest• Negotiating terms and conditions with

publishing• Creating Indices• Services such as Search & Browse

Metadata for DL

Digital Library.. builds

• Creating User interaction services– Terms & Conditions– Authentication– Billing– Display

Metadata for DL

Digital Library.. hindered

• Interoperability• Object mobility• Complex archives

Metadata for DL

Digital Library..cons

• Digital Libraries are partitioned– Discipline - Computer Science, Aeronautics,

Physics, etc.– Format - Technical reports, video, software, etc.

• Interdisciplinary search difficult• Resource Description includes manuscripts,

software, data sets etc.

Metadata for DL

Digital Library..cons

• Manuscripts Vs Other objects - Reintegration

• All digital storage and transmission, tight integration

Metadata for DL

SODA…background

• Information generated in several forms• Differentiated by semantic types (report,

software, video, data sets etc.)• Given semantic representation

differentiated by syntactic representation (PS, PDF, Word)

• Media boundaries exists

Metadata for DL

SODA…addresses

• Archive-independent container construct • All semantic and syntactic data types• Objects that logically grouped together• Archived & manipulated as a single object• Several objects can communicate with each

other• Arbitrary network services

Metadata for DL

SODA..addresses

• Traditional functionality associated with archives has been pushed down into objects

• Making objects smarter/increase the responsibility

• Archives dumber/decrease the responsibility

Metadata for DL

SODA

• Archives exists to assist the user to locate the objects

• Once the object is found user directly interact with the objects

Metadata for DL

Smart Objects.. illustration

Smart objects

DumbArchives

Smart Archives Dumb Archives

SOSA: Smart objects, Smart ArchivesEx: none

SODA: Smart ObjectsDumb ArchivesEx: NCSTRL+

DOSA: Dumb ObjectsSmart ArchivesEx: NCSTRL

DODA: Dumb objectsDumb ArchivesEx: FTP server

Metadata for DL

SODA Model…implementation

Metadata for DL

Buckets..containers

• Object oriented containers• Logically grouped items are

– Collected– Stored– Transported as a single unit

• Many forms of same data• Related & non traditional data (Supportive

material)

Metadata for DL

Buckets.. containers

• Multiple packages• Packages can corresponds semantics

– manuscript, software etc.– metadata– terms and conditions– pointers

• Single package can have several items

Metadata for DL

Bucket..architecture

Terms and Conditions

Metadata (RFC 1807, Dublin Core)

Manuscript.ps, .pdf, .tex, .doc

Software.tar,.c, .java, .asp

Images.gif, .jpg

Data sets.xls, .tar

Packages inside the bucket Element

s inside the package

Access MethodsHandle (unique ID)

Metadata for DL

Bucket…requirements

• Unique ID - handle• Either standalone or multiple repositories• Standalone - WWW through TCP/IP• Moderation of number of buckets through

intelligence and functionality• Individual buckets may have custom terms

and conditions

Metadata for DL

Buckets..characteristics

• Is of arbitrary size• Globally unique ID• 0 or more components called packages• Package contains 1 or more components -

elements• Element can be a file or pointer• Packages and elements can be other buckets

Metadata for DL

Buckets..characteristics

• Package can be a pointers to a remote bucket, another package or element

• Buckets can keep internal logs of actions• Interactions or communication between

buckets are made only through defined methods

• Buckets can initiate actions, they do not have to wait to be acted on

Metadata for DL

Traditional Vs Bucket repository

Repository Interface Repository Interface

intelligence Optional intelligence

Archived objects Archived Buckets

Bucketextractionprocedure

User User

Metadata for DL

Buckets..protocol

Index holdingsSearch/retrieve

holdings

Display holdingsbucket

Archive

User

Metadata for DL

Bucket..Tools

• Author Tool– Metadata– Adds packages– Adds elements to package– Selects applicable clusters– Terms and conditions

Metadata for DL

Bucket..Tools

• Management Tool– Interface – Query and update buckets

• Bucket Matching System– SDI– Find similar works by different authors– Arbitrary SDI– Metadata scrubbing

Metadata for DL

Buckets..implementation

• NCSTRL• NCSTRL+

Metadata for DL

STARTS

• Stanford Digital Library Project• Search Engine Vendors

Metadata for DL

STARTS

• Document Sources– Internal networks– Internet

• Source Contents– Hidden behind search interfaces

• Algorithms/Protocols are different

Metadata for DL

STARTS..Architecture

Metadata for DL

STARTS..Architecture

• Large Number of resources• Each resource consist one or more sources• Source is collection of files• Accepts queries from clients and produces

results• Sources may be small or large• Extract the source list from resources

periodically

Metadata for DL

STARTS..Architecture

• Extract Metadata and content summaries from source periodically

• Query to a source to a resource• Communicate with promising resources• Results are from multiple sources, merge

them & retrieve them to the user

Metadata for DL

STARTS..Query language

• Filter expression– Boolean nature– Defines documents

• Ranking expression– Associates score with documents

Metadata for DL

STARTS..Query language

• L-strings– language-country– string behavior

• Atomic Terms– Fields– Modifiers

• Complex filter expression– and, or, and-not, prox etc

Metadata for DL

STARTS..Query language

• Complex ranking expressions• Global settings

Metadata for DL

STARTS..Merging ranks

• Unnormalized score of the document for each query

• ID of the sources where document appears• Statistics

– Term-frequency, Term-weight, Document-frequency, Document-size, Document-count

Metadata for DL

STARTS..Source metadata

• Properties of the source– Fields supported, score range, linkage etc.

• Content Summary of the source– List of words that appear in the source– statistics of each word listed– total documents in the list etc.

Metadata for DL

STARTS..in the end

• General Search Engines– Gathers all documents on the network

• STARTS– Gathers metadata about collections– Selects small set of collections– Search & retrieve

Metadata for DL

STARTS..implementation

• Alexandria Digital Library

Metadata for DL

STARTS..limitation

• Text only

Metadata for DL

Indian Digital Library..

• Ancient & Diverse culture• 5000 years old culture• Largest Democracy • Seventh largest country• High population• Illiterate• Important part of World Economy

Metadata for DL

Indian Digital Library..

• World’s largest middle class• Poverty• Highly skilled manpower• Generates Research Oriented Information• Global interest• Major players in IT in the World• World is looking for ancient Indian Culture

Metadata for DL

Indian Scene..IT

• Content is lacking• Indian Literature control (both bibliographic

and full text)in almost all fields are sketchy• NII• DL on Indian Heritage• World Wide accord for Indian Heritage• Internet Religion is the hot attraction

Metadata for DL

Indian Scene.. IT

• West Research has been done on Veda, Upanishads, Shastra, Philosophy etc. but soul is missing

• Protection, Preservation, Study, Research, Propagation for posterity

• NLP• Knowledge Presentation

Metadata for DL

Indian Scene.. IT

• Speech recognition• OCR• Machine translation• NL interfaces• Text Processing through Index,

Concordance, Thesauri, Dictionaries

Metadata for DL

Indian Scene.. IT

• National Integration, Guide Humanity, Conflicts, Aberrations, intolerance etc

• Value based system• Historic priceless manuscripts

Metadata for DL

Indian Heritage

• Indian Art• Indian Paintings• Indian Sculpture• Religion

Metadata for DL

Proposed Architecture….

• Background– User Group

• Skilled & Illiterates• Oral tradition still exists• Multilingual

– Information Sources• Content is lacking• Literature Control both Bibliographic and Text is

very weak

Metadata for DL

Proposed Architecture….

• Media– Computer Generated files to Palm leaf manuscripts

• Language • lack of standards for communication• Geographical boundaries• Accessibility• Reaching rural population

– Publishing• Restricted to regional and local

Metadata for DL

Proposed Architecture….• National initiates are yet to take off• Cooperative publishing is lacking• Unicode/Universal protocol yet make its impact

– Network Resources• Communication infrastructure exists but not stable• Individuals, Organizations, local, regional are generators

of sources• Loose networks - manpower & infrastructure• Lack of communication standards• Duplicate works

Metadata for DL

Proposed Architecture….

– Need of Networked Information Sources• Many priceless knowledge lost or loosing • Future generation missing the value of life told by

ancestors• Protection, Preservation, Study, Research,

Propagation for posterity– Looking for future

• NII• Better CCC, Computer, Communication, Content

Metadata for DL

Hybrid Architecture….

• Combination of SODA & STARTS Architecture– From SODA - Bucket Architecture– From STARTS - Search and Retrieval protocol

• Metadata - Dublin Core– For its simplicity and popularity

Metadata for DL

Bucket Architecture….

• Buckets are logically grouped– Language, Region, Content, Media, Images,

etc. (any combination or together as intelligent)• Large archives have buckets with many

different functionality's• Bucket may contain resources, packages,

elements, metadata, pointers, etc.

Metadata for DL

Bucket Architecture….

• Bucket may be unique entity or many buckets may form an entity

• Bucket may be standalone with the content • Many buckets may become resource• Each bucket has been built with some

degree of intelligence and functionality • Includes author tool and management tool

Metadata for DL

Bucket Architecture….

• Similarly user’s buckets are also created • Bucket matching may take place• Interactions with packages or elements are

made only through defined methods on a bucket

• Bucket can initiate actions• Buckets can exist inside or out of a repository

Metadata for DL

STARTS Architecture….

• Search, Retrieval and Browse within Bucket • Resources, Sources, Elements, Packages,

Pointers, etc. based on the Bucket definition• Search query is made within the source

defined in Bucket• Query may be within the bucket or across the

bucket based on the definition and functionality

Metadata for DL

STARTS Architecture….

• Ranking is done within the source• Matching is done with User’s Bucket

definition• Results displayed based on Ranking and user’s

requirements• Although STARTS uses Z39.50 for metadata

& transfer protocol, we propose to use Dublin Core for metadata

Metadata for DL

New Protocol..

• Need to create standard for communication • Information processing and retrieval• Feeling universal information source• Many sources converge as once resource• Global information resource• Universal accessibility by unified protocol• Global access

Metadata for DL

New Protocol..

• Frame work is just beginning