next generation digital libraries: supporting interoperability, semantics, and quality biblioteca...
TRANSCRIPT
Next Generation Digital Libraries: Supporting Interoperability,
Semantics, and Quality
Biblioteca CentralUniversidad Nacional del Sur
Bahia Blanca, Argentina May 17-18, 2004
Edward A. Fox
[email protected] http://fox.cs.vt.edu
Acknowledgements (Selected)• Sponsors: ACM, Adobe, AOL, IBM, Microsoft,
NASA, NLM, NSF, OCLC, SUN, US Dept. of Ed.
• VT Faculty/Staff: Debra Dudley, Weiguo Fan, Gail McMillan, Manuel Perez, Naren Ramakrishnan, Layne Watson, …
• VT Students: Yuxin Chen, Shahrooz Feizabadi, Marcos Gonçalves, Nithiwat Kampanya, S.H. Kim, Bing Liu, Paul Mather, Fernando Das Neves, Unni. Ravindranathan, Ryan Richardson, Rao Shen, Ricardo Torres, Wensi Xi, Baoping Zhang, Qinwei Zhu, …
ACKNOWLEDGEMENTS (NDLTD)• NDLTD Board of Directors, previous Steering Committee + other
NDLTD committees; those running Electronic Thesis & Dissertation (ETD) initiatives in universities, regions, countries
• Helpful sponsorship by many organizations, especially Adobe (new initiative!), CONACyT, DFG, FIPSE (US Dept. Education), IBM, Microsoft, NSF (IIS-9986089, 0086227, 0080748, 0325579; DUE-0121679, 0136690, 0121741, 0333601), OCLC, SOLINET, SUN, SURA, UNESCO, VTLS, many governments (Australia, Germany, India, …), …
• Colleagues at Virginia Tech (faculty, staff, students), and collaborators at many universities
• Slides included from: Vinod Chachra, Thom Hickey, Joan Lippincott, Gail McMillan, Axel Plathe, Hussein Suleman, …
Other Collaborators (Selected)
• Brazil: FUA, UFMG, UNICAMP• Case Western Reserve University• Emory, Notre Dame, Oregon State• Germany: Univ. Oldenburg• Mexico: UDLA (Puebla), Monterrey• College of NJ, Hofstra, Penn State, Villanova• University of Arizona• University of Florida, Univ. of Illinois• University of Virginia
• Endowment: VTLS
UNESCO
• Cláudio Menezes [[email protected]]• Purpose:
• Reinforce local solutions, commitments
• Emphasize:• ETD does not need many resources.• Open source and free software is available.• International cooperation can help.• Local training is crucial. • => Inclusion of ETD in practices, processes• => Schedule for ETD projects
Part 4
Next Generation DigitalLibraries: Supporting
Interoperability,Semantics, and Quality
Digital Libraries in Education
• Analytical Survey, ed. Leonid Kalinichenko
• © 2003, www.iite-unesco.org, [email protected]
• Transforming the Way to Learn
• DLs of Educational Resources & Services
• Integrated/Virtual Learning Environment
• Educational Metadata
• Current DLEs: US (NSDL, DLESE, CITIDEL, NDLTD), Europe (Scholnet, Cyclades), UK (Distributed National Electronic Resource)
Digital Libraries in Education - 2
• Advanced Frameworks & Methodologies• Instructional course development with learning
module repositories, Learning Object reuse• Community organization around DLEs• Other content for science and research• Cyberinfrastructure, data grids• Curriculum-based interfaces (see Krowne et al.)• Concept-based organization of learning materials
and courses (CMs, ontologies)
DLEs: Future Vision (p. 6)
• Global learning environment of the future:
• Student-centered
• Interactive and dynamic
• Enabling group work on real world problems
• Enabling students to determine their own learning routes (styles, personalization)
• Supporting lifelong learning
DLEs: Objectives (p. 11)
• Long-range: lifelong/distance/anytime-anywhere
• Intermediate goals• Support for students, teachers, parents• Enhanced student performance• More students excited about science• More Internet-based science educational resources
• with increased quality and comprehensiveness,• easy to discover and retrieve,• preserved and universally available
DLEs: Guiding Principles (p. 12)
• Driven by educational and science needs
• Facilitating educational innovation
• Stable, reliable, permanent
• Accessible to all
• Leveraging prior research: DL, courseware, …
• Adaptable to new technologies
• Supporting decentralized services
• Resource integration thru tools/organization
CITIDEL Technology Features•Component architecture (Open Digital Library)
•Re-use and compose re-deployable digital library components.
•Built Using Open Standards & Technologies
•OAI: Used to collect DL Resources and DL Interoperability
•XSL and XML: Interface rendering with multi-lingual community based translation of screens and content (Spanish, …)
•Perl: Component Integration
•ESSEX: Search Engine Functionality
•Very fast, utilizing in-memory processing
•Includes snap-shots for persistence
•Multi-scheming
•Integrates multiple classifications / views through maps, closure
English
Spanish
Nominated
Editor reviewed
Java
Multimedia
LLaanngguuaaggee TTooppiicc
QQuuaalliittyy
Identified by crawl
Peer reviewed
Algorithms
Multi-dimensional Categorization
PIPE: Personalization by Partial Evaluation
• Interactions at existing web sites are predefined by the site designer
• Personalization is achieved by the designer’s anticipation of users’ expectations
• PIPE allows automatic personalization of a web site without designer anticipation• Recognized with the 2001 New Century Technology
Council Innovation award
CITIDEL + PIPE
• Adds Interaction Personalization to CITIDEL
•Automatically handles multi-modal conversion to Cell phone, PDA, Etc.
•Can be adopted to any digital data set, only requires XML file of content with hierarchy maintained.
PIPE provides Mixed-Initiative Interaction
• Involves an extra specification window (e.g., a toolbar)• system-initiated + user-initiated modes of interaction
Traditional browser: the user merely clicks on available hyperlinks.
PIPE window: the user can type in any information out-of-turn
Features of PIPE
• Applicable to many information system technologies
• web sites (even third-party)
•Digital Libraries (currently working on CITIDEL integration)• voice-activated systems (e.g., pizza ordering, movie information, and flight reservation services)
• PIPE is available for licensing and is ready for commercialization, through VTIP• PIPE has been featured in IEEE Internet Computing, IEEE IT Professional, and the Appian Web Personalization Report.
OAI, ODL, DL-in-a-box
• Open Archives Initiative• since 1999, www.openarchives.org
• Open Digital Libraries• since 2001, from www.dlib.vt.edu• with Hussein Suleman (now U. Cape Town)
• DL-in-a-box• NSDL support since 2001• Aimed to help new collections / services projects• http://dlbox.nudl.org
Open Archives Initiative (OAI)
• Advocacy for interoperability• Standard for transferring metadata among
digital libraries• Protocol for Metadata Harvesting (PMH)
• Simplicity• Generality• Extensibility
• Support for PMH => Open Archive (OA)
OAI = Technical Umbrella forPractical Interoperability…
ReferenceLibraries
PublishersE-Print
Archives
…that can be exploited by different communities
Museums
OAI – Repository Perspective
Required: Protocol
DODO DO DO
MDO
MDO MDOMDOMDO
MDOMDOMDO
OAI – Black Box Perspective
OA 1
OA 2
OA 4
OA 3
OA 5OA 6
OA 7
Tiered Model of Interoperability
Mediator services
Metadata harvesting
Document models
DiscoveryCurrent
AwarenessPreservation
Service Providers
Data Providers
Meta
data
harv
estin
g
The World According to OAI
1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
Document
1010100101010010101010010101010101010101
Document
1010100101010010101010010101010101010101
Document1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
Image
1010100101010010101010010101010101010101
Image
1010100101010010101010010101010101010101
Image1010100101010010101010010101010101010101
Video
1010100101010010101010010101010101010101
Video
1010100101010010101010010101010101010101
Video
users digital objects
?
?1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
Document
1010100101010010101010010101010101010101
Document
1010100101010010101010010101010101010101
Document
1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
Image
1010100101010010101010010101010101010101
Image
1010100101010010101010010101010101010101
Image1010100101010010101010010101010101010101
Video
1010100101010010101010010101010101010101
Video
1010100101010010101010010101010101010101
Video?digital library
Monolithicand/or
Custom-builtweb-basedapplication
1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
Document
1010100101010010101010010101010101010101
Document
1010100101010010101010010101010101010101
Document
1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
Image
1010100101010010101010010101010101010101
Image
1010100101010010101010010101010101010101
Image
1010100101010010101010010101010101010101
Video
1010100101010010101010010101010101010101
Video
1010100101010010101010010101010101010101
Video
componentized digital library
?
?
?
?
???
?
?
?
?
??
? ?
?
?
?
?
?
?
?
1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
Document
1010100101010010101010010101010101010101
Document
1010100101010010101010010101010101010101
Document
1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
Image
1010100101010010101010010101010101010101
Image
1010100101010010101010010101010101010101
Image
1010100101010010101010010101010101010101
Video
1010100101010010101010010101010101010101
Video
1010100101010010101010010101010101010101
Video
open digital library
OA OA
OA
OA
OA
OA
OA
OA
OA
PMH
PMH
XPMH
XPMH
XPMH
XPMH
XPMH
XPMH
XPMH
XPMH
XPMH
XPMH
XPMH
Open Digital Library Protocol
Extended OAI-PMH
Protocol for Metadata Harvesting
Open Digital Library Component
Extended OPEN ARCHIVE
OPENARCHIVE
Open Digital Library Deployments
• NDLTD (www.ndltd.org)• Computer Science Teaching Center (www.cstc.org)• Computing and Information Technology
Interactive Digital Educational Library (www.citidel.org)
• Open Archives Distributed (NSF, DFG) – enhancements to PhysNet
• OCKHAM• Open to others through DL-in-a-box
Open Digital Library
• Network of Extended Open Archives where each node acts as either a provider of data, services or both.
• Component = Node
• Protocol = Arc
Open Digital Library Components
• Running now• XML-File (data provider from file system)• Search: simple or in-memory (Essex) or generalized• Union, browse, recent, filter• E-journal/review, Submit, Edit, Annotation• Recommender, Rating; Mirroring (see JCDL’02)• Working with NCSA: from DB, unstructured text
• Others in process• Classification/categorization• Registry (and other connections with web services)
1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
Document
1010100101010010101010010101010101010101
Document
1010100101010010101010010101010101010101
ETD-1
1010100101010010101010010101010101010101
Program
1010100101010010101010010101010101010101
ETD-2
1010100101010010101010010101010101010101
Image
1010100101010010101010010101010101010101
Image
1010100101010010101010010101010101010101
ETD-3
1010100101010010101010010101010101010101
Video
1010100101010010101010010101010101010101
Video
1010100101010010101010010101010101010101
ETD-4
ETD DL for the Networked Digital Library of Theses and Dissertations
(www.ndltd.org)
Search
Filter
Filter
Union
Recent
Browse
PMH
PMH
PMH
ODLRecent
ODLBrowse
ODLUnion
ODLUnion
ODLSearch
ODLUnionPMH
PMH
US
ER
INT
ER
FA
CE
Students and researchers ETD collections
Example Open Digital Library
Harvest from data providers
DBUnion Archive Merger Component
DBBrowse Browse Engine
IRDB-1 Search Engine
As Metadata Search Service Provider
As Metadata Browse Service Provider
XML File Coll. & Data Provider 1
XML File Coll. & Data Provider 2
XML File Coll. & Data Provider 3
Open Digital Library: Extended
What’s NewEngine
As What’s New Service Provider
OAI-PMHData Provider
Submit Archive
OAIB (NCSA:from RDBMS)
Filter
Recommend
RateEngine
AnnotationEngine
IRDB-2 Search Engine
As Annotation Search Service
Provider
As Recommend & Rate Service Provider
New ODL Component: Generalized
Search Platform
CS6604 Client: Patrick Fan, Wensi Xi
Group Member: Ming Luo, Rui Yang, Xiaoyan Yu
Introduction
• Background• The importance of search service in a digital
library• Problems of search engines in DLRL
IRDB Low search effectiveness, insufficient parsing component
ESSEX Less scalability due to in-memory Index
MARIAN Low search efficiency
Algorithms
• Phrase Searching Algorithms• Adjacency of terms
• Ranking Functions• Okapi (baseline)• GP-based ranking function
Genetic Programming (GP)
• A problem solving system designed based on principles of evolution and heredity
Order Doc. Rele.1 A 12 D 13 F 14 G 15 B 06 C 07 E 0
Order Doc. Rele.1 A 12 B 03 C 04 D 15 E 06 F 17 G 1
Feedback
Training
Data
Input
Ranking FunctionDiscovery
Ranking
Function f
Output
Order Doc. Rele.1 A 12 D 13 F 14 G 15 B 06 C 07 E 0
Order Doc. Rele.1 A 12 B 03 C 04 D 15 E 06 F 17 G 1
Feedback
Training
Data
Input
Order Doc. Rele.1 A 12 B 03 C 04 D 15 E 06 F 17 G 1
Feedback
Training
Data
Input
Ranking FunctionDiscovery
Ranking
Function f
Output
An Example of GP-based RF(log (+ (* df (log (log (* (* (/ n df) (* (* (/ n df) (* (* df_max_Col tf) (+ df_max_Col tf_avg))) (* (/ tf tf_max) (log tf_avg_Col)))) (* (/ (* (* (/ n df) (* (* df_max_Col tf) (+ df_max_Col tf_avg))) (* (/ tf tf_max) (log tf_avg_Col))) (+ (* length df) tf_avg_Col)) (log tf_avg_Col)))))) (+ (* (* df_max_Col tf) (/ (* (* (/ (/ (* tf 6.720) (/ df N)) (* df_max_Col tf)) (* (* tf N) (+ df_max_Col tf_avg))) (* (/ tf tf_max) (log tf_avg_Col))) (+ (* length df) (* (* (/ tf tf_max) (+ (* length df) (* 2.812 1))) tf_avg)))) (+ (/ df tf_avg) tf))))
tf Query term frequency in the document ( vector )
tf_query Query term frequency in the query ( vector )
tf_max The maximum term frequency in a document ( scalar )
Length Document length in the number of words ( scalar )
Length_avg Average document length in the number of words ( scalar )
N Number of documents in the collection ( scalar )
tf_avg Average term frequency in the current document (scalar)
tf_avg_Col Average term frequency for all the documents in the collection ( scalar )
df_max_Col Maximum document frequency for a word in the collection ( scalar )
df Document frequency for the query words ( vector )
tf Query term frequency in the document ( vector )
tf_query Query term frequency in the query ( vector )
tf_max The maximum term frequency in a document ( scalar )
Length Document length in the number of words ( scalar )
Length_avg Average document length in the number of words ( scalar )
N Number of documents in the collection ( scalar )
tf_avg Average term frequency in the current document (scalar)
tf_avg_Col Average term frequency for all the documents in the collection ( scalar )
df_max_Col Maximum document frequency for a word in the collection ( scalar )
df Document frequency for the query words ( vector )
tftf Query term frequency in the document ( vector ) Query term frequency in the document ( vector )
tf_querytf_query Query term frequency in the query ( vector )Query term frequency in the query ( vector )
tf_maxtf_max The maximum term frequency in a document ( scalar )The maximum term frequency in a document ( scalar )
LengthLength Document length in the number of words ( scalar )Document length in the number of words ( scalar )
Length_avgLength_avg Average document length in the number of words ( scalar )Average document length in the number of words ( scalar )
NN Number of documents in the collection ( scalar )Number of documents in the collection ( scalar )
tf_avgtf_avg Average term frequency in the current document (scalar)Average term frequency in the current document (scalar)
tf_avg_Coltf_avg_Col Average term frequency for all the documents in the collection ( scalar )Average term frequency for all the documents in the collection ( scalar )
df_max_Coldf_max_Col Maximum document frequency for a word in the collection ( scalar )Maximum document frequency for a word in the collection ( scalar )
dfdf Document frequency for the query words ( vector )Document frequency for the query words ( vector )
Parser
• Flexibility• TREC Style SGML/HTML• Configurable tagging
• Abbreviation and number detection
• Case sensitive
• Phrase parsing
Interface –(I)
1. Receive user query
2. Send query to search engine
3. Get ranked list
4. Search database
5. Get document information
6. Return results to user
Servlet
Socket
JDBC
1
6
Database
4
5
Search Engine
2 3
Interface –(II)
1. Receive user query thru ODL’s XOAI searching protocol
2. Send query to search engine
3. Get ranked list4. Request metadata5. Get metadata6. Return results in
format complying with ODL’s searching protocol
Perl Adaptor
Socket
1
6OAI data provider
4
5
Search Engine
2 3
As an ODL component
OCKHAM Initiative, Contact Info
• Supported by DL Federation, Mellon, NSF, …• P2P University Network involving:• Emory, Notre Dame, U. Arizona, Virginia Tech, …• PI: Martin Halbert
Phone 404-727-2204
Email: [email protected]
• OCKHAM URL:
http://ockham.library.emory.edu
The Problem
• Digital library development is complex and expensive.
• Various DL development communities (in the USA at least) are not working together well.
• Results exhibit much incompatibility, little common practice, slow progress, and no leverage on investment.
• If this continues, we are just going to languish and fester.
Lightweight Protocols
• “Lightweight”, or relatively small and simple protocols seem to have clear advantages over “Full” protocols that attempt to be comprehensive.
• Successes of protocols considered lightweight is illuminating.
• Examples: TCP/IP, HTTP, LDAP, and the OAI PMH
Reference Models
• Reference Model: a common vocabulary and description of components, services, and inter-relationships that comprise a system under consideration
• Useful as a tool to foster consensus and common understanding in a time of rapid change and/or disagreement
• Explored in CS6604 class project with 2 focus groups: librarians, education experts
Current Focus: Peer-to-Peer (P2P) Lightweight (Protocol) Reference Models
• Builds on successful example of the OAI PMH, clearly understood minimalist concept of metadata distribution, implemented in simple protocols (e.g., ODL)
• Leads to developing simple reference models of specific subsystems, with associated simple protocols and standards
• Testing in NSDL, connecting university libraries to support teaching & learning
OCKHAM Proposed Services
• Alerting• Browsing• Cataloging• Conversion• OAI – Z39.50• Pathfinding• Registry – prototype in CS6604 now• (plus others such as from adapted ODL)
DL Student Research: Gonçalves
• 5S as a basis for developing digital libraries
• Theory
• Syntax, Semantics; Definitions, Relationships
• Specification of requirements
• Generation of systems
• Quality
Motivation for 5S
• DLs are not benefiting from formal theories as have other CS fields: DB, IR, PL, etc.
• DL construction: difficult, ad-hoc, lacking support for tailoring/customization
• Conceptual modeling, requirements analysis, and methodological approaches are rarely supported in DL development.• Lack of specific DL models, formalisms,
languages
5S Layers
Societies
Scenarios
Spaces
Structures
Streams
5S Model: Examples, Objectives
Models Examples ObjectivesStream Text; video; audio; image Describes properties of the DL
content such as encoding and language for textual material or particular forms of multimedia data
Structures Collection; catalog; hypertext; document; metadata; organization tools
Specifies organizational aspects of the DL content
Spatial Measure; measurable, topological, vector, probabilistic
Defines logical and presentational views of several DL components
Scenarios Searching, browsing, recommending,
Details the behavior of DL services
Societies Service managers, learners, Teachers, etc.
Defines managers, responsible for running DL services; actors, that use those services; and relationships among them
Intra-Model Relationships: Streams
• Participant concepts: {text, image, video, audio}• Relations:
• contains video image video audio
• Streams define the basic content types over which digital objects are built; the latter being the ultimate carriers of the information in the DL.
• However some complex types of streams (e.g., video) may themselves be associated with simpler types of streams (e.g., images, audio).
• This relation indicates that a video contains a image as one of its frames or a specific audio recording.
Streams
text
audio
image
video do mss
R
C DMc
describes
stores
is_version_of
Ic
Se
Sc
e
extendsreuses
SM
Ac
opexecutes
participates_in
recipient
runs
Scenarios
Societies
inherits_from/includes
association
uses
Top
Pr Metric
Measurable
Measure
describes
employsproduces
employsproduces
employsproduces
Structures
Spaces
Vec
belongs_to
contains
ms
is_ais_a
precedeshappens_before
is_a
redefinesinvokes
DL Services/Activities Taxonomy (Gonçalves)
BrowsingCollaboratingCustomizingFilteringProviding accessRecommendingRequestingSearchingVisualizing
AnnotatingClassifyingClusteringEvaluatingExtractingIndexing
MeasuringPublicizing
RatingReviewing (peer)
SurveyingTranslating (language)
ConservingConverting
Copying/ReplicatingEmulatingRenewing
Translating (format)
AcquiringCataloging
Crawling (focused)DescribingDigitizingFederatingHarvestingPurchasingSubmitting
PreservationalCreational
AddValue
Repository-Building
Information SatisfactionServices
Infrastructure Services
Services, Definitions, Parameters
• In the table each service is characterized by• parameters (input, output)
• of the initial and final events
• of the scenarios that compose those services and
• respective pre- and post-conditions which are represented in terms of rules on DL relations.
• All other previous definitions and keys apply here.• That set is complemented with the following
definitions:
Services Related Definitions
• A query q is the representation of user interest or information need.
• Hyptxt is an hypertext; wherein anchor is a node.• A log_entry is a descriptive metadata specification
about an event of a scenario.• Let {doi} = {doi1, doi2,…, doin } be a set of digital
objects and Ct = {c1, c2,…,cn} is a set of labels for categories. A classifier classCt: {doi} 2Ct is a function that maps a digital object to a set of categories.
• A cluster cluk = {do1k, do2k, …, donk} is a subset of a set of digital objects.
Service User input Other Service Input
Output
Acquiring {doi} Ci Cj
Browsing anchor Hyptxt {doi}
Cataloging doi, msi_k (hi, mssi_m) (hi, mssi_(m+k))
Classifying {doi} classCt, Ct {(doi, {ck_i})}
Clustering {doi} X {cluk_i}
Expanding (query) {doi} IC_i, qi qj
Indexing Ci X IC_i
Linking Ci X Hyptxtik
Logging X ei({pi}); log_entryj
Rating doi ,acj X {(doi,acj,rk)}
Searching q, Ci IC_i {dok}
Visualizing {doi} tfrk spik
Searching Browsing
Ic
AcquiringUser interests/needs
query anchor
UniversalCollection
Ci
DMCi
Indexing
Society
actor
DescribingCataloguing
Linking
Hypertext
Infra-structure Services(fundamental)
Information Satisfaction Services(fundamental)
criteria sortOrder
{doi}
Submitting
Authoring
dok
mskj
DL Services I/O Behavior
• Regarding the prior figure, which shows:• Instantiations of the “Services Definition” model• Inputs and outputs of examples of infrastructure
and information satisfaction DL services
• Key: • CDL = Collection
• ICDL = index for collection CDL
• {doi} = digital object
• Soc = Society
SearchingBrowsing
queryanchor
Society
actor
criteria sortOrder
Ck, {doi}
Recommending Filtering Binding Visualizing Expanding query
user model/expr Classifier/expr {doj}
{doR} {doF}
bi
InformationSatisfaction Services
spV query’
fundamental
Rating/Reviewing (peer)
Training
Infrastructure
Services (Add_Value)
composite
Defining Quality in Digital LibrariesDL Concept Dimensions of Quality
Digital object Accessibility
Pertinence (*)
Preservability (*)
Relevance
Similarity
Significance
Timeliness (*)
Metadata specification Accuracy
Completeness
Conformance
Collection Completeness
Impact Factor
Catalog Completeness
Consistency
Repository Completeness
Consistency
Structures for Navigation Navigability (*)
Services Composability
Efficiency
Effectiveness
Extensibility
Reusability
Reliability
AuthoringModifying
OrganizingIndexing
Storing
Archiving
NetworkingAccessing
Filtering
Creation
DistributionUtilization
Reputation
Similarity
Desirability
AccuracyCompletenessConformance
Discovery
SearchingBrowsingRecommending
Relevance
Timeliness
Accessibility
Usage
Inactive
Active
Discard
RetentionMining
Semi-Active
Preservability
Timeliness
Completeness of Metadata (1)
• Degree of completeness of a metadata specification msx
• Completeness(msx) = 1 - (no. of missing attributes in msx/ total attributes of the schema to which msx conforms)
• According to 5S definition of conformance
Completeness of Metadata (2)
• Example of application: • OCLC NDLTD Union
• average of completeness of all metadata specifications (records)• of the NDLTD union Archive• administered by OCLC• as of Feb, 23, 2004• regarding to the Dublin Core metadata standard
(15 attributes)
00. 10. 20. 30. 40. 50. 60. 70. 80. 9
1
GW
UD
LSU
VTETD
MIT
UBC
PH
YSN
ET
VTIN
DIV
VAN
DER
BILT
NC
SU
USASK
PIT
T
HKU
HU
MBO
LT
OC
LC
BG
MYU
DR
ESD
EN
VIE
NN
A
GATEC
H
ETSU
USF
MU
EN
CH
EN
UTEN
N
CC
SD
WATER
LOO
NSYSU
LAVAL
UPSALL
A
CALT
EC
H
UC
L
WagU
niv
Completeness of Metadata (3)
Collection Completeness (1)
• Defn: A complete DL collection Cx is one which contains all the pertinent existing digital objects.
• completeness(Cx)• = |Cx| /|ideal collection’|• can be defined as the ratio between the size of
Cx and the ideal real-world collection
Collection Completeness (2)
• Example of use. Computing collections• The ACM Guide is a collection of bibliographic
references and abstracts of works published by ACM and other publishers.
• The Guide can be considered a good approximation of an ideal computing collection – it contains most of the different types of computing-related literature (about 735K works)
Degree of Completeness
ACM Guide 1
DBLP 0.652
CITIDEL(DBLP + ACM + NCSTRL + NDLTD-CS) 0.467
IEEE-DL 0.168
ACM-DL 0.146
Reliability (1)
• Scope: operations of DL
• Defn: the probability that the service will not fail during a given period of time [Hansen83]
• Example of use: CITIDEL services
• Example details: using log analysis April 1
Reliability (2)
CITIDEL service No. of failures/ no. of accesses
Reliability
searching 73/14370 0.994
browsing 4130/153369 0.973
requesting (getobject) 1569/318036 0.995
structured search 214/752 0.66
contributing 0/980 1
Extensibility, Reusability (1)
• Scope: Design and Implementation of DL services
• Two main classes1. Composability of services:
• Extensibility
• Reusability
2. Quality aspects of models and implementations: • completeness, consistency, correctness, soundness
Extensibility, Reusability (2)
• Micro-Reusability(Serv) = ( LOC(smx) * reused(sei),• smx SM, sei Serv, sex runs sei) / |
LOC(sm), sm SM|,• where LOC corresponds to the number of lines
of code of a service manager
• Macro-Reusability(Serv) = reused(sei), sei Serv/ |Serv|, where reused is a indicator function defined as :• 1, if smj: sej reuses si;• 0, otherwise
Extensibility, Reusability (3)
• Example: ETANA-DL
• Consider:• Services• Use of existing ODL component• Lines of Code (LOC)
• Reused from component
• Added for implementation
Service Component Based
LOC for implementing
service
LOC reused from
component
Total LOC
Searching – Back-end Yes - 1650 1650
Search Wrapping No 100 - 100
Recommending Yes - 700 700
Recommend Wrapping No 200 - 200
Annotating – Back-end Yes 50 600 600
Annotate Wrapping No 50 - 50
Union Catalog Yes - 680 680
User Interface Service No 1800 - 1600
Browsing No 1390 - 1390
Comparing (objects) No 650 - 650
Marking Items No 550 - 550
Items of Interest No 480 - 480
Recent Searches/Discussions No 230 - 230
Collections Description No 250 - 250
User Management No 600 - 600
Framework Code No 2000 - 2000
Total 8280 3630 11910
Extensibility, Reusability (5)
• Macro-Reusability(ETANA DL Services)• = 3/13 = 0.23• only a few important services are
componentized
• Micro-Reusability• = 3630/11910 = 0.304• we can re-use a very significant percentage of
DL code by implementing common DL services as components
Review of Gonçalves Achievements in Past Year
• Book Chapters1. Fox, E. A., Gonçalves, M. A., Luo, M., Chen, Y., Krowne, A., Zhang, B., McDevitt,, K.
Pérez-Quiñones, M., Cassel, L. N. Harvesting: Broadening the Field of Distributed Information Retrieval. In Multimedia Distributed Information Retrieval, eds. Fabio Crestani, Mark Sanderson, and Jamie Callan, 2003.
2. Fox, E., McMillan, G., Suleman, H., Gonçalves, M., Networked Digital Library of Theses and Dissertations. Invited chapter for “Digital Libraries: Policy, Planning, and Practice”, eds. Judith Andrews and Derek Law, Ashgate Publishing, 2003
• Journal papers1. 5S TOIS paper (April 2004, issue)2. S. Perugini, M. A. Gonçalves, and E. A. Fox. A Connection-Centric Survey of
Recommender Systems Research. Journal of Intelligent Information Systems, Jun, 2004.
3. Zhu, Q., Gonçalves, M. A., Fox, E. A.. 5SGraph: A Domain-Specific Visual Modeling Tool for Digital Libraries. Journal of the American Society for Information Science and Technology, submitted 2003, in revision
4. Baoping Zhang, Marcos Andre Goncalves, Yuxin Chen, Edward A. Fox, and Pavel Calado, "Combining Support Vector Machines and Structural Rules for Effective Filtering of OAI-Based Repositories", submitted to Journal of Digital Libraries (Springer Verlag) Special Issue on Asian Digital Libraries, 2004
• Conference papers1. Pável P. Calado, Marcos André Gonçalves, Edward A. Fox, Berthier Ribeiro-Neto, Alberto H.
F. Laender, Altigran S. da Silva, Davi C. Reis, Pablo A. Roberto,Monique V. Vieira, and Juliano P. Lage. The Web-DL Environment for Building Digital Libraries from the Web. JCDL'2003, Third Joint ACM / IEEE-CS Joint Conference on Digital Libraries, May 27-31, 2003, Houston.
2. Marcos André Gonçalves, Ganesh Panchanathan, Unnikrishnan Ravindranathan, Aaron Krowne, Edward A. Fox, Filip Jagodzinski, and Lillian Cassel. The XML Log Standard for Digital Libraries: Analysis, Evolution, and Deployment. Proc. JCDL'2003, Third Joint ACM / IEEE-CS Joint Conference on Digital Libraries, May 27-31, 2003, Houston.
3. Qinwei Zhu, Marcos André Gonçalves, Rao Shen, Lillian Cassel, Edward A. Fox. Visual Semantic Modeling of Digital Libraries. ECDL'2003, 7th European Conference on Research and Advanced Technology for Digital Libraries, 17-22 August, 2003, Trondheim, Norway.
4. Rohit Kelapure, Marcos André Gonçalves, Edward A. Fox. Scenario-Based Generation of Digital Library Services. ECDL'2003, 7th European Conference on Research and Advanced Technology for Digital Libraries, 17-22 August, Trondheim, Norway
5. Marco Cristo, Pavel Calado, Edleno Moura, Nivio Ziviani, Berthier Ribeiro-Neto, and Marcos André Gonçalves. Combining Link-Based and Content-Based Methods for Web Document Classification. CIKM 2003, 3-8 November, New Orleans, Louisiana, USA, 2003.
6. Baoping Zhang, Marcos Andre Goncalves, and Edward A. Fox. An OAI-based Filtering Service for CITIDEL from NDLTD. ICADL 2003, 6th International Conference of Asian Digital Libraries, 8-11 December, Kuala Lumpur, Malaysia, 2003
7. U. Ravindranathan, R. Shen, M. A. Goncalves, W. Fan, E. A. Fox, and J. W. Flanagan. ETANA-DL: A Digital Library for Integrated Handling of Heterogeneous Archaeological Data. To be presented at ACM-IEEE Joint Conference on Digital Libraries (JCDL 2004), Tucson, AZ, June 7-11, 2004.
• Conference papers8. U. Ravindranathan, R. Shen, M. A. Goncalves, W. Fan, E. A. Fox, and J. W. Flanagan. ETANA-DL: A Digital
Library for Integrated Handling of Heterogeneous Archaeological Data. To be presented at ACM-IEEE Joint Conference on Digital Libraries (JCDL 2004), Tucson, AZ, June 7-11, 2004.
9. M. A. Goncalves, E. A. Fox, A. Krowne, P. Calado, A. H. F. Laender, A. S. da Silva, and B. Ribeiro-Neto. The Effectiveness of Automatically Structured Queries in Digital Libraries. To be presented at ACM-IEEE Joint Conference on Digital Libraries (JCDL 2004), Tucson, AZ, June 7-11, 2004.
10. Alberto H. F. Laender, M. A. Goncalves, Pablo A. Roberto. BDBComp: Building a Digital Library for the Brazilian Computer Science Community. To be presented at ACM-IEEE Joint Conference on Digital Libraries (JCDL 2004), Tucson, AZ, June 7-11, 2004.
11. U. Ravindranathan, R. Shen, M. A. Goncalves, W. Fan, E. A. Fox, and J. W. Flanagan. Prototyping Digital Libraries Handling Heterogeneous Data Sources - The ETANA-DL Case Study. European Conference on Digital Libraries (ECDL 2004), Bath, UK, September 12-17, 2004. (submitted)
Other publications1. R. da S. Torres, C. B. Medeiros, M. A. Goncalves, and E. A. Fox. An OAI-based Digital Library Framework for
Biodiversity Information Systems. Department of Computer Science, Virginia Tech, Technical Report No. TR-04-01, 2004.
2. R. da S. Torres, C. B. Medeiros, M. A. Goncalves, and E. A. Fox. An OAI Compliant Content-Based Image Search Component. Demo to be presented at ACM-IEEE Joint Conference on Digital Libraries (JCDL 2004), Tucson, AZ, June 7-11, 2004.
3. R. da S. Torres, C. B. Medeiros, Renata Q. Dividino, Mauricio A. Figueiredo, M. A. Goncalves, E. A. Fox, and R. Richardson. Using Digital Library Components for Biodiversity Systems. Poster to be presented at ACM-IEEE Joint Conference on Digital Libraries (JCDL 2004), Tucson, AZ, June 7-11, 2004.
4. U. Ravindranathan, R. Shen, M. A. Goncalves, W. Fan, E. A. Fox, and J. W. Flanagan. ETANA-DL: Managing Complex Information Applications – An Archaeology Digital Library. Demo to be presented at ACM-IEEE Joint Conference on Digital Libraries (JCDL 2004), Tucson, AZ, June 7-11, 2004.
5. Qinwei Zhu, Marcos André Gonçalves, E. Fox. 5SGraph Demo: A Graphical Modeling Tool for Digital Libraries. Proc. JCDL'2003, Third Joint ACM / IEEE-CS Joint Conference on Digital Libraries, May 27-31, 2003, Houston.
Proposed Outline of Dissertation(Marcos André Gonçalves)
• Chapter 1 – Introduction and Motivation• Chapter 2 – Background and Related Work• Chapter 3 – Streams, Structures, Spaces, Scenarios and Societies: the 5S
Formal Model for Digital Libraries• Chapter 4 – Towards a Digital Library Theory: A Formal Digital Library
Ontology based on 5S• Chapter 5 – Applications of the 5S Model/Ontology
• 5.1 Declarative Specification of DLs: the 5S Language• 5.2 Semantic Visual Modeling of DLs: the 5SGraph Tool• 5.3 (Semi-) Automatic Generation of Componentized DLs: The 5SGen Tool• 5.4 Evaluating DLs: The XML Log Standard for DLs• 5.5 Formally comparing Architectures: Fedora and Buckets (time
permitting)
• Chapter 6 – Defining Quality in Digital Libraries• Chapter 7 – Conclusions and Future Work• Appendix 1- Mathematical Preliminaries
Questions/Discussion?