radioactive metadata records an interoperability testing approach based on metadata utilization...

28
Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen <[email protected]> School of Library and Information Sciences Texas Center for Digital Knowledge University of North Texas Denton, TX 72603 cess 2005, October 18, 2005, Edmonton, Alberta

Upload: cory-hunter

Post on 20-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Radioactive Metadata Records

An Interoperability Testing Approach Based on Metadata Utilization

William E. Moen<[email protected]>

School of Library and Information SciencesTexas Center for Digital Knowledge

University of North TexasDenton, TX 72603

Access 2005, October 18, 2005, Edmonton, Alberta

Page 2: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 2

IMLS funded projects U.S. Federal Institute of Museum and Library Services Z39.50 Interoperability Testbed, Phases 1 & 2

Improve Z39.50 semantic interoperability among libraries for information access and resource sharing

Establish and operate a testbed for interop testing of Z39.50 clients and servers with library catalogs (Phase 1: 2000-2003)

Explore alternative approach using radioactive MARC records (Phase 2: 2004-2005)

MARC Content Designation Utilization (2005-2006) Provide empirical evidence of MARC content designation use Explore the evolution of MARC content designation Develop methodological approach to understand the factors

contributing to current levels of MARC content designation use

Page 3: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 3

Factors affecting interoperability Multiple and disparate systems

Information retrieval systems, search functionality, etc.

Multiple protocols Z39.50, HTTP, SOAP, SRW/Uetc.

Multiple data formats, syntax, metadata schemes MARC 21, UNIMARC, XML, ISBD/AACR2-based, Dublin Core

Multiple vocabularies, ontologies, disciplines LCSH, MESH, AAT

Multiple languages, multiple character sets Indexing, word normalization, and word extraction policies

Page 4: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 4

Z-Interop Phase 1 Test dataset: 400,000 MARC 21 records from OCLC Z39.50 reference implementations

Z-client, Z-server, information retrieval system Configured to the profile specifications

Test scenarios & searches Searches with known result records from dataset

Benchmarks Results of test searches against reference implementations

Finding: Interoperability improved dramatically using profile specs and common indexing policies

Issue: Approach not suitable to interop testing for individual, local library systems

Page 5: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 5

Phase 1 interop testing

Reference Z39.50 Client

VendorZ39.50 Server

Configuredto Support

ProfileSpecifications

Configuredby Vendor

for Conformance

to Profile

Indexed by Vendor

According to Vendor’s

Specifications

Test Dataset Loaded by Vendor or Library

Test Searches

RetrievalResults

RetrievalBenchmarks

Compared to

Page 6: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 6

Z-Interop Phase 2 Specially designed MARC records, Radioactive MARC

Records Concept coined by Sebastian Hammer, Index Data

A set of test searches and automatic testing script that issues searches, retrieves records, and develops reports on the search and retrieval results Developed by Index Data Will be released under GPL

A database of MARC documentation that enables the automatic identification of types of searches to issue Developed by UNT

Page 7: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 7

Page 8: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 8

Radioactive MARC records Specially designed diagnostic records Legitimate instance of MARC record structure Fields/subfields contain content-rich tokens

A token is a string of characters that has a specific structure and semantics that will serve as “words” or other data values in specific fields/subfields.

Multiple sets of RadMARC records, distinguished by the amount of content designation populated

Page 9: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 9

Structure of RadMARC tokens A single alpha character for left-hand padding.

Value = r A single alpha character to indicate the format of the material being described or type

of record Value = Selected values as defined in MARC Leader/06 – Type of Record or the Leader/07

– Bibliographic Level Three numbers indicating the Field Tag

Value = Defined in MARC 21 specifications A single integer to indicate number of occurrence the Field Tag

Value = Sequential number starting with 1 A single alpha character to indicate the Subfield Code

Value = Defined in MARC 21 specifications A single integer indicating the offset within subfield

Value = Use the following scheme: 1=first token in subfield, 2=second token in subfield; 3= third token in subfield, etc.

A single alpha character for right-hand padding Value = r

Page 10: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 10

Token example ra2451a1r

r - Left-hand padding a - Type of record -- this is a books type record 245 - Field code 1 – First occurrence of field in record a - Subfield code 1 - Offset within subfield, where 1 = first token in subfield r - Right-hand padding

RadMARC example record

Page 11: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 11

Test scripts Automate interoperability testing and reporting Test searches defined by Bath Profile and US National

Z39.50 Profile RadioMARC Perl module

Automatically generates Z39.50 queries with tokens as search terms

Sends searches to target servers known to contain copies of specific records

Generates reports dependent on whether or not the expected records are present in the result set

Available for use: Net-Z3950-RadioMARC-0.06 Sample output of testing

Page 12: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 12

Page 13: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 13

MARCdocs database Pilot effort aimed at structuring MARC 21 documentation into

a relational database Stores information about all content designation available in

the MARC 21 Format for Bibliographic Data specifications Stores additional information about profile-defined searches

necessary to the automatic test scripts Implementation uses MySQL and PhP Example display from MARCdocs MARCdocs Database Special data in RadioMARCdocs

Page 14: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 14

Question space for Z-Interop2 Profile conformance level: Addresses the

interoperability between the Z-client and Z-server Information retrieval (IR) system level: Addresses

the capability of the IR system underlying the online catalog application (e.g., types of searching)

Metadata record level: Concerned with how the IR system indexes fields in the metadata record

Data content level: Addresses normalization of data, hyphenated words, special characters and diacritics, etc.

Page 15: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 15

So far, so good…. Verified procedures and test scripts with the Z-

Interop reference implementation server Completed testing with local library

Loaded RadMARC records successfully Used the test script and procedures to issue searches

Created two sets of RadMARC records

Page 16: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 16

RadMARC record sets What content designation should be populated in

RadMARC records to support interoperability testing?

MARC 21 defines approximately 2,000 structures for holding data

Z-Interop2 approach Develop multiple RadMARC record sets Increasing amount of content designation populated

Informed by MARC content designation analysis

Page 17: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 17

Analysis of the Z-Interop test dataset

Books: 91% Cartographic Materials: < 1% Electronic resources: < 1% Archival/Mixed Materials: <1%

Sound recordings: 4% Visual Materials: 1% Serials: 3%

Approximately 1% sample of MARC records from OCLC’s WorldCat database

419,657 total MARC records 89% of records “full level” cataloging Formats represented in test dataset

Page 18: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 18

MARC 21 content designationMARC 21

Field GroupsCurrently Defined

Obsolete Total MARC 1972(Books Format Only)

00x 6 1 7 3

0xx 238 7 245 28

1xx 66 1 67 40

2xx 137 32 169 15

3xx 109 32 141 4

4xx 69 0 69 37

5xx 323 38 361 8

6xx 184 5 189 66

7xx 452 47 499 41

8xx 141 20 161 36

TOTAL 1725 183 1908 278

Page 19: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 19

Fields populated in Z-Interop datasetMARC 21

Field Groups

Currently Defined

Obsolete Unlikely Used

Total

00x 6 0 0 6

0xx 96 1 33 130

1xx 49 0 2 51

2xx 81 0 19 100

3xx 23 6 0 29

4xx 10 0 30 40

5xx 128 1 3 132

6xx 104 1 7 112

7xx 205 0 5 210

8xx 105 3 8 116

TOTAL 807 12 107 926

Page 20: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 20

Occurrence summaryTotal number of fields/subfields instances in dataset = 13,849,499

Only 4% of all fields/subfields account for 80% of all occurrencesor96% of all fields/subfields account for 20% of all occurrences

Frequency # of Fields/Subfields % of All Occurrences

> 600,000 1 4.4%

500,000 > 599,999 0 0%

400,000 > 499,999 13 39.9%

300,000 > 399,999 6 14.3%

200,000 > 299,999 6 10.6%

100,000 > 199,999 10 10.3%

TOTAL 36 79.5%

Page 21: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 21

Characteristics of top 36 Most frequently occurring: 650 $a [Subject data] 2nd most frequently occurring: 040 $d [Cataloging

source] 3rd & 4th most frequently occurring: 260 $a & $b

[Publication information] 5th most frequently occurring: 245 $a [Title]

List of top 36 fields

Page 22: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 22

Indexing & MARC Indexing Guidelines to Support Z39.50 Profile

Searches (available on Z-Interop website)

Identified all MARC 21 fields/subfields that can contain author, title, or subject data Author-related fields/subfields : 119 AuthorTitle-related fields/subfields: 21 Title-related fields/subfields: 253 Subject-related fields/subfields: 144

Page 23: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 23

Occurrences in test dataset 537 fields/subfields can contain author, title, subject data 381 of these actually occur in Z-Interop dataset Total occurrences of the 381 = 4,397,712 19 of the 381 (5%) account for 80% of all occurrences

9 of 19 are subject-related 5 of 19 are author-related 5 of 19 are title-related

Preliminary testing using only 19 indexed fields: 95% - 100% of correct records retrieved!

The 19 fields/subfields

Page 24: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 24

Initial RadMARC sets Set 1

10 records Populate 19 most frequently occurring Author, Title,

Subject fields Distinguished by types of materials cataloged

Set 2 4 records (100, 110, 111, 130 main entry fields) Populate the Author, Title, Subject fields occurring 1000

or more times (approximately 50 fields/subfields populated)

Sample Set 2 RadMARC Record

Page 25: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 25

Extensibility of RadMARC Records can be as simple or as complex as needed Custom records for a library that wants specific assessment

of indexing or other policies to interrogate system behavior Assess normalization of characters Testing transformation from one metadata scheme to

another MARC Record MARCXML Transformation MODS Transformation DC Transformation

Other metadata environments?

Page 26: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 26

MCDU Project Systematically analyze WorldCat records Provide empirical evidence of catalogers’ use of

MARC content designation Contribute to community discussion about core

elements in MARC bibliographic records – based on empirical evidence of actual use

Inform future sets of RadMARC record Initial analysis

FOR MORE INFORMATION, VISIT THE PROJECT WEBSITE…

http://www.mcdu.unt.edu

Page 27: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 27

Concluding thoughts Exploring an innovative conceptual and technical

approach for interoperability testing. Conducting a proof-of-concept for a radioactive

record approach for diagnosing interoperability factors in an identified question space

Extensible in terms of the current focus Creating different sets of RadMARC records to diagnose

general or specific system and interoperability issues Extensible to other application environments,

metadata schemes, and protocols.

Page 28: Radioactive Metadata Records An Interoperability Testing Approach Based on Metadata Utilization William E. Moen School of Library and Information Sciences

Moen Access 2005 -- Edmonton, Alberta -- October 18, 2005 28

References Z39.50 Interoperability Testbed

http://www.unt.edu/zinterop/

MARC Content Designation Utilization Project http://www.mcdu.unt.edu/

Assessing Metadata Utilization: An Analysis of MARC Content Designation Use http://www.unt.edu/wmoen/publications/MARCPaper_Final2003pdf.pdf

Indexing Guidelines to Support Z39.50 Profile Searches http://www.unt.edu/zinterop/Documents/IndexingGuidelines1Feb2002.pdf

MARCdocs Database (public interface) http://meta.lis.unt.edu/MARCdocs2