MARC Content Designation Use Implications for indexing & interoperability
William E. Moen<[email protected]>
School of Library and Information SciencesTexas Center for Digital Knowledge
University of North TexasDenton, TX 72603
South Central Unicorn Users Group Annual Conference, October 17, 2003 Austin, Texas
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 2
Overview Context for the analysis -- interoperability Findings from the analysis Indexing and MARC Discussion
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 3
Context for the analysis Interoperability across library online catalogs Indexing of MARC records to support searching Richness of MARC content designation available Indexing guidelines prepared for the Z39.50
Interoperability Testbed (Z-Interop) Implications for indexing guidelines and policies
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 4
Interoperability
Systems and organizations will interoperate!
One should actively be engaged in the ongoing process of ensuring that the systems, procedures and culture of an organisation are managed in such a way as to maximise
opportunities for exchange and re-use of information, whether internally or externally.
Paul Miller, 2000
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 5
Factors affecting interoperability Multiple and disparate systems
operating systems, information retrieval systems, etc.
Multiple protocols Z39.50, HTTP, SOAP, etc.
Multiple data formats, syntax, metadata schemes MARC 21, UNIMARC, XML, ISBD/AACR2-based, Dublin Core
Multiple vocabularies, ontologies, disciplines LCSH, MESH, AAT
Multiple languages and character sets Indexing, word normalization, and word extraction policies
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 6
Information communities Community agreements exist (e.g., standards, rules, etc.) Interoperability factors reduced Interoperability more easily achieved
Do we need additional agreements regarding indexing policies to improve interoperability?Libraries as Focal Community
Relative homogeneity of data and systems Standards-based MARC records Content and structure prescribed by AACR Commonly understood access points Use of controlled vocabularies
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 7
Interoperability testbed projectRealizing the Vision of Networked Access toLibrary Resources: An Applied Research andDemonstration Project to Establish andOperate a Z39.50 Interoperability Testbed
A Institute of Museum and Library Services National Leadership Grant
Goal: Improve Z39.50 semantic interoperability among libraries for information access and resource sharing
FOR MORE INFORMATION, VISIT THE PROJECT WEBSITE…
http://www.unt.edu/zinterop/
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 8
Threats to Z39.50 interoperability Differences in implementation of the standard Differences in local information retrieval systems
Search functionality Indexing policies
These threats can be addressed by Z39.50 specifications and configuration (i.e., profiles) Enhancing local information retrieval systems Recommendations for local indexing decisions
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 9
Components of the testbed Test dataset
400,000+ MARC 21 records from OCLC’s WorldCat Z39.50 reference implementations
Z-client (Bookwhere), Z-server & information retrieval system (Sirsi Unicorn)
Test scenarios & searches Searches with known result records from dataset
Benchmarks Results of test searches using reference
implementations
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 10
MARC Record structure for encoding data for machine processing
Standard structure (ANSI/NISO Z39.2/ISO 2709) Leader Directory map 3-digit tag to identify a field 2 indicator values to provide additional processing information 1 or more delimiters/codes to identify subfields
Content designation: Semantics MARC 21 245 00 $a [title] $h [format] : $b [subtitle]
Rules Anglo-American Cataloguing Rules and others
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 11
MARC 21 content designationMARC 21
Field Groups
Currently Defined
Obsolete Total MARC 1972(Books Format Only)
00x 6 1 7 30xx 238 7 245 281xx 66 1 67 402xx 137 32 169 153xx 109 32 141 44xx 69 0 69 375xx 323 38 361 86xx 184 5 189 667xx 452 47 499 418xx 141 20 161 36TOTAL 1725 183 1908 278
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 12
Z-Interop test dataset
Books: 91% Cartographic Materials: < 1% Electronic resources: < 1% Archival/Mixed Materials: <1%
Sound recordings: 4% Visual Materials: 1% Serials: 3%
Approximately 1% sample of MARC records from OCLC’s WorldCat database
Weighted sampling based on number of libraries “holding” the object represented by the record
419,657 total MARC records 89% of records “full level” cataloging Formats represented in test dataset
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 13
MARC record LDR01019cam 2200265 4500^001 ocm00000003^003 OCoLC^005 20010925133908.0^008 690414s1963 nyu b 000 0 eng ^010 $a63064323 ^040 $aDLC $cDLC ^050 04 $aHV700.5 $b.N37 ^082 0 $a362.7/3 ^110 2 $aNational Study Service. ^245 10 $aIllegitimacy and adoption in Maine : $breport of a study made for the Maine Committee on Children and Youth. ^260 $a[New York], $c1963. ^300 $a24 p. ; $c28 cm. ^500 $aCover title. ^504 $aBibliographical footnotes. ^650 0 $aIllegitimacy $zMaine. ^650 0 $aAdoption $zMaine. ^710 1 $aMaine. $bCommittee on Children and Youth. ^
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 14
Decomposing MARC RecordsOCLC #
Tag 1st Ind
2nd Ind
SubFld Fld Pos
SubFld Pos
Word Pos
Word
3 1 1 1 1 Ocm00000003
3 3 2 1 1 OCoLC
3 110 2 a 11 1 1 National
3 110 2 a 11 1 2 Study
3 110 2 a 11 1 3 Service
3 245 1 0 a 12 1 1 Illegitimacy
3 245 1 0 a 12 1 2 and
3 245 1 0 a 12 1 3 Adoption
3 245 1 0 b 12 2 1 Report
3 650 0 a 17 1 1 Illegitimacy
3 650 0 z 17 2 1 Maine
400,000 MARC21 records = 33 million decomposed records
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 15
Content designation in datasetMARC 21
Field Groups
Currently Defined
Obsolete Unlikely Used
Total
00x 6 0 0 60xx 96 1 33 1301xx 49 0 2 512xx 81 0 19 1003xx 23 6 0 294xx 10 0 30 405xx 128 1 3 1326xx 104 1 7 1127xx 205 0 5 2108xx 105 3 8 116TOTAL 807 12 107 926
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 16
Summary frequency results
Frequency # of Fields/Subfields % of All Occurrences> 600,000 1 4.4%500,000 > 599,999 0 0%400,000 > 499,999 13 39.9%300,000 > 399,999 6 14.3%200,000 > 299,999 6 10.6%
100,000 > 199,999 10 10.3%TOTAL 36 79.5%
Total number of fields/subfields occurring in dataset = 13,849,499
Only 4% of all fields/subfields account for 80% of all occurrencesor96% of all fields/subfields account for 20% of all occurrences
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 17
Characteristics of top 36 Most frequently occurring: 650 $a [Subject data] 2nd most frequently occurring: 040 $d [Cataloging
source] 3rd & 4th most frequently occurring: 260 $a & $b
[Publication information] 5th most frequently occurring: 245 $a [Title] Contain data useful to end users: 28 Contain control numbers, etc.: 5 Contain data useful to catalogers: 3
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 18
Indexing & MARC Indexing Guidelines to Support Z39.50 Profile Searc
hes Identified all MARC 21 fields/subfields that may
contain author, title, or subject data Author-related fields/subfields : 119 AuthorTitle-related fields/subfields: 21 Title-related fields/subfields: 253 Subject-related fields/subfields: 144
537 fields/subfields contain author, title, subject data Usefulness of indexing all possible fields?
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 19
Occurrences in test dataset 381 occur one or more times in Z-Interop dataset Author, title, or subject fields/subfields in Z-Interop dataset
Author-related fields/subfields : 86 AuthorTitle-related fields/subfields: 16 Title-related fields/subfields: 178 Subject-related fields/subfields: 101
19 of the 381 (5%) account for 80% of all occurrences 9 of 19 are subject-related 5 of 19 are author-related 5 of 19 are title-related
The 19 fields/subfields
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 20
Implications for indexing What difference does indexing decisions make? Preliminary testing using the 19 fields/subfields:
95% - 100% of correct records retrieved! How much time would be saved in setting up
indexing policies? Is there a systematic method to identify the “best”
fields/subfields to index? Per format of materials? Per user (librarians and end users) needs? Good enough search results?
Moen South Central Unicorn Users Group Annual Conference -- Austin, Texas -- October 17, 2003 21
References Z39.50 Interoperability Testbed
http://www.unt.edu/zinterop/ Indexing Guidelines to Support Z39.50 Profile
Searches http://www.unt.edu/zinterop/Documents/IndexingGuidelin
es1Feb2002.pdf