batch-conversion of non-standard multiscript records by xslt lucas mak metadata and catalog...

44
Batch-conversion of Non- standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest Group, ALA Midwinter 2011, Jan. 8, 2011, San Diego CA

Upload: joan-walton

Post on 23-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Batch-conversion of Non-standard Multiscript Records by XSLT

Lucas MakMetadata and Catalog Librarian

Michigan State University

Catalog Management Interest Group, ALA Midwinter 2011, Jan. 8, 2011, San Diego CA

Page 2: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Agenda

• Background– Structure of multiscript records

• Model A vs. Model B

– Using z39.50 for cataloging• Multiscript records retrieved through z39.50

– Coding issues– Problems caused by non-standard multiscript records

• Solutions– Design of XSLT

• Processing logic• Factors affecting the design

• Limitations & unintended consequence

Page 3: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Structure of Multiscript Records

• Multiscript records– For recording data in multiple scripts in MARC

records– One script may be considered the primary script of

the data content of the record, even though other scripts are also used for data content

– Two models• Model A: Vernacular & Transliteration• Model B: Simple Multiscript Records

Page 4: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Structure of Multiscript Records

• Model A: Vernacular & Transliteration– The regular fields may contain data in different scripts

and in the vernacular or transliteration of the data. Fields 880 are used when data needs to be duplicated to express it in both the original vernacular script and transliterated into one or more scripts

– Model A data in the regular fields is linked to the data in 880 fields by a subfield $6 that occurs in both of the associated fields• $6 [linking tag]-[occurrence number]/[script identification

code]/[field orientation code]

* MARC21 Bibliographic Appx. D

Page 5: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Structure of Multiscript Records

• Model A: Vernacular & Transliteration

Linking Tag

Linking Tag

Occurrence Number

Occurrence Number

Script Identification Code

Field Orientation Code

Page 6: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Structure of Multiscript Records

• Model A: Vernacular & Transliteration

Linking Tag Occurrence Number

Page 7: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

CJK Record according to Model A Specifications

Page 8: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

* MARC21 Bibliographic Appx. D

Structure of Multiscript Records

• Model B: Simple Multiscript Records– All data is contained in regular fields and script

varies depending on the requirements of the data– Repeatability specifications of all fields should be

followed– Although the Model B record may contain

transliterated data, Model A is preferred if the same data is recorded in both the original vernacular script and transliteration

– Field 880 is not used

Page 9: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

CJK Record according to Model B SpecificationsItem in Chinese. Cataloging language in English

Page 10: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Structure of Multiscript Records

• Field 066 (Character Sets Present) – To indicate the MARC-8 character sets other than

the default sets that are invoked in the record• MARC-8 vs. Unicode Environment

MARC-8 Unicode

MARC Field 066 Required N/AScript Identification Code Required N/A

Field Orientation Code Required Required

Page 11: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

z39.50 for Cataloging

• SkyRiver – MSU switched to SkyRiver in Oct 2009– Ways to expand the pool of re-usable

bibliographic records• z39.50 function in Innovative Millennium (day-to-day

cataloging)• MarcEdit z39.50 client (HathiTrust record load)

Page 12: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

z39.50 search in Millennium

Page 13: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

z39.50 search in Millennium (Record retrieved for Editing)

Page 14: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

HathiTrust Data Availability

Page 15: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

MarcEdit z39.50 Client (HathiTrust)Batch search against Univ. of Michigan Catalog

using UM record identifier

Page 16: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

U of M Catalog

MSU Catalog

Record Dump

HathiTrust Record Load Workflow

Request

Retrieve

Page 17: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Non-standard Multiscript Records from z39.50

Sample Non-standard CJK Record Retrieved by MSU Millennium z39.50 Client

Page 18: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Same Record in Source Library Catalog (Staff View)

Page 19: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

HathiTrust Record Retrieved by MarcEdit z39.50 Client** As of Dec. 10, 2010, Univ. of Michigan has rebuilt 880 fields on their z39.50 serving records

Non-standard Multiscript Records from z39.50

Page 20: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Same HathiTrust Record in Univ. of Michigan Catalog (Staff View)

Page 21: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Coding Issues

Non-standard Coding

• Field-pairing– Vernacular data in regular

field

• No linking tag in subfield $6• No script identification code

in subfield $6 (may be due to Unicode environment)

Standard Model A Coding

• Field-pairing– Transliteration in regular field– Vernacular data in 880 field

• Linking tag– Tag number of an associated

field

• Script identification code*– $1 => CJK script

* Applicable to MARC-8 encoded records

Page 22: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Coding Issues

Non-standard Coding

• No field orientation code in subfield $6

Standard Model A Coding

• Field orientation code– /r

Page 23: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Coding Issues

Non-standard Coding Practice• Repeat non-repeatable fields

(245, 250)• Duplication of data in both

vernacular and transliteration

Model B Guidelines• Repeatability specifications of all

fields should be followed• Model A is preferred if the same

data is recorded in both the original vernacular script and transliteration

Page 24: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Problems Caused by Non-standard Multiscript Records

• Irregular/Incorrect field orientation in Arabic and Hebrew records in OPAC display– Left-to-right display of subfields in “Title” due to the lack of “Field Orientation

code” while scripts within subfields are from right to left

“Field Orientation code” added back

Page 25: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Problems Caused by Non-standard Multiscript Records

• Irregularity in result display– Inconsistent sequencing of vernacular and

transliteration fields

Page 26: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Problems Caused by Non-standard Multiscript Records

• Database maintenance– Data structure inconsistency• Same kind of data resides in two different places• Extra steps needed to accommodate inconsistencies

– Heading validation issues• NACO records with headings in vernacular in 4xx since

mid 2008• Vernacular headings (4xx) in regular fields

Page 27: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Problems Caused by Non-standard Multiscript Records

• Expectation in retrieval of vernacular data– MSU only indexes CJK and Cyrillic data in 880

fields– Arabic, Hebrew, Greek, and other vernacular data

in regular fields of non-standard multiscript records are indexed and searchable• Create a false impression that patrons can search in

scripts other than CJK and Cyrillic

Page 28: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Solutions

• MSU uses Model A for multiscript records• Tasks– To change field tag of vernacular data to 880– Subfield $6 in both regular & 880 fields

• To insert linking tag

– Subfield $6 in 880 fields• To insert script identification code*• To insert field orientation code for Arabic & Hebrew

records

– To insert 066 field if not already exist**No longer applicable since MSU has moved to Unicode environment

Page 29: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Solutions

• Necessary steps– Determine which fields contain vernacular data• Replace regular field tag with 880

– Determine which script(s) is contained in a record• Insert field 066* • Insert “Script Identification code*” and “Field

Orientation code” when appropriate

*No longer applicable since MSU has moved to Unicode environment

Page 30: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Solutions

• XSLT (Extensible Stylesheet Language Transformation)– Within the family of XML

• Current version: 2.0• Case sensitive

– “Transformation”means:• Manipulation of XML documents by creating a new document based on the

original document

– Common usages in library context• Web display

– e.g. converting EAD into HTML for display

• Metadata crosswalking– Data selection and manipulation

– Conditional processing• Specify matching criteria and corresponding action(s)

Page 31: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Database Maintenance Workflow

MSU Catalo

g

Format Conversion

XSLT Processor

Format Conversion

Uncorrected MARC File

Uncorrected MARCXML

Corrected MARCXML

Corrected MARC File

Page 32: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

U of M Catalog

MSU Catalog

Corrected records

Alternative HathiTrust Pre-load Data Cleanup Workflow

Request

Retrieve

XSLT Processor

Uncorrected records

Page 33: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Design of XSLT

• Processing logic– Regular field to 880 and insert linking tag• Remove all roman data from a field• Determine length of a field

– 0 => no vernacular data– ≠0 => contains vernacular data

– Field 066, Script identification & Field orientation codes• Match vernacular data field against vernacular

characters

Page 34: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Design of XSLT

• Remove all roman data– Roman data (ASCII, special characters & diacritics used in

transliteration)– replace() and translate() functions

• Find “pattern A” and replace it with “pattern B”– Replace roman data with nothing

<xsl:value-of select="replace(replace(replace(translate(translate(translate(translate(normalize-space(.),$ascii,$spaces),$specialCharacters,' '),$diacritics,' '),$extendedLatin,' '),$apos,' '),'[A-Za-z]',' '),' ','')"/>

Page 35: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Design of XSLT

• Test the length of the field after removing all non-vernacular data– XSLT elements: <xsl:choose> in combination with

<xsl:when> & <xsl:otherwise>– XSLT functions: string-length()

<xsl:choose><xsl:when test="string-length($subfieldString)=0">

…… [series of actions when string-length equals 0]</xsl:when><xsl:otherwise>

…… [series of actions when string-length not equals 0]</xsl:otherwise>

</xsl:choose>

Page 36: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Design of XSLT

• Field with no vernacular data <xsl:when test="string-length($subfieldString)=0">

<xsl:element name="marc:datafield"> <xsl:attribute name="tag"> <xsl:value-of select="$tag"/> </xsl:attribute> <xsl:attribute name="ind1"> <xsl:value-of select="$ind1"/> </xsl:attribute> <xsl:attribute name="ind2"> <xsl:value-of select="$ind2"/> </xsl:attribute> <xsl:element name="marc:subfield"> <xsl:attribute name="code"> <xsl:text>6</xsl:text> </xsl:attribute> <xsl:text>880-</xsl:text> <xsl:value-of select="$subfield6"/> </xsl:element> <xsl:copy-of select="*[not(self::marc:subfield[@code='6'])]"/> </xsl:element> </xsl:when>

Test length of the field

Insert original values

Insert linking tag (880) and original occurrence number

Copy subfields other than $6

Page 37: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Design of XSLT

• Field with vernacular data <xsl:otherwise>

<xsl:element name="marc:datafield"> <xsl:attribute name="tag"> <xsl:text>880</xsl:text> </xsl:attribute> <xsl:attribute name="ind1"> <xsl:value-of select="$ind1"/> </xsl:attribute> <xsl:attribute name="ind2"> <xsl:value-of select="$ind2"/> </xsl:attribute> <xsl:element name="marc:subfield"> <xsl:attribute name="code"> <xsl:text>6</xsl:text> </xsl:attribute> <xsl:value-of select="$tag"/>

<xsl:text>-</xsl:text> <xsl:value-of select="$subfield6"/>

…… [Insert “Script Identification Code” & “Field Orientation Code”] </xsl:element>

<xsl:copy-of select="*[not(self::marc:subfield[@code='6'])]"/> </xsl:element> </xsl:otherwise>

Insert original values

• Insert original tag no. as linking tag

• Insert original occurrence number

Insert “880” as tag no.

Page 38: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Design of XSLT• Insert “Script Identification Code” (MARC-8 environment)

<xsl:choose> <xsl:when test="matches($basicArabic,substring($subfieldString,1,1)) or matches($extendedArabic,substring($subfieldString,1,1))"> <xsl:text>/(3</xsl:text> </xsl:when> <xsl:when test="matches($greek,substring($subfieldString,1,1))"> <xsl:text>/(S</xsl:text> </xsl:when> <xsl:when test="matches($basicHebrew,substring($subfieldString,1,1))"> <xsl:text>/(2</xsl:text> </xsl:when> <xsl:when test="matches($basicCyrillic,substring($subfieldString,1,1)) or matches($extendedCyrillic,substring($subfieldString,1,1))"> <xsl:text>/(N</xsl:text> </xsl:when>

<xsl:when test="matches($bengali,substring($subfieldString,1,1)) or matches($tamil,substring($subfieldString,1,1)) or matches($thai,substring($subfieldString,1,1)) or matches($devanagar,substring($subfieldString,1,1)) "/> <xsl:otherwise> <xsl:text>/$1</xsl:text> </xsl:otherwise> </xsl:choose>

Insert code for Arabic

Insert code for Greek

Insert code for Hebrew

Insert code for Cyrillic

Insert code for CJK

Page 39: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Design of XSLT

• Insert “Field Orientation Code”

<xsl:choose><xsl:when test=“contains($subfieldString,‘[Arabic

script]’ or contains($subfieldString,‘[Hebrew script]’)"><xsl:text>//r</xsl:text>

</xsl:when></xsl:choose>

Test if the subfield contains Arabic or Hebrew script

Insert Field Orientation Code

Page 40: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Design of XSLT

• Field 066 (MARC-8 environment)

– Insert character set code in subfield $c– A single record may have more than one vernacular

script => multiple subfield $c• XSLT element: <xsl:if>

– Allows multiple matches

• XSLT function: matches()

– Processing logic• Turn the whole record into a text string• Remove all Latin data• Match vernacular script against normalized text string

Page 41: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Design of XSLT

• After removing all Latin data from the record <xsl:value-of

select="translate(translate(translate(translate(translate(translate(translate(translate(translate(translate(.,$basicArabic,'3'),$extendedArabic,'4'),$basicCyrillic,'N'),$extendedCyrillic,'Q'),$Greek,'S'),$basicHebrew,'2'),$bengali,'b'),$tamil,'ta'),$thai,'th'),$devanagar,'d')"/>

…<xsl:if test="matches($normalizedWholeRecord,'3')">

<xsl:element name="marc:subfield"> <xsl:attribute name="code">c</xsl:attribute> <xsl:text>(3</xsl:text> </xsl:element> </xsl:if>

…… <xsl:if test="matches($normalizedWholeRecord,'[^A-Za-z0-9]')"> <xsl:element name="marc:subfield"> <xsl:attribute name="code">c</xsl:attribute> <xsl:text>$1</xsl:text> </xsl:element> </xsl:if>

Replace Arabic characters with “3”

Test if the normalized data contains “3”

Insert “(3” as the character set code in $c

Test if any non-alpha-numeral characters exist

Insert code for CJK

Page 42: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Design of XSLT

• Factors affecting the design– Pre-load vs. post-load data clean up (HathiTrust workflow)

• Mechanism to filter out non-multiscript records needed for pre-load data clean up

• Construction of 949 overlay command*

– MARC-8 vs. Unicode • Field 066 and Script identification code not allowed in Unicode

environment– 2 separate XSLTs made

– OCLC vs. MARC21 Standard• Representation of Bengali, Devanagari, Tamil, and Thai in field 066

* Innovative Millennium specific

Page 43: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Limitations & Unintended Consequences

• Processing of data represented by UTF-8 character number– \U+0e33\\U+0e43\\U+0e2b\\U+0e49\

• Vernacular scripts processed (MARC-8 environment)

• Handling of unlinked vernacular data– Implications on OPAC display

Page 44: Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest

Questions?

Lucas [email protected]

Michigan State University Libraries