batch-conversion of non-standard multiscript records by xslt lucas mak metadata and catalog...

Batch-conversion of Non-standard Multiscript Records by XSLT

Lucas MakMetadata and Catalog Librarian

Michigan State University

Catalog Management Interest Group, ALA Midwinter 2011, Jan. 8, 2011, San Diego CA

Agenda

• Background– Structure of multiscript records

• Model A vs. Model B

– Using z39.50 for cataloging• Multiscript records retrieved through z39.50

– Coding issues– Problems caused by non-standard multiscript records

• Solutions– Design of XSLT

• Processing logic• Factors affecting the design

• Limitations & unintended consequence

Structure of Multiscript Records

• Multiscript records– For recording data in multiple scripts in MARC

records– One script may be considered the primary script of

the data content of the record, even though other scripts are also used for data content

– Two models• Model A: Vernacular & Transliteration• Model B: Simple Multiscript Records


• Model A: Vernacular & Transliteration– The regular fields may contain data in different scripts

and in the vernacular or transliteration of the data. Fields 880 are used when data needs to be duplicated to express it in both the original vernacular script and transliterated into one or more scripts

– Model A data in the regular fields is linked to the data in 880 fields by a subfield $6 that occurs in both of the associated fields• $6 [linking tag]-[occurrence number]/[script identification

code]/[field orientation code]

* MARC21 Bibliographic Appx. D


• Model A: Vernacular & Transliteration

Linking Tag

Linking Tag

Occurrence Number

Occurrence Number

Script Identification Code

Field Orientation Code


• Model A: Vernacular & Transliteration

Linking Tag Occurrence Number

CJK Record according to Model A Specifications

* MARC21 Bibliographic Appx. D


• Model B: Simple Multiscript Records– All data is contained in regular fields and script

varies depending on the requirements of the data– Repeatability specifications of all fields should be

followed– Although the Model B record may contain

transliterated data, Model A is preferred if the same data is recorded in both the original vernacular script and transliteration

– Field 880 is not used

CJK Record according to Model B SpecificationsItem in Chinese. Cataloging language in English


• Field 066 (Character Sets Present) – To indicate the MARC-8 character sets other than

the default sets that are invoked in the record• MARC-8 vs. Unicode Environment

MARC-8 Unicode

MARC Field 066 Required N/AScript Identification Code Required N/A

Field Orientation Code Required Required

z39.50 for Cataloging

• SkyRiver – MSU switched to SkyRiver in Oct 2009– Ways to expand the pool of re-usable

bibliographic records• z39.50 function in Innovative Millennium (day-to-day

cataloging)• MarcEdit z39.50 client (HathiTrust record load)

z39.50 search in Millennium

z39.50 search in Millennium (Record retrieved for Editing)

HathiTrust Data Availability

MarcEdit z39.50 Client (HathiTrust)Batch search against Univ. of Michigan Catalog

using UM record identifier

U of M Catalog

MSU Catalog

Record Dump

HathiTrust Record Load Workflow

Request

Retrieve

Non-standard Multiscript Records from z39.50

Sample Non-standard CJK Record Retrieved by MSU Millennium z39.50 Client

Same Record in Source Library Catalog (Staff View)

HathiTrust Record Retrieved by MarcEdit z39.50 Client** As of Dec. 10, 2010, Univ. of Michigan has rebuilt 880 fields on their z39.50 serving records

Non-standard Multiscript Records from z39.50

Same HathiTrust Record in Univ. of Michigan Catalog (Staff View)

Coding Issues

Non-standard Coding

• Field-pairing– Vernacular data in regular

field

• No linking tag in subfield $6• No script identification code

in subfield $6 (may be due to Unicode environment)

Standard Model A Coding

• Field-pairing– Transliteration in regular field– Vernacular data in 880 field

• Linking tag– Tag number of an associated

field

• Script identification code*– $1 => CJK script

* Applicable to MARC-8 encoded records

Coding Issues

Non-standard Coding

• No field orientation code in subfield $6

Standard Model A Coding

• Field orientation code– /r

Coding Issues

Non-standard Coding Practice• Repeat non-repeatable fields

(245, 250)• Duplication of data in both

vernacular and transliteration

Model B Guidelines• Repeatability specifications of all

fields should be followed• Model A is preferred if the same

data is recorded in both the original vernacular script and transliteration

Problems Caused by Non-standard Multiscript Records

• Irregular/Incorrect field orientation in Arabic and Hebrew records in OPAC display– Left-to-right display of subfields in “Title” due to the lack of “Field Orientation

code” while scripts within subfields are from right to left

“Field Orientation code” added back


• Irregularity in result display– Inconsistent sequencing of vernacular and

transliteration fields


• Database maintenance– Data structure inconsistency• Same kind of data resides in two different places• Extra steps needed to accommodate inconsistencies

– Heading validation issues• NACO records with headings in vernacular in 4xx since

mid 2008• Vernacular headings (4xx) in regular fields


• Expectation in retrieval of vernacular data– MSU only indexes CJK and Cyrillic data in 880

fields– Arabic, Hebrew, Greek, and other vernacular data

in regular fields of non-standard multiscript records are indexed and searchable• Create a false impression that patrons can search in

scripts other than CJK and Cyrillic

Solutions

• MSU uses Model A for multiscript records• Tasks– To change field tag of vernacular data to 880– Subfield $6 in both regular & 880 fields

• To insert linking tag

– Subfield $6 in 880 fields• To insert script identification code*• To insert field orientation code for Arabic & Hebrew

records

– To insert 066 field if not already exist**No longer applicable since MSU has moved to Unicode environment

Solutions

• Necessary steps– Determine which fields contain vernacular data• Replace regular field tag with 880

– Determine which script(s) is contained in a record• Insert field 066* • Insert “Script Identification code*” and “Field

Orientation code” when appropriate

*No longer applicable since MSU has moved to Unicode environment

Solutions

• XSLT (Extensible Stylesheet Language Transformation)– Within the family of XML

• Current version: 2.0• Case sensitive

– “Transformation”means:• Manipulation of XML documents by creating a new document based on the

original document

– Common usages in library context• Web display

– e.g. converting EAD into HTML for display

• Metadata crosswalking– Data selection and manipulation

– Conditional processing• Specify matching criteria and corresponding action(s)

Database Maintenance Workflow

MSU Catalo

g

Format Conversion

XSLT Processor

Format Conversion

Uncorrected MARC File

Uncorrected MARCXML

Corrected MARCXML

Corrected MARC File

U of M Catalog

MSU Catalog

Corrected records

Alternative HathiTrust Pre-load Data Cleanup Workflow

Request

Retrieve

XSLT Processor

Uncorrected records

Design of XSLT

• Processing logic– Regular field to 880 and insert linking tag• Remove all roman data from a field• Determine length of a field

– 0 => no vernacular data– ≠0 => contains vernacular data

– Field 066, Script identification & Field orientation codes• Match vernacular data field against vernacular

characters

Design of XSLT

• Remove all roman data– Roman data (ASCII, special characters & diacritics used in

transliteration)– replace() and translate() functions

• Find “pattern A” and replace it with “pattern B”– Replace roman data with nothing

<xsl:value-of select="replace(replace(replace(translate(translate(translate(translate(normalize-space(.),$ascii,$spaces),$specialCharacters,' '),$diacritics,' '),$extendedLatin,' '),$apos,' '),'[A-Za-z]',' '),' ','')"/>

Design of XSLT

• Test the length of the field after removing all non-vernacular data– XSLT elements: <xsl:choose> in combination with

<xsl:when> & <xsl:otherwise>– XSLT functions: string-length()

<xsl:choose><xsl:when test="string-length($subfieldString)=0">

…… [series of actions when string-length equals 0]</xsl:when><xsl:otherwise>

…… [series of actions when string-length not equals 0]</xsl:otherwise>

</xsl:choose>

Design of XSLT

• Field with no vernacular data <xsl:when test="string-length($subfieldString)=0">

<xsl:element name="marc:datafield"> <xsl:attribute name="tag"> <xsl:value-of select="$tag"/> </xsl:attribute> <xsl:attribute name="ind1"> <xsl:value-of select="$ind1"/> </xsl:attribute> <xsl:attribute name="ind2"> <xsl:value-of select="$ind2"/> </xsl:attribute> <xsl:element name="marc:subfield"> <xsl:attribute name="code"> <xsl:text>6</xsl:text> </xsl:attribute> <xsl:text>880-</xsl:text> <xsl:value-of select="$subfield6"/> </xsl:element> <xsl:copy-of select="*[not(self::marc:subfield[@code='6'])]"/> </xsl:element> </xsl:when>

Test length of the field

Insert original values

Insert linking tag (880) and original occurrence number

Copy subfields other than $6

Design of XSLT

• Field with vernacular data <xsl:otherwise>

<xsl:element name="marc:datafield"> <xsl:attribute name="tag"> <xsl:text>880</xsl:text> </xsl:attribute> <xsl:attribute name="ind1"> <xsl:value-of select="$ind1"/> </xsl:attribute> <xsl:attribute name="ind2"> <xsl:value-of select="$ind2"/> </xsl:attribute> <xsl:element name="marc:subfield"> <xsl:attribute name="code"> <xsl:text>6</xsl:text> </xsl:attribute> <xsl:value-of select="$tag"/>

<xsl:text>-</xsl:text> <xsl:value-of select="$subfield6"/>

…… [Insert “Script Identification Code” & “Field Orientation Code”] </xsl:element>

<xsl:copy-of select="*[not(self::marc:subfield[@code='6'])]"/> </xsl:element> </xsl:otherwise>

Insert original values

• Insert original tag no. as linking tag

• Insert original occurrence number

Insert “880” as tag no.

Design of XSLT• Insert “Script Identification Code” (MARC-8 environment)

<xsl:choose> <xsl:when test="matches($basicArabic,substring($subfieldString,1,1)) or matches($extendedArabic,substring($subfieldString,1,1))"> <xsl:text>/(3</xsl:text> </xsl:when> <xsl:when test="matches($greek,substring($subfieldString,1,1))"> <xsl:text>/(S</xsl:text> </xsl:when> <xsl:when test="matches($basicHebrew,substring($subfieldString,1,1))"> <xsl:text>/(2</xsl:text> </xsl:when> <xsl:when test="matches($basicCyrillic,substring($subfieldString,1,1)) or matches($extendedCyrillic,substring($subfieldString,1,1))"> <xsl:text>/(N</xsl:text> </xsl:when>

<xsl:when test="matches($bengali,substring($subfieldString,1,1)) or matches($tamil,substring($subfieldString,1,1)) or matches($thai,substring($subfieldString,1,1)) or matches($devanagar,substring($subfieldString,1,1)) "/> <xsl:otherwise> <xsl:text>/$1</xsl:text> </xsl:otherwise> </xsl:choose>

Insert code for Arabic

Insert code for Greek

Insert code for Hebrew

Insert code for Cyrillic

Insert code for CJK

Design of XSLT

• Insert “Field Orientation Code”

<xsl:choose><xsl:when test=“contains($subfieldString,‘[Arabic

script]’ or contains($subfieldString,‘[Hebrew script]’)"><xsl:text>//r</xsl:text>

</xsl:when></xsl:choose>

Test if the subfield contains Arabic or Hebrew script

Insert Field Orientation Code

Design of XSLT

• Field 066 (MARC-8 environment)

– Insert character set code in subfield $c– A single record may have more than one vernacular

script => multiple subfield $c• XSLT element: <xsl:if>

– Allows multiple matches

• XSLT function: matches()

– Processing logic• Turn the whole record into a text string• Remove all Latin data• Match vernacular script against normalized text string

Design of XSLT

• After removing all Latin data from the record <xsl:value-of

select="translate(translate(translate(translate(translate(translate(translate(translate(translate(translate(.,$basicArabic,'3'),$extendedArabic,'4'),$basicCyrillic,'N'),$extendedCyrillic,'Q'),$Greek,'S'),$basicHebrew,'2'),$bengali,'b'),$tamil,'ta'),$thai,'th'),$devanagar,'d')"/>

…<xsl:if test="matches($normalizedWholeRecord,'3')">

<xsl:element name="marc:subfield"> <xsl:attribute name="code">c</xsl:attribute> <xsl:text>(3</xsl:text> </xsl:element> </xsl:if>

…… <xsl:if test="matches($normalizedWholeRecord,'[^A-Za-z0-9]')"> <xsl:element name="marc:subfield"> <xsl:attribute name="code">c</xsl:attribute> <xsl:text>$1</xsl:text> </xsl:element> </xsl:if>

Replace Arabic characters with “3”

Test if the normalized data contains “3”

Insert “(3” as the character set code in $c

Test if any non-alpha-numeral characters exist

Insert code for CJK

Design of XSLT

• Factors affecting the design– Pre-load vs. post-load data clean up (HathiTrust workflow)

• Mechanism to filter out non-multiscript records needed for pre-load data clean up

• Construction of 949 overlay command*

– MARC-8 vs. Unicode • Field 066 and Script identification code not allowed in Unicode

environment– 2 separate XSLTs made

– OCLC vs. MARC21 Standard• Representation of Bengali, Devanagari, Tamil, and Thai in field 066

* Innovative Millennium specific

Limitations & Unintended Consequences

• Processing of data represented by UTF-8 character number– \U+0e33\\U+0e43\\U+0e2b\\U+0e49\

• Vernacular scripts processed (MARC-8 environment)

• Handling of unlinked vernacular data– Implications on OPAC display

Questions?

Lucas [email protected]

Michigan State University Libraries

batch-conversion of non-standard multiscript records by xslt lucas mak metadata and catalog...

Documents

marc records

millennium slide

model b record

editing slide

english slide

data content

transliterated data

models model