intro to xml in libraries
DESCRIPTION
Explanation of XML, how it is processed, and common examples of its application in librariesTRANSCRIPT
Intro to XML in librariesKyle Banerjee
Why do libraries use XML?
• Easy to share information
• Strict syntax and human readability make it easy to work with
• Create any structure you need
• Many tools for all operating systems
• Schema support
• Namespace support
2
Disadvantages
• Requires an external application
• Verbose
• Inefficient
• Picky – everything stops when data is not well formed
• No intrinsic data types
3
Encoded Archival Description (EAD)
4
Open Archives Initiative Protocol for Metadata Harvesting(OAI-PMH)
5
NISO Circulation Interchange Protocol (NCIP)
6
<!DOCTYPE NCIPMessage PUBLIC "-//NISO//NCIP DTD Version 1.0//EN" "http://www.niso.org/ncip/v1_0/imp1/dtd/ncip_v1_0.dtd"><NCIPMessage version="http://www.niso.org/ncip/v1_0/imp1/dtd/ncip_v1_0.dtd"> <LookupUserResponse> <ResponseHeader> <FromAgencyId> <UniqueAgencyId> <Scheme>http://136.181.125.166:6601/IRCIRCD?target=get_scheme_values&scheme=UniqueAgencyId</Scheme> <Value>zv229</Value> </UniqueAgencyId> </FromAgencyId> <ToAgencyId> <UniqueAgencyId> <Scheme>http://136.181.125.166:6601/IRCIRCD?target=get_scheme_values&scheme=UniqueAgencyId</Scheme> <Value>melir</Value> </UniqueAgencyId> </ToAgencyId> </ResponseHeader>
… [rest of entry deleted]
MARCXML
<record xmlns="http://www.loc.gov/MARC21/slim">
<leader>00000cas a2200000 4500</leader>
<controlfield tag="001">1798471</controlfield>
<controlfield tag="008">750909d19722001sw qx p ob 0 a0eng</controlfield>
<datafield ind1=" " ind2=" " tag="010">
<subfield code="a">75640778</subfield>
</datafield>
<datafield ind1=" " ind2=" " tag="022">
<subfield code="a">0105-0397</subfield>
<subfield code="l">0105-0397</subfield>
<subfield code="2">1</subfield>
</datafield>
…[rest of record deleted]7
Dublin Core (DC)
<qdc:qualifieddc xmlns:qdc="http://epubs.cclrc.ac.uk/xmlns/qdc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://epubs.cclrc.ac.uk/xmlns/qdc/ http://epubs.cclrc.ac.uk/xsd/qdc.xsd">
<dc:creator>Huntington, C. L.</dc:creator>
<dc:title>Horseshoe Bend near Wolf Creek, Southern Pacific Railroad, Shasta Route</dc:title>
<dc:date>1908-00-00</dc:date>
<dc:date>1900-1909</dc:date>
<dc:subject>Railroad tracks; Forests; Railroad locomotives</dc:subject>
<dc:coverage>Josephine County (Ore.)</dc:coverage>
<dc:type>Image</dc:type>
<dc:source>Postcards</dc:source>
<dc:source>Gerald W. Williams Collection</dc:source>
<dc:title>Umpqua Album</dc:title>
<dcterms:isPartOf>WilliamsG:Horseshoe Bend</dcterms:isPartOf>
..[rest of record deleted] 8
Search / Retrieve via URL (SRU)
9
And enough other stuff to blow your mind
• RDF
• Darwin Core
• VRA Core
• MODS
10
• MADS
• PBCore
• Webapps and other cool stuff
XML is not a language
• It’s a grammar that specifies a structure for exchanging information
• XML cannot do anything by itself• When most people talk about XML, they are
actually referring to a family of related technologies
• Don’t confuse XML (a data structure standard) with content standards such as AACR2R/RDA, DACS, LCNAF, LCSH, MeSH, and AAT
11
Interpreting XML
• Common methods are Document Object Model (DOM) and Simple API for XML (SAX)
• DOM is more common and far more powerful. Best for smaller files and documents
• SAX is much faster and requires much less memory. Best for large files
12
XML Document
<?xml version = “1.0”?><inventory> <book> <title>My Dog</title> </book> <book> <title>My Cat</title> </book></inventory>
DOM (tree structure) SAX (linear events)
Start document
Start element: inventoryStart element: bookStart element: titleCharacters: My DogEnd element: titleEnd element: book
Start element: bookStart element: titleCharacters: My CatEnd element: titleEnd element: book
End document
DOM vs. SAX
13
inventory
book book
title title
My Dog
My Cat
DOM basics
• Platform independent way to represent and interact with XML documents
• All nodes and relationships are accessible
• Great for generating and displaying documents (e.g. EAD), interpreting messages (e.g. NCIP, OAI-PMH)
• Must load entire document into memory – terrible for transferring millions of records
14
SAX (Simple API for XML)
• Not formally defined
• Relies on events – detects beginnings/ends of elements, attributes, etc.
• Does not require loading file into memory
• Great for extracting info from large files but awkward for interpreting documents
15
XML Document
<?xml version = “1.0”?><inventory> <book> <title>My Dog</title> </book> <book> <title>My Cat</title> </book></inventory>
JSON
{“inventory”: { “book”: { “title”: “My Dog” }, “book”: { “title”: “My Cat” } }}
Delimited
Inventory
Common Alternatives to XML
16
Item type Title
book My Dog
book My Cat
Why Delimited or JSON?
• Delimited– Easiest to parse– Works great with tabular data– Not good for arbitrary and nested structures
• JSON– Much simpler and easier to use– Bad for situations where markup languages are
appropriate (e.g. documents)
17
XML = Data Duct Tape
• Very useful and is here to stay
• Best uses are documents, messaging, and data transport
• Can be used for almost anything but sometimes not a good choice
18
XML and Life after MARC
• Use of XML will expand as the role of the traditional catalog wanes
• Expect growth as libraries need to provide access to a greater variety of resources
• XML will be critical as linked data becomes more common
19
What You Should Do Now
• Be aware of what XML is
• Know what it is good for
• Learn specifics on an as needed basis
20
Thank You!Kyle Banerjee