Transcript
Page 1: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

This article was downloaded by: [University of Alberta]On: 22 October 2014, At: 17:35Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Library MetadataPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/wjlm20

Exposing MARC 21 Format forBibliographic Data As Linked Data WithProvenanceSharma Kumar a , Marjit Ujjal b & Biswas Utpal aa Department of Computer Science and Engineering , University ofKalyani , Kalyani , West Bengal , Indiab Center for Internet Resource Management (CIRM) , University ofKalyani , Kalyani , West Bengal , IndiaPublished online: 20 Sep 2013.

To cite this article: Sharma Kumar , Marjit Ujjal & Biswas Utpal (2013) Exposing MARC 21 Format forBibliographic Data As Linked Data With Provenance, Journal of Library Metadata, 13:2-3, 212-229,DOI: 10.1080/19386389.2013.826076

To link to this article: http://dx.doi.org/10.1080/19386389.2013.826076

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoever orhowsoever caused arising directly or indirectly in connection with, in relation to or arisingout of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

Journal of Library Metadata, 13:212–229, 2013Published with license by Taylor & FrancisISSN: 1938-6389 print / 1937-5034 onlineDOI: 10.1080/19386389.2013.826076

Exposing MARC 21 Format for BibliographicData As Linked Data With Provenance

SHARMA KUMARDepartment of Computer Science and Engineering, University of Kalyani, Kalyani,

(West Bengal), India

MARJIT UJJALCenter for Internet Resource Management (CIRM), University of Kalyani, Kalyani,

(West Bengal), India

BISWAS UTPALDepartment of Computer Science and Engineering, University of Kalyani, Kalyani,

(West Bengal), India

The library community has been using the MARC 21 standard to ex-change library data for decades and all the information processedseems to be successful. However, due to the proliferation of tools andtechnologies, people started anticipating the data to be easily avail-able for their use. But this is only possible by standardizing the datarepresentation formats and sharing data on the Web. The MARC21 standard has no possibility of distributing library metadata out-side the library community. Besides, these standards focus only ondata representation and storage, which causes the semantics ofdata to be hidden for machines. This paper presents an approachon transitioning the MARC 21 Format for Bibliographic Data intoRDF triple representation based on the linked data principles. Thelinked data principles proposed by Sir Tim Berners-Lee state howthe data alone can be shared and linked with each other regardlessof the documents they are enclosed in and formulating the web as aWeb of Data. Further, an automatic generation of the provenanceinformation of the library metadata is considered.

KEYWORDS linked data, MARC 21, provenance, Semantic Web

© Sharma Kumar, Marjit Ujjal, and Biswas UtpalAddress correspondence to Sharma Kumar, Department of Computer Science & Engi-

neering, University of Kalyani, Kalyani, Nadia, West Bengal-741235, India. E-mail: [email protected]

212

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 3: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

Exposing Bibliographic Data As Linked Data 213

Early in the 1960s, the Library of Congress (LOC) developed the firstmachine-readable cataloging project, called MARC (MAchine Readable Cat-aloguing). The Library of Congress, along with the British National Bibliog-raphy, later modified the MARC project and subsequently developed MARC21.1It is an international format, which serves as the basis for presentation andexchange of bibliographic data as well as related information. This standardfacilitates the process of creation and dissemination of library resources. Withthe advent of online and machine-readable catalogs and client-server proto-col, such as Z39.50, SRW, and DIENST, the search-and-retrieval process hasbeen improved dramatically. An increasing number of online bibliographicdatabases are being created to distribute bibliographic data among libraries.Despite generating different standards to facilitate the task of exchangingand the search-and-retrieval process, they have certain boundaries such asproviding data for human consumption. Machines can read and serve upthe data but they do not know the semantics behind them. A machine isbound to the syntax and preparation of the data. These standards also servethe data only within the library communities. At present many such librarylegacy data are in traditional file format. In order to retain them into theOpen Data Web and expose data outside of the library communities, thedata need to be translated into the Semantic Web standards as linked data.For this, the legacy library data need to be transformed into RDF (Berners-Lee, Hendler, & Lassila, 2001); integrated with linked data, (Heath & Bizer,2011), and their provenance information exposed. It is imperative to publishprovenance metadata with the library data, as it reveals the trust, quality, andusability of the data on the Web.

The Semantic Web provides standards to make data for both machineand human consumption. The Semantic Web is “an extension of the currentWeb in which information is given well-defined meaning, better enablingcomputers and people to work in cooperation” (Berners-Lee et al., 2001).The Semantic Web gives importance to the data that is being processed.In order to model and represent the expressiveness of data, it providestechnologies such as Resource Description Framework (RDF), RDF Schema(RDFS), and Web Ontology Language (OWL).

The objective of this paper is to discuss the detail representations ofMARC 21 Format for Bibliographic Data and how to make them available onthe Web along with their provenance metadata. The first part of the paperdiscusses the process of converting MARC 21 Format for Bibliographic Datainto linked data, and then the second part focuses on generating and storingprovenance information.

INTRODUCTION OF LINKED DATA

Although it is possible to publish data on the Web using documents suchas HTML or a description in a page, the data becomes obscure to machines

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 4: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

214 S. Kumar et al.

with these approaches. There is no interlinking among the data, leading tostatic data silos. In 2006, Sir Tim Berners-Lee coined the concept of linkeddata, an extension to the existing Semantic Web. Linked data is an approachfor publishing, reusing, and sharing data on the Web. Data on the Web arenot linked to each other. Linked data allows interlinking and knowledgesharing among the data. Linked data represents data by using RDF, URI,Ontologies, and HTTP access mechanism. A unique URI is assigned to thedata and each of these URIs delivers the information about data accessibility,purpose, and usability. Ontologies also play a pivotal role in linked data.Ontologies provide knowledge concerning how to define the semantics ofthe data. In this way, the data presented on the Web are available to beconsumed by both machines and humans without any human intervention.Sir Tim Berners-Lee in Bizer, Heath, and Berners-Lee (2009) has proposedfour principles for publishing structured data on the Web:

1. Use URIs for naming things.2. Use HTTP URIs, so people can look up those names.3. When someone looks up a URI, provide useful information using the

standards (RDF, SPARQL).4. Include links to other URIs, so that they can discover more things.

The use of URIs for naming things assists in uniquely identifying re-sources such as people, places, or other real-world entities. If library re-sources, such as a book, its author, place of publication, and name of thepublisher, are represented using the above-mentioned rules, then each ofthem will have a unique HTTP URI. They can also be published on theWeb as Web of Data rather than the Web of Documents. This way the li-brary resources could be linked, reused, and integrated with data from othersources. Furthermore, ontologies also facilitate the task of categorizing libraryresources. Bizer, Heath, & Cyganiak (2007) have discussed implementationin detail regarding the publishing of linked data on the Web, the designarchitecture, approaches to choosing URIs, and setting RDF links to otherdata sources.

DATA PROVENANCE

Data provenance is a method of collecting relevant metadata about data.It provides information about an object regarding its origin, the method bywhich it was recorded, the format, and the language of the data represen-tation. Buneman, Khanna, and Tan (2001) define data provenance, in thecontext of database systems, as “Data provenance—sometimes called ‘lin-eage’ or ‘pedigree’—is the description of the origins of a piece of data andthe process by which it arrived in a database.” Data provenance, which is

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 5: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

Exposing Bibliographic Data As Linked Data 215

also known as tracing data back, means knowing data about data by tracingits origin. Data provenance assists in bringing adequate metadata informa-tion, and it guarantees the quality, trustworthiness, and adequacy of the data.It also ensures that the information available for any source is valid, usable,authorized, and legal. In an open data environment, such as the Web, peo-ple often find the data lacking in provenance or metadata. The amount ofbibliographic data is increasing; hence, in the future it may be difficult totrust and find usefulness of the data. People may also want to know the in-formation available on the Web, processes, and the entities that are involvedin producing this information. Every day the amount of provenance infor-mation is increasing; therefore, it may become hard for users to view andmake decisions. For this reason, it is mandatory to make provenance infor-mation easily processable and analyzable by machines. For this, we need tofocus on provenance representation and storage techniques. At present thereare some standards, such as VoID (Vocabulary of Interlinked Datasets), torepresent and store the provenance. For concrete provenance information,a generalized provenance model is required. A provenance model providesterms, core data model for representing different types of entities, data, andprocesses involved in producing the provenance information. In this workthe PROV2 (Provenance Model) has been used to elaborately define differentmetadata terms and data. The following sections discuss the representationand storage techniques and give a brief overview of the PROV model.

PROVENANCE REPRESENTATION AND STORAGE

Provenance representation and storage techniques has been reviewed byMarjit, Sharma, and Biswas (2012). Basically, the provenance of any datais represented by two methods: annotation method and inversion method.In the annotation method the metadata are precomputed and stored in adocument separately. In the inversion method the metadata are collected onthe fly based on user-defined queries. Provenance information is also storedwithin the same data storage system where the original data is stored, orit can be stored in a different storage location with other metadata such asVoID. In this paper, the annotation method has been followed,that is, theprovenance of the RDF dataset is stored in a separate document. Basically,the representation of provenance and its storage are the two major challengesin the provision of provenance of linked data. Storing the provenance infor-mation in the data file or in the same data storage system may create issueslike the management of the provenance information. Mainly the scalabilityissue would be difficult if the provenance metadata were to grow in amount.However, if we store the provenance information in a separate documentor if the original record gets changed, then the change or update shouldbe reflected in the provenance information. The data may always update or

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 6: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

216 S. Kumar et al.

introduce new sorts of information; hence, the versioning of the provenanceinformation should be kept in mind.

One way to implement the annotation method, in the context of theSemantic Web, is by using VoID. Detailed implementation regarding VoIDis discussed by Alexander, Cyganiak, Hausenblas, and Zhao (2009a). VoIDallows publishers to define the metadata of their dataset and publish it sep-arately. VoID is also known as the ontology, or the vocabulary, which pro-vides a collection of classes and properties to define the metadata aboutRDF datasets. VoID discloses everything about the nature and features ofthe dataset: its access mechanisms, statistical information, publishers, andinterlinking of the data sources. Further, the information it provides is cat-egorized into general metadata, access metadata, structural metadata, andthe information about interlinking between datasets. Two main concepts orclasses are found in VoID:

Dataset. Dataset is a collection of RDF statements whose design is fullybased on the linked data principles and that is published and maintainedby a single data provider. It provides meaningful information on the Weband is hosted on a particular server. The publisher of the dataset shouldprovide all the relevant information, such as SPARQL End-Points,URI ofRDF dumps, and information about vocabularies used and other generalmetadata. Void:Dataset class is used to model the dataset instance.

Linkset. Linked dataset presents many outgoing links, called RDF links.These links actually do the job of interlinking between the source and thetarget datasets so that consumers of the dataset would find more infor-mation. This is mandatory in linked data according to the fourth principleof linked data. A linkset is a collection of such RDF links. Void:Linkset isused to model the linkset instance.

PROV Data Model

The PROV2 data model states how things are created and came into exis-tence. This model is applied to convey different provenance metadata terms,entities, and activities. PROV defines some key concepts such as entities,activities, generation, agents, roles, derivations, plans, time, and alternateentities. These key concepts take part in different roles while performingvarious activities. Specifying different types of entities brings understandingabout those involved in the activity, and the time demonstrates the timingof the activity and the provenance document last modified. So using allthese concepts helps users to find the exact dataset publishers, their tools,and the process by which the dataset came into the web. Figure 1 shows aprovenance graph for a bibliographic dataset. The graph says that the originof the dataset, “openmetadata.lib.harvard.edu/bibdata,” has been generated

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 7: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

Exposing Bibliographic Data As Linked Data 217

FIGURE 1 Provenance graph. (Color figure available online.)

by a “Data Conversion” activity using “dataset1” and “dataset2” entities. The“Data Conversion” activity was associated with the “Centre for InformationResource Management (CIRM),” which is an agent and the publisher of thedataset. In this way the users would come to know where the dataset wascreated and by whom before making any decisions. Apart from this, PROVcan also be combined with other ontologies such as DCTERMS and FOAF.It also allows instances of PROV data model to be serialized in XML by us-ing PROV-XML schema, so that provenance information can be exchangedamong different communities using XML format.

MARC 21 FORMAT FOR BIBLIOGRAPHIC DATA

An excerpt from a MARC 21 record (MARC 21 Data is taken from HarvardLibrary Bibliographic Dataset3) is shown in Figure 2. A MARC 21 recordmainly comprises three elements, namely, Record Structure, Content Des-ignation, and Data Contents. Each MARC 21 record structure is againcomposed of three main components, namely, the Leader, the Record Di-rectory, and the Variable Fields. The Leader consists of the data elementsand it provides the information to process the record. The Record Direc-tory contains the tags, starting locations, and the length of each field withina record. It is an index of the location fields within a record. The Vari-able Fields are identified as three character numeric fields that contain thedata. In MARC 21 format the Variable Fields may be of two types: VariableControl Fields and Variable Data Fields. Control Fields begin with two

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 8: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

218 S. Kumar et al.

leading zeros (00X) and provide control information such as a Library ofCongress (LOC) control number. Control Fields do not contain the indicatorsand subfields. The tags from 01X-8XX that do not have two leading zeros areidentified as the Variable Data Fields. Variable Data Fields give informationrelated to data content such as ISBN, title, and author. The Variable DataFields contain the Indicator Positions, Subfield codes, data elements, andfield terminators. Each of these Control Fields, subfields, and indicator posi-tions are important and have a specific meaning. While parsing each recordand the Variable Fields, special attention to this information is required.

A SURVEY OF RELATED WORK

A number of prominent researchers and organizations have contributed workrelated to controlled vocabularies, classification schemes, library data, andthe Semantic Web. Harper and Tillett (2007) present the use of variouscontrolled vocabularies (such as Dewey Decimal Classification, Library ofCongress Classification) in the Semantic Web. Use of these vocabulariesbrings improvement in the Semantic Web development. The Simple Knowl-edge Organization System (SKOS)4 also provides standards to representknowledge organization systems using RDF. Converting MARC 21 records totheir RDF equivalent by applying mapping rules between MARC 21 and RDFhave been discussed by Styles, Ayers, and Shabir (2008). They show how theresulting RDF library resources such as people, books, and places could belinked to other sources such as Library of Congress Linked Data, DBpedia,and Geonames. A framework called eXtensible Catalog (XC), which facili-tates the conversion of library legacy data to linked data, has been presentedby Bowen (2010). Malmsten (2008, 2009) has illustrated the implementationof linked data of library resources available in the Swedish Union Cata-log (LIBRIS). LIBRIS is based on a server called RDF wrapper around the

FIGURE 2 Bibliographic data in MARC 21 format. (Color figure available online.)

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 9: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

Exposing Bibliographic Data As Linked Data 219

Integrated Library Systems (ILS), which, upon request, extracts the MARC 21records in MARC-XML. The resulting XML records are transformed into thedesired format using EXtensible Stylesheet Language (XSL).

Several communities and agencies have also generated their library-structured metadata as Open RDF datasets, value vocabularies, and meta-data element sets. A couple of links are presented in the Library Linked DataIncubator Group.5 A few of them are the British National Library (BNB),6

Europeana Linked Open Data,7 Cambridge University Library dataset,8 Hun-garian National Library,9 Library of Congress Subject Headings,10 (LCSH), andBiblioteca Nacional de Espana (BNE, Spanish National Library).

THE PROPOSED APPROACH

Figure 3 elucidates the workflows involved in the process of conversionfrom MARC 21 Format for Bibliographic Data into linked data. In step 1, theMARC 21 file is taken as the initial input. Subsequently, the MARC 21 fileis parsed by the MARC 21 File Parser, which produces individual MARC 21records. The MARC 21 record is parsed by the record parser, which yields theLeader, Control Fields, and the Variable Data Fields. The Leader, individual

FIGURE 3 Workflow of the conversion of MARC 21 format for bibliographic data into linkeddata.

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 10: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

220 S. Kumar et al.

Control Fields, and the Data Fields are required to be interpreted carefully.Our main intent is to automate the transition process and to minimize asmuch as possible the process of specifying mapping files. Our approach iscomprised of the following steps:

1. Given the MARC 21 file, verify that the file is valid and contains at leastone record.

2. Parse each of the MARC 21 records.3. Extract the processing information from the Leader and convert it into

equivalent RDF properties.4. Process the Variable Control Fields, extract the control information, and

convert to equivalent RDF terms.5. Process Variable Data Fields, extract control information, number, and

code general information. Finally convert them to RDF terms.6. Process the Variable Data Fields, extract the bibliographic content from

their relative subfield positions, and convert them to equivalent RDF terms.7. Perform link generation, store the RDF recorded data into RDF store, and

reveal their provenance information.

We begin with the MARC 21 file. While parsing, the individual recordsare encountered and these records are in turn parsed in order to obtain therecord information such as the Leader, Control Fields, Variable Control Fields,and Variable Data Fields. If the short title is missing at the very beginning ofthe record, the short title (246a) or the key title (222a) is extracted to obtainthe unique identifier for the equivalent RDF resource of the record. A URIlookup is performed to get the similar (seeAlso) resources from DBpediaand VIAF.

The Leader is processed and the positional data elements are extracted.Marcont11 ontology is used to denote the data elements such as record length,record status, encoding scheme, and bibliographic level. DCTerms ontologyis applied to denote the record-type data element. Figure 4 shows an excerptof the record identifier and the processing information for the above MARC21 record.

FIGURE 4 Record identifier and processing information. (Color figure available online.)

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 11: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

Exposing Bibliographic Data As Linked Data 221

Variable Control Fields (00X) are the special part of the MARC 21 stan-dard. In the case of Variable Control Fields there are no subfields and indi-cators. Each of the tags contains data, data in turn contains other data in rel-ative character position. The following control information is recorded usingDCTerms ontology: control numbers assigned by the organization, controlnumber identifier, and date and time of the latest transactions. The codes006, 007, and 008 have not yet been implemented, since these tags needsome extra steps to extract the data from their relative character positions.

The Variable Data Fields (01X-09X) are again treated similarly to Vari-able Control Fields. The Variable Control Fields provide control information,numbers (such as ISBN and ISSN), and other general information. Theyare directly mapped to equivalent RDF terms using Marcont and DCTermsontologies.

The Variable Data Fields (1XX, 2XX, 3XX, 4XX, 5XX, 6XX, 7XX, 8XX)are parsed at each of the subfield codes, converting them into RDF prop-erties and subproperties. The mappings are made based on the semanticsof the tags and the data elements. The Mapping Manager is responsible formapping this information with their equivalent RDF terms. Figure 5 depictsan excerpt of the converted MARC 21 data into a RDF resource. The Ontol-ogy Manager holds the ontologies or the controlled vocabularies. It suppliesknowledge about mapping between MARC 21 data and the RDF term. On-tologies are needed to choose the correct RDF terms in a particular domainso that the task of data representation and URI generation becomes easier. Inlibrary domains there exist a number of ontologies, taxonomies, thesauri, and

FIGURE 5 Bibliographic resource in RDF format. (Color figure available online.)

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 12: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

222 S. Kumar et al.

TABLE 1 Mapping Between MARC 21 Data and RDF Terms

MARC 21Elements Comments RDF Terms

Leader 00–04 Record length marcont:hasRecordLengthLeader 05 Record status marcont:hasRecordStatusLeader 06 Type of record DCTerms:typeLeader 09 Character coding scheme marcont:hasEncodingSchemeCF 001 LC Control Number marcont:hasNumberCF 005 Date & time of latest transaction marcont:hasDateCF 008 Fixed length dataelements marcont:hasCoverage020 $a International Standard Book

Numbermarcont:hasISBN

100 $a Personal name marcont:hasAuthor245 $a Title rdagroup1elements:keyTitle245 $b Remainder of title rdagroup1elements:otherTitleInformation260 $a Place of publication rdagroup1elements:placeOfPublication260 $b Name of publisher rdagroup1elements:publishersName300 $a Extent DCTerms:extent300 $c Dimension rdagroup1elements:dimensions500 $a General note marcont:hasNote533 $a Type of reproduction rdagroup1elements:productionMethod533 $b Place of reproduction rdagroup1elements:placeOfProduction

controlled vocabularies. In this approach, MarcOnt,12 RDA Group 1 ElementVocabulary,13 DCTerms,14 FOAF,15 and BIBO16 have been used. Ontologiesare selected based on the semantics of the MARC 21 Data Fields and theirSubfields. A MARC 21 record is treated as a RDF resource and each data unitwithin a MARC 21 record is treated as property, subproperty, or relationship.Table 1 shows a few RDF terms and their equivalent MARC 21 mappings.The resulting RDF terms are then stored into the RDF store. The challengingtask in this process is the unique assignment of the URIs to each of theresources and their relationship attributes. In this way each of the resourcesand their relationship attributes are uniquely identified throughout the sys-tem. A pictorial view of the RDF resource is sketched in Figure 6 where abibliographic resource has its attribute as another resource, which may bein the same system or in the different source. For example, a bibliographicrecord may have its author as relationship attribute, title as literal attribute,and name of the publications as relationship attribute. We arrive at step 3only when we achieve the simplified form of the RDF resource. The task oflink generation is performed in step 3. Whenever any author, location, andbook-related data are encountered, a URI lookup is accomplished to linkwith other data sources such as DBpedia and VIAF. If a resource is alreadydefined in other sources, then that resource is dropped from the local ver-sion and linked with the bibliographic resource. Once we achieve the fulllinked data form of the bibliographic data adhering to the four principles of

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 13: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

Exposing Bibliographic Data As Linked Data 223

FIGURE 6 Bibliographic resource as RDF graph.

linked data as discussed before, the data are ready to be published and canbe shared on the Web.

RDF Link Generation

Once the RDF dataset is ready to be published into RDF store then it needsto be properly linked with other sources before being published on theWeb. These are called RDF links (examples are RDFS:seeAlso, owl:sameAs,and FOAF:knows). It is highly recommended that while building linked datathe links should be generated either manually or automatically. Setting RDFlinks ensures that the dataset can discover related information by followingthe outgoing links. Generally, RDF links are generated either manually orautomatically (Heath and Bizer, 2011). Automatic links are generated usingapproaches such as SILK proposed by Volz, Bizer, Gaedke, and Kobilarov(2009) in a Link Discovery Framework for the Web of Data. In this approachRDF links were set up during the transition process. Authors as well as otherresources such as place of publication or locality are linked with the DBpediaand VIAF datasets. Figure 7 shows an excerpt of a generated linked datasetcontaining such links. Before generating the URI for a locality or author, theRDF store and the URI cache are queried. URI cache is a key-value local store,which holds the URI (value) of the resources (key) that were downloadedfrom an outside source such as DBpedia or VIAF. If the resource exists in the

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 14: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

224 S. Kumar et al.

FIGURE 7 Bibliographic linked dataset. (Color figure available online.)

cache or RDF store then only the URI is fetched and linked; otherwise thedownload manager downloads the URI from the desired source and storesit in the local store and cache. The resources are defined only locally if thedownload manager fails to download the desired URI.

Provenance Generation

Once the linked data version of the dataset is ready, we move to the step ofgenerating the provenance information. Our approach first begins with thegeneral metadata, such as dataset title, description, name of the publisher,and source of the origin, which are manually entered. Provenance Handleralso takes the linked dataset as input from LD Publisher. Based on the gen-eral metadata and linked dataset, it generates the provenance information. Itautomatically processes the linked dataset and produces the statistical infor-mation, name, and number of vocabularies used and the information aboutinterlinking datasets. To represent provenance information the PROV2 data

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 15: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

Exposing Bibliographic Data As Linked Data 225

FIGURE 8 Provenance information. (Color figure available online.)

model along with VoID, DCTerms, and FOAF have been used. The VoIDbacklink17 approach is used for interlinking the linked dataset and VoID Fileusing the property void:inDataset. Provenance information of such is shownin Figure 8.

Content Negotiation

When a particular resource’s URI is keyed into a browser address field, thenthe browser should be able to access the human-readable representation(HTML) as well as the machine readable representation (RDF/XML) of theresource, depending upon the Content:Accept-Type of the browser. Normallythis process is called the content negotiation. The RDF dataset as shown inFigure 7 is difficult to interpret by humans. Though these data are highlystructured and semantically rich, they still need to be presented in a well-formed display so that humans can easily view and interpret them. Repre-sentation of the resource in RDF is suitable for machines; whereas, the HTMLrepresentation of the resource is suitable for humans, such as is shown in

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 16: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

226 S. Kumar et al.

FIGURE 9 HTML representation of bibliographic resource. (Color figure available online.)

Figure 9. The content negotiation is actually done by dereferencing the HTTPURIs and 303 URIs strategies. Necessary steps to handle the URIs of the realworld entities are discussed by Heath and Bizer (2011). Content negotiationhas been implemented whereby upon providing the link to a bibliographicresource the Content Manager determines the request’s Content:Accept-Typeheader information. If the Content:Accept-Type is text/html then it returns thehuman readable information in HTML format. If the Content:Accept-Type isrdf+xml then the resource is returned in RDF format.

EVALUATION

We have evaluated our approach on various MARC 21 datasets. Some ofthose datasets are smaller in size. We used the Harvard Library BibliographicDataset3 provided by Harvard Library. The datasets are publicly availableand contain more than 12 million bibliographic records of different cate-gories such as books, journals, electronic resources, audio, video, and othermaterials. The experiments were performed on a 64-bit 2 GHz Intel Core i7processor having 4GB of RAM running on Mac OS X 10.8.3. We assessed

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 17: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

Exposing Bibliographic Data As Linked Data 227

TABLE 2 Evaluation Results

No. of Time taken to Time taken to No. ofMARC 21 MARC 21 convert into RDF convert into linked data RDFDataset Records (Without links) (in Sec) (With links) (in Sec) Resource

Dataset 1 100000 2591s 61882s 96300Dataset 2 100000 2754.5s 44838.5s 100000Dataset 3 100000 2477.9s 42044.9s 98800Dataset 4 100000 2622s 43303.3s 98500Dataset 5 100000 2523.9s 52062.2s 98500Dataset 6 100000 2651s 41386.9s 99600

the first 100,000 MARC 21 records, each from the first six datasets. Table 2shows the evaluation results. It has been observed that some records arelost in the converted RDF/linked datasets because of some duplicate titlesin the records. We have used the record title (246a or 222a) to denote theunique identifier throughout the system. This situation can be overcome bydetecting duplicate titles and appending them with some sort of unique char-acter. It has also been observed that the time to convert MARC 21 Format forBibliographic Data into RDF without having RDF links is always much lessthan the time to convert MARC 21 Format for Bibliographic Data into RDFwith links. In this case it should not fetch any links from outside sources.However, to have a complete framework that does the job of link gener-ation, the conversion needs more time. As shown in Table 2, the time toconvert MARC 21 Format for Bibliographic Data into linked data is always 16to 17 times greater than the time needed to convert into RDF without links.This is mainly due to the reason that for each MARC 21 record there are fiveto six network fetch operations to fetch the link from outside sources such asDBpedia and VIAF. This situation can be overcome by choosing a differentlink generation framework, such as Silk (Volz et al., 2009). With a separateframework, the dataset should be detached from the main framework andthe task of provenance generation will be difficult. Our work is at the verybasic stage and further research is required to optimize the results as well asthe process.

CONCLUSION

In this paper we have presented an approach to transforming the MARC 21Format for Bibliographic Data into linked data. RDF resources are generatedat the minimum level by providing built-in mapping between RDF and MARC21 data. We conclude that a completely automated process is possible witha strong knowledge of mapping. In the current approach we endeavoredto automate RDF resource generation and link generation at the same time.

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 18: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

228 S. Kumar et al.

Also, the provenance information or metadata is going to reveal the trustand quality of the data. Further research is needed to make this processrobust, which eliminates the data loss as well as creates more links to otherdata sources. Eventually, linked data connects library data with the Web andthe data is smoothly exposed outside the library domain. We believe thatthe efficient use of SPARQL queries and the SPARQL end-points will bringmany benefits for searching and retrieving library data. Our research is stillin progress. In the future we will be implementing the SPARQL end-pointand other formats such as MARC 21 Format for Holdings Data, AuthorityData, Classification Data, and Community Information.

NOTES

1. http://www.loc.gov/marc/bibliographic/lite/genintro.html2. http://www.w3.org/TR/prov-primer/3. http://openmetadata.lib.harvard.edu/bibdata4. http://www.w3.org/TR/skos-primer/5. http://www.w3.org/2005/Incubator/lld/XGR-lld-vocabdataset-20111025/6. http://datahub.io/dataset/bluk-bnb7. http://datahub.io/dataset/europeana-lod8. http://datahub.io/dataset/culds_19. http://datahub.io/dataset/hungarian-national-library-catalog

10. http://id.loc.gov/authorities/subjects.html11. http://www.marcont.org/ontology/2.1#12. http://semdl.info/books/2/appendices/G13. http://metadataregistry.org/schema/show/id/1.html14. http://dublincore.org/documents/dcmi-terms/15. http://xmlns.com/foaf/spec/16. http://bibliontology.com/specification17. http://www.w3.org/TR/void/#backlinks

REFERENCES

Alexander, K., Cyganiak, R., Hausenblas, M., & Zhao, J. (2009a). voiD guide—Usingthe vocabulary of interlinked datasets. Community Draft, voiD working group,2009. Retrieved from http://rdfs.org/ns/void-guide/

Alexander, K., Cyganiak, R., Hausenblas, M., & Zhao, J. (2009b). voiD, the vocabularyof interlinked datasets. Community Draft, voiD working group, 2009. Retrievedfrom http://rdfs.org/ns/void/

Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web, ScientificAmerican, May 2001. http://www-sop.inria.fr/acacia/cours/essi2006/Scientific%20American_%20Feature%20Article_%20The%20Semantic%20Web_%20May%202001.pdf

Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked Data: The story so far. In-ternational Journal on Semantic Web and Information Systems (IJWIS), 5(3),1–22.

Bizer, C., Heath, T., & Cyganiak, R. (2007). How to publish linked data on the Web. Re-trieved from http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4

Page 19: Exposing MARC 21 Format for Bibliographic Data As Linked Data With Provenance

Exposing Bibliographic Data As Linked Data 229

Bowen, J. (2010, October 20–22). Moving library metadata toward Linked Data:Opportunities provided by the eXtensible Catalog. In DCMI ‘10 Proceedings ofthe 2010 International Conference on Dublin Core and Metadata Applications.http://dcpapers.dublincore.org/pubs/article/download/1010/979

Buneman, P., Khanna, S., & Tan, W. C. (2001, January). Why and where: A character-ization of Data Provenance. In Proceedings of the 8th International Conferenceon Database Theory (ICDT), London, UK (pp. 316–330). London: Springer.

Harper, C. A., & Tillett, B. B., (2007). Library of Congress controlled vocabularies andtheir application to the Semantic Web. Cataloging & Classification Quarterly,43(3/4), 8–9.

Heath, T., & Bizer, C. (2011). Linked Data Evolving the Web into a global dataspace. Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan& Claypool Publishers.

Malmsten, M. (2008, September 22–26). Making a library catalogue part ofthe Semantic Web. In Proceedings of the International Conference onDublin Core and Metadata Applications. http://www.kb.se/dokument/Libris/artiklar/Project%20report-final.pdf

Malmsten, M. (2009, August). Exposing library data as linked data. Pre-sented at the IFLA satellite preconference sponsored by the Informa-tion Technology Section “Emerging trends in technology: Libraries betweenWeb 2.0, the Semantic Web and search technology,” http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.181.860&rep=rep1&type=pdf

Marjit, U., Sharma, K., & Biswas, U. (2012). Provenance representation and storagetechniques in Linked Data: A state-of-the-art survey. International Journal ofComputer Applications, 38(9), 0975–8887.

Styles, R., Ayers, D., & Shabir, N. (2008, January 27). Semantic MARC, MARC21and the Semantic Web. In Proceedings of WWW. http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-369/paper02.pdf

Volz, J., Bizer, C., Gaedke, M., & Kobilarov, G. (2009, April 20). Silk: Alink discovery framework for the web of data. In Proceedings of the2nd Linked Data on the Web Workshop, 2009. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.184.2701&rep=rep1&type=pdf

Dow

nloa

ded

by [

Uni

vers

ity o

f A

lber

ta]

at 1

7:35

22

Oct

ober

201

4


Top Related