linked bibliographic data - society.library.sh.cn

61
关联书目数据 Linked Bibliographic Data 发布、查询、消费和混搭 Publishing, Querying, Consuming and Mash-up

Upload: others

Post on 12-Dec-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

关联书目数据Linked Bibliographic Data

发布、查询、消费和混搭

Publishing, Querying, Consuming and Mash-up

内容和目标

• 以关联书目数据为实例,介绍关联数据发布、查询和消费的基本模式。以期大家对关联数据的技术原理有一个直观的了解。

• 重点讨论以下五个方面:1. 关联数据概述

2. 从卡片到关联书目数据:书目数据语义架构的历史考察

3. 关联书目数据的语义表达:模型和模式

4. 关联数据数据的查询—SPARQL

5. 关联书目数据的编程 –发布消费与混搭

发展?

发展:图书馆关联书目数据项目—以国家图书馆为例

National Agricultural Library Thesaurus(美国) http://agclass.nal.usda.gov/agt.shtml

Web NDL Authorities - National Diet Library of Japan (日本) http://id.ndl.go.jp/auth/ndla

British National Bibliography (BNB) (英国) http://bnb.data.bl.uk

Polythematic Structured Subject Heading System (捷克) http://psh.ntkcz.cz/skos/home/html/en

Library of Congress Subject Headings (美国) http://id.loc.gov/authorities/

B3Kat - Library Union Catalogues of Bavaria, Berlin and Brandenburg

(德国) http://lod.b3kat.de

Deutsche Nationalbibliografie (DNB) (德国) http://www.dnb.de/EN/datendienste/linkedData

datos.bne.es (西班牙) http://datos.bne.es

data.bnf.fr - Bibliothèque nationale de France (法国) http://data.bnf.fr

LIBRIS (瑞典) http://libris.kb.se

Hungarian National Library (NSZL) catalog (匈牙利) http://nektar.oszk.hu/wiki/Semantic_web

Library of Congress Name Authority File (NAF) http://id.loc.gov/download/

Rådata nå! Norwegian personal name authorities as linked data

(挪威国家图书馆参与) BIBSYS is a key supplier of products and services for

higher educational institutions, other research institutions in Norway, public

administrative institutions and the National Library of Norway. http://data.bibsys.no/data

三元组

National Agricultural Library Thesaurus 364996

Web NDL Authorities - National Diet Library of Japan 15000000

British National Bibliography (BNB) 84961180

Polythematic Structured Subject Heading System 100000

Library of Congress Subject Headings 4151586

B3Kat - Library Union Catalogues of Bavaria, Berlin and Brandenburg 570000000

Deutsche Nationalbibliografie (DNB) 12786555

datos.bne.es 58053215

data.bnf.fr - Bibliothèque nationale de France 6330000

LIBRIS 50000000

Hungarian National Library (NSZL) catalog 19300000

Rådata nå! 9370074

总计 830417606

SPARQL Endpoint

Web NDL Authorities - National Diet Library of Japan http://id.ndl.go.jp/auth/ndla/

British National Bibliography (BNB) http://bnb.data.bl.uk/sparql

B3Kat - Library Union Catalogues of Bavaria, Berlin and Brandenburg

http://lod.b3kat.de/sparql

LIBRIS http://lab3.libris.kb.se/sparql

Hungarian National Library (NSZL) catalog http://setaria.oszk.hu/sparql

Rådata nå! http://data.bibsys.no/data/query_authority.html

Library of Congress Subject Headings http://api.talis.com/stores/lcsh-info/services/sparql

外联datos.bne.es links:dbpedia 36431

datos.bne.es links:dnb-gemeinsame-normdatei 76413

datos.bne.es links:lexvo 3112900

datos.bne.es links:libris 10884

datos.bne.es links:sudocfr 9725

datos.bne.es links:viaf 454068

Deutsche Nationalbibliografie (DNB) links:gnd 16.734.298

Deutsche Nationalbibliografie (DNB) links:iso639-2 3.263.366

Hungarian National Library (NSZL) catalog links:dbpedia 6285

Hungarian National Library (NSZL) catalog links:viaf 33709

Library of Congress Subject Headings links:stitch-rameau 55281

LIBRIS links:dbpedia 4669

LIBRIS links:lcsh 12586

LIBRIS links:viaf 248228

Polythematic Structured Subject Heading System links:dbpedia 3000

Polythematic Structured Subject Heading System links:lcsh 3000

Rådata nå! links:dbpedia 30346

Rådata nå! links:dnb-gemeinsame-normdatei 209681

Rådata nå! links:viaf 311154

Web NDL Authorities - National Diet Library of Japan links:lcsh 4545

Web NDL Authorities - National Diet Library of Japan links:viaf 2673

外联总计

links:gnd Total 1 16734298

links:iso639-2 Total 1 3263366

links:lexvo Total 1 3112900

links:libris Total 1 10884

links:stitch-rameau Total 1 55281

links:sudocfr Total 1 9725

links:dnb-gemeinsame-normdatei Total 2 286094

links:lcsh Total 3 20131

links:dbpedia Total 5 80731

links:viaf Total 5 1049832

Grand Total 24623242

如何解读这些数据?

特点

什么是关联书目数据

• 关联书目数据是利用语义网技术,遵循关联数据原则组织、生成和发布的书目数据。

• 关联书目数据的特征:– 用URIs来命名书目数据的各种对象(things)

– 用 HTTP URIs,因此可以实现referred /dereferenced

– 当用户(人或机器)解析(dereference)书目数据对象时,可以获得有用的信息,这些信息以通用标准的形式便发出来,如RDF/XML.SPARQL

– 包含各种连接到外部数据的URI,以加强网络信息的发现机制

关联书目数据特点

• 特点:从记录到三元组

卡片目录 机读目录 关联数据

卡片 记录 statements

特点二

• 特点:从描述信息到描述关系

One of the key concepts of Linked Data is to represent data as a set of interlinked things. These things are referred to as objects of interest. They are things about which we can make statements.

-- Tim Hodson

Hodson, T. (2011, July 22nd). British Library Data Model:Overview. Retrieved Jun2 24th, 2012, from Talis systems: http://talis-systems.com/2011/07/british-library-data-model-overview/

特点三

• 通过以Http为基础的URI机制,将书目数据整合到网络中,成为Web of Data 基础结构

的一个组成部分,而是架构在网络上的特定应用

特点四

• 开放和复用

开放性包括技术上采用 http 这样的通用协议,权属上遵循开放版权机制

关联书目数据vs.传统书目数据

Layer New generation Traditional catalogue

Concept model FRBR Paris Principles

Definition layer RDA AACRII

Semantic layer Vocabulary DC, SKOS OWL MARC

Syntax RDF

Coding layer XML, JSON, et.al.

Access layer RESTful Z39.50

语义

书目数据语义架构的历史考察

书目类型 语义表达

卡片的语义表达 标点符号:斜杠逗号加冒号点横括号空一格

电子书目数据 MARC 字段标识符,每个字段不能表达一个完整的意思.

关联书目数据 RDF 三元组,主谓宾构成一个完整的句子,表达一个完整的语义

• 000 00808cam a22002658a 450• 001 399195• 005 20011123074558.0• 008 890403s1990 nyuabcf b 00110 eng• 020 __ |a 0393027082• 035 __ |a 89009241• 035 __ |9 BAA7243GL• 040 __ |d CU• 043 __ |a a-cc---• 050 00 |a DS754 |b .S65 1990• 082 00 |a 951/.03 |2 20• 100 10 |a Spence, Jonathan D.• 245 14 |a The search for modern China / |c by Jonathan D. Spence.• 250 __ |a 1st ed.• 260 0_ |a New York : |b Norton, |c c1990.• 300 __ |a xxv, 876 p., [130] p. of plates ? |b ill. (some col.), facs ims., maps, ports. ; |c 24 cm.• 500 __ |a Maps on lining pages.• 504 __ |a Includes bibliographies and index.• 651 _0 |a China |x History |y Qing dynasty, 1644-1912.• 651 _0 |a China |x History |y 20th century.• 984 __ |a 3839 |c 951 S74

MARC数据的statement

关联书目数据的statement

http://bnb.data.bl.uk/id/resource/011954620

http://purl.org/dc/terms/creator

http://bnb.data.bl.uk/id/person/SpenceJonathanD1936-

关联书目数据的语义表达与发布

• 功能需求分析和建模

• 定义URIs

• 语义编码

• 数据存储

• 数据发布

功能需求分析和数据建模

• 建模四项基本原则– 功能需求原则:功能需求是数据建模的出发点

– Thing 原则:Thing是建模的核心

– 复用原则:将可重复使用的资源定义为Thing,并尽量复用外部资源

– 扩展原则:采用扩展通用数据模型的方法建模以提高系统的互操作性。

用这四项原则来分析以下数据模型– FRBR模型

– Europeana Data Model v. 5.2.3

– 大英图书馆数据模型

FRBR or Not FRBR

两种数据建模模式:FRBR模式和非FRBR模式

非FRBR的观点

Nobody talks about works, expressions and manifestations, so why describe our data that way?

Styles, Rob. Bringing FRBR Down to Earth…. 11 November 2009. http://dynamicorange.com/2009/11/11/bringing-frbr-down-to-earth/ (accessed July 1, 2012).

FRBR类型

Koster, L. (2012, January 5th). Local library data in the new global framework. Retrieved June 26th, 2012, from COMMONPLACE.NET: http://commonplace.net/tag/frbr/

Koster, L. (2012, January 5th). Local library data in the new global framework. Retrieved June 26th, 2012, from COMMONPLACE.NET: http://commonplace.net/tag/frbr/

Koster, L. (2012, January 5th). Local library data in the new global framework. Retrieved June 26th, 2012, from COMMONPLACE.NET: http://commonplace.net/tag/frbr/

非FRBR建模British Library Data Model

图片来源:http://www.bl.uk/bibliographic/datafree.html/pdfs/bldatamodelbook.pdf

Europeana Data Model v. 5.2.3

The EDM Class hierarchy. The classes introduced by EDM are shown is light blue rectangles. The classes in the white rectangles are re-used from other schemas; the schema is indicated before the colon.

图片来源:http://pro.europeana.eu/edm-documentation

词汇结构

• 词汇是关联书目数据语义的主要载体

• 和传统书目数据相比较,书目数据的词汇体系由单一词汇体系向多词汇系统演化

• 各种词汇在一个系统协同和谐的基础是逻辑的一致性。

• 词汇的复用与语义的互操作

URI命名

• CLEAN URLs 是一种单纯结构化的仅包含路径和资源的而不包含诸如查询字段的URLs。通常URL:http://example.com/index.php?page=foo

CLEAN URL http://example.com/foo

• 用Clean URLs来命名资源– http://bnb.data.bl.uk/id/resource/011954620

– http://bnb.data.bl.uk/id/concept/lcsh/ChinaHistoryQingdynasty1644-1912

– http://dbpedia.org/resource/Shanghai_Library

• RDF中利用URL来命名资源的实例<rdf:Description rdf:about="http://bnb.data.bl.uk/id/resource/011954620">

<dct:creator rdf:resource="http://bnb.data.bl.uk/id/person/SpenceJonathanD1936-"/>

</rdf:Description>

• Permanent URL short URL

• http://Purl.oclc.org

数字图书馆的语义描述和服务升级作者: 刘炜 http://www.kevenlw.name/kevenfoaf.rdf

用URI和RDF描述的两种方法

作者: 刘炜

关联书目数据的存储

• 传统数据库模式

即RDF文档是动态生成,数据后台管理依然是采用关系数据库。

• RDF三元组存储

书目数据预先转换成RDF文档,以三元组的形式存储,如英国图书馆数据,生成RDF文档通过Talis平台发布

查询

关联书目数据的查询(SPARQL)

• SPARQL 是一种查询RDF数据的语言

• SPARQL基于RDF数据模型(三元组)

• SPARQL是pattern matching,可以查询RDF数据模型中的主谓宾任何部分

• SPARQL Endpoint是一种RESTful Web服务

SPARQL语句模板

PREFIX

SELECT(DESCRIBE,CONSTRUCT)

FROM

WHERE { OPTIONAL UNION}

ORDER BY

LIMIT

一个简单的SPARQL查询

SELECT ?a ?b ?c

WHERE {?a ?b ?c} ----必须是三元组

LIMIT 10

查询大英图书馆数据

http://bnb.data.bl.uk/sparql

查dbpedia

http://dbpedia.org/sparql

查主谓宾

• 用URIs表示thing• 用尖括号表示URIs• 可以查询主谓宾—(可以查关系)

SELECT ?b ?cWHERE {<http://bnb.data.bl.uk/id/resource/011954620> ?b ?c}LIMIT 10

SELECT ?a ?cWHERE {?a <http://purl.org/dc/terms/title >?c}LIMIT 10

SELECT ?cWHERE {<http://bnb.data.bl.uk/id/resource/011954620> <http://purl.org/dc/terms/title> ?c}LIMIT 10

SELECT ?bWHERE {<http://bnb.data.bl.uk/id/resource/011954620> ?b <http://bnb.data.bl.uk/id/person/SpenceJonathanD1936->}LIMIT 10

多条件查询和PREFIX

SPARQL支持多条件查询,每个条件之间用“ . ”相连接,相当于AND

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?b ?c

WHERE {<http://bnb.data.bl.uk/id/resource/011954620> ?b <http://bnb.data.bl.uk/id/person/SpenceJonathanD1936->.<http://bnb.data.bl.uk/id/person/SpenceJonathanD1936-> rdfs:label ?c.

}

LIMIT 10

OPTIONAL

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX dct:<http://purl.org/dc/terms/>

SELECT ?b ?v ?c

WHERE {<http://bnb.data.bl.uk/id/resource/011954620> ?b ?v.

OPTIONAL {?c dct:creator ?v}}

LIMIT 10

OPTIONAL

先查select distinct ?a ?b where {?a ?b

<http://dbpedia.org/class/yago/LibrariesInChina>} LIMIT 100

再查select distinct ?a ?b ?c where {?a ?b

<http://dbpedia.org/class/yago/LibrariesInChina> . ?a dbpprop:director ?c} LIMIT 100

用OPTIONAL查select distinct ?a ?b ?c where {?a ?b

<http://dbpedia.org/class/yago/LibrariesInChina> . OPTIONAL{?a dbpprop:director ?c}} LIMIT 100

UNION

• UNION将两个独立的查询条件通过类似布尔运算的“并”将其连接在一起,组成一个查询PREFIX rdfs: <http://www.w3.org/2000/01/rdf-

schema#>

PREFIX dct:<http://purl.org/dc/terms/>

SELECT ?b ?v ?n ?m

WHERE {{<http://bnb.data.bl.uk/id/resource/011954620> ?b ?v} UNION{

<http://bnb.data.bl.uk/id/resource/GB9917627> ?n ?m}}

UNION

比较select distinct ?a ?b ?c where {{?a ?b

<http://dbpedia.org/class/yago/LibrariesInChina>}OPTIONAL{?a dbpprop:director ?c .?a ?b <http://dbpedia.org/class/yago/LibrariesInChina>}} LIMIT 100

和select distinct ?a ?b ?c where {{?a ?b

<http://dbpedia.org/class/yago/LibrariesInChina>}UNION{?a dbpprop:director ?c .?a ?b <http://dbpedia.org/class/yago/LibrariesInChina>}} LIMIT 100

后者结果中上海图书馆和国家图书馆是重复的

FILTER

• FILTER 的作用是设定查询条件,其后跟一个判断语句。这个判断语句可以是表达式,也可以是函数

select distinct ?a ?b ?c where {?a ?b <http://dbpedia.org/class/yago/LibrariesInChina> ; geo:lat ?c filter(?c>30) }LIMIT 100

select distinct ?a ?b ?c where {?a ?b <http://dbpedia.org/class/yago/LibrariesInChina> OPTIONAL {?a dbpprop:established ?c filter (?c<1949)}}LIMIT 100

FILTER与函数

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX dct:<http://purl.org/dc/terms/>

SELECT ?a ?b

WHERE {?a dct:subject ?b filter(regex(str(?b),"China"))}

limit 10

DESCRIBE and CONSTRUCT

• DESCRIBE 和CONSTRUCT都是返回一个完整的RDF数据集

• DESCRIBE提供了一个资源的全部描述而CONSTRUCT则依据用户设定的模板生成一个新的RDF描述

SPARQL与知识发现?

实例1 巴金和茅盾的共同点

construct {dbpedia:Ba_Jin_and_Mao_Dun ?a ?x} where{<http://dbpedia.org/resource/Ba_Jin> ?a ?x.<http://dbpedia.org/resource/Mao_Dun> ?a ?x }

实例2肖红和Nikolai的关系

construct { <http://dbpedia.org/resource/Xiao_Hong> ?x <http://dbpedia.org/resource/Nikolai_Gogol>} where {<http://dbpedia.org/resource/Xiao_Hong> ?a ?x. {<http://dbpedia.org/resource/Nikolai_Gogol> ?a ?x } union {?x ?a <http://dbpedia.org/resource/Nikolai_Gogol>}}

肖红和Nikolai的关系: 魯迅作爲中間人

编程

关联书目数据的系统结构RESRTful

• RESRTful是一种web 服务模式• RESTful的主要原理:

– 规范使用 HTTP方法• POST:创建资源(create)• GET:检索资源(retrieve)• PUT:修改资源(change or update)• DELETE:删除资源(remove or delete)

– 无状态会话:对同一个请求,服务器将返回完整的资源,同一个资源分布传输其间不需要送交State信息.

– 结构化URIs. (CLEAN URIs)– 按需返回同一资源的不同格式类型,如HTML, XML, JSON等.

RESRTful关联书目数据发布系统结构

结构化URIs

设置CLEAN URL(Apache .htaccess为例)

# Rewrite URLs of the form 'x' to the form 'index.php?q=x'.

RewriteCond %{REQUEST_FILENAME} !-f

RewriteCond %{REQUEST_FILENAME} !-d

RewriteCond %{REQUEST_URI} !=/favicon.ico

RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]

RESRTful关联书目数据发布程序样例

<?php$accept=$_SERVER['HTTP_ACCEPT'];header("Content-type:".$accept);switch ($accept){case 'text/html':

echo '<html xmlns="http://www.w3.org/1999/xhtml">';…. ….break;

case 'application/rdf+xml':echo '<?xml version="1.0" encoding="UTF-8"?>';echo '<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"……..</rdf:RDF>';break;

}?>

http://cloudlibrary.info/lodbib/lodbib_restful.php

http://www.w3.org/RDF/Validator/ARPServlet?URI=http%3A%2F%2Fcloudlibrary.info%2Flodbib%2Flodbib_restful.php&PARSE=Parse+URI%3A+&TRIPLES_AND_GRAPH=PRINT_TRIPLES&FORMAT=PNG_EMBED

请求RDF文档:一个实例

<?php$type=$_GET["RadioGroup1"];$url=$_GET["URL"];echo '资源URL: '.$url.'<br>';echo ‘ACCEPT:’.$type.‘<br>’; 例如 Accpet:application/RDF+XML

$ch = curl_init($url);$headers = array (

'ACCEPT: '.$type);

curl_setopt( $ch, CURLOPT_HTTPHEADER, $headers);curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);$result = curl_exec($ch); curl_close($ch);echo $result;

?>

http://cloudlibrary.info/lodbib/lodrequest.html

访问SPARQL Endpoint(一个实例)

public function checkdbpedia($mashterm){

$searchterm='select ?ab ?la where {{?x foaf:homepage '.$mashterm.'} optional {?x rdfs:label ?la filter(lang(?la)="en")}.'.

' optional { ?x dbpedia-owl:abstract ?ab filter(lang(?ab)="en" )}}';

$librarylist=$this->dbpediasearch($searchterm);

if (!($librarylist)) return false ;

else{

$libraryarray=json_decode($librarylist,true);

return $libraryarray;

} }

public function dbpediasearch($term){

$format = 'json';

$query =

"PREFIX dbp: <http://dbpedia.org/resource/>".$term;

$searchUrl = 'http://dbpedia.org/sparql?'

.'query='.urlencode($query)

.'&format='.$format;

$ch= curl_init();

curl_setopt($ch,CURLOPT_URL,$searchUrl);

curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,2);

curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$response = curl_exec($ch);

$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

curl_close($ch);

if (empty($response))return false;

else {

if($httpcode>=200 && $httpcode<300) return $response;

else return false;

}

解析RDF-XML文档

function xml_to_array($file){/* 请求一个RDF文档 */if (!($lcsh_str=$this->get_RDF_File($file,'application/RDF+XML')))

{ $this->is_success=false;return;}

else$this->is_success=true;

/* 创建一个新的XML解析器 */$parser = xml_parser_create();xml_parse_into_struct( $parser, $lcsh_str, $tags );xml_parser_free( $parser ); /* 释放XML解析器 */return $tags;

}

解析结果

0 /* 获得命名空间信息*/

Array ( [tag] => rdf:RDF [type] => open [level] => 1 [attributes] => Array ( [xmlns:rdf] => http://www.w3.org/1999/02/22-rdf-syntax-ns# [xmlns:rdfs] => http://www.w3.org/2000/01/rdf-schema# [xmlns:owl] => http://www.w3.org/2002/07/owl# [xmlns:skos] => http://www.w3.org/2004/02/skos/core# [xmlns:xl] => http://www.w3.org/2008/05/skos-xl# [xmlns:rda] => http://RDVocab.info/ElementsGr2/ [xmlns:frbrent] => http://RDVocab.info/uri/schema/FRBRentitiesRDA/ [xmlns:foaf] => http://xmlns.com/foaf/0.1/ [xmlns:ndl] => http://ndl.go.jp/dcndl/terms/ [xmlns:dct] => http://purl.org/dc/terms/ ) )

1 /* RDF树形结构 */

Array ( [tag] => rdf:Description [type] => open [level] => 2 [attributes] => Array ( [rdf:about] => http://id.ndl.go.jp/auth/ndlna/00288347 ) )

2

Array ( [tag] => foaf:primaryTopic [type] => open [level] => 3 )

3

Array ( [tag] => foaf:Organization [type] => open [level] => 4 [attributes] => Array ( [rdf:about] => http://id.ndl.go.jp/auth/entity/00288347 ) )

4

Array ( [tag] => foaf:name [type] => complete [level] => 5 [value] => 国立国会図書館 )

5

Array ( [tag] => rda:dateOfEstablishment [type] => complete [level] => 5 [value] => 1948 )

6

Array ( [tag] => foaf:Organization [type] => close [level] => 4 )

7

Array ( [tag] => foaf:primaryTopic [type] => close [level] => 3 )

8

Array ( [tag] => dct:modified [type] => complete [level] => 3 [value] => 2012-02-03T14:08:15 )

RDF三元组

• 将RDF/XML数据的树形结构序列化成三元组

• [rdf:Description] + [rdf:about] S

• [Tag] P

• [Value] or [rdf:resource] O

共有五个三元组

S P Ohttp://id.ndl.go.jp/auth/ndlna/00288347 foaf:primaryTopic http://id.ndl.go.jp/auth/entity/00288347

http://id.ndl.go.jp/auth/ndlna/00288347 dct:modified 2012-02-03T14:08:15

http://id.ndl.go.jp/auth/entity/00288347 Is a foaf:Organization

http://id.ndl.go.jp/auth/entity/00288347 foaf:name 国立国会図書館

http://id.ndl.go.jp/auth/entity/00288347 rda:dateOfEstablishmen 1948

将数据联合起来

• 利用外部资源丰富书目数据的内容

如集成Dbpedia数据

• 书目数据集成到其他系统中去

挑战• 数据规范• 语义互操作

还有?

谢谢大家