datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

49
FOSDEM 5/02/2011 1 With the help of the Datalift team And the support of the French National Research Agency Datalift: A Catalyser for the Web of Data François Scharffe LIRMM/CNRS/University of Montpellier [email protected] @lechatpito

Upload: datalift

Post on 26-Jun-2015

789 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

FOSDEM 5/02/2011 1

With the help of the Datalift teamAnd the support of the French National Research Agency

Datalift: A Catalyser for the Web of Data

François ScharffeLIRMM/CNRS/University of Montpellier

[email protected] @lechatpito

Page 2: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

The data revolution is on its way !

As Open Data meets the Semantic Web

Page 3: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

The promises of linked-data

Page 4: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Richer Applications

Linked Data Lite | the Web on Steroids 1.0 (iPhone)

Page 5: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Richer applications

BBC Programmes

Page 6: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

More precise search and QA

Page 7: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Making your data 5 stars

http://www.w3.org/DesignIssues/LinkedData.html

Page 8: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

So, how to lift data ?

How to publish data on the Web as linked-data ?

● Basic principles Tim Berners Lee [2006] (Design Issues)

– Use URIs to identify things (not only documents)– Use HTTP URIs– When dereferecing URIS, return a description of the

ressource– Include links to other ressources on the Web

Page 9: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Welcome aboard the data lift

Published and interlinked data on the Web

Applications

Interconnexion

Publication infrastructure

Data convertion

Vocabulary selection

Raw data

Page 10: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Datalift

Datasets publication

R&D to automate the publication process

Tool suite to help publish data

Training, tutorials, data publication camps

Page 11: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

SemWebPro 18/01/2011 11

1st floor - Selection

Page 12: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Les vocabulaires de mes amis …

Ø What is a (good) vocabulary for linked data ?

§ Usability criterias

Simplicity, visibility, sustainability, integration, coherence …

Ø Differents types of vocabularies

§ metadata, reference, domain, generalist …

§ The pillars of Linked Data : Dublin Core, FOAF, SKOS

Ø Good and less good practices

§ Ex : Programmes BBC vs legislation.gov.uk

§ Vocabulary of a Friend : networked vocabularies

Ø Linguistic problems

§ Existing vocabularies are in English at 99%

§ Terminological approach :which vocabularies for « Event » « Organization »

Page 13: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

SemWebPro 18/01/2011 13

Did you say « vocabulary »

… And why not « ontology »?

§ Or « schema » ou « metadata schema »?

§ Ou « model » (data ? World ?)

Ø All these terms are used and justifiable

They are all « vocabularies »

§ The define types of objects (or classes)and the properties (oo attributes) atttached to these objects.

§ Types and attributes are logically definedand named using natural language

§ A (semantic) vocabularyis an explicit formalizationof concepts existing in natural language

Page 14: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Vocabularies for linked data

ØAre meant to describe resources in RDF

ØAre based on one of the standard W3C language§ RDF Schema (RDFS)

• For vocabulaires without too much logical complexity

§ OWL • For more complex ontological constructs

§ These two languages are compatible (almost)

ØThe can be composed « ad libitum »§ One can reuse a few elements of a vocabulary

§ The original semantics have to be followed

Page 15: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

What makes a good vocabulary ?

Ø A good vocabulary is a used vocabulary

§ Data published on CKAN give an idea of vocabulary usage

§ Exemple : vlist of datasets using FOAF http://xmlns.com/foaf/0.1/

Ø Other usability criterias

§ Simplicity and readability in natural language

§ Elements documentation (definition in natural language)

§ Visibility and sustainability of the publication

§ Flexibility and extensibility

§ Sémantique integration (with other vocabularies)

§ Social integration (with the user community)

Page 16: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

A vocabulary is also a community

ØBad (but common) practice● Build a lonely vocabulary

– For example as a research project– Without basing it on any existing vocabulary

§ To publish it (or not) and then to forget about it

§ Not to care about its users

ØA good vocabulary has an organic life

§ Users and use cases

§ Revisions and extensions

§ Like a « natural » vocabulary

Page 17: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Types of vocabularies

Ø Metadata vocabularies

§ Allowing to annotate other vocabularies

• Dublin Core, Vann, cc REL, Status

Ø Reference vocabularies

§ Provide « common » classes and properties

• FOAF, Event, Time, Org Ontology

Ø Domain vocabularies

§ Specific to a domain of knowledge

• Geonames, Music Ontology, WildLife Ontology

Ø « general » vocabularies

§ Describe « everything » at an arbitrary detail level

• DBpedia Ontology, Cyc Ontology, SUMO

Page 18: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Vocabulary of a Friend

Øhttp://www.mondeca.com/foaf/voaf

ØA simple vocabulary...

ØTo represent interconnexions between vocabularies

ØA unique entry point to vocabularies and Datasets of the linked-data cloud Linked Data Cloud

ØOngoing work in Datalift

Page 19: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

SemWebPro 18/01/2011 19

2nd floor - Conversion

Page 20: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

URL Design et URL Pattern

ØGood practices for linked-data

§ Ressource: http://dbpedia.org/resource/Paris

§ Document: http://dbpedia.org/page/Paris

§ Data: http://dbpedia.org/data/Paris

Ø… served using content negociation

Page 21: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

URI Pattern in REST

ØLes services REST (Representational State Transfer) manipulent des ressources et les URLs sont principalement utilisés pour adresser ces ressources

ØUne URI de base:

§ http://www.example.com/bookstore/

ØUne ressource à un URL unique: (retrieve, update, create, delete)

§ http://www.example.com/bookstore/books/ISBN123

ØNotion de collection: (list, replace, create, delete)

§ http://www.example.com/bookstore/books

Page 22: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Convertion tools to RDF

ØHow is the raw data to be converted ?

§ Relational Database ?

§ (Semi-)structured formats ?

§ Programmatic acces (API) ?

ØThere are solutions for all cases

Page 23: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

D2RQ Map

Page 24: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Triplify: Relational data to JSON/RDF

ØExtract a folder in your Webapp: http://sourceforge.net/projects/triplify/

ØModify a config file:

§ SQL query … URI pattern

§ PHP lover!

Page 25: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Working on spreadsheets

Page 26: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Google acquired Freebase

http://code.google.com/p/google-refine/

Page 27: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

RDF extension for Google Refine

ØA graphical extension for Google Refine allowing to export the clean data as RDFhttp://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/

Name Job Title Grade Organization

Annual pay rate - including

taxable benefits and allowances

Notes

Stephan Wilcke Chief Executive Officer

Asset Protection Agency

£150,000 - £154,999

Jens Bech Chief Risk Officer Asset Protection Agency

£165,000 - £169,999 No pension

Ion Dagtoglou Chief Invesment Officer

Asset Protection Agency

£165,000 - £169,999 No pension

Brian Scammell Chief Credit Officer

Asset Protection Agency

£130,000 - £134,999 4 days per week

Page 28: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Google Refine et RDF

Page 29: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

SemWebPro 18/01/2011 29

3rd floor - Publication

Page 30: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Publication components

SPARQLendpoint

REST

RDFstorage

Alimentation

Alimentation

Alimentation

InferenceEngine

QueryingBrowsing

A few productsVirtuoso, Sesame, Mulgara, 4storeOWLIM, AllegroGraph, Big Data,Jena

Page 31: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Named graphs

1

23

4

5

6

7

8

9

1110

14

12

13

15

16

ØDelete on a graph

ØSPARQL queries define graphs

ØRdf graphs are bags of triples, everything is mixed

Page 32: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Inference

Ø Generating triples from other triples

Ø Deduction mechanism

§ Men are mortals, Socrates is a man, so Socrates is mortal

Ø Allows to avoid exhaustivity, give sense to defining hierarchies

Ø Constraints: cardinality, NFPs, ...

1

23

4

5

6

7

8

9

1110

14

12

13

15

16

Page 33: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Analyse des RDF Store : la méthode QSOS

Ø Qualification and Selection of Open Source Software

§ Projet Open Source sur des solutions open source

§ http://www.qsos.org

Ø Objectifs de QSOS

§ Qualifier des logiciels

§ Comparer des solutions après avoir défini des exigences et en pondérant les critères

§ Sélectionner le produit le plus adapté par rapport à un besoin

Ø QSOS fournit

§ Une méthode objective et formalisée ‏

§ Un référentiel d’études disponibles

§ Des outils facilitant le déroulement de la méthode

Page 34: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

SemWebPro 18/01/2011 34

4th floor - Interconnexion

Page 35: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Linked data and interconnexions

ØWithout links there is no Web but data silos

ØLinks can be part of the datasets design (reference datasets)

ØLinks can be found after the publication: equivalence links between resources

Page 36: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Comment interconnecter ses données ?

Page 37: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Tools

Ø RKB-CRS A coreference resolution service for the RKB knowledge base

Ø LD-mapper A linkage tool for datasets described using the Music Ontology

Ø ODD Linker A linkage tool based on SQL

Ø RDF-AI Multi purpose data linkage and fusion

Ø Silk et Silk LSL Linkage tool and linkage specification language

Ø Knofuss architecture Datasets linkage and fusion

Page 38: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Exemple Silk specification

<Silk> <Prefix id="rdfs" namespace= "http://www.w3.org/2000/01/rdf-schema#" /> <Prefix id="dbpedia" namespace= "http://dbpedia.org/ontology/" /> <Prefix id="gn" namespace= "http://www.geonames.org/ontology#" />

<DataSource id="dbpedia"> <EndpointURI>http://demo_sparql_server1/sparql </EndpointURI> <Graph>http://dbpedia.org</Graph> </DataSource>

<DataSource id="geonames"> <EndpointURI>http://demo_sparql_server2/sparql </EndpointURI> <Graph>http://sws.geonames.org/</Graph> </DataSource> <Thresholds accept="0.9" verify="0.7" /> <Output acceptedLinks="accepted_links.n3" verifyLinks="verify_links.n3" mode="truncate" />

<Interlink id="cities"> <LinkType>owl:sameAs</LinkType> <SourceDataset dataSource="dbpedia" var="a"> <RestrictTo> ?a rdf:type dbpedia:City </RestrictTo> </SourceDataset> <TargetDataset dataSource="geonames" var="b"> <RestrictTo> ?b rdf:type gn:P </RestrictTo> </TargetDataset> <LinkCondition> <AVG> <Compare metric="jaroSimilarity"> <Param name="str1" path="?a/rdfs:label" /> <Param name="str2" path="?b/gn:name" /> </Compare> <Compare metric="numSimilarity"> <Param name="num1" path="?a/dbpedia:populationTotal" /> <Param name="num2" path="?b/gn:population" /> </Compare> </AVG> </LinkCondition> </Interlink></Silk>

Page 39: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Where to find links ?

Page 40: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Towards automated interconnexion services

ØThe linkage specification could be simplified

§ Using alignments between vocabularies

§ Detection of discriminating properties

§ Indicating comparison methods by attaching metadata to ontologies

ØWork in progress in Datalift

Page 41: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

SemWebPro 18/01/2011 41

5th floor - Applications

Page 42: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Data visualization

Tabulator (CSAIL, MIT)

Page 43: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

VisiNav

Page 44: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Sig.ma

Page 45: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011
Page 46: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Nos Députés . FR

Page 47: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

A few examples from US

http://data-gov.tw.rpi.edu/demo/USForeignAid/demo-1554.html

Page 48: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

Mashups … Mashups … Mashups …

Page 49: Datalift a-catalyser-for-the-web-of-data-fosdem-05-02-2011

That's it !

● Datalift.org● We're looking for a Datageek !