semantic representation of provenance in wikipedia

23
Copyright 2009 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.ie Semantic Representation of Provenance in Wikipedia Fabrizio Orlandi¹, Pierre-Antoine Champin², Alexandre Passant¹ SWPM 2010 Shanghai – 7th Nov 2010 ¹ Digital Enterprise Research Institute – National University of Ireland, Galway ² LIRIS, Université de Lyon, CNRS, UMR5205, Lyon

Upload: fabrizio-orlandi

Post on 28-Aug-2014

1.027 views

Category:

Technology


3 download

DESCRIPTION

presented @ISWC2010 - SWPM (Sem Web Provenance Management) workshop

TRANSCRIPT

Page 1: Semantic Representation of Provenance in Wikipedia

Copyright 2009 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

Semantic Representation of Provenance in Wikipedia

Fabrizio Orlandi¹, Pierre-Antoine Champin², Alexandre Passant¹

SWPM 2010Shanghai – 7th Nov 2010

¹ Digital Enterprise Research Institute – National University of Ireland, Galway

² LIRIS, Université de Lyon, CNRS, UMR5205, Lyon

Page 2: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

WikipediaWikipedia is one of the widest-known knowledge bases available on the Web is one of the widest-known knowledge bases available on the Web

Everyone can contribute Everyone can contribute TrustTrust and and qualityquality concerns! concerns!

Use of Use of provenanceprovenance information to identify trust and quality values for pages information to identify trust and quality values for pages

MotivationMotivation

2 of 23

Data Provenance as the Data Provenance as the historyhistory, the , the originsorigins and the and the evolutionevolution of data. of data.

Ability to answer the following questions about data:Ability to answer the following questions about data:

WhoWho created/modified it? created/modified it? WhenWhen? ?

WhatWhat is the content? is the content? WhereWhere is it located? is it located?

HowHow and and WhyWhy was it created? was it created?

WhichWhich tools and processes were used? tools and processes were used?

Page 3: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

• By representing Wikipedia provenance information with Semantic Web By representing Wikipedia provenance information with Semantic Web technologies we enable:technologies we enable:

– TransparencyTransparency

– ReusabilityReusability

– Integration with the Web of Data Integration with the Web of Data

• Our contribution:Our contribution:

– A semantic model to represent provenance information in wikisA semantic model to represent provenance information in wikis

– A software architecture to extract provenance from WikipediaA software architecture to extract provenance from Wikipedia

– An application that uses and exposes provenance data to compute An application that uses and exposes provenance data to compute measures and statistics on Wikipedia articles measures and statistics on Wikipedia articles

3 of 23

Semantic provenance in WikipediaSemantic provenance in Wikipedia

Page 4: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

The The SIOCSIOC CoreCore ontology: ontology:http://rdfs.org/sioc/spechttp://rdfs.org/sioc/spec

4 of 23

• WikiWiki and and WikiArticleWikiArticle classes with the classes with the SIOCSIOC TypesTypes module. module.

AdvantagesAdvantages of using SIOC: of using SIOC:

• Widely used on the Web.Widely used on the Web.

• IntegrationIntegration with existing SIOC data and other popular lightweight ontologies like FOAF, DC, etc. with existing SIOC data and other popular lightweight ontologies like FOAF, DC, etc.

• Same queries to find items on a Same queries to find items on a WikiWiki or a or a BlogBlog, , ForumForum, etc., etc.

SIOCSIOC Semantically-Interlinked Online CommunitiesSemantically-Interlinked Online Communities

Describes the content and Describes the content and structure of community sites.structure of community sites.

Page 5: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

• From a From a document-centricdocument-centric (SIOC) (SIOC) to an to an action-centricaction-centric (SIOC Actions)(SIOC Actions) view of online view of online communities. communities. [Champin, Passant – 2010][Champin, Passant – 2010]

• It represents the dynamics of online communities, how they evolve:It represents the dynamics of online communities, how they evolve:

– A set of A set of actionsactions, performed by a , performed by a useruser at some at some timetime, impacting one or more , impacting one or more objectsobjects..

– In Wikipedia actions are In Wikipedia actions are editsedits made by users on the articles. made by users on the articles.

Relies on the Relies on the Event OntologyEvent Ontology [Raimond et al. - 2007] [Raimond et al. - 2007] http://motools.sourceforge.net/event/event.htmlhttp://motools.sourceforge.net/event/event.html

The SIOCThe SIOC Actions module Actions module

5 of 23

Page 6: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

• Ontological model created to describe the semantics of data provenance Ontological model created to describe the semantics of data provenance [Ram, Liu - 2007][Ram, Liu - 2007]

– Based on the Bunge's ontology (Based on the Bunge's ontology (19771977).).

– Tracks the Tracks the historyhistory of the of the eventsevents affecting the status of affecting the status of thingsthings during during

their their lifcyclelifcycle..

– Extensible and generic, it can be used in different domains.Extensible and generic, it can be used in different domains.

– 7 interrogative words: 7 interrogative words: WhatWhat, , HowHow, , WhenWhen, , WhereWhere, , WhoWho, , WhichWhich, , WhyWhy..

– Not implemented in RDFS/OWL.Not implemented in RDFS/OWL.

The W7 ModelThe W7 Model

6 of 23

Page 7: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

1 – What1 – What

An An eventevent (i.e. change of state) that happens to data during its life time (i.e. change of state) that happens to data during its life time

In Wikipedia every type of event (In Wikipedia every type of event (creation, modification, deletioncreation, modification, deletion) leads to ) leads to the the creation of a new article revisioncreation of a new article revision..

Just using SIOC Core we can model Just using SIOC Core we can model versioningversioning and history of wiki articles. and history of wiki articles.

Our modelling solutionOur modelling solution

7 of 23

<http://example.com/action?title=Linked_Data#38010613> sioca:creates <http://en.wikipedia.org/w/index.php?title=Linked_Data&oldid=38010613>;

sioca:modifies <http://en.wikipedia.org/wiki/Linked_Data>;

a sioca:Action.

Page 8: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

• 2 – How2 – How

The The actionaction leading to an event. leading to an event.

• In Wikipedia the actions are the In Wikipedia the actions are the editsedits applied to the articles. applied to the articles.

• By analyzing By analyzing diffsdiffs between revisions we identify the between revisions we identify the type of actiontype of action involved involved in the creation of the newer revisionin the creation of the newer revision

( ( Insertion Insertion | | Update Update | | Deletion Deletion ) ( ) ( Sentence Sentence | | Reference Reference ))

• To model the differences between revisions we created a lightweight To model the differences between revisions we created a lightweight Diff Diff ontologyontology that aims at describing that aims at describing changes to plain text documentschanges to plain text documents..

(http://vocab.deri.ie/diff#)(http://vocab.deri.ie/diff#)

Our modelling solutionOur modelling solution

8 of 23

Page 9: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

3 – When3 – When

The The timetime an event occurs. an event occurs.

• In Wikipedia every edit has a timestamp recorded, and edits are In Wikipedia every edit has a timestamp recorded, and edits are considered instantaneous.considered instantaneous.

• Use of Use of dc:createddc:created or or event:timeevent:time

Our modelling solutionOur modelling solution

9 of 23

<http://example.com/action?title=Linked_Data#380106133> dc:created "2010-08-21T06:36:17Z";

event:time [ a time:Instant; time:inXSDDateTime "2010-08-21T06:36:17Z". ];

a sioca:Action.

Page 10: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

4 – Where4 – Where

The online The online spacespace or the location associated with an event. or the location associated with an event.

In Wikipedia the information about the location of the user editing the In Wikipedia the information about the location of the user editing the

page is not provided. page is not provided.

This information cannot be modelled.This information cannot be modelled.

Our modelling solutionOur modelling solution

10 of 23

Page 11: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

Our modelling solutionOur modelling solution

11 of 23

5 – Who5 – Who

An An agentagent involved in an event. involved in an event.

In Wikipedia it is represented by the In Wikipedia it is represented by the editoreditor of a page. of a page.

We use the We use the sioc:UserAccountsioc:UserAccount class to identify the account of the agent class to identify the account of the agent

<http://example.com/action?title=Linked_Data#36243686>

sioc:has_creator

<http://en.wikipedia.org/wiki/User:Timbl>;

a sioca:Action.

Page 12: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

Our modelling solutionOur modelling solution

12 of 23

6 – Which6 – Which

The programs or The programs or instrumentsinstruments used in the event. used in the event.

• In Wikipedia it is represented by the MediaWiki software used to edit the In Wikipedia it is represented by the MediaWiki software used to edit the

articles.articles.

• Different in case the editor is a “bot”.Different in case the editor is a “bot”.

Page 13: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

Our modelling solutionOur modelling solution

13 of 23

7 – Why7 – Why

The The reasonsreasons behind the event occurrence. behind the event occurrence.

• In Wikipedia it is defined by the justifications for a change inserted by a In Wikipedia it is defined by the justifications for a change inserted by a

user in the user in the “comment”“comment” field. field.

• Property Property diff:comment diff:comment with the with the diff:Diffdiff:Diff class as domain. class as domain.

Page 14: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

Our modelling solutionOur modelling solution

14 of 23

Page 15: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

Application using Wikipedia provenance dataApplication using Wikipedia provenance data

The application is composed mainly in 3 parts:The application is composed mainly in 3 parts:

• Data CollectionData Collection

– Extracts and generates provenance data from Wikipedia using our model.Extracts and generates provenance data from Wikipedia using our model.

• Firefox plug-inFirefox plug-in

– From the provenance data collected, it computes and shows statistical From the provenance data collected, it computes and shows statistical information directly on Wikipedia pages.information directly on Wikipedia pages.

• Exposing the data to the Web of dataExposing the data to the Web of data

– The statistical information and the provenance data are provided as The statistical information and the provenance data are provided as Linked Open Data.Linked Open Data.

15 of 23

Page 16: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

Data CollectionData Collection

A PHP script has been developed to extract all the articles belonging to a A PHP script has been developed to extract all the articles belonging to a categorycategory and all its subcategories, and for each article, its entire and all its subcategories, and for each article, its entire revision historyrevision history..

Then the program extracts provenance information from the articles collected at Then the program extracts provenance information from the articles collected at the previous step: it calculates the the previous step: it calculates the diffdiff function between versions and retrieves function between versions and retrieves

other information from the Wikipedia API.other information from the Wikipedia API.

We ran our experiment with the We ran our experiment with the “Semantic Web”“Semantic Web” category and all its category and all its 166166 Wikipedia articles. All the data has been loaded in a RDF store.Wikipedia articles. All the data has been loaded in a RDF store.

16 of 23

Page 17: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

Data CollectionData Collection

17 of 23

Page 18: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

A Firefox plug-inA Firefox plug-in

• This application displays a table directly on top of Wikipedia articles This application displays a table directly on top of Wikipedia articles exposing information about the most active users and their edits.exposing information about the most active users and their edits.

• It is composed by: It is composed by:

– 1) The 1) The triplestoretriplestore, exposing a SPARQL endpoint; , exposing a SPARQL endpoint;

– 2) A 2) A PHP scriptPHP script, which queries the triplestore and sends the results to , which queries the triplestore and sends the results to the Greasemonkey script;the Greasemonkey script;

– 3) A 3) A Greasemonkey scriptGreasemonkey script, which retrieves the URL of the Wikipedia , which retrieves the URL of the Wikipedia loaded page, sends the request to the PHP script and then displays the loaded page, sends the request to the PHP script and then displays the returned HTML data on the Wikipedia page.returned HTML data on the Wikipedia page.

18 of 23

Page 19: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

A Firefox plug-inA Firefox plug-in

19 of 23

Page 20: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

To the Web of dataTo the Web of data

• The application is currently available at The application is currently available at http://vmuss06.deri.ie/WikiProvenance/index.phphttp://vmuss06.deri.ie/WikiProvenance/index.php. .

• Using this web service is possible to have RDF for the provenance data Using this web service is possible to have RDF for the provenance data generated with our model. generated with our model.

• It is also possible to have the statistical information displayed with the It is also possible to have the statistical information displayed with the Firefox plugin represented in RDF. Firefox plugin represented in RDF.

• To represent the statistics we use SCOVO, the Statistical Core VocabularyTo represent the statistics we use SCOVO, the Statistical Core Vocabulary

(http://vocab.deri.ie/scovo)(http://vocab.deri.ie/scovo)

20 of 23

Page 21: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

To the Web of dataTo the Web of data

• As an example the following triples represent that:As an example the following triples represent that:

the user “KingsleyIdehen” made 11 edits on the SIOC pagethe user “KingsleyIdehen” made 11 edits on the SIOC page

21 of 23

@prefix WikiStats: <http://vmuss06.deri.ie/WikipediaStats.owl#>.@prefix scovo: <http://purl.org/NET/scovo#>.

<WikiStats:title=SIOC&user=KingsleyIdehen&edits>a scovo:Item ;rdf:value 11 ;scovo:dimension WikiStats:Edits ;scovo:dimension <http://wikipedia.org/wiki/SIOC>;scovo:dimension <http://wikipedia.org/wiki/User:KingsleyIdehen>.

Page 22: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

Conclusions and Future WorkConclusions and Future Work

Our contributionOur contribution:• A specific lightweight ontology for provenance in wikis, based on the W7 model and SIOC.A specific lightweight ontology for provenance in wikis, based on the W7 model and SIOC.

• A framework for the extraction of provenance data from Wikipedia.A framework for the extraction of provenance data from Wikipedia.

• An application to access the generated data in a meaningful way and to expose it to the An application to access the generated data in a meaningful way and to expose it to the Web of data.Web of data.

Future work:Future work: A refinement of the proposed model and an A refinement of the proposed model and an alignmentalignment with other general-purpose with other general-purpose ontologies for provenance representation.ontologies for provenance representation. To improve the To improve the performanceperformance and extend the and extend the featuresfeatures of the application. of the application.

To model statistics using the To model statistics using the SDMXSDMX vocabulary vocabulary (Statistical Data and Metadata eXchange)(Statistical Data and Metadata eXchange)

22 of 23

CommentComment:• VeryVery large amount of data large amount of data generated for the “Semantic Web” category and its 166 generated for the “Semantic Web” category and its 166

articles: almost 1.5 million triples for a total of 8.656 revisions.articles: almost 1.5 million triples for a total of 8.656 revisions.

Page 23: Semantic Representation of Provenance in Wikipedia

Digital Enterprise Research Institute www.deri.ie

Applications and source code:Applications and source code:

http://vmuss06.deri.ie/WikiProvenance/index.phphttp://vmuss06.deri.ie/WikiProvenance/index.php

The Diff ontology:The Diff ontology:

http://vocab.deri.ie/diffhttp://vocab.deri.ie/diff##

Contacts:Contacts:

[email protected]@deri.org

@BadmotorF@BadmotorF

http://www.slideshare.net/badmotorfingerhttp://www.slideshare.net/badmotorfinger

23 of 23

Questions ?Questions ?