bibbase linked data triplification challenge 2010 presentation

15
BibBase Triplified http://data.bibbase.org/ Presented by: Reynold S. Xin UC Berkeley Joint work with: Oktie Hassanzadeh, Yang Yang, Jiang Du, Minghua Zhao, Renee J. Miller University of Toronto Christian Fritz University of Southern California

Upload: reynold-xin

Post on 25-Jun-2015

707 views

Category:

Education


2 download

DESCRIPTION

This was a short talk given on BibBase (http://data.bibbase.org) in LDTC 2010, Graz, Austria.

TRANSCRIPT

Page 1: BibBase Linked Data Triplification Challenge 2010 Presentation

BibBase Triplified http://data.bibbase.org/

Presented by:

Reynold S. Xin UC Berkeley

Joint work with:

Oktie Hassanzadeh, Yang Yang, Jiang Du, Minghua Zhao,Renee J. Miller University of Toronto

Christian Fritz University of Southern California

Page 2: BibBase Linked Data Triplification Challenge 2010 Presentation

Outline

Goals and Status

Duplicate detection

Interlinking of data sources

Additional features

Conclusions and future work

Page 3: BibBase Linked Data Triplification Challenge 2010 Presentation
Page 4: BibBase Linked Data Triplification Challenge 2010 Presentation
Page 5: BibBase Linked Data Triplification Challenge 2010 Presentation

Goals http://www.bibbase.org

Makes it easy for scientists to maintain publications pages

Scientists maintain a bibtex file; BibBase does the rest Publishes them in HTML

Page 6: BibBase Linked Data Triplification Challenge 2010 Presentation

Goals http://data.bibbase.org

Makes it easy for scientists to maintain publications pages

Scientists maintain a bibtex file; BibBase does the rest Publishes them in HTML Publishes them in RDF Links entries to the open linked data cloud

With incentive, scientists are helping us build a bibliographic database (think DBLP but automated)

Invaluable data set for benchmarking duplicate detection and semantic link discovery systems

Page 7: BibBase Linked Data Triplification Challenge 2010 Presentation
Page 8: BibBase Linked Data Triplification Challenge 2010 Presentation

Some statistics

“Beta” went online in June 2010

As of yesterday (September 1, 2010) ~ 100 active users 4520 publications, 4883 authors, 502 journals, 1881

proceedings, 88 keywords 39201 author links, 2768 publication links, 30 keyword

links

Note that this is before we do any form of “marketing”

Page 9: BibBase Linked Data Triplification Challenge 2010 Presentation

Duplicate Detection

Examples Authors: “Renee J. Miller” or “R. J. Miller” or “RJ Miller” Publication entries Journal & conferences: “VLDB” or “Very Large Data Base”

Solutions Local detection (within a single bibtex file) Global detection (across multiple files)

Page 10: BibBase Linked Data Triplification Challenge 2010 Presentation

Local Detection

A set of predefined rules to identify duplicates. E.g. within a single file, it is highly likely that “Renee J

Miller” is the same as “RJ Miller”.

Users can specify a suffix to the name to differentiate them (DBLP approach). E.g. “Min Wang” vs “Min Wang2”

Page 11: BibBase Linked Data Triplification Challenge 2010 Presentation

Global Detection

Duplicate detection, also known as entity resolution, record linkage, or reference reconciliation is a well-studied problem and an active research area. [Tutorial-VLDB’05, Tutorial-SIGMOD’06]

We use existing declarative techniques [D.App.σ-SIGMOD’07] to detect duplicates across multiple files.

Display disambiguation page on HTML interface and rdfs:seeAlso attribute on RDF interface.

Also enables user to provide feedback by@string{vldb = Very Large Data Base}

Page 12: BibBase Linked Data Triplification Challenge 2010 Presentation

Interlinking of Data Sources

Leverages both offline dictionaries and online real-time URL verifications.

Some external data sources DBLP DBpedia RKBExplorer Semantic Web Dogfood LOD foaf

Page 13: BibBase Linked Data Triplification Challenge 2010 Presentation

Additional Features

Storage and publication of provenance information

Dynamic grouping of entities (by year, keyword, etc)

RSS feed for notification

DBLP scraper to generate bibtex files from DBLP records

Statistics on usage

Enhancement to existing MIT bibtex ontology file

Page 14: BibBase Linked Data Triplification Challenge 2010 Presentation

Conclusion and Future Work

BibBase Light-weight publication of bibliographic data Semantic web technologies as a result of complex

triplification performed inside the system Invaluable data set

Future Work More comprehensive duplicate detection Links to more external data sources Better engineering and service level agreement (99.99%?) Broader user base

Page 15: BibBase Linked Data Triplification Challenge 2010 Presentation

Questions?