hackathon s pb
DESCRIPTION
Slides for the Open Data Hackathon at Saint PetersburgTRANSCRIPT
Data on the Semantic Web
Peter Mika
Senior Research Scientist
Yahoo! Research
- 2 -
Vague, but exciting… Berners-Lee and the dawn of the Web
- 3 -
Semantic Web
• Publish information in a way that is easier to process for machines
• Web of Data instead of Web of Documents
• Two main architectural challenges
– A common format for sharing data
– Sharing the meaning of data
• Through social means (shared schemas)
• By using powerful schema languages
• Semantic Web standards from W3C
– Languages (RDF, OWL, RIF)
– Serializations (RDF/XML, RDFa)
– Protocols (SPARQL, HTTP)
• Semantic Web research into knowledge representation and reasoning, data integration, data quality and many other topics
• Community efforts to publish data and develop schemas
- 4 -
Resource Description Framework (RDF)
• Each resource (thing, entity) is identified by a URI– Globally unique identifiers
• RDF represents knowledge as a set of triples– Each triple is a single fact about the entity (an attribute or a
relationship)
• A set of triples forms an RDF graph
example:roi
“Roi Blanco”
name
type foaf:PersonRDF document
- 5 -
Linking across the Web
example:roi
“Roi Blanco”
namefoaf:Person
sameAs
#roi2worksWith
#peter
type
type
Roi’s homepage
Yahoo!’s website
Friend-of-a-Friend ontology
knows
- 7 -
Vocabularies (ontologies)
• Ontologies are collections of classes and properties used to describe objects in a particular domain
– OWL (the Web Ontology Language) is the standard ontology language
– OWL has an RDF serialization: ontologies are part of the Semantic Web
• Classes can be described by sub- and superclasses, required properties
– Class membership in RDF is expressed using the rdf:type property
– An instance can have multiple classes (types)
– A class can have multiple superclasses
• Properties can be described by their domain, range, cardinalities, etc.
- 8 -
Example: schema.org
• Agreement on a shared set of schemas for common types of web content
– Bing, Google, and Yahoo! as initial supporters
– Similar in intent to sitemaps.org (2006)
• Use a single format to communicate the same information to all three search engines
• Support for microdata
• schema.org covers areas of interest to all search engines
– Business listings (local), creative works (video), recipes, reviews
– User defined extensions
• Each search engine continues to develop its products
- 9 -
Documentation and OWL ontology
Sources of data
- 11 -
Data on the Web
• Most web pages on the Web are generated from structured data
– Data is stored in relational databases (typically)
– Queried through web forms
– Presented as tables or simply as unstructured text
• The structure and semantics (meaning) of the data is not directly accessible to search engines
• Two solutions
– Extraction using Information Extraction (IE) techniques (implicit metadata)
• Supervised vs. unsupervised methods
– Relying on publishers to expose structured data using standard Semantic Web formats (explicit metadata)
• Particularly interesting for long tail content
- 12 -
Information Extraction methods
• Natural Entity Recognition (NER) and Disambiguation (NED)• OpenCalais, Zemanta API, Dbpedia Spotlight
• Yahoo! Placemaker
• Extraction of structured data from text– Yago system (demo)
• Exploiting patterns in web page structure– Dapper
– ScraperWiki
• Extraction from HTML tables
– Google Squared (deprecated)
- 13 -
Publishing and consuming data on the Semantic Web
• Publishing data involves– Deciding in which format to publish your data
– Deciding which schema (ontology, vocabulary) to use
• OR you can create a new schema and publish it as well
• Multiple ways of publishing RDF data:
1. Linked Data
2. Metadata in HTML
3. SPARQL endpoints
4. Feeds, e.g. OData
Note: you may implement more than one
- 14 -
Option 1: Linked Data
• A web of RDF documents in parallel to the current Web
– Most often implemented as wrappers around databases or APIs
• The four rules of Linked Data:
– Use URIs to identify things.
– Use HTTP URIs so that these things can be referred to and looked up ("dereference") by people and user agents.
– Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF-XML.
– Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
..#PeterM
#Bud
born
“Peter Mika”
label
“Budapest”
label#Hun
capital-of
“2,000,000”
population
..#PeterM
#Bud
born
“Peter Mika”
label
“Budapest”
label#Hun
capital-of
“2,000,000”
population
..#PeterM
#Bud
born
“Peter Mika”
label
“Budapest”
label#Hun
capital-of
“2,000,000”
population
- 15 -
Option 1: Linked Data
• Advantages:
– No change to the publishing of the HTML documents
– Data can be published by third party (e.g. Dbpedia)
• Disadvantages:
– Web servers need to be configured to properly handle URIs that identify concepts instead of documents
– Not favored by search engines
• Lack of use cases
• Crawling needs to be changed
• Authority is difficult to determine
• Tools
– Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby)
– RDB-to-RDF mappers (e.g. D2RQ, Triplify)
– Validators (Vapour)
– Linked Data browsers (many)
- 16 -
Growth of Linked Data
• Community effort to (re)publish open datasets as Linked Data
– In particular, scientific and government datasets
– see linkeddata.org, the Data Hub
- 17 -
Option 2: Metadata in HTML
• Using microformats, RDFa, Microdata (more later)
• Advantages:
– Data and document are always in sync
– Browser plug-in friendly
– Search engine friendly
– Copy-paste friendly
• Tools:
– Any23 (Anything to Triples)
– RDFaCE
– RDFa Distiller
Peter Mika was born in Budapest.
Peter Mika was born in Budapest.
#PeterM
#Bud
born
“Peter Mika”
label
“Budapest”
label#Hun
capital-of
“2,000,000”
population
Peter Mika was born in Budapest.
Peter Mika was born in Budapest.
#PeterM
#Bud
born
“Peter Mika”
label
“Budapest”
label#Hun
capital-of
“2,000,000”
population
- 18 -
Example: Facebook’s Open Graph Protocol
• RDF vocabulary to be used in conjunction with RDFa
– Simplify the work of developers by restricting the freedom in RDFa
• Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment
• Only HTML <head> accepted
• http://opengraphprotocol.org/
<html xmlns:og="http://opengraphprotocol.org/schema/"> <head>
<title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> …
</head> ...
- 19 -
Current state of metadata on the Web
• 31% of webpages, 5% of domains contain some metadata
– Analysis of the Bing Crawl (US crawl, January, 2012)
– RDFa is most common format• By URL: 25% RDFa, 7% microdata, 9% microformat
• By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat
– Adoption is stronger among large publishers• Especially for RDFa and microdata
• See also
– P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012
– H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured Data from Two Large Web Corpora, LDOW 2012
- 20 -
Exponential growth in RDFa data
Percentage of URLs with embedded metadata in various formats
Five-fold increase between March, 2009 and October, 2010
Five-fold increase between March, 2009 and October, 2010
Another five-fold increase between October 2010 and January, 2012
Another five-fold increase between October 2010 and January, 2012
- 21 -
Option 3: SPARQL endpoints
• An API for accessing RDF databases on the Web
– A query language and an HTTP protocol
• Advantages:
– Flexible access: make any query you want
– Also possible to expose a traditional RDBMs via a wrapper
• Disadvantages:
– For the publisher: cost of supporting arbitrary queries
– For the search engine: discovery of SPARQL servers is unsolved
• Tools:
– Triple stores
• Sesame, Jena, OWLIM, Redland, Oracle, Virtuoso, Stardog etc.
– RDB-to-RDF mappers such as D2RQ and Triplify
#PeterM
#Bud
born
“Peter Mika”
label
“Budapest”
label#Hun
capital-of
“2,000,000”
population
- 22 -
Example: Dbpedia
• demo
- 24 -
Crawling the Semantic Web
• Linked Data
– Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled
– Semantic Sitemap/VOID descriptions
• RDFa
– Same as HTML crawling, but data is extracted after crawling
– Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010.
• SPARQL endpoints
– Endpoints are not linked, need to be discovered by other means
– Semantic Sitemap/VOID descriptions
- 25 -
Data fusion
• Ontology (schema) matching
– Widely studied in Semantic Web research
• ontologymatching.org
• Entity resolution
– Finding links between datasets
– Tools: SILK, LIMES
• Blending
– Merging objects that represent the same real world entity and reconciling information from multiple sources
• Cleaning
– Google Refine
- 26 -
More info
• Ideas for hacks
– http://challenge.semanticweb.org/
– http://iswc2011.semanticweb.org/calls/linked-data-a-thon/
• Book
– Segaran, Evans and Taylor. Programming the Semantic Web. O’Reilly, 2009.
• More tools
– Exhibit: faceted browsing and other visualizations
– http://www.dajobe.org/talks/200906-semtech-open/
– LOD2 stack (stack.lod2.eu)