hackathon s pb

Data on the Semantic Web

Peter Mika

Senior Research Scientist

Yahoo! Research

- 2 -

Vague, but exciting… Berners-Lee and the dawn of the Web

- 3 -

Semantic Web

• Publish information in a way that is easier to process for machines

• Web of Data instead of Web of Documents

• Two main architectural challenges

– A common format for sharing data

– Sharing the meaning of data

• Through social means (shared schemas)

• By using powerful schema languages

• Semantic Web standards from W3C

– Languages (RDF, OWL, RIF)

– Serializations (RDF/XML, RDFa)

– Protocols (SPARQL, HTTP)

• Semantic Web research into knowledge representation and reasoning, data integration, data quality and many other topics

• Community efforts to publish data and develop schemas

- 4 -

Resource Description Framework (RDF)

• Each resource (thing, entity) is identified by a URI– Globally unique identifiers

• RDF represents knowledge as a set of triples– Each triple is a single fact about the entity (an attribute or a

relationship)

• A set of triples forms an RDF graph

example:roi

“Roi Blanco”

name

type foaf:PersonRDF document

- 5 -

Linking across the Web

example:roi

“Roi Blanco”

namefoaf:Person

sameAs

#roi2worksWith

#peter

“[email protected]”

email

type

type

Roi’s homepage

Yahoo!’s website

Friend-of-a-Friend ontology

knows

- 7 -

Vocabularies (ontologies)

• Ontologies are collections of classes and properties used to describe objects in a particular domain

– OWL (the Web Ontology Language) is the standard ontology language

– OWL has an RDF serialization: ontologies are part of the Semantic Web

• Classes can be described by sub- and superclasses, required properties

– Class membership in RDF is expressed using the rdf:type property

– An instance can have multiple classes (types)

– A class can have multiple superclasses

• Properties can be described by their domain, range, cardinalities, etc.

- 8 -

Example: schema.org

• Agreement on a shared set of schemas for common types of web content

– Bing, Google, and Yahoo! as initial supporters

– Similar in intent to sitemaps.org (2006)

• Use a single format to communicate the same information to all three search engines

• Support for microdata

• schema.org covers areas of interest to all search engines

– Business listings (local), creative works (video), recipes, reviews

– User defined extensions

• Each search engine continues to develop its products

- 9 -

Documentation and OWL ontology

Sources of data

- 11 -

Data on the Web

• Most web pages on the Web are generated from structured data

– Data is stored in relational databases (typically)

– Queried through web forms

– Presented as tables or simply as unstructured text

• The structure and semantics (meaning) of the data is not directly accessible to search engines

• Two solutions

– Extraction using Information Extraction (IE) techniques (implicit metadata)

• Supervised vs. unsupervised methods

– Relying on publishers to expose structured data using standard Semantic Web formats (explicit metadata)

• Particularly interesting for long tail content

- 12 -

Information Extraction methods

• Natural Entity Recognition (NER) and Disambiguation (NED)• OpenCalais, Zemanta API, Dbpedia Spotlight

• Yahoo! Placemaker

• Extraction of structured data from text– Yago system (demo)

• Exploiting patterns in web page structure– Dapper

– ScraperWiki

• Extraction from HTML tables

– Google Squared (deprecated)

- 13 -

Publishing and consuming data on the Semantic Web

• Publishing data involves– Deciding in which format to publish your data

– Deciding which schema (ontology, vocabulary) to use

• OR you can create a new schema and publish it as well

• Multiple ways of publishing RDF data:

1. Linked Data

2. Metadata in HTML

3. SPARQL endpoints

4. Feeds, e.g. OData

Note: you may implement more than one

- 14 -

Option 1: Linked Data

• A web of RDF documents in parallel to the current Web

– Most often implemented as wrappers around databases or APIs

• The four rules of Linked Data:

– Use URIs to identify things.

– Use HTTP URIs so that these things can be referred to and looked up ("dereference") by people and user agents.

– Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF-XML.

– Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

- 15 -

Option 1: Linked Data

• Advantages:

– No change to the publishing of the HTML documents

– Data can be published by third party (e.g. Dbpedia)

• Disadvantages:

– Web servers need to be configured to properly handle URIs that identify concepts instead of documents

– Not favored by search engines

• Lack of use cases

• Crawling needs to be changed

• Authority is difficult to determine

• Tools

– Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby)

– RDB-to-RDF mappers (e.g. D2RQ, Triplify)

– Validators (Vapour)

– Linked Data browsers (many)

- 16 -

Growth of Linked Data

• Community effort to (re)publish open datasets as Linked Data

– In particular, scientific and government datasets

– see linkeddata.org, the Data Hub

- 17 -

Option 2: Metadata in HTML

• Using microformats, RDFa, Microdata (more later)

• Advantages:

– Data and document are always in sync

– Browser plug-in friendly

– Search engine friendly

– Copy-paste friendly

• Tools:

– Any23 (Anything to Triples)

– RDFaCE

– RDFa Distiller

Peter Mika was born in Budapest.


#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population



#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

- 18 -

Example: Facebook’s Open Graph Protocol

• RDF vocabulary to be used in conjunction with RDFa

– Simplify the work of developers by restricting the freedom in RDFa

• Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment

• Only HTML <head> accepted

• http://opengraphprotocol.org/

<html xmlns:og="http://opengraphprotocol.org/schema/"> <head>

<title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> …

</head> ...

- 19 -

Current state of metadata on the Web

• 31% of webpages, 5% of domains contain some metadata

– Analysis of the Bing Crawl (US crawl, January, 2012)

– RDFa is most common format• By URL: 25% RDFa, 7% microdata, 9% microformat

• By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat

– Adoption is stronger among large publishers• Especially for RDFa and microdata

• See also

– P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012

– H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured Data from Two Large Web Corpora, LDOW 2012

- 20 -

Exponential growth in RDFa data

Percentage of URLs with embedded metadata in various formats

Five-fold increase between March, 2009 and October, 2010

Five-fold increase between March, 2009 and October, 2010

Another five-fold increase between October 2010 and January, 2012

Another five-fold increase between October 2010 and January, 2012

- 21 -

Option 3: SPARQL endpoints

• An API for accessing RDF databases on the Web

– A query language and an HTTP protocol

• Advantages:

– Flexible access: make any query you want

– Also possible to expose a traditional RDBMs via a wrapper

• Disadvantages:

– For the publisher: cost of supporting arbitrary queries

– For the search engine: discovery of SPARQL servers is unsolved

• Tools:

– Triple stores

• Sesame, Jena, OWLIM, Redland, Oracle, Virtuoso, Stardog etc.

– RDB-to-RDF mappers such as D2RQ and Triplify

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

- 22 -

Example: Dbpedia

• demo

- 24 -

Crawling the Semantic Web

• Linked Data

– Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled

– Semantic Sitemap/VOID descriptions

• RDFa

– Same as HTML crawling, but data is extracted after crawling

– Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010.

• SPARQL endpoints

– Endpoints are not linked, need to be discovered by other means

– Semantic Sitemap/VOID descriptions

- 25 -

Data fusion

• Ontology (schema) matching

– Widely studied in Semantic Web research

• ontologymatching.org

• Entity resolution

– Finding links between datasets

– Tools: SILK, LIMES

• Blending

– Merging objects that represent the same real world entity and reconciling information from multiple sources

• Cleaning

– Google Refine

- 26 -

More info

• Ideas for hacks

– http://challenge.semanticweb.org/

– http://iswc2011.semanticweb.org/calls/linked-data-a-thon/

• Book

– Segaran, Evans and Taylor. Programming the Semantic Web. O’Reilly, 2009.

• More tools

– Exhibit: faceted browsing and other visualizations

– http://www.dajobe.org/talks/200906-semtech-open/

– LOD2 stack (stack.lod2.eu)