publishing data on the semantic web

Publishing Data on the Semantic Web

Peter Mika

Researcher, Data Architect

Yahoo! Research

Intro to the Semantic Web

- 3 -

Vague, but exciting… Berners-Lee and the dawn of the Web

- 4 -

Semantic Web

• Publish information in a way that is easier to process for machines

• Web of Data instead of Web of Documents

• Two main architectural challenges

– A common format for sharing data

– Sharing the meaning of data

• Through social means (shared schemas)

• By using powerful schema languages

• Semantic Web standards from W3C

– Languages (RDF, OWL, RIF)

– Serializations (RDF/XML, RDFa)

– Protocols (SPARQL, HTTP)

• Semantic Web research into knowledge representation and reasoning, data integration, data quality and many other topics

• Community efforts to publish data and develop schemas

- 5 -

RDF (Resource Description Framework)

• The basic data model of the Semantic Web

– A universal model to capture all sorts of data: networks, relational, object-oriented…

• Basic unit of information is a triple

– A tuple of (subject, predicate, object)

– Example: (Joe, loves, Mary)

– Each triple gives the value of a property for a given resource or relates two objects to one another

• Object is either a resource or a literal

• An RDF model is a set of triples

– Ordering of statements in an RDF document is irrelevant (unlike XML)

- 6 -

Resources vs. literals

• Resources are identified by a URI or otherwise the are called a blank node

– URIs are a generalization of URLs

– Notation: <http://www.example.org/Person> or ex:Person

• Literals have an optional language and datatype (string, integer etc.)

– Literals can not be subjects of statements

– Datatypes are identified by URIs, e.g. XML Schema datatypes

– Two literals are the same if their components are the same

– Notation: “Joe B.” or Joe@en^^http://…#string

- 7 -

Advanced topic: Resources vs Literals

• Resources are objects, Literals are strings

• Resources are instances of classes, Literals have datatypes

• Whether something is a resource or literal sometimes depends on the detail of modeling

<meta property=“myvocab:knows”>Paris Hilton</meta>

<item rel=“foaf:knows”><meta property=“foaf:name”>Paris Hilton</meta>

</item>

• You cannot make statements about literals (literals are always the object in a triple)

• Resources can carry a globally unique identifier, literals have no identity

• Web resources such as documents and images are resources– <item rel=“rdfs:seeAlso” resource=“http://www.some.related.page.com/”/>

– <item rel=“foaf:img” resource=“http://photosite.example.org/photo.jpg”/>

• When in doubt: it’s a resource

- 8 -

Graphical and textual notation

• A number of ways to serialize an RDF model into an RDF document

– RDF/XML, Turtle, N3, N-Triples

– Example: http://www.cs.vu.nl/~pmika/foaf.rdf

my:Joe

“Joe A.”

name

foaf:Persontype

- 9 -

Informational versus non-informational resources

• Informational resource: an HTML document, image, any other file on the Web

– Retrievable in its entirety from the Web

– Retrieving it can return a 200 OK

• Conceptual (non-informational) resource: a person, an event, a place, etc.

– A description of it may be retrievable from the Web

– When identified by a URL, retrieving it should return a 303 Redirect

• Never confuse a webpage with what it describes!

– You are not your Facebook profile: one is a document, the other is a person. A document has properties such as byte-size, media-type etc, a person has name, age, etc.

– Make sure you don’t use the URL of an existing webpage as the URI of a resource

- 10 -

Vocabularies (ontologies)

• Ontologies are collections of classes and properties used to describe objects in a particular domain

– OWL (the Web Ontology Language) is the standard ontology language

– OWL has an RDF serialization: ontologies are part of the Semantic Web

• Classes can be described by sub- and superclasses, required properties

– Class membership in RDF is expressed using the rdf:type property

– An instance can have multiple classes (types)

– A class can have multiple superclasses

• Properties can be described by their domain, range, cardinalities, etc.

- 11 -

RDF is designed for distributed systems

• URIs provide web-wide global identification across documents– A resource may be described by multiple documents

– We know it’s the same resource because the same URI is used or through reasoning (advanced topic…)

– URIs are intented to be reused

– Unique, but not single identifiers: two URIs may denote the same thing

• URIs are dereferencable (can be retrieved)– A well-behaved URI returns a description of the resource

– Provides authority: the definition of foaf:Person lives at that URI

• Ontologies can be looked up as well– Typically at the root of the URIs, also known as the namespace

– Example: http://xmlns.com/foaf/0.1/Person redirects to the specification

- 12 -

URIs implicitly link data together

(#joe, #name, “Joe A.”)(#joe, #email, mailto:[email protected])

(#mary, name, “Mary B.”)(#mary, gender, “female”)

(#joe, #loves, #mary)

Joe’s homepage

A dating site

Mary’s homepage

(#name, #type, #Property)(#name, #domain, #Person)

Schema doc

- 13 -

Put together, triples form a single ‘global’ graph

“Joe A.”

#joe

#name

“[email protected]”

#email

#mary

#loves

“Mary B.”

“female”

#name

#gender

Publishing for the Semantic Web

- 15 -

Motivation

• Why publish data on the (Semantic) Web?

– In a business context

• Increase the potential for linking, reuse and aggregation

– Drive traffic back from other sites on the Web

– Pre-competitive data integration (e.g. drug discovery)

• Make your data more easily findable

– Drive traffic from search engines

– In a non-profit context

• Increase industry or government transparency, accountability

• Support research and education by making data accessible

- 16 -

Publishing and consuming data on the Semantic Web

• Publishing data involves– Deciding in which format to publish your data

– Deciding which schema (ontology, vocabulary) to use

• OR you can create a new schema and publish it as well

• Multiple ways of publishing RDF data:1. Linked Data

2. Metadata in HTML

3. SPARQL endpoints

4. Feeds

5. GRDDL

6. Automated tools

Note: you may implement more than one

- 17 -

Option 1: Linked Data

• A web of RDF documents in parallel to the current Web

– Most often implemented as wrappers around databases or APIs

• The four rules of Linked Data:

– Use URIs to identify things.

– Use HTTP URIs so that these things can be referred to and looked up ("dereference") by people and user agents.

– Provide useful information about the thing when its URI is dereferenced, using standard formats such as RDF-XML.

– Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

- 18 -

Option 1: Linked Data

• Advantages:

– No change to the publishing of the HTML documents

– Data can be published by third party (e.g. Dbpedia)

• Disadvantages:

– Web servers need to be configured to properly handle URIs that identify concepts instead of documents

– Not favored by search engines

• Lack of use cases

• Crawling needs to be changed

• Authority is difficult to determine

• Tools

– Triple stores (Virtuoso, Oracle etc.) and front-ends (Pubby)

– RDB-to-RDF mappers (e.g. D2RQ, Triplify)

– Validators (Vapour)

– Linked Data browsers (many)

- 19 -

Linked Data as a movement

• Rapidly growing community effort to (re)publish open datasets as Linked Data

– In particular, scientific and government datasets

– see linkeddata.org

- 20 -

Option 2: Metadata in HTML

• Using microformats, RDFa, Microdata (more later)

• Advantages:

– Data and document are always in sync

– Browser plug-in friendly

– Search engine friendly

– Copy-paste friendly

• Tools:

– XML editors (e.g. Oxygen)

– Triplr

– RDFa Distiller

– RDFa bookmarklet

– Ubiquity RDFa plugin

– Optimus microformat parser

• Examples: many, including SlideShare, YouTube, LinkedIn, Digg, Myspace, Facebook…

Peter Mika was born in Budapest.


#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population



#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

- 21 -

Option 3: SPARQL endpoints

• An API for accessing RDF databases on the Web

– A query language and an HTTP protocol

• Advantages:

– Flexible access: make any query you want

– Also possible to expose a traditional RDBMs via a wrapper

• Disadvantages:

– For the publisher: cost of supporting arbitrary queries

– For the search engine: discovery of SPARQL servers is unsolved

• Tools:

– Triple stores (Oracle, Virtuoso, Sesame, Jena, OWLIM etc.)

– RDB-to-RDF mappers such as D2RQ and Triplify

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

- 22 -

Option 4: Feeds

• Disadvantages:

– No standard feed format for RDF: data needs to be formatted and often manually submitted for each search engine

• Advantages

– Submit your data without making it public

• Competing and incompatible formats

– DataRSS (Yahoo!)

– Google Data Protocol

– Open Data Protocol (Microsoft)

..#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

#PeterM

#Bud

born

“Peter Mika”

label

“Budapest”

label#Hun

capital-of

“2,000,000”

population

- 23 -

• Publish the rule to transform the HTML to structured data

• GRDDL is a standard for linking an HTML page to a transformation that produces RDF data

• Advantages

– No change to the page

• Disadvantages

• Transformation needs to be executed to get to the data

• Not much support by search engines

• Tools

• Intel MashMaker

• Dapper

• Glue API from AdaptiveBlue

Option 5: Publishing a transformation of the data

xx yy

1 2

<XSLT><XSLT>

- 24 -

Option 6: Automatic markup

• Web services that annotate HTML automatically

• Advantages

– No manual effort

• Disadvantages

– Limited to finding relevant entities in text

• Tools

– OpenCalais

– Zemanta APIPeter Mika was born in Budapest.


<person>Peter Mika</person> was born in <location>Budapest</location>.

<person>Peter Mika</person> was born in <location>Budapest</location>.

- 25 -

Example: Zemanta

• A personal writing assistant for bloggers

– Plugin for popular blogging platforms and web mail clients

• Analyzes text as you type and suggests hyperlinks, tags, categories, images and related articles

• API available with the same functionality

- 26 -

Choosing a vocabulary

• No vocabularies in many domains

– Books, movies, stuff people care about…

• Too many competing proposals in other domains

– Often versions of the same proposal

– Example: vocabularies for microformats

• Not maintained

– I cannot maintain your vocabulary for you

• Limited tool support

– Too many expert tools until now

• Many vocabularies are not designed for annotation

• Missing meeting point and social process

– An ontology is a shared, formal representation of a domain

- 27 -

Choosing a vocabulary

• Search the Web or ask for advice on mailing lists

– [email protected]

– [email protected]

• Wikis

– semanticweb.org

– vocamp.org

• Beware of people who claim to have the vocabulary of everything

– Preferably you want something small and targeted

• Never a 100% fit you will need to introduce vocabulary terms (classes and properties)

– Do not introduce new classes/properties in existing namespaces

– Example: the namespace http://xmlns.com/foaf/0.1/ is used by the FOAF project. Try not to introduce a new term without contacting the owner, i.e. the membership of the FOAF mailing list.

- 28 -

Advanced topic: creating a vocabulary

1. Get advice on methodology– vocamp.org and semanticweb.org

2. Choose a namespace and a prefix– Give sensible names, e.g. name it after your site, but don’t call it searchmonkey

– Namespace ends either with a slash or a hash

3. Create an RDF or OWL document describing your classes and properties• Use an ontology editor such as Protégé 4.0

• Follow naming conventions

4. Publish your vocabulary– Make sure the URIs of your properties and classes are resolvable

1. E.g. myvocab:digicam should resolve to a document containing the definition of myvocab:digicam

• Convince others to adopt your vocabulary1. If you are in fishing, convince other fishing businesses

- 29 -

How do we build communities? www.vocamp.org

Metadata in HTML

- 31 -

Brief history of the Annotated Web

• 1995: HTML meta tags• 1996: Simple HTML Ontology Extensions (SHOE)• 1998: RDF/XML

– RDF/XML in HTML– RDF linked from HTML

• 2003: Web 2.0– Tagging– Microformats– Metadata in Wikipedia– Machine tags in Flickr

• 2005: eRDF • 2008: RDFa 1.0• 2011: RDFa 1.1• 2012: Microdata?

- 32 -

HTML meta tags

<HTML><HEAD profile="http://dublincore.org/documents/dcq-html/"><META name="DC.author" content="Peter Mika"><LINK rel="DC.rights copyright"

href="http://www.example.org/rights.html" /> <LINK rel="meta" type="application/rdf+xml" title="FOAF"

href= "http://www.cs.vu.nl/~pmika/foaf.rdf"> </HEAD> …</HTML>

- 33 -

SHOE example (Hefflin & Hendler, 1996)

<ONTOLOGY "our-ontology" VERSION="1.0"> <ONTOLOGY-EXTENDS "organization-ontology" VERSION="2.1" PREFIX="org"

URL="http://www.ont.org/orgont.html"> <ONTDEF CATEGORY="Person" ISA="org.Thing"> <ONTDEF RELATION="lastName" ARGS="Person STRING"> <ONTDEF RELATION="firstName" ARGS="Person STRING"> <ONTDEF RELATION="marriedTo" ARGS="Person Person"> <ONTDEF RELATION="employee" ARGS="org.Organization Person">

</ONTOLOGY>

<HEAD><META HTTP-EQUIV="Instance-Key" CONTENT="http://www.cs.umd.edu/~george"> <USE-ONTOLOGY "our-ontology" VERSION="1.0" PREFIX="our" URL="http://ont.org/our-ont.html"> </HEAD><BODY>

<CATEGORY "our.Person">

<RELATION "our.marriedTo" TO="http://www.cs.umd.edu/~helena">

<RELATION "our.employee" FROM="http://www.cs.umd.edu">

My name is

<ATTRIBUTE "our.firstName"> George </ATTRIBUTE>

<ATTRIBUTE "our.lastName"> Cook </ATTRIBUTE> and I live at...

- 34 -

SHOE system

- 35 -

SHOE Text-based query interface

- 36 -

SHOE Graphical Query Interface

- 37 -

Example: Creative Commons

Embedding CC license in HTML (now deprecated):

<HTML><HEAD>… </HEAD><BODY>…

<!–- <rdf:RDF xmlns="http://creativecommons.org/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <Work rdf:about="http://www.yergler.net/averages/"> <dc:title>The Law of Averages</dc:title> <dc:description>...because eventually i'll be right...</dc:description> <license rdf:resource="http://creativecommons.org/licenses/by-nc/1.0/" /> </Work> <License rdf:about="http://creativecommons.org/licenses/by-nc/1.0/"><requires rdf:resource="http://web.resource.org/cc/Notice" /> <permits rdf:resource="http://web.resource.org/cc/Reproduction" /> <permits rdf:resource="http://web.resource.org/cc/Distribution" /> <prohibits rdf:resource="http://web.resource.org/cc/CommercialUse" /> </License> </rdf:RDF>

-->

- 38 -

Example: Creative Commons

• Current: rel attribute (HTML4)

This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/us/">Creative Commons Attribution 3.0 United States License</a>.

• Use of the “rel” attribute for semantic annotation is the birth of the microformat…

- 39 -

Microformats (μf)

• Agreements on the way to encode certain kinds metadata in HTML

– Reuse of semantic-bearing HTML elements

– Based on existing standards

– Minimality

• Microformats exist for a limited set of objects

– hCard (persons and organizations)

– hCalendar (events)

– hResume

– hProduct

– hRecipe

• Varying degrees of support and stability

– hCard and rel-tag are widely supported

• Community centered around microformats.org

– Specifications and discussions are hosted there

- 40 -

Microformats: limitations

• No shared syntax

– Each microformat has a separate syntax tailored to the vocabulary

• No formal schemas

– Limited reuse, extensibility of schemas

– Unclear which combinations are allowed

• No datatypes

• No namespaces, unique identifiers (URIs)

– no interlinking

– mapping between instances is required

• Always appears in the HTML <body>

- 41 -

Example: the hCard microformat

<cite class="vcard"><a class="fn url" rel="friend colleague met” href="http://meyerweb.com/">Eric Meyer</a> </cite> wrote a post (<cite><a href="http://meyerweb.com/eric/thoughts/2005/12/16/tax-relief/">Tax Relief</a></cite>) about an unintentionally humorous letter he received from the <span class="vcard”> <a class="fn org url" href="http://irs.gov/">Internal Revenue Service</a> </span>.

<div class="vcard"> <a class="email fn" href="mailto:[email protected]">Joe Friday</a> <div class="tel">+1-919-555-7878</div> <div class="title">Area Administrator, Assistant</div> </div>

- 42 -

RDFa

• W3C standard for embedding RDF data in HTML documents

– A set of new HTML attributes to be used in head or body

– A specification of how to extract the data from these attributes

• RDFa is just a syntax, you have to choose a vocabulary separately

• RDFa 1.0 is a W3C Recommendation since October, 2008

– RDFa Primer

• RDFa 1.1 is a small update on RDFa to make it easier to use

– Currently Working Draft (March 31, 2011)

– Updated version of the RDFa Primer (April 19, 2011)

• RDFa API for accessing RDFa data in a webpage in the browser from JavaScript

– Currently Working Draft (April 19, 2011)

- 43 -

RDFa 1.1

• Changes

– New vocab attribute to define the default namespace for the document or subtree

– Profile documents to define multiple namespace prefixes

– The prefix attribute as a recommended replacement of xmlns

– You can use URIs even where only CURIEs where allowed before

• RDFa 1.1 is backward compatible with RDFa 1.0

– RDFa 1.1 is recommended if you want to use HTML5

- 44 -

When to use RDFa

• Choose microformats when you find a microformat that fits your needs and supported by your consumers– Microformats are first option because they are simple

– Yahoo supports all major microformats, see the documentation

– It’s a common misconception that RDFa requires XHTML or that it’s compatible with HTML5

• It’s compatible with HTML4, HTML5, XHTML

• If you find none that perfectly fits your needs then you need RDFa– Microformats have a fixed schema: you can not add your own

attributes

• Example: a social networking site with user profiles– VCard is a good candidate, but for example it doesn’t have a way to

express the user’s social connections

– You either live without this, or go with RDFa

- 45 -

RDFa intro: metadata in the header

• More info in the<html prefix="og: http://ogp.me/ns#"> <head> <title>The Trouble with Bob</title> <meta property="og:title" content="The Trouble with Bob" /> <meta property="og:type" content="text" /> <meta property="og:image" content="http://example.com/alice/bob-ugly.jpg" /> ... </head>

- 46 -

RDFa intro: links with a flavor

• More info in theAll content on this site is licensed under <a rel="license" href="http://creativecommons.org/licenses/by/3.0/"> a Creative Commons License </a>.

- 47 -

RDFa links: talking about subjects other than the page

• More info in theThe trouble with Bob is that he takes much better photos than me: <div about="http://example.com/bob/photos/sunset.jpg"> <img src="http://example.com/bob/photos/sunset.jpg" /> <span property="og:title">Beautiful Sunset</span> by <span property="dc:creator">Bob</span>. </div>

- 48 -

RDFa links: talking about subjects other than the page

• More info in the

<div typeof=”foaf:Person"> <p property=”foaf:name"> Alice Birpemswick </p> <p> Email: <a rel=”foaf:mbox” href="mailto:[email protected]"> [email protected] </a> </p> <p> Phone: <a rel=”foaf:phone" href="tel:+1-617-555-7332">+1 617.555.7332</a> </p> </div>

- 49 -

The process of annotating with RDFa

• Find a vocabulary that fits your needs and supported by your consumers

– A vocabulary describes a set of types and attributes within a given domain

– If you don’t find a good candidate, extend an existing one or create a new one

• Annotate your page.

– Before you start, you might want to validate your page for (X)HTML conformance using the W3C’s (X)HTML Validator to reduce the chance of errors. Choose Document Type XHTML + RDFa.

– No specific tool support. If you have an HTML or XML editor that supports DTDs, you will have syntax checking and highlighting.

– Use the RDFa Distiller to validate which data can be extracted from your page.

– If you fancy, use the RDF Validator to graphically visualize the RDF graph that is outputted.

• Put the annotated page online

– The data will be extracted by Google/Bing/Yahoo the next time your page is crawled and indexed

– The data will be available to browser extensions, bookmarklets etc.

• See http://rdfa.info/rdfa-implementations for new tools and APIs

- 50 -

RDFa can be hard to get right…

• Validation problems can stop us from extracting data– Use the W3C validator

– Use the right DOCTYPE declaration if using XHTML

– Set the encoding of your page properly (using HTTP headers or XML declaration)

• Prefixes need to be defined using the xmlns attribute

• Unless you are making statements about the document, set the subject using the about attribute

• Do not include HTML elements in literal values– Incorrect: <div property=“foaf:name”><b>Peter Mika</b></div>

• Use absolute URIs as the value of the resource attribute– Or make sure you specify HTML base

- 51 -

RDFa can be hard to get right… II.

• Be careful when using rel and typeof in combination because of the precedence rules

• BAD example:

<div about=“#id”>

<span property=“foaf:name“>Peter Mika</span>

<span rel=“foaf:img“ typeof=“foaf:Image”>

<span property=“dc:format”>jpg</span>

…

</span

</div>

• To correct, you need to put the typeof inside the <span> node with rel=“foaf:img”

- 52 -

RDFa can be hard to get right… III.

• Typeof does two things at once: it creates a new subject resource and assigns the type to it

• BAD example:

<div about=“#id”>

<span property=“foaf:name“>Peter Mika</span>

<span rel=“foaf:img“ resource=“http://www.example.org/photo.jpg”>

<span typeof=“foaf:Image”>

<span property=“dc:format”>jpg</span>

</span

</span

</div>

• To correct, you have to repeat the resource attiribute on the span node with the typeof

- 53 -

RDFa can be hard to get right… IV.

• Marking up <h1>:

– <h1 property=“dc:title”>My homepage</h1>

– NOT: <h1><div property=“dc:title”>My homepage</h1>

• Marking up an image: <span rel=”foaf:img"> <img alt="Alex" src="http://example.org/alex.jpg"/> </span>

NOT:

<img rel=“foaf:img” src=“photo.jpg/>

• Header

– <meta property=“…” content=“…”>

NOT

– <meta name=“…” content=“…”>

- 54 -

RDFa can be hard to get right… V.

• You can not break up a description like this:

<span rel=“foaf:knows"> <span property=“foaf:name">Peter Mika</span></span>….

<span rel=“foaf:knows"> <a rel=“foaf:email“ href=“mailto:[email protected] /></span>

• This is not the same as:

<span rel=“foaf:knows"> <span property=“foaf:name">Peter Mika</span>

<a rel=“foaf:email“ href=“mailto:[email protected] />

</span>

• In the first case there are two related resources, with one attribute each, in the second case there is a single related resource with two attributes.

- 55 -

Tips

• Hiding information from being displayed

– Links without content will not be rendered

– Use <span property=“foaf:name” content=“Peter Mika”/>

• Use datatypes to provide the expected type of a literal.

– This helps validation because any tool can check whether the literal is indeed of that type.

- 56 -

Example: Facebook’s Like and the Open Graph Protocol

• The ‘Like’ button provides publishers with a way to promote their content on Facebook and build communities

– Shows up in profiles and news feed

– Site owners can later reach users who have liked an object

– Facebook Graph API allows 3rd party developers to access the data

• Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’

- 57 -

Example: Facebook’s Open Graph Protocol

• RDF vocabulary to be used in conjunction with RDFa

– Simplify the work of developers by restricting the freedom in RDFa

• Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment

• Only HTML <head> accepted

• http://opengraphprotocol.org/

<html xmlns:og="http://opengraphprotocol.org/schema/"> <head>

<title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> …

</head> ...

- 58 -

Example: Yahoo! Enhanced Results (was: SearchMonkey)

• Guide for publishers to mark-up their pages for common types of objects

– Product, Local, News, Video, Events, Documents, Discussion, Games

• Using popular microformats and RDF vocabularies

– Copy-paste code

– Validator

• Yahoo as a consumer

– See later

- 59 -

Example: Google’s Rich Snippets

• Google accepts popular microformats and its own RDFa vocabulary

– Similar approach to RDFa as Facebook

• Validator to check if the markup is correct

• Google displays enhanced results based on this metadata

– Rich Snippets

- 60 -

Microdata example

<div itemscope itemid=“http://www.yahoo.com/resource/person”> <p>My name is <span itemprop="name">Neil</span>.</p> <p>My band is called <span itemprop="band">Four Parts Water</span>. I was born on <time itemprop="birthday" datetime="2009-05-10">May 10th 2009</time>. <img itemprop="image" src=”me.png" alt=”me”> </p></div

- 61 -

Microdata

• Currently under standardization at the W3C– Originally part of the HTML5 spec, but now a separate document

• Similar to microformats, but with the extensibility of RDFa

– Introduce new terms using reverse domain names or full URIs

• HTML5 also has a number of “semantic” elements such as <time>, <video>, <article>…

- 62 -

RDFa on the rise

Percentage of URLs with embedded metadata in various formats

510% increase between March, 2009 and October, 2010

- 63 -

The state of metadata in HTML

• 5-10% of webpages contain some explicit metadata

– Depending on how you count…

• Too many competing approaches

– Too many formats: microformats vs RDFa vs Microdata

– When using RDFa, publishers may need to use multiple different vocabularies to satisfy everyone

publishing data on the semantic web

Technology