a first course in rdf and rdfs (resource description framework and resource description framework...

Introduction to RDF and RDFS

Mark Birbeckhttp://webBackplane.com/mark-birbeck

RDF...a four letter word

RDF is probably the least understood of all the W3C standards. Many people say it's too complicated and that authors will never use it. In fact the entire Microformats 'movement' is based on the idea that people will never use RDF.

Yet RDF is probably the most important of all of the W3C's standards; without it there will be no semantic web.

RDF/XML is the real villain

In this tutorial I'd like to show that the bad press that RDF has received is due to RDF/XML. This is a specific XML format for conveying RDF, but RDF should be seen as independent of any particular way of marking it up.

So in this session we're going to concentrate on the 'abstract' aspects of RDF.

The promise of XML

Happy birthday XML!

It's 10 years since XML became a standard, and it's been incredible to see it find its way into all sorts of scenarios.

When it first took off the promise was that it would give us better search results, we'd be able to interchange documents more easily, and so on.

Despite the obvious success of XML, this potential is largely unrealised, since although we have a good interchange format, we still don't know what the documents mean.

RDF Documents

Primer

Concepts

Syntax

Semantics

Vocabulary

Test cases

The W3C's RDF specs comprise the following:

a primer, which introduces many of the ideas of RDF and RDF Schema;

Concepts, which is what we're going to focus on, discussing the underlying ideas of RDF;

Syntax, which is RDF/XML;

Semantics, which is very theoretical, and the place to go if you are the kind of person who likes to think about the meaning of meaning, and whether there are parallel universes to ours;

Vocabulary, which we'll also look at;

Test cases, for developers building RDF tools and services.

This tutorial

Concepts

Transporting RDF

Storing

Some sources of RDF data

Querying

Defining vocabularies

In this tutorial we're going to focus on:

the RDF concepts that are crucial to understanding all of the other clever things that you can do with RDF;

how we might transport RDF, i.e., the languages that can be used to write RDF;

how to store RDF;

some examples of RDF on the web;

how RDF can be queried;

how to create your own RDF vocabularies.

Metadata: data about data

You've probably heard this expression many times; that metadata is data about data.

This is important to understand for a number of reasons.

The first is that this statement is saying that there is nothing mysterious about metadata; it's just plain, ordinary data, and so can be manipulated and stored, just like data. It just so happens that this particular data has the property of being about other data...but so what.

The second aspect to this statement to draw out, is that there is no point in getting hung up on meta-metadata; one level of indirection is fine!

Getting data from the
web is easy...getting
metadata is hard

The web is mature, and retrieving web pages and doing clever stuff with them is also pretty mature.

The problem we're trying to solve is getting metadata.

The web is full of useful information like this.

XML was supposed to
give us this information

XML was supposed to help us to dig out this information.

The Well-tempered Clavier

J. S. Bach

At first sight this seems way ahead of our Amazon web-page:

we should be able to search for music by genre, such as 'classical', 'rock', 'country';

we could search for music by composer, or key, or tempo.

This should be possible because we're no longer dealing with web-pages, but structured documents.

Why has XML not delivered?

To understand why XML hasn't delivered on this promise, let's imagine we have another XML document type.


J. S. Bach

This XML format is dealing with documents of any type, not just pieces of music, which means that a document has an 'author' rather than a 'composer'.

It's still better than our HTML page, but it doesn't deliver on the promise, since it's difficult to search for all pieces by J. S. Bach; now we need to search across 'composer' and 'author'.

We want to say:
"give me everything
created by J. S. Bach"

What we really want to be able to do is query for anything that was 'created by' J. S. Bach.

To achieve that we
need to agree how to
say 'created by' in
any document

To be able to search for anything created by Bach, across any document, we need to agree on a way to identify the concept of 'created by'.

Of course we could just try to map 'composer' to 'author', and that would work for the two document types that we've just made up. But then the moment we add a third document type we have to do it again, and this time we have to map three ways.

Imagine the problem when anyone can create document types as they like.

Knowing when we are talking about the same thing is a crucial part of the semantic web.

Dublin, Ohio

Photo by Sleestak66: http://www.flickr.com/people/kesselring/

A workshop in 1995, in Dublin, Ohio, agreed on a core set of properties that were seen to be common to many different types of document.

The result of the workshop is the Dublin Core Metadata Initiative.

DC.Creator = "J. S. Bach"

Dublin Core provides us with a standard way of indicating 'created by' for the document.

DC.Title = "The Well Tempered Clavier"

Dublin Core also provides us with a standard way of indicating a document's title.

abstract

audience

contributor

creator

description

publisher

rightsHolder

...and many more

Often called a
vocabulary or
taxonomy

Over the years the list of terms in Dublin Core has grown. This is just an example of a handful, but the full list is at:

This is often called a taxonomy or vocabulary, and we'll look in the final part of this tutorial at how to create such vocabularies.

Bach: The Well Tempered Clavier

...

Dublin Core is quite widely used in HTML mark-up, by putting the properties into the meta element.

But this is largely because the metadata features of HTML are quite limited.

An introduction to RDF

...

For example, if I was planning to add metadata to my document about the title of the document and its author, I might initially think to do something like this.

But since HTML offers a limited set of metadata features it's very easy to change this to...

An introduction to RDF

...

...this.

In other words, it's easy to make use of common properties in HTML, because there is only one place to put them.

But this is not so easy in XML.


J. S. Bach


J. S. Bach

Recall that we had two document types.


J. S. Bach


J. S. Bach

We could just use an XMLised version of the Dublin Core property in our XML.

But actually there is no easy way to do that in XML, at least not if you want your documents to validate. You'd need to define a schema module that contains the 'dc:creator' element, and then you'd need to incorporate that into your own XML schemas.

And even if we did that, we actually wouldn't know what it is that Bach is the creator of. We don't get any context from this arrangement.

However, this concept is on the right track.

Recap

XML doesn't help

Having unique identifiers does

Despite the promise of XML it doesn't actually solve our problem of providing semantics.

However, as illustrated by Dublin Core, the ability to express unique identifiers starts to address the issue.

This is what RDF makes use of, so lets dive into that.

Unique identifiers area key concept in RDF

We'll now start to look at some of the core concepts of RDF.

Remember I said that we're concerned all of the time with the idea that RDF is essentially metadata, which is independent of any particular way of expressing that metadata.

One of the key concepts in RDF is the idea of a unique identifier.

DC.Creator

DC.Title

Recall that we had a couple of terms from Dublin Core.

http://purl.org/dc/terms/creator

http://purl.org/dc/terms/title

The way to ensure that the identifier is unique is to use URIs to identify them.

We've now represented a concept or an idea, with a simple identifier which can be used anywhere.

Resource Description Framework

This is fundamentally what RDF is all aboutdescribing resources.

The term 'resource' was used originally to define things that could be retrieved over the web. And RDF was initially a means of describing properties about those resources, such as whether a web-page was appropriate for children, or what the topic of the page was.

But since we're essentially creating metadata about URIs, then all we have to do is give a URI to anything we want to, and we can then create metadata about that, too.

But that's a little ahead of ourselves, let's continue with our Dublin Core properties.



By using a well-known, globally unique identifier to name our properties, we've taken an important step along the road towards "give me everything created by J. S. Bach".

If you're familiar with databases then what we've essentially done is agreed on a common name for our column headings in our database tableswhich is a pretty powerful concept.

In these two database tables, one owned by you, and one owned by me, you can see that by having different column names it becomes difficult to move data from one store to another.

But by agreeing on common names, we can share the data, something that would only be possible by mapping one table to another.

This turns the internet into a giant relational database.

Again if you are familiar with databases, you'll know that it's not just the column names that need to be uniquely identified, but also the primary keys.

And continuing the database analogy again, we know that having unique identifiers is also important in the foreign keys.

Recap

Resources and unique identifiers are fundamental for RDF:



http://dbpedia.org/resource/Johann_Sebastian_Bach

http://dbpedia.org/resource/Well-Tempered_Clavier

So we're using globally unique identifier to name our properties, as well as to identify 'primary' and 'foreign' keys.

Triples

Another fundamental concept of RDF is the triple.

It's called a triple because we have:

something that we are talking about;

a property of the thing we are talking about;

a value for that property.

You will have been dealing with triples all the time when dealing with computers and programming languages, even if you didn't realise it.

Bach: The Well Tempered Clavier

...

For example, using the meta element in HTML gives you triples.

It may not look like it at first; since meta is only giving you name/value pairs, we appear to only have two thirds of a triple.

But if you consider that the first part of the triplethe thing we're talking aboutis the document itself, then we do indeed have a triple.

Similarly, when you've been dealing with databases you've also been dealing with triples; each row represents the thing that you are describing, the column name is the property of that thing, and the value in the cell is the value of the property.

var piece = {
title: "The Well-tempered Clavier",
creator: "J. S. Bach"
};

You can even deconstruct objects in a programming language to a set of triples.

Triples are sometimes called statements:

J. S. Bach composed "The Well Tempered Clavier".

J. S. Bach was born on 21st March, 1685.

J. S. Bach died on 28th July, 1750.

Triples mirror simple spoken statements.

Just about anything you can think of can be broken down into simple statements like this, or triples.

Triples are the 'atoms' of RDF.

The parts of a triple

We said that a triple is:

something that we are talking about;

a property of the thing we are talking about;

a value for that property.

But we need to get a little more precise now, as bw drill in RDF more.

Subject

A URI:

http://dbpedia.org/resource/Johann_Sebastian_Bach

The 'something' that we are talking about is called the 'subject' and is always a URI.

Predicate

Also a URI:


The property of the 'something' (the subject), is called the 'predicate' and is also, always a URI.

Object

Literal or URI:

"Johann Sebastian Bach"

http://dbpedia.org/resource/Well-Tempered_Clavier

The value of the predicate is called the 'object' and can be either a URI, or a literal.

Plain literals

Essentially strings:

"Johann Sebastian Bach"

"The Well-Tempered Clavier"

The most basic type of literal is a plain literal.

Plain literals

But can contain language information:

"Johann Sebastian Bach"@en
", "@ru
"The Well-Tempered Clavier"@en
"Wohltemperiertes Klavie"@de

Plain literals can also include language information.

Note the convention of using "@" to tack the language onto the end of the string, which we'll see again later.

Typed literals

And a datatype:

"1685-03-21"^^xsd:date

"1750-07-28"^^xsd:date

A literal with a datatype is called a typed literal.

Note the convention of using "^^" to tack the datatype onto the end of the string, which we'll see again later.

These datatype values could come from anywhere, although generally they will be chosen from XML Schema datatypes.

RDF provides one built-in type though, which is rdf:XMLLiteral.

XML literals

"H2O"^^rdf:XMLLiteral

This built-in datatype allows XML to be embedded into literals.

Recap

J. S. Bach was born on 21st March, 1685

We had these statements, so let's look at how they convert to triples.

Recap


dc:creator .

dc:title "The Well Tempered Clavier"@en .

dc:title "Wohltemperiertes Klavie"@de .

I'm using Turtle here to represent the triples, a language for serialising RDF which we'll come back to shortly.

Recap


p:dateOfBirth "1685-03-21"^^xsd:date .

Recap


p:dateOfDeath "1750-07-28"^^xsd:date .

Graphs

A collection of triples is called a graph.

The Well Tempered ClavierWohltemperiertes Klavie

dc:title

dc:title

1685-03-211750-07-28

p:dateOfBirth

p:dateOfDeath

dc:creator

Note how the object of the first triplethe piece of musicis the subject of another triple; this is a powerful aspect of RDF, and note too, that 'anyone can say anything about anything'.

Nodes are used to mark subjects and objects, with arcs connecting them; hence the name 'graph'.

(Normally these graphs would have long URIs where the ellipses are, but I'm using pictures to represent the items.)

rdf:type

RDF comes with some built-in predicates, of which rdf:type is probably the most important.

We've been using two identifiers in our examples, one for J. S. Bach, and the other for The Well Tempered Clavier. But although we've said that one 'created' the other, we don't actually know anything about what 'type' of 'thing' these two items are; rdf:type is one way to address that.

rdf:type yago:Composer .

rdf:type foaf:Person .

The type of something is also said to be its class, which we'll go into in more detail when we look at how to create your own vocabularies with RDFS.

Lists

We saw there how we used two different types from two different vocabularies, but RDF also comes with a handful of built-in types or classes.

Some of these are used to create lists of items, and there are three different types of list:

an ordered, or sequential list, which is a list in which the order of the items matters;

an unordered list, or bag, which is a list in which the order doesn't matter;

and a list of mutually exclusive options, or alternates, where any one of the items in the list could be used in place of any other.

Note that there is nothing to enforce this behaviour, and it is merely a set of semantics to guide a processor.

Performed by

Performed by

dc:creator

Performed by

dc:creator

rdf:Bagrdf:type

To illustrate when lists are useful, imagine we have one piece, such as Bach's Cello Suites, performed by many different people, e.g., Pablo Casals and Yo-Yo Ma. We know how to represent this already, simply by using triples.

But imagine that we have another piece, such as Beethoven's Sonata's for piano and cello; here there will be two performers in one performance. The representation we used for the Cello Suites would not be appropriate, so instead we create a 'bag' and attach our two performersRichter and Rostropovichto that bag.

Another way to look at it is that the Cello Suite scenario resulted in tw CDs, because two cellists each gave one performance. But the Sonata scenario gave only one CD, because two artists gave one performance.

Blank nodes

Notice in the last graph that we had a node that served no purpose other than to allow us to attach a list of items.

Since it is merely a connector between the more important concepts of the piece of music and the two performers, we're not bothered about naming it with its own unique URI, so it remains a 'blank node', or more usually, a 'bnode'.

Representing RDF

The key thing is to understand the concepts that underpin RDF, rather than any particular way of representing RDF.

We call the different ways of representing RDF serialisations.

These serialisations are only really relevant when RDF is 'in transit', since it will usually be stored in a special kind of database (often called a triple store) and after being transported it will usually end up in another triple store.

However, it is obviously worth being familiar with these languages, and in particular it's worth getting up to speed with Turtle, since it's often used in discussions of RDF...like this one.

RDF/XML

1750-07-28

The problem with RDF/XML is that it's quite complicated, and there are many ways to represent the same thing.

The key features you can see here are that there is an attribute for expressing the subject, and that predicates are expressed using XML element names.

Note the different ways that objects are expressed; either as inline text if they are literals, or using the 'resource' attribute if they are a resource.

Turtle

dbpedia:Johann_Sebastian_Bach
p:dateOfBirth "1685-03-21"^^xsd:date; p:dateOfDeath "1750-07-28"^^xsd:date; p:death "1750-07-28"^^xsd:date; p:deathPlace dbpedia:Leipzig;
a yago:Composer, foaf:Person
.

We've already seen some Turtle earlier on, and it's worth spending a little time on, since it will appear in many of the discussions you read about RDF, and probably many of the presentations you'll go to this week.

It's intended to be a 'terse' language, i.e., compact, which is why it gets used in presentations and documents.

Note also that when we come to look at querying, Turtle pops up in the W3C's query language.

In this example we see Turtle using namespaces to shorten the URIs; semi-colons to allow us to have lots of predicates for the same subject; and commas to have many values for the same predicate. We've also abbreviated rdf:type to 'a'.

RDFa

Johann Sebastian Bach

...

RDFa is a way to put RDF into HTML and XHTML documents.

As a way of simply carrying RDF it's not that important to discuss; as you can see here just using lots of link and meta elements is probably just as verbose as RDF/XML.

However, RDFa comes into its own if you have data already in your HTML page, which can then be reused as RDF:

RDFa

...

Johann Sebastian Bach

was born in

1685,

in Eisenach.

As you can see in this second example, much of the data is also being used inline in the document, so RDFa starts to become quite compact.

Storing RDF

RDF is normally stored in a triple store. There are a wide range of them available:

Jena

Sesame

Virtuoso

ARC2

There are many more, so Google is your friend, here.

Also look at Drupal's initiative to make triple-stores pluggable.

Sources of RDF

There are a growing number of online triple stores that can be accessed, and the data then used in your applications.

I've been using some URLs from DBPedia, which is worth a mention, since it's an RDF representation of Wikipedia.

Querying RDF

The W3C has also defined a language for querying RDF, called SPARQL.

SELECT ?composer
WHERE { ?composer rdf:type yago:Composer }

SPARQL is a kind of SQL for RDF.

Essentially we're providing a graph template that needs to be matched. In this example I've said find every subject that is of type 'composer'. I've done this by creating a 'template' that has the predicate and object filled in, leaving a gap for the subject.

The result of the query would be a list of composers, i.e., the subject from every triple that matches our template.

SELECT ?composer

Hopefully you can now also see how we have got to a point where we might be able to ask the question we had at the beginning, "give me everything created by J. S. Bach":

we have a unique identifier for Bach;

we have a unique identifier for 'created by'.

We can therefore run a query like the following:

SELECT ?piece
WHERE
{ dbpr:Johann_Sebastian_Bach dc:creator ?piece }

Creating vocabularies

To illustrate our points we've been using terms from various vocabularies. For example, we've said that Bach was a 'person' from the Friend-of-a-friend vocabulary, and we said that he was a 'composer' from the Yago ontology. We also said that Bach 'created' The Well Tempered Clavier, and The Cello Suites.

RDF Schema is the language used to define these vocabularies or ontologies.

Classes

I'm going to begin with classes, because if you come from a programming background, that is where you would assume that one would start.

However, RDFS is not usually used in a programming-like way, as we'll see...but to get us started let's imagine it is.

Composer

MusicianIs a type of

ArtistIs a type of

CreatorIs a type of

Where classes in our RDF vocabularies are similar to programming languages is that they can be hierarchical. As you know, in a programming language like C++ or Java you can say that one class inherits from another, so we might say that a composer is also a musician, which is also an artist, which is also a creator, which is, in turn, a person.

yago:Composer a rdfs:Class .

To define something as being a class, or a type, we need do nothing more than give it a type of rdfs:Class.


yago:Composer rdfs:subClassOf yago:Musician .

yago:Musician rdfs:subClassOf yago:Artist .

yago:Artist rdfs:subClassOf yago:Creator .

To define the hierarchy, we use the RDFS-provided predicate 'subClassOf'; here we're saying that the class 'composer' is a sub-class of 'musician', which is in turn a sub-class of the class 'artist', and so on.


yago:Composer rdfs:subClassOf yago:Musician .

yago:Musician rdfs:subClassOf yago:Artist .

yago:Artist rdfs:subClassOf yago:Creator .

a yago:Composer, foaf:Person
.

Let's add our vocabulary and our information about Bach to the same graph, and then try running some queries against the graph.

SELECT ?composer

Recall that we queried for all composers before, like this.

But with the graph that we have just defined, we could also query for all artists, like this:

SELECT ?s
WHERE { ?s rdf:type yago:Artist }

Remember that nowhere have we said that Bach or Beethoven are artists, but since the 'subClassOf' predicate has been used in the vocabulary, then they are both automatically artists, simply by being composers.

Properties

RDFS also allows us to indicate properties of classes. As with rdfs:Class, this is also achieved with a built-in type, rdfs:Property.

foaf:surname a rdfs:Property .

In this example we've created a property in the Friend-of-a-friend vocabulary, called 'surname'.

If you're not familiar with this vocabulary, it's used for defining people, the organisations they work for, their home-pages, their relationships with other people, and so on.

As this stands, all we've said is that foaf:surname is a predicate that can be used; but since there was nothing to stop us using that as a predicate before, we've not gained a great deal!

However, RDFS adds a few more built-in predicates to allow us to say more about this property.


foaf:surname rdfs:Range rdfs:Literal .

The first of these predicates is rdfs:Range, which gives an indication of the type of values that the surname property can take. In this example we're simply saying that surname can take a literal, i.e., a string.



foaf:surname rdfs:Domain foaf:Person .

The second RDFS predicate is rdfs:Domain. This tells us what classes this property might apply to.

Validation v. inference

Now, I said "might apply to", but of course we could say "can only apply to".

In other words, we could use RDF Schema to validate RDF, and ensure such things as cars don't have surnames, composers don't have available seats, and so on.

There is nothing to stop an application doing that.

However, it's far more common to use these aspects of RDFS to infer information about our data, i.e., to fill in or deduce missing pieces.



foaf:surname rdfs:Domain foaf:Person .

foaf:surname "Bach" .

dbpedia:Johann_Sebastian_Bach a foaf:Person .

If we place the small vocabulary we've just created into a graph, and then add the statement that the URI identifying J. S. Bach has a surname property with the value "Bach", it is straightforward to then say that J. S. Bach is a person.

This is why the class nature of RDFS is unlike the way classes work in programming languages. In a standard OO programming language we need to define in advance what a composer is, we need to create properties for their name, the pieces they have composed, their date of birth, and so on.

But in RDFS, we can simply assign a surname to a resource, and that makes them a person. Similarly, we can say that anyone who has composed something, is a composer.

xyz:composedBy a rdfs:Property .

xyz:composedBy rdfs:Range yago:Composer .

xyz:composedBy rdfs:Domain yago:Piece .

dbp:Cello_Suites_%28Bach%29
xyz:composedBy dbp:Johann_Sebastian_Bach .

dbp:Johann_Sebastian_Bach a yago:Composer .

Previously we saw the use of rdfs:Range to indicate that some property would contain a literal, but we can also use it to indicate that some property will contain other types of value.

For example, rather than indicating that Bach is of type composer, we could simply indicate in a piece of music that its composer was Bach, and the very act of producing a composition makes Bach a composer.

This is far more flexible, and far more natural than having to define types for everything. When you start dealing with hundreds of thousands, or even millions of triples, this kind of convenience becomes crucial.

xyz:composedBy a rdfs:Property .

xyz:composedBy rdfs:Range yago:Composer .

xyz:composedBy rdfs:Domain yago:Piece .

xyz:composedBy
rdfs:subPropertyOf dc:creator .

And just in case you're now worried that we might have lost sight of the very thing that I was emphasising at the beginningasking for "everything created by Bach"that is taken care of by another built-in predicate, rdfs:subPropertyOf.

OWL

If you're particularly interested in this aspect of RDFS then you'll want to take a look at the Web Ontology Language, or OWL.

OWL adds further rules to enhance the simple ones that RDFS allows. For example, it enables additional predicates to be inferred (not just class types), allows us to indicate that two classes are the inverse of each other, and so on.

Conclusion

RDF is often presented as being too complicated to use, but it is crucial not only for the future of the web, but also as a means for you to get control of information within your organisation.

Hopefully we've seen that it solves real problems, and also that it needn't be that complicated...in fact you will probably already be familiar with its fundamental ideas.

References

RDF Primer: http://www.w3.org/TR/rdf-primer/

RDF Schema: http://www.w3.org/TR/rdf-schema/

OWL: http://www.w3.org/2004/OWL/

a first course in rdf and rdfs (resource description framework and resource description framework...

Business