foundations of semantic web databases

Slide 1

Foundations of Semantic Web DatabasesClaudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon1Recall: Semantic WebThe Web is a huge collection of varied interconnected data which lacks of semantic. Therefore, understandable only by humans.

To allow anyone to say anything about anything

The Semantic Web is based on the idea of adding machine understandable semantics to web information via annotations., so that they can perform more of the tedious work involved in finding, sharing and combining information on the web.

2Recall: The Relational ModelThe rows represent the things you are storing information about.The columns represent the properties of those things.The intersection gives the value of that property for that thing.

3Recall: RDFbooktitleJavaScriptsubjectpropertyvalue4Recall: RDFResource Description Framework (RDF).

The RDF model was designed with the following goals: simple data model, formal semantics and provable inference, extensible URI-based vocabulary, allowing anyone to make statements about any resource.

RDF statement is the way to describe any resource which can have a URI, through its properties using binary predicates and another resource.

5Recall: RDFRDF statement - (Subject, Predicate, Object)(http://en.wikipedia.org/wiki/Dan_Brown, http://purl.org/dc/elements/1.1/publisher, "Wikipedia)

Or in XML format: Wikipedia 6Recall: Ontology and RDFSRDF lacks the ability of expressing the relations between objects (e.g. Cat is an Animal, Book has an Author).

RDF Schema (also called RDFS vocabulary) provides additional information about properties, e.g. adds information about the classes and properties of resources and the relations between them.

7Recall: RDF SchemaRDFS main constructs:Class, subClassOf, Property, subPropertyOf, Object, Predicate, Subject, Range, Domain, Type, etc

A: (John, Class, Man)B: (Man,subClassOf, Person)C: (A,Subject,John)

Enables Duck Typing.

Reification8Recall: RDF Query LanguagesGiven data which is represented by RDF format, the query language (e.g. SPARQL) enables to retrieve and manipulate the data.

Like in other querying languages we would like to filter and reorganize the data. Although the data can be part of different DBs, and represented in different formats, its semantic is represented with RDFS and ontologies, common to all of the data. 9The ProblemRDF DB!?10The ProblemRDF DB!?RDF DBRDF DBRDF DBRDF DB!!!!11The ProblemRDF DB!?RDF DBRDF DBRDF DBRDF DB!!!!!!=!!!!!!U12The ProblemsDifferent representation of the data (no normal form) and redundancy elimination.

Equivalence (of DBs, queries and answers).

Entailment and containment of queries.

The impact of predefined semantics (RDFS vocabulary), blank nodes, reification and premises on queries.

Complexity issues.13Blank Nodes\ResourceBlank node of resource is a resource in RDF DB (or graph), which is not identified by URI (Universal Resource Identifier).

(John, knows, _:p1)(_:p1, birthDate, 04-21)

exist _:p1 who is known by John and his date of birth is the 21st of April

Enables partial understanding when information is missing.

We will use letters N,X,Y, to donate blank nodes.

1414UBL(Resources)RDF GraphsFor a given triple (Subj, Pred, Obj)

RDF graph G is a set of triples.

SubjObjPredU(URIs)B(Blank Nodes)L(Literals)15RDF GraphsThe universe of a graph is the set of elements of UBL, which occur in the triples of G, universe(G).

The vocabulary of a graph G is the set of elements of UL, which occur in the triples of G.

A graph is ground if it has no blank nodes.

The union of G1, G2 is the union of their sets of triples, donate by G1G2.

The merge of G1, G2 is the union of their sets of triples, where the sets of blank nodes are disjoint, donate by G1+G2. (merge is safe)

16RDF GraphsXscacscYscacscG2G1G1 G2XscacscG1 +G2XscacscYscsc17RDFS VocabularyDescribes properties like attributes of resources, and relationships between them. Also enable to make statements about statements, reifications.For a given triple N:(a, b, c) occurs in http://...Nhttp://...occurstypestatabcsubjobjpred18MapsMap is a function :UBLUBL.

is consistent with graph G, if (G) is RDF graph. And (G) is an instance of G.

An instance is proper if it has fewer blank nodes.

Overloading the meaning of map, :G1G2 if there is a map such that (G1) is subgraph of G2.

19Graph IsomorphismTwo graphs G1 and G2 are isomorphic if there are maps 1 and 2 such that 1(G1)=G2 and 2(G2)=G1, donated by G1G2.

20Graph Isomorphismabcdghij81352476(a) = 1 (b) = 6(c) = 8(d) = 3(g) = 5(h) = 2(i) = 4(j) = 721Lean GraphsA graph G is lean, if there is no map such that (G) is a proper subgraph of G.

aqpXYprapXYpbG2G122CoreTheorem: Each RDF graph G contains a unique (up to isomorphism) lean subgraph which is an instance of G. We will denote this unique subgraph by core(G).

Theorem:Deciding if G is lean is coNP-complete (reduction to tautology).Deciding if G core(G) is DP-complete.23Graph InterpretationAn interpretation I of RDF graph G:

A non-empty set of resources Res.

The literals, a subset LitRes.

A set of binary properties PropResXRes.

Mapping from the vocabulary of G, UResProp and LLit.24Entailment & EquivalenceAn RDF graph G1 entails G2, denoted G1 |= G2, iff every interpretation over the vocabulary of G1G2 which satisfies G1 also satisfies G2.

We say that two graphs are equivalent, denoted G1G2, if G1 |= G2 and G2 |= G1.25Semantics of Simple RDF GraphsA simple RDF graphs is a graph that do not use vocabulary with a predefined semantics.

Theorem: A simple RDF graph G1 entails G2, denoted G1 |= G2, if and only if there is a map G2G1.

A graph entail any of its subgraphs.

apbcqXpbcq|=26Semantics of Simple RDF GraphsTheorem:Deciding entailment of simple RDF graphs is NP-complete.Deciding equivalence of simple RDF graphs is isomorphism-complete.

Both depends heavily on the set of blank nodes. Can be done in O(vn), where v the set of nodes and n the blank nodes.

Theorem: If G is simple, then core(G) is the unique minimal graph equivalence to G.27Semantics of RDF Graphs with RDFS VocabularyGroup B (sp)Group A (simple graphs)(a, type, prop)/(a, sp, a)(a, sp, b) (b, sp, c)/(a, sp, c)(a, sp, b) (x, a, y)/(x, b, y)

2)3)4)From map : GGG/G1)Group D (typing)Group C (sc)(a, dom, c) (x, a, y)/(x, type, c)(a, range, d) (x, a, y)/(y, type, d)8)9)(a, type, class)/(a, sc, a)(a, sc, b) (b, sc, c)/(a, sc, c)(a, sp, b) (x, type, a)/(x, type, b)

5)6)7)The following deductive system is sound & complete:28Semantics of RDF Graphs with RDFS VocabularyTheorem: G1 |= G2, if and only if there is a sequence operations starts from G1 and ends with G2. NP-complete.

There is no mapping from G2G1 although G1 |= G2.

The idea is to close the graph with all possible triples.bsccascG2G1scdbsccascscdscXsc29Closure A closure of a graph G is a maximal set of triples G over universe(G) plus the RDFS vocabulary such that G contains G and is equivalent to it.

There could be more than one closer for a graph.

The closer may have a redundancies.

The problem of deciding if G is the closure of G is DP-complete.bqdaprXpcp30Normal FormA normal-form of a graph G, donated nf(G), is the core(G) for the closer G of G.

Theorem: Let G be an RDF graph:The normal-form, nf(G) is unique.G1 |= G2 if and only if nf(G2)nf(G1).G1G2 if and only if nf(G1)nf(G2).

The problem of deciding if G is the normal form of G is DP-complete.

31Normal FormbsccascG2G1scdbsccascscdscXscnf(Gi)bsccascscdscscscnf is not the most compact representation.32Query LanguageThe RDF database will be the RDF graph.

Let V be the set of variables donated by ?X, ?Y.

The query form is Datalog like HB, where H and B contain variables.(?X, ancestor, ?Y) (?X, ancestor, ?Z), (?Z, ancestor, ?Y)

The condition var(H)var(B) avoids the presence of free variables in the head of the query.

The presence of blank nodes in the body plays the same rule as variable , therefore is unnecessary.33Query LanguageQuery can have a set of premises P and constrains C. Query is a tuple (H, B, P, C).

The set of constrains C gives the user the possibility to discriminate between blank and ground nodes in the answer.

The premise P represents information the user supplies to the database to be queried in order to answer the query. E.g. the ability to query incomplete information by supplying information not in the DB or adding semantic information like (son, sp, relative) .34Answer to a QueryLet q = (H, B, P, C) be a query, D a database and V set of variables.

A valuation v is function v:VUBL for all variables x in B. And for all variables x in C, v(x) is not a blank node.

A pre-answer to q over D is the set single answers v(H): preans(q,D) = {v(H): v(B)nf(D+P) and v|=C}35Answer to a QueryComposing a complex query from simpler once.ansu(q,D) is the union of all single answers (blank nodes play the rule of bridges between two single answers).Ans+(q,D) is the merge of all single answers (renaming blank nodes to avoid names clashes). Useful when querying to several sources.

Let q be a query:If D|=D then ans(q,D) |=ans(q,D).For all D, ansu(q,D)|=ans+(q,D) (the converse is not true).

36ReificationThe ability of identifying RDF statements.

By having a blank nodes in the head of the query, one can identify a statement.

(N, value, true), (N, type, stat),(N, subj, ?X), (N, pred, ?Y ),(N, obj, ?Z) (?X, ?Y, ?Z)

Can cause an infinite DB. If statement i1 (a,b,c) is a valid then statement i2 (i1, subj, a) is also and the statement (i2, subj, i1), and so on.37Query ContainmentExploring different notions of query containment.

In relational databases, set-theoretical inclusion of tuples captures this requirement.

Let q and q be queries, and for all databases D:qpq , iff preans(q,D)preans(q,D) up to isomorphism.qmq , iff ans(q,D)|=ans(q,D).

Let q and q be queries, qpq entails that qmq. The converse is not true.

Theorem: Deciding each one of them is NP-complete.38Query ContainmentFor example:

H=B=(X, sc, Y), (Y, sc, Z)H=B=(X, sc, Y), (Y, sc, Z), (X, sc, Z)

qmq and qmq is true, but NOT qpq or qpq

39Query ContainmentConsider the queries q=(H,B,P,C) and q=(H,B,P,C), and assume H,H,B,B, P, P are simple graphs.

Theorem: Then qpq if and only if for each map on the variables of B, there is a substitution (of variables and blank nodes) such that:

(B)P+(B(B,P)), where (B,P) is the set of triples t of B such that (t)P.

(H)=H.

(C)C.40Query ContainmentConsider the queries q=(H,B,P,C) and q=(H,B,P,C), and assume H,H,B,B, P, P are simple graphs.

Theorem: Then qmq if and only if there are substitutions (of variables) 1,, n such that:

j(B)nf(B).

jj(H)|=H.

j(C)C.41Complexity of Query AnsweringThe complexity of the evaluation problem of testing emptiness of the query answer set in two versions:

Query complexity version: For a fixed database D, given a query q, is q(D) non-empty?NP-complete

Data complexity version: For a fixed query q, given a database D, is q(D) non-empty?polynomial

The size of the set of the answer is bounded by |D||q|.42Redundancy Elimination In GraphsA reduction of a graph G is a minimal graph Gr equivalent to G and contained in G.

Algorithm computing the reduction of a graph G:Gnf(G)Apply reverse rules 7), 8), 9), 4), and 3) and 6) in this order until no longer applicable.Apply any reverse rule in any order until no longer applicable.

Theorem: The problem of deciding if G is the reduction of G is DP-complete.

43Redundancy Elimination In QueriesAvoiding redundancy in query answer with lean query heads.

Lean querys body is not always possible, and may cause for missing an answer.

Even having lean databases and queries with lean heads and bodies does not avoid redundancies. For example:

G1 is the answer to the query (?Z, p, ?U)(?Z, p, ?U) on G2

aqpXYprapXYpbG2G144Redundancy Elimination In QueriesThe naive approach to eliminate redundancy in answers is to compute: ans(q,D), and a lean equivalent to ans(q,D).

Theorem: Given a lean database D and a query q, to decide whether ans(q,D) is lean is coNP-complete (in the size of D).

Theorem: Given a lean database D and a query q, to decide whether ans+(q,D) is lean can be done in polynomial time in the size of D45Contributions Normal form.

A formal definition of query language for RDF and its main features.

Query containment and processing.

Redundancy elimination.

From entailment to mapping between graphs.

Complexity issues.46References Foundations of Semantic Web Databases Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon (2004)

RDF Semantics W3C Working Draft (2003)

Composing Web Services on the Semantic Web Vadim Eisenberg

Special thanks to Google and Wikipedia.47Thank you!48

foundations of semantic web databases

Documents