experiments with evolving rdf

Experiments with evolving RDF

Sławek Staworko (joint work with Peter Buneman)

University of Edinburgh

Preservation of evolving data

Tomcat

has

tunaeats

Tomcat

has

Apr 1dies

Tomdog

has

dog food eats

Version 1 Version 2 Version 3

…

Archive• Version retrieval • Timeline queries • Storage space efficiency

Approaches to data preservation

• Store all versions • Store the original databases and log the changes • Hybrid approach of the above two

• store the initial and every 10th version • store log changes for the intermediate versions

• Annotation based approach!• never delete data but annotate its validity with

time intervals

Annotation of RDFTom

cat

has

tunaeats

Tomcat

has

Apr 1dies

Tomdog

has

dog food eats

Version 1 Version 2 Version 3

Archive

Tomcat

has [1–2]

tunaeats [1–1]

Apr 1dies [2–2]

dog

has [3—]

dog food

eats [3—]

What exactly is the input?

Delta = difference between two databases expressed with two atomic operations: inserting a triple and deleting a triple

Tomcat

has

tunaeats

Tomcat

has

Apr 1dies

Tomdog

has

dog food eats

delete (cat, eats, tuna) insert (cat, dies, Apr 1)

delete (Tom, has, cat) insert (Tom, has, dog) inset (dog, eats, dog food) delete (cat, dies, Apr 1)

Snapshots

Deltas

Snapshots = complete database instances

Challenges in preserving evolving data with annotations1. The task is relatively simple if deltas are know:!

• deleting a triple closes its interval!• adding a triple opens a new interval !

2. It gets complicated when only snapshots are given!• it boils down to computing deltas!• main challenge: identify objects that are the same across

versions of the database

Entity resolution problem!which data object represent the same entity across different versions!

well-studied database problem in various different settings (from duplicate elimination to record matching)

Entity resolution and RDFURI (Uniform resource identifier)

URIs are supposed to make things easy but… • RDF has also blank nodes • URIs don’t exactly solve the problem in the

context of evolving/merged ontologies…

Two different RDF nodes need not represent different objects

Blank nodes• LOD initiative frowns upon them • Blank nodes are commonplace (and misused?)

Tom cat

has

Peterbelieves

Tom cathas

Peter believes

_bsubjectpred

object_b

2.4 -0.4

Reification Complex number

Blank nodes (cont.)1. Reification (Peter believes that Tom has a cat) 2. Data structures (complex types) 3. Anonymization (Tom has a pet)

Assumptions on reasonable use of blank nodes:!1. Represent concrete objects !2. The objects can be identified from the context

Deblanking

_b1

7 end

_b2

3

_b3

5

LISP-style encoding list of numbers [5,3,7]

head

head

head

tail

tail

tail

#(7,end)

7 end

_b2

3

_b3

5

head

head

head

tail

tail

tail#(7,end)

7 end

#(3,7,end)

3

_b3

5

head

head

head

tail

tail

tail#(7,end)

7 end

#(3,7,end)

3

#(5,3,7,end)

5

head

head

head

tail

tail

tail

Assumption: graph has no cycles consisting of blanks only

Assumption: identity of a blank node is determined by its contents

Experiements

• 10 versions of Experimental Factor Ontology (EFO) data expressed in OWL

• 200k triples in the 1st version, 290k in the last • On average 20k blank nodes in each version • 920k triples overall (blank nodes are independent) • many triples do not last more than 1 version

ExperimentDeblanking and life expectancy of an object

Round Triples Blanks Life expect.0 921896 165935 2.551 358857 33253 6.392 348356 28150 6.573 339695 23502 6.884 330564 18862 7.105 318761 14763 7.246 311562 11021 7.397 304628 7299 7.548 297744 3622 7.839 285484 58 7.83

10 285334 2 7.8311 285334 1 7.8312 285334 0 7.83

Improving space efficiency

Peter

Edinburgh +44 712 4567

phone [1–10]lives [1–10]Peter

Edinburgh +44 712 4567

phonelives

[1–10]Lift common intervals to subject

dog

has [1–5]

dog

has [1–5]

• Intervals moved from all but 33.7k triples (of total 285k) • Number of subjects with histories is 34.3k • Total number of intervals is reduced from 285k to 60k • The size of the index reduced by almost 80%

Future:

• Bisimulation • Nested RDF

Conclusions

• Annotation offers an attractive way of representing an evolving RDF dataset (need for nested RDF?)

• Evolution of data may require more complex atomic operations. For instance, vocabulary evolution: adding, splitting, merging classes. (can bisimulation help here?)

experiments with evolving rdf

Technology