rdf-3x : a risc-style engine for rdf thomas neumann, gerhard weikum max-planck-institute fur...

30
RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08 May 25 2011 Presented by Somin Kim

Upload: vernon-dickerson

Post on 14-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

RDF-3X : a RISC-style Engine for RDF

Thomas Neumann, Gerhard Weikum

Max-Planck-Institute fur Informatik, Max-Planck-Institute fur InformatikPVLDB ‘08

May 25 2011Presented by Somin Kim

Page 2: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation Conclusion

2/30

Page 3: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Introduction (1/3)

Motivation and Problem RDF (Resource Description Framework)

– A flexible representation of schema-free information for the semantic web

– (subject, predicate, object) or (subject, property, value)– All RDF triples together can be viewed as a large graph– The notion of RDF triples fits well with “pay as you go” phi-

losophy

3/30

Page 4: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Introduction (2/3)

Motivation and Problem Technical challenges for managing large-scale RDF

data– Physical database design– Prediction of join attributes– Suitable granularity of statistics gathering – RDF triples form a graph rather than a collection of trees

4/30

Page 5: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Introduction (3/3)

Contribution and Outline RDF-3X (RDF Triple eXpress)

– A novel architecture for RDF indexing and querying, eliminat-ing the need for physical database design

Key principles of RDF-3X– Physical design is workload-independent

By creating appropriate indexes over a single, giant “triple ta-ble”

– The query processor is RISC-style By relying mostly on merge joins over sorted index lists

– The query optimizer employs dynamic programming for plan enumeration

5/30

Page 6: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Outline Introduction Background and State of the Art

– SPARQL– Related Work

Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation Conclusion

6/30

Page 7: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Background and State of the Art (1/4)

SPARQL The official standard for searching over RDF storages Each pattern consists of S, P, O and each of these is

either a variable or a literal

Two query modifiers of SPARQL– distinct keyword : duplicates must be eliminated– reduced keyword : duplicates may but need not be elimi-

nated

SELECT ?var1 ?var2…WHERE {

pattern1. pattern2. … }

SELECT ?titleWHERE {

?m <hasTitle> ?title; <hasCasting> ?c. ?c <Actor> ?a. ?a <hasName> “Johnny Depp” }

7/30

Page 8: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Background and State of the Art (2/4)

Related Work Triple table

– All triples are stored in a single table

SELECT ?titleWHERE {

?book <title> ?title.?book <author> <Fox, Joe>.?book <copyright> <2001>

}

8/30 Based on JS Myoung’s presentation slide

Page 9: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Background and State of the Art (3/4)

Related Work Property table

– Triples are grouped by their predicate name

subject

property

object

9/30

Page 10: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Background and State of the Art (4/4)

Related Work Cluster-property table

– Triples are clustered by properties that tend to be defined to-gether

10/30

Page 11: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Outline Introduction Background and State of the Art Storage and Indexing

– Triple Store and Dictionary– Compressed Indexes– Aggregated Indexes

Query Processing and Optimization Selectivity Estimates Evaluation Conclusion

11/30

Page 12: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Storage and Indexing (1/7)

Triples Store and Dictionary RDF-3X is based on a single, giant “triples table”

Mapping Dictionary– Replacing all literals by ids using a mapping dictionary– It compresses the triple store by containing only id triples

S P O

ob-ject214

hasColor blue

ob-ject214

be-longsTo

ob-ject352

… … …S P O

0 1 2

0 3 4

… … …

ID Value

0 ob-ject214

1 hasColor

… …12/30

Page 13: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Storage and Indexing (2/7)

Triples Store and Dictionary Store all triples in a clustered B+-tree

– Triples are sorted lexicographically– It allows the conversion of SPARQL patterns into range scans

002 …

000 001 002 003

ID Value

0 ob-ject214

1 hasColor

… …

<Mapping Dictionary>

S P O

0 1 2

0 3 4

… … …

Actually, we don’t need this table!

( literal1, literal2, ?x )

13/30

Page 14: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Storage and Indexing (3/7)

Compressed Indexes We relied on the fact that the variables are a suffix

– <S>-<P>- ?var or <S>-?var1 -?var2

To guarantee that we can answer every possible pat-tern with variables in any position by merely perform-ing a single index scan, we maintain all six permuta-tions of S, P and O in six separate indexes– (SPO, SOP, OSP, OPS, PSO, POS)– We can afford this level of redundancy

<POS>

?var - <P> - <O>

14/30

Page 15: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Storage and Indexing (4/7)

Compressed Indexes Instead of storing full triples, we only store the

changes between triples– The collation order causes neighboring triples to be very sim-

ilar

We use a byte-level compression scheme– The algorithm computes the delta to the previous tuple– If delta is small, it is directly encoded in the header byte– Otherwise, it computes the delta value, write the header

byte with the size information and write the non-zero tail of the delta

15/30

Page 16: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Storage and Indexing (5/7)

Compressed Indexes Comparison of byte-wise compression vs. bit-wise

compression for the Barton dataset

Each leaf page is compressed individually– It allows us to seek to any leaf page and directly start read-

ing triples– The compressed index behaves just like a normal B+-tree

16/30

Page 17: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Storage and Indexing (6/7)

Aggregated Indices For many SPARQL patterns, indexing partial triples

rather than full triples would be sufficient

Aggregated indexes– Each aggregated indexes store only two out of the three col-

umns of a triple (value1, value2, count ) This is done for (SP, PS, SO, OS, PO, OP)

– All three one-value indexes (value1, count) This is done for (S, P, O)

select ?a ?cwhere { ?a ?b ?c }

17/30

Page 18: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Storage and Indexing (7/7)

SPO

SPO

SOP

SO

P

PSO

PS

O

POS

PO

S

OSP

OS

P

OPS

OPS

Triple Index

Count

SP

Count

SO

Count

PS

Count

PO

Count

OP

Count

OS

Count

S

Count

P

Count

O

Aggregate Index

18/30 Based on KS Kim’s presentation slide

Page 19: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization

– Translating SPARQL Queries– Optimizing Join Ordering

Selectivity Estimates Evaluation Conclusion

19/30

Page 20: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Query Processing and Optimization (1/2)

Translating SPARQL Queries Each query can be parsed and expanded into a set

of triple patterns The parser performs dictionary lookups, so the literals

are mapped into ids When a query consists of

– a single pattern Use index structures and answer the query with a single range

scan

– multiple triple pattern Join the results of the individual patterns

When a query includes the distinct option , we elimi-nates duplicates in the result

Finally, a dictionary lookup operator converts the re-sulting ids back in to strings

20/30

Page 21: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Query Processing and Optimization (2/2)

Optimizing Join Ordering Demanding properties

– Bushy join trees (rather than left-deep or right-deep trees)– Fast plan enumeration and cost estimation – Extensive use of merge joins

DP framework– To find best plan, consider all possible plans of subsets– Recursively compute costs for joining subsets to find the

cost of each plan– When plan for any subset is computed, store it and reuse it– Larger plans are created by joining optimal solutions of

smaller problems

21/30

Page 22: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates

– Selectivity Histograms Evaluation Conclusion

22/30

Page 23: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Selectivity Estimates Estimated cardinalities and selectivities have a huge

impact on plan generation

Selectivity Histograms– The cardinality of a single

triple pattern Using aggregated in-

dexes– The numbers of the join

partners

Frequent join path

23/30

Page 24: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation

– General Setup– Query Run-times

Conclusion

24/30

Page 25: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Evaluation (1/3)

General Setup Setup

– 2GHz dual core, 2GB RAM, 30MB/s disk, Linux

Competitors– MonetDB

Column-store-based approach Presented in VLDB07, by Abadi et al.

– PostgreSQL Triple store with SPO, POS, PSO indexes, similar to Sesame

– Other approaches performed much worse Jena2, Yars2(DERI)

25/30

Page 26: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Evaluation (2/3)

General Setup Datasets

– Barton, library data, 51M triples (4.1GB)– Yago, Wikipedia-based ontology, 40M triples (3.1GB)– LibraryThing(partial crawl), tags that users have assigned to

the books, 30M triples (1.8GB)

DB load time & DB size

26/30

Page 27: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Evaluation (3/3)

Query Run-times Average run-times for cold caches (sec)

Average run-time for warm caches (sec)

Barton Yago LibraryThing

RDF-3X 5.9 0.7 0.89

MonetDB 26.4 78.2 8.16

PostgreSQL 167.8 10.6 93.90

Barton Yago LibraryThing

RDF-3X 0.4 0.04 0.13

MonetDB 4.8 54.60 4.39

PostgreSQL 64.3 0.56 30.40

27/30

Page 28: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation Conclusion

28/30

Page 29: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08

Conclusion RDF-3X is a fast and flexible RDF/SPARQL engine

– Exhaustive but very space-efficient triple indexes– Avoids physical design tuning, generic storage– Fast runtime system, query optimization has a huge impact

29/30

Page 30: RDF-3X : a RISC-style Engine for RDF Thomas Neumann, Gerhard Weikum Max-Planck-Institute fur Informatik, Max-Planck-Institute fur Informatik PVLDB ‘08