rdf-3x : a risc-style engine for rdf thomas neumann, gerhard weikum max-planck-institute fur...

RDF-3X : a RISC-style Engine for RDF

Thomas Neumann, Gerhard Weikum

Max-Planck-Institute fur Informatik, Max-Planck-Institute fur InformatikPVLDB ‘08

May 25 2011Presented by Somin Kim

Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation Conclusion

2/30

Introduction (1/3)

Motivation and Problem RDF (Resource Description Framework)

– A flexible representation of schema-free information for the semantic web

– (subject, predicate, object) or (subject, property, value)– All RDF triples together can be viewed as a large graph– The notion of RDF triples fits well with “pay as you go” phi-

losophy

3/30

Introduction (2/3)

Motivation and Problem Technical challenges for managing large-scale RDF

data– Physical database design– Prediction of join attributes– Suitable granularity of statistics gathering – RDF triples form a graph rather than a collection of trees

4/30

Introduction (3/3)

Contribution and Outline RDF-3X (RDF Triple eXpress)

– A novel architecture for RDF indexing and querying, eliminat-ing the need for physical database design

Key principles of RDF-3X– Physical design is workload-independent

By creating appropriate indexes over a single, giant “triple ta-ble”

– The query processor is RISC-style By relying mostly on merge joins over sorted index lists

– The query optimizer employs dynamic programming for plan enumeration

5/30

Outline Introduction Background and State of the Art

– SPARQL– Related Work

Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation Conclusion

6/30

Background and State of the Art (1/4)

SPARQL The official standard for searching over RDF storages Each pattern consists of S, P, O and each of these is

either a variable or a literal

Two query modifiers of SPARQL– distinct keyword : duplicates must be eliminated– reduced keyword : duplicates may but need not be elimi-

nated

SELECT ?var1 ?var2…WHERE {

pattern1. pattern2. … }

SELECT ?titleWHERE {

?m <hasTitle> ?title; <hasCasting> ?c. ?c <Actor> ?a. ?a <hasName> “Johnny Depp” }

7/30


Related Work Triple table

– All triples are stored in a single table

SELECT ?titleWHERE {

?book <title> ?title.?book <author> <Fox, Joe>.?book <copyright> <2001>

}

8/30 Based on JS Myoung’s presentation slide


Related Work Property table

– Triples are grouped by their predicate name

subject

property

object

9/30


Related Work Cluster-property table

– Triples are clustered by properties that tend to be defined to-gether

10/30

Outline Introduction Background and State of the Art Storage and Indexing

– Triple Store and Dictionary– Compressed Indexes– Aggregated Indexes

Query Processing and Optimization Selectivity Estimates Evaluation Conclusion

11/30

Storage and Indexing (1/7)

Triples Store and Dictionary RDF-3X is based on a single, giant “triples table”

Mapping Dictionary– Replacing all literals by ids using a mapping dictionary– It compresses the triple store by containing only id triples

S P O

ob-ject214

hasColor blue

ob-ject214

be-longsTo

ob-ject352

… … …S P O

0 1 2

0 3 4

… … …

ID Value

0 ob-ject214

1 hasColor

… …12/30


Triples Store and Dictionary Store all triples in a clustered B+-tree

– Triples are sorted lexicographically– It allows the conversion of SPARQL patterns into range scans

002 …

000 001 002 003

ID Value

0 ob-ject214

1 hasColor

… …

<Mapping Dictionary>

S P O

0 1 2

0 3 4

… … …

Actually, we don’t need this table!

( literal1, literal2, ?x )

13/30


Compressed Indexes We relied on the fact that the variables are a suffix

– <S>-<P>- ?var or <S>-?var1 -?var2

To guarantee that we can answer every possible pat-tern with variables in any position by merely perform-ing a single index scan, we maintain all six permuta-tions of S, P and O in six separate indexes– (SPO, SOP, OSP, OPS, PSO, POS)– We can afford this level of redundancy

<POS>

?var - <P> - <O>

14/30


Compressed Indexes Instead of storing full triples, we only store the

changes between triples– The collation order causes neighboring triples to be very sim-

ilar

We use a byte-level compression scheme– The algorithm computes the delta to the previous tuple– If delta is small, it is directly encoded in the header byte– Otherwise, it computes the delta value, write the header

byte with the size information and write the non-zero tail of the delta

15/30


Compressed Indexes Comparison of byte-wise compression vs. bit-wise

compression for the Barton dataset

Each leaf page is compressed individually– It allows us to seek to any leaf page and directly start read-

ing triples– The compressed index behaves just like a normal B+-tree

16/30


Aggregated Indices For many SPARQL patterns, indexing partial triples

rather than full triples would be sufficient

Aggregated indexes– Each aggregated indexes store only two out of the three col-

umns of a triple (value1, value2, count ) This is done for (SP, PS, SO, OS, PO, OP)

– All three one-value indexes (value1, count) This is done for (S, P, O)

select ?a ?cwhere { ?a ?b ?c }

17/30


SPO

SPO

SOP

SO

P

PSO

PS

O

POS

PO

S

OSP

OS

P

OPS

OPS

Triple Index

Count

SP

Count

SO

Count

PS

Count

PO

Count

OP

Count

OS

Count

S

Count

P

Count

O

Aggregate Index

18/30 Based on KS Kim’s presentation slide

Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization

– Translating SPARQL Queries– Optimizing Join Ordering

Selectivity Estimates Evaluation Conclusion

19/30

Query Processing and Optimization (1/2)

Translating SPARQL Queries Each query can be parsed and expanded into a set

of triple patterns The parser performs dictionary lookups, so the literals

are mapped into ids When a query consists of

– a single pattern Use index structures and answer the query with a single range

scan

– multiple triple pattern Join the results of the individual patterns

When a query includes the distinct option , we elimi-nates duplicates in the result

Finally, a dictionary lookup operator converts the re-sulting ids back in to strings

20/30

Query Processing and Optimization (2/2)

Optimizing Join Ordering Demanding properties

– Bushy join trees (rather than left-deep or right-deep trees)– Fast plan enumeration and cost estimation – Extensive use of merge joins

DP framework– To find best plan, consider all possible plans of subsets– Recursively compute costs for joining subsets to find the

cost of each plan– When plan for any subset is computed, store it and reuse it– Larger plans are created by joining optimal solutions of

smaller problems

21/30

Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates

– Selectivity Histograms Evaluation Conclusion

22/30

Selectivity Estimates Estimated cardinalities and selectivities have a huge

impact on plan generation

Selectivity Histograms– The cardinality of a single

triple pattern Using aggregated in-

dexes– The numbers of the join

partners

Frequent join path

23/30

Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation

– General Setup– Query Run-times

Conclusion

24/30

Evaluation (1/3)

General Setup Setup

– 2GHz dual core, 2GB RAM, 30MB/s disk, Linux

Competitors– MonetDB

Column-store-based approach Presented in VLDB07, by Abadi et al.

– PostgreSQL Triple store with SPO, POS, PSO indexes, similar to Sesame

– Other approaches performed much worse Jena2, Yars2(DERI)

25/30

Evaluation (2/3)

General Setup Datasets

– Barton, library data, 51M triples (4.1GB)– Yago, Wikipedia-based ontology, 40M triples (3.1GB)– LibraryThing(partial crawl), tags that users have assigned to

the books, 30M triples (1.8GB)

DB load time & DB size

26/30

Evaluation (3/3)

Query Run-times Average run-times for cold caches (sec)

Average run-time for warm caches (sec)

Barton Yago LibraryThing

RDF-3X 5.9 0.7 0.89

MonetDB 26.4 78.2 8.16

PostgreSQL 167.8 10.6 93.90

Barton Yago LibraryThing

RDF-3X 0.4 0.04 0.13

MonetDB 4.8 54.60 4.39

PostgreSQL 64.3 0.56 30.40

27/30

Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation Conclusion

28/30

Conclusion RDF-3X is a fast and flexible RDF/SPARQL engine

– Exhaustive but very space-efficient triple indexes– Avoids physical design tuning, generic storage– Fast runtime system, query optimization has a huge impact

29/30

rdf-3x : a risc-style engine for rdf thomas neumann, gerhard weikum max-planck-institute fur...

Documents