rdf-3x : a risc-style engine for rdf thomas neumann, gerhard weikum max-planck-institute fur...
TRANSCRIPT
RDF-3X : a RISC-style Engine for RDF
Thomas Neumann, Gerhard Weikum
Max-Planck-Institute fur Informatik, Max-Planck-Institute fur InformatikPVLDB ‘08
May 25 2011Presented by Somin Kim
Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation Conclusion
2/30
Introduction (1/3)
Motivation and Problem RDF (Resource Description Framework)
– A flexible representation of schema-free information for the semantic web
– (subject, predicate, object) or (subject, property, value)– All RDF triples together can be viewed as a large graph– The notion of RDF triples fits well with “pay as you go” phi-
losophy
3/30
Introduction (2/3)
Motivation and Problem Technical challenges for managing large-scale RDF
data– Physical database design– Prediction of join attributes– Suitable granularity of statistics gathering – RDF triples form a graph rather than a collection of trees
4/30
Introduction (3/3)
Contribution and Outline RDF-3X (RDF Triple eXpress)
– A novel architecture for RDF indexing and querying, eliminat-ing the need for physical database design
Key principles of RDF-3X– Physical design is workload-independent
By creating appropriate indexes over a single, giant “triple ta-ble”
– The query processor is RISC-style By relying mostly on merge joins over sorted index lists
– The query optimizer employs dynamic programming for plan enumeration
5/30
Outline Introduction Background and State of the Art
– SPARQL– Related Work
Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation Conclusion
6/30
Background and State of the Art (1/4)
SPARQL The official standard for searching over RDF storages Each pattern consists of S, P, O and each of these is
either a variable or a literal
Two query modifiers of SPARQL– distinct keyword : duplicates must be eliminated– reduced keyword : duplicates may but need not be elimi-
nated
SELECT ?var1 ?var2…WHERE {
pattern1. pattern2. … }
SELECT ?titleWHERE {
?m <hasTitle> ?title; <hasCasting> ?c. ?c <Actor> ?a. ?a <hasName> “Johnny Depp” }
7/30
Background and State of the Art (2/4)
Related Work Triple table
– All triples are stored in a single table
SELECT ?titleWHERE {
?book <title> ?title.?book <author> <Fox, Joe>.?book <copyright> <2001>
}
8/30 Based on JS Myoung’s presentation slide
Background and State of the Art (3/4)
Related Work Property table
– Triples are grouped by their predicate name
subject
property
object
9/30
Background and State of the Art (4/4)
Related Work Cluster-property table
– Triples are clustered by properties that tend to be defined to-gether
10/30
Outline Introduction Background and State of the Art Storage and Indexing
– Triple Store and Dictionary– Compressed Indexes– Aggregated Indexes
Query Processing and Optimization Selectivity Estimates Evaluation Conclusion
11/30
Storage and Indexing (1/7)
Triples Store and Dictionary RDF-3X is based on a single, giant “triples table”
Mapping Dictionary– Replacing all literals by ids using a mapping dictionary– It compresses the triple store by containing only id triples
S P O
ob-ject214
hasColor blue
ob-ject214
be-longsTo
ob-ject352
… … …S P O
0 1 2
0 3 4
… … …
ID Value
0 ob-ject214
1 hasColor
… …12/30
Storage and Indexing (2/7)
Triples Store and Dictionary Store all triples in a clustered B+-tree
– Triples are sorted lexicographically– It allows the conversion of SPARQL patterns into range scans
002 …
000 001 002 003
ID Value
0 ob-ject214
1 hasColor
… …
<Mapping Dictionary>
S P O
0 1 2
0 3 4
… … …
Actually, we don’t need this table!
( literal1, literal2, ?x )
13/30
Storage and Indexing (3/7)
Compressed Indexes We relied on the fact that the variables are a suffix
– <S>-<P>- ?var or <S>-?var1 -?var2
To guarantee that we can answer every possible pat-tern with variables in any position by merely perform-ing a single index scan, we maintain all six permuta-tions of S, P and O in six separate indexes– (SPO, SOP, OSP, OPS, PSO, POS)– We can afford this level of redundancy
<POS>
?var - <P> - <O>
14/30
Storage and Indexing (4/7)
Compressed Indexes Instead of storing full triples, we only store the
changes between triples– The collation order causes neighboring triples to be very sim-
ilar
We use a byte-level compression scheme– The algorithm computes the delta to the previous tuple– If delta is small, it is directly encoded in the header byte– Otherwise, it computes the delta value, write the header
byte with the size information and write the non-zero tail of the delta
15/30
Storage and Indexing (5/7)
Compressed Indexes Comparison of byte-wise compression vs. bit-wise
compression for the Barton dataset
Each leaf page is compressed individually– It allows us to seek to any leaf page and directly start read-
ing triples– The compressed index behaves just like a normal B+-tree
16/30
Storage and Indexing (6/7)
Aggregated Indices For many SPARQL patterns, indexing partial triples
rather than full triples would be sufficient
Aggregated indexes– Each aggregated indexes store only two out of the three col-
umns of a triple (value1, value2, count ) This is done for (SP, PS, SO, OS, PO, OP)
– All three one-value indexes (value1, count) This is done for (S, P, O)
select ?a ?cwhere { ?a ?b ?c }
17/30
Storage and Indexing (7/7)
SPO
SPO
SOP
SO
P
PSO
PS
O
POS
PO
S
OSP
OS
P
OPS
OPS
Triple Index
Count
SP
Count
SO
Count
PS
Count
PO
Count
OP
Count
OS
Count
S
Count
P
Count
O
Aggregate Index
18/30 Based on KS Kim’s presentation slide
Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization
– Translating SPARQL Queries– Optimizing Join Ordering
Selectivity Estimates Evaluation Conclusion
19/30
Query Processing and Optimization (1/2)
Translating SPARQL Queries Each query can be parsed and expanded into a set
of triple patterns The parser performs dictionary lookups, so the literals
are mapped into ids When a query consists of
– a single pattern Use index structures and answer the query with a single range
scan
– multiple triple pattern Join the results of the individual patterns
When a query includes the distinct option , we elimi-nates duplicates in the result
Finally, a dictionary lookup operator converts the re-sulting ids back in to strings
20/30
Query Processing and Optimization (2/2)
Optimizing Join Ordering Demanding properties
– Bushy join trees (rather than left-deep or right-deep trees)– Fast plan enumeration and cost estimation – Extensive use of merge joins
DP framework– To find best plan, consider all possible plans of subsets– Recursively compute costs for joining subsets to find the
cost of each plan– When plan for any subset is computed, store it and reuse it– Larger plans are created by joining optimal solutions of
smaller problems
21/30
Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates
– Selectivity Histograms Evaluation Conclusion
22/30
Selectivity Estimates Estimated cardinalities and selectivities have a huge
impact on plan generation
Selectivity Histograms– The cardinality of a single
triple pattern Using aggregated in-
dexes– The numbers of the join
partners
Frequent join path
23/30
Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation
– General Setup– Query Run-times
Conclusion
24/30
Evaluation (1/3)
General Setup Setup
– 2GHz dual core, 2GB RAM, 30MB/s disk, Linux
Competitors– MonetDB
Column-store-based approach Presented in VLDB07, by Abadi et al.
– PostgreSQL Triple store with SPO, POS, PSO indexes, similar to Sesame
– Other approaches performed much worse Jena2, Yars2(DERI)
25/30
Evaluation (2/3)
General Setup Datasets
– Barton, library data, 51M triples (4.1GB)– Yago, Wikipedia-based ontology, 40M triples (3.1GB)– LibraryThing(partial crawl), tags that users have assigned to
the books, 30M triples (1.8GB)
DB load time & DB size
26/30
Evaluation (3/3)
Query Run-times Average run-times for cold caches (sec)
Average run-time for warm caches (sec)
Barton Yago LibraryThing
RDF-3X 5.9 0.7 0.89
MonetDB 26.4 78.2 8.16
PostgreSQL 167.8 10.6 93.90
Barton Yago LibraryThing
RDF-3X 0.4 0.04 0.13
MonetDB 4.8 54.60 4.39
PostgreSQL 64.3 0.56 30.40
27/30
Outline Introduction Background and State of the Art Storage and Indexing Query Processing and Optimization Selectivity Estimates Evaluation Conclusion
28/30
Conclusion RDF-3X is a fast and flexible RDF/SPARQL engine
– Exhaustive but very space-efficient triple indexes– Avoids physical design tuning, generic storage– Fast runtime system, query optimization has a huge impact
29/30