towards a top-k sparql query benchmark generator
Post on 01-Jan-2016
16 Views
Preview:
DESCRIPTION
TRANSCRIPT
Towards a Top-K SPARQL Query Benchmark Generator
Shima Zahmatkesh1, Emanuele Della Valle1, Daniele Dell’Aglio1, and Alessandro Bozzon2
1Politecnico di Milano 2TU Delft
Agenda
• Rankings, Rankings everywhere • What are top-k SPARQL queries• Jim Gray's Benchmarking Principles• The problem• Some Definitions• Research Hypothesis• Background work: DBpedia SPARQL Benchmark• Our proposal: Top-k DBPSB• Preliminary Evaluation• Conclusions
3
Rankings, rankings everywhere
4
Rankings, rankings everywhere
5
Rankings, rankings everywhere
6
A very intuitive and simplified example:
• Top 3 largest countries (by both area and population)
Why do we need to optimize them?
7
The standard way: materialize-then-sort scheme
Countries
Compute the scoring function that accounts for area and population
Sort all the 242 countries
Fetch 3 best results
……
242
…
8
Innovative optimization:Split-and-Interleave scheme
Fetch 3 best results
Incrementally order partial results by area
Sorted access to countries ordered by population
Countries 242
9
3
9
State-of-the artDatabase• method
– Split the evaluation of the scoring function into single criteria
– Interleave them with other operators– Use partial orders to construct incrementally the final
order
• Standard assumptions:– Monotone scoring function– Each criterion is evaluated as a [0,1] number
(normalization)
• Optimized for the case of fast sorted access for each criterion
Top-k SPARQL queriesE.g., the 10 most recent books written by the youngest authors
SELECT ?book ?author
(0.5*norm(?releaseDate) +
0.5*norm(?dateOfBirth) AS ?s )
WHERE {
?book dbp:isbn ?v .
?book dbp:author ?author .
?book dbp:releaseDate ?releaseDate .
?v3 dbp:dateOfBirth ?dateOfBirth .
}
ORDER BY DESC(?s)
LIMIT 10
Scoring Functionas a SELECT expression
Normalization cast the value in [0..1]
norm(x) = x - minx
maxx - minx
Order and slice 10
The ProblemSet up a benchmark for top-k SPARQL Queries that• Resembles reality• Stresses the features of top-k queries
– Syntax: SELECT expression + ORDER BY + LIMIT – Performance: hit SPARQL engine where it hurts
11
Jim Gray on BenchmarkingPrinciples
• Relevant: Measures performance and price/performance of systems when performing typical operations within the problem domain
• Portable: Easy to implement on many different systems
• Scalable: Applies to small and large computer systems
• Simple: understandable
Results
12
DefinitionsE.g., the 10 most recent books written by the youngest authors
releaseDate
Rankable Variables
Scoring Variables
Rankable Data Properties
Rankable Triple Patterns
Scoring Function
0.5* norm(?releaseDate) + 0.5*norm(?birthDate)
?book
?author
?releaseDate
dateOfBirth?birthDate
aut
ho
rT
riple
Pat
tern
s
13
Research Hypothesis• H.0 top-k SPARQL queries that resemble reality can
be obtained extending DBpedia SPARQL Benchmark– H.1 ++ Rankable variable ++ execution time
– H.2 ++ Scoring variable ++ execution time
– H.3 +/- LIMIT = execution time
14
DBpedia SPARQL Benchmark• A method to
generate a SPARQL benchmark from DBpedia an its query longs
• It can be applied to other datasets and other query logs
• Characteristics– Resemble reality– Stress SPARQL
features
Query Logs
Query Analysis and Clustering
Dataset generation
Auxiliary Queries
Queries Templates
Query Instances
15
Proposed SolutionTop-k DBPSB• An extension of DBPSB
Auxiliary query with top-k clauses using the DBPSB datasets as source of meaningful rankable variables
• It is also a method– Can be applied to other
benchmark obtained using DBSBM method
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
16
A DBPSB Auxiliary QuerySELECT DISTINCT ?v
WHERE {
?v6 rdf:type ?v .
?v6 dbp:name ?v0 .
?v6 dbp:pages ?v1 .
?v6 dbp:isbn ?v2 .
?v6 dbp:author ?v3 .
}
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
17
Top-k DBPSB step 1aTo generate queries with 1 rankable variable
SELECT ?p (COUNT(?p) AS ?n)WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . ?v6 ?p ?o . FILTER(isNumeric(?o) || datatype(?o)=xsd:dateTime)} ORDER BY ORDER BY DESC(?n)
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
18
Top-k DBPSB step 1bResults – not all sortable properties resemble reality• Pages• ISBN• NumberOfPages• Year• Volume• wikiPageID• releaseDate• …
NOTE: it requires manual selection
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
19
Top-k DBPSB step 1cTo generate queries with 2 rankable variables
SELECT ?p ?p1 (COUNT(?p1) AS ?n)
WHERE {
?v6 rdf:type ?v .
?v6 dbp:name ?v0 .
?v6 dbp:pages ?v1 .
?v6 dbp:isbn ?v2 .
?v6 dbp:author ?v3 .
?v6 ?p ?o .
?o ?p1 ?o1 .
FILTER(isNumeric(?o1) ||
datatype(?o1)=xsd:dateTime) }
GROUP BY ?p ?p1
ORDER BY DESC(?n)
NOTE: in practice we loop through all properties of ?v6 whose object is an IRI in decreasing frequency
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
20
Top-k DBPSB step 1dResults• author, wikiPageID• author, wikiPageRevisionID• …• author, dateOfBirth• …• publisher, wikiPageID• publisher, wikiPageRevisionID• …• publisher, founded • …
NOTE: it requires manual selection
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
21
Top-k DBPSB step 2SELECT (max(?o) as ?max) (min(?o) as ?min)
WHERE {
?v6 rdf:type ?v .
?v6 dbp:name ?v0 .
?v6 dbp:pages ?v1 .
?v6 dbp:isbn ?v2 .
?v6 dbp:author ?v3 .
?v6 dbp:pages ?o .
FILTER(isNumeric(?o) ||
datatype(?o)=xsd:dateTime)
}
NOTE: the filter clause should not be necessary, but DBpedia is very dirty …
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
22
Top-k DBPSB step 3• Choose the number of ranking variables
– Max three– E.g., books and authors
• Choose the number of scoring variables per ranking variables– Max three– E.g., releaseDate for books and dateOfBirth for authors
• Look up the min and the max of each ranking variable to normalise it
• Choose the weights– The sum of the weight should be 1
• Assemble the scoring function– E.g., 0.5*norm(?releaseDate ) +
0.5*norm(?dateOfBirth)
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
23
Top-k DBPSB step 4SELECT ?v6 ?v3
(0.5*norm(?o1) + 0.5*norm(?o2) AS ?s )
WHERE {
?v6 rdf:type ?v .
?v6 dbp:name ?v0 .
?v6 dbp:pages ?v1 .
?v6 dbp:isbn ?v2 .
?v6 dbp:author ?v3 .
?v6 dbp:releaseDate ?o1 .
?v3 dbp:dateOfBirth ?o2 . FILTER(isNumeric(?o1) || datatype(?o1)=xsd:dateTime)
FILTER(isNumeric(?o2) || datatype(?o2)=xsd:dateTime)
}
ORDER BY ?s
LIMIT 10
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
24
Preliminary Results 1/2• We tested our hypothesis using
– Virtuoso Open-Source Edition version 6.1.6 – Jena-TDB Version 2.10.1 – DBpedia 10%
• In this setting, Top-k DBPSB generates queries– adequate to test
• H.2 ++ Scoring variable ++ execution time
• H.3 +/- LIMIT = execution time
– only partially adequate to test • H.1 ++ Rankable variable ++ execution time
25
Preliminary Results 2/2• H.1 ++ Rankable variable ++ execution time
– confirmed in some cases
– not confirmed aggregating by query across engine
– confirmed aggregating by engine across queries
• H.2 ++ Scoring variable ++ execution time– confirmed for Jena TDB
– confirmed in most of the cases for Virtuoso
• H.3 +/- LIMIT = execution time– confirmed for Jena TDB
– confirmed in most of the cases for Virtuoso
26
Conclusions• Top-k DBPSB is a successful first attempt to
automatically generate Top-k SPARQL queries that– Resemble reality– Hit SPARQL engines where it hurts
• More investigation is required– Better understand the relationships between the number of
rankable variable and the execution time• E.g., cardinalities, selectivity and jooins
– Include over known features of top-k query that impact execution time
• E.g., correlation of order induced on the result set by the different scoring variable in the scoring function
• E.g., Distribution of values matched by the scoring variables
27
Thank you! Any Question?
Shima Zahmatkesh1, Emanuele Della Valle1, Daniele Dell’Aglio1, and Alessandro Bozzon2
1Politecnico di Milano 2TU Delft
Preliminary Results - details
top related