![Page 1: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/1.jpg)
Towards a Top-K SPARQL Query Benchmark Generator
Shima Zahmatkesh1, Emanuele Della Valle1, Daniele Dell’Aglio1, and Alessandro Bozzon2
1Politecnico di Milano 2TU Delft
![Page 2: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/2.jpg)
Agenda
• Rankings, Rankings everywhere • What are top-k SPARQL queries• Jim Gray's Benchmarking Principles• The problem• Some Definitions• Research Hypothesis• Background work: DBpedia SPARQL Benchmark• Our proposal: Top-k DBPSB• Preliminary Evaluation• Conclusions
![Page 3: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/3.jpg)
3
Rankings, rankings everywhere
![Page 4: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/4.jpg)
4
Rankings, rankings everywhere
![Page 5: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/5.jpg)
5
Rankings, rankings everywhere
![Page 6: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/6.jpg)
6
A very intuitive and simplified example:
• Top 3 largest countries (by both area and population)
Why do we need to optimize them?
![Page 7: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/7.jpg)
7
The standard way: materialize-then-sort scheme
Countries
Compute the scoring function that accounts for area and population
Sort all the 242 countries
Fetch 3 best results
……
242
…
![Page 8: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/8.jpg)
8
Innovative optimization:Split-and-Interleave scheme
Fetch 3 best results
Incrementally order partial results by area
Sorted access to countries ordered by population
Countries 242
9
3
![Page 9: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/9.jpg)
9
State-of-the artDatabase• method
– Split the evaluation of the scoring function into single criteria
– Interleave them with other operators– Use partial orders to construct incrementally the final
order
• Standard assumptions:– Monotone scoring function– Each criterion is evaluated as a [0,1] number
(normalization)
• Optimized for the case of fast sorted access for each criterion
![Page 10: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/10.jpg)
Top-k SPARQL queriesE.g., the 10 most recent books written by the youngest authors
SELECT ?book ?author
(0.5*norm(?releaseDate) +
0.5*norm(?dateOfBirth) AS ?s )
WHERE {
?book dbp:isbn ?v .
?book dbp:author ?author .
?book dbp:releaseDate ?releaseDate .
?v3 dbp:dateOfBirth ?dateOfBirth .
}
ORDER BY DESC(?s)
LIMIT 10
Scoring Functionas a SELECT expression
Normalization cast the value in [0..1]
norm(x) = x - minx
maxx - minx
Order and slice 10
![Page 11: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/11.jpg)
The ProblemSet up a benchmark for top-k SPARQL Queries that• Resembles reality• Stresses the features of top-k queries
– Syntax: SELECT expression + ORDER BY + LIMIT – Performance: hit SPARQL engine where it hurts
11
![Page 12: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/12.jpg)
Jim Gray on BenchmarkingPrinciples
• Relevant: Measures performance and price/performance of systems when performing typical operations within the problem domain
• Portable: Easy to implement on many different systems
• Scalable: Applies to small and large computer systems
• Simple: understandable
Results
12
![Page 13: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/13.jpg)
DefinitionsE.g., the 10 most recent books written by the youngest authors
releaseDate
Rankable Variables
Scoring Variables
Rankable Data Properties
Rankable Triple Patterns
Scoring Function
0.5* norm(?releaseDate) + 0.5*norm(?birthDate)
?book
?author
?releaseDate
dateOfBirth?birthDate
aut
ho
rT
riple
Pat
tern
s
13
![Page 14: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/14.jpg)
Research Hypothesis• H.0 top-k SPARQL queries that resemble reality can
be obtained extending DBpedia SPARQL Benchmark– H.1 ++ Rankable variable ++ execution time
– H.2 ++ Scoring variable ++ execution time
– H.3 +/- LIMIT = execution time
14
![Page 15: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/15.jpg)
DBpedia SPARQL Benchmark• A method to
generate a SPARQL benchmark from DBpedia an its query longs
• It can be applied to other datasets and other query logs
• Characteristics– Resemble reality– Stress SPARQL
features
Query Logs
Query Analysis and Clustering
Dataset generation
Auxiliary Queries
Queries Templates
Query Instances
15
![Page 16: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/16.jpg)
Proposed SolutionTop-k DBPSB• An extension of DBPSB
Auxiliary query with top-k clauses using the DBPSB datasets as source of meaningful rankable variables
• It is also a method– Can be applied to other
benchmark obtained using DBSBM method
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
16
![Page 17: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/17.jpg)
A DBPSB Auxiliary QuerySELECT DISTINCT ?v
WHERE {
?v6 rdf:type ?v .
?v6 dbp:name ?v0 .
?v6 dbp:pages ?v1 .
?v6 dbp:isbn ?v2 .
?v6 dbp:author ?v3 .
}
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
17
![Page 18: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/18.jpg)
Top-k DBPSB step 1aTo generate queries with 1 rankable variable
SELECT ?p (COUNT(?p) AS ?n)WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . ?v6 ?p ?o . FILTER(isNumeric(?o) || datatype(?o)=xsd:dateTime)} ORDER BY ORDER BY DESC(?n)
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
18
![Page 19: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/19.jpg)
Top-k DBPSB step 1bResults – not all sortable properties resemble reality• Pages• ISBN• NumberOfPages• Year• Volume• wikiPageID• releaseDate• …
NOTE: it requires manual selection
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
19
![Page 20: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/20.jpg)
Top-k DBPSB step 1cTo generate queries with 2 rankable variables
SELECT ?p ?p1 (COUNT(?p1) AS ?n)
WHERE {
?v6 rdf:type ?v .
?v6 dbp:name ?v0 .
?v6 dbp:pages ?v1 .
?v6 dbp:isbn ?v2 .
?v6 dbp:author ?v3 .
?v6 ?p ?o .
?o ?p1 ?o1 .
FILTER(isNumeric(?o1) ||
datatype(?o1)=xsd:dateTime) }
GROUP BY ?p ?p1
ORDER BY DESC(?n)
NOTE: in practice we loop through all properties of ?v6 whose object is an IRI in decreasing frequency
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
20
![Page 21: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/21.jpg)
Top-k DBPSB step 1dResults• author, wikiPageID• author, wikiPageRevisionID• …• author, dateOfBirth• …• publisher, wikiPageID• publisher, wikiPageRevisionID• …• publisher, founded • …
NOTE: it requires manual selection
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
21
![Page 22: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/22.jpg)
Top-k DBPSB step 2SELECT (max(?o) as ?max) (min(?o) as ?min)
WHERE {
?v6 rdf:type ?v .
?v6 dbp:name ?v0 .
?v6 dbp:pages ?v1 .
?v6 dbp:isbn ?v2 .
?v6 dbp:author ?v3 .
?v6 dbp:pages ?o .
FILTER(isNumeric(?o) ||
datatype(?o)=xsd:dateTime)
}
NOTE: the filter clause should not be necessary, but DBpedia is very dirty …
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
22
![Page 23: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/23.jpg)
Top-k DBPSB step 3• Choose the number of ranking variables
– Max three– E.g., books and authors
• Choose the number of scoring variables per ranking variables– Max three– E.g., releaseDate for books and dateOfBirth for authors
• Look up the min and the max of each ranking variable to normalise it
• Choose the weights– The sum of the weight should be 1
• Assemble the scoring function– E.g., 0.5*norm(?releaseDate ) +
0.5*norm(?dateOfBirth)
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
23
![Page 24: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/24.jpg)
Top-k DBPSB step 4SELECT ?v6 ?v3
(0.5*norm(?o1) + 0.5*norm(?o2) AS ?s )
WHERE {
?v6 rdf:type ?v .
?v6 dbp:name ?v0 .
?v6 dbp:pages ?v1 .
?v6 dbp:isbn ?v2 .
?v6 dbp:author ?v3 .
?v6 dbp:releaseDate ?o1 .
?v3 dbp:dateOfBirth ?o2 . FILTER(isNumeric(?o1) || datatype(?o1)=xsd:dateTime)
FILTER(isNumeric(?o2) || datatype(?o2)=xsd:dateTime)
}
ORDER BY ?s
LIMIT 10
Find Rankable Variables
Auxiliary Queries
Compute Max and Min value
Generate Scoring Function
Generate Top-k queries
Top-k Queries
24
![Page 25: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/25.jpg)
Preliminary Results 1/2• We tested our hypothesis using
– Virtuoso Open-Source Edition version 6.1.6 – Jena-TDB Version 2.10.1 – DBpedia 10%
• In this setting, Top-k DBPSB generates queries– adequate to test
• H.2 ++ Scoring variable ++ execution time
• H.3 +/- LIMIT = execution time
– only partially adequate to test • H.1 ++ Rankable variable ++ execution time
25
![Page 26: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/26.jpg)
Preliminary Results 2/2• H.1 ++ Rankable variable ++ execution time
– confirmed in some cases
– not confirmed aggregating by query across engine
– confirmed aggregating by engine across queries
• H.2 ++ Scoring variable ++ execution time– confirmed for Jena TDB
– confirmed in most of the cases for Virtuoso
• H.3 +/- LIMIT = execution time– confirmed for Jena TDB
– confirmed in most of the cases for Virtuoso
26
![Page 27: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/27.jpg)
Conclusions• Top-k DBPSB is a successful first attempt to
automatically generate Top-k SPARQL queries that– Resemble reality– Hit SPARQL engines where it hurts
• More investigation is required– Better understand the relationships between the number of
rankable variable and the execution time• E.g., cardinalities, selectivity and jooins
– Include over known features of top-k query that impact execution time
• E.g., correlation of order induced on the result set by the different scoring variable in the scoring function
• E.g., Distribution of values matched by the scoring variables
27
![Page 28: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/28.jpg)
Thank you! Any Question?
Shima Zahmatkesh1, Emanuele Della Valle1, Daniele Dell’Aglio1, and Alessandro Bozzon2
1Politecnico di Milano 2TU Delft
![Page 29: Towards a Top-K SPARQL Query Benchmark Generator](https://reader030.vdocuments.us/reader030/viewer/2022032805/5681316f550346895d97ea2c/html5/thumbnails/29.jpg)
Preliminary Results - details