towards a top-k sparql query benchmark generator

Towards a Top-K SPARQL Query Benchmark Generator

Shima Zahmatkesh1, Emanuele Della Valle1, Daniele Dell’Aglio1, and Alessandro Bozzon2

1Politecnico di Milano 2TU Delft

Agenda

• Rankings, Rankings everywhere • What are top-k SPARQL queries• Jim Gray's Benchmarking Principles• The problem• Some Definitions• Research Hypothesis• Background work: DBpedia SPARQL Benchmark• Our proposal: Top-k DBPSB• Preliminary Evaluation• Conclusions

Rankings, rankings everywhere

A very intuitive and simplified example:

• Top 3 largest countries (by both area and population)

Why do we need to optimize them?

The standard way: materialize-then-sort scheme

Countries

Compute the scoring function that accounts for area and population

Sort all the 242 countries

Fetch 3 best results

……

Innovative optimization:Split-and-Interleave scheme

Fetch 3 best results

Incrementally order partial results by area

Sorted access to countries ordered by population

Countries 242

State-of-the artDatabase• method

– Split the evaluation of the scoring function into single criteria

– Interleave them with other operators– Use partial orders to construct incrementally the final

• Standard assumptions:– Monotone scoring function– Each criterion is evaluated as a [0,1] number

(normalization)

• Optimized for the case of fast sorted access for each criterion

Top-k SPARQL queriesE.g., the 10 most recent books written by the youngest authors

SELECT ?book ?author

(0.5*norm(?releaseDate) +

0.5*norm(?dateOfBirth) AS ?s )

WHERE {

?book dbp:isbn ?v .

?book dbp:author ?author .

?book dbp:releaseDate ?releaseDate .

?v3 dbp:dateOfBirth ?dateOfBirth .

ORDER BY DESC(?s)

LIMIT 10

Scoring Functionas a SELECT expression

Normalization cast the value in [0..1]

norm(x) = x - minx

maxx - minx

Order and slice 10

The ProblemSet up a benchmark for top-k SPARQL Queries that• Resembles reality• Stresses the features of top-k queries

– Syntax: SELECT expression + ORDER BY + LIMIT – Performance: hit SPARQL engine where it hurts

Jim Gray on BenchmarkingPrinciples

• Relevant: Measures performance and price/performance of systems when performing typical operations within the problem domain

• Portable: Easy to implement on many different systems

• Scalable: Applies to small and large computer systems

• Simple: understandable

Results

DefinitionsE.g., the 10 most recent books written by the youngest authors

releaseDate

Rankable Variables

Scoring Variables

Rankable Data Properties

Rankable Triple Patterns

Scoring Function

0.5* norm(?releaseDate) + 0.5*norm(?birthDate)

?author

?releaseDate

dateOfBirth?birthDate

Research Hypothesis• H.0 top-k SPARQL queries that resemble reality can

be obtained extending DBpedia SPARQL Benchmark– H.1 ++ Rankable variable ++ execution time

– H.2 ++ Scoring variable ++ execution time

– H.3 +/- LIMIT = execution time

DBpedia SPARQL Benchmark• A method to

generate a SPARQL benchmark from DBpedia an its query longs

• It can be applied to other datasets and other query logs

• Characteristics– Resemble reality– Stress SPARQL

features

Query Logs

Query Analysis and Clustering

Dataset generation

Auxiliary Queries

Queries Templates

Query Instances

Proposed SolutionTop-k DBPSB• An extension of DBPSB

Auxiliary query with top-k clauses using the DBPSB datasets as source of meaningful rankable variables

• It is also a method– Can be applied to other

benchmark obtained using DBSBM method

Find Rankable Variables

Auxiliary Queries

Compute Max and Min value

Generate Scoring Function

Generate Top-k queries

Top-k Queries

A DBPSB Auxiliary QuerySELECT DISTINCT ?v

WHERE {

?v6 rdf:type ?v .

?v6 dbp:name ?v0 .

?v6 dbp:pages ?v1 .

?v6 dbp:isbn ?v2 .

?v6 dbp:author ?v3 .

Auxiliary Queries

Top-k Queries

Top-k DBPSB step 1aTo generate queries with 1 rankable variable

SELECT ?p (COUNT(?p) AS ?n)WHERE { ?v6 rdf:type ?v . ?v6 dbp:name ?v0 . ?v6 dbp:pages ?v1 . ?v6 dbp:isbn ?v2 . ?v6 dbp:author ?v3 . ?v6 ?p ?o . FILTER(isNumeric(?o) || datatype(?o)=xsd:dateTime)} ORDER BY ORDER BY DESC(?n)

Auxiliary Queries

Top-k Queries

Top-k DBPSB step 1bResults – not all sortable properties resemble reality• Pages• ISBN• NumberOfPages• Year• Volume• wikiPageID• releaseDate• …

NOTE: it requires manual selection

Auxiliary Queries

Top-k Queries

Top-k DBPSB step 1cTo generate queries with 2 rankable variables

SELECT ?p ?p1 (COUNT(?p1) AS ?n)

WHERE {

?v6 rdf:type ?v .

?v6 dbp:name ?v0 .

?v6 dbp:pages ?v1 .

?v6 dbp:isbn ?v2 .

?v6 ?p ?o .

?o ?p1 ?o1 .

FILTER(isNumeric(?o1) ||

datatype(?o1)=xsd:dateTime) }

GROUP BY ?p ?p1

ORDER BY DESC(?n)

NOTE: in practice we loop through all properties of ?v6 whose object is an IRI in decreasing frequency

Auxiliary Queries

Top-k Queries

Top-k DBPSB step 1dResults• author, wikiPageID• author, wikiPageRevisionID• …• author, dateOfBirth• …• publisher, wikiPageID• publisher, wikiPageRevisionID• …• publisher, founded • …

NOTE: it requires manual selection

Auxiliary Queries

Top-k Queries

Top-k DBPSB step 2SELECT (max(?o) as ?max) (min(?o) as ?min)

WHERE {

?v6 rdf:type ?v .

?v6 dbp:name ?v0 .

?v6 dbp:pages ?v1 .

?v6 dbp:isbn ?v2 .

?v6 dbp:pages ?o .

FILTER(isNumeric(?o) ||

datatype(?o)=xsd:dateTime)

NOTE: the filter clause should not be necessary, but DBpedia is very dirty …

Auxiliary Queries

Top-k Queries

Top-k DBPSB step 3• Choose the number of ranking variables

– Max three– E.g., books and authors

• Choose the number of scoring variables per ranking variables– Max three– E.g., releaseDate for books and dateOfBirth for authors

• Look up the min and the max of each ranking variable to normalise it

• Choose the weights– The sum of the weight should be 1

• Assemble the scoring function– E.g., 0.5*norm(?releaseDate ) +

0.5*norm(?dateOfBirth)

Auxiliary Queries

Top-k Queries

Top-k DBPSB step 4SELECT ?v6 ?v3

(0.5*norm(?o1) + 0.5*norm(?o2) AS ?s )

WHERE {

?v6 rdf:type ?v .

?v6 dbp:name ?v0 .

?v6 dbp:pages ?v1 .

?v6 dbp:isbn ?v2 .

?v6 dbp:releaseDate ?o1 .

?v3 dbp:dateOfBirth ?o2 . FILTER(isNumeric(?o1) || datatype(?o1)=xsd:dateTime)

FILTER(isNumeric(?o2) || datatype(?o2)=xsd:dateTime)

ORDER BY ?s

LIMIT 10

Auxiliary Queries

Top-k Queries

Preliminary Results 1/2• We tested our hypothesis using

– Virtuoso Open-Source Edition version 6.1.6 – Jena-TDB Version 2.10.1 – DBpedia 10%

• In this setting, Top-k DBPSB generates queries– adequate to test

• H.2 ++ Scoring variable ++ execution time

• H.3 +/- LIMIT = execution time

– only partially adequate to test • H.1 ++ Rankable variable ++ execution time

Preliminary Results 2/2• H.1 ++ Rankable variable ++ execution time

– confirmed in some cases

– not confirmed aggregating by query across engine

– confirmed aggregating by engine across queries

• H.2 ++ Scoring variable ++ execution time– confirmed for Jena TDB

– confirmed in most of the cases for Virtuoso

• H.3 +/- LIMIT = execution time– confirmed for Jena TDB

– confirmed in most of the cases for Virtuoso

Conclusions• Top-k DBPSB is a successful first attempt to

automatically generate Top-k SPARQL queries that– Resemble reality– Hit SPARQL engines where it hurts

• More investigation is required– Better understand the relationships between the number of

rankable variable and the execution time• E.g., cardinalities, selectivity and jooins

– Include over known features of top-k query that impact execution time

• E.g., correlation of order induced on the result set by the different scoring variable in the scoring function

• E.g., Distribution of values matched by the scoring variables

Thank you! Any Question?

Shima Zahmatkesh1, Emanuele Della Valle1, Daniele Dell’Aglio1, and Alessandro Bozzon2

1Politecnico di Milano 2TU Delft

Preliminary Results - details

towards a top-k sparql query benchmark generator

Documents

sparql query languageai.fon.bg.ac.rs › wp-content ›...

sparql intro: a query language for rdf

twinkle: a sparql query tool

sparql all slides are adapted from the w3c recommendation...

using sparql to query bioportal ontologies and metadata

predicting sparql query execution time and suggesting

openhpi 3.1 - how to query rdf(s)? - sparql

scalable multi-query optimization for sparql -...

sparql query languageai.fon.bg.ac.rs › wp-content ›...

sp2b - a sparql performance benchmark

the sparql query language -...

interactive sparql query processing on hadoop -...

optimizing sparql query answering over owl ontologies

berlin sparql benchmark (bsbm)

gstore: a graph-based sparql query engine

sparql- a query language for rdf(s)

05/01/2016 sparql sparql protocol and rdf query language s....

introduction to sparql -...

characterizing machine agent behavior through sparql query...

sparql query language for rdf