sparql querying benchmarks iswc2016
TRANSCRIPT
SPARQL Querying BenchmarksMuhammad Saleem, Ivan Ermilov, Axel-Cyrille Ngonga Ngomo,
Ricardo Usbeck, Michael Roderhttps://sites.google.com/site/sqbenchmarks/
Tutorial at ISWC 2016, Kobe, Japan, 17/10/2016
Agile Knowledge Engineering and Semantic Web (AKSW), University of Leipzig, Germany
05/03/2023 1
Agenda• Why benchmarks?• Components and design principles• Key features and choke points• Centralized SPARQL benchmarks• Federated SPARQL benchmarks• Hands-on • HOBBIT introduction
10:00 – 10:30
9:00 – 10:00
10:30 – 12:00
05/03/2023 2
Why Benchmarks?• What tools I can use for my use case? • Which tool best suit my use case and why?• Which are the relevant measures?• Which is the behavior of the existing engines?• What are the limitations of the existing engines?• How to improve existing engines?
05/03/2023 3
Benchmark Categories• Micro benchmarks
Specialized, detailed, very focused and easy to run Neglect larger picture Difficult to generalize results Do not use standardized metrics For example, Joins evaluation benchmark
• Standard benchmarks Generalized, well defined Standard metrics Complicated to run Systems are often optimized for benchmarks For example, Transaction Processing Council (TPC) benchmarks
• Real-life applications
05/03/2023 4
SPARQL Querying Benchmarks• Centralized benchmarks• Centralized repositories• Query span over a single dataset• Real or synthetic• Examples: LUBM, SP2Bench, BSBM, WatDiv, DBPSB, FEASIBLE
• Federated benchmarks• Multiple Interlinked datasets• Query span over multiple datasets• Real or synthetic• Examples: FedBench, LargeRDFBench
5
Querying Benchmark Components• Datasets (real or synthetic)• Queries (real or synthetic)• Performance metrics• Execution rules
05/03/2023 6
Design Principles [L97] • Relevant• Understandable• Good metrics• Scalable• Coverage• Acceptance• Repeatable• Verifiable
05/03/2023 7
Choke Points: Technological Challenges [BNE14]• CP1: Aggregation Performance• CP2: Join Performance• CP3: Data Access Locality (materialized views)• CP4: Expression Calculation• CP5: Correlated Sub-queries• CP6: Parallelism and Concurrency
05/03/2023 8
RDF Querying Benchmarks Choke Points [FK16]• CP1: Join Ordering• CP2: Aggregation• CP3: OPTIONAL and nested OPTIONAL clauses• CP4: Reasoning• CP5: Parallel execution of UNIONS• CP6: FILTERS• CP7: ORDERING• CP8: Geo-spatial predicates • CP9: Full Text• CP10: Duplicate elimination• CP11: Complex FILTER conditions
05/03/2023 9
SPARQL Queries as Directed Labelled Hyper-graphs (DLH) [SNM15]
05/03/2023 10
SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}
?president
rdf:type dbpedia:President
11
DLH Of SPARQL Queries
SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}
?president
rdf:type dbpedia:President
dbpedia:United_States
dbpedia:nationality
12
DLH Of SPARQL Queries
SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}
?president
rdf:type dbpedia:President
dbpedia:United_States
dbpedia:nationality
dbpedia:party ?party
13
DLH Of SPARQL Queries
SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}
?president
rdf:type dbpedia:President
dbpedia:United_States
dbpedia:nationality
?x
dbpedia:party ?party
nyt:topicPage
?page
14
DLH Of SPARQL Queries
SELECT ?president ?party ?page WHERE {?president rdf:type dbpedia:President .?president dbpedia:nationality dbpedia:United_States .?president dbpedia:party ?party .?x nyt:topicPage ?page .?x owl:sameAs ?president .}
?president
rdf:type dbpedia:President
dbpedia:United_States
dbpedia:nationality
?x
owl:SameAS
dbpedia:party ?party
nyt:topicPage
?page
Star simple hybrid Tail of hyperedge
15
DLH Of SPARQL Queries
Key SPARQL Queries Characteristics FEASIBLE [SNM15], WatDiv [AHO+14], LUBM [GPH05] identified: • Query forms
SELECT, DESCRIBE, ASK, CONSTRUCT
• Constructs UNION, DISTINCT, ORDER BY, REGEX, LIMIT, FILTER, OPTIONAL, GROUP BY,
Negation
• Features Result size, No. of BGPs, Number of triple patterns, No. of join vertices, Mean
join vertices degree, Mean triple pattern selectivity, Join selectivity, Query runtime, Unbound predicates,
05/03/2023 16
Centralized SPARQL Querying Benchmarks
05/03/2023 17
Lehigh University Benchmark (LUBM) [GPH05]• Synthetic RDF benchmark• Test reasoning capabilities of triple stores• Synthetic universities data generator• 15 SPARQL 1.0 queries• Query design criteria
Input Size, Selectivity, Complexity, Logical inferencing
• Performance metrics Load time, Repository size, Query runtime, Query completeness and
soundness, Combined metric (runtime + completeness + soundness)
05/03/2023 18
LUBM Queries Choke Points [FK16]# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11
Q1
Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14
05/03/2023 19
Join Ordering
Reasoning
LUBM Queries Characteristic [SNM15]
05/03/2023 20
Queries 15
Query Forms
SELECT 100.00%ASK 0.00%CONSTRUCT 0.00%DESCRIBE 0.00%
Important SPARQL
Constructs
UNION 0.00%DISTINCT 0.00%ORDER BY 0.00%REGEX 0.00%LIMIT 0.00%OFFSET 0.00%OPTIONAL 0.00%FILTER 0.00%GROUP BY 0.00%
Result Size
Min 3Max 1.39E+04Mean 4.96E+03S.D. 1.14E+04
BGPs
Min 1Max 1Mean 1S.D. 0
Triple Patterns
Min 1Max 6Mean 3S.D. 1.8126539
Join Vertices
Min 0Max 4Mean 1.6
S.D. 1.4040757
Mean Join Vertices Degree
Min 0Max 5Mean 2.0222222S.D. 1.2999796
Mean Triple
Patterns Selectivity
Min 0.0003212Max 0.432Mean 0.01S.D. 0.0745
Query Runtime
(ms)
Min 2Max 3200Mean 437.675S.D. 320.34
SP2Bench[SHM+09]• Synthetic RDF triple stores benchmark• DBLP bibliographic synthetic data generator• 12 SPARQL 1.0 queries• Query design criteria
SELECT, ASK SPARQL forms, Covers majority of SPARQL constructs
• Performance metrics Load time, Per query runtime, Arithmetic and geometric mean of overall
queries runtime, memory consumption
05/03/2023 21
SP2Bench Queries Choke Points [FK16]# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
Q11
Q12
05/03/2023 22
Join Ordering FILTERS Duplicate Elimination
SP2Bench Queries Characteristic [SNM15]
05/03/2023 23
Queries 12
Query Forms
SELECT 91.67%ASK 8.33%CONSTRUCT 0.00%DESCRIBE 0.00%
Important SPARQL
Constructs
UNION 16.67%DISTINCT 41.67%ORDER BY 16.67%REGEX 0.00%LIMIT 8.33%OFFSET 8.33%OPTIONAL 25.00%FILTER 58.33%GROUP BY 0.00%
Result Size
Min 1Max 4.34E+07Mean 4.55E+06S.D. 1.37E+07
BGPs
Min 1Max 3Mean 1.5S.D. 0.67419986
Triple Patterns
Min 1Max 13Mean 5.91666667S.D. 3.82475985
Join Vertices
Min 0Max 10Mean 4.25
S.D. 3.79293602
Mean Join Vertices Degree
Min 0Max 9Mean 2.41342593S.D. 2.26080826
Mean Triple
Patterns Selectivity
Min 6.5597E-05Max 0.53980613Mean 0.22180428S.D. 0.20831387
Query Runtime
(ms)
Min 7Max 7.13E+05Mean 2.83E+05S.D. 5.26E+05
Berlin SPARQL Benchmark (BSBM) [ BS09]• Synthetic RDF triple stores benchmark• E-commerce use case synthetic data generator• 20 Queries
12 SPARQL 1.0 queries for explore, explore and update use cases 8 SPARQL 1.1 analytical queries for business intelligence use case
• Query design criteria SELECT, DESCRIBE, CONSTRUCT SPARQL forms, Covers majority of SPARQL
constructs
• Performance metrics Load time, Query Mixes per Hour (QMpH), Queries per Second (QpS)
05/03/2023 24
BSBM Queries Choke Points [FK16]# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12
05/03/2023 25
Join Ordering
FILTERS
Result Ordering
BSBM Queries Characteristic [SNM15]
05/03/2023 26
Queries 20
Query Forms
SELECT 80.00%ASK 0.00%CONSTRUCT 4.00%DESCRIBE 16.00%
Important SPARQL
Constructs
UNION 8.00%DISTINCT 24.00%ORDER BY 36.00%REGEX 0.00%LIMIT 36.00%OFFSET 4.00%OPTIONAL 52.00%FILTER 52.00%GROUP BY 0.00%
Result Size
Min 0Max 31Mean 8.312S.D. 9.0308
BGPs
Min 1Max 5Mean 2.8S.D. 1.7039
Triple Patterns
Min 1Max 15Mean 9.32S.D. 5.18
Join Vertices
Min 0Max 6Mean 2.88
S.D. 1.8032
Mean Join Vertices Degree
Min 0Max 4.5Mean 3.05S.D. 1.6375
Mean Triple
Patterns Selectivity
Min 9E-08Max 0.0453Mean 0.0105S.D. 0.0142
Query Runtime
(ms)
Min 5Max 99Mean 9.1S.D. 14.564
DBpedia SPARQL Benchmark (DBSB) [MLA+14]• Real benchmark generation framework based on
DBpedia dataset with different sizes DBpedia query log mining
• Clustering log queries Name variables in triple patterns Select frequently executed queries Remove SPARQL keywords and prefixes Compute query similarity using Levenshtein string matching Compute query clusters using a soft graph clustering algorithm [NS09] Get queries templates (most frequently asked and uses more SPARQL constructs) from
clusters with > 5 queries Generate any number of queries from queries templates
05/03/2023 27
DBSB Queries Features• Number of triple patterns
Test the efficiency of join operations (CP1)• SPARQL UNION & OPTIONAL constructs
Handle parallel execution of Unions (CP5)• Solution sequences & modifiers (DISTINCT)
Efficiency of duplication elimination (CP10)• Filter conditions and operators (FILTER, LANG, REGEX, STR)
Efficiency of engines to execute filters as early as possible (CP6)
05/03/2023 28
DBSB Queries Features• Queries are based on 25 templates• Do not consider features such as number of join vertices, join vertex
degree, triple patterns selectivities or query execution times etc. • Only consider SPARQL SELECT queries• Not customizable for given use cases or needs of an application
05/03/2023 29
Recall: Key SPARQL Queries Characteristics FEASIBLE [SNM15], WatDiv [AHO+14], LUBM [GPH05] identified: • Query forms
SELECT, DESCRIBE, ASK, CONSTRUCT
• Constructs UNION, DISTINCT, ORDER BY, REGEX, LIMIT, FILTER, OPTIONAL, GROUP BY,
Negation
• Features Result size, No. of BGPs, Number of triple patterns, No. of join vertices, Mean
join vertices degree, Mean triple pattern selectivity, Join selectivity, Query runtime, Unbound predicates,
05/03/2023 30
DBSB Queries Characteristic [SNM15]
05/03/2023 31
Queries from 25 templates 125
Query Forms
SELECT 100%ASK 0%CONSTRUCT 0%DESCRIBE 0%
Important SPARQL
Constructs
UNION 36%DISTINCT 100%ORDER BY 0%REGEX 4%LIMIT 0%OFFSET 0%OPTIONAL 32%FILTER 48%GROUP BY 0%
Result Size
Min 197Max 4.62E+06Mean 3.24E+05S.D. 9.56E+05
BGPs
Min 1Max 9Mean 2.695652S.D. 2.438979
Triple Patterns
Min 1Max 12Mean 4.521739S.D. 2.79398
Join Vertices
Min 0Max 3Mean 1.217391
S.D. 1.126399
Mean Join Vertices Degree
Min 0Max 5Mean 1.826087S.D. 1.435022
Mean Triple
Patterns Selectivity
Min 1.19E-05Max 1Mean 0.119288S.D. 0.226966
Query Runtime
(ms)
Min 11Max 5.40E+04Mean 1.07E+04S.D. 1.73E+04
Waterloo SPARQL Diversity Test Suite (WatDiv) [AHO+14]• Synthetic benchmark
Synthetic data generator Synthetic query generator
• User-controlled data generator Entities to include Structuredness [DKS+11] of the dataset Probability of entity associations Cardinality of property associations
• Query design criteria Structural query features Data-driven query features
05/03/2023 32
WatDiv Query Design Criteria• Structural features
Number of triple patterns Join vertex count Join vertex degree
• Data-driven features Result size (Filtered) Triple Pattern (f-TP) selectivity BGP-Restricted f-TP selectivity Join-Restricted f-TP selectivity
05/03/2023 33
WatDiv Queries Generation• Query Template Generator
User-specified number of templates User specified template characteristics
• Query Generator Instantiates the query templates with terms (IRIs, literals etc.) from the RDF
dataset User-specified number of queries produced
05/03/2023 34
WatDiv Queries Characteristic [SNM15]
05/03/2023 35
Queries templates 125
Query Forms
SELECT 100.00%ASK 0.00%CONSTRUCT 0.00%DESCRIBE 0.00%
Important SPARQL
Constructs
UNION 0.00%DISTINCT 0.00%ORDER BY 0.00%REGEX 0.00%LIMIT 0.00%OFFSET 0.00%OPTIONAL 0.00%FILTER 0.00%GROUP BY 0.00%
Result Size
Min 0Max 4.17E+09Mean 3.49E+07S.D. 3.73E+08
BGPs
Min 1Max 1Mean 1S.D. 0
Triple Patterns
Min 1Max 12Mean 5.328S.D. 2.60823
Join Vertices
Min 0Max 5Mean 1.776
S.D. 0.9989
Mean Join Vertices Degree
Min 0Max 7Mean 3.62427S.D. 1.40647
Mean Triple
Patterns Selectivity
Min 0Max 0.01176Mean 0.00494S.D. 0.00239
Query Runtime
(ms)
Min 3Max 8.82E+08Mean 4.41E+08S.D. 2.77E+07
FEASIBLE: Benchmark Generation Framework [SNM15]• Customizable benchmark generation framework• Generate real benchmarks from queries log• Can be applied to any SPARQL queries log• Customizable for given use cases or needs of an application
05/03/2023 36
FEASIBLE Queries Selection Criteria• Query forms
SELECT, DESCRIBE, ASK, CONSTRUCT
• Constructs UNION, DISTINCT, ORDER BY, REGEX, LIMIT, FILTER, OPTIONAL, GROUP BY,
Negation
• Features Result size, No. of BGPs, Number of triple patterns, No. of join vertices, Mean
join vertices degree, Mean triple pattern selectivity, Join selectivity, Query runtime, Unbound predicates
05/03/2023 37
FEASIBLE: Benchmark Generation Framework• Dataset cleaning • Feature vectors and normalization• Selection of exemplars • Selection of benchmark queries
38
Feature Vectors and Normalization
39
SELECT DISTINCT ?entita ?nomeWHERE { ?entita rdf:type dbo:VideoGame . ?entita rdfs:label ?nome FILTER regex(?nome, "konami", "i") }LIMIT 100
Query Type: SELECT Results Size: 13Basic Graph Patterns (BGPs): 1Triple Patterns: 2Join Vertices: 1Mean Join Vertices Degree: 2.0Mean triple patterns selectivity: 0.01709761619798973UNION: No DISTINCT: Yes ORDER BY: No REGEX: Yes LIMIT: Yes OFFSET: No OPTIONAL: No FILTER: Yes GROUP BY: No Runtime (ms): 65
13 1 2 1 2 0.017 0 1 0 1 1 0 0 1 0 65
0.11 0.53 0.67 0.14 0.08 0.017 0 1 0 1 1 0 0 1 0 0.14
Feature Vector
Normalized Feature Vector
FEASIBLE
40
Plot feature vectors in a multidimensional space
Query F1 F2Q1 0.2 0.2Q2 0.5 0.3Q3 0.8 0.3Q4 0.9 0.1Q5 0.5 0.5Q6 0.2 0.7Q7 0.1 0.8Q8 0.13 0.65Q9 0.9 0.5Q10 0.1 0.5
Suppose we need a benchmark of 3 queries
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FEASIBLE
41
Calculate average point
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FEASIBLE
42
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Select point of minimum Euclidean distance to avg. point
*Red is our first exemplar
FEASIBLE
43
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Select point that is farthest to exemplars
FEASIBLE
44
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FEASIBLE
45
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Select point that is farthest to exemplars
FEASIBLE
46
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FEASIBLE
47
Calculate distance from Q1 to each exemplars
0.60
0.42
0.70
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FEASIBLE
48
0.60
0.42
0.70
Assign Q1 to the minimum distance exemplar
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FEASIBLE
49
Repeat the process for Q2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FEASIBLE
50
Repeat the process for Q3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FEASIBLE
51
Repeat the process for Q6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FEASIBLE
52
Repeat the process for Q8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FEASIBLE
53
Repeat the process for Q9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FEASIBLE
54
Repeat the process for Q10
FEASIBLE
55
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Calculate Average across each cluster
FEASIBLE
56
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Calculate distance of each point in cluster to the average
FEASIBLE
57
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Select minimum distance query as the final benchmark query from that cluster
Purple, i.e., Q2 is the final selected query from yellow cluster
FEASIBLE
58
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Select minimum distance query as the final benchmark query from that cluster
Purple, i.e., Q3 is the final selected query from green cluster
FEASIBLE
59
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Select minimum distance query as the final benchmark query from that cluster
Purple, i.e., Q8 is the final selected query from brown clusterOur benchmark queries are Q2, Q3, and Q8
Comparison of Composite Error
60
FEASIBLE’s composite error is 54.9% less than DBPSB
Rank-wise Ranking of Triple Stores
61
All values are in percentages
None of the system is sole winner or loser for a particular rank Virtuoso mostly lies in the higher ranks, i.e., rank 1 and 2 (68.29%) Fuseki mostly in the middle ranks, i.e., rank 2 and 3 (65.14%) OWLIM-SE usually on the slower side, i.e., rank 3 and 4 (60.86 %) Sesame is either fast or slow. Rank 1 (31.71% of the queries) and rank 4 (23.14%)
FEASIBLE(DBpedia) Queries Characteristic [SNM15]
05/03/2023 62
Queries 125
Query Forms
SELECT 95.20%ASK 0.00%CONSTRUCT 4.00%DESCRIBE 0.80%
Important SPARQL
Constructs
UNION 40.80%DISTINCT 52.80%ORDER BY 28.80%REGEX 14.40%LIMIT 38.40%OFFSET 18.40%OPTIONAL 30.40%FILTER 58.40%GROUP BY 0.80%
Result Size
Min 1Max 1.41E+06Mean 52183S.D. 1.97E+05
BGPs
Min 1Max 14Mean 3.176S.D. 3.55841574
Triple Patterns
Min 1Max 18Mean 4.88S.D. 4.396846377
Join Vertices
Min 0Max 11Mean 1.296
S.D. 2.39294662
Mean Join Vertices Degree
Min 0Max 11Mean 1.44906666S.D. 2.13246612
Mean Triple Patterns
Selectivity
Min 2.86693E-09Max 1
Mean0.14021433
7S.D. 0.31899488
Query Runtime
(ms)
Min 2Max 3.22E+04Mean 2242.6S.D. 6961.99191
FEASIBLE(SWDF) Queries Characteristic [SNM15]
05/03/2023 63
Queries 125
Query Forms
SELECT 92.80%ASK 2.40%CONSTRUCT 3.20%DESCRIBE 1.60%
Important SPARQL
Constructs
UNION 32.80%DISTINCT 50.40%ORDER BY 25.60%REGEX 16.00%LIMIT 45.60%OFFSET 20.80%OPTIONAL 32.00%FILTER 29.60%GROUP BY 19.20%
Result Size
Min 1Max 3.01E+05Mean 9091.512S.D. 4.70E+04
BGPs
Min 0Max 14Mean 2.688S.D. 2.812460752
Triple Patterns
Min 0Max 14Mean 3.232S.D. 2.76246734
Join Vertices
Min 0Max 3Mean 0.52
S.D. 0.65500554
Mean Join Vertices Degree
Min 0Max 4Mean 0.968S.D. 1.09202386
Mean Triple Patterns
Selectivity
Min 1.06097E-05Max 1Mean 0.29192835
S.D.0.32513860
1
Query Runtime
(ms)
Min 4Max 4.13E+04Mean 1308.832S.D. 5335.44123
Others Useful Benchmarks• Semantic Publishing Benchmark (SPB)• UniProt [RU09][UniprotKB]• YAGO (Yet Another Great Ontology)[SKW07]• Barton Library [Barton]• Linked Sensor Dataset [PHS10]• WordNet [WordNet]• Publishing TPC-H as RDF [TPC-H]• Apples and Oranges [DKS+11]
05/03/2023 64
Summary of the centralized SPARQL querying benchmarks
05/03/2023 65
Centralized SPARQL Querying Benchmarks Summary [SNM15]
05/03/2023 66
LUBM BSBM SP2Bench WatDiv DBPSB FEASIBLE(DBpedia) DBpediaLog FEASIBLE(SWDF) SWDFLog Queries 15 125 12 125 125 125 130466 125 64030
Basic Query Forms
SELECT 100.00% 80.00% 91.67% 100.00% 100% 95.20% 97.964987 92.80% 58.7084ASK 0.00% 0.00% 8.33% 0.00% 0% 0.00% 1.93% 2.40% 0.09%CONSTRUCT 0.00% 4.00% 0.00% 0.00% 0% 4.00% 0.09% 3.20% 0.04%DESCRIBE 0.00% 16.00% 0.00% 0.00% 0% 0.80% 0.02% 1.60% 41.17%
05/03/2023 67
LUBM BSBM SP2Bench WatDiv DBPSB FEASIBLE(DBpedia) DBpediaLog FEASIBLE(SWDF) SWDFLog
Important SPARQL
Constructs
UNION 0.00% 8.00% 16.67% 0.00% 36% 40.80% 7.97% 32.80% 29.32%DISTINCT 0.00% 24.00% 41.67% 0.00% 100% 52.80% 4.16% 50.40% 34.18%ORDER BY 0.00% 36.00% 16.67% 0.00% 0% 28.80% 0.30% 25.60% 10.67%REGEX 0.00% 0.00% 0.00% 0.00% 4% 14.40% 0.21% 16.00% 0.03%LIMIT 0.00% 36.00% 8.33% 0.00% 0% 38.40% 0.40% 45.60% 1.79%OFFSET 0.00% 4.00% 8.33% 0.00% 0% 18.40% 0.03% 20.80% 0.14%OPTIONAL 0.00% 52.00% 25.00% 0.00% 32% 30.40% 20.11% 32.00% 29.52%FILTER 0.00% 52.00% 58.33% 0.00% 48% 58.40% 93.38% 29.60% 0.72%GROUP BY 0.00% 0.00% 0.00% 0.00% 0% 0.80% 7.66E-06 19.20% 1.34%
Centralized SPARQL Querying Benchmarks Summary [SNM15]
Centralized SPARQL Querying Benchmarks Summary [SNM15]
05/03/2023 68
LUBM BSBM SP2Bench WatDiv DBPSB FEASIBLE(DBpedia) DBpediaLog FEASIBLE(SWDF) SWDFLog
Result Size
Min 3 0 1 0 197 1 1 1 1Max 1.39E+04 31 4.34E+07 4.17E+09 4.62E+06 1.41E+06 1.41E+06 3.01E+05 3.01E+05Mean 4.96E+03 8.312 4.55E+06 3.49E+07 3.24E+05 52183 404.000307 9091.512 39.5068S.D 1.14E+04 9.0308 1.37E+07 3.73E+08 9.56E+05 1.97E+05 12932.2472 4.70E+04 2208.7
BGPs
Min 1 1 1 1 1 1 0 0 0Max 1 5 3 1 9 14 14 14 14Mean 1 2.8 1.5 1 2.695652 3.176 1.67629114 2.688 2.28603S.D 0 1.7039 0.67419986 0 2.438979 3.55841574 1.66075812 2.81246075 2.94057
Triple Patterns
Min 1 1 1 1 1 1 0 0 0Max 6 15 13 12 12 18 18 14 14Mean 3 9.32 5.91666667 5.328 4.521739 4.88 1.7062683 3.232 2.50928S.D 1.812653 5.18 3.82475985 2.60823 2.79398 4.396846377 1.68639622 2.76246734 3.21393
Join Vertices
Min 0 0 0 0 0 0 0 0 0Max 4 6 10 5 3 11 11 3 3Mean 1.6 2.88 4.25 1.776 1.217391 1.296 0.02279521 0.52 0.18076S.D 1.40407 1.8032 3.79293602 0.9989 1.126399 2.392946625 0.23381101 0.65500554 0.45669
Centralized SPARQL Querying Benchmarks Summary [SNM15]
05/03/2023 69
LUBM BSBM SP2Bench WatDiv DBPSB FEASIBLE(DBpedia) DBpediaLog FEASIBLE(SWDF) SWDFLog
Mean Join
Vertices Degree
Min 0 0 0 0 0 0 0 0 0Max 5 4.5 9 7 5 11 11 4 5Mean 2.02222 3.05 2.4134259 3.62427 1.826087 1.449066667 0.04159183 0.968 0.37006S.D 1.29997 1.6375 2.2608082 1.40647 1.435022 2.132466121 0.33443107 1.092023868 0.87378
Mean Triple
Patterns Selectivi
ty
Min 0.00032 9E-08 6.559E-05 0 1.19E-05 2.86693E-09 1.261E-05 1.06097E-05 1.1E-05Max 0.432 0.0453 0.5398061 0.01176 1 1 1 1 1Mean 0.01 0.0105 0.2218042 0.00494 0.119288 0.140214337 0.00578652 0.29192835 0.02381S.D 0.0745 0.0142 0.2083138 0.00239 0.226966 0.318994887 0.03669906 0.325138601 0.07857
Query Runtime
Min 2 5 7 3 11 2 1 4 3Max 3200 99 7.13E+05 8.82E+08 5.40E+04 3.22E+04 5.60E+04 4.13E+04 4.13E+04Mean 437.675 9.1 2.83E+05 4.41E+08 1.07E+04 2242.6 30.4185995 1308.832 16.1632S.D 320.34 14.564 5.26E+05 2.77E+07 1.73E+04 6961.991912 702.518249 5335.441231 249.674
Federated SPARQL Querying Benchmarks
05/03/2023 70
Federated Query
• Return the party membership and news pages about all US presidents.
Party memberships US presidents US presidents News pages
71Computation of results require data from both sources
Federated SPARQL Query Processing
S1
S2
S3
S4
RDF RDF RDF RDF
Parsing/Rewriting
Source Selection
Federator Optimizer
Integrator
Rewrite query and get Individual Triple Patterns
Identify capable/relevant sources
Generate optimized query Execution Plan
Integrate sub-queries results
Execute sub-queries
Federation Engine
72
SPARQL Query Federation Approaches• SPARQL Endpoint Federation (SEF)• Linked Data Federation (LDF)• Hybrid of SEF+LDF
05/03/2023 73
SPLODGE [SP+12]• Federated benchmarks generation tool• Query design criteria
Query form Join type Result modifiers: DISTINCT, LIMIT, OFFSET, ORDER BY Variable triple patterns Triple patterns joins Cross product triple patterns Number of sources Number Join vertices Query selectivity
• Non-conjunctive queries that make use of the SPARQL UNION, OPTIONAL are not considered
05/03/2023 74
FedBench [FB+11]• Based on 9 real interconnected datasets
KEGG, DrugBank, ChEDI from life sciences DBpedia, GeoNames, Jamendo, SWDF, NYT, LMDB from cross domain Vary in structuredness and sizes
• Four sets of queries 7 life sciences queries 7 cross domain queries 11 Linked Data queries 14 queries from SP2Bench
05/03/2023 75
FedBench Queries Characteristic
05/03/2023 76
Queries 25
Query Forms
SELECT 100.00%ASK 0.00%CONSTRUCT 0.00%DESCRIBE 0.00%
Important SPARQL
Constructs
UNION 12%DISTINCT 0.00%ORDER BY 0.00%REGEX 0.00%LIMIT 0.00%OFFSET 0.00%OPTIONAL 4%FILTER 4%GROUP BY 0.00%
Result Size
Min 1Max 9054Mean 529S.D. 1764
BGPs
Min 1Max 2Mean 1.16S.D. 0.37
Triple Patterns
Min 2Max 7Mean 4S.D. 1.25
Join Vertices
Min 0Max 5Mean 2.52
S.D. 1.26
Mean Join Vertices Degree
Min 0Max 3Mean 2.14S.D. 0.56
Mean Triple
Patterns Selectivity
Min 0.001Max 1Mean 0.05S.D. 0.092
Query Runtime
(ms)
Min 50Max 1.2E+4Mean 1987S.D. 3950
LargeRDFBench
• 32 Queries 10 simple 10 complex 8 large data
• 14 Interlined datasets
77
LinkedMDB
DBpedia
New York
Times
Linked TCGA-
M
Linked TCGA-E
Linked TCGA-
A
Affymetrix
SW Dog Food
KEGG Drugbank
Jamendo
ChEBI
Geo names
owl:sameAs
owl:sameAs
based_near
basedNear owl:sameAs
x-geneid#Links: 251.3k
country, ethnicity, race
owl:s
ameA
s
keggCompoundId
based_near
owl:sameAs
bcr_patient_barcode
Same instanceLife Sciences Cross Domain Large Data
bcr_patient_barcode
#Links: 1.7k
#Links: 4.1k
#Links:
10k
#Links: 21.7k
#Links: 118k
#Links: 1.9k
#Links: 4k
#Link
s: 9.5
k
#Links: 1.3k
#Links: 63.1k
LargeRDFBench Datasets Statistics
78
LargeRDFBench Queries Properties• 14 Simple
2-7 triple patterns Subset of SPARQL clauses Query execution time around 2 seconds on avg.
• 10 Complex 8-13 triple patterns Use more SPARQL clauses Query execution time up to 10 min
• 8 Large Data Minimum 80459 results Large intermediate results Query execution time in hours
05/03/2023 79
LargeRDFBench Queries Characteristic
05/03/2023 80
Queries 32
Query Forms
SELECT 100.00%ASK 0.00%CONSTRUCT 0.00%DESCRIBE 0.00%
Important SPARQL
Constructs
UNION 18.75%DISTINCT 28.21%ORDER BY 9.37%REGEX 3.12%LIMIT 12.5%OFFSET 0.00%OPTIONAL 25%FILTER 31.25%GROUP BY 0.00%
Result Size
Min 1Max 3.0E+5 Mean 5.9E+4S.D. 1.1E+5
BGPs
Min 1Max 2Mean 1.43S.D. 0.5
Triple Patterns
Min 2Max 12Mean 6.6S.D. 2.6
Join Vertices
Min 0Max 6Mean 3.43
S.D. 1.36
Mean Join Vertices Degree
Min 0Max 6Mean 2.56S.D. 0.76
Mean Triple
Patterns Selectivity
Min 0.001Max 1Mean 0.10S.D. 0.14
Query Runtime
(ms)
Min 159Max >1hrMean UndefinedS.D. Undefined
FedBench vs. LargeRDFBench
05/03/2023 81
Performance Metrics• Efficient source selection in terms of• Total triple pattern-wise sources selected• Total number of SPARQL ASK requests used during source selection• Source selection time
• Query execution time• Results completeness and correctness• Number of remote requests during query execution• Index compression ratio (1- index size/datadump size)• Number of intermediate results
05/03/2023 82
Future Directions• Micro benchmarking• Synthetic benchmarks generation
Synthetic data that is like real data Synthetic queries that is like real queries
• Customizable and flexible benchmark generation • Fits user needs• Fits current use-case
• What are the most important choke points for SPARQL querying benchmarks? How they are related to query performance?
05/03/2023 83
References• [L97] Charles Levine. TPC-C: The OLTP Benchmark. In SIGMOD – Industrial Session, 1997.• [GPH05] Y. Guo, Z. Pan, and J. Heflin. LUBM: A Benchmark for OWL Knowledge Base
Systems. Journal Web Semantics: Science, Services and Agents on the World Wide Web archive Volume 3 Issue 2-3, October, 2005 , Pages 158-182
• [SHM+09] M. Schmidt , T. Hornung, M. Meier, C. Pinkel, G. Lausen. SP2Bench: A SPARQL Performance Benchmark. Semantic Web Information Management, 2009.
• [BS09] C. Bizer and A. Schultz. The Berlin SPARQL Benchmark. Int. J. Semantic Web and Inf. Sys., 5(2), 2009.
• [BSBM] Berlin SPARQL Benchmark (BSBM) Specification - V3.1. http://wifo5-3.informatik.unimannheim.de/bizer/berlinsparqlbenchmark/spec/index.html.
• [RU09] N. Redaschi and UniProt Consortium. UniProt in RDF: Tackling Data Integration and Distributed Annotation with the Semantic Web. In Biocuration Conference, 2009.
05/03/2023 84
References• [UniProtKB] UniProtKB Queries. http://www.uniprot.org/help/query-fields.• [SKW07]F. M. Suchanek, G. Kasneci and G. Weikum. YAGO: A Core of Semantic Knowledge Unifying
WordNet and Wikipedia, In WWW 2007.• [Barton] The MIT Barton Library dataset. http://simile.mit.edu/rdf-test-data/• [PHS10] H. Patni, C. Henson, and A. Sheth. Linked sensor data. 2010• [TPC-H] The TPC-H Homepage. http://www.tpc.org/tpch/• [WordNet] WordNet: A lexical database for English. http://wordnet.princeton.edu/• [MLA+14] M. Morsey, J. Lehmann, S. Auer, A-C. Ngonga Ngomo. Dbpedia SPARQL Benchmark• [SP+12] Görlitz, Olaf, Matthias Thimm, and Steffen Staab. Splodge: Systematic generation of sparql
benchmark queries for linked open data. International Semantic Web Conference. Springer Berlin Heidelberg, 2012.
• [BNE14] P. Boncz, T. Neumann, O. Erling. TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark. Performance Characterization and Benchmarking. In TPCTC 2013, Revised Selected Papers.
05/03/2023 85
References• [NS09] A–C. Ngonga Ngomo and D. Schumacher. Borderflow: A local graph clustering algorithm for
natural language processing. In CICLing, 2009.• [AHO+14]G. Aluc, O. Hartig, T. Ozsu, K. Daudjee. Diversifed Stress Testing of RDF Data Management
Systems. In ISWC, 2014.• [SMN15] M. Saleem, Q. Mehmood, and A–C. Ngonga Ngomo. FEASIBLE: A Feature-Based SPARQL
Benchmark Generation Framework. ISWC 2015.• [DKS+11] S. Duan, A. Kementsietsidis, Kavitha Srinivas and Octavian Udrea. Apples and oranges: a
comparison of RDF benchmarks and real RDF datasets. In SIGMOD, 2011.• [FK16] I.Fundulaki, A.Kementsietsidis Assessing the performance of RDF Engines: Discussing RDF
Benchmarks, Tutorial at ESWC2016• [FB+11] Schmidt, Michael, et al. Fedbench: A benchmark suite for federated semantic data query
processing. International Semantic Web Conference. Springer Berlin Heidelberg, 2011.• [LB+16] M.Saleem, A.Hasnain, A–C. Ngonga Ngomo. LargeRDFBench: A Billion Triples Benchmark for
SPARQL Query Federation, Submitted to Journal of Web Semantics
05/03/2023 86
Thanks
{lastname}@informatik.uni-leipzig.deAKSW, University of Leipzig, Germany
05/03/2023 87
This work was supported by grands from BMWi project SAKE and the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).