s2rdf: rdf querying with sparql on spark

33
S2RDF: RDF Querying with SPARQL on Spark Proceedings of the 2016 VLDB Endowment (PVLDB) 171026 shu 1

Upload: others

Post on 23-Jun-2022

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: S2RDF: RDF Querying with SPARQL on Spark

S2RDF:RDFQueryingwithSPARQLonSpark

Proceedingsofthe2016VLDBEndowment(PVLDB)171026shu

1

Page 2: S2RDF: RDF Querying with SPARQL on Spark

MainAuthorAlexanderSchätzle◦ UniversityofFreiburg

Reserch Interest◦ SemanticWeb◦ SocialNetworks◦ MapReduce◦ NoSQL

2

Page 3: S2RDF: RDF Querying with SPARQL on Spark

Introduction•RDF:W3Cstandardforsemanticdatamodeling.• veryflexiblegraph-like datamodel

•RDFdatacollectionswithbillionsoftriplesarenotunusual.•→Hadoop– distributedBigDataprocessing

•ButHadoopisnotdesignedforRDFdatamanagement.

•→NecessarytoachieveperformanceinthesameorderofmagnitudecomparedtospecializedsystemsbuiltfromgroundforRDF.

3

Page 4: S2RDF: RDF Querying with SPARQL on Spark

S2RDF(SPARQLonSparkforRDF)• aSPARQLprocessorbasedonthein-memoryclustercomputingframeworkSpark.

•ExtVP – ExtendedVerticalPartitioning• basedonsemi-join reductions• canlargelyreducetheinputsizeofaquery• allowstodefineselectivitythreshold• Canapplicableforallqueryshapes(⇄foronlyspecificshapeslikeHbase,Impala)

4

Page 5: S2RDF: RDF Querying with SPARQL on Spark

RDF&SPARQL•IRI: identifyaresource(e.g.http://bio2rdf.org/drugbank_vocabulary:)•triple t =(s,p,o)• s:subject,p:predicate,o:object• e.g.(A,follows,G), (A,likes,I1)

•graph G ={t1,…,tn}

5

Page 6: S2RDF: RDF Querying with SPARQL on Spark

SPARQL•QuerylanguageforRDF•triplepattern tp =(sʹ,pʹ,oʹ)withsʹ∈ {s,?s},pʹ∈ {p,?p}andoʹ∈ {o,?o}• ?:unbound,witihout ?:bound

•BGP(basicgraphpattern):Asetoftriplepatterns•ThesingleresultforQ1is•(?x→A,?y→B,?z→C,?w→ℓ2)

6

SPQRQLqueryQ1

Page 7: S2RDF: RDF Querying with SPARQL on Spark

VariableDefinition•result ofaBGPis abag ofsolution mappings

•V:setofqueryvariables(?x,?y,?z…)•T:setofvalidRDFterms(A,B,C,D, ℓ1…)

•(solution)mapping μ:partialfunctionμ:V→ T

•vars(tp):setofvariablescontainedintp (?x,?wfor1sttp)•μ(tp):tripethatisobtainedbysubstitutingthevariablesintp accordingtoμ• →(A,likes,ℓ1)

•dom(μ):subsetofV whereμ isdefined(?x)

7

Page 8: S2RDF: RDF Querying with SPARQL on Spark

SPARQLEAnswer•The answer toatriple pattern tp is abag ofmappings

•Ωtp ={μ |dom(μ)=vars(tp),μ(tp)∈G}

•The result toabasic graph pattern bgp ={tp1,..., tpm} is•Ωbgp =Ωtp1 ∞...∞Ωtpm

•Ex) Ω(?xlikes ?y)={(A,likes,I1),(A,likes,I2),(C,likes,I2)}

8

s p o

A follows B

B follows C

B follows D

C follows D

A likes I1

A likes I2

C likes I2

graphG

Page 9: S2RDF: RDF Querying with SPARQL on Spark

RELATEDWORK(RDFsystems)•Centralized• Virtuoso- CentralizedRDFsystemusingarelationalback-endtomaterializeRDFdata.

•Distributed• SHARD- grouping RDFdatabysubjectandusingaClause-Iterationapproachforqueryprocessing.• PigSPARQL - usingverticalpartitioningschemafordatarepresentation.• H2RDF+- sortedandcolumn-oriented NoSQLkey-valuestoreontopofHDFS.Usingsixtablesforallpossibletriplepermutations.(eachs,p,o)• Sempala - SPARQL-over-SQLapproachbasedonHadoop.Especiallyforstar-shapedquery.

9

Page 10: S2RDF: RDF Querying with SPARQL on Spark

EXTENDEDVERTICALPARTITIONING• DirectrepresentationofRDF→ VerticalPartitioning (VP)• needsomeorsixindexeseasytomanageindistributedHDHS

10

s p o

A follows B

B follows C

B follows D

C follows D

A likes I1A likes I2C likes I2

s o

A B

B C

B D

C D

s o

A I1A I2C I2

VPfollows VPlikes

Page 11: S2RDF: RDF Querying with SPARQL on Spark

ExtVP Definition• star-shapeonly→allshapes(beoftenneglected)

• VPtableleadstoalargereductionoftheinputsize.• But,VPtablessizedifferencescausealotofdanglingtuplesunused.•→unnecessaryI/Oandincreaseinmemoryconsumption.• →ExtVP

11

Page 12: S2RDF: RDF Querying with SPARQL on Spark

ExtVP Definition(correlation)• precomputeanumberofsemi-joinreductionsofVPp1• correlation:co- occurrenceofavariableintwotriplepatterns

joinvariablesonpredicatepositionarerarelyusedinSPARQLqueries.

12

Page 13: S2RDF: RDF Querying with SPARQL on Spark

ExtVP Precomputation• precomputesemi-joinreductionsforSS,OSandSOcorrelations

13

green:storedasExtVPred:notstoredforwqual toVP

Page 14: S2RDF: RDF Querying with SPARQL on Spark

ExtVP updatability• Insertions: notcritical– easilyadapt• Deletions: important.• Toremoveatriple(s,p,o) wehavetodeletecorrespondingtuplesfromallExtVP tables.• Tradeoff• Deletionscannotbeimplementednow.• Updates:combinationofdeleteandinsert.

14

Page 15: S2RDF: RDF Querying with SPARQL on Spark

ExtVP andDatabaseQueryOptimization• Sparkjoins:executedinparallelonallclusternodesonportionsofthedata.•→theapplicationofsemi-joinsonthefly duringqueryprocessinglesseffective.

• S2RDFprecompute semi-joinreductionsofVPtablesforallpossiblecorrelations.•→donothavetocomputethemon-the-flybutonlyonce.

15

Page 16: S2RDF: RDF Querying with SPARQL on Spark

ExtVP SelectivityThreshold• Higherselectivity makestheadditionalstorageoverheadsmaller.• SF(ExtVPp1|p2)=|ExtVPp1|p2 |/|VPp1|.• Ex)(ExtVPfollows|likes)=1/4=0.25.

• k:thenumberofpredicatesinanRDFgraphG(=2(follows,likes))• n =|G|bethenumberoftriplesinG(=7)• Assumption)• AllVPtableshaveequalsizen/k• SF=0.5forallExtVP tables

16

s p o

A follows B

B follows C

B follows D

C follows D

A likes I1A likes I2C likes I2

Page 17: S2RDF: RDF Querying with SPARQL on Spark

ExtVP SelectivityThreshold• But,assumptionisovergeneralized.• mostoftheExtVP tableswillbeempty.• ex)n=119,k=86• expected:(3*86-1)*109/2≒14,000(≒128n)• practical:1199(≒11n)• morethan90%ofallExtVP tableswereeitheremptyorequaltoVP.• SF≒ 1:alargeoverhead whilecontributingonlyanegligibleperformancebenefit.• SF<0.5:thebestperformance benefitwhilecausingonlylittleoverhead.• athresholdof0.25reducesthesizeofExtVP from ∼ 11n to ∼ 2ntuplesandatthesametimeprovides95%oftheperformancebenefit

17

Page 18: S2RDF: RDF Querying with SPARQL on Spark

S2RDFQUERYPROCESSING• Spark:in-memoryclustercomputingsystemthatrunsonHadoop.• SparkSQL:relationalinterfaceofSpark.• JenaARQ:SPARQLqueryparser.

• SPARQLquery→ S2RDF(includingJenaARQ)→ equivalentSparkSQLquery

18

Page 19: S2RDF: RDF Querying with SPARQL on Spark

TriplePatternMapping• S

19

Algorithm1 Algorithm2,3

Page 20: S2RDF: RDF Querying with SPARQL on Spark

TriplePatternMapping

20

Page 21: S2RDF: RDF Querying with SPARQL on Spark

TriplePatternMapping

21

Page 22: S2RDF: RDF Querying with SPARQL on Spark

QueryComposition• Theorderofresultsubquerieshavesevereimpactsonperformance.• Reducingtheamountofintermediateresults• isveryimportant.

22

Page 23: S2RDF: RDF Querying with SPARQL on Spark

23

Page 24: S2RDF: RDF Querying with SPARQL on Spark

Evaluation• 10machines(1masterand9worker)• [email protected],2x2TBdisks,32GBRAMrunningUbuntu14.04LTS• HadoopdistributionofClouderaCDH5.4• Spark1.3• ComparedwithHadoopSPARQLProcessor:• SHARD• PigSPARQL• Sempala• H2RDF+• Virtuoso

24

Page 25: S2RDF: RDF Querying with SPARQL on Spark

Evaluation• Dataset:WatDiv DataGenerator• http://dsg.uwaterloo.ca/watdiv/basic-testing.shtml• 5Linear,7star,5snowflake,3complexquerygraphs.• 100millionand1billionRDFtriples.

• Excludecachetime(becauseone-time• operation)

25

Page 26: S2RDF: RDF Querying with SPARQL on Spark

WatDiv BasicTesting• aa

26

specializedtostarquery

Page 27: S2RDF: RDF Querying with SPARQL on Spark

WatDiv BasicTesting• aa

27

Page 28: S2RDF: RDF Querying with SPARQL on Spark

WatDiv IncrementalLinearTesting(IL)• InBasicTesting,onlytwohaveadiameterlarger• than3(C1,C2).•→IncrementalLinearTesting• IL-1,IL-2,IL-3• diameter:5〜10• IL-1: (%u%p1?v1.?v1p2?v2.?v2p3?v3....)

28

F3:dameter =3

C2:dameter =5

Page 29: S2RDF: RDF Querying with SPARQL on Spark

WatDiv IncrementalLinearTesting(IL)• aa

29

Page 30: S2RDF: RDF Querying with SPARQL on Spark

WatDiv IncrementalLinearTesting(IL)• aa

30

Page 31: S2RDF: RDF Querying with SPARQL on Spark

SFThreshold• +YAGODataset:245milliontriples.• hoge semanticknowledgebasederivedWikipedia,WordNet,GeoNames.(usedbyIBMWatson)• Changethresholdfrom0.00(equaltoVPonly)to1.00(allExtVP(equaltolasttoexperiment))

31

Page 32: S2RDF: RDF Querying with SPARQL on Spark

SFThreshold• +

32

Page 33: S2RDF: RDF Querying with SPARQL on Spark

Conclusion• S2RDF:SPARQLqueryprocessorforlarge-scaleRDFdata.• ExtVP:ExtensionofVerticalPartitioning.• inspiredbysemi-joinreductionssimilartoJoinIndices.• PrecomputethereductionsoftablesinVPforpossiblejoincorrelations.• Reducetheinputsize.• forallqueryshapes.• Doesnotdependonthequerydiameter.

33