s2rdf: rdf querying with sparql on spark

S2RDF:RDFQueryingwithSPARQLonSpark

Proceedingsofthe2016VLDBEndowment(PVLDB)171026shu

1

MainAuthorAlexanderSchätzle◦ UniversityofFreiburg

Reserch Interest◦ SemanticWeb◦ SocialNetworks◦ MapReduce◦ NoSQL

2

Introduction•RDF：W3Cstandardforsemanticdatamodeling.• veryflexiblegraph-like datamodel

•RDFdatacollectionswithbillionsoftriplesarenotunusual.•→Hadoop– distributedBigDataprocessing

•ButHadoopisnotdesignedforRDFdatamanagement.

•→NecessarytoachieveperformanceinthesameorderofmagnitudecomparedtospecializedsystemsbuiltfromgroundforRDF.

3

S2RDF(SPARQLonSparkforRDF)• aSPARQLprocessorbasedonthein-memoryclustercomputingframeworkSpark.

•ExtVP – ExtendedVerticalPartitioning• basedonsemi-join reductions• canlargelyreducetheinputsizeofaquery• allowstodefineselectivitythreshold• Canapplicableforallqueryshapes(⇄foronlyspecificshapeslikeHbase,Impala)

4

RDF&SPARQL•IRI： identifyaresource(e.g.http://bio2rdf.org/drugbank_vocabulary:)•triple t =(s,p,o)• s:subject,p:predicate,o:object• e.g.(A,follows,G), (A,likes,I1)

•graph G ={t1,…,tn}

5

SPARQL•QuerylanguageforRDF•triplepattern tp =(sʹ,pʹ,oʹ)withsʹ∈ {s,?s},pʹ∈ {p,?p}andoʹ∈ {o,?o}• ?:unbound,witihout ?:bound

•BGP(basicgraphpattern)：Asetoftriplepatterns•ThesingleresultforQ1is•(?x→A,?y→B,?z→C,?w→ℓ2)

6

SPQRQLqueryQ1

VariableDefinition•result ofaBGPis abag ofsolution mappings

•V：setofqueryvariables(?x,?y,?z…)•T：setofvalidRDFterms(A,B,C,D, ℓ1…)

•(solution)mapping μ：partialfunctionμ：V→ T

•vars(tp)：setofvariablescontainedintp (?x,?wfor1sttp)•μ(tp):tripethatisobtainedbysubstitutingthevariablesintp accordingtoμ• →(A,likes,ℓ1)

•dom(μ)：subsetofV whereμ isdefined(?x)

7

SPARQLEAnswer•The answer toatriple pattern tp is abag ofmappings

•Ωtp ={μ |dom(μ)=vars(tp),μ(tp)∈G}

•The result toabasic graph pattern bgp ={tp1,..., tpm} is•Ωbgp =Ωtp1 ∞...∞Ωtpm

•Ex) Ω(?xlikes ?y)={(A,likes,I1),(A,likes,I2),(C,likes,I2)}

8

s p o

A follows B

B follows C

B follows D

C follows D

A likes I1

A likes I2

C likes I2

graphG

RELATEDWORK(RDFsystems)•Centralized• Virtuoso- CentralizedRDFsystemusingarelationalback-endtomaterializeRDFdata.

•Distributed• SHARD- grouping RDFdatabysubjectandusingaClause-Iterationapproachforqueryprocessing.• PigSPARQL - usingverticalpartitioningschemafordatarepresentation.• H2RDF+- sortedandcolumn-oriented NoSQLkey-valuestoreontopofHDFS.Usingsixtablesforallpossibletriplepermutations.(eachs,p,o)• Sempala - SPARQL-over-SQLapproachbasedonHadoop.Especiallyforstar-shapedquery.

9

EXTENDEDVERTICALPARTITIONING• DirectrepresentationofRDF→ VerticalPartitioning (VP)• needsomeorsixindexeseasytomanageindistributedHDHS

10

s p o

A follows B

B follows C

B follows D

C follows D

A likes I1A likes I2C likes I2

s o

A B

B C

B D

C D

s o

A I1A I2C I2

VPfollows VPlikes

ExtVP Definition• star-shapeonly→allshapes(beoftenneglected)

• VPtableleadstoalargereductionoftheinputsize.• But,VPtablessizedifferencescausealotofdanglingtuplesunused.•→unnecessaryI/Oandincreaseinmemoryconsumption.• →ExtVP

11

ExtVP Definition(correlation)• precomputeanumberofsemi-joinreductionsofVPp1• correlation：co- occurrenceofavariableintwotriplepatterns

joinvariablesonpredicatepositionarerarelyusedinSPARQLqueries.

12

ExtVP Precomputation• precomputesemi-joinreductionsforSS,OSandSOcorrelations

13

green:storedasExtVPred:notstoredforwqual toVP

ExtVP updatability• Insertions： notcritical– easilyadapt• Deletions： important.• Toremoveatriple(s,p,o) wehavetodeletecorrespondingtuplesfromallExtVP tables.• Tradeoff• Deletionscannotbeimplementednow.• Updates：combinationofdeleteandinsert.

14

ExtVP andDatabaseQueryOptimization• Sparkjoins：executedinparallelonallclusternodesonportionsofthedata.•→theapplicationofsemi-joinsonthefly duringqueryprocessinglesseffective.

⇅

• S2RDFprecompute semi-joinreductionsofVPtablesforallpossiblecorrelations.•→donothavetocomputethemon-the-flybutonlyonce.

15

ExtVP SelectivityThreshold• Higherselectivity makestheadditionalstorageoverheadsmaller.• SF(ExtVPp1|p2)=|ExtVPp1|p2 |/|VPp1|.• Ex)(ExtVPfollows|likes)=1/4=0.25.

• k：thenumberofpredicatesinanRDFgraphG(=2(follows,likes))• n =|G|bethenumberoftriplesinG(=7)• Assumption)• AllVPtableshaveequalsizen/k• SF=0.5forallExtVP tables

16

s p o

A follows B

B follows C

B follows D

C follows D

A likes I1A likes I2C likes I2

ExtVP SelectivityThreshold• But,assumptionisovergeneralized.• mostoftheExtVP tableswillbeempty.• ex)n=119,k=86• expected:(3*86-1)*109/2≒14,000(≒128n)• practical:1199(≒11n)• morethan90%ofallExtVP tableswereeitheremptyorequaltoVP.• SF≒ 1：alargeoverhead whilecontributingonlyanegligibleperformancebenefit.• SF<0.5：thebestperformance benefitwhilecausingonlylittleoverhead.• athresholdof0.25reducesthesizeofExtVP from ∼ 11n to ∼ 2ntuplesandatthesametimeprovides95%oftheperformancebenefit

17

S2RDFQUERYPROCESSING• Spark：in-memoryclustercomputingsystemthatrunsonHadoop.• SparkSQL：relationalinterfaceofSpark.• JenaARQ：SPARQLqueryparser.

• SPARQLquery→ S2RDF(includingJenaARQ)→ equivalentSparkSQLquery

18

TriplePatternMapping• S

19

Algorithm1 Algorithm2,3

TriplePatternMapping

20

TriplePatternMapping

21

QueryComposition• Theorderofresultsubquerieshavesevereimpactsonperformance.• Reducingtheamountofintermediateresults• isveryimportant.

22

Evaluation• 10machines(1masterand9worker)• [email protected],2x2TBdisks,32GBRAMrunningUbuntu14.04LTS• HadoopdistributionofClouderaCDH5.4• Spark1.3• ComparedwithHadoopSPARQLProcessor:• SHARD• PigSPARQL• Sempala• H2RDF+• Virtuoso

24

Evaluation• Dataset：WatDiv DataGenerator• http://dsg.uwaterloo.ca/watdiv/basic-testing.shtml• 5Linear,7star,5snowflake,3complexquerygraphs.• 100millionand1billionRDFtriples.

• Excludecachetime(becauseone-time• operation)

25

WatDiv BasicTesting• aa

26

specializedtostarquery

WatDiv BasicTesting• aa

27

WatDiv IncrementalLinearTesting(IL)• InBasicTesting,onlytwohaveadiameterlarger• than3(C1,C2).•→IncrementalLinearTesting• IL-1,IL-2,IL-3• diameter：5〜10• IL-1： (%u%p1?v1.?v1p2?v2.?v2p3?v3....)

28

F3:dameter =3

C2:dameter =5

WatDiv IncrementalLinearTesting(IL)• aa

29

WatDiv IncrementalLinearTesting(IL)• aa

30

SFThreshold• +YAGODataset：245milliontriples.• hoge semanticknowledgebasederivedWikipedia,WordNet,GeoNames.(usedbyIBMWatson)• Changethresholdfrom0.00(equaltoVPonly)to1.00(allExtVP(equaltolasttoexperiment))

31

SFThreshold• +

32

Conclusion• S2RDF：SPARQLqueryprocessorforlarge-scaleRDFdata.• ExtVP：ExtensionofVerticalPartitioning.• inspiredbysemi-joinreductionssimilartoJoinIndices.• PrecomputethereductionsoftablesinVPforpossiblejoincorrelations.• Reducetheinputsize.• forallqueryshapes.• Doesnotdependonthequerydiameter.

33

s2rdf: rdf querying with sparql on spark

Documents