s2rdf: rdf querying with sparql on spark
TRANSCRIPT
S2RDF:RDFQueryingwithSPARQLonSpark
Proceedingsofthe2016VLDBEndowment(PVLDB)171026shu
1
MainAuthorAlexanderSchätzle◦ UniversityofFreiburg
Reserch Interest◦ SemanticWeb◦ SocialNetworks◦ MapReduce◦ NoSQL
2
Introduction•RDF:W3Cstandardforsemanticdatamodeling.• veryflexiblegraph-like datamodel
•RDFdatacollectionswithbillionsoftriplesarenotunusual.•→Hadoop– distributedBigDataprocessing
•ButHadoopisnotdesignedforRDFdatamanagement.
•→NecessarytoachieveperformanceinthesameorderofmagnitudecomparedtospecializedsystemsbuiltfromgroundforRDF.
3
S2RDF(SPARQLonSparkforRDF)• aSPARQLprocessorbasedonthein-memoryclustercomputingframeworkSpark.
•ExtVP – ExtendedVerticalPartitioning• basedonsemi-join reductions• canlargelyreducetheinputsizeofaquery• allowstodefineselectivitythreshold• Canapplicableforallqueryshapes(⇄foronlyspecificshapeslikeHbase,Impala)
4
RDF&SPARQL•IRI: identifyaresource(e.g.http://bio2rdf.org/drugbank_vocabulary:)•triple t =(s,p,o)• s:subject,p:predicate,o:object• e.g.(A,follows,G), (A,likes,I1)
•graph G ={t1,…,tn}
5
SPARQL•QuerylanguageforRDF•triplepattern tp =(sʹ,pʹ,oʹ)withsʹ∈ {s,?s},pʹ∈ {p,?p}andoʹ∈ {o,?o}• ?:unbound,witihout ?:bound
•BGP(basicgraphpattern):Asetoftriplepatterns•ThesingleresultforQ1is•(?x→A,?y→B,?z→C,?w→ℓ2)
6
SPQRQLqueryQ1
VariableDefinition•result ofaBGPis abag ofsolution mappings
•V:setofqueryvariables(?x,?y,?z…)•T:setofvalidRDFterms(A,B,C,D, ℓ1…)
•(solution)mapping μ:partialfunctionμ:V→ T
•vars(tp):setofvariablescontainedintp (?x,?wfor1sttp)•μ(tp):tripethatisobtainedbysubstitutingthevariablesintp accordingtoμ• →(A,likes,ℓ1)
•dom(μ):subsetofV whereμ isdefined(?x)
7
SPARQLEAnswer•The answer toatriple pattern tp is abag ofmappings
•Ωtp ={μ |dom(μ)=vars(tp),μ(tp)∈G}
•The result toabasic graph pattern bgp ={tp1,..., tpm} is•Ωbgp =Ωtp1 ∞...∞Ωtpm
•Ex) Ω(?xlikes ?y)={(A,likes,I1),(A,likes,I2),(C,likes,I2)}
8
s p o
A follows B
B follows C
B follows D
C follows D
A likes I1
A likes I2
C likes I2
graphG
RELATEDWORK(RDFsystems)•Centralized• Virtuoso- CentralizedRDFsystemusingarelationalback-endtomaterializeRDFdata.
•Distributed• SHARD- grouping RDFdatabysubjectandusingaClause-Iterationapproachforqueryprocessing.• PigSPARQL - usingverticalpartitioningschemafordatarepresentation.• H2RDF+- sortedandcolumn-oriented NoSQLkey-valuestoreontopofHDFS.Usingsixtablesforallpossibletriplepermutations.(eachs,p,o)• Sempala - SPARQL-over-SQLapproachbasedonHadoop.Especiallyforstar-shapedquery.
9
EXTENDEDVERTICALPARTITIONING• DirectrepresentationofRDF→ VerticalPartitioning (VP)• needsomeorsixindexeseasytomanageindistributedHDHS
10
s p o
A follows B
B follows C
B follows D
C follows D
A likes I1A likes I2C likes I2
s o
A B
B C
B D
C D
s o
A I1A I2C I2
VPfollows VPlikes
ExtVP Definition• star-shapeonly→allshapes(beoftenneglected)
• VPtableleadstoalargereductionoftheinputsize.• But,VPtablessizedifferencescausealotofdanglingtuplesunused.•→unnecessaryI/Oandincreaseinmemoryconsumption.• →ExtVP
11
ExtVP Definition(correlation)• precomputeanumberofsemi-joinreductionsofVPp1• correlation:co- occurrenceofavariableintwotriplepatterns
joinvariablesonpredicatepositionarerarelyusedinSPARQLqueries.
12
ExtVP Precomputation• precomputesemi-joinreductionsforSS,OSandSOcorrelations
13
green:storedasExtVPred:notstoredforwqual toVP
ExtVP updatability• Insertions: notcritical– easilyadapt• Deletions: important.• Toremoveatriple(s,p,o) wehavetodeletecorrespondingtuplesfromallExtVP tables.• Tradeoff• Deletionscannotbeimplementednow.• Updates:combinationofdeleteandinsert.
14
ExtVP andDatabaseQueryOptimization• Sparkjoins:executedinparallelonallclusternodesonportionsofthedata.•→theapplicationofsemi-joinsonthefly duringqueryprocessinglesseffective.
⇅
• S2RDFprecompute semi-joinreductionsofVPtablesforallpossiblecorrelations.•→donothavetocomputethemon-the-flybutonlyonce.
15
ExtVP SelectivityThreshold• Higherselectivity makestheadditionalstorageoverheadsmaller.• SF(ExtVPp1|p2)=|ExtVPp1|p2 |/|VPp1|.• Ex)(ExtVPfollows|likes)=1/4=0.25.
• k:thenumberofpredicatesinanRDFgraphG(=2(follows,likes))• n =|G|bethenumberoftriplesinG(=7)• Assumption)• AllVPtableshaveequalsizen/k• SF=0.5forallExtVP tables
16
s p o
A follows B
B follows C
B follows D
C follows D
A likes I1A likes I2C likes I2
ExtVP SelectivityThreshold• But,assumptionisovergeneralized.• mostoftheExtVP tableswillbeempty.• ex)n=119,k=86• expected:(3*86-1)*109/2≒14,000(≒128n)• practical:1199(≒11n)• morethan90%ofallExtVP tableswereeitheremptyorequaltoVP.• SF≒ 1:alargeoverhead whilecontributingonlyanegligibleperformancebenefit.• SF<0.5:thebestperformance benefitwhilecausingonlylittleoverhead.• athresholdof0.25reducesthesizeofExtVP from ∼ 11n to ∼ 2ntuplesandatthesametimeprovides95%oftheperformancebenefit
17
S2RDFQUERYPROCESSING• Spark:in-memoryclustercomputingsystemthatrunsonHadoop.• SparkSQL:relationalinterfaceofSpark.• JenaARQ:SPARQLqueryparser.
• SPARQLquery→ S2RDF(includingJenaARQ)→ equivalentSparkSQLquery
18
TriplePatternMapping• S
19
Algorithm1 Algorithm2,3
TriplePatternMapping
20
TriplePatternMapping
21
QueryComposition• Theorderofresultsubquerieshavesevereimpactsonperformance.• Reducingtheamountofintermediateresults• isveryimportant.
22
23
Evaluation• 10machines(1masterand9worker)• [email protected],2x2TBdisks,32GBRAMrunningUbuntu14.04LTS• HadoopdistributionofClouderaCDH5.4• Spark1.3• ComparedwithHadoopSPARQLProcessor:• SHARD• PigSPARQL• Sempala• H2RDF+• Virtuoso
24
Evaluation• Dataset:WatDiv DataGenerator• http://dsg.uwaterloo.ca/watdiv/basic-testing.shtml• 5Linear,7star,5snowflake,3complexquerygraphs.• 100millionand1billionRDFtriples.
• Excludecachetime(becauseone-time• operation)
25
WatDiv BasicTesting• aa
26
specializedtostarquery
WatDiv BasicTesting• aa
27
WatDiv IncrementalLinearTesting(IL)• InBasicTesting,onlytwohaveadiameterlarger• than3(C1,C2).•→IncrementalLinearTesting• IL-1,IL-2,IL-3• diameter:5〜10• IL-1: (%u%p1?v1.?v1p2?v2.?v2p3?v3....)
28
F3:dameter =3
C2:dameter =5
WatDiv IncrementalLinearTesting(IL)• aa
29
WatDiv IncrementalLinearTesting(IL)• aa
30
SFThreshold• +YAGODataset:245milliontriples.• hoge semanticknowledgebasederivedWikipedia,WordNet,GeoNames.(usedbyIBMWatson)• Changethresholdfrom0.00(equaltoVPonly)to1.00(allExtVP(equaltolasttoexperiment))
31
SFThreshold• +
32
Conclusion• S2RDF:SPARQLqueryprocessorforlarge-scaleRDFdata.• ExtVP:ExtensionofVerticalPartitioning.• inspiredbysemi-joinreductionssimilartoJoinIndices.• PrecomputethereductionsoftablesinVPforpossiblejoincorrelations.• Reducetheinputsize.• forallqueryshapes.• Doesnotdependonthequerydiameter.
33