s2rdf: rdf querying with sparql on spark

S2RDF:RDFQueryingwithSPARQLonSpark

Proceedingsofthe2016VLDBEndowment(PVLDB)171026shu

MainAuthorAlexanderSchätzle◦ UniversityofFreiburg

Reserch Interest◦ SemanticWeb◦ SocialNetworks◦ MapReduce◦ NoSQL

Introduction•RDF：W3Cstandardforsemanticdatamodeling.• veryflexiblegraph-like datamodel

•RDFdatacollectionswithbillionsoftriplesarenotunusual.•→Hadoop– distributedBigDataprocessing

•ButHadoopisnotdesignedforRDFdatamanagement.

•→NecessarytoachieveperformanceinthesameorderofmagnitudecomparedtospecializedsystemsbuiltfromgroundforRDF.

S2RDF(SPARQLonSparkforRDF)• aSPARQLprocessorbasedonthein-memoryclustercomputingframeworkSpark.

•ExtVP – ExtendedVerticalPartitioning• basedonsemi-join reductions• canlargelyreducetheinputsizeofaquery• allowstodefineselectivitythreshold• Canapplicableforallqueryshapes(⇄foronlyspecificshapeslikeHbase,Impala)

RDF&SPARQL•IRI： identifyaresource(e.g.http://bio2rdf.org/drugbank_vocabulary:)•triple t =(s,p,o)• s:subject,p:predicate,o:object• e.g.(A,follows,G), (A,likes,I1)

•graph G ={t1,…,tn}

SPARQL•QuerylanguageforRDF•triplepattern tp =(sʹ,pʹ,oʹ)withsʹ∈ {s,?s},pʹ∈ {p,?p}andoʹ∈ {o,?o}• ?:unbound,witihout ?:bound

•BGP(basicgraphpattern)：Asetoftriplepatterns•ThesingleresultforQ1is•(?x→A,?y→B,?z→C,?w→ℓ2)

SPQRQLqueryQ1

VariableDefinition•result ofaBGPis abag ofsolution mappings

•V：setofqueryvariables(?x,?y,?z…)•T：setofvalidRDFterms(A,B,C,D, ℓ1…)

•(solution)mapping μ：partialfunctionμ：V→ T

•vars(tp)：setofvariablescontainedintp (?x,?wfor1sttp)•μ(tp):tripethatisobtainedbysubstitutingthevariablesintp accordingtoμ• →(A,likes,ℓ1)

•dom(μ)：subsetofV whereμ isdefined(?x)

SPARQLEAnswer•The answer toatriple pattern tp is abag ofmappings

•Ωtp ={μ |dom(μ)=vars(tp),μ(tp)∈G}

•The result toabasic graph pattern bgp ={tp1,..., tpm} is•Ωbgp =Ωtp1 ∞...∞Ωtpm

•Ex) Ω(?xlikes ?y)={(A,likes,I1),(A,likes,I2),(C,likes,I2)}

A follows B

B follows C

B follows D

C follows D

A likes I1

A likes I2

C likes I2

graphG

RELATEDWORK(RDFsystems)•Centralized• Virtuoso- CentralizedRDFsystemusingarelationalback-endtomaterializeRDFdata.

•Distributed• SHARD- grouping RDFdatabysubjectandusingaClause-Iterationapproachforqueryprocessing.• PigSPARQL - usingverticalpartitioningschemafordatarepresentation.• H2RDF+- sortedandcolumn-oriented NoSQLkey-valuestoreontopofHDFS.Usingsixtablesforallpossibletriplepermutations.(eachs,p,o)• Sempala - SPARQL-over-SQLapproachbasedonHadoop.Especiallyforstar-shapedquery.

EXTENDEDVERTICALPARTITIONING• DirectrepresentationofRDF→ VerticalPartitioning (VP)• needsomeorsixindexeseasytomanageindistributedHDHS

A follows B

B follows C

B follows D

C follows D

A likes I1A likes I2C likes I2

A I1A I2C I2

VPfollows VPlikes

ExtVP Definition• star-shapeonly→allshapes(beoftenneglected)

• VPtableleadstoalargereductionoftheinputsize.• But,VPtablessizedifferencescausealotofdanglingtuplesunused.•→unnecessaryI/Oandincreaseinmemoryconsumption.• →ExtVP

ExtVP Definition(correlation)• precomputeanumberofsemi-joinreductionsofVPp1• correlation：co- occurrenceofavariableintwotriplepatterns

joinvariablesonpredicatepositionarerarelyusedinSPARQLqueries.

ExtVP Precomputation• precomputesemi-joinreductionsforSS,OSandSOcorrelations

green:storedasExtVPred:notstoredforwqual toVP

ExtVP updatability• Insertions： notcritical– easilyadapt• Deletions： important.• Toremoveatriple(s,p,o) wehavetodeletecorrespondingtuplesfromallExtVP tables.• Tradeoff• Deletionscannotbeimplementednow.• Updates：combinationofdeleteandinsert.

ExtVP andDatabaseQueryOptimization• Sparkjoins：executedinparallelonallclusternodesonportionsofthedata.•→theapplicationofsemi-joinsonthefly duringqueryprocessinglesseffective.

• S2RDFprecompute semi-joinreductionsofVPtablesforallpossiblecorrelations.•→donothavetocomputethemon-the-flybutonlyonce.

ExtVP SelectivityThreshold• Higherselectivity makestheadditionalstorageoverheadsmaller.• SF(ExtVPp1|p2)=|ExtVPp1|p2 |/|VPp1|.• Ex)(ExtVPfollows|likes)=1/4=0.25.

• k：thenumberofpredicatesinanRDFgraphG(=2(follows,likes))• n =|G|bethenumberoftriplesinG(=7)• Assumption)• AllVPtableshaveequalsizen/k• SF=0.5forallExtVP tables

A follows B

B follows C

B follows D

C follows D

A likes I1A likes I2C likes I2

ExtVP SelectivityThreshold• But,assumptionisovergeneralized.• mostoftheExtVP tableswillbeempty.• ex)n=119,k=86• expected:(3*86-1)*109/2≒14,000(≒128n)• practical:1199(≒11n)• morethan90%ofallExtVP tableswereeitheremptyorequaltoVP.• SF≒ 1：alargeoverhead whilecontributingonlyanegligibleperformancebenefit.• SF<0.5：thebestperformance benefitwhilecausingonlylittleoverhead.• athresholdof0.25reducesthesizeofExtVP from ∼ 11n to ∼ 2ntuplesandatthesametimeprovides95%oftheperformancebenefit

S2RDFQUERYPROCESSING• Spark：in-memoryclustercomputingsystemthatrunsonHadoop.• SparkSQL：relationalinterfaceofSpark.• JenaARQ：SPARQLqueryparser.

• SPARQLquery→ S2RDF(includingJenaARQ)→ equivalentSparkSQLquery

TriplePatternMapping• S

Algorithm1 Algorithm2,3

TriplePatternMapping

QueryComposition• Theorderofresultsubquerieshavesevereimpactsonperformance.• Reducingtheamountofintermediateresults• isveryimportant.

Evaluation• 10machines(1masterand9worker)• XeonE5-2420CPU@1.90GHz,2x2TBdisks,32GBRAMrunningUbuntu14.04LTS• HadoopdistributionofClouderaCDH5.4• Spark1.3• ComparedwithHadoopSPARQLProcessor:• SHARD• PigSPARQL• Sempala• H2RDF+• Virtuoso

Evaluation• Dataset：WatDiv DataGenerator• http://dsg.uwaterloo.ca/watdiv/basic-testing.shtml• 5Linear,7star,5snowflake,3complexquerygraphs.• 100millionand1billionRDFtriples.

• Excludecachetime(becauseone-time• operation)

WatDiv BasicTesting• aa

specializedtostarquery

WatDiv BasicTesting• aa

WatDiv IncrementalLinearTesting(IL)• InBasicTesting,onlytwohaveadiameterlarger• than3(C1,C2).•→IncrementalLinearTesting• IL-1,IL-2,IL-3• diameter：5〜10• IL-1： (%u%p1?v1.?v1p2?v2.?v2p3?v3....)

F3:dameter =3

C2:dameter =5

WatDiv IncrementalLinearTesting(IL)• aa

SFThreshold• +YAGODataset：245milliontriples.• hoge semanticknowledgebasederivedWikipedia,WordNet,GeoNames.(usedbyIBMWatson)• Changethresholdfrom0.00(equaltoVPonly)to1.00(allExtVP(equaltolasttoexperiment))

SFThreshold• +

Conclusion• S2RDF：SPARQLqueryprocessorforlarge-scaleRDFdata.• ExtVP：ExtensionofVerticalPartitioning.• inspiredbysemi-joinreductionssimilartoJoinIndices.• PrecomputethereductionsoftablesinVPforpossiblejoincorrelations.• Reducetheinputsize.• forallqueryshapes.• Doesnotdependonthequerydiameter.

s2rdf: rdf querying with sparql on spark

Documents

sparql for querying pml data jitin arora. overview sparql:...

sparql query languageai.fon.bg.ac.rs › wp-content ›...

querying rdf data: a multigraph based...

chapter 3 querying rdf stores with sparql. tl;dr we will...

sparqlbye: querying rdf data by examplea system for querying...

ws-dai rdf(s) specification discussion · matchmaker...

querying the web of data with...

rdf, sw, sparql final

querying rdf with sparql - web-based knowledge...

practical rdf chapter 10. querying rdf: rdf as data

sparql protocol and rdf query language (sparql)

applied temporal rdf: efficient temporal querying using...

querying rdf with sparql - vrije universiteit...

scalable sparql querying of large rdf graphs jiewen...

sparql an rdf query language. sparql sparql is a recursive...

scalable sparql querying of large rdf graphs - yale...

chapter 3 querying rdf stores with sparql

querying linked data with sparql

practical rdf ch.10 querying rdf: rdf as data

sparql query languageai.fon.bg.ac.rs › wp-content ›...