an intermediate algebra for optimizing rdf graph pattern matching on mapreduce padmashree ravindra,...
TRANSCRIPT
![Page 1: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/1.jpg)
An Intermediate Algebra for Optimizing RDF Graph
Pattern Matching on MapReducePadmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu
COUL – Semantic COmpUting research Lab
![Page 2: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/2.jpg)
Outline IntroductionBackground
MapReduce, Pig and Join Processing RDF Graph Pattern Matching in Pig
Approach TripleGroup data model and Nested TripleGroup
Algebra (NTGA) Comparing NTGA based plans and Pig Latin
plans for graph pattern matching queries
EvaluationRelated WorkConclusion and Future Work
2/30
![Page 3: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/3.jpg)
Basics: MapReduce Large scale processing of data on a
cluster of commodity grade machinesUsers encode task as map / reduce
functions, which are executed in parallel across the cluster
Apache Hadoop* – open-source implementation
Key Terms Hadoop Distributed File System (HDFS) Slave nodes / Task Tracker – Mappers (Reducers) execute
the map (reduce) function Master node / Job Tracker – manages and
assigns tasks to Mappers / Reducers* http://hadoop.apache.org/
3/30
![Page 4: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/4.jpg)
Supports Partition ParallelismEach MR cycle I/O and communication costs
Data Processing on HadoopJob Tracker
Mapper1map()
Mapper2map()
MapperNmap()
Reducer1reduce()
ReducerMreduce()
DiskDisk
Disk………….
………….
Input
Sort / Shuffle
Output
HDFS Reads
Local Writes
HDFS Writes
Remote Reads
Map exec
Reduce exec
(k1, v1)
(k1, v2)
(k1, v3)
(k1, {v1, v2, v3})
(k1, val)
4/30
![Page 5: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/5.jpg)
Joins in Map Reduce
Map phase – scan input records map func. annotates each record
based on join column e.g. (joinKey, Record)
Reduce phase – records with same joinKey collected by same reduce task reduce func. joins the tuples Output written into HDFS
Single Join Workload
5/30
![Page 6: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/6.jpg)
Data Processing in Pig
Express data flow using high-level query primitives usability, code reuse, automatic optimizationPig Latin
Data model : atom, tuple, bag (nesting) Operators : LOAD, STORE, JOIN, GROUP BY,
COGROUP, FOREACH, SPLIT, aggr. functions • Ex.Equijoin on REL A (column 0) and REL B (column 1)
JOIN A by $0, B by $1;Extensibility support via UDFs
Dataflow is compiled into a workflow of MapReduce jobs
6/30
![Page 7: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/7.jpg)
#MR cycles = #Joins = 6(I/O & communication costs) * 6Loads have I/Os as wellExpensive!!! (SPLIT Operator)*
SELECT ?vlabel ?hpage ?price ?prodWHERE{ ?v homepage ?hpage . ?v label ?vlabel. ?v country ?vcountry . ?o vendor ?v . ?o price ?price . ?o delDays ?delDays . ?o product ?prod .}
Example Pig Query PlanA =
LOADInput.rdf
FILTER (homepage)
B = LOAD
Input.rdf
FILTER(label)
T1 = JOIN A ON Sub,
B ON Sub;
C = LOAD
Input.rdfFILTER(country
)
T2 = JOIN C ON Sub,
T1 ON Sub;
STORE
T3 = JOIN H ON Sub,
T7 ON Sub;
…….
H= LOAD
Input.rdfFILTER(product
)
MR1
MR2
MR6
7/30
![Page 8: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/8.jpg)
SELECT ?vlabel ?hpage ?price ?prodWHERE{ ?v homepage ?hpage . ?v label ?vlabel. ?v country ?vcountry . ?o vendor ?v . ?o price ?price . ?o delDays ?delDays . ?o product ?prod .}
Join between
Stars
Possible Optimizations : m-way Join
JOIN SJ2
Disk
reduce
map
JOIN J1
Disk
reduce
map
HDFS
Input
JOIN SJ1
Disk
reduce
map
MR1
MR2
MR3
SJ1
SJ2
J
1 #MR cycles reduced from
6 to 3
8/30
![Page 9: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/9.jpg)
BUT ?Still expensive! I MR cycle/star-joinMany pattern matching queries involve
multiple star join subpatterns 50% of BSBM* benchmark queries have two
or more star patterns
Our proposal: Coalesce the computation of ALL star-join
subpatterns into a single MR cycle
How? Don’t think of them as a set of joins! Think of it as a GROUP BY operation
9/30
![Page 10: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/10.jpg)
Sub Prop Obj
&V1 type Vendor
&V1 label Vendor1
&V1 country US
&V1 homepage www.ven...
&Offer1 type Offer
&Offer1 vendor &V1
&Offer1 product &P1
&Offer1 price 108
&Offer1 delDays 2
&Offer1 validToDate 01/01/2011
&Offer1validFromDa
te08/01/2011
&Rev1 type Review
&Rev1 reviewFor &P1
&Rev1 rating1 9
&Rev1 reviewer &R1
WHERE{ ?v homepage ?hpage . ?v label ?vlabel. ?v country ?vcountry .
?o vendor ?v . ?o price ?price . ?o delDays ?delDays . ?o product ?prod .}
GROUPBY Subject
1 MapReduce Cycle!!!
10/30
![Page 11: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/11.jpg)
What are we proposing?
A new data model (TripleGroup) and algebra (Nested TripleGroup Algebra - NTGA) for more efficient graph pattern matching on MapReduce platforms
11/30
![Page 12: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/12.jpg)
Outline IntroductionBackground
MapReduce, Pig and Join Processing RDF Graph Pattern Matching in Pig
Approach TripleGroup data model and Nested TripleGroup
Algebra (NTGA) Comparing NTGA based plans and Pig Latin
plans for graph pattern matching queries
EvaluationRelated WorkConclusion and Future Work
12/30
![Page 13: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/13.jpg)
Our Approach : RAPID+ Goal : Minimize I/O and communication costs by reducing MR cycles
Reinterpret and refactor operations into a more suitable (coalesced) set of operators – NTGA
Foundation: Re-interpret multiple star-joins as a grouping operation leads to “groups of Triples” (TripleGroups) instead of n-tuples
different structure BUT “content equivalent”
NTGA- algebra on TripleGroups
13/30
![Page 14: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/14.jpg)
NTGA – Data Model Data model based on nested
TripleGroupsMore naturally capture graphs
TripleGroup – groups of triples sharing Subject / Object component Can be nested at the Object
component
{(&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2)}
{(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.vendors….)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}
14/30
![Page 15: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/15.jpg)
NTGA Operators…(1)
TG_Unnest – unnest a nested TripleGroup{(&Offer1, price, 108), (&Offer1, vendor,{(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}
{(&Offer1, price, 108), (&Offer1, vendor, &V1), (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}
TG_Unnest
TG_Flatten – generate equivalent n-tuple(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven...)}
(&V1, label, vendor1, &V1, country, US, &V1, homepage, www.ven...)
TG_Flatten
t1 t2 t3
“Content Equivalence”
15/30
![Page 16: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/16.jpg)
NTGA Operators…(2) TG_GroupFilter – retain only
TripleGroups that satisfy the required query sub structure
Structure-based filtering
TG_GroupFilter
{ (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..) },
{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } ,
{ (&Offer2, vendor, &V2), (&Offer2, product, &P3), (&Offer2, delDays, 1) } }
{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) }
(TG, {price, vendor, delDays, product})
TG TG{price, vendor, delDays, product}
Eliminate TripleGroups with
missing triples (edges)
16/30
![Page 17: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/17.jpg)
NTGA Operators…(3) TG_Filter – filter out triples that do not
satisfy the filter condition (FILTER clause) Value-based filtering
TG_Filterprice<200(TG)
{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } ,
{ (&Offer3, vendor, &V2), (&Offer3, product, &P3), (&Offer3, price, 306), (&Offer3, delDays, 1) } }
{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) }
TG{price, vendor, delDays, product}
Eliminate TripleGroups with triples that do not satisfy filter condition
TG{price, vendor, delDays, product}
17/30
![Page 18: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/18.jpg)
NTGA Operators…(4) TG_Join – join between different structure
TripleGroups based on join triple patterns
TG_Join
{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) }
TG{price, vendor, delDays, product}
(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, ww.ven...)}
TG{label, country, homepage}
{(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}
?o vendor ?v ?v country ?vcountry
18/30
![Page 19: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/19.jpg)
Pattern Matching using NTGA in Pig
Subject
Property Object
&V1 type VENDOR&V1 label Vendor1&V1 country US&V1 homepage www.ven...
&Offer1
type OFFER
&Offer1
vendor &v1
&Offer1
product &p1
&Offer1
price 108
&Offer1
delDays 2
&Offer1
validToDate 01/01/2011
&Offer1
validFromDate
08/01/2011
Subject
Property
Object
&V1 label Vendor1&V1 country US
&V1homepa
gewww.ven..
&Offer1
vendor &v1
&Offer1
product &p1
&Offer1
price 108
&Offer1
delDays 2
{ (&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..) },
{ (&Offer1, price, 108), (&Offer1, vendor, &V1), (&Offer1, product, &P1), (&Offer1, delDays, 2) } }
{(&Offer1, price, 108), (&Offer1, vendor, {(&V1, label, vendor1), (&V1, country, US), (&V1, homepage, www.ven..)} (&Offer1, product, &P1), (&Offer1, delDays, 2)}
LoadFilter
StarGroupFilter
RDFJoin
(load +TG_Filter)
(TG_GroupBy+TG_GroupFilter)
(TG_Join)
![Page 20: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/20.jpg)
Mapping to Pig Latin/Relational Algebra
20/30
![Page 21: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/21.jpg)
RDFMap: Efficient Data Representation
Compact representation of intermediate results during TripleGroup based processing
Efficient look-up of triples matching a given Property type via property-based indexing scheme
Ability to represent structure-label information for groups of triples.
21/30
![Page 22: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/22.jpg)
Outline IntroductionBackground
MapReduce, Pig and Join Processing RDF Graph Pattern Matching in Pig
Approach TripleGroup data model and Nested TripleGroup
Algebra (NTGA) Comparing NTGA based plans and Pig Latin
plans for graph pattern matching queries
EvaluationRelated WorkConclusion and Future Work
22/30
![Page 23: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/23.jpg)
Evaluation Setup: 5-node to 25-node Hadoop
clusters on NCSU’s Virtual Computing Lab*
Dataset: Synthetic benchmark dataset generated using BSBM** tool(max. 40GB data – approx. 175 million triples)
Evaluation of Pig (Pig_opt) vs. RAPID+ Task 1 – Scalability with size of RDF graphs Task 2 – Scalability with denser star patterns Task 3 – Scalability with increasing cluster
sizes
*https://vcl.ncsu.edu **http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/
23/30
![Page 24: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/24.jpg)
Experimental Results…(1)Cost Analysis across Increasing size of RDF graphs (5-node)
Key Observations:Benefit of TripleGroup based processing seen across data sizes – up
to 60% in some cases (RAPID+ << Pig_opt < Pig) Pig approaches did not complete for large data size
24/30
![Page 25: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/25.jpg)
Query
#Triple
Patterns
#Edges in
Stars
%gain
Q1 3 1:2 56.8
Q2 4 2:2 46.7
Q3 5 2:3 47.8
Q4 6 3:3 51.6
Q5 7 3:4 57.4
Q6 8 4:4 58.4
Q7 9 5:4 58.6
Q8 10 6:4 57.3
Q9* 6 2:4 65.4
Q10* 10 2:4:4 61.5
Experimental Results…(2)Cost Analysis across Increasing Star Density
%gain of RAPID+ over Pig (10-node / 32GB)
(5-node / 20GB)
Key Observations: RAPID+ maintains a consistent %gain
of 50% across the varying density Costs savings by eliminating redundant
Subject values and join triples
25/30
![Page 26: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/26.jpg)
Experimental Results…(3)Cost Analysis across Increasing Cluster Sizes
Query pattern with three star-joins and two chain-joins (32GB)
Key Observations: RAPID+ has 56% gain for 10-
node cluster over Pig approaches
Pig approaches catch up with increasing cluster size Increasing nodes decrease
probability of disk spills with the SPLIT approach
RAPID+ still maintains 45% gain across the experiments
26/30
![Page 27: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/27.jpg)
And some Updates… Additional evaluation –
Up to 65% performance gain on another synthetic benchmark dataset* for three/two star-join queries
Experiments extended to 1 billion 3-ary triples (43GB) – 31% (10-node) to 41% (30-node) performance gain
RAPID+ now includes a SPARQL interface
In Future: Cost-based optimizations to select Pig vs. NTGA execution plans
Join us for a demo of RAPID+@VLDB2011**
**Kim, H., Ravindra, P., Anyanwu, K : From SPARQL to MapReduce: The Journey using a Nested TripleGroup Algebra. To appear In: Proc. International Conference on Very Large Data Bases. (VLDB 2011)
*Pavlo, A.,Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M : A Comparison of Approaches to Large-scale Data Analysis. In Proc. Of the 35th SIGMOD International Conference on Management of data (2009)
27/30
![Page 28: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/28.jpg)
Related WorkMapReduce-based Processing
Indexing
Partitioning Schemes Rule-based
Optimizations[Newman08] * [Hunter08]*[Afrati10]
[Husain10]*, Hadoop++
[Dittrich10],HadoopDB[Abadi09]
Reasoning[Urbani07] *
High-levelDataflow
LanguagesPig Latin
[Olston08], [HiveQL][JAQL]
Other Extensions
Map-Reduce-Merge
[Yang07]
28/30
![Page 29: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/29.jpg)
Conclusion TripleGroup based processing for
evaluating pattern matching queries on MapReduce platformsNTGA Operators re-factored to
minimize #MR cycles minimize costs Reduce costs of repeated data
handling via operator coalescingEfficient data representation
(RDFMap)
29/30
![Page 30: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/30.jpg)
References[Dean04] Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun.
ACM 51 (2008) 107–113[Olston08] Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign
language for data processing. In: Proc. International Conference on Management of data. (2008)[Abadi09] Abouzied, A., Bajda-Pawlikowski, K., Huang, J., Abadi, D.J., Silberschatz, A.: Hadoopdb in
action: building real world applications. In: Proc. International Conference on Management of data. (2010)
[Sridhar09] Sridhar, R., Ravindra, P., Anyanwu, K.:RAPID: Enabling scalable ad-hoc analytics on the semantic web. ISWC 2009
[Yu08] Yu,Y., Isard, M., Fetterly,D., Badiu,M ., Erlingsson,U., Gunda,P.K. , and Currey,J.: DryadLINQ: A system for generalpurpose distributed data-parallel computing using a high-level language. OSDI 2008
[Newman08] Newman, A., Li, Y.F., Hunter, J.: Scalable semantics: The silver lining of cloud computing. In: eScience. IEEE International Conference on. (2008)
[Hunter08] Newman, A., Hunter, J., Li, Y., Bouton, C., Davis, M.: A scale-out rdf molecule store for distributed processing of biomedical data. In: Semantic Web for Health Care and Life Sciences Workshop. (2008)
[Urbani07] Urbani, J., Kotoulas, S., Oren, E., Harmelen, F.: Scalable distributed reasoning using mapreduce. In: Proc. International Semantic Web Conference. (2009)
[Abadi07] Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Scalable Semantic Web Data Management Using Vertical Partitioning. VLDB 2007
[Dittrich10] Dittrich, J., Quiane-Ruiz, J., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). VLDB 2010/PVLDB
[Yang07] Yang, H., Dasdan, A., Hsiao, R., Parker Jr., D.S.: Map-reduce-merge: simplified relational data processing on large clusters. SIGMOD 2007
[Afrati10] Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proc. International Conference on Extending Database Technology. (2010)
[Husain10] Husain, M., Khan, L., Kantarcioglu, M., Thuraisingham, B.: Data intensive query processing for large rdf graphs using cloud computing tools. In: Cloud Computing (CLOUD), IEEE International Conference on. (2010)
[HiveQL] http://hadoop.apache.org/hive/ [JAQL], http://code.google.com/p/jaql
30/30
![Page 31: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/31.jpg)
Thank You!
![Page 32: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/32.jpg)
EnvironmentNode Specifications
Single / duo core Intel X86 2.33 GHz processor speed 4G memory Red Hat Linux
Pig 0.5.0 Hadoop 0.20
Block size 256MB
![Page 33: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/33.jpg)
Benchmark Data*Log files of HTTP server trafficColumn-delimited text file
Rankings:pageRank | PageURL | avgDuration
UserVisits:sourceIPAddr | destinationURL | visitDate | adRevenue | UserAgent | cCode | lCode | sKeyword | avgTimeOnSite
*Pavlo, A.,Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M : A Comparison of Approaches to Large-scale Data Analysis. In Proc. Of the 35th SIGMOD International Conference on Management of data (2009)
![Page 34: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/34.jpg)
Scripts (Q1)A = load '/data/' using PigStorage(' ');A1 = filter A by $1 eq 'pageRank' or $1 eq 'pageURL' or $1 eq
'destURL' or $1 eq 'srcIP' or $1 eq 'adRevenue' or ($1 eq 'type' and ($2 eq 'Ranking' or $2 eq 'UserVisits'));
B = group A1 by $0 PARALLEL 5;C = foreach B generate flatten(ReassembleRDF($1,'pageURL|
destURL','1'));D = group C by $0 PARALLEL 5;E = foreach D generate
flatten(ReassembleRDF($1,'srcIP|','2')) as (srcIP:chararray, vals:bytearray);
store E into '/q1_app1';
![Page 35: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/35.jpg)
Scripts (Q1)A1 = load '/data/' using PigStorage(' ');split A1 into pageRank IF $1 eq 'pageRank',srcIP IF $1 eq 'srcIP‘, pageURL IF $1 eq 'pageURL',destURL IF $1 eq 'destURL‘, adRevenue IF $1 eq 'adRevenue',typeRanking IF $1 eq 'type' and $2 eq 'Ranking',typeUV IF $1 eq 'type' and $2 eq 'UserVisits';Ranking = join pageURL by $0, pageRank by $0, typeRanking
by $0 PARALLEL 5;UserVisits = join srcIP by $0, destURL by $0, adRevenue by
$0, typeUV by $0 PARALLEL 5;C1 = join Ranking by $2, UserVisits by $5 PARALLEL 5;D1 = foreach C1 generate $11, $17, $5;store D1 into '/q1_app2';
![Page 36: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/36.jpg)
Experiment ResultsPercentage Performance Gain = (exec time 1) – (exec time 2) (exec time 1)
![Page 37: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/37.jpg)
Possible Optimizations (2) Coalesce join operations into as few
MR cycles as possibleCompute star patterns via m-way
JOIN Star-join using m-way JOIN = 1 MR
cycle Reduced #MR cycles Reduced I/O
+ communication costs
![Page 38: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/38.jpg)
Structured Data Processing in Pig
srcIP destURL
visitDate
adRevenue
…
158.112.27.3
url1 1979/12/12
339.08142
….
158.112.27.3
url5 1979/12/15
180.334 ….
150.121.18.6
url1 1979/12/28
550.7889 ….
… … … … …pageRank
pageURL
avgDur
11 url1 96
23 url2 3
18 url3 87
… … …
UserVisits
Ranking
Query: Retrieve the pageRank and adRevenue of pages visited by particular users between “1979/12/01” and “1979/12/30”
LOADUserVisits
LOADRanking
FILTER(visitDate)
JOIN UserVisits ON destURL,
Ranking ON pageURL;
STORE
![Page 39: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/39.jpg)
Package tuples
JOINUserVisits ON
destURL,Ranking ON pageURL;
JOIN: Pig Latin MapReduce UserVisits Ranking
Annotate based on join key
map
reduce
Reducer 1 Reducer 2158.112.27.3
url1
url1
11 …
srcIP destURL
visitDate adRev
…
158.112.27.3
url1 1979/12/12
339.081
…
158.112.27.3
url2 1979/12/15
180.334
…
150.121.18.6
url1 1979/12/28
550.78
…
url2url1
pageRank
pageURL
avgDur
11 url1 96
23 url2 3
url1url2
url1
url1
150.121.18.6
url1
url1
11 …
url2
158.112.27.3
url2
url2
3 …
… srcIP destURL
pageURL
pageRank
…
… 158.112.27.3
url1 url1 339.081 …
… 150.121.18.6
url1 url1 550.78 …
… 158.112.27.3
url2 url2 180.334 …
![Page 40: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/40.jpg)
Subject Prop Object(&UV1 srcIP 158.112.27.3)
RDF Data Model(Resource Description Framework)
Statements (triples) Graph representationSub Prop Obj
&R1 type Ranking
&R1 pageRank 11
&R1 pageURL Url1
&R1 avgDuration 97
&UV1
type UserVisits
&UV1
srcIP 158.112.27.3
&UV1
destURL url1
&UV1
adRevenue 339.08142
&UV1
visitDate 1979/12/12
&UV1
userAgent SCOPE
&UV1
cCode VNM
&UV1
iCode VNM-KH
&UV1
sKeyword comets
&UV1
avgTime 3
Ranking
UserVisits
![Page 41: An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Padmashree Ravindra, HyeongSik Kim, Kemafor Anyanwu COUL – Semantic COmpUting](https://reader034.vdocuments.us/reader034/viewer/2022051315/56649f005503460f94c15921/html5/thumbnails/41.jpg)
Example SPARQL Query
Sub Prop Obj
&V1 type Vendor&V1 label Vendor1&V1 country US&V1 homepage www.ven...
&Offer1
type Offer
&Offer1
vendor &V1
&Offer1
product &P1
&Offer1
price 108
&Offer1
delDays 2
&Offer1
validToDate 01/01/2011
&Offer1
validFromDate
08/01/2011
&Rev1 type Review&Rev1 reviewFor &P1&Rev1 rating1 9
&Rev1 reviewer &R1
Data: Description of Vendors, their product Offers, and Reviews of products (BSBM* dataset)
Query: Retrieve the details of US-based Vendors
SELECT ?vlabel ?hpageWHERE {?v type Vendor . ?v country ?vcountry . ?v label ?vlabel . ?v homepage ?hpage .}FILTER (?vcountry = “US”);
*http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/