graphx : graph processing in a distributed dataflow...
TRANSCRIPT
![Page 1: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/1.jpg)
GraphX : Graph Processing in a Distributed Dataflow Framework
OSDI 2014
BidyutHota
![Page 2: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/2.jpg)
Agenda
• Analytics space background
• Motivation
• Goal
• Approach
• Optimizations
• Results
• Flaws/Limitations
• Questions
![Page 3: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/3.jpg)
Real life Analytics Pipeline
Link Table Page Rank Desired results Raw data
Eg. Google Knowledge graph :570MVertices, 18B Edges ( as in Mid 2017)
![Page 4: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/4.jpg)
Real life Analytics Pipeline
Link Table Page Rank Desired results Raw data
Tables
![Page 5: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/5.jpg)
Real life Analytics Pipeline
Link Table Page Rank Desired results Raw data
Graphs
![Page 6: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/6.jpg)
Systems landscape
![Page 7: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/7.jpg)
Motivation
• Currently separate systems exist to compute on these data representation. • Ability to combine data
sources. • Enhance dataflow
frameworks to leverage inherent positives.
![Page 8: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/8.jpg)
Current drawbacks of dataflow frameworks
• Implementing iterative algorithms -> requires multiple stages of complex joins. • Do not cover common patterns in
graph algorithms -> Room for optimization. • Unlike Spark, no fine grained control
of data partitioning.
![Page 9: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/9.jpg)
Current drawbacks of specialized systems
• Lacking ability for combining graphs with unstructured or tabular data • Systems favoring snapshot recovery
rather than fault tolerance like in Spark
![Page 10: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/10.jpg)
What can we leverage?
• Immutability of RDD’s • Reusing indices across graph and
collection views over iterations. • Increase in performance
![Page 11: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/11.jpg)
Goal
• General purpose distributed frameworks for graph computations • Comparable performances to
specialized graph processing systems
![Page 12: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/12.jpg)
Approach
• Unifies Tabular view and Graph view
• Imbibe the best of specialized systems
• Graph representation on dataflow frameworks
• Optimizations • Develop GraphX API on top of Spark
![Page 13: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/13.jpg)
Graph approach: Page Rank example
• Eg. Page Rank algorithm • Graph parallel abstraction • Define a vertex program • Terminate when vertex programs vote to halt
Figure : PageRank in Pregel
![Page 14: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/14.jpg)
Approach
• GAS (Gather Apply Scatter)
How to apply this in dataflow frameworks? • Map, group-by, join dataflow operators
![Page 15: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/15.jpg)
Representing Property graphs as Tables
Never transfer edges
![Page 16: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/16.jpg)
GraphX API
![Page 17: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/17.jpg)
Using the dataflow operators
Logical representation Join of vertices table on edges table
![Page 18: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/18.jpg)
Using the dataflow operators on vertex program
Userdefined
![Page 19: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/19.jpg)
Optimizations
SpecializedDataStructure Vertex-cutPartitioning Remotecaching
ActiveSetTracking
![Page 20: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/20.jpg)
Implementing Optimizations
• Reusable Hash index • Sequential scan or clustered scan based on active set (Dynamic) • Incremental updates • Automatic Join elimination
Additional optimizations: • Memory based shuffle • Batching and columnar structure • Variable Integer encoding
![Page 21: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/21.jpg)
Results
![Page 22: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/22.jpg)
Results
Scaling for PageRank on Twitter dataset
Effect of partitioning on communication
![Page 23: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/23.jpg)
Current Flaws • Is not optimized for dynamic graphs. • Requires incremental updates to
routing table. • Is not designed for streaming
applications.
• Asynchronous graph computation not available. This is where Naiad will outperform.
![Page 24: GraphX : Graph Processing in a Distributed Dataflow Frameworkpages.cs.wisc.edu/~shivaram/cs744-slides/cs744-graphx-bidyut.pdf · Google Knowledge graph :570MVertices, 18B Edges (](https://reader034.vdocuments.us/reader034/viewer/2022051807/6006221dcd9fef792864f729/html5/thumbnails/24.jpg)
Questions