dato vs graphx

13
DATO VS. SPARK GRAPHX KEIRA ZHOU OCT, 2015 ails: https://github.com/keiraqz/dato-vs-graphx

Upload: keira-zhou

Post on 14-Apr-2017

480 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Dato vs GraphX

DATO VS. SPARK GRAPHX

KEIRA ZHOUOCT, 2015

Details: https://github.com/keiraqz/dato-vs-graphx

Page 2: Dato vs GraphX

SETTINGS• 1 master node and 3 work nodes on AWS

• m4.large instances with 8GB of RAM with 2 cores

Page 3: Dato vs GraphX

DATO• A graph-based, asynchronous, high performance, distributed

computation framework written in C++

• 30-days free trial, then a service fee

• Install GraphLab Create on the local machine and Dato Distributed on a cluster

Page 4: Dato vs GraphX

SPARK GRAPHX• Come with Spark

import org.apache.spark._import org.apache.spark.graphx._

Page 5: Dato vs GraphX

EXPERIMENTS• Graph Algorithms

• Triangle-counting• PageRank• Connected Components

• Datasets: Stanford Large Network Dataset Collection (SNAP)• Facebook:

• Nodes: 4039 | Edges: 88234 | Number of triangles: 1612010• YouTube:

• Nodes: 1134890 | Edges: 2987624 | Number of triangles: 3056386• Pokec:

• Nodes: 1632803 | Edges: 30622564 | Number of triangles: 32557458• LiveJournal:

• Nodes: 3997962 | Edges: 34681189 | Number of triangles: 177820130

Page 6: Dato vs GraphX

EXPERIMENTS (CONT’D)• Default settings

• Dato:• GRAPHLAB_FILEIO_MAXIMUM_CACHE_CAPACITY = 4G

• GraphX• Start with executor memory = 1G• Change into 2G later

Page 7: Dato vs GraphX

RESULTS• Triangle Counting: both Dato and GraphX (if it finishes the job) returns the

correct answer as listed on the SNAP website.

• For Pokec and LiveJournal data, GraphX has trouble finishing the computation

Page 8: Dato vs GraphX

TAKE-AWAY FOR GRAPHX• What I observed was that certain stages within the job kept

failing

• A stage in Spark will operate on one partition of the RDD at a time (and load the data in that partition into memory)

• Potential Solution

• Increasing the executor memory• Increase the number of partitions of the RDD so that each

stage is processing smaller amount of data

Page 9: Dato vs GraphX

RESULTS (CONT’D)• PageRank: The threshold for PageRank is set to 0.001

Page 10: Dato vs GraphX

RESULTS (CONT’D)• Connected Components

Page 11: Dato vs GraphX

CONCLUSIONS• Quick setups for both of the tools without fine-tune runtime

parameters, but

• Dato has clear advantages over GraphX in terms of execution time for processing large scale graph data

• However, GraphX is free while Dato charges a service fee after the free trial.

• The goal of the GraphX project is to unify graph-parallel and data-parallel computation in one system with a single composable API.

• Further experiments can be done to compare the overall performance of a specific task that contains both graph algorithms and other data-parallel computation

Page 12: Dato vs GraphX

MORE DETAILS• https://github.com/keiraqz/dato-vs-graphx

Page 13: Dato vs GraphX

REFERENCES• Dato:

• https://dato.com/

• Spark GraphX:

• https://spark.apache.org/docs/1.1.0/graphx-programming-guide.html

• Stanford Large Network Dataset Collection (SNAP):

• https://snap.stanford.edu/data/