experimental analysis of streaming algorithms for graph ...sariyuce.com/sem/sigmod19.pdfdealing with...

18
Experimental Analysis of Streaming Algorithms for Graph Partitioning Anil Pacaci University of Waterloo [email protected] M. Tamer Özsu University of Waterloo [email protected] ABSTRACT We report a systematic performance study of streaming graph partitioning algorithms. Graph partitioning plays a crucial role in overall system performance as it has a sig- nificant impact on both load balancing and inter-machine communication. The streaming model for graph partitioning has recently gained attention due to its ability to scale to very large graphs with limited resources. The main objective of this study is to understand how the choice of graph partitioning algorithm affects system per- formance, resource usage and scalability. We focus on both offline graph analytics and online graph query workloads. The study considers both edge-cut and vertex-cut approaches. Our results show that the no partitioning algorithms per- forms best in all cases, and the choice of graph partitioning algorithm depends on: (i) type and degree distribution of the graph, (ii) characteristics of the workloads, and (iii) specific application requirements. CCS CONCEPTS Information systems Database performance eval- uation; Graph-based database models; Computing method- ologies Vector / streaming algorithms. KEYWORDS Graph Partitioning; Streaming Algorithms; Graph Processing ACM Reference Format: Anil Pacaci and M. Tamer Özsu. 2019. Experimental Analysis of Streaming Algorithms for Graph Partitioning. In 2019 International Conference on Management of Data (SIGMOD ’19), June 30-July 5, 2019, Amsterdam, Netherlands. ACM, New York, NY, USA, 18 pages. https://doi.org/10.1145/3299869.3300076 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGMOD ’19, June 30-July 5, 2019, Amsterdam, Netherlands © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-5643-5/19/06. . . $15.00 https://doi.org/10.1145/3299869.3300076 1 INTRODUCTION In a broad range of domains from online social networks to protein interaction networks, it is natural to model and rep- resent data as graphs. There has been extensive research into techniques to efficiently store and process such graph data. In particular, scale-out solutions such as Pregel [30], Power- Graph [20] and GPS [37] have been popular. Partitioning of a graph over a cluster of machines plays a crucial role in such systems, affecting both their design and their performance. Choice of a partitioning model has complex interactions with the computation models of the distributed graph systems – e.g., vertex-cuts enabling a single vertex program to span multiple worker machines in PowerGraph – and affects the communication overhead. Minimizing the network commu- nication and balancing the computational workload across a cluster of machines is crucial to the performance of any scale-out system, but these two objectives often conflict for graph processing applications. A random partitioning of the graph is well balanced, however it is known to incur large number of cuts [10]. Irregular structure and inherent inter- dependencies within a graph make it difficult to obtain high quality partitions that are also balanced. A recent development in graph partitioning is the stream- ing model, where individual vertices or edges of a graph arrive at the processing point one-at-a-time, sometimes at high speed. In some cases, data are streamed from disk for bulk loading across a cluster, while in other cases data are streamed from sources. We discuss this separation in Sec- tion 4, but we do not make that distinction in this paper, because our focus is on the partitioning algorithms and how the resulting partitioning impacts workload execution. In both of the above cases, the full graph is not wholly available to the partitioning algorithm, and each element needs to be processed on-the-fly. Streaming algorithms for graph parti- tioning (SGP) do not maintain the entire graph in memory and make placement decisions as the graph becomes avail- able. Even though streaming algorithms are able to scale to very large graphs, only accessing a limited portion of the graph and not being able to store and access the entire graph in memory during partitioning makes SGP a more challenging problem. Although a significant number of SGP techniques have Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands 1375

Upload: others

Post on 25-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

Experimental Analysis of Streaming Algorithms forGraph Partitioning

Anil PacaciUniversity of [email protected]

M. Tamer ÖzsuUniversity of Waterloo

[email protected]

ABSTRACTWe report a systematic performance study of streaminggraph partitioning algorithms. Graph partitioning plays acrucial role in overall system performance as it has a sig-nificant impact on both load balancing and inter-machinecommunication. The streaming model for graph partitioninghas recently gained attention due to its ability to scale tovery large graphs with limited resources.

The main objective of this study is to understand how thechoice of graph partitioning algorithm affects system per-formance, resource usage and scalability. We focus on bothoffline graph analytics and online graph query workloads.The study considers both edge-cut and vertex-cut approaches.Our results show that the no partitioning algorithms per-forms best in all cases, and the choice of graph partitioningalgorithm depends on: (i) type and degree distribution of thegraph, (ii) characteristics of the workloads, and (iii) specificapplication requirements.

CCS CONCEPTS• Information systems→ Database performance eval-uation;Graph-based databasemodels; •Computingmethod-ologies→ Vector / streaming algorithms.

KEYWORDSGraph Partitioning; StreamingAlgorithms; Graph ProcessingACM Reference Format:Anil Pacaci and M. Tamer Özsu. 2019. Experimental Analysis ofStreaming Algorithms for Graph Partitioning. In 2019 InternationalConference on Management of Data (SIGMOD ’19), June 30-July 5,2019, Amsterdam, Netherlands. ACM, New York, NY, USA, 18 pages.https://doi.org/10.1145/3299869.3300076

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACMmust be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’19, June 30-July 5, 2019, Amsterdam, Netherlands© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-5643-5/19/06. . . $15.00https://doi.org/10.1145/3299869.3300076

1 INTRODUCTIONIn a broad range of domains from online social networks toprotein interaction networks, it is natural to model and rep-resent data as graphs. There has been extensive research intotechniques to efficiently store and process such graph data.In particular, scale-out solutions such as Pregel [30], Power-Graph [20] and GPS [37] have been popular. Partitioning of agraph over a cluster of machines plays a crucial role in suchsystems, affecting both their design and their performance.Choice of a partitioning model has complex interactions withthe computation models of the distributed graph systems –e.g., vertex-cuts enabling a single vertex program to spanmultiple worker machines in PowerGraph – and affects thecommunication overhead. Minimizing the network commu-nication and balancing the computational workload acrossa cluster of machines is crucial to the performance of anyscale-out system, but these two objectives often conflict forgraph processing applications. A random partitioning of thegraph is well balanced, however it is known to incur largenumber of cuts [10]. Irregular structure and inherent inter-dependencies within a graph make it difficult to obtain highquality partitions that are also balanced.

A recent development in graph partitioning is the stream-ing model, where individual vertices or edges of a grapharrive at the processing point one-at-a-time, sometimes athigh speed. In some cases, data are streamed from disk forbulk loading across a cluster, while in other cases data arestreamed from sources. We discuss this separation in Sec-tion 4, but we do not make that distinction in this paper,because our focus is on the partitioning algorithms and howthe resulting partitioning impacts workload execution. Inboth of the above cases, the full graph is not wholly availableto the partitioning algorithm, and each element needs to beprocessed on-the-fly. Streaming algorithms for graph parti-tioning (SGP) do not maintain the entire graph in memoryand make placement decisions as the graph becomes avail-able. Even though streaming algorithms are able to scaleto very large graphs, only accessing a limited portion ofthe graph and not being able to store and access the entiregraph in memory during partitioning makes SGP a morechallenging problem.Although a significant number of SGP techniques have

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1375

Page 2: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

been proposed [10, 24, 34, 39, 40], there is limited under-standing of their performance in practice. Existing perfor-mance studies do not cover the entire design space of SGPalgorithms and their implications on runtime performance[1, 21, 41]. Performance analyses accompanying the algo-rithm proposals are usually conducted in a limited context.Specifically, the evaluation is on a single workload (almostalways PageRank) and algorithms are integrated in eithergeneralized large-scale data processing frameworks [39] orad hoc implementations [40], which do not represent theperformance characteristics of scale-out graph processingsystems in practice. Furthermore, formulation of these al-gorithms is based on theoretical metrics and they aim tominimize the cut size (edge-cut ratio or replication factor,to be defined precisely in Section 4) while balancing thepartition sizes. It is not clear how these objective functionscorrelate with the workload performance in practice. In thispaper, we conduct a systematic experimental analysis ofSGP algorithms using a diverse set of workloads and graphdatasets to better understand the implications of such formu-lation on performance and their impact on graph processingapplications in practice. Based on our analysis, we presentour findings in a decision tree format to help readers to picksuitable SGP algorithm based on the type and degree distri-bution of the graph, workload characteristics and applicationrequirements.

We study the existing algorithms along two main dimen-sions: the cut model and workloads they support. Alongthe first dimension, we identify edge-cut and vertex-cut ap-proaches. Edge-cut methods create vertex-disjoint partition-ing of the original graph where the endpoints of a particularedge might be in different partitions, whereas vertex-cutmethods create edge-disjoint partitioning where individualvertices may span multiple machines. Along the second di-mension we follow the commonly used classification: offlinegraph analytics and online graph queries. Graph analyticworkloads such as PageRank and Weakly Connected Com-ponent are usually implemented as iterative algorithms thataccess the entire graph in each iteration (optimizations arepossible and frequently deployed), whereas online graphqueries are not iterative and typically access a portion ofthe graph. To the best of our knowledge, our study is thefirst to consider both of these classes of workloads for SGPin addition to providing the first systematic comparativeanalysis of both the edge-cut and the vertex-cut models ongraph processing systems. Experimental framework that wedeveloped and used in this study is publicly available1.

The highlights of our findings are the following:• We observe that the formulation of existing SGP algo-rithms that focus primarily on minimizing the cut size

1https://github.com/anilpacaci/streaming-graph-partitioning

(while maintaining a balanced partitioning) does not repre-sent the workload performance in practice. This is becausethe formulation of these algorithms fail to capture a widerange of workload characteristics and graph structures. Ex-isting algorithms suffer from unbalanced load distributioneven though they can produce balanced partitions.

• Edge-cut SGP algorithms incur less network communica-tion for the same cut size for offline graph analytic work-loads when communication is uni-directional (as in PageR-ank). This holds regardless of the graph characteristics (i.e.,it holds for power-law graphs as well as regular graphs).The literature, by and large, have neglected comparisonsacross these two approaches (except for PowerLyra [13]),and claim particular approaches are better for specifictypes of graphs.

• Existing SGP algorithms are not as effective for onlinegraph queries as they are for offline graph analytics due tothe workload skew. Workload-aware strategies need to beconsidered to gain significant performance improvements.

• Hash-based partitioning provides a better trade-off be-tween throughput and latency objectives compared toedge-cut SGP algorithms for online graph queries, mak-ing it a simple but effective strategy for scale-out graphdatabases.

• Cut size is a reliable indicator of the network communi-cation for all the workloads we study, but this does notalways translate into improvements in workload perfor-mance.

2 RELATEDWORKScale-Out Graph Processing Systems: Scale-out graphprocessing systems are typically divided into graph analyticssystems and graph database systems based on the workloadthey process. Systems in the former category (e.g., Pregel[30], Giraph [5], PowerGraph [20] and PowerLyra [13]) focuson running offline analytical workloads such as PageRank,weakly connected component, clustering, while the latter(e.g., Neo4j2 and JanusGraph3) focus on real-time querying(reachability queries and subgraph matching) and manip-ulation of graph structured data. Due to lack of space, wedo not discuss the details of these systems and assume thatthe reader is familiar with the fundamentals. Surveys [32]and performance studies [3, 22] of graph processing systemsexist. One issue that is important to highlight for this pa-per is the computational models of analytics systems. Thetwo popular ones are vertex-centric block synchronous wherecomputation is performed iteratively through user-definedvertex functions and vertices push their state along the edgesof the graph at the end of each iteration, and vertex-centric

2https://neo4j.com/3http://janusgraph.org/

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1376

Page 3: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

Gather, Apply, Scatter (GAS) where the state is pulled (ratherthan pushed) by vertices at the beginning of each iteration.This distinction is important for the discussions in this paper.

Balanced Graph Partitioning: The objective of the k-way balanced graph partitioning is to generate k disjointpartitions of the graph with minimum cut size. This problemis proven to be NP-Hard [4], and due to the importance of theproblem, there is a significant body of research on approx-imation algorithms and heuristic solutions. Among others,the multi-level partitioning scheme has been successful andcommonly adopted due to the quality of its solution and abil-ity to process graphs of non-trivial size [27]. However, thesemethods have been shown to not scale well with graph size,particularly when they have skewed degree distribution, andthey cannot easily handle streaming graphs [2]. The scalingand skew issues have been addressed (e.g., [26] and [2]) butdealing with streaming graphs is difficult. Re-partitioningalgorithms such as Hermes [33] and Leopard [23] focus ondynamic refinement of an initial partitioning instead of re-partitioning the whole graph in case of updates. We referinterested readers to surveys on graph partitioning [7, 12].

Streaming Algorithms for Graph Partitioning: Stan-ton and Kliot [39] formulate the streaming model for graphpartitioning in the context of graph loading and proposevarious streaming heuristics for graph partitioning. Thesealgorithms exhibit common characteristics: (i) they maintaina synopsis in memory rather than the entire graph, whichenables them to scale to very large graphs with limited re-sources, (ii) they perform single-pass over the graph streamand make partitioning decisions on the fly to significantlyreduce the partitioning time, (iii) they can provide significantimprovements over random partitioning, even comparableto traditional offline heuristics for certain scenarios.

Even though graph partitioning is a well studied problemand it has a significant role in the design and the performanceof scale-out graph processing systems, there is limited un-derstanding of its implications in practice. Existing literaturemostly focus on theoretical metrics and neglect the impactof graph partitioning on workload execution performance.In contrast, we study the performance of SGP algorithmson scale-out graph processing systems to better understandthe design space and its implications on workload perfor-mance. Here we highlight the major findings of previousperformance studies of SGP algorithms and their differenceswith our results.

Guo et. al. present a performance study of edge-cut SGPalgorithms that are available in Oracle PGX.D [21]. Perfor-mance evaluation and the analysis are tightly coupled withOracle PGX.D graph analytics system and it is limited to theoffline graph analytic workloads.Verma et. al. study vertex-cut SGP algorithms that are

present in three scale-out graph processing systems, namely

PowerGraph, PowerLyra and GraphX [41]. Authors studythe vertex-cut model and focus on the replication factor asthe indicator of workload performance. We show that cutsize is indeed an accurate predictor of the network communi-cation, however it does not always correlate with workloadperformance due to workload imbalance (see Section 6.2.2).We observe that load distribution is heavily influenced byworkload and graph characteristics and exhibits differenttrends based on the cut model (see Figure 4).

Abbas et. al. conduct a performance study of existing SGPalgorithms on Apache Flink, a general purpose data process-ing framework [1]. Even though the study covers a widerrange of SGP algorithms and workloads compared to previ-ous studies [21, 41], it fails to capture the effect of workloadand graph characteristics on the workload performance.

To the best of our knowledge, ours is the first study to con-sider both offline graph analytics and online graph queriesfor performance analysis of SGP algorithms. In addition, wesurvey the formulations of existing algorithms in literaturein a common framework, making it easier to understand thedesign space of SGP algorithms and its implications.

3 PRELIMINARIESLet G(V ,E) be a graph where V is the set of vertices and Eis the set of edges with |V | = n and |E | = m, and let P ={P1, ..., Pk } be a partitioning of G. The partitioning problemis characterized by communication costC(P) to beminimizedunder load balance constraints. Letw(Pi ) represent the loadof a single partition Pi . The (k, β) balanced graph partitioningproblem can be formally defined as

minimize C(P) subject to:

w(Pi ) ≤ β ∗

∑kj=1w(Pj )

k,∀i ∈ {1...k}

(1)

Intuitively, the objective is to generatek disjoint but completepartitions of the graph such that the overall communicationcost is minimized while the workload is evenly distributedamong partitions. Finding a partitioningwith strict balance israrely necessary; therefore, algorithms allow deviation fromexact balance controlled by slackness parameter β . β = 1represents the case where exact balance is required whereasβ > 1 relaxes balanced constraint on partition loads.

The streaming graph model assumes that elements of theinput graphG(V ,E) arrive in an arbitrary order. A streamingalgorithm is sequentially presented stream S =

⟨a1,a2, ...

⟩of graphG where ai is either an edge e = (u,v) or a vertex uand its neighbors N (u). P t = {P t1 , ..., P

tk } refers to the set of

partitions at time t and P t (ai ) denotes the set of partitionsthat containai .∪k

i=1Pti is equal to the part of the graph stream

consumed at time t . The SGP algorithm places each ai to apartition P ti that maximizes the algorithm-specific objective

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1377

Page 4: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

function h(ai , P t ). Formally;

∀aj ∈ S assign aj to the partition P ti

where i = arg maxi ∈{1...k }

(h(aj , P

t )) (2)

4 STREAMING ALGORITHMS FORGRAPH PARTITIONING

In this study, we characterize partitioning algorithms basedon the cut model and identify edge-cut and vertex-cut ap-proaches. We describe edge-cut and vertex-cut models anddiscuss the most notable examples in Sections 4.1 and 4.2,respectively. Section 4.3 introduces the hybrid-cut model andits corresponding SGP algorithms that are proposed in Pow-erLyra. There exists some work in literature that considersdifferent objective functions, workload and cluster character-istics. Due to space constrains, we discuss these algorithmsin Appendix A. Table 1 lists the partitioning algorithms andtheir characteristics that we study in this paper.

4.1 Edge-Cut ModelAlgorithms in the edge-cut class aim to create a balanced dis-tribution of vertices amongst the partitions while minimizingthe edge-cuts. Communication cost C(P t ) and partition loadw(P ti ) in Equation 1 are formally defined as;

C(P t ) =∑

(u,v)∈E

�� P t (u) , P t (v)��

|E |, andw(P ti ) =

�� P ti �� (3)

The important metrics for algorithms in this class are edge-cut ratio and load imbalance. Edge-cut ratio is the ratio ofthe edges that are cut across partitions to the total numberof edges and is used to model the communication cost invertex-centric systems. Each cut edge contributes to the inter-machine communication as communication typically occursalong the edges of the graph. Load imbalance, on the otherhand, is defined as the ratio of the number of vertices in thelargest partition to the average and is used to describe thecomputational imbalance of the cluster.

4.1.1 Edge-Cut SGP on Vertex Streams. Adjacency lists, whereeach vertex u is presented with its neighbouring verticesN (u), are a common format to represent graphs and areadopted by many graph processing systems. This represen-tation is suitable for static graphs as it requires the completeneighbourhood information to be available. During graphloading, input graph is streamed from storage and each ver-tex is assigned to a partition on-the-fly. Therefore, manyedge-cut SGP methods adopt the vertex stream input model.Hash-based (over vertex keys) graph partitioning is one

of the simplest forms of SGP and is widely used in manysystems. It achieves a well-balanced distribution; howeverit completely ignores the graph topology and the history of

previous assignments. Under a uniform random partitioninginto k machines, the expected edge-cut ratio is (1 − 1

k ).Stanton and Kliot [39] formulate the traditional (k, β) bal-

anced graph partitioning problem in the streaming settingsand propose a greedy heuristic algorithm, Linear Determin-istic Greedy (LDG) that assigns vertex u to partition P ti withthe most number of u’s neighbors while penalizing for ex-cessive partition size to force balance. Formally;

argmaxi ∈{1...k }

(�� P ti ∩ N (u)�� ∗ (1 − �� P ti ��

C)

)(4)

where C = β ∗|V |

k is the maximum size for any partition.Multiplicative weights of LDG strictly enforce exact balance.

FENNEL [40] formulates the problem as modularity max-imization in streaming settings and it relaxes the hard car-dinality constraint in Equation (1) by introducing the loadtermw(Pi ) as an additive term that takes its minimum valuewhen all partition sizes are equal. Formally:

argmaxi ∈{1...k }

(�� P ti ∩ N (u)�� − (α ∗ γ ∗ |P ti |

γ−1))

(5)

γ indicates the how much imbalance of the cluster size isaccounted in the objective function whereas α controls thescaling of partition size and the original paper analyzes theoptimal value of α as a function onm and k [40].

Both LDG and FENNEL significantly improve the edge-cutratio compared to hash partitioning and provide comparablecut size to METIS, using significantly less resources (bothtime and memory). Experiments [40] on the Twitter graphshow that both of these are approximately ten times fasterthan their offline counterpart, METIS, and only use a fractionof memory, as also confirmed in our experiments. However,both methods require each worker to continuously commu-nicate and synchronize the history of previous assignments.Hash-based methods do not require any synchronization andcan be parallelized without communication overhead.

We study the hash-based partitioning, LDG and FENNELalgorithms in this category as they are commonly adoptedin practice. However, there are other algorithms in litera-ture that are in this category. Re-streaming versions of LDG(re-LDG) and FENNEL (re-FENNEL) [34] are iterative algo-rithms that utilize partitioning results of previous iterationsto improve partitioning quality for static graphs.For systems that support the adjacency list input format,

the edge-cut SGP model is a lightweight, scalable alterna-tive to hash partitioning with significantly better partition-ing quality. Indeed, these methods can even achieve METIS-comparable quality partitions under the re-streaming model.On the other hand, vertex streams with adjacency lists re-quire complete adjacency information in advance, makingsuch methods inappropriate for dynamic graphs and otherinput formats like triple list serializations of RDF graphs.

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1378

Page 5: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

Table 1: Characteristics of Streaming Graph Partitioning AlgorithmsAlgorithm Stream Cost Metric Parallelization Update Support Method

Edge-cut

LDG[39] Vertex Edge-cut Ratio Inter-Stream Comm. No GreedyFENNEL[40] Vertex Edge-cut Ratio Inter-Stream Comm. No Greedy

Restreaming LDG [34] Vertex Edge-cut Ratio Intra-Stream Comm. Yes GreedyRe-FENNEL [34] Vertex Edge-cut Ratio Intra-Stream Comm. No GreedyTAPER [19] Vertex Inter-Partition Traversal Yes Yes GreedyBMI [44] Vertex Expected JobTime No No Greedy

Leopard [23] Edge Edge-cut Ratio No Yes DynamicIOGP [15] Edge Edge-cut Ratio No Yes Greedy

Vertex-cut

DBH [43] Edge Replication Factor Yes Yes HashGrid [24] Edge Replication Factor Yes Yes Constrained

PowerGraph [20] Edge Replication Factor Inter-Stream Comm. Yes GreedyHDRF [36] Edge Replication Factor Inter-Stream Comm. Yes Greedy

Hybrid-cut Hybrid Random [13] Edge Replication Factor Yes No HashGinger [13] Edge Replication Factor Inter-Stream Comm. No Greedy

4.1.2 Edge-Cut SGP on Edge Streams. Edge streams do notnecessarily have locality and algorithms in this class can-not maintain complete adjacency information N (u) until allincident edges of vertex u arrive. Therefore, they producepartitionings of lower quality than their vertex stream coun-terparts and need to revisit their initial assignments (e.g.,Condensed Spanning Tree (CST) [18] and IOGP [15]). There-fore, they are not generally deployed in real systems and wedo not focus on algorithms from this class.

4.2 Vertex-Cut ModelVertex-cut graph partitioning algorithms distribute edgesacross the cluster and produce edge-disjoint partitioning ofthe original graph. They produce balanced distribution ofedges but have to replicate vertices that have incident edgesin multiple partitions. The definition of P ti in this case is theset of edges assigned to partition i at time t . If A(u) ⊆ P t

represents the set of partitions that vertex u spans, the com-munication cost C(P t ) and partition loadw(P ti ) in Equation1 are formally defined as:

C(P t ) =

∑u ∈V | A(u) |

|V |, andw(P ti ) =

�� P ti �� (6)

Important metrics for algorithms in this class are replica-tion factor and load imbalance. Replication factor is definedas the average number of partitions that vertices span andrepresents the communication cost. Similar to the edge-cutstreaming model, load imbalance is the ratio of the largestpartition to the average partition size and is used to indi-cate computational load imbalance. The vertex-cut model isadopted in GAS systems where computation is factorizedover the edges, and therefore partition size in number ofedges is used as an indicator of computational load. Under auniform random partitioning, the expected communicationcosts of edge-cut and vertex-cut models are identical [10].

4.2.1 Vertex-Cut SGP on Vertex Streams. Vertex-cut SGPalgorithms on vertex streams are not popular as they aredifficult to design for reasons that are duals of those for edge-cut SGP on edge streams. Nevertheless, there exists someworks that use particular variations of stream models: [42]discusses a hybrid stream model where a vertex might arrivewith subset of its incident edges; [31] constructs vertex-cutpartitionings over degree-ordered vertex streams. However,these are not streaming algorithms, since they process theentire graph stream before any partitioning can be produced.Therefore we do not focus on these algorithms.

4.2.2 Vertex-Cut SGP on Edge Streams. The simplest solu-tion in this category is to partition the edges using a hashfunction on some attributes of the endpoints, e.g. concate-nation of the vertex ids. Although this produces perfectlybalanced partitions, it is known to have high communica-tion cost [10]. Considering that the main challenge posed bypower-law graphs are those few vertices with very high de-gree, one can focus on replicating only those. Degree-BasedHashing (DBH) [43] is built on this idea and assigns an edgeto a partition by hashing the vertex of smaller degree to pre-serve the locality of vertices of lower degree. Formally, givena hash function h(·), an edge (u,v) is assigned to partitionh(u) if d(u) < d(v) or partition h(v) otherwise.

DBH reduces the expected replication factor as skew in-creases, however it relies on a priori knowledge of degreeinformation. Another approach is to limit replication fac-tor for each individual vertex. Jain et. al. [24] propose Gridpartitioning and virtually place the set of partitions on atwo-dimensional grid. All partitions in the same row or col-umn as partition Pi constitute Pi ’s constrained set. When anew edge (u,v) arrives, both endpoints are hashed to somepartitions Pi and Pj , and it is guaranteed that constrainedsets of Pi and Pj have at least two common partitions andthe one with the smallest size is selected for edge (u,v). Sucha scheme upper-bounds the replication factor by 2 ∗

√n − 1.

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1379

Page 6: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

Petroni et. al. [36] exploit the skew in degree distributionof real-world graphs and propose a degree-aware greedyheuristic for vertex-cut SGP, called HDRF. Similar to DBH,HDRF replicates the few high degree vertices and aims topreserve locality of low degree vertices. HDRF’s objectivefunction does so by utilizing the partial degree information,avoiding a pre-processing step to calculate the exact vertexdegrees. Partial vertex degrees are normalized to θ (u) andθ (v) for each incoming edge (u,v), where θ (u) = 1 − θ (v) =

d (u)d (u)+d (v) . The resulting objective function is:

arg maxi ∈{1...k }

(д(v, Pi ) + д(u, Pi ) + λ ∗ (1 −

|e(P ti )|

C))

where д(v, Pi ) = (1 + (1 − θ (v))) ∗ 1A(v)(Pi )(7)

HDRF favors the partitions of the lower-degree verticesover those of the high-degree vertices by д(v, Pi ). Moreover,the original PowerGraph formulation of greedy vertex-cutis sensitive to stream orders and might result in a singlepartition in case of breadth-first traversal order. HDRF avoidsthis problem by setting the importance of balance with λ > 1.

Both hash-based and constrained methods are “embarrass-ingly parallel” meaning that they can be parallelized withoutany communication or dependencies. On the other hand,greedy methods need the history of previous assignments tobe shared among workers, similar to greedy edge-cut SGPheuristics. In particular, a distributed table with the valuesof A(u), ∀u ∈ V needs to be maintained.In this study, we use DBH, Grid and HDRF as well as

hash-based random partitioning as the representatives ofvertex-cut SGP algorithms.

Vertex-cut methods achieve a more balanced workloaddistribution by evenly distributing edges, especially in realworld graphs with a power-law degree distribution. Thestraggler effect of high degree vertices can be avoided bydistributing the computation of those high degree verticesacrossmultiple workers. Furthermore, it is shown that vertex-cut methods can achieve lower communication cost com-pared to edge-cut partitioning under message aggregation, acommon optimization technique for reducing network over-head [10]. Combined with the edge stream model, thesemethods are favorable for modern applications that processlarge, growing real-world graphs. On the other hand, vertex-cut models require the underlying system to allow vertexcomputation to span multiple machines. Although a few re-cent systems such as PowerGraph, PowerLyra and GraphXprovide such support, it complicates the systems’ design.Therefore, most existing vertex-centric graph processingsystems rely on the traditional edge-cut model.

4.3 Hybrid-cut ModelOnly representative of this category is PowerLyra’s hybrid

SGP method called Ginger [13]. PowerLyra differentiates be-tween high-degree and low-degree vertices; it uses edge-cutpartitioning for low-degree vertices while in-edges of high-degree vertices are partitioned via vertex-cut. Ginger is ableto process both edge and vertex streams. This hybrid modelneeds to identify high degree vertices for differentiated pro-cessing, and therefore it works in two phases when the inputgraph is presented as an edge stream, which is difficult forstreaming data. Ginger employs FENNEL-like heuristic tominimize the replication factor of low-degree vertices in thefirst phase by assigning a vertex v and its in-edges to thepartition Pi that minimizes the expected replication factor.Unlike FENNEL, Ginger incorporates both the number ofvertices and the number of edges into its objective function.Formally:

arg maxi ∈{1...k }

(�� P ti ∩ N (u)�� − c(

12(�� P ti �� + |V |

|E |

�� P ti ��) )) (8)

Given a user-specified threshold, high-degree vertices areidentified in the first phase. The in-edges of those vertices arere-assigned by hashing on the source vertex so that Gingerpreserves the locality of low-degree vertices while distribut-ing high-degree vertices across the cluster.

We study both Ginger and hash-based hybrid partitioningof PowerLyra as representatives of this category for our ex-perimentation since hybrid-cut model is reported to providegood performance for power-law graphs [13, 41].

5 EXPERIMENTAL DESIGNThe main objective of the experiments is to understand theimpacts of theoretical graph partitioning metrics proposedin literature on the performance of distributed graph process-ing systems. Specifically, we seek answers to the followingquestions:Q1: What are the impacts of cut size on the performance ofscale-out graph processing systems? Specifically, in whichscenarios minimizing the cut size reduces the network com-munication and improves system performance?Q2: What are the impacts of workload and graph charac-teristics on the computational load balance? In particular,under what circumstances partitioning that is balanced insize yields computational load balance?Q3: How does graph partitioning impact the runtime per-formance of offline graph analytic workloads, i.e. overallexecution time?Q4: How does graph partitioning affect latency distributionand throughput of online graph query workloads?Q5: What is the impact of workload skew on online queryworkloads and what are the benefits, if any, of utilizing work-load information for graph partitioning?

In this section we describe the systems, datasets and work-loads used in the experiments as well as the metrics we

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1380

Page 7: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

Table 2: A summary of experiment dimensionsWorkload Parameters Values

OfflineAnalytics

System PowerLyraAlgorithms Vertex-cut Hash (VCR), Grid, DBH,

HDRF, Hybrid-cut Hash (HCR), Gin-ger (HG), Edge-cut Hash(ECR), LDG,FENNEL (FNL), METIS (MTS)

Workloads PageRank, WCC, SSSPCluster Size 8, 16, 32, 64, 128

Datasets Twitter, UK2007-05, USA-Road

OnlineQueries

System JanusGraphAlgorithms Hash (ECR), LDG, FENNEL (FNL),

METIS (MTS)Workloads 1-hop, 2-hop, SSSPCluster Size 4, 8, 16, 32

Datasets Twitter, UK2007-05, USA-Road,LDBC-SNB-SF1000

measure to answer the above questions. We study the per-formance of graph partitioning algorithms for a collection ofpopular workloads under two scenarios: (i) graph processingengines for offline graph analytics and (ii) graph databasesfor online graph queries. The parameters of the experimentsin this paper are summarized in Table 2.

5.1 Offline Graph Analytics5.1.1 Infrastructure. We choose PowerLyra as the graphprocessing engine to study the effects of SGP algorithmson performance of offline graph analytics. As discussed ear-lier, PowerLyra’s graph analytics engine aims to optimizegraph processing on real-world graphs with skewed degreedistributions by using a hybrid partitioning model. Tradi-tionally, scale-out graph processing engines are designedwith consideration of a particular partitioning model as itaffects both computation and communication behavior ofthe system (refer to [32] for a survey of graph processingsystems). PowerLyra’s hybrid model is the only system thatenables us to implement both edge-cut and vertex-cut parti-tioning algorithms on a single system, providing a fair andsystematic comparison between the two models. Unlike Gi-raph and GraphX, PowerLyra does not incur the additionaloverhead of resource management, scheduling etc. as it isnot built on a generalized parallel data processing framework[3]. Hence, the impact of the SGP algorithm can be isolatedand studied rigorously. We provide a detailed analysis of theedge-cut model on PowerLyra in Appendix B and prove thatits communication cost, represented in terms of replicationfactor, is identical to that of edge-cut systems.

All experiments on PowerLyra are run on dedicated Ama-zon EC2 m5.2xlarge machines, each of which has 2.5 GHzIntel Xeon processors with 8 cores and 32 GB memory. Weconfigure PowerLyra to use all available cores as recom-mended and use a synchronous engine as asynchronous

Table 3: Graph datasets used in experiments

Dataset Edges Vertices Avg. / Max.Degree Type

Twitter 1.46B 41M 35 / 2.9M Heavy TailedUK2007-05 3.73B 105M 35.5 / 975K Power-lawUS-Road 58.3M 23M 2.5 / 9 Low-degreeLDBC SNB 3.6M 447M 124 / 3682 Heavy Tailed

engines are reported to perform slower [3, 41]. We test scala-bility of the algorithms using 8, 16, 32, 64 and 128 machines.

5.1.2 Algorithms. We choose Hash (VCR), DBH, Grid andHDRF as representatives of vertex-cut SGP algorithms. Hy-brid Random (HCR) and Ginger (HG) from PowerLyra arechosen as hybrid-cut methods whereas Hash (ECR), LDGand FENNEL (FNL) are chosen from the edge-cut SGP algo-rithms class. Lastly, we use METIS (MTS), an offline multi-level graph partitioning algorithm, as it is the de facto stan-dard for large-scale graph partitioning. We perform METISpartitioning as a pre-processing step prior to data loading,and load these partitions into the system manually as METISis an offline method. This pre-processing step is performedon a separate dedicated machine since it took ∼8 hours and320GB memory to partition the Twitter graph using METIS.

5.1.3 Datasets and Workloads. We use a number of real-world graphs that cover a wide range of structural charac-teristics. Twitter [28] is the largest publicly available socialnetwork dataset with heavily skewed degree distribution andUK2007-05 [8, 9] is one of the largest web graphs with power-law degree distribution. High degree vertices of Twitter andUK2007-05 networks cause communication and computationbottlenecks, especially for edge-cut methods as explainedin Section 6. US Road Network, on the other hand, is a low-degree graph with regular grid-like structure and a longdiameter. Its lower density causes graph algorithms to havelower communication-to-computation ratio, in addition tothe higher number of iterations for iterative graph applica-tions. Table 3 presents the graph characteristics.We consider 3 graph workloads in this study: PageRank

(PR), Weakly Connected Components (WCC) and Single-source Shortest Path (SSSP). These workloads are chosenfor their prominence in existing studies and because theyare representative of graph applications in terms of com-putation and communication patterns. We use the defaultimplementations provided by PowerGraph and PowerLyra.

PageRank is the single most popular algorithm for eval-uating the performance of graph partitioning algorithmsand graph processing systems. In a nutshell, PageRank as-signs a weight to each vertex in the graph based on its im-portance [11]. In PowerLyra implementation of PageRank,vertex weights are iteratively updated based on each ver-tex’s incoming links for a fixed number of iterations (20 in

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1381

Page 8: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

our experiments). As every vertex is active at each itera-tion and must propagate information to all its neighbors,PageRank demonstrates uniform and stable computationand communication costs across each iteration. Therefore, itclosely matches the structural metrics employed by graphpartitioning algorithms.

Weakly Connected Components finds all the maximaldisjoint subgraphs where every vertex is reachable regard-less of the edge direction. The distributed implementationof WCC [25] in a vertex-centric model starts out by consid-ering each vertex as its own component. Each vertex thenupdates its component id by retrieving those of each of itsneighbors and selecting the minimum. This is repeated untilconvergence. Similar to PageRank, all vertices are active atthe beginning of the computation. Unlike PageRank, verticesare only activated with incoming messages and thereforenetwork communication shrinks and workload per machinevaries at each iteration. This variable computation and com-munication behavior makes WCC a suitable candidate to testgraph partitioning algorithms under non-stable conditions.

Single Source Shortest Path computes the shortest pathfrom a source vertex to every vertex in the graph. We ran-domly select the source vertex for each dataset and consis-tently use the same initial source throughout our experi-ments as other studies in the literature. Initially, only thesource vertex is active and other vertices are activated uponreceiving a message in BFS traversal order. Network commu-nication initially grows and then shrinks with each iteration.The ordered activation of vertices and variable communica-tion patterns present a challenging test for SGP algorithmsas it does not fit into the uniform workload assumption.

5.1.4 Metrics. We use following metrics to systematicallyevaluate the impact of SGP algorithms on partitioning qual-ity, system performance and the workload performance:Structural Metrics: We report the replication factor, theaverage number of mirror vertices per vertex, as an indicatorof the quality of the partitioning solution. We do not reportthe load balance in terms of partition size, as all SGP algo-rithms achieve good balance either by setting the maximumallowable imbalance [27, 39] or by adjusting the importanceof load imbalance in their formulation [36, 40].Runtime Metrics:We monitor the distribution of computa-tional load across themachines in the cluster and the networkcommunication within the cluster.PerformanceMetrics:Wemeasure the latency of the graphanalytics applications excluding the partitioning time.

5.2 Online Graph Queries5.2.1 Infrastructure. We use JanusGraph to study the impactof graph partitioning on online graph queries as it is theonly open-source scale-out solution that can process online

graph query workloads. Unlike graph processing engines,JanusGraph and graph database systems in general needpersistent storage in addition to in-memory query processing.JanusGraph adopts an adjacency list representation to storethe graph and uses hash-based random edge-cut partitioning.We use the Cassandra-based storage backend as its ByteOrdered Partitioner enables us to implement custom graphpartitioning algorithms by providing fine grained controlover data placement. Appendix C describes the architectureof the JanusGraph cluster in detail.

All of these experiments are run on a private cluster of 32machines, each of which has 2.1 GHz Intel Xeon processorwith 12 cores and 32 GB of memory. We test scalability ofthe algorithms using 4, 8, 16, and 32 machines. We do notgo beyond 32 machines since the communication overheadsignificantly increases and becomes the dominant factor ofperformance as the number of partitions increases. Figure12 in Appendix D shows the aggregated performance ofthe 1-hop traversal workload on the LDBC-SNB SF-1000graph using 4 to 32 machines. The performance significantlydegrades even on 32 partitions for all algorithms. This is dueto a significant increase in communication overhead.

5.2.2 Algorithms. JanusGraph does not provide support forvertex-cut partitioning due to its adjacency list-based repre-sentation in the storage layer. Therefore, we only use edge-cut SGP algorithms for this scenario, i.e. Hash (ECR), LDGand FENNEL (FNL). In addition, we compare performanceof these algorithms with METIS (MTS). Similar to Power-Lyra experiments, METIS partitioning is implemented as apre-processing step prior to data loading.

5.2.3 Datasets and Workloads. In addition to the real-worldgraphs we use for offline graph analytics, we use the LDBCSocial Network graph in these experiments. The LDBC So-cial Network Benchmark (LDBC SNB) [17] models a socialnetwork graph and introduces an OLTP-like workload thatis designed to simulate real-world interactions in social net-working applications. We extract the friendship graph, whichcomprise of users and knows relationships between users,similar to the Twitter graph. Characteristics of the LDBC-SNB SF-1000 friendship graph as well as real-world graphsare presented in Table 3.

We consider 3 classes of online graph queries in this study:one-hop, two-hop and single-pair shortest path queries.One-hop queries retrieve all adjacent vertices of a given startvertex and constitute more than 50% of the Facebook’s data-base benchmark LinkBench [6]. Similarly, Twitter’s graphsystem for real-time recommendations is optimized for one-hop queries [38]. Combined with two-hop queries, we con-sider neighbourhood retrievals as the most important classof queries for graph databases. The LDBC SNB graph datagenerator produces parameter bindings for the LDBC SNB

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1382

Page 9: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

graph. For real-world datasets, we randomly select the queryvertices that we consistently use across all experiments. Wegenerate 1000 bindings for each type of query and report themeasurements after caches are warmed up.

5.2.4 Metrics. The primary metrics that we use to evaluatethe performance of SGP algorithms on online graph queryworkloads are:Structural Metrics: We report the edge-cut ratio, the ratioof edges whose endpoints lie in different partitions, as anindicator of the quality of the partitioning solution.Runtime Metrics: We monitor the distribution of the re-quests made to the storage layer and the network communi-cation during the execution of a particular workload.Performance Metrics: We report the overall throughputof the JanusGraph cluster as well as latency distribution.

6 EXPERIMENTAL RESULTS & ANALYSISWe present a brief summary of our experimental results inSection 6.1. In Section 6.2, we compare the performance ofall SGP algorithms with respect to all offline graph analyticworkloads, datasets and cluster sizes to answer Q1, Q2 andQ3 specified in Section 5.We depict the detailed performanceresults for the Twitter graph on all workloads and clustersizes in Figure 3. Due to space limitations, we focus in thissection on representative results and provide the full set ofresults for all workloads and datasets in Figure 13 in Appen-dix D. Note that the UK2007-05 dataset does not fit into acluster of 8 machines in any configuration and fits into 16machines only in some configurations (VCR, DBH, HCR, HG,MTS). The experiments detailed in Section 6.3 compare theperformance of all edge-cut SGP algorithms with respect toall online graph query workloads, cluster sizes and datasetsWe only consider up to 32 machines for these experimentsas increasing the number of worker machines beyond 32negatively affects the performance as described in Section5.2.1 and Appendix D. Finally, we discuss the limitations ofexisting work and highlight open challenges in Section 6.4.

6.1 Key Findings• Our experimental results show that existing SGP algo-rithms fail to accommodate a wide range of workloadand graph characteristics, causing a mismatch betweenthe theoretical metrics and the workload performance inpractice.

• We observe that the network communication grows lin-early with respect to the cut size for all the workloadswe study (Section 6.2.1 and 6.3.1), and the slope of theincrease is determined by the workload and the cut model(Q1). Notably, regardless of graph characteristics, edge-cut SGP methods incur less network communication thanvertex-cut methods for the same cut size for offline graph

analytics with uni-directional communication - a compar-ison neglected in the literature.

• We find that (Sections 6.2.1 and 6.3.1) existing algorithmssuffer from imbalance in computational load (Q2) andimprovements in the cut size do not translate to improve-ments in system performance (Section 6.2.2 and 6.3.2) un-less balanced load distribution is achieved (Q3).

• A surprising finding (Section 6.3.2) is that existing SGPalgorithms are not effective for online graph query work-loads due to workload skew, and that simple hash-basedpartitioning achieves a good trade-off between high through-put and low latency objectives (Q4).

• Workload patterns can be used to improve the perfor-mance of online graph queries (Section 6.3.3), specificallyby achieving a balanced load distribution (Q5).

6.2 Offline Graph Analytics6.2.1 Impact of Partitioning Quality. Figure 2 shows thereplication factors for all SGP algorithms on all graphs andcluster sizes. Replication factor accurately quantifies the com-munication overhead of edge-cut based partitioning in Pow-erLyra, as explained in Appendix B, which is why we reportthe replication factor for all SGP algorithms in this section.There is no single algorithm that provides the best replica-tion factor in all cases. Edge-cut SGP algorithms FNL andLDG outperform their vertex-cut counterparts on USA-Roadnetwork as shown in Figure 2. This is because on regulargraphs with low degree vertices, vertex-cut SGP algorithmsunnecessarily replicate these low degree vertices. Edge-cutSGP algorithms, on the other hand, can preserve the localityof these low degree vertices and thus provide lower repli-cation factors. Vertex-cut and hybrid-cut SGP algorithmsare more effective on the Twitter graph as it has a heavilyskewed degree distribution. In particular, HG, HDRF andDBH deliver a lower replication factor than that of MTS, anoffline graph partitioning algorithm. This is expected as allthree methods are explicitly designed to target graphs withheavy tailed degree distributions. The replication factor ofHDRF is the lowest among vertex-cut SGP algorithms onUK2007-05 web-graph, demonstrating that its greedy heuris-tic is effective for graphs with power-law degree distribution.Therefore we conclude that vertex-cut methods and in par-ticular HDRF are better for power-law graphs in terms ofreplication factor.We evaluate the effectiveness, as commonly claimed, of

the cut size as an indicator of the communication overheadto answer Q1 (Section 5). Figure 1 depicts the total networkcommunication with respect to the replication factor for PR,WCC and SSSP workloads on the Twitter graph for eachcut model. Each individual point represents the total net-work communication against the replication factor during

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1383

Page 10: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

��

���

���

���

� � � � � �� �� ��

���������

�� ���������

��� ��� ����� ��� ��� ���

(a) PageRank

��

���

���

���

� � � � � �� �� ��

�������������

� ��������

��� ��� ����� ��� ��� ���

(b) Weakly Connected Components

��

���

���

���

� � � � � �� �� ��

�������������

� ��������

��� ��� ����� ��� ��� ���

(c) Single Source Shortest Path

Figure 1: Replication factor vs total network I/O during the execution of PageRank, Weakly Connected Compo-nent and Single-Source Shortest Path Algorithms on Twitter Graph

1

1.5

2

2.5

3

3.5

8 16 32 64 128

ReplicationFactor

Number of Partitions

USA-Road

VCR Grid DBH HDRF HCR HG ECR LDG FNL MTS

2

4

6

8

10

12

14

16

18

20

22

24

8 16 32 64 128

Number of Partitions

Twitter

0

5

10

15

20

25

30

16 32 64 128

Number of Partitions

UK2007-05

Figure 2: Replication factors of USA-Road, Twitter and UK2007-05 graphs over 8 to 128 partitions

the execution of the workload for a particular configura-tion of partition size and SGP algorithm. This representsthe relationship between network communication and thereplication factor independent of the partitioning algorithmand partition size. The network communication is a linearfunction of the replication factor, however the cut model andworkload affect the slope of the line. Note that we do notdistinguish between algorithms within a cut model, as allalgorithms in a particular model follow a similar trend.The correlation between the replication factor and net-

work communication is explained by synchronization be-tween the master and mirror vertices before and after theapply phase. After the gather phase, each mirror sends itspartial results to the master to compute the aggregated ver-tex data. The master vertex then updates the mirrors withthe final result before the scatter phase. Since the edge-cutapproaches group all out-edges of a vertex together, the scat-ter phase is performed locally and the master vertex doesnot have to update the mirrors with updated vertex data.In the vertex-cut and hybrid-cut approaches, on the otherhand, master vertices are required to update the mirrorswith the updated vertex data, causing them to incur largernetwork communication compared to the edge-cut model.Therefore, edge-cut partitioning has less network communi-cation for the same replication factor compared to hybrid-cutand vertex-cut for PageRank whereas all cut models exhibitsimilar trends for other workloads, as shown in Figure 1(a).

Among the workloads, PageRank incurs the most networkcommunication for a given replication factor because it isan all active algorithm, i.e. all vertices are active at eachiteration. On the other hand, WCC and SSSP incur muchsmaller communication overhead as the set of active verticesvaries in each iteration.

CPU and memory consumption are not the key runtimemetrics as we use the same system across all experiments.Replication factor, by definition, is an indicator of the mem-ory usage and its effect is the same on each machine as allSGP algorithms produce balanced partitions. Similarly, CPUconsumption is correlated with the particular system andworkload characteristics, and therefore it is not affected bypartitioning, as reported by previous studies [41]. Distribu-tion of CPU consumption, on the other hand, is correlatedwith computational load balance as all experiments are con-ducted in memory, which we discuss next.In Figure 4 we present the distribution of computation

time of individual machines on a 64 machine cluster duringthe execution of PageRank to answer Q2. We observe thatbalance in partition sizes does not necessarily translate tocomputational load balance. In particular, edge-cut methodsperform poorly in skewed graphs as all edges of high-degreevertices are grouped together, causing a subset of machinesto be overloaded. On the other hand, uniform degree distri-butions of low-degree graphs enable edge-cut methods toachieve balanced load distribution as shown in 4(a), even

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1384

Page 11: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

better than vertex-cut methods.These results show that the replication factor is indeed

an accurate indicator to model the communication overheadfor offline graph analytics. Algorithms in the edge-cut classincur less network I/O compared to vertex-cut algorithmswith the same replication factor if all communication is uni-directional, i.e. along the direction of either the in- or out-edges. On the other hand, computational load balance isheavily influenced by the graph characteristics. Edge-cutSGP algorithms cause load imbalance on heavy tailed graphsas they group all edges of high degree vertices together.

6.2.2 Workload Performance. In Figure 3 we present theexecution time for all workloads on all cluster sizes on theTwitter graph. As noted earlier, the UK2007-05 web graphdoes not fit in a cluster of 8 machines. Therefore, we onlyfocus on 16 to 64 partitions for the scalability analysis. Inline with the findings in Section 6.2.1, edge-cut SGP algo-rithms exhibit poor performance on the Twitter graph, evenfor offline algorithms such as MTS. ECR yields lower execu-tion times compared to VCR when the number of partitionsis low due to lower replication factor. However, the differ-ence diminishes as the number of partitions increases dueto higher load imbalance of edge-cut methods on skewedgraphs. Edge-cut SGP algorithms cause significant load im-balance compared to vertex-cut ones for the heavily skewedTwitter (Figure 4(b)) and UK2007-05 (Figure 4(c)) datasets.This explains why LDG and FNL result in longer executiontimes on UK2007-05 (Figure 13) compared to HDRF and HCRdespite similar cut sizes.

On graphs with a skewed degree distribution (Twitter andUK207-05), all vertex-cut and hybrid-cut algorithms havesimilar load balance properties and therefore HDRF providesthe best workload performance among vertex-cut algorithmsas it provides the lowest replication factor. On the otherhand, hybrid-cut algorithms exhibit similar workload per-formance compared to HDRF despite the higher replicationfactor. The hybrid-cut model does so by lowering the syn-chronization cost for low degree vertices for algorithms withuni-directional communication such as PageRank, since itgroups together all edges of a low-degree vertex (Figure 1(a)).Figure 13 shows that edge-cut SGP algorithms LDG and

FNL incur the lowest execution times on regular, low-degreegraphs like the USA-Road network. Given that these algo-rithms achieve computation load balance on low-degreegraphs, reductions in the replication factor translates tolower execution times compared to vertex-cut methods.

We look at the performance of different workloads on theTwitter graph in order to better understand the impact ofpartitioning for offline graph analytics. PageRank is an allactive algorithm with heavy communication, and thereforethe impact of the replication factor on the performance is

Table 4: Edge-cut ratio for LDBC SNB SF-1000 GraphPartitions ECR LDG FNL MTS

4 0.75 0.74 0.47 0.318 0.87 0.81 0.58 0.4116 0.94 0.82 0.63 0.4632 0.97 0.84 0.66 0.51

the most notable for PageRank, particularly for vertex-cutand hybrid-cut algorithms due to better load balance. On theother hand, the performance difference between partitioningalgorithms diminishes for WCC and SSSP due to irregularcommunication, as depicted in Figure 3.

One trend we observe for all workloads is that increasingthe number of partitions does not improve the performanceafter 64 partitions as communication overhead dominatesthe execution time due to the decrease in partition sizes.A similar trend is observed on other datasets during theexecution of PageRank as shown in Figure 13. This suggeststhat it is important to find the right scale-out factor thatoptimizes the communication-to-computation ratio.The upshot of the runtime experiments is that the cut

size is not always a good predictor of workload executionperformance. In some workloads, such as PageRank, wherethere is heavy communication among all vertices at eachiteration, there is a correlation, but this does not hold for allworkloads. This partially explains why previous works havefocused on cut size minimization as an objective, since alltheir experiments focused on PageRank.

6.3 Online Graph Queries6.3.1 Impact of Partitioning Quality. JanusGraph only sup-ports the edge-cut model and therefore we use the edge-cutratio as the main structural metric to evaluate the partition-ing quality. The edge-cut ratio of edge-cut SGP algorithmson the LDBC SNB SF-1000 dataset is presented in Table 4. Itshows that FNL is able to achieve edge-cut ratios comparableto MTS, confirming the results in the original study [40].We investigate the effectiveness of cut size as an indicatorof the communication overhead for online graph queriesin order to answer Q1. Figure 5 presents the total networkcommunication plotted against the edge-cut ratio during theexecution of the 1-hop query workload on the LDBC SNBSF-1000 graph. We do not distinguish between individualalgorithms since all edge-cut SGP algorithms show a similartrend for online graph queries and the total network commu-nication is a linear function of the edge-cut ratio. Therefore,we conclude that the edge-cut ratio is a reliable indicator ofthe communication cost for online graph query workloadson distributed graph databases.Figure 7 and Figure 15 depict the distribution of vertex

access counts over 16 machines during the execution of the1-hop traversal workload on all datasets. Unlike analytical

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1385

Page 12: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

���������������������

����� � ������ � ����� �

��������

����������

�����

� �������

���������������������

����� ������� � ����� �

�� �������

��� ���� � ��� �� � ��� �� �� ���

���������������������

����� ������� � ����� �

�� �������

���������������������

����� ������� � ����� �

�� �������

���������������������

����� ������� � ����� �

��� �������

��

��

��

��

��

����� � ������ � ����� �

����������

�����

��

��

��

��

��

����� � ������ � ����� ��

��

��

��

��

��

����� � ������ � ����� ��

��

��

��

��

��

����� � ������ � ����� ��

��

��

��

��

��

����� � ������ � ����� �

����������������

����� � ������ � ����� �

����������

�����

����������������

����� � ������ � ����� �����������������

����� � ������ � ����� �����������������

����� � ������ � ����� �����������������

����� � ������ � ����� �

Figure 3: Performance of offline graph analytic workloads on Twitter Graph

��

��

��

��

�� �� ��� ��� �� � �� �� ��� ���

���

����������

���� �

�� ���������

!"#���$% �$ �"#

(a) USA-Road Network

������������������������

�� �� ��� ��� �� � �� �� ��� ���

���

����������

���� �

�� ���������

!"#���$% �$ �"#

(b) Twitter

���������������������������

�� �� ��� ��� �� � �� �� ��� ���

���

����������

���� �

�� ���������

!"#���$% �$ �"#

(c) UK2007-05 Web Graph

Figure 4: Distribution of computation time of individual workers in 64 machine cluster during the execution ofPageRank. Lines from bottom to top represent the minimum, 25th percentile, median, 75th percentile and themaximum of the distribution, respectively.

���������������������

��� ��� �� ��� �� ��� ��� �

���������

�� ���������

� ������ �����

� ������ ����������

Figure 5: Edge-cut ratio vs network I/O during the ex-ecution 1-hop query workload on LDBC SNB SF-1000.

workloads, both FNL and LDG suffer from computationalload imbalance regardless of the graph characteristics. Onlineworkloads pose significant challenges to graph systems dueto the workload skew in addition to the degree skew inthe data graph. Dynamic changes in these workloads makethe structural metric-based formulation of SGP even less

effective in modeling the workload performance comparedto offline graph analytics. Workload skew causes hotspotson a subset of partitions and results in system performancedegradation. We defer the discussion of how the workloadskew affects the performance to the next section.

6.3.2 Workload Performance. Figure 6 presents the aggre-gated throughput of the 1-hop and 2-hop traversal workloadson the LDBC-SNB SF-1000 graph. We observe that partition-ing has a smaller impact on performance compared to theoffline graph analytics, where the choice of partitioning algo-rithm might speed up processing by up to 5×. As an offlinealgorithm, MTS provides the best performance in terms ofaggregated throughput and it provides 25% and 18% improve-ment over hash partitioning for 1-hop and 2-hop traversalworkloads, respectively. Similar performance results havebeen reported [33]. We do not observe any significant dif-ference in throughput for single-pair shortest path queries,

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1386

Page 13: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

Table 5: Mean and tail (99th percentile) latencies (inms) for the execution of 1-hop traversal workload ona 16 machine cluster.

Algorithm Medium Load High LoadMean 99th Mean 99th

ECR 30 64 46 95LDG 30 65 47 155FNL 29 81 56 323MTS 25 60 42 96

therefore we do not include the results for this workload.We measure the throughput under two different scenarios:

(i)medium load describes the case where there are 12 concur-rent clients per worker and the system is at high utilization,and (ii) high load describes the case where the number ofconcurrent clients is doubled and system is overloaded withincoming requests. Figure 6 presents the aggregated through-put for both the 1-hop and 2-hop traversal workloads onthe LDBC-SNB SF-1000 graph and Figure 14 in Appendix Dpresents the aggregated throughput for the 1-hop traversalworkload on real-world graph datasets on a 16 machine clus-ter. We observe that better edge-cut ratio (Table 4) leads tohigher throughput only under medium load. Next, we look atthe computational load distribution to better understand thesystem behaviour under high load. We observe that higherload imbalance of FNL and LDG on all datasets cause per-formance degradation for these methods under high loadscenario due to subset of machines being overloaded.

We look at the latency distribution of the 1-hop traversal,specifically the mean and tail latency, to better understandthe impact of load balance on system performance for onlinegraph queries. Table 5 presents the mean and tail latencies ofthe 1-hop traversal queries on the LDBC-SNB SF1000 graph.Mean latency under both scenarios is a function of through-put as expected. However, the tail (99th percentile) latencyis significantly affected by load balance under high load.Edge-cut SGP algorithms, such as LDG and FNL, cause taillatency to be significantly higher than that of ECR – up to3.5× in case of FNL. Combined with the decrease in through-put, edge-cut SGP methods are not a favorable alternative tohash partitioning for a scale-out graph system that processesonline graph query workloads under high load.

All the results we present in this section show that existingSGP algorithms are not effective for online graph queriesand simple hash-based random partitioning provides a betterthroughput vs latency trade-off. We observe that the mainreason is the load imbalance caused by the workload skew,which is not captured by the formulation of existing SGPalgorithms as discussed next.

6.3.3 Impact of the Workload Skew on Performance. Unlikeanalytic workloads that access the entire graph iteratively,online graph queries typically access a portion of the graph

and workload skew creates hotspots on a subset of parti-tions. These hotspots cause imbalances in load distributionas shown in Figure 7 and 15. The key to achieving a bal-anced load distribution in online graph queries is to utilizethe workload information and dynamically adapt the parti-tioning to reflect the changing workload. Existing SGP algo-rithms completely ignore the workload skew (except [19])and only focus on structural partitioning metrics.

We record vertex and edge accesses during the executionof the 1-hop query workload to compute a weighted graphwhere weights represent the access ratio. Then, we computea 16-way balanced partitioning of this weighted graph usingMTS. Figure 8 compares the total throughput and the loaddistribution for the execution of the same 1-hop query work-load to those of original unweighted graph. It shows thatworkload patterns can be utilized to improve performance;partitioning of the graph using complete workload informa-tion provides 13% to 35% improvement in throughput andleads to a balanced load distribution.

6.4 DiscussionWe summarize the main findings of our analysis in a decisiontree to guide readers through our results and to aid usersin selecting a suitable SGP algorithm (Figure 9). First, werecognize the limitations of the literature on online graphquery workloads and recommend hash-based partitioning asa simple but effective solution, especially for latency criticalapplications. On the other hand, FENNEL can improve theaggregated throughput at the expense of higher tail latencyfor systems under medium load. For graph analytics, graphtype and degree distribution play the most important rolein picking the right SGP algorithm. Edge-cut methods, FEN-NEL in particular, are effective for low-degree graphs likeroad networks. Hybrid model is most effective on heavy-tailed graphs like online social networks, particularly due tolower communication cost for uni-directional algorithms asPageRank. For graphs with power-law degree distribution,we recommend HDRF as its lower cut size and well-balancedload distribution provide the best performance.A common theme in our results is that the partitioning

quality on its own does not translate to improvements inworkload performance, contrary to common motivation be-hind most algorithms in literature. Cut size itself is not areliable predictor of workload performance unless computa-tional load balance is achieved, an aspect largely neglectedby existing algorithms. For online graph query workloads,we observe that a simple, hash-based partitioning provides abetter trade-off between latency and throughput objectivescompared to existing SGP methods as this class of work-loads additionally exhibit workload execution skew. Theseresults draw parallels with earlier work on query processing

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1387

Page 14: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

�����������������������������������

�� ��� ��� ���

������������

������������

� ����������

�����������������������������������

�� ��� ��� ���

���������������� ��� ��� ���

�����������������������������������

�� ��� ��� ���

�� ����������

���

���

���

��

����

�� ��� ��� ��� �����������

������������

��� �� ����!"

���

���

���

��

����

�� ��� ��� ������ �� ����!"

���

���

���

��

����

�� ��� ��� ������ �� ����!"

Figure 6: Aggregate throughput on LDBC-SNB SF-1000 graph under medium and high load.

������

�����

�������

�����

�������

�� �� � ���

�������

������ �� ����������

������ �� ��!���

Figure 7: Distribution of the number of vertices readfrom each worker on a 16 machine cluster during theexecution of 1-hop traversal workload on LDBC-SNBSF-1000. Lines from bottom to top represent the mini-mum, 25th percentile, median, 75th percentile and themaximum of the distribution, respectively.

����

����

����

����

�����

�� �� � ��� ��

���

���

���

���

��������������� � �

����

������������

���������

���������� �� !�"���#$%

Figure 8: Throughput and relative standard deviationof load distribution on LDBC SNB SF-1000 graph for1-hop query workload. W is the weighted graph.

on distributed relational databases, where hash-based parti-tioning is known to be resilient to both data and executionskew [16]. Existing literature suggest and our results confirmthat workload-aware fragmentation strategies are effectiveto improve performance of OLTP applications [14, 35] andtherefore it is necessary to design and develop such strate-gies for scale-out graph databases for improved performanceon online graph query workloads.

Yes

Low

Online Queries

Tail LatencyYes

No

Workload Query Complexity

Hashing

Optimization Objective

FENNEL

Low-degree

Power-law

HDRFHybrid

Throughput

High

Analytics

No

Figure 9: Decision tree for picking a SGP algorithm

7 CONCLUSIONIn this paper we present results from extensive experimentalanalysis of existing classes of streaming graph partitioning al-gorithms across both offline graph analytic and online graphquery workloads over three very large datasets with differentcharacteristics (USA Road Network, Twitter and UK2007-05Web Graph). Our results highlight important factors andinteractions that impact system performance that have notbeen discussed in previous studies [1, 21, 41]. To summarize,we find that none of the existing partitioning algorithmsperform uniformly the best in all cases in terms of workloadperformance. Graph structure, workload characteristics andapplication requirements should be considered when pickingthe right partitioning algorithm.This work highlights a number of directions for future

work. One direction is to develop algorithms that considerthe factors indicated above: (i) impacts of workload andgraph characteristics on computational load balance, and(ii) impacts of workload execution skew on the workloadperformance. Another direction is to study the appropriatescale-out factor given a particular graph and workload char-acteristics. This is a general problem but is important in thiscontext since some of the algorithms are sensitive to thecommunication-to-computation ratio.

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1388

Page 15: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

REFERENCES[1] Zainab Abbas, Vasiliki Kalavri, Paris Carbone, and Vladimir Vlassov.

2018. Streaming graph partitioning: an experimental study. Proc. VLDBEndowment 11, 11 (2018), 1590–1603.

[2] Amine Abou-Rjeili and George Karypis. 2006. Multilevel Algorithmsfor Partitioning Power-law Graphs. In Proc. 20th IEEE Int. Parallel &Distributed Processing Symp. 124–124. https://doi.org/10.1109/IPDPS.2006.1639360

[3] Khaled Ammar and M. Tamer Özsu. 2018. Experimental Analysis ofDistributed Graph Systems. Proc. VLDB Endowment 11 (2018). Forth-coming.

[4] Konstantin Andreev and Harald Racke. 2006. Balanced Graph Par-titioning. Theor. Comp. Sci. 39, 6 (2006), 929–939. https://doi.org/10.1007/s00224-006-1350-7

[5] Apache 2016. Apache Giraph. http://giraph.apache.org. (2016).[6] Timothy G. Armstrong, Vamsi Ponnekanti, Dhruba Borthakur, and

Mark Callaghan. 2013. LinkBench: a database benchmark based on theFacebook social graph. In Proc. ACM SIGMOD Int. Conf. on Managementof Data. 1185–1196. https://doi.org/10.1145/2463676.2465296

[7] Charles-Edmond Bichot and Patrick Siarry. 2013. Graph partitioning.John Wiley & Sons.

[8] Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. 2011.Layered Label Propagation: A MultiResolution Coordinate-Free Or-dering for Compressing Social Networks. In Proceedings of the 20thinternational conference on World Wide Web, Sadagopan Srinivasan,Krithi Ramamritham, Arun Kumar, M. P. Ravindra, Elisa Bertino, andRavi Kumar (Eds.). ACM Press, 587–596.

[9] Paolo Boldi and Sebastiano Vigna. 2004. The WebGraph FrameworkI: Compression Techniques. In Proc. of the Thirteenth InternationalWorld Wide Web Conference (WWW 2004). ACM Press, Manhattan,USA, 595–601.

[10] Florian Bourse, Marc Lelarge, and Milan Vojnovic. 2014. Balancedgraph edge partition. In Proc. 20th ACMSIGKDD Int. Conf. on KnowledgeDiscovery and Data Mining. ACM, 1456–1465.

[11] S. Brin and L. Page. 1998. The Anatomy of a Large-Scale HypertextualWeb Search Engine. Comp. Netw. 30, 1-7 (1998), 107 – 117.

[12] Aydın Buluç, Henning Meyerhenke, Ilya Safro, Peter Sanders, andChristian Schulz. 2016. Recent Advances in Graph Partitioning. InAlgorithm Engineering - Selected Results and Surveys. Lecture Notesin Computer Science, Vol. 9220. Springer, 117–158. https://doi.org/10.1007/978-3-319-49487-6_4

[13] Rong Chen, Jiaxin Shi, Yanzhe Chen, and Haibo Chen. 2015. Power-Lyra: Differentiated Graph Computation and Partitioning on SkewedGraphs. In Proc. 10th ACM SIGOPS/EuroSys European Conf. on Comp.Syst. Article 1, 15 pages. https://doi.org/10.1145/2741948.2741970

[14] Carlo Curino, Evan Jones, Yang Zhang, and SamMadden. 2010. Schism:a workload-driven approach to database replication and partitioning.Proc. VLDB Endowment 3, 1 (2010), 48–57. Issue 1-2. http://dl.acm.org/citation.cfm?id=1920841.1920853

[15] Dong Dai, Wei Zhang, and Yong Chen. 2017. IOGP: An IncrementalOnline Graph Partitioning Algorithm for Distributed Graph Databases.ACM, 219–230.

[16] D. J. DeWitt and J. Gray. 1992. Parallel Database Systems: The Futureof High Performance Database Systems. Commun. ACM 35, 6 (1992),85–98.

[17] Orri Erling, Alex Averbuch, Josep Larriba-Pey, Hassan Chafi, AndreyGubichev, Arnau Prat, Minh-Duc Pham, and Peter Boncz. 2015. TheLDBC social network benchmark: Interactive workload. In Proc. ACMSIGMOD Int. Conf. on Management of Data. ACM, 619–630.

[18] Ioanna Filippidou and Yannis Kotidis. 2015. Online and on-demandpartitioning of streaming graphs. In Proc. 2015 IEEE Int. Conf. on Big

Data. IEEE, 4–13.[19] Hugo Firth and Paolo Missier. 2017. TAPER: query-aware, partition-

enhancement for large, heterogenous graphs. dapd 35, 2 (2017), 85–115.

[20] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, andCarlos Guestrin. 2012. PowerGraph: Distributed Graph-parallel Com-putation on Natural Graphs. In Proc. 10th USENIX Symp. on OperatingSystem Design and Implementation. 17–30. http://dl.acm.org/citation.cfm?id=2387880.2387883

[21] Yong Guo, Sungpack Hong, Hassan Chafi, Alexandru Iosup, and DickEpema. 2017. Modeling, analysis, and experimental comparison ofstreaming graph-partitioning policies. J. Parallel and Distrib. Comput.108 (2017), 106 – 121. https://doi.org/10.1016/j.jpdc.2016.02.003

[22] Minyang Han, Khuzaima Daudjee, Khaled Ammar, M. Tamer Özsu,Xingfang Wang, and Tianqi Jin. 2014. An Experimental Comparisonof Pregel-like Graph Processing Systems. Proc. VLDB Endowment 7,12 (2014), 1047–1058. http://www.vldb.org/pvldb/vol7/p1047-han.pdf

[23] Jiewen Huang and Daniel J Abadi. 2016. Leopard: lightweight edge-oriented partitioning and replication for dynamic graphs. Proc. VLDBEndowment 9, 7 (2016), 540–551.

[24] Nilesh Jain, Guangdeng Liao, and Theodore L Willke. 2013. Graph-builder: scalable graph etl framework. In Proc. 1st Int. Workshop onGraph Data Management Experiences and Systems. ACM, 4.

[25] U. Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. 2009.PEGASUS: A Peta-Scale Graph Mining System Implementation andObservations. In Proceedings of the 2009 Ninth IEEE International Con-ference on DataMining. 229–238. https://doi.org/10.1109/ICDM.2009.14

[26] George Karypis and Vipin Kumar. 1996. Parallel Multilevel K-wayPartitioning Scheme for Irregular Graphs. In Proc. 1996 ACM/IEEE Conf.on Supercomputing. Article 35. https://doi.org/10.1145/369028.369103

[27] George Karypis and Vipin Kumar. 1998. A fast and high quality multi-level scheme for partitioning irregular graphs. SIAM J. on ScientificComput. 20, 1 (1998), 359–392.

[28] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010.What is Twitter, a social network or a news media?. In Proc. 19th Int.World Wide Web Conf. ACM, 591–600.

[29] Michael LeBeane, Shuang Song, Reena Panda, Jee Ho Ryoo, and Lizy KJohn. 2015. Data partitioning strategies for graph workloads on hetero-geneous clusters. In Proc. 2015 ACM/IEEE Conf. on High PerformanceComputing, Networking, Storage and Analysis. IEEE, 1–12.

[30] Grzegorz Malewicz, MatthewH. Austern, Aart J. C. Bik, James C. Dehn-ert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel:a system for large-scale graph processing. In Proc. ACM SIGMOD Int.Conf. on Management of Data. 135–146.

[31] Daniel W. Margo and Margo I. Seltzer. 2015. A Scalable DistributedGraph Partitioner. Proc. VLDB Endowment 8, 12 (2015), 1478–1489.http://www.vldb.org/pvldb/vol8/p1478-margo.pdf

[32] Robert Ryan McCune, TimWeninger, and Greg Madey. 2015. ThinkingLike a Vertex: A Survey of Vertex-Centric Frameworks for Large-ScaleDistributed Graph Processing. ACM Comput. Surv. 48, 2 (2015), 25:1–25:39. https://doi.org/10.1145/2818185

[33] Daniel Nicoara, Shahin Kamali, Khuzaima Daudjee, and Lei Chen. 2015.Hermes: Dynamic Partitioning for Distributed Social Network GraphDatabases. In Proc. 18th Int. Conf. on Extending Database Technology.25–36. https://doi.org/10.5441/002/edbt.2015.04

[34] Joel Nishimura and Johan Ugander. 2013. Restreaming graph parti-tioning: simple versatile algorithms for advanced balancing. In Proc.19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining.ACM, 1106–1114.

[35] M. Tamer Özsu and Patrick Valduriez. 2011. Principles of DistributedDatabase Systems (3rd ed.). Springer. Previous two editions of thebook were published by Prentice-Hall in 1991 and 1999, respectively.

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1389

Page 16: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

[36] Fabio Petroni, Leonardo Querzoni, Khuzaima Daudjee, Shahin Kamali,and Giorgio Iacoboni. 2015. Hdrf: Stream-based partitioning for power-law graphs. In Proc. 24th ACM Int. Conf. on Information and KnowledgeManagement. ACM, 243–252.

[37] Semih Salihoglu and Jennifer Widom. 2013. GPS: a graph processingsystem. In Proc. 25th Int. Conf. on Scientific and Statistical DatabaseManagement. 22:1–22:12. https://doi.org/10.1145/2484838.2484843

[38] Aneesh Sharma, Jerry Jiang, Praveen Bommannavar, Brian Larson,and Jimmy Lin. 2016. GraphJet: real-time content recommendations attwitter. Proceedings of the VLDB Endowment 9, 13 (2016), 1281–1292.

[39] Isabelle Stanton and Gabriel Kliot. 2012. Streaming Graph Partitioningfor Large Distributed Graphs. In Proc. 18th ACM SIGKDD Int. Conf. onKnowledge Discovery and Data Mining. 1222–1230. https://doi.org/10.1145/2339530.2339722

[40] Charalampos Tsourakakis, Christos Gkantsidis, Bozidar Radunovic,and Milan Vojnovic. 2014. FENNEL: Streaming Graph Partitioning forMassive Scale Graphs. In Proc. 7th ACM Int. Conf. Web Search and DataMining. 333–342. https://doi.org/10.1145/2556195.2556213

[41] Shiv Verma, Luke M. Leslie, Yosub Shin, and Indranil Gupta. 2017. AnExperimental Comparison of Partitioning Strategies in DistributedGraph Processing. Proc. VLDB Endowment 10, 5 (2017), 493–504.

[42] Cong Xie, Wu-Jun Li, and Zhihua Zhang. 2015. S-powergraph: Stream-ing graph partitioning for natural graphs by vertex-cut. arXiv preprintarXiv:1511.02586 (2015).

[43] Cong Xie, Ling Yan, Wu-Jun Li, and Zhihua Zhang. 2014. Distributedpower-law graph computing: Theoretical and empirical analysis. InAdvances in Neural Information Proc. Systems 27, Proc. 28th AnnualConf. on Neural Information Proc. Systems. 1673–1681.

[44] Ning Xu, Bin Cui, Lei Chen, Zi Huang, and Yingxia Shao. 2015. Het-erogeneous environment aware streaming graph partitioning. IEEETrans. Knowl. and Data Eng. 27, 6 (2015), 1560–1572.

A SGPWITH GENERALIZED COSTMODELS

Section 4 presents a classification of existing literature onSGP based on the cut and the input stream models and de-scribe notable examples in each class. One common charac-teristic is the focus on optimizing structural metrics such asthe number of edges that lie in partition boundaries (edge-cut methods) and the replication factor (vertex-cut methods).Optimizing such measures might not necessarily minimizenetwork communication or achieve a balanced workloaddistribution if the graph is not uniformly accessed or if thecluster does not have a homogeneous structure. We providea brief overview of some work on streaming graph partition-ing that considers different objective functions, workloadand cluster characteristics although we do not include themin our experiments as they are not commonly adopted andtheir scope is limited to specific applications.

Re-streaming versions of LDG and FENNEL [34] can gen-erate a balanced partitioning on any vertex attribute a(u) bysubstituting

�� P ti �� with x ti =∑u ∈P ti

a(u) in Equation (4) and(5). Similarly, the communication cost C(P t ) of streamingalgorithms can be modified to optimize the edge-cut ratioand replication factor under message aggregation. Bourse et.al. [10] propose a class of streaming policies that optimizes

the edge-cut ratio and replication factor with aggregation.TAPER [19] is a workload-aware edge-cut SGP algorithm

for subgraph matching workloads. It continuously monitorsincoming subgraph matching queries to discover frequentpatterns and uses an LDG-like heuristic that reduces thepossibility of inter-partition traversals.The algorithms discussed so far assume a homogeneous

cluster where each machine has identical resources. LeBeaneet. al. [29] propose an extension to the vertex-cut SGP al-gorithms Grid, PowerGraph and Ginger, that takes clusterheterogeneity into consideration. Similarly, Xu et. al. [44]propose Balanced Min-Increased as an edge-cut SGP algo-rithm that assigns each arriving vertex u to a partition P tithat minimizes the marginal cost under balance constraints.

B EDGE-CUT MODEL IN POWERLYRAEdge-cut algorithms produce vertex-disjoint partitioning andtherefore the edge-cut ratio is used to quantify the cut size.However, PowerLyra uses edge-disjoint placement scheme.In order to accurately evaluate edge-cut algorithms on Pow-erLyra, we need to: (i) produce an equivalent edge-disjoint(vertex-cut) partitioning from a given vertex-disjoint (edge-cut) partitioning, (ii) and show that the replication factoraccurately captures the communication cost of edge-cut par-titioning on PowerLyra.For a given vertex-to-partition mapping of an edge-cut

SGP algorithm that assign vertex u to partition Pi , we cre-ate an equivalent edge-disjoint (vertex-cut) partitioning byassigning all out-edges of vertex u to partition Pi . A similarmethod for undirected graphs is described in [10].

In edge-cut model, a network message is incurred for anyedge whose endpoints lie in different machines, as repre-sented in Figure 10(a). However, edge-cut systems employ acommon optimization technique called sender-side aggrega-tion to bound the number of messages incurred by a vertexby the number of partitions p [32]. In a nutshell, each workercombines the set of messages that are destined to same ver-tex u and sends a single message to the machine containingu as depicted in Figure 10(b).

Mirrors are created only for target vertices when edge-disjoint partitioning is derived from a vertex-disjoint parti-tioning as shown in Figure 10(c). Each mirror computes apartial sum and results are accumulated in the master dur-ing the gather phase, requiring network communication foreach mirror. Both the apply and scatter phases can be donewithout any synchronization since all out-edges required forthe scatter phase are placed locally. Figure 10(b) and 10(c)show that the communication patterns of edge-cut partition-ing with sender-side aggregation are identical to vertex-cutpartitioning where the edges are placed together with their

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1390

Page 17: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

source vertex. Below, we formally define the communica-tion cost of edge-cut based partitioning in PowerLyra andprove that the replication factor can accurately quantifiesthe communication cost;

Definition B.1. For any graph G(V ,E), we assign all out-edges of a vertex u to a partition i independently and uni-formly. Given that the degree sequence of the graph is d =(d(v)

),∀v ∈ V , the expected replication factor E

[C(P)

]of

such an assignment is;

= E[ ∑v ∈V

(# o f partitions with v ′s neiдhbours

)]− n

=∑v ∈V

E[(# o f partitions withv ′s neiдhbours

)]− n

=∑v ∈V

k∑i=1

(Pr ( ∃ u ∈ N (v) s .t . u ∈ Pi )

)− n

=∑v ∈V

k∑i=1

(1 − Pr ( � u ∈ N (v) s .t . u ∈ Pi )

)− n

=∑v ∈V

k∑i=1

(1 − (1 − 1

k )d (v)

)− n

=∑v ∈V

k(1 − (1 − 1

k )d (v)

)− n

= k∑v ∈V

(1 − (1 − 1

k )d (v)

)− n

= kn - k∑v ∈V

(1 − 1k )

d (v) − n

= n(k - 1) - k∑v ∈V

(1 − 1k )

d (v)

Following the notation in [10], we donate the momentgenerating function of the degree sequence d evaluated atvalue loд(1 − 1

k ) as

ψ (d,k) =1n

∑v inV

((1 −

1k)d (v)

)Then the replication factor of edge-cut partitioning in

PowerLyra is E[C(P)

]= n(k − 1)(1 −ψ (d,k)), which is iden-

tical to the cost of edge-cut partitioning with sender-sideaggregation in Proposition 2 of [10].

C JANUSGRAPH ARCHITECTUREFigure 11 depicts the architecture of JanusGraph deployed onour clusters. Each worker machine in the JanusGraph clusterconsists of one JanusGraph instance for query execution anda Cassandra instance for persistent storage. Even though thearchitecture logically separates the JanusGraph and Cassan-dra instances, co-location of execution and storage on thesame worker machine reduces the network overhead andenables JanusGraph instances to process their partition ofdata locally without additional network communication. Weconfigure the Cassandra instances to fit the entire workingset into memory to eliminate the overhead of disk access. In

3 6 P1

14

P2

25

P3

(a) Edge-cut Partition-ing

3 6 P1

14

P2

25

1 P3

(b) Edge-cut Partition-ing with Sender-side Ag-gregation

3 61

4

P1

14 6

P2

25

1

6

P3

(c) Vertex-cut Partition-ing where edges aregrouped by source

Figure 10: Different partitioning models and theirinter-machine communication on PowerLyra. Dashedlines represent networkmessages and colored verticesrepresent mirrors. Figures are adopted from [13]

C1 C2 CN…

S1 S2 SN…

Worker 1 Worker 2 Worker N

ComputeLayer

StorageLayer

Clients

Queries Results

Figure 11: Architecture of the JanusGraph

addition, we implement a partitioning-aware query routerin JanusGraph so that client queries are forwarded to thepartition that holds the starting vertex of the query.

D FULL SET OF RESULTS

����

����

����

����

����

����

����

���

� �� ��

��������������� � ������

�� �� �� ����������

��� ��� �� � !

Figure 12: Aggregate throughput of 192 concurrentclients running 1-hop traversals onLDBCSNBSF-1000graph, showing that increasing number of machinesnegatively impacts performance beyond 16 workers.

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1391

Page 18: Experimental Analysis of Streaming Algorithms for Graph ...sariyuce.com/sem/SIGMOD19.pdfdealing with streaming graphs is difficult. Re-partitioning algorithms such as Hermes [33] and

���������

������������

� �� �� �� ���

�������

������������ ���

��� �� �� �������

��� � �� ��� ���� ��� �� ��� �� �� !"#

$���$���$���$���$$�

� �� �� �� ������ �� �� �������

$���$���$���$��

� �� �� �� ������ �� �� �������

���������

���������

�� �� �� ���

����

������������ ���

������$���%���&�

���

�� �� �� ������$$�$$���$%�%$

�� �� �� ���

��

����������������

� �� �� �� ���

���

�����

������������ ���

��'����(

��������������������������������

� �� �� �� ���

)��

&�����������������������$������

� �� �� �� ���

###�

Figure 13: Performance of all offline graph analytic workloads on all graphs on all cluster sizes

����

�����

�����

�����

��� � ��� ����������������� � ������ ��������

���������������������������

��� � ��� ��

������� ����� ���� �!" ����

��������������������������������

��� � ��� ��

�#���$���

Figure 14: Aggregate throughput of 1-hop traversal workload on 16 machine cluster. Medium load represents 12concurrent clients per machine where system is at high utilization without overloading. High load represents 24concurrent clients per machine where system is overloaded.

�������

��������

�������

��������

�������

��������

�������

� � � ��� ���

�������

������ �� ���������

!�����"� ��#���

(a) USA-Road Network

������������

������������

������������

������������

�������

� � � ��� ���

�������

������ �� ���������

!�����"� ��#���

(b) Twitter

������

������

������

�����

�������

�������

�������

�������

� � � ��� ���

�������

������ �� ���������

!�����"� ��#���

(c) UK2007-05 Web Graph

Figure 15: Distribution of the number of vertices read from each individual worker machine in 16 machine clus-ter during the execution of 1-hop traversal workload. Lines from bottom to top represent the minimum, 25thpercentile, median, 75th percentile and maximum of the distribution, respectively.

Research 14: Graphs 2 SIGMOD ’19, June 30–July 5, 2019, Amsterdam, Netherlands

1392