trinity: a distributed graph engine on a memory cloud - · pdf filetrinity: a distributed...

Trinity: A Distributed Graph Engine on a

Memory Cloud

Report by: Animesh [email protected]

May 2014

Abstract

Trinity [1] is an in-memory graph storage and processing system. It stores com-plete graph data (vertices, edges and contextual data) in main memory of acluster of servers. Storing data in main memory eliminates high disk accessoverheads in graph processing. Trinity supports fast graph exploration and effi-cient parallel computation by leveraging a highly optimized by highly optimizedmemory management and network communication stack. In this report we willexplain motivation, design, and implementation of Trinity.

1 Introduction

Large graphs are ubiquitous. Their examples include Knowledge graphs, trans-portation networks, collaboration relationships among scientific works, commu-nication networks, social networks (Facebook/Twitter) etc. Many real-worldproblems involve processing large amounts of data in a graph structure to un-cover trends, perform searches, and derive associations and insights. Based ontheir processing needs, graph applications fall in two categories: online andoffline. Online data processing needs low latency processing to make it feasi-ble for real-time processing. An example of online processing is a connectivityquery between two persons in a social graph. Since such queries are expectedto finish quickly (1-2 seconds), they require low-latency graph exploration ca-pability. Offline graph processing does not have such strict time bounds on theexecution time. It usually executes in a batch mode where a large portion of agraph structure is processed in parallel. This parallel processing requires highbandwidth for repeated and efficient access to the graph data. Examples ofoffline processing include calculating PageRank on the WWW graph, machinelearning algorithms etc.

However, both modes of graph processing have some common unique charac-teristics. First, they have a high data-access-to-computation ratio. This meansthey are I/O intensive and processing time is governed by the I/O access time.

1

Disk PCIe Flash DRAM

Latency 2-10 msec 25-40 µsec 50-100 nsecBandwidth 60-100 MB/s 600-800 MB/s 10-20 GB/s

Table 1: Bandwidths and access latencies of storage devices.

Here I/O time includes both network and storage time. Second, due to the na-ture of graph processing, both graph processing modes require random vertexdata accesses. This is particularly true for graph exploration queries such asShortest-path or BFS etc. Unfortunately, modern storage technologies (disk orflash) still can not provide the required level of data bandwidth and randomaccess latency as stipulated by graph processing applications. Table 1 summa-rizes the performance gap we observe between storing data on disk, flash, andmain memory (or DRAM). Keeping data in the main memory has compellingadvantages to significantly improve performance of graph processing applica-tions because DRAM performance is orders of magnitude better than disk orflash.

2 Trinity System Design

Trinity is a distributed graph engine on a memory cloud. It stores entire graphdata in DRAMs of well connected machines and accesses it using high-speednetworks. This decision is motivated by the fact that prices of DRAMs and high-speed network cards is reducing and the total cost of ownership for DRAM-basedsystems is similar or better than traditional disk based systems. However, thedesign of Trinity does not have any specific hardware or software dependencies.

Trinity does not come with comprehensive built-in-graph computation mod-ules. However it can be morphed easily to support any kind of computation.The key insight in the design is that, instead of optimizing to support a certaintype of graph processing (e.g. Bulk Synchronous Parallel (BSP) in Pregel [2]),Trinity is designed to express any algorithm using fast graph exploration. Withthe fast graph exploration capability, the system easily supports both onlinegraph query and offline graph analytics modes of graph computation.

2.1 System Components

A Trinity system consists of three main components called slaves, proxies, andclients. These components communicate through network messages. Their spe-cific roles are :

1. Slave: A Trinity slave stores and processes a portion of graph data inmemory. The processing is triggered in response to a message from otherslaves, proxies, or clients.

2

(a) Trinity Components

(b) System Layers

Figure 1: Trinity system components and layers.

2. Proxies: Trinity Proxies only handle messages without storing any data.As they usually sit between clients and slaves, one possible use of themis to serve as an information aggregator (similar to the Aggregators inPregel). Hence for certain graph computations, they can act as perfor-mance optimizers. Consequently, they are optional and a Trinity systemdoes not always need a proxy.

3. Clients: Trinity client-side logic is implemented as a shared library. Itenables applications to store, process, and view results of graph processingdone on Trinity slaves. Clients communicate with Trinity slaves using theclient-side library and API calls.

Figure 1a shows the overall architecture with all components.

2.2 Distributed Memory Cloud Abstraction

Trinity organizes memory of multiple well-connected machines in a globally ad-dressable, distributed memory address space (called memory cloud) to storegraph data. The distributed memory abstraction essentially is a distributedkey-value (KV) store. This distributed key-value store is built upon two com-ponents (as shown in Figure 1b). First, a Distributed Memory Storage, whichis responsible for managing memory and providing concurrency control on ev-ery slave. Second, a Message Passing Framework that provides an efficient,one-sided, machine-to-machine message passing infrastructure.

Distributed memory of slaves are divided into and managed in multiples ofmemory trunks. 2p number of memory trunks are stored on m slave machines.Usually 2p > m, hence each machine stores more than one memory trunk.Trinity divides memory into trunks instead of managing memory of a singleslave as one unit. This division has several advantages:

3

Figure 2: Key Value partitioning, addressing, and look up.

• Locking is done on a trunk granularity. Hence multiple data access re-quests for different trunks can proceed in parallel.

• A value look up in the KV store is done using a trunk specific hash table.A single hash table for complete memory increases the probability of hashcollisions (and consequently poor performance).

• Individual memory trunks can be backed up and restored in parallel forfault tolerance.

The Trinity key-value store has 64-bit globally unique keys and associatedvalues which can be of arbitrary length. Accessing an associated value with aparticular key requires two levels of look up. As a first step, for a given key theresponsible slave machine is located. This is done by hashing the given 64-bitkey k to a p-bit value i, where i ∈ [0, 2p − 1]. This translation denotes that thekey k is stored in the memory trunk i. To find out which slave machine storesthe memory trunk i, an addressing table is maintained. This table contains 2p

slots, where each slot stores the responsible slave machine ID. Each machinekeeps a replica of the addressing table.

As a second step, for the given key k and the trunk ID i, Trinity looks upthe trunk offset and size of the associated value within the trunk. This lookup is done on a trunk specific hash table. The 64-bit key k is hashed again tofind the right slot in the trunk specific hash table. The hash table entries store

4

size and trunk offset of the values stored in a particular memory trunk. Uponknowing the offset and the size, we retrieve the value associated with the key k.Figure 2 shows these steps in detail.

Each key-value pair in the memory cloud can contain some meta data forvarious purposes. For concurrency control, each key-value pair is associatedwith a spinlock. A spinlock ensures that a key-value pair is in a consistent statebetween concurrent reads and writes. It also ensures that a key-value pair is notmoved during an access by the memory defragmentation thread to maximize thememory utilization.

3 Data Model

Modern graph processing differs significantly from traditional graph problemssuch as Shortest path or Depth first traversal. Traditional graph processing onlyinvolves processing vertex values and graph topology. Modern real-world graphstend to be information rich where lots of contextual information is associatedwith edges and vertices. Furthermore, parts of a graph can be restricted (e.g.Facebook’s social graph). Traditional ways of storing graph data with values,for example in a relational database (RDBMS) with multiple tables, have highperformance overheads. Though such storage schema is intuitive to build, itinvolves expensive multi-way joins of relational tables for simple graph traversalqueries. Hence storing and processing all contextual information with the graphdata is a challenging task.

3.1 Graph Model

Trinity handles graph modeling problem by storing all data (graph or contex-tual) in a flat distributed memory cloud. As previously described, in the key-value store, a 64-bit key becomes a global system-wide identifier and the value isassociated with a user-defined application data. An application data structureor Schema is defined using TSL (see next section). When a value is associatedwith a schema, it is also called a cell and a key-value pair becomes a (cellID,cell) pair. A cell is used to implement a node in a graph that contains informa-tion about the vertex, its incoming and outgoing edges, contextual informationfrom various data sources such as RDBMS, XML, plain files etc. Edges can besimply saved within cells where each cell contains cellIDs of connected vertices.An edge can also be modeled using a separate cell when it also has additionaldata. In this case, a node cell will contain a list of edge cellIDs (instead of nodecellIDs).

3.2 Trinity Specification Language (TSL)

Trinity Specification Language (TSL) is a high-level language that is used tomodel data and network communication in Trinity. It defines a data model orschema, which is used to interpret blob values stored in the key-value store.

5

1. [CellType: NodeCell]2. cell struct Movie3. {4. string MovieName;5. [EdgeType: SimpleEdge, ReferencedCell: Actor]6. List <uint64_t> Actors_CellIDs;7. }8.9. [CellType: NodeCell]10. cell struct Actor11. {12. string ActorName;13. [EdgeType: SimpleEdge, ReferencedCell: Movie]14. List <uint64_t> Movies_CellIDs;15. }

Figure 3: Modeling a Movie and Actor graph.

For example, Figure 3 defines two types of graph nodes namely Movie andActor, using two Cell structs. Like a normal C# struct, a cell struc-ture contains a list of arbitrary data types. A sum of individual storage sizesdictates the total storage size of a cell. In the example, Actor and Moviecell structs contain data elements List<uint64_t> Actors_CellIDs andList<uint64_t> Movies_CellIDs. These lists represents outgoing edgesfrom nodes.

In this examples, edges are stored implicitly (as marked by SimpleEdge) bystoring 64-bit key identifiers (or CellIDs) of Actor or Movie NodeCells. Edge-Cells can be used to model edges separately using StructEdge or HyperEdgecell types. In this model, a node cell would store cellIDs of EdgeCells than Node-Cells.

TSL can also be used to model network communication. Modeling networkcommunication has certain advantages. First, it frees users to focus on graphcomputation instead of writing a complex network messaging logic between slavemachines. Second, by knowing the communication requirements and data types,Trinity can perform message protocol specific optimizations. Third, trinity canpack multiple small messages together in order to reduce per-packet/messageprocessing overheads in the networking stack.

Figure 4 shows how Trinity models a simple Echo protocol message ex-change. In this example, a request and a response message (both of typestruct MyMessage) are exchanged between a server and client. The mes-saging type is synchronous messages. When this example is compiled, Trinitygenerates empty message handlers (similar to RPC stubs) for both server andclient. The user only needs to implement the logic for the message handler.Trinity internally takes care of message dispatching and packing etc. for the

6

1. struct MyMessage2. {3. string Text;4. }5. protocol Echo6. {7. Type: Syn;8. Request: MyMessage;9. Response: MyMessage;10. }

Figure 4: Modeling message passing.

user.In essence the advantages of TSL are:

• TSL provides object-oriented data manipulation capability (see next sec-tion).

• TSL facilitates data integration by defining interfaces between a graphand external data sources such as DBMS etc. The transparent interfacingallows transparent query processing and data access across many sources.

• TSL allows system extensions by capturing data schema and communi-cation protocols. TSL can be used to further develop more sophisticatedgraph querying systems using declarative languages.

3.3 Object-Oriented Cell Manipulation

To get an object-oriented interface to a graph, one can implement system struc-tures (nodes, edges, messages etc.) as run-time objects. However objects inmanaged run-time languages (such as C# or Java) also have language associ-ated metadata. Authors report that an empty run-time object in C# requires24 and 12 bytes of memory on 64- and 32-bit systems respectively. Since theobject might not be stored contiguously, for a network transfer also necessitatesexpensive serialization and deserialization operations.

To eliminate the serialization overheads and optimize for memory utiliza-tion, Trinity stores graph data as binary blobs. A TSL specification similarto one in Figure 3, tells Trinity how to interpret and manipulate binary datablobs. In order to facilitate user-friendly data manipulation, TSL compilationof a specification generates a CellAccessor and field get/set functions. Usinga CellAccessor function, which takes a CellID, a user gets a reference to theCellStructure. Subsequently the user can use get/set functions to manipulatevalues inside the cell structure. A cell accessor object is not a data container buta data mapper. Hence this method does not involve any data copy overhead.

7

However, various other object oriented principles such as inheritance etc. mightnot be allowed on such objects.

3.4 Consistency

During concurrent access Trinity provides atomicity of operations (read/write)on a single cell using its associated spinlock. Trinity does not provide ACIDtransaction support. However previous works have demonstrated how to builda transactional system on top of a system like Trinity [3].

4 Graph Computation Paradigms

Trinity supports wide variety of graph computation paradigms by leveraginghigh-bandwidth low-latency graph data access capability.

4.1 Traversal Based Online Queries

Traversal based online queries are the most common form of social graph queries.For example, people search on a social graph requires very fast graph explorationto dig out relationships and associations. Friend of friend who has a particularattribute (like name or university association) is another common query type.These queries are hard to predict and can not be pre-computed due to thedynamic nature and the size of social graphs.

Trinity supports very efficient traversal based online queries by leveragingits very efficient memory-based graph exploration capability. On a synthetic,power-law graph stored in an 8-machine cluster (with 800 million nodes, 104billion edges, and an average edge count of 103), Trinity explored the entire3-hop neighborhood of any node in the graph in about 100 milliseconds. Thisprocessing requires accessing 130 + 1302 + 1303 = 2.2 million nodes in 100milliseconds.

4.2 A New Paradigm for Online Queries

By storing web-scale graph data in memory, Trinity is able to support moresophisticated queries which were not possible before. For example, sub-graphmatching on web scale graphs is a hard problem. Existing systems rely onprecomputed indexes and then perform matching in super-linear space and/orsuper-linear construction time against them. Trinity solves this problem byproviding fast random data access and efficient parallel computation. In thisapproach, Trinity uses power of fast random graph access to traverse and matchsub-graphs. As an example, Trinity can perform sub-graph matching on a 128million nodes graph with average node degree of 16 and average query size of10 nodes in under a second.

Further, fast exploration enables global view of a graph and hence, betteralgorithms. In a BSP graph processing system, each vertex has only a local

8

Figure 5: Execution of Single Source Shortest Path in a BSP system. Sub figure(a) shows the initial graph, weighted edges, the source and the destination.

view of the graph. Consequently, it can not possibly know about any globalcomputation state. For example, the single source shortest path (SSSP) problemin a BSP system converges only after the computation has been propagatedto complete graph and there are no more messages in transit (see Figure 5).However, a better algorithm (e.g. Dijkstra algorithm) executing at the sourcevertex will stop after the step shown in Figure 5(c). The global view of thecomputation at the source vertex enables it to stop the computation becausethe destination has been reached with a path cost of 2 and all other possiblepaths are costlier than 2.

4.3 Vertex Centric Offline Analytics

Trinity supports traditional vertex-centric computation model for offline graphanalysis as introduced by Pregel. Unlike Pregel, where any node can send mes-sages to any other node, Trinity supports a restricted form of messaging model,where nodes can send messages only to their neighbors. Though restrictive,this model already supports many popular graph workloads such as PageRank,Shortest path etc. Additionally, sending messages to a known set of neighborsgives an opportunity to optimize message delivery as we will describe in Sec-tion 5.1. This results in a good performance. On an 8-machines cluster, Trinitycan perform one BSP iteration on a synthetic, power-law graph of 1 billion nodesand 13 billion edges in under a minute.

4.4 A New Paradigm for Offline Queries

To reduce network communication overhead, Trinity also supports probabilisticinference to derive an answer for the entire graph from the answer on a singlemachine or partition processing. An experiment with using landmark nodes to

9

estimate shortest distance on Trinity demonstrated very close accuracy to thebest approach.

However, this method can also be used for other graph processing systemthat use graph partitioning such as Pregel.

5 Implementation Details and Optimizations

Trinity is implemented in Microsoft .NET framework.

5.1 Message Passing Optimization

In a vertex-centric BSP graph processing e.g. Pregel, the number of messagessent or received by a vertex depends upon its degree. For densely connectedgraphs, sending, buffering, and delivering these messages before the local com-putation starts can incur significant overhead. There are two possible ways tohandle the large number of incoming messages. In one implementation, a vertexcan be scheduled to run the local computation when all messages intended for ithave been received. This approach significantly increases space requirements tobuffer all incoming messages. In absence of enough buffer space, these messagescould be stored on a disk. However random disk accesses to retrieve messagesincur significant performance penalties. A second approach to this problem isto run a vertex and fetch messages on demand and then discard them afterthe run. However, this approach increases the network traffic as same set ofmessages might be needed by another vertex in the same partition on the samemachine.

Trinity handles the message problem by (a) restricting communication to afixed set of vertices which are immediate neighbors; (b) dividing neighboringvertices in normal or hub vertices. A neighbor vertex can belong to a local ora remote graph partition. All vertices in a local partition resides on a singlemachine. To efficiently manage messages, Trinity creates a bipartite partitionof the local graph stored on the machine. As illustrated in figure 6, the localgraph stored on the machine now have 3 partitions. If a good bipartite partitionis achieved, vertices in a partition need messages from vertices in the samepartition. So, as long as the machine can hold messages required for verticesin one partition in memory, we can schedule the computation on local verticesbelonging to that partition without waiting. Once the computation is over, wecan discard messages and move to the next partition.

However bipartite partitioning of a densely connected graph itself is a hardproblem to solve. In order to minimize the complexity of the problem, remotevertices are classified into two categories: hub and normal vertices. Hub verticeshave a large degree and are connected to a large fraction of local vertices. Thesehub vertices (and associated edges) do not participate in the graph partitioningprocess, thus significantly reducing the complexity of the problem.

Messages from the hub vertices are stored for the whole duration of one it-eration. Messages from normal vertices are fetched on demand as we schedule

10

Figure 6: Bipartite view of local and remote vertices.

a partition for computation and then discarded after the use. This approach isa hybrid of two possible solution discussed earlier. In order to pre-fetch remotemessages before executing a vertex node, one needs to know the execution or-der. The execution order depends upon the bipartite partitioning because thesepartitions are executed in order. To receive messages promptly and reduce thewaiting time, a local machine sends out to each remote machine an action script,which specifies the exact order of messages it needs. Each machine merges theaction scripts it receives from other machines. After the execution starts, eachmachine sends messages based on the merged script, iteration after iteration.

Another advantage of this scheme is vertex execution and message deliverymodel become deterministic and predictable. This means that the entire graphdoes not have to be memory resident all the time. At any moment, Trinityonly has to load vertices for a partition that is scheduled to run. Other verticesonly have their messages stored in the memory. This arrangement results insignificant memory savings.

5.2 Circular Memory Management

Trinity stores all its data in the main memory of slave machines. Similar toa local memory management, internal fragmentation is an issue for memoryutilization. Additionally, values in the key-value system can grow or shrink.Since DRAM space is expensive (both in terms of power and cost than thedisk), Trinity can not afford to have under-utilize DRAM space.

Trinity uses a simple circular log-structured approach for storing key-valuesin memory. Unlike local memory management libraries such as malloc, Trinityis free to move and consolidate values around in memory. In the log-structured

11

Figure 7: Circular log-structured memory management.

approach, there are three pointers: committed head, committed tail, and appendhead. The committed head and tail pointers mark the area where allocationcan be done. The append head pointer points to the next free memory area inthis region. Memory allocation is done by moving the allocation head pointerahead. Adjusting the length of a value or deleting it leaves a hole in the memorylog. When the commit head and the append head reach the end of the trunk,Trinity activates a defragmentation daemon which scans the committed memoryregion, and moves key-value pairs toward the append head pointer. After a pass,the freed memory space at the tail of committed memory, is released and thecommitted tail is moved forward giving more free space for commit head toadvance. Figure 7 shows these movements.

5.3 Fault Tolerance

When a slave machine fails, its memory trunk contents are recovered from ashared persistent storage (e.g. HDFS) to different slave machines. The recoveryalso requires an update in the shared addressing table because locations ofmemory trunks have changed. To keep an up-to-date addressing table, Trinitystores an authoritative copy of the table on a leader machine and a persistentreplica in a distributed filesystem. When a client fails to access data on a slavemachine, it informs the leader about the machine. Trinity also uses heartbeatmessages to detect a slave machine failure. Upon detecting a machine failure,

12

the leader (1) instructs data recovery; (2) update the persistent copy of theaddressing table on the distributed file system; (3) updates its copy of theaddressing table; (3) broadcasts the new addressing table to all machines.

For different computation models, Trinity uses different fault recovery meth-ods. BSP based synchronous computation has an explicit synchronization stepat the end of a superstep. Trinity makes checkpoints after a few supersteps.These checkpoints are written to the shared distributed file system. For asyn-chronous computation, a simple periodic interruption mechanism is used tocreate snapshots. For real-only queries, Trinity restarts the failed machine andreloads data from the persistent disk storage. For online update queries, theupdates are logged in an update log stored on a remote memory before commit-ting them to the local memory. If the local update fails, data can be recoveredand the update log from the remote memory is re-applied on it.

6 Evaluation

Key evaluation highlights are:

• Online Queries: Trinity can perform people search (for 2-3 hops) ona social network graph (with an average node degree of 130) under 100milliseconds. This shows very good data-intensive, traversal based onlinequery performance.

• Offline Analysis: Calculation of PageRank on a graph with one billionnodes and average degree of 13 is done under a minute. This demonstratesa good offline performance.

• Traditional Algorithms: Trinity can perform breadth-first search on abillion node graph with average edge count of 13 in about 1028 seconds.

• vs. Parallel Boost Graph Library (PBGL): In comparison to PBGL,which is a generic C++ library for high-performance parallel and dis-tributed graph computation, Trinity is 10x faster with 10x less memoryfootprint for a similar breadth-first search workload.

• vs. Giraph: Giraph is an open source implementation of Pregel computa-tion model. For PageRank calculations Trinity is two orders of magnitudefaster than Giraph.

• Scalability: Performance and memory requirements of Trinity scalesgracefully with number of nodes. This is achieved by reducing the run-time object memory foot-print by storing plain blobs (section 3.3) andalways keeping active data in in-memory.

13

7 Conclusion

Trinity is an in-memory graph storage and query processing system. In this re-port we have discussed the design, implementation, and optimization of Trinity.The key advantage of the system is performance as it is orders of magnitudefaster than other systems. It is a real system which is tested in a production en-vironment at Microsoft. However the source code is not open source. The paperdoes not give a complete example of how a TSL program for graph processinglooks like.

References

[1] Bin Shao, Haixun Wang, and Yatao Li. Trinity: A distributed graph engineon a memory cloud. In Proceedings of the 2013 ACM SIGMOD, pages 505–516, 2013.

[2] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert,Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: A system forlarge-scale graph processing. In Proceedings of the 2010 ACM SIGMOD,pages 135–146, 2010.

[3] Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, and Chris-tos Karamanolis. Sinfonia: A new paradigm for building scalable distributedsystems. In Proceedings of Twenty-first ACM SIGOPS Symposium on Op-erating Systems Principles, SOSP ’07, pages 159–174, New York, NY, USA,2007. ACM.

14

trinity: a distributed graph engine on a memory cloud - · pdf filetrinity: a distributed...

Documents