problem-solving using graph traversals: searching, scoring, ranking, and recommendation

Problem-Solving using Graph Traversals

Searching, Scoring, Ranking, and Recommendation

Marko A. RodriguezGraph Systems Architecthttp://markorodriguez.com

http://twitter.com/twarko

AT&Ti Technical Talk - Glendale, California – July 27, 2010

July 26, 2010

http://markorodriguez.com

http://twitter.com/twarko

Abstract

A graph is a data structure that links a set of vertices by a set of edges.Modern graph databases support multi-relational graph structures, wherethere exist different types of vertices (e.g. people, places, items) anddifferent types of edges (e.g. friend, lives at, purchased). By means ofindex-free adjacency, graph databases are optimized for graph traversalsand are interacted with through a graph traversal engine. A graphtraversal is defined as an abstract path whose instance is realized on agraph dataset. Graph databases and traversals can be used for searching,scoring, ranking, and in concert, recommendation. This presentation willexplore graph structures, algorithms, traversal algebras, graph-relatedsoftware suites, and a host of examples demonstrating how to solvereal-world problems, in real-time, with graphs. This is a whirlwind tour ofthe theory and application of graphs.

Outline

• Graph Structures, Algorithms, and Algebras

• Graph Databases and the Property Graph

• TinkerPop Open-Source Graph Product Suite

• Real-Time, Real-World Use Cases for Graphs

Difficulty Chartdifficulty

timeal

gebr

a

grap

hs

data

base

s

indi

ces

data

mod

els

softw

are

algo

rithm

s

real

-wor

ld

conc

lusi

on

Outline






timeal

gebr

a

grap

hs

data

base

s

indi

ces

data

mod

els

softw

are

algo

rithm

s

real

-wor

ld

conc

lusi

on

G = (V,E)

A Vertex

There once was a vertex i ∈ V named tenderlove.

Two Vertices

And then came along another vertex j ∈ V named sixwing.Thus, i, j ∈ V .

A Directed Edge

Our tenderlove extended a relationship to sixwing. Thus,(i, j) ∈ E.

The Single-Relational, Directed Graph

More vertices join, create edges and, in turn, the graph grows...

The Single-Relational, Directed Graph as a Matrix

A single-relational graph defined as

G = (V,E ⊆ (V × V ))

can be represented as the adjacency matrix A ∈ 0, 1n×n, where

Ai,j =

1 if (i, j) ∈ E0 otherwise.

The Single-Relational, Directed Graph as a Matrix

1

1 0

1

0 1

10

0

0

0

00

AG

1

0

0

The Single-Relational, Directed Graph

• All vertices are homogenous in meaning—all vertices denote the sametype of object (e.g. people, webpages, etc.).1

• All edges are homogenous in meaning—all edges denote the same typeof relationships (e.g. friendship, works with, etc.).2

1This is not completely true. All n-partite single-relational graphs allow for the division of the vertex setinto n subsets, where V =

⋃ni Ai : Ai ∩ Aj = ∅. Thus, its possible to implicitly type the vertices.

2This is not completely true. There exists an injective, information-preserving function that maps anymulti-relational graph to a single-relational graph, where edge types are denoted by topological structures.Thus, at a “higher-level,” it is possible to create a heterogenous set of relationships.Rodriguez, M.A., “Mapping Semantic Networks to Undirected Networks,” International Journal of AppliedMathematics and Computer Sciences, 5(1), pp. 39–42, 2009. [http://arxiv.org/abs/0804.0277]

http://arxiv.org/abs/0804.0277

Applications of Single-Relational Graphs

• Social: define how people interact (collaborators, friends, kins).

• Biological: define how biological components interact (protein, foodchains, gene regulation).

• Transportation: define how cities are joined by air and road routes.

• Dependency: define how software modules, data sets, functions dependon each other.

• Technology: define the connectivity of Internet routers, web pages, etc.

• Language: define the relationships between words.

The Limitations of Single-Relational Graph Modeling

Friendship Graph Favorite Graph Works-For Graph

Unfortunately, single-relational graphs are independent of each other. Thisis because G = (V,E)—there is only a single edge set E (i.e. a single typeof relation).

Numerous Algorithms for Single-Relational Graphs

We would like a more flexible graph modeling construct, but unfortunately,most of our graph algorithms were designed for single-relational graphs.3

• Geodesic: diameter, radius, eccentricity, closeness, betweenness, etc.

• Spectral: random walks, PageRank, eigenvector centrality, spreading activation, etc.

• Assortativity: scalar, categorical, hierarchal, etc.

• Others: ...4

We can solve this with multi-relational graphs and a path algebra.

3For a fine book on graph analysis algorithms, please see:Brandes, U., Erlebach T., “Network Analysis: Methodolgical Foundations,” edited book, Springer, 2005.

4One of the purposes of this presentation is advocate for local graph analysis algorithms (i.e. priors-based,relative) vs. global graph analysis algorithms. Most popular graph analysis algorithms are global in thatthey require an analysis of the whole graph (or a large portion of a graph) to yield results. Local analysisalgorithms are dependent on sub-graphs of the whole and in effect, can boast faster running times.

G = (V,E)

A Directed Edge

A Directed, Labeled Edge

friend

Lets specify the type of relationship that exists betweentenderlove and sixwing. Thus, (i, j) ∈ Efriend.

Growing a Multi-Relational Graph

friend

friend

Lets make the friendship relationship symmetric. Thus,(j, i) ∈ Efriend.


friend

friend

friend

friend

Lets add marko to the mix: k ∈ V . This graph is stillsingle-relational. There is only one type of relation.


friend

friend

friend

friend favorite

Lets add an (i, l) ∈ Efavorite. Now there are multiple types ofrelationships: Efriend and Efavorite (2 edge sets).

The Multi-Relational, Directed Graph

• At this point, there is a multi-relational, directed graph: G = (V,E),where E = (E0, E1, . . . , Em ⊆ (V × V )).5

• Vertices can denote different types of objects (e.g. people, places).6

• Edge can denote different types of relationships (e.g. friend, favorite).7

• This is the data model of the Web of Data—the RDF data model.8

5Another representation is G ⊆ (V × Ω× V ), where Ω ⊆ Σ∗ is the set of legal edge labels.6Vertex types can be determined by the domain and range specification of the respective edge

relation/label/predicate. Or, another way, by means of an explicit typing relation such as 〈a, type, b〉.7Edge types are determined by the label that accompanies the edge.8This is not completely true. The vertex set is split into URIs (U), literals (L), and blank/anonymous

nodes (B), such that G ⊆ ((U × B)× U × (U × B × L)). [http://www.w3.org/RDF/]

http://www.w3.org/RDF/

The Multi-Relational, Directed Graph as a Tensor

A three-way tensor can be used to represent a multi-relational graph. If

G = (V,E = E0, E1, . . . , Em ⊆ (V × V ))

is a multi-relational graph, then A ∈ 0, 1n×n×m and

Aki,j =

1 if (i, j) ∈ Em : 1 ≤ k ≤ m0 otherwise.

Thus, each edge set in E represents an adjacency matrix and thecombination of m adjacency matrices forms a 3-way tensor.

The Multi-Relational, Directed Graph as a Tensor

favoritefriend

answers

0

0

0

0

0

0

0

0

0

0

0

1

0

00

0

A

friend

friend

favorite

G

Multi-Relational Graph Algorithms

“Can we evaluate single-relational graph analysis algorithmson a multi-relational graph?”

The Meaning of Edge Meanings

lovesloves loves loves

loves hateshates hates hates

hates

• Multi-relationally: tenderlove is more liked than marko.

• Single-relationally: tenderlove and marko simply have the samein-degree.

? Given, lets say, degree-centrality, tenderlove and marko are equal asthey have the same number of relationships. The edge labels do noteffect the output of the degree-centrality algorithm.

What Do You Mean By “Central?”

...

...

friend friend

favorite

friend

What is your favoritebookstore?

favorite

question_by

answer_for

answer_by

answer

Lets focus specifically on centrality. What is the most central vertex in a

multi-relational graph? Who is the most central friend in the graph—by friendship, by

question answering, by favorites, etc?

Primary Eigenvector

“What does the primary eigenvector of a multi-relationalgraph mean?”91011

9We will use the primary eigenvector for the following argument. Note that the same argument appliesfor all known single-relational graph algorithms (i.e. geodesic, spectral, community detection, etc.).

10Technical details are left aside such as outgoing edge probability distributions and the irreducibility ofthe graph.

11The popular PageRank vector is defined as the primary eigenvector of a low-probability fully connectedgraph combined with the original graph (i.e. both graphs maintain the same V ).

Primary Eigenvector: Ignoring Edge Labels

• If π = Bπ, where B ∈ N|V |×|V |+ is the adjacency matrix formed bymerging the edge sets in E, then edge labels are ignored—all edges aretreated equally.

• In this “ignoring labels”-model, there is only one primary eigenvector forthe graph—one definition of centrality.

• With a heterogenous set of vertices connected by a heterogenous set ofedges, what does this type of centrality mean?

Primary Eigenvector: Isolating Subgraphs

• Are there other primary eigenvectors in the multi-relational graph?

• You can ignore certain edge sets and calculate the primary eigenvector(e.g. pull out the single-relational “friend”-graph.)

? π = Afriendπ, where Afriend ∈ 0, 1|V |×|V | is the adjacency matrixformed by the edge set Efriend.

• Thus, you can isolate subgraphs (i.e. adjacency matrices) of themulti-relational graph and calculate the primary eigenvector for thosesubgraphs.

• In this “isolation”-model, there are m definitions of centrality—one foreach isolated subgraph.12

12Remember, A ∈ 0, 1n×n×m.

Ultimately what we want is...

Primary Eigenvector: Turing Completeness

• What about using paths through the graph—not simply explicit one-stepedges?

• What about determining centrality for a relation that isn’t explicit in E(i.e. Ak ∈ A)? In general, what about π = Xπ, where X is a derivedadjacency matrix of the multi-relational graph.

? For example, if I know who everyone’s friends are, then I know (i.e. caninfer, derive, compute) who everyone’s friends-of-a-friends (FOAF) are.What about the primary eigenvector of the derived FOAF graph?

• In the end, you want a Turing-complete framework—you want completecontrol (universal computability) over how π moves through themulti-relational graph structure.13

13These ideas are expounded upon at great length throughout this presentation.

A Path Algebra for EvaluatingSingle-Relational Algorithms on Multi-Relational Graphs

• There exists a multi-relational graph algebra for mapping single-relationalgraph analysis algorithms to the multi-relational domain.14

• The algebra works on a tensor representation of a multi-relational graph.

• In this framework and given the running example, there are as manyprimary eigenvectors as there are abstract path definitions.

14* Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational NetworkAnalysis Algorithms,” Journal of Informetrics, 4(1), pp. 29–41, doi:10.1016/j.joi.2009.06.004, 2009.[http://arxiv.org/abs/0806.2274]* Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks,” Knowledge-Based Systems,21(7), pp. 727–739, doi:10.1016/j.knosys.2008.03.030, 2008. [http://arxiv.org/abs/0803.4355]* Rodriguez, M.A., Watkins, J.,“Grammar-Based Geodesics in Semantic Networks,” Knowledge-BasedSystems, in press, doi:10.1016/j.knosys.2010.05.009, 2010.



The Operations of the Multi-Relational Path Algebra

• A ·B: ordinary matrix multiplication determines the number of (A,B)-paths between vertices.

• A>: matrix transpose inverts path directionality.

• A B: Hadamard, entry-wise multiplication applies a filter to selectivelyexclude paths.

• n(A): not generates the complement of a 0, 1n×n matrix.

• c(A): clip generates a 0, 1n×n matrix from a Rn×n+ matrix.

• v±(A): vertex generates a 0, 1n×n matrix from a Rn×n+ matrix, whereonly certain rows or columns contain non-zero values.

• xA: scalar multiplication weights the entries of a matrix.

• A + B: matrix addition merges paths.

Primary Eigenvectors in a Multi-Relational Graph

• Friend:(Afriend

)π

• FOAF:(Afriend · Afriend

)π ≡

(Afriend2

)π

• FOAF (no self):(Afriend2 n(I)

)π15

• FOAF (no friends nor self):(Afriend2 n

(Afriend

) n(I)

)π

• Co-Worker:((Aworks at · Aworks at>

) n (I)

)π

• Friend-or-CoWorker:(

0.65Afriend + 0.35((Aworks at · Aworks at>

) n (I)

))π

• ...and more.16

15I ∈ 0, 1|V |×|V | : Ii,i = 1—the identity matrix.16Note, again, that the examples are with respect to determining the primary eigenvector of the derived

adjacency matrix. The same argument holds for all other single-relational graph analysis algorithms. Ingeneral, the path algebra provides a means of creating “higher-order” (i.e. semantically-rich) single-relationalgraphs from a single multi-relational graph. Thus, these derived matrices can be subjected to standardsingle-relational graph analysis algorithms.

Deriving “Semantically Rich” Adjacency Matrices

friend

-of-fri

end

(no se

lf)

favori

tefriend

answ

ers

0

0

0

0

0

0

0

0

0

0

0

1

0

00

0

A

0

0 0

0

0 0

00

1

0

0

000

1

0

Afriend · Afriend

n(I)

"friend-of-a-friend (no self)"

Afriend2 n(I)

favori

tefriend

answ

ers

0

0

0

0

0

0

0

0

0

0

0

1

0

00

0

A

∪ =

Use the multi-relational graph to generate explicit edges that were implicitly defined as

paths. Those new explicit edges can then be memoized17 and re-used (time vs. space

tradeoff)—aka path reuse.17Memoization Wikipedia entry: http://en.wikipedia.org/wiki/Memoization.

http://en.wikipedia.org/wiki/Memoization

Benefits, Drawbacks, and Future of the Path Algebra

• Benefit: Provides a set of theorems for deriving equivalences and thus,provides the foundation for graph traversal engine optimizers.18 Serves asimilar purpose as the relational algebra for relational databases.19

• Drawback: The algebra is represented in matrix form and thus,operationally, works globally over the graph.20

• Future: A non-matrix-based, ring theoretic model of graph traversalthat supports +, −, and · on individual vertices and edges. The Gremlin[http://gremlin.tinkerpop.com] graph traversal engine presentedlater provides the implementation before a fully-developed theory.

18Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network AnalysisAlgorithms,” Journal of Informetrics, 4(1), pp. 29–41, 2009. [http://arxiv.org/abs/0806.2274]

19Codd, E.F., “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM,13(6), pp. 377–387, doi:10.1145/362384.362685, 1970.

20It is possible to represent local traversals using vertex filters at the expense of clumsy notation.

http://gremlin.tinkerpop.com


Outline






timeal

gebr

a

grap

hs

data

base

s

indi

ces

data

mod

els

softw

are

algo

rithm

s

real

-wor

ld

conc

lusi

on

The Simplicity of a Graph

• A graph is a simple data structure.

• A graph states that something is related to something else (the foundationof any other data structure).21

• It is possible to model a graph in various types of databases.22

? Relational database: MySQL, Oracle, PostgreSQL

? JSON document database: MongoDB, CouchDB

? XML document database: MarkLogic, eXist-db

? etc.

21A graph can be used to represent other data structures. This point becomes convenient when lookingbeyond using graphs for typical, real-world domain models (e.g. friends, favorites, etc.), and seeing theirapplicability in other areas such as modeling code (e.g. http://arxiv.org/abs/0802.3492), indices, etc.

22For the sake of diagram clarity, the examples to follow are with respect to a single-relational, directedgraph. Note that it is possible to model multi-relational graphs in these types of database as well.


Representing a Graph in a Relational Database

outV | inV

------------

A | B

A | C

C | D

D | A

A

CB

D

Representing a Graph in a JSON Database

A :

outE : [B, C]

B :

outE : []

C :

outE : [D]

D :

outE : [A]

A

CB

D

Representing a Graph in an XML Database

<graphml>

<graph>

<node id=A />

<node id=B />

<node id=C />

<node id=D />

<edge source=A target=B />

<edge source=A target=C />

<edge source=C target=D />

<edge source=D target=A />

</graph>

</graphml>

A

CB

D

Defining a Graph Database

“If any database can represent a graph, then what

is a graph database?”

Defining a Graph Database

A graph database is any storage system thatprovides index-free adjacency.2324

23There is no “official” definition of what makes a database a graph database. The one provided is mydefinition (respective of the influence of my collaborators in this area). However, hopefully the followingargument will convince you that this is a necessary definition. Given that any database can model a graph,such a definition would not provide strict enough bounds to yield a formal concept (i.e. >).

24There is adjacency between the elements of an index, but if the index is not the primary data structureof concern (to the developer), then there is indirect/implicit adjacency, not direct/explicit adjacency. Agraph database exposes the graph as an explicit data structure (not an implicit data structure).

Defining a Graph Database by Example

D

E

C

A

B

Toy Graph Gremlin(stuntman)

Graph Databases and Index-Free Adjacency

D

E

C

A

B

• Our gremlin is at vertex A.

• In a graph database, vertex A has direct references to its adjacent vertices.

• Constant time cost to move from A to B and C. It is dependent upon the number

of edges emanating from vertex A (local).

Graph Databases and Index-Free Adjacency

D

E

C

A

B

The Graph (explicit)

Non-Graph Databases and Index-Based Adjacency

D

E

C

A

B

A B C

D EB,C E D,E

• Our gremlin is at vertex A.


D

E

C

A

B

A B C

D EB,C E D,E

• In a non-graph database, the gremlin needs to look at an index to determine whatis adjacent to A.

• log2(n) time cost to move to B and C. It is dependent upon the total number of

vertices and edges in the database (global).


D

E

C

A

B

A B C

D EB,C E D,E

The Index (explicit) The Graph (implicit)

Index-Free Adjacency

• While any database can implicitly represent a graph, only agraph database makes the graph structure explicit.25

• In a graph database, each vertex serves as a “mini index”of its adjacent elements.26

• Thus, as the graph grows in size, the cost of a local stepremains the same.27

25Please see http://markorodriguez.com/Blarko/Entries/2010/3/29_MySQL_vs._Neo4j_on_a_

Large-Scale_Graph_Traversal.html for some performance characteristics of graph traversals in arelational database (MySQL) and a graph database (Neo4j).

26Each vertex can be intepreted as a “parent node” in an index with its children being its adjacentelements. In this sense, traversing a graph is analogous in many ways to traversing an index—albeit thegraph is not an acyclic connected graph (tree). (a vision espoused by Craig Taverner)

27A graph, in many ways, is like a distributed index.

http://markorodriguez.com/Blarko/Entries/2010/3/29_MySQL_vs._Neo4j_on_a_Large-Scale_Graph_Traversal.html

http://markorodriguez.com/Blarko/Entries/2010/3/29_MySQL_vs._Neo4j_on_a_Large-Scale_Graph_Traversal.html

Graph Databases Do Make Use of Indices

A B C

D E

The Graph

Index of Vertices(by id)

• There is more to the graph than the explicit graph structure.

• Indices index the vertices by their properties (e.g. ids, name, latitude).28

28Graph databases can be used to create index structures. In fact, in the early days of Neo4j, Neo4j usedits own graph structure to index the properties of its vertices—a graph indexing a graph. A thought iteratedmany times over by Craig Taverner who is interested in graph databases for geo-spatial indexing/analysis.

The Patterns of Relational and Graph Databases

• In a relational database, operations are conceptualized set-theoreticallywith the joining of tuple structures being the means by whichnormalized/separated data is associated.

• In a graph database, operations are conceptualized graph-theoreticallywith paths over edges being the means by which non-adjacent/separatedvertices are associated.29

In theory and ignoring performance, both models have the sameexpressivity and allow for the same manipulations. But such theory doesnot determine intention and the mental ruts that any approach engrains.The graph database provides a novel perspective on the ancient necessityto manipulate information.

29Rodriguez, M.A., Neubauer, P., “The Graph Traversal Pattern,” AT&Ti and NeoTechnology TechnicalReport, currently in review, 2010. [http://arxiv.org/abs/1004.1001]


Property Graphs and Graph Databases

• Most graph databases support a graph data model known as a propertygraph.

• A property graph is a directed, attributed, multi-relational graph.In other words, vertices and edges are equipped with a collection ofkey/value pairs.30

30Rodriguez, M.A., Neubauer, P., “Constructions from Dots and Lines,” Bulletin of the American Societyfor Information Science and Technology, American Society for Information Science and Technology, 2010.[http://arxiv.org/abs/1006.2361]


From a Multi-Relational Graph...

friend

friend

friend

friend favorite

...to a Property Graph

friend

friend

friend

friend favorite

name=markolocation=Santa Fe

gender=malelat=11111

long=22222

created_at=123456

created_at=234567

created_at=234567

name=sixwinglocation=West Hollywood

gender=male

Why the Property Graph Model?

• Standard single-relational graphs do not provide enough modeling flexibility for use in

real-world situations.31

• Multi-relational graphs do and the Web of Data (RDF) world demonstrates this to be

the case in practice.

• Property graphs are perhaps more practical because not every datum needs to be

“related” (e.g. age, name, etc.). Thus, the edge and key/value model is a convenient

dichotomy.32

• Property graphs provide finer-granularity on the meaning of an edge as the key/values

of an edge add extra information beyond the edge label.

31This is not completely true—researchers use the single-relational graph all the time. However, in mostdata rich applications, its limiting to work with a single edge type and a homogenous population of vertices.

32RDF has a similar argument in that literals can only be the object of a triple. However, in practice, whenrepresented in a graph database, there is a single literal vertex denoting that literal and thus, is traversablelike any other vertex.

Graph Type Morphisms

property graph

weighted graph

semantic graph

multi-graph

undirected graph

directed graph

simple graph

add weight attribute

remove attributes

remove edge labels

remove loops, directionality, and multiple edges

no op

no op

no op

no op

remove directionality

remove attributes

labeled graph

remove edge labels

no op

rdf graph

make labels URIs

Outline






timeal

gebr

a

grap

hs

data

base

s

indi

ces

data

mod

els

softw

are

algo

rithm

s

real

-wor

ld

conc

lusi

on

TinkerPop: Making Stuff for the Fun of It• Open source software group started in 2008 focusing on graph data

structures, graph query engines, graph-based programming languages,and, in general, tools and techniques for working with graphs.[http://tinkerpop.com] [http://github.com/tinkerpop]

? Current members: Marko A. Rodriguez (AT&Ti), Peter Neubauer(NeoTechnology), Joshua Shinavier (Rensselaer Polytechnic Institute),and Pavel Yaskevich (“I am no one from nowhere”).

http://tinkerpop.com

http://github.com/tinkerpop

TinkerPop Productions

• Blueprints: Data Models and their Implementations

[http://blueprints.tinkerpop.com]

• Pipes: A Data Flow Framework using Process Graphs

[http://pipes.tinkerpop.com]

• Gremlin: A Graph-Based Programming Language

[http://gremlin.tinkerpop.com]

• Rexster: A RESTful Graph Shell

[http://rexster.tinkerpop.com]

? Wreckster: A Ruby API for Rexster

[http://github.com/tenderlove/wreckster]

There are other TinkerPop products (e.g. Ripple, LoPSideD, TwitLogic, etc.), but for the

purpose of this presentation, only the above will be discussed.

http://blueprints.tinkerpop.com

http://pipes.tinkerpop.com

http://gremlin.tinkerpop.com

http://rexster.tinkerpop.com

http://github.com/tenderlove/wreckster

Blueprints: Data Models and their Implementations

Blueprints

• Blueprints is the like the JDBC of the graph database community.

• Provides a Java-based interface API for the property graph data model.

? Graph, Vertex, Edge, Index.

• Provides implementations of the interfaces for TinkerGraph, Neo4j, Sails(e.g. AllegroGraph, HyperSail, etc.), and soon (hopefully) others suchas InfiniteGraph, InfoGrid, Sones, DEX, and HyperGraphDB.33

33HyperGraphDB makes use of an n-ary graph structure known as a hypergraph. Blueprints, in its currentform, only supports the more common binary graph.

Pipes: A Data Flow Framework using Process Graphs

Pipes

• A dataflow framework with support for Blueprints-based graph processing.

• Provides a collection of “pipes” (implement Iterable and Iterator)that are connected together to form processing pipelines.

? Filters: ComparisonFilterPipe, RandomFilterPipe, etc.? Traversal: VertexEdgePipe, EdgeVertexPipe, PropertyPipe, etc.? Splitting/Merging: CopySplitPipe, RobinMergePipe, etc.? Logic: OrPipe, AndPipe, etc.

Gremlin: A Graph-Based Programming Language

GremlinG = (V,E)

• A Turing-complete, graph-based programming language that compilesGremlin syntax down to Pipes (implements JSR 223).

• Support various language constructs: :=, foreach, while, repeat,if/else, function and path definitions, etc.

? ./outE[@label=‘friend’]/inV

? ./outE[@label=‘friend’]/inV/outE[@label=‘friend’]/inV[g:except($ , .)]

? g:key(‘name’,‘Aaron Patterson’)[0]/outE[@label=‘favorite’]/inV/@name

Rexster: A RESTful Graph Shell

reXster

• Allows Blueprints graphs to be exposed through a RESTful API (HTTP).

• Supports stored traversals written in raw Pipes or Gremlin.

• Supports adhoc traversals represented in Gremlin.

• Provides “helper classes” for performing search-, score-, and rank-basedtraversal algorithms—in concert, support for recommendation.

• Aaron Patterson (AT&Ti) maintains the Ruby connector Wreckster.

Typical TinkerPop Graph Stack

NativeStore TinkerGraphNeo4j

GET http://host/resource

Outline






timeal

gebr

a

grap

hs

data

base

s

indi

ces

data

mod

els

softw

are

algo

rithm

s

real

-wor

ld

conc

lusi

on

Using Graphs in Real-Time Systems

• Most popular graph algorithms require global graph analysis.

? Such algorithms compute a score, a vector, etc. given the structureof the whole graph. Moreover, many of these algorithms have largerunning times: O(|V |+ |E|), O(|V | log |V |), O(|V |2), etc.

• Many real-world situations can make use of local graph analysis.34

? Search for x starting from y.? Score x given its local neighborhood.? Rank x relative to y.? Recommend vertices to user x.

34Many web applications are “ego-centric” in that they are with respect to a particular user (the userlogged in). In such scenarios, local graph analysis algorithms are not only prudent to use, but also, beneficialin that they are faster than global graph analysis algorithms. Many of the local analysis algorithms discussedrun in the sub-second range (for graphs with “natural” statistics).

Applications of Graph Databases and Traversal Engines:Searching, Scoring, and Ranking

• Searching: given a power multi-set of vertices (P(V )) and a pathdescription (Ψ), return the vertices at the end of that path.35

? P(V )×Ψ→ P(V )

• Scoring: given some vertices and a path description, return a score.

? P(V )×Ψ→ R

• Ranking: given some vertices and a path description, return a map ofscored vertices.

? P(V )×Ψ→ (V × R)

35Use cases need not be with respect to vertices only. Edges can be searched, scored, and ranked as well.However, in order to express the ideas as simply as possible, all discussion is with respect to vertices.

Applications of Graph Databases and Traversal Engines:Recommendation

• Recommendation: searching, scoring, and ranking can all be used ascomponents of a recommendation. Thus, recommendation is founded onthese more basic ideas.

? Recommendation aids the user by allowing them to make “jumps” through

the data. Items that are not explicitly connected, are connected implicitly through

recommendation (through some abstract path Ψ).

• The act of recommending can be seen as an attempt to increase thedensity of the graph around a user’s vertex. For example, recommendinguser i ∈ V places to visit U ⊂ V , will hopefully lead to edges of the form〈i, visited, j〉 : ∀j ∈ U .36

36A standard metric for recommendation quality is seen as how well it predicts the user’s future behavior.That is, does it predict an edge.

There Is More Than “People Who Like X Also Like Y .”

• A system need not be limited to one type of recommendation. With graph-based

methods, there are as many recommendations as there are abstract paths.

• Use recommendation to aid the user in solving problems (i.e. computationally

derive solutions for which your data set is primed for). Examples below are with respect

to problem-solving in the scholarly community.37

? Recommend articles to read. (articles)

? Recommend collaborators to work on an idea/article with. (people)

? Recommend a venue to submit the article to. (venues)? Recommend an editor referees to review the article. (people)38

? Recommend scholars to talk to and concepts to talk to them about at the venue.

(people and tags)

37Rodriguez, M.A., Allen, D.W., Shinavier, J., Ebersole, G., “A Recommender System to Support theScholarly Communication Process,” KRS-2009-02, 2009. [http://arxiv.org/abs/0905.1594]

38Rodriguez, M.A., Bollen, J., “An Algorithm to Determine Peer-Reviewers,” Conference on Informationand Knowledge Management (CIKM), pp. 319–328, doi:10.1145/1458082.1458127, 2008. [http://arxiv.org/abs/cs/0605112]


http://arxiv.org/abs/cs/0605112


Real-Time, Domain-Specific, Graph-Based,Problem-Solving Engine

Graph Data Set

Ψ1Ψ2

Ψ3Ψ4

Ψn

Ψ5

Library of Path/Traversal Expressions

+ =Real-Time

Domain-SpecificGraph-Based

Problem-Solving Engine

Your domain model (i.e. graph dataset) determines what traversals you can design,

develop, and deploy. Together, these determine which types of problems you can solve

automatically/computationally for yourself, your users.

Applicable in Various, Seemingly Diverse Areas

• Applications to a techno-social government (i.e. collective decision making systems).39

percentage of active citizens

error

100 90 80 70 60 50 40 30 20 10 0

0.00

0.05

0.10

0.15

0.20

dynamically distributed democracydirect democracy

4

percentage of active citizens

pro

port

ion o

f corr

ect decis

ions

100 90 80 70 60 50 40 30 20 10 0

0.50

0.65

0.80

0.95

dynamically distributed democracy

direct democracy

(n)

Fig. 5. The relationship between k and evotek for direct democracy (gray

line) and dynamically distributed democracy (black line). The plot providesthe proportion of identical, correct decisions over a simulation that was runwith 1000 artificially generated networks composed of 100 citizens each.

As previously stated, let x ! [0, 1]n denote the politicaltendency of each citizen in this population, where xi is thetendency of citizen i and, for the purpose of simulation, isdetermined from a uniform distribution. Assume that everycitizen in a population of n citizens uses some social network-based system to create links to those individuals that theybelieve reflect their tendency the best. In practice, these linksmay point to a close friend, a relative, or some public figurewhose political tendencies resonate with the individual. Inother words, representatives are any citizens, not politicalcandidates that serve in public office. Let A ! [0, 1]n!n denotethe link matrix representing the network, where the weight ofan edge, for the purpose of simulation, is denoted

Ai,j =

!1 " |xi " xj | if link exists0 otherwise.

In words, if two linked citizens are identical in their politicaltendency, then the strength of the link is 1.0. If their tendenciesare completely opposing, then their trust (and the strength ofthe link) is 0.0. Note that a preferential attachment networkgrowth algorithm is used to generate a degree distribution thatis reflective of typical social networks “in the wild” (i.e. scale-free properties). Moreover, an assortativity parameter is usedto bias the connections in the network towards citizens withsimilar tendencies. The assumption here is that given a systemof this nature, it is more likely for citizens to create links tosimilar-minded individuals than to those whose opinions arequite different. The resultant link matrix A is then normalizedto be row stochastic in order to generate a probability distribu-tion over the weights of the outgoing edges of a citizen. Figure6 presents an example of an n = 100 artificially generatedtrust-based social network, where red denotes a tendency of0.0, purple a tendency of 0.5, and blue a tendency of 1.0.

Given this social network infrastructure, it is possible to bet-ter ensure that the collective tendency and vote is appropriatelyrepresented through a weighting of the active, participatingpopulation. Every citizen, active or not, is initially provide with

Fig. 6. A visualization of a network of trust links between citizens. Eachcitizen’s color denotes their “political tendency”, where full red is 0, full blueis 1, and purple is 0.5. The layout algorithm chosen is the Fruchterman-Reingold layout.

1n “vote power” and this is represented in the vector ! ! Rn

+,such that the total amount of vote power in the population is1. Let y ! Rn

+ denote the total amount of vote power that hasflowed to each citizen over the course of the algorithm. Finally,a ! 0, 1n denotes whether citizen i is participating (ai = 1)in the current decision making process or not (ai = 0). Thevalues of a are biased by an unfair coin that has probability kof making the citizen an active participant and 1"k of makingthe citizen inactive. The iterative algorithm is presented below,where # denotes entry-wise multiplication and " $ 1.

! % 0while

"i"ni=1 yi < " do

y % y + (! # a)! % ! # (1 " a)! % A!

end

In words, active citizens serve as vote power “sinks” inthat once they receive vote power, from themselves or froma neighbor in the network, they do not pass it on. Inactivecitizens serve as vote power “sources” in that they propagatetheir vote power over the network links to their neighborsiteratively until all (or ") vote power has reached activecitizens. At this point, the tendency in the active populationis defined as #tend = x · y. Figure 4 plots the error incurredusing dynamically distributed democracy (black line), wherethe error is defined as

etendk = |dtend

100 " #tendk |.

Next, the collective vote #votek is determined by a weighted

majority as dictated by the vote power accumulated by activeparticipants. Figure 5 plots the proportion of votes that aredifferent from what a fully participating population would

39* Rodriguez, M.A., Watkins, J.H., “Revisiting the Age of Enlightenment from a Collective Decision Making Systems

Perspective,” First Monday, 14(8), 2009. [http://arxiv.org/abs/0901.3929]

* Rodriguez, M.A., “Social Decision Making with Multi-Relational Networks and Grammar-Based Particle Swarms,” Hawaii

International Conference on Systems Science (HICSS), pp. 39–49, 2007. [http://arxiv.org/abs/cs/0609034]

* Rodriguez, M.A., Steinbock, D.J., “A Social Network for Societal-Scale Decision-Making Systems,” Proceedings of the North

American Association for Computational Social and Organizational Science Conference, 2004. [http://arxiv.org/abs/cs/

0412047]





Toy Graph Dataset

friendfriend

favoritename=marko

location=Santa Fegender=male

lat=11111long=22222

created_at=123456

1 23

4 name=sixwinglocation=West Hollywood

gender=male

friend

name=charlie

favorite

favorite

favorite

name=Bryce Canyon

created_at=234567

5

6

We will use the toy-graph above to demonstrate Gremlin (to introduce the syntax). However, in parallel, we

will also use a large graph of the same schema to demonstrate how SQL/MySQL compares relative to

Gremlin/Neo4j on traversal-based queries (i.e. for relational databases, queries with table joins).

Dataset Schema in Neo4jNeo4j [http://neo4j.org] is a “schema-less” database. However, ultimately, data is

represented according to some schema whether that schema be explicit in the database, in

the code interacting with the database, or in the developer’s head.40 Please note the

schema diagrammed below is a non-standard convention.41

Person Place

name=<string>location=<string>gender=<string>type=Person

name=<string>lat=<double>long=<double>type=Place

favoritefriend

40A better term for “schema-less” might have been “dynamic schema.”41For expressive, standardized graph-based schema languages, refer to RDFS [http://www.w3.org/TR/

rdf-schema/] and OWL [http://www.w3.org/TR/owl-features/] of the Web of Data community.

http://neo4j.org

http://www.w3.org/TR/rdf-schema/

http://www.w3.org/TR/rdf-schema/

http://www.w3.org/TR/owl-features/

Dataset Schema in MySQL

CREATE TABLE friend (

outV INT NOT NULL,

inV INT NOT NULL);

CREATE INDEX friend_outV_index USING BTREE ON friend (outV);

CREATE INDEX friend_inV_index USING BTREE ON friend (inV);

CREATE TABLE favorite (

outV INT NOT NULL,

inV INT NOT NULL);

CREATE INDEX favorite_outV_index USING BTREE ON favorite (outV);

CREATE INDEX favorite_inV_index USING BTREE ON favorite (inV);

CREATE TABLE metadata (

vertex INT NOT NULL,

_key VARCHAR(100) NOT NULL,

_value VARCHAR(100),

PRIMARY KEY (vertex, _key));

CREATE INDEX metadata_vertex_index USING BTREE ON metadata (vertex);

CREATE INDEX metadata_key_index USING BTREE ON metadata (_key);

CREATE INDEX metadata_value_index USING BTREE ON metadata (_value);

Experiment Discussion

• First, for each experiment, no cache is used. For each query (or run ofqueries), caches are reset/flushed and the query is performed.42

• Second, for each experiment, a “stable point” (i.e. performance with fullcaching) is found through the repeated evaluation of the same query.

• Evaluations are done on my laptop using SQL/MySQL(5.1.45) andGremlin(0.5-alpha)/Neo4j(1.1).43

• I am not an expert in relational databases. Be aware of all of my choices(table design, indexes used, query representation, etc.).44

42I believe, from looking at the behavior of MySQL, MySQL caches maintain joined structure in mainmemory for subsequent queries. Neo4j caches by maintaining active portions of the graph in main memory.

43Note that Gremlin 0.5-alpha is much more performant than Gremlin 0.2.2. Also, running times presentedare likely to change with optimizations (discussed later)—consider all times in passing only.

44For the more interested, please do experiments yourself with your particular domain models and queries.

Loading Identical Data into MySQL and Neo4j

For the first half of the examples, we will use a small data set. Later wewill increase this data set by 10,000,000 edges and compare again. Thereason is to test how indices effect the performance of standard queries.As indices grow, log2(n) becomes costly.

mysql> (SELECT * FROM friend) UNION (SELECT * FROM favorite)

71100 rows in set (0.47 sec)

gremlin> g:count($_g/E)

==>71100 results returned in 145.427ms (0.145 sec)

First thing to note—graph databases don’t have a notion of “tables,” theentire graph is one atomic entity.

Basic Gremlin

gremlin> (1 + 2) * 4 div 5

==>2.4

gremlin> "marko" + " a. " + "rodriguez"

==>marko a. rodriguez

gremlin> func ex:add-one($x)

$x + 1

end

gremlin> foreach $y in g:list(1,2,3,4)

g:print(ex:add-one($y))

end

2

3

4

5

Searching Example: Friends

friend

friend favorite



long=22222

created_at=123456

1 2

3 4


gender=male

friend

name=charlie

favorite

favorite favorite

name=Bryce Canyon

created_at=234567

5

6

gremlin> $_g := neo4j:open(‘/data/mygraph’)

gremlin> $_ := g:id(1)

==>v[1]

gremlin> .

==>v[1]

gremlin> ./outE

==>e[10][1-friend->2]

==>e[11][1-friend->3]

==>e[12][1-favorite->4]

gremlin> ./outE[@label=‘friend’]/inV/@name

==>sixwing

==>marko

gremlin> ./outE[@label=‘friend’]/inV/@gender

==>male

==>male

gremin> ./outE[@label=‘friend’]

/inV[@location=‘Santa Fe’]/@name

==>marko

Searching FriendsSQL/MySQL vs. Gremlin/Neo4j

What are the names of Rand Fitzpatrick’s friends?45

mysql> SELECT friend.inV, b._value FROM friend, metadata as a,

metadata as b WHERE a._key=‘name’ AND

a._value=‘Rand Fitzpatrick’ AND a.vertex=friend.outV AND

b.vertex=friend.inV AND b._key=‘name’;

97 rows in set (0.32 sec -- 320.0 ms)

gremlin> g:key(‘name’,‘Rand Fitzpatrick’)/outE[@label=‘friend’]/inV/@name

97 results returned (0.00258 sec -- 25.88 ms)

45When in cache (through repeated, identical querying), SQL/MySQL evaluates in ∼0.005 seconds (5ms)and Gremlin/Neo4j evaluates in ∼0.0002 seconds (0.2ms).

Searching Example: FOAF (No Friends, No Self)

friend

friend favorite



long=22222

created_at=123456

1 2

3 4


gender=male

friend

name=charlie

favorite

favorite favorite

name=Bryce Canyon

created_at=234567

5

6

gremlin> .

==>v[1]

gremlin> ./outE[@label=‘friend’]/inV

/outE[@label=‘friend’]/inV

==>v[1]

==>v[1]

==>v[5]

gremlin> (./outE[@label=‘friend’]

/inV)[g:assign(‘$x’)]

/outE[@label=‘friend’]

/inV[g:except(.,$_)][g:except(.,$x)]

/@name

==>charlie

Searching FOAF (Not Self)SQL/MySQL vs. Gremlin/Neo4j

What are the names of Rand Fitzpatrick’s friends friends who are not Rand(note: this may include Rand’s friends)?46

mysql> SELECT mb._value FROM friend as a, friend as b, metadata as ma,

metadata as mb WHERE ma._key=‘name’ AND ma._value=‘Rand Fitzpatrick’

AND ma.vertex=a.outV AND a.inV=b.outV AND b.outV != ma.vertex AND

b.inV = mb.vertex AND mb._key=‘name’

8985 rows in set (0.47 sec -- 470.00 ms)

gremlin> g:key(‘name’,‘Rand Fitzpatrick’)/outE[@label=‘friend’]

/inV/outE[@label=‘friend’]/inV[g:except(.,$_)]/@name


46When in cache (through repeated, identical querying), SQL/MySQL evaluates in ∼0.03 seconds (30ms)and Gremlin/Neo4j evaluates in ∼0.015 seconds (15ms).

Searching Example: Friend’s Favorites

friend

friend favorite



long=22222

created_at=123456

1 2

3 4


gender=male

friend

name=charlie

favorite

favorite favorite

name=Bryce Canyon

created_at=234567

5

6

gremlin> .

==>v[1]


/outE[@label=‘favorite’]/inV

==>v[6]

==>v[6]


/outE[@label=‘favorite’ and @created_at>234500]

/inV/@name

==>Bryce Canyon

Searching FOAF (No Self) FavoritesSQL/MySQL vs. Gremlin/Neo4j

What do Rand’s friends friends (who are not Rand) favorite?47

mysql> SELECT mb._value FROM friend as fa, friend as fb, favorite,

metadata as ma, metadata as mb WHERE ma._key=‘name’ AND

ma._value=‘Rand Fitzpatrick’ AND ma.vertex=fa.outV AND fa.inV=fb.outV

AND fb.inV != ma.vertex AND fb.inV=favorite.outV AND

mb.vertex=favorite.inV AND mb._key=‘name’;

364905 rows in set (11.17 sec -- 11170.0 ms)

gremlin> g:key(‘name’,‘Rand Fitzpatrick’)/outE[@label=‘friend’]

/inV/outE[@label=‘friend’]/inV[g:except(.,$_)]

/outE[@label=’favorite’]/inV/@name


47When in cache (through repeated, identical querying), SQL/MySQL evaluates in ∼6.25 seconds(6250ms) and Gremlin/Neo4j evaluates in ∼1.0 second (1000ms).

A Traversal Detour Through the Web of Data

As of July 2009

LinkedCTReactome

Taxonomy

KEGG

PubMed

GeneID

Pfam

UniProt

OMIM

PDB

SymbolChEBI

Daily Med

Disea-some

CAS

HGNC

InterPro

Drug Bank

UniParc

UniRef

ProDom

PROSITE

Gene Ontology

HomoloGene

PubChem

MGI

UniSTS

GEOSpecies

Jamendo

BBCProgrammes

Music-brainz

Magna-tune

BBCLater +TOTP

SurgeRadio

MySpaceWrapper

Audio-Scrobbler

LinkedMDB

BBCJohnPeel

BBCPlaycount

Data

Gov-Track

US Census Data

riese

Geo-names

lingvoj

World Fact-book

Euro-stat

flickrwrappr

Open Calais

RevyuSIOCSites

Doap-space

Flickrexporter

FOAFprofiles

CrunchBase

Sem-Web-

Central

Open-Guides

Wiki-company

QDOS

Pub Guide

RDF ohloh

W3CWordNet

OpenCyc

UMBEL

Yago

DBpediaFreebase

Virtuoso Sponger

DBLPHannover

IRIT Toulouse

SWConference

Corpus

RDF Book Mashup

Project Guten-berg

DBLPBerlin

LAAS- CNRS

Buda-pestBME

IEEE

IBM

Resex

Pisa

New-castle

RAE 2001

CiteSeer

ACM

DBLP RKB

Explorer

eprints

LIBRIS

SemanticWeb.org

Eurécom

RKBECS

South-ampton

CORDIS

ReSIST ProjectWiki

NationalScience

Foundation

ECS South-ampton

LinkedGeoData

BBC Music

Image produced by Richard Cyganiak and Anja Jentzsch. [http://linkeddata.org/]

http://linkeddata.org/

Defining the Web of Data

• The Web of Data is similar to the Web of Documents (of common knowledge), but

instead of referencing documents (e.g. HTML, images, etc.) with the URI address

space, individual datum are referenced.4849

? 〈http://markorodriguez.com, foaf:fundedBy, http://atti.com〉? 〈http://markorodriguez.com, foaf:name, "Marko Rodriguez"〉? 〈http://markorodriguez.com, foaf:age, "30"〉? 〈http://markorodriguez.com, foaf:knows, http://tenderlovemaking.com〉

• In graph theoretic terms, the Web of Data is a multi-relational graph defined as

G ⊆ (U ∪B)× U × (U ∪B ∪ L), where U is the set of all URIs, B is the set of

all blank/anonymous nodes, and L is the set of all literals.

48The Web of Data is also known as the Linked Data Web, the Giant Global Graph, the Semantic Web,the RDF graph, etc.

49* Rodriguez, M.A., “Interpretations of the Web of Data, Data Management in the Semantic Web, eds.H. Jin and Z. Lv, Nova Publishing, in press, 2010. [http://arxiv.org/abs/0905.3378]* Rodriguez, M.A., “A Graph Analysis of the Linked Data Cloud,” Technical Report, KRS-2009-01, 2009.[http://arxiv.org/abs/0903.0194]



Some of the Datasets on the Web of Datadata set domain data set domain data set domain

audioscrobbler music govtrack government pubguide booksbbclatertotp music homologene biology qdos socialbbcplaycountdata music ibm computer rae2001 computerbbcprogrammes media ieee computer rdfbookmashup booksbudapestbme computer interpro biology rdfohloh socialchebi biology jamendo music resex computercrunchbase business laascnrs computer riese governmentdailymed medical libris books semanticweborg computerdblpberlin computer lingvoj reference semwebcentral socialdblphannover computer linkedct medical siocsites socialdblprkbexplorer computer linkedmdb movie surgeradio musicdbpedia general magnatune music swconferencecorpus computerdoapspace social musicbrainz music taxonomy referencedrugbank medical myspacewrapper social umbel generaleurecom computer opencalais reference uniref biologyeurostat government opencyc general unists biologyflickrexporter images openguides reference uscensusdata governmentflickrwrappr images pdb biology virtuososponger referencefoafprofiles social pfam biology w3cwordnet referencefreebase general pisa computer wikicompany businessgeneid biology prodom biology worldfactbook governmentgeneontology biology projectgutenberg books yago generalgeonames geographic prosite biology . . .

Web of Data Dataset Dependencies

geospecies

freebase

dbpedia

libris

geneid

interpro

hgnc

symbol

pubmed

mgi

geneontology

uniprot

pubchem

unists

omim

homologene

pfam

pdb

reactome

chebi

uniparc

kegg

cas

uniref

prodomprosite

taxonomy

dailymed

linkedct

acm

dblprkbexplorer

laascnrs

newcastle

eprints

ecssouthampton

irittoulouseciteseer

pisa

resexibm

ieee

rae2001

budapestbme

eurecom

dblphannover

diseasome

drugbank

geonames

yago

opencyc

w3cwordnet

umbel

linkedmdb

rdfbookmashup

flickrwrappr

surgeradio

musicbrainz myspacewrapper

bbcplaycountdata

bbcprogrammes

semanticweborg

revyu

swconferencecorpus

lingvoj

pubguide

crunchbase

foafprofiles

riese

qdos

audioscrobbler

flickrexporter

bbcjohnpeel

wikicompany

govtrack

uscensusdata

openguides

doapspace

bbclatertotp

eurostat

semwebcentral

dblpberlin

siocsites

jamendo

magnatuneworldfactbook

projectgutenberg

opencalais

rdfohloh

virtuososponger

geospecies

freebase

dbpedia

libris

geneid

interpro

hgnc

symbol

pubmed

mgi

geneontology

uniprot

pubchem

unists

omim

homologene

pfam

pdb

reactome

chebi

uniparc

kegg

cas

uniref

prodomprosite

taxonomy

dailymed

linkedct

acm

dblprkbexplorer

laascnrs

newcastle

eprints

ecssouthampton

irittoulouseciteseer

pisa

resexibm

ieee

rae2001

budapestbme

eurecom

dblphannover

diseasome

drugbank

geonames

yago

opencyc

w3cwordnet

umbel

linkedmdb

rdfbookmashup

flickrwrappr

surgeradio

musicbrainz myspacewrapper

bbcplaycountdata

bbcprogrammes

semanticweborg

revyu

swconferencecorpus

lingvoj

pubguide

crunchbase

foafprofiles

riese

qdos

audioscrobbler

flickrexporter

bbcjohnpeel

wikicompany

govtrack

uscensusdata

openguides

doapspace

bbclatertotp

eurostat

semwebcentral

dblpberlin

siocsites

jamendo

magnatuneworldfactbook

projectgutenberg

opencalais

rdfohloh

virtuososponger

Web of Data Transforms Development ParadigmA new application development paradigm emerges. No longer do data and application

providers need to be the same entity (left). With the Web of Data, its possible for

developers to write applications that utilize data that they do not maintain (right).50

Web of Data

127.0.0.1 127.0.0.2 127.0.0.3

Application 1 Application 2 Application 3

structures structuresstructures

processes processes processes

127.0.0.1 127.0.0.2 127.0.0.3

Application 1 Application 2 Application 3

structures structures structures

processes processes processes

50Rodriguez, M.A., “A Reflection on the Structure and Process of the Web of Data,”Bulletin of the American Society for Information Science and Technology, 35(6), pp. 38–43,doi:10.1002/bult.2009.1720350611, 2009. [http://arxiv.org/abs/0908.0373]


Extending our Knowledge of Bryce Canyon National Parkgremlin> $h := lds:open()

gremlin> $_ := g:add-v($h, ‘http://dbpedia.org/resource/Bryce_Canyon_National_Park’)

==>v[http://dbpedia.org/resource/Bryce_Canyon_National_Park]

gremlin> ./outE

==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:reference -> http://www.nps.gov/brca/]

==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:iucnCategory -> "II"@en]

==>e[dbpedia:Bryce_Canyon_National_Park - dbpedia-owl:numberOfVisitors -> "1012563"^^<xsd:integer>]

==>e[dbpedia:Bryce_Canyon_National_Park - skos:subject -> dbpedia:Category:Colorado_Plateau]

==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:visitationNum -> "1012563"^^<xsd:int>]

==>e[dbpedia:Bryce_Canyon_National_Park - dbpedia-owl:abstract -> "Bryce Canyon National Park is a national

park located in southwestern Utah in the United States..."@en]

==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:area -> "35835.0"^^<http://dbpedia.org/datatype/acre>]

==>e[dbpedia:Bryce_Canyon_National_Park - rdf:type -> dbpedia-owl:ProtectedArea]

==>e[dbpedia:Bryce_Canyon_National_Park - dbpedia-owl:location -> dbpedia:Garfield_County%2C_Utah]

==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:nearestCity -> dbpedia:Panguitch%2C_Utah]

==>e[dbpedia:Bryce_Canyon_National_Park - dbpprop:established -> "1928-09-15"^^<xsd:date>]

...

51

51Linked Data Sail (LDS) was developed by Joshua Shinavier (RPI and TinkerPop) and connects toGremlin through Gremlin’s native support for Sail (i.e. for RDF graphs). LDS caches the traversed aspectsof the Web of Data into any quad-store (e.g. MemoryStore, AllegroGraph, HyperGraphSail, Neo4jSail, etc.).

Augmenting Traversals with the Web of Data

Lets extend our query over the Web of Data. Perhaps incorporate that into our searching,scoring, ranking, and recommendation.

gremlin> $visits := ./outE[@label=‘dbpprop:visitationNum’]/inV/@value

==>1012563

gremlin> $acreage := ./outE[@label=‘dbpprop:area’]/inV/@value

==>35835.0

### imagine wrapping traversals in Gremlin functions:

### func lds:acreage($h, $v) and func lds:visitors($h, $v)

gremlin> ./outE[@label=‘friend’]/inV/outE[@label=‘favorite’]

/inV[lds:acreage($h, .) < 1000000 and lds:visitors($h, .) < 2000000]/@name

==>Bryce Canyon

Thus, what do tenderlove’s friends favorite that are small in acreage and visitation?52

52In Gremlin, its possible to have multiple graphs open in parallel and thus, mix and match data fromeach graph as desired. Hence, demonstrated by the example above, its possible to mix Web of Data RDFgraph data and Blueprints property graph data.

Using the Web of Data for Music Recommendation

Yet another aside: Using only the Web of Data data to recommend musicians/bands

with a simplistic, edge-boolean spreading activation algorithm.53

gremlin> $_ :=

g:id(‘http://dbpedia.../Grateful_Dead’)

==>v[http://dbpedia.../Grateful_Dead]

gremlin> lds:spreading-activation(.)

==>Jerry Garcia Acoustic Band

==>BK3

==>Phil Lesh and Friends

==>Old and In the Way

==>RatDog

==>The Dead

==>Heart of Gold Band

==>Legion of Mary

==>The Tubes

==>Bob Dylan

==>New Riders of the Purple Sage

==>Bruce Hornsby

==>Donna Jean Godchaux

==>Kingfish

==>Jerry Garcia Band

==>Donna Jean Godchaux Band

==>The Other Ones

==>Bobby and the Midnites

==>Furthur

==>Rhythm Devils

53Please read the following for interesting, deeper ideas in this space: Clark, A., “Associative Engines:Connectionism, Concepts, and Representational Change,” MIT Press, 1993.

Another View of the TinkerPop Stack

Web of DataLocal Dataset

owl:sameAs

GET http://host/resource

Scoring Example: How Many of My Friends Favorite X?

friend

friend favorite



long=22222

created_at=123456

1 2

3 4


gender=male

friend

name=charlie

favorite

favorite favorite

name=Bryce Canyon

created_at=234567

5

6

gremlin> .

==>v[1]


==>v[3]

==>v[2]

gremlin> g:count(./outE[@label=‘friend’]/inV

/outE[@label=‘favorite’]

/inV[@id=6])

==>2

Scoring Example: How Many of My FOAFs Favorite X?

friend

friend favorite



long=22222

created_at=123456

1 2

3 4


gender=male

friend

name=charlie

favorite

favorite favorite

name=Bryce Canyon

created_at=234567

5

6

gremlin> .

==>v[1]

gremlin> g:count(

(./outE[@label=‘friend’]/inV)[g:assign(‘$x’)]

/outE[@label=‘friend’]

/inV[g:except(.,$_)][g:except(.,$x)]

/outE[@label=‘favorite’]/inV[@id=6])

==>1

Loading Identical Data into MySQL and Neo4j

Now we will use a larger data set. 10,000,000 edges are created between100,000 vertices. Random assignment with 50% favorite-edges and 50%friend-edges. This is a dense, relatively unnatural graph—everyone isheavily connected.54

mysql> (SELECT * FROM favorite) UNION (SELECT * FROM friend)

10071100 rows in set (4 min 28.10 sec)

gremlin> g:count($_g/E)

10071100 edges in return (5 min 35 sec)

54The largest Neo4j instance that I know of contained 100,030,002 (100 million) vertices, 3,041,030,000(3 billion) edges, and 140,120,000 (140 million) properties. This was deployed on Amazon EC2 and wasyielding FOAF traversals, on average, in ∼50ms (again, index-free traversal). Figures provided by ToddStavish (Stav.ish Consulting [http://blog.stavi.sh/]).

http://blog.stavi.sh/

Querying Random Vertices with Repeatsmysql> SELECT count(favorite.inV) FROM friend as fa, friend as fb, favorite

WHERE fa.outV=XXX AND fa.inV=fb.outV AND fb.inV=favorite.outV;

29.72 sec -- vertex 110752

0.330 sec -- vertex 110752 REPEAT

10.10 sec -- vertex 145893

11.64 sec -- vertex 126993


14.37 sec -- vertex 136442

6.990 sec -- vertex 154837


gremlin> g:count(g:id(XXX)/outE[@label=‘friend’]/inV

/outE[@label=‘friend’]/inV/outE[@label=‘favorite’]/inV)

3.646 sec -- vertex 110752


0.756 sec -- vertex 145893

3.251 sec -- vertex 126993


1.462 sec -- vertex 136442

1.875 sec -- vertex 154837


Recommendation

Extending the Schema for Some Richer Examples

For the last part of this presentation on recommendation, we will extendthe data schema to include tags (a place can be tagged with a tag). Thiswill allow for some richer examples.5556

Person Place

name=<string>location=<string>gender=<string>type=Person

name=<string>lat=<double>long=<double>type=Place

favoritefriend

Tag

name=<string>type=Tag

tagged

55Please note that 1.) “place” can be item/thing/book/music/etc. 2.) “favorite” can belikes/purchased/visited/etc. 3.) “tag” can be category/etc. A particular use case is presented, but withlittle imagination, application to other schemas is, of course, plausible.

56Following examples have experimental syntax that may differ slightly from official Gremlin 0.5 release.

Recommendation Example: Friend Finder

• Open Friendship Triangles: (V ×Ψ)→ (V × N+)57 (people)

1. Create return map (i.e. V × N+).

2. Determine who my friends are.

3. Determine who my friends friends are...

4. ...that are not already my friends or me. (weighted by the number of overlapping

friends—more overlaps, more traversers at that user vertex)

5. Sort return map by number of traversers at those user/people vertices.

$m := g:map()

(./outE[@label=‘friend’]/inV)[g:assign(‘$x’)]

/outE[@label=‘friend’]/inV

/.[g:except(.,$x)][g:except(.,$_)][g:op-value(‘+’,$m,.,1)]

g:sort($m,‘value’,true)

57((

Rx Afriend)· Afriend

) n(Afriend

) n (I), where x is the user/person vertex. The in-degree

centrality vector of the derived adjacency matrix determines the resultant V rank.

Recommendation Example: Follower Finder• People Similarity based on Favorites: (V ×Ψ)→ (V × N+)58 (people)


2. Determine what I favorite/like/prefer/purchased/etc.

3. Of those things I favorite, who else favorites them that are not me? (weighted user

similarity based on taste—the more I share in common, the more traversers are at

that user vertex).

4. Filter out those people that are my friends.

5. Sort return map by number of traversers at those people vertices.

$m := g:map()

(./outE[@label=‘favorite’]/inV)[g:assign(‘$x’)]

/inE[@label=‘favorite’]/outV[g:except(.,$_)]

/outE[@label=‘friend’]/inV[g:except(.,$x)]/../..[g:op-value(‘+’,$m,.,1)]


58((

Rx Afavorite)· Afavorite>

)n (I) n

(Afriend

). The in-degree centrality vector of the derived

adjacency matrix determines the resultant V rank.

Recommendation Example: Follower Finder 2

• People Similarity based on Tags: (V ×Ψ)→ (V × N+)5960 (people)


2. Determine the tags associated with what I favorite.

3. What else is tagged with those tags?

4. Who favorites those tagged items that are not me.61

5. Sort return map by number of traversers at those people vertices.

$m := g:map()

./outE[@label=‘favorite’]/inV/outE[@label=‘tagged’]/inV

/inE[@label=‘tagged’]/outV

/inE[@label=‘favorite’]/outV[g:except(.,$_)][g:op-value(‘+’,$m,.,1)]


59((

Rx Afavorite)· Atagged · Atagged> · Afavorite>

) n (I). The in-degree centrality vector of the

derived adjacency matrix determines the resultant V rank.60Variations on this theme can be used for expertise identification.61A user’s friends could be recommended. This filter was ignored for the sake of brevity.

Recommendation Example:“Users Who Like x Also Like y”

• Co-Favorited Places: (V ×Ψ)→ (V × N+)6263 (places)


2. Determine who has favorited (i.e. liked) place x.

3. What else have they favorited that is not place x.

4. Sort return map by number of traversers at those place vertices.

$m := g:map()

$x/inE[@label=‘favorite’]/outV

/outE[@label=‘favorite’]/inV[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]


62((

Rx Afavorite>)· Afavorite

) n (Cx). In-degree centrality of derived matrix determines rank.

63This type of recommendation may be considered content-based recommendation. When two verticesshare content (relations to other vertices), they are deemed similar. Co-relation, in general, is a patternfor content-based recommendation. Look back at the first three recommendation examples: “friend finder”(co-friend), “follower finder” (co-favorites), “follow finder 2” (co-tagged-favorites).

Recommendation Example: Places Related through Tags

• Co-Tagged Places: (V ×Ψ)→ (V × N+)6465 (places)

1. Create return map (i.e. V × N+).2. Determine the tags for place x.3. What else is tagged the same as x that is not x.4. Sort return map by number of traversers at those place vertices.

$m := g:map()

$x/outE[@label=‘tagged’]/inV

inE[@label=‘tagged’]/outV[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]


64((

Rx Atagged)· Atagged>

) n (I). In-degree centrality of derived matrix determines rank.

65Yet another type of content-based recommendation, but items are similar to each other not because ofco-favoriting, but because of co-tagging. Think about mixing and matching different similarities. How doyou weight the different “co”-graphs (i.e. aAα + bAβ)? Statistical techniques can emerge the significantfactors.

Recommendation Example: Tags Related through Places

• Co-Placed Tags: (V ×Ψ)→ (V × N+)6667 (tags)

1. Create return map (i.e. V × N+).2. Determine what has been tagged x.3. What other tags do those items have that are not x.4. Sort return map by number of traversers at those tag vertices.

$m := g:map()

$x/inE[@label=‘tagged’]/outV

outE[@label=‘tagged’]/inV[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]


66((

Rx Atagged>)· Atagged

) n (I). In-degree centrality of derived matrix determines rank.

67In the previous example, items were related if they shared the same tags. In this example, tags arerelated if they are used to tag the same items. Anything can be deemed similar to anything else if thereexists paths between such items—inferred or explicit. The path taken (Ψ) determines the meaning/type ofsimilarity. Cognitive philosophers/psychologists see this as associativity through spreading activation.

Recommendation Example: Collaborative Filtering 1• Basic Collaborative Filtering: (V ×Ψ)→ (V × N+)68 (places)


2. Determine what I favorite/like/prefer/purchased/etc.

3. Of those things I favorite, who else favorites them? (weighted user similarity based

on taste—the more I share in common, the more traversers are at that person

vertex).

4. Of those similar users, what do they favorite that I don’t already favorite?

5. Sort return map by number of traversers at those favorited places.

$m := g:map()


/inE[@label=‘favorite’]/outV



68Related to “follower finder” from previous. However, it takes the traversal one step further. Instead ofsimply finding who is similar to me with respect to favoriting, you then compute, what do those similar usersalso favorite. This is a classic case for path-reuse as an optimization.

Recommendation Example: Collaborative Filtering 2

• Collaborative “Category” Filtering: (V ×Ψ× V )→ (V × N+) (places)


2. Determine what I favorite...

3. ...in category/tag x.

4. Of those things I favorite, who else favorites them?

5. Of those similar users, what do they favorite categorized/tagged x ...

6. ...that I don’t already favorite?


$m := g:map()

(./outE[@label=‘favorite’]/inV

/outE[@label=‘tagged’]/inV[@name=‘bar’]/../..)[g:assign(‘$x’)]


/outE[@label=‘favorite’]/inV/outE[@label=‘tagged’]/inV[@name=‘bar’]

/../..[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]



• Collaborative “Location” Filtering: (V ×Ψ× R4)→ (V × N+)69 (places)


2. Determine what I favorite.

3. Of those things I favorite, who else favorites them?

4. Of those similar users, what do they favorite in bounding box x1, x2, y1, y2...


6. Sort return map by number of traversers at those places.

$m := g:map()



/outE[@label=‘favorite’]/inV[@lat > $x1 and @lat < $x2 ...]

/.[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]


69Location-filtering idea adapted from the Bonobo recommender engine by Nate Murray (AT&Ti).


• Collaborative “State of Mind” Filtering: (V ×Ψ×N+)→ (V ×N+)(places)

1. Create return map (i.e. V × N+).2. Determine what I have favorited in the last x minutes.3. Of those things I recently favorited, who else favorites them?4. Of those similar users, what do they favorite that I don’t?5. Sort return map by number of traversers at those favorited places.

$m := g:map()

(./outE[@label=‘favorite’ and @created_at > 1234567]/inV)[g:assign($x)]





• Collaborative “Zietgeist” Filtering: (V ×Ψ× N+)→ (V × N+) (places)


2. Determine what I have favorited.

3. Of those things I favorited, who else favorites them?

4. Of those similar users, what have they favorited in the last x minutes...



$m := g:map()



/outE[@label=‘favorite’ and @created_at > 1234567]

/inV[g:except(.,$x)][g:op-value(‘+’,$m,.,1)]


...keep going all day long.

A Cornucopia of Recommendations – Part 1

• Its possible to use offline statistical methods to determine which factorsof a vertex contribute to user interest (e.g. PCA+KMeans to determinemetadata contributing to shared interests). (slow)

• Then, use online, real-time graph methods to incorporate those featuresinto the traversal (i.e. to define Ψ). (fast)

? Mix various traversals together: aAα + bAβ + . . . + zAζ (or other,perhaps non-linear combinations).70

70Though not discussed in this presentation, sampling techniques can be used to increase the speed ofa traversal. For example, ./outE[g:rand-real() > 0.5] only traverses, on average, 50% of the edges.Moreover, if edges have weights, those weights can be used to create probability distributions and thus,biased sampling can be implemented (i.e. random walks)

A Cornucopia of Recommendations – Part 2

• ...also, be creative. Develop numerous recommendation traversals fornumerous problem-solving situations.71

• Make use of user click-behavior to determine usefulness.

• ...Or, allow users to select which algorithms they want to apply (givethem the option to select how they want to solve their problems).

71For a fine review of graph-based techniques and ideas regarding recommendation, please see:* Mirza, B.J., Keller, B., Ramakrishnan, N., “Studying Recommendation Algorithms by Graph Analysis,”Journal of Intelligent Information Systems, 20(2), pp. 131–160, doi:10.1023/a:1021819901281, 2003.* Huang, Z., Zeng, D., Chen, H., “A Link Analysis Approach to Recommendation Under Sparse Data,”Proceedings of the Tenth Americas Conference on Information Systems, 2004.* Perugini, S., Goncalves, M.A., Fox, E., “Recommender System Research: A Connection-Centric Survey,”Journal of Intelligent Information Systems, 23(2), pp. 107–143, 2004.* Rodriguez M.A., Bollen, J., Van de Sompel, H., “Automatic Metadata Generation using AssociativeNetworks,” ACM Transactions on Information Systems, 27(2), pp. 1–20, doi:10.1145/1462198.1462199,2009. [http://arxiv.org/abs/0807.0023]


Traversal Algorithms Simulate User Behavior

• A traversal is like a simulation of the user(s).

• If all the user had were direct links (i.e. a basic user-interface over thedataset), what path would they take to solve their problem?

• Operationalize as a traversal and you have simulated (and sped up) theirproblem-solving behavior.7273

72Rodriguez, M.A., Watkins, J., “Faith in the Algorithm, Part 2: Computational Eudaemonics,”Proceedings of the International Conference on Knowledge-Based and Intelligent Information & EngineeringSystems, Lecture Notes in Artificial Intelligence, 5712, pp. 813–820, doi:10.1007/978-3-642-04592-9 101,Springer-Verlag, 2009. [http://arxiv.org/abs/0904.0027] – see Faith in the Algorithm, in general:http://faithinthealgorithm.net.

73Think of the graph data set as a conceptual graph—“things” and their relationships to each other:the world as index. Think how your mind composes, manipulates, make use of such structures to solveproblems—to think, to infer, to creatively combine (i.e. join, traverse) ideas. Automate that process....automate the process that generates that process. [http://arxiv.org/abs/0704.3395]


http://faithinthealgorithm.net


Graph Traversal Model: Benefits and Drawbacks

• Benefits:

? The solution is explainable (i.e. the factors/paths are known).? Evaluations can happen in real-time and on live data.7475

? Can easily develop/deploy new traversals for different problems.76

• Drawbacks:

? If intuition fails, derive factors with offline statistical techniques.7778

74A user can add an edge and then recalculate a traversal.75It is noted that this depends on the complexity of the traversal and density of the graph.76For very rich data models, this is a promising proposition.77In the past, my method has been to use intuition to develop traversals, and then with sample data,

validate/tweak the traversal [http://arxiv.org/abs/cs/0605112, http://arxiv.org/abs/0807.0023].Also, for live systems with active users, using click-behavior is possible.

78Think about deriving Ψ from the paths that the users take through the data. “Ruts,” given the law oflarge numbers, can expose the collective’s problem-solving behavior. In short, study your users to derive Ψ.



The Future of Gremlin – Part 1

• Pavel Yaskevich and I are currently re-writting Gremlin from the groundup with a new compiler and virtual machine. Orders of magnitude fasterand more memory efficient. (now)79

• Make use of equivalences in the path algebra to do run-time optimizationsof path traversals. Extend the algebra. (future)

• Make use of path caching to do (V ×Ψ)→ P(V ) lookups. (future)80

? For example, x · ./outE/inV/outE/inV→ a, b, c, d, . . .

79This new implementation is Gremlin 0.5 and can be currently git pulled–note that its unstable untilofficial release date.

80A simple, intelligent memoization technique introduced by Joshua Shinavier in the Ripple programminglanguage [http://ripple.fortytwo.net/].

http://ripple.fortytwo.net/

The Future of Gremlin – Part 2

• Get more community involvement on the optimization of thecompiler/virtual machine. (now)

? http://groups.google.com/group/gremlin-users/

• Support splitting/branching of path descriptions. Currently supported inPipes, but no syntactic mapping yet available in Gremlin. (future)

? ./outE/inVsplit() [1]| ./outE/inV [2]| .[@name=‘atti’]/outE/@name

• Support for threading. Pipes, due to its data flow nature, is easilyparallelized. Support concurrency through to the Gremlin language.(future)81

81Kahn, G., “The Semantics of a Simple Language for Parallel Processing,” Proceedings of the InformationProcessing Congress, pp. 471–475, 1974.

http://groups.google.com/group/gremlin-users/

Acknowledgements

• The ideas presented have been developed over the course of my time with the following

institutions: University of California at Santa Cruz, Vrije Universiteit Brussel, Los

Alamos National Laboratory, and AT&T Interactive.

• My core collaborators: Alberto Pepe (Harvard), Johan Bollen (University of Indiana),

Herbert Van de Sompel (LANL), Jennifer H. Watkins (LANL), Peter Neubauer

(NeoTechnology), Joshua Shinavier (Rensselaer Polytechnic Institute), and Pavel

Yaskevich (“No one, from no where.”)

• The Neo4j team [http://neo4j.org] have been instrumental in influencing my

thoughts with respect to the database considerations of graph processing. These

people include Peter Neubauer, Emil Eifrem, Tobais Ivarsson, Johan Svensson, Mattias

Persson...

• My current institution of AT&Ti has provided me with ideas and support: Aaron

Patterson, Rand Fitzpatrick, Nate Murray, Gene Chuang, and Charlie Hornberger.

• The greater TinkerPop [http://tinkerpop.com] community for their discussions,

code submissions, and general excitement in the space.

http://neo4j.org

http://tinkerpop.com

Conclusion

• Model real-world structures with multi-relational/property graphs.

• Augment local data with the Web of Data.

• Store in a graph database to make traversing efficient.

• Traverse to search, score, rank, and recommend.

• Execute using TinkerPop productions.

• Relish in the glory that is the graph.

• “I must rest now. I’m tired from battle.” – Maximus.

problem-solving using graph traversals: searching, scoring, ranking, and recommendation

Technology

gendermale

locationwest

rensselaer

create return

based programming

outline graph

intelligent

data flow