gremlin: a graph-based programming language
DESCRIPTION
Gremlin is a Turing-complete, graph-based programming language developed for key/value-pair multi-relational graphs called property graphs. Gremlin makes extensive use of XPath 1.0 to support complex graph traversals. Connectors exist to various graph databases and frameworks. This language has application in the areas of graph query, analysis, and manipulation.TRANSCRIPT
GremlinG = (V,E)
A Graph-Based Programming Language
Marko A. RodriguezT-5, Center for Nonlinear StudiesLos Alamos National Laboratoryhttp://markorodriguez.com
http://gremlin.tinkerpop.com
February 25, 2010
AbstractGremlin is a Turing-complete, graph-based programming languagedeveloped for key/value-pair multi-relational graphs called property graphs.Gremlin makes extensive use of XPath 1.0 to support complex graphtraversals. Connectors exist to various graph databases and frameworks.This language has application in the areas of graph query, analysis, andmanipulation.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Acknowledgements
• Marko A. Rodriguez [http://markorodriguez.com]
designed, developed, tested, and documented Gremlin.
• Peter Neubauer [http://www.linkedin.com/in/neubauer]
aided in the design and the evangelizing of Gremlin.
• Pavel Yaskevich [http://github.com/xedin]
aided in the development of user defined functions in Gremlin.
• Joshua Shinavier [http://fortytwo.net]
provided initial conceptual support for Gremlin.
• Ketrina Yim [http://csillustrated.berkeley.edu]
designed the logo for Gremlin.
• Gremlin-Users Group [http://groups.google.com/group/gremlin-users]
provided much direction in the design and implementation of Gremlin.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Outline
• Introduction to Graphs and Graph Software
• Basic Gremlin Concepts
• Gremlin Language Description
• Advanced Gremlin Concepts
• Conclusions
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Outline
• Introduction to Graphs and Graph Software
• Basic Gremlin Concepts
• Gremlin Language Description
• Advanced Gremlin Concepts
• Conclusions
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
What is a Graph?• A graph (network) is composed of a collection of vertices (dots) and edges (lines).
There are many types of graphs: directed/undirected, weighted, attributed, etc.
http://ex.com/123
a
0.2 knowsmul
ti
weighted
directed
edge-labeled
vertex-labeled
undi
rect
edtype="person"name="emil"
vertex-attributed
created=2-01-09modified=2-11-09
edge-attributedhyper
pseudo
resource description framework
regular
half-
edge se
mantic
hired
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Why Use a Graph?
• A graph is a very general data structure that can be used to modelvarious systems.
? A graph can model the structure of transportation, technological,bibliographic, etc. systems.
? A graph can model a list, a map, a tree, etc.
• There are numerous graph algorithms that are defined independent ofthe domain of the graph model.
• There are numerous graph databases, frameworks, packages, etc.that aid in the creation, manipulation, and analysis of graphs.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Graph Databases, Frameworks, and Packages
• Neo4j Graph Database [http://neo4j.org]
• AllegroGraph Quad Store [http://http://www.franz.com/agraph]
• HyperGraphDB [http://www.kobrix.com/hgdb.jsp]
• Java Universal Network/Graph Framework [http://jung.sourceforge.net]
• OpenRDF Sesame Framework [http://www.openrdf.org]
• InfoGrid Graph Database [http://infogrid.org]
• Filament Graph Toolkit [http://filament.sourceforge.net]
• OWLim Semantic Repository [http://www.ontotext.com/owlim]
• Sones Graph Database [http://www.sones.com]
• NetworkX Graph Toolkit [http://networkx.lanl.gov]
• iGraph Toolkit [http://igraph.sourceforge.net]
• Blueprints Graph API [http://blueprints.tinkerpop.com]
• ... and many more.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
What Makes Gremlin Different?
• Gremlin is a domain specific language for working with graphs.
• Gremlin is not an application programming interface (API).
• Gremlin makes use of various graph databases, frameworks, packages.
• Gremlin is a language that currently has a virtual machineimplementation written in Java.
• What can be succinctly expressed in Gremlin is verbose/clumsy toexpress in general purpose languages such as Java, Python, Ruby, etc.
• Gremlin allows one to map single-relational graph analysis algorithmsover to the multi-relational domain.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Single-Relational Graphs• In single-relational graphs, all edges have the same meaning
(e.g. all edges are either frienship, kinship, worksWith, knows, etc.).
? G = (V,E ⊆ (V × V ))
• Most graph algorithms are defined for single-relational graphs(e.g. centrality/ranking, clustering/community detection, etc.).
person-a person-b
person-c
NOTE: These types of graphs are also known as directed, vertex-labeled graphs.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Multi-Relational Graphs
• In multi-relational graphs, edges can have different meanings.
? G = (V,E ⊂ (V × V ), ω : E → Σ∗)
• Most graph software is designed for multi-relational graphs (e.g. arbitraryobjects as vertices and edges, knowledge-based reasoning systems, etc.).
person-a book-b
book-c
read cites
authored
NOTE: These types of graphs are also known as directed, vertex/edge-labeled graphs.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Gremlin and Multi-Relational Graphs
• Gremlin provides a means to elegantly map single-relational graphanalysis algorithms over to the multi-relational graph domain.
• Gremlin provides an elegant way to do automated reasoning inmulti-relational graphs using path expressions.
These two points form the primary thesis of this presentation.
Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network Analysis
Algorithms,” Journal of Informetrics, 4(1), 29–41, doi:10.1016/j.joi.2009.06.004, LA-UR-08-03931,
http://arxiv.org/abs/0806.2274, December 2009.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Property Graphs
• Gremlin works with a type of multi-relational graph called a propertygraph.
? Vertices and edges are labeled with unique identifiers.? Edges are directed, labeled, and can form loops.? Multiple edges of the same label can exist for the same vertex pair.? Vertices and edges can have any number of key/value pair
properties/attributes.
Property graphs are a relatively general graph structure that can be constrained to model other graph
structures — though, a property-based hypergraph would be the most general (see HyperGraphDB and the
JUNG API).
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Property Graphs
name = "marko"age = 29
1
4
knows
weight = 1.0
name = "josh"age = 32
name = "vadas"age = 27
2
knows
weight = 0.5
created
weight = 0.4
name = "lop"lang = "java"
3
created
weight = 0.4
name = "ripple"lang = "java"
5
created
weight = 1.0
name = "peter"age = 35
6
created
weight = 0.2
78
9
11
10
12
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Outline
• Introduction to Graphs and Graph Software
• Basic Gremlin Concepts
• Gremlin Language Description
• Advanced Gremlin Concepts
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Gremlin System Architecture
NativeStore TinkerGraphNeo4j
GremlinConsole
Gremlin ScriptEngine
• The Gremlin console is a scripting environmentwhich allows for the dynamic evaluation ofGremlin code.
• Gremlin implements JSR 223 which allowsGremlin to also be used within the Javalanguage and thus, as a virtual machine directlyaccessible to Java applications. Popular JSR223 implementations include Jython, JRuby, andGroovy. For a fine list of implementations seehttps://scripting.dev.java.net.
• Blueprints is a set of interfaces for abstractdata structures such as graphs and documents.Implementations to these interfaces exist forvarious data management systems.
• There exist many graph data managementsystems that span various graph data models(e.g. edge labeled graphs, RDF graphs,hypergraphs, etc.).
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
“Hello World” in the Gremlin Console
marko$ ./gremlin.sh
\,,,/(o o)
-----oOOo-(_)-oOOo-----gremlin>gremlin> concat(‘goodbye’, ‘ ’, ‘self’)==>goodbye self
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Simple Traversals in Gremlin
name = "marko"age = 29
1
4
knows
name = "josh"age = 32
name = "vadas"age = 27
2
knows
created
name = "lop"lang = "java"
3
created
5
created
6
created
78
9
11
10
12
weight = 1.0
weight = 0.5
weight = 0.4
gremlin> $_ := g:key(‘name’,‘marko’)==>v[1]gremlin> .==>v[1]gremlin> ./outE==>e[7][1-knows->2]==>e[9][1-created->3]==>e[8][1-knows->4]gremlin> ./outE/@weight==>0.5==>0.4==>1.0
./outE/@weight: “Get the current object(s). Then get the outgoing edges of those objects. Then get the
weights of those edges.”
$ is a reserved variable meaning the root list of objects.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Simple Traversals in Gremlin
name = "marko"age = 29
1
4
knows
2
knows
created
name = "lop"lang = "java"
3
created
5
created
6
created
78
9
11
10
12
gremlin> .==>v[1]gremlin> ./outE[@label=‘created’]/inV==>v[3]gremlin> $_ := $_last==>v[3]gremlin> ./@name==>lopgremlin> g:map(.)==>name=lop==>lang=java
./outE[@label=‘created’]/inV: “Get the current object(s). Then get the outgoing edges of those
objects, where their labels equal ‘created’. Then get the incoming vertices of those ‘created’ edges.”
$ last is a reserved variable meaning the last value evaluated.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Simple Traversals in Gremlin
name = "marko"age = 29
1
4
knows
name = "josh"age = 32
name = "vadas"age = 27
2
knows
created
name = "lop"lang = "java"
3
created
5
created
6
created
78
9
11
10
12
./outE[@label=‘knows’]/inV[matches(@name,‘va.{3}’) and @age > 21]/@name==>vadas
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Simple Traversals in Gremlin
./outE[@label=‘knows’]/inV[matches(@name,‘va.{3}’) and @age > 21]/@name
1. .: Get the current object(s).
2. outE[@label=‘knows’]: Get the outgoing edges of the currentobject(s), where their labels equal ‘knows’.
3. inV[matches(@name,‘va.{3}’) and @age > 21]: Get the incomingvertices of those ‘knows’ edges, where the names of those vertices are 5characters long, start with ‘va’, and whose age is greater than 21.
4. @name: get the name of those particular incoming vertices.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Knowledge-Based Reasoning
• Blueprints implements the Sesame SAIL interfaces and thus, Gremlincan be used over the many Resource Description Framework (RDF)triple/quad stores. In such cases, RDF is modeled as a property graphwhere the named graph component is the @ng edge property.
• Gremlin makes use of the Sesame SAIL SPARQL engine to allow forqueries based on graph-pattern matching.
gremlin> sail:sparql(‘SELECT ?x ?y WHERE { ?x foaf:knows ?y }’)==>{y=v[http://ex.com#2], x=v[http://ex.com#1]}==>{y=v[http://ex.com#4], x=v[http://ex.com#1]}
• Gremlin is useful for knowledge-based reasoning using pathexpressions.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Reasoning as Defining New Types of Adjacency
marko
josh
knows
vadas
knows
created
lop
created
ripple
created
peter
created
co-developer
co-developer
For these “co-developer” examples, we will use
vertex 1 (marko) as the source of the reasoning
process.
• Graph-based reasoning is the processof making explicit what is implicit inthe graph.
• A reasoner takes a graph G
and a collection of graph-patterns
(i.e. transformation/rewrite rules) and
creates a new graph G′ (usually, G ⊂G′). G′ has new relationships/edges
and thus, new definitions of vertexadjacency.
• Example: The co-developers of person
A are those people who have created
the same software as person A and who
are themselves, not person A (as person
A has created the same software as him
or herself).
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
The Co-Developers of Marko A. Rodriguez in SPARQL
name = "marko"age = 29
1
4
knows
name = "josh"age = 32
2
knows
created
name = "lop"lang = "java"
3
created
5
created
name = "peter"age = 35
6
createdmarko
?x
?x
?y
?z
?z
SELECT ?x WHERE {marko created ?y .?z created ?y .?z != marko .?z name ?x
}
This query would return: josh andpeter.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
The Co-Developers of Marko A. Rodriguez in Gremlin
marko
josh
knows
vadas
knows
created
lop
created
ripple
created
peter
created
co-developer
co-developer
co-developer
gremin> ./@name==>markogremlin> ./outE[@label=‘created’]/inV/inE[@label=‘created’]/outV[g:except($_)]/@name==>josh==>peter
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
The Co-Developers of Marko A. Rodriguez in Gremlin
./outE[@label=‘created’]/inV/inE[@label=‘created’]/outV[g:except($_)]/@name
1. .: Get the current object(s) (i.e. vertex 1 — denoting Marko).
2. outE[@label=‘created’]: Get the outgoing edges of the Marko vertex, where their
labels equal ‘created’.
3. inV: Get the incoming (i.e. head) vertices of those ‘created’ edges.
4. inE[@label=‘created’]: Get the incoming edges of those vertices, where their
labels equal ‘created’.
5. outV[g:except($ )]: Get the outgoing (i.e. tail) vertices of those ‘created’ edges,
where those vertices are not the Marko vertex.
6. @name: get the name of those non-Marko vertices.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Defining Co-Developers in Gremlin
path co-developer./outE[@label=‘created’]/inV/inE[@label=‘created’]/outV[g:except($_)]end
Once defined, you can use it like any other path segment.
gremlin> ./co-developer==>v[4]==>v[6]gremlin> ./co-developer/@name==>josh==>peter
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Defining Co-Developers in Java
public class CoDeveloperPath implements Path {public List invoke(Object root) {
if(root instanceof Vertex) {List<Vertex> projects = new ArrayList<Vertex>();for(Edge edge : ((Vertex)root).getOutEdges()) {
if(edge.getLabel().equals("created")) {projects.add(edge.getInVertex());
}}List<Vertex> coDevelopers = new ArrayList<Vertex>();for(Vertex project : projects) {
for(Edge edge : project.getInEdges()) {if(edge.getLabel().equals("created") && edge.getOutVertex() != root) {
coDevelopers.add(edge.getOutVertex());}
}}return coDevelopers;
} else {return null;
}}
}
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Outline
• Introduction to Graphs and Graph Software
• Basic Gremlin Concepts
• Gremlin Language Description
• Advanced Gremlin Concepts
• Conclusions
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Gremlin Type System
object
graphelement
vertex edge
booleannumber string listmap
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Predefined Paths and Properties
1
4
knows created
3created
vertex 4 id
vertex 1 out edges
9
8 11
edge 9 labelvertex 3 in edges
edge 9 in vertexedge 9 out vertexedge 9 id
name = "josh"age = 32
vertex 4 properties
object property description example
graph V the vertex iterator of the graph $g/Vgraph E the edge iterator of the graph $g/E
vertex/edge @id the identifier of the element $v/@idvertex outE the outgoing edges of the vertex $v/outEvertex inE the incoming edges of the vertex $v/inEvertex bothE both in and out edges of the vertex $v/bothEedge outV the outgoing tail vertex of the edge $e/outVedge inV the incoming head vertex of the edge $e/outVedge bothV both in and out vertices of the edge $e/bothVedge @label the label of the edge $e/@label
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Predefined Functions
g:assign()g:assign()g:unassign()g:id()g:key()g:add-v()g:add-e()g:remove-ve()g:idx-all()g:add-idx()
g:remove-idx()g:load()g:save()g:clear()g:close()g:keys()g:values()g:map()g:get()g:op-value()
g:list()g:dedup()g:union()g:intersect()g:difference()g:retain()g:except()g:remove()g:get()g:op-value()
g:sort()g:map()g:keys()g:values()g:rand-nat()g:rand-real()g:prob()g:cont()g:halt()g:type()
g:print()g:time()g:p()g:to-json()g:from-json()......
There are over 70 predefined functions. See the following for a description of each.
http://wiki.github.com/tinkerpop/gremlin/core-function-library
http://wiki.github.com/tinkerpop/gremlin/gremlin-function-library
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Working With Non-Graph Typesgremlin> 1.2 + 6==>7.2gremlin> ‘this is a string’==>this is a stringgremlin> true() or false()==>truegremlin> g:map(‘marko’,‘lanl’,‘peter’,‘neotech’,‘josh’,‘rpi’)==>marko=lanl==>peter=neotech==>josh=rpigremlin> g:list(‘graphs’,‘hockey’,‘motorcylces’,6)==>graphs==>hockey==>motorcylces==>6.0
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Working With Non-Graph Types
gremlin> $m := g:map(‘hobbies’,g:list(‘hockey’,‘graphs’),‘location’, g:map(‘state’,‘new mexico’, ‘city’, ‘santa fe’,
‘zipcode’, 87501), ‘age’, 30)==>location={zipcode=87501.0, state=new mexico, city=santa fe}==>age=30.0==>hobbies=[hockey, graphs]gremlin> $m/@age==>30.0gremlin> $m/@hobbies[2]==>graphsgremlin> $m/@location/@city==>santa fe
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Variables
• Variables in Gremlin are prefixed with a $ character.
• There are a collection of reserved variables that all begin with $ .
? $ is the root list of objects.? $ last is the last result evaluated by the evaluator.? $ g is the “working graph” to reduce typing with graph functions.
gremlin> $x := 1==>1.0gremlin> $y := 2==>2.0gremlin> $x + $y==>3.0
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Language Statements
Variable Assignment
gremlin> $i := 1 + 5==>6.0gremlin> $i==>6.0
If/Else
gremlin> if true()$i := 1
else$i := 2end
==>1.0
Repeat
gremlin> $i := 0==>0.0gremlin> repeat 10$i := $i + 1end
==>10.0
While
gremlin> $i := ‘g’==>ggremlin> while not(matches($i, ‘ggg’))$i := concat($i,‘g’)end
==>ggg
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Language StatementsForeach
gremlin> $i := 0==>0.0gremlin> foreach $j in 1 | 2 | 3
$i := $i + $jend
==>6.0
Function
gremlin> func ex:hello($name)concat(‘hello ’, $name)end
gremlin> ex:hello(‘pavel’)==>hello pavel
Path
gremlin> path friend_name./outE[@label=‘knows’]/inV/@nameend
gremlin> gremlin> ./friend_name==>vadas==>josh
You can define functions and paths in native Gremlin (as demonstrated above) or in Java.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
XPath Filters
• Use [ ] filters to filter objects in a path expression (i.e. “such that” or“where”)
• The evaluated result of [ ] must be a number or boolean.
? If its a number, it is treated as the position within an array (i.e. list).? If it is boolean, it is treated as whether to include or exclude the
object from the next path in the sequence.
gremlin> ./outE[@label=‘knows’]==>e[7][1-knows->2]==>e[8][1-knows->4]gremlin> ./outE[@label=‘knows’ and @weight>0.5]/inV[@age<21 or @name=‘josh’][true()][1]==>v[4]
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Outline
• Introduction to Graphs and Graph Software
• Basic Gremlin Concepts
• Gremlin Language Description
• Advanced Gremlin Concepts
• Conclusion
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
A Grateful Dead Dataset
2,500 concerts35,000 songs played600 songs30 years11 members1 band... the Grateful Dead.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
A Grateful Dead Dataset
• vertices denote songs and artists
? type: “song” or “artist”? name: name of song or artist.? performances: number of times song was
played in concert.? song type: whether the song was a “cover”
or “original”.
• edges denote followed by, sung by,written by
? weight: number of times a song wasfollowed by another song over all concertsplayed.
Rodriguez, M.A., Gintautas, V., Pepe, A., “A Grateful Dead Analysis: The Relationship Between Concert and Listening
Behavior,” First Monday, 14(1), University of Illinois at Chicago Library, http://arxiv.org/abs/0807.2466, January 2009.
NOTE: A portion of the raw dataset courtesy of Mark Leone http://www.cs.cmu.edu/ mleone/gdead/setlists.html
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
A Grateful Dead Dataset
Stanley TheaterPittsburgh, PA (11/30/79)
2nd Set-------------------Scarlet BegoniasFire on the MountainPassengerTerrapin Station......
1
type="song"name="Scarlet.."
2
type="song"name="Fire on.."
3
type="song"name="Pass.."
4
type="song"name="Terrap.."
followed_by
followed_by
followed_by
weight=239
weight=1
weight=2
5
type="artist"name="Garcia"
sung_by
sung_by
6
type="artist"name="Lesh"
sung_by
sung_by
7
type="artist"name="Hunter"
written_by
written_by
written_by
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
A Grateful Dead Dataset – Load Data/Basic Stats
gremlin> g:load(‘data/graph-example-2.xml’)==>truegremlin> count($_g/V)==>809.0gremlin> count($_g/E)==>8049.0
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
A Grateful Dead Dataset – Out-Degree of Each Vertex
gremlin> $degrees := g:map()gremlin> foreach $v in $_g/V$degrees[@name=$v/@name] := count($v/outE)
end
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
A Grateful Dead Dataset – Out-Degree of Each Vertex
gremlin> g:sort($degrees, ‘value’, true())==>PLAYING IN THE BAND=96.0==>SUGAR MAGNOLIA=92.0==>PROMISED LAND=89.0==>GOOD LOVING=87.0==>NOT FADE AWAY=86.0==>I KNOW YOU RIDER=85.0==>CASSIDY=83.0==>DEAL=82.0==>JACK STRAW=81.0==>ONE MORE SATURDAY NIGHT=81.0==>EL PASO=80.0==>MEXICALI BLUES=79.0...
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
A Grateful Dead Dataset – Inspecting Single Vertex
gremlin> $v := g:key(‘name’,‘CHINA DOLL’)[1]==>v[129]gremlin> g:map($v)==>name=CHINA DOLL==>song_type=original==>performances=114==>type=songgremlin> $v/outE[@label=‘sung_by’]/inV/@name==>Garcia
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
A Grateful Dead Dataset – Inspecting Single Vertex
gremlin> $v/outE[@label=‘followed_by’]/inV/@name==>BIG RIVER==>THROWING STONES==>SAMSON AND DELILAH==>TRUCKING==>CASEY JONES==>HIGH TIME...gremlin> $v/outE[@label=‘followed_by’]/@weight==>2==>8==>1==>2==>1==>1...
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Introduction to PageRank• The remainder of this section will discuss the PageRank algorithm and
its application to multi-relational graphs.
• The arguments made and the examples presented generalizes to all othersingle-relational graph algorithms. However, for the sake of brevity andconsistency, only PageRank will be discussed.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Introduction to Matrix-Based PageRank
• PageRank is a centrality measure based on the primary eigenvector
of a modified version of a graph. Let A ∈ R+|V |×|V | denote theadjacency matrix representing the graph.
• In order to ensure a positive real values in the eigenvector, the graphmust be strongly connected. PageRank induces strong connectivityby overlaying a low probability (defined by α ∈ [0, 1] – usually 0.15)
“teleportation” graph over the original graph. Let B ∈ 1|V ||V |×|V |
denote
a teleportation adjacency matrix where ever vertex is connected to vertexwith equal probability.
? C = (1− α)A + αB, where C ∈ R+|V |×|V |
? λ = λC, where λ ∈ R+|V | is the PageRank vector over V .
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Introduction to Random Walk-Based PageRank
• PageRank can be implemented by a random walk.
• Create a vertex counter map, m : V → N+.
• Place a walker on a random vertex in V . Denote the walker’s currentvertex i ∈ V .
1. increment the vertex counter by 1 (i.e. m(i)← m(i) + 1).2. the walker chooses a random adjacent vertex with probability α.3. the walker chooses a random vertex in V with probability 1− α.4. rinse and repeat until m reaches a stationary probability distribution
(continually normalize m if you want a probability distribution).
• We will use this random walk model in the Gremlin examples to follow.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
PageRank over Multi-Relational Graphs
• PageRank was designed for single-relational graphs (i.e. where all edgeshave the same meaning).
• In a multi-relational graph, what does it mean to find the centralityof a vertex when vertices can be related by various types of edges?For example, if there exists “socializes with” and “met once”, then theperson who “met once” many people could be the most centrally locatedin the graph. Also, what if you graph has more than just “person”-typevertices (e.g. cars, pets, buildings, articles, etc.) and “person”-typeedges (e.g. owns, walks, livesAt, cites, etc.).
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
PageRank over Multi-Relational Graphs
• Calculating single-relational PageRank
would yield Person as the most central
vertex.
• You can boolean filter certain edge labels
(e.g. ignore type edges — in such cases,
you would have the centrality scores over
the knows social graph).
• However, what if you only wanted to
traverse knows edges if and only if the
adjacent vertex knows more than 10
other people?
• In the end, you want completecontrol (universal computability)over the paths that thetraverser/walker can take througha graph.
Person
Herbert Johan Marko Josh Jen
type
...
type type type type typetypetypetypetypetypetype
knows knows
knows
knows knows
knows
...
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
PageRank over Multi-Relational Graphs
• In multi-relational graphs, the meaning of your graph algorithm’s results aredefined by your definition of adjacency.
• With respect to random walk-based PageRank, define the path that the walkershould take. That path is the definition of adjacency.
• The stationary probability distribution created from this walk yields a path-dependentcentrality.
• Thus, in a multi-relational graph, there are many types of PageRanks that canbe calculated — one for each type of path defined for a walker.
Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks”, Knowledge-Based Systems,
21(7), 727–739, http://arxiv.org/abs/0803.4355, October 2008.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
PageRank over “Garcia Followed By” SubGraph
• Define a path that will go from song-to-song by “followed by” edges andonly traverse songs that are “sung by” Jerry Garcia.
(./outE[@label=‘followed_by’]/inV/outE[@label=‘sung_by’]/inV[name=‘Garcia’]/../..)[g:rand-nat()]
.
Afollowed_by
followed_by
followed_by
Bsung_by
sung_by
sung_by
C Dname="Garcia"
name="Garcia"
name="Weir"
g:rand-nat()
/../..
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
PageRank over “Garcia Followed By” SubGraphpath garcia-followed_by
(./outE[@label=‘followed_by’]/inV/outE[@label=‘sung_by’]/inV[name=‘Garcia’]/../..)[g:rand-nat()]
end
$m := g:map()$alpha := 0.15$_ := g:key(‘type’, ‘song’)[g:rand-nat()]repeat 2500
$_ := ./garcia-followed_byif count($_) > 0
g:op-value(‘+’,$m,$_[1]/@name, 1.0)endif g:rand-real() < $alpha or count($_) = 0
$_ := g:key(‘type’, ’song’)[g:rand-nat()]end
end
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
PageRank over “Garcia Followed By” SubGraphgremlin> g:sort($m,‘value’,true())==>CRAZY FINGERS=98.0==>HES GONE=85.0==>CHINA CAT SUNFLOWER=79.0==>BERTHA=76.0==>UNCLE JOHNS BAND=74.0==>TERRAPIN STATION=72.0==>GOING DOWN THE ROAD FEELING BAD=71.0==>WHARF RAT=71.0==>EYES OF THE WORLD=65.0==>COLD RAIN AND SNOW=62.0==>SHIP OF FOOLS=58.0==>RAMBLE ON ROSE=53.0==>CASEY JONES=51.0==>DARK STAR=47.0==>DEAL=46.0...
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Universal Computation in Pathspath path-name# any arbitrary computation can occur hereend
• A path definition can be used to define adjacencies.
? adjacency can be expressed as anything that can be computed by a Turing machine.
? path definitions are used to create “semantically meaningful” results from single-
relational graph algorithms applied to multi-relational graphs.
? path definitions make explicit what is implicit in the structure of the graph. This
has applications to knowledge-based reasoning.
• A path definition can perform any arbitrary computation.
? path definitions can check/set vertex/edge properties.
? path definitions can create new vertices and edges.
? path definitions can call/define functions.
This allows fine grained control over how your traverser/walker moves through a graph.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Outline
• Introduction to Graphs and Graph Software
• Basic Gremlin Concepts
• Gremlin Language Description
• Advanced Gremlin Concepts
• Conclusions
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
The Current Gremlin EcoSystems
• Webling: Web console for Gremlin
(developed by Pavel Yaskevich w/ funding from Neo Technology)
Webling• Project Gargamel: Distributed Graph Computing
(uses Linked Process and Gremlin)
• ReXster: A Graph-Based Recommender Engine
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010
Thank You
Please enjoy Gremlin at http://gremlin.tinkerpop.com ...
My homepage is http://markorodriguez.com.Please feel to contact me with any questions or comments.
Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – February 25, 2010