graph databases in python (pycon canada 2012)
DESCRIPTION
Since the irruption in the market of the NoSQL concept, graph databases have been traditionally designed to be used with Java or C. With some honorable exceptions, there isn't an easy way to manage graph databases from Python. In this talk, I will introduce you some of the tools that you can use today in order to work with those new challenging databases, from our favorite languge, Python.TRANSCRIPT
GRAPH DATABASESIN PYTHON
Javier de la Rosa @versae
The CulturePlex LabWestern University, London, ON
PyCon Canada 2012
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 2
WHO I AM● Javier de la Rosa
● versae● versae● Computer Scientist and
Humanist
● CulturePlex Lab
● CulturePlex
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 3
FIRST OF ALL
“You do not really understand something unless you can explain it to your
grandmother”
– (Frequently attributed to) Richard Feynman
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 4
DATABASES (in the last 30 years)● Data in tables, rows and columns
● Pretty basic mechanism to make connections:
– Primary keys, Foreign keys, and... that's all
● Relational, ahem, really?
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 5
DATABASES (in the last 30 years)● Rigid data schemas
– Have you ever tried to make a schema migration?
● Relational Algebra and SQL
– Terrible for highly interconnected data
– JOIN's can take a life to end (a bit overdramatized)
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 6
NoSQL, Not Only SQL● Document
– MongoDB, CouchDB, etc.
● Key-value stores– Redis, Riak, Voldemort,
Dynamo, etc.
● Big Tables– Cassandra, Hbase, etc
● Anaylitc– Hadoop
● Graph– Neo4j, OrientDB,
HyperGraphDB, Titan, etc.
● Other– Objectivity/DB, ZODB, etc.
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 7
DATABASES LANDSCAPE
Source: 451Research, https://451research.com/report-long?icid=2289
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 8
WHO IS USING GRAPHS?● Mozilla with Pancake and Pacer
– https://wiki.mozilla.org/Pancake & http://pangloss.github.com/pacer/
● Twitter with FlockDB
– https://github.com/twitter/flockdb
● Facebook with Open Graph
– https://developers.facebook.com/docs/opengraph/
● Google with Knowledge Graph
– http://www.google.ca/insidesearch/.../knowledge.html
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 9
WHY GRAPHS?● Data is getting more and more connected
– From text documents, to wikis, to ontologies, to folksonomies, etc
● And more semi-structured
– Think about the decentralization of content generation
● And more complex
– Social networks, semantic trending, etcSource: Neo Technology, http://www.slideshare.net/emileifrem/neo4j-the-benefits-of-graph-databases-oscon-2009
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 10
A FEW OF THE CURRENT USES● Social Networking and Recommendations
● Network and Cloud Management
● Master Data Management
● Geospatial
● Bioinformatics
● Content Management and Security and Access Control
Source: Mashable, http://mashable.com/2012/09/26/graph-databases/
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 11
AND WHY ELSE?● Because graphs are cool!
Leonard Euler
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 12
WHAT IS A GRAPH?
● G = (V, E)
Where– G is a graph
– V is a set of vertices
– E is a set of edges
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_(mathematics)
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 13
WHAT IS A GRAPH?● G = (V, E)
– Graph, aka network, diagram, etc.
– Vertex, aka point, dot, node, element, etc.
– Edge, aka relationship, arc, line, link, etc.
● Basically, “a graph states that something is related to something else”
– Svetlana Sicular,Research Director at Gartner
Source: Gartner, http://blogs.gartner.com/svetlana-sicular/think-graph/
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 14
TYPES OF GRAPH
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_(mathematics)
Undirected Digraph
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 15
TYPES OF GRAPH
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_(mathematics)
Multigraph Hypergraph
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 16
SOME GRAPHS EVEN HAVE A NAME● Complete graphs
Source: Wikipedia, http://en.wikipedia.org/wiki/Gallery_of_named_graphs
K3 K8K5
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 17
SOME GRAPHS EVEN HAVE A NAME● Stars
Source: Wikipedia, http://en.wikipedia.org/wiki/Gallery_of_named_graphs
The star graphs S3, S4, S5 and S6
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 18
SOME GRAPHS EVEN HAVE A NAME● Snarks
Source: Wikipedia, http://en.wikipedia.org/wiki/Gallery_of_named_graphs
Blanuša (second) Double starSzekeres
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 19
THINGS CAN COMPLICATE...
Source: Wikipedia, http://en.wikipedia.org/wiki/Gallery_of_named_graphs
Local McLaughlin graph
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 20
WAIT A SEC,
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 21
DON'T WORRY● Just one more type: the Property Graph
1
4
32
12
3
4
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 22
THE PROPERTY GRAPH● Directed, attributed and multi-relational
4
32
12
3
4
KnowsSince: 2009
KnowsSince:1990
Likes
LikesName: John
Name: Javi
Name: David
Title: The Art of Computer ProgrammingPrice: $135
1
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 23
THE PROPERTY GRAPH● A set of nodes, and each node has:
– An unique identifier.
– A set of outgoing edges.
– A set of incoming edges.
– A collection of properties defined by a map from key to value.
● A set of relationships, and each relationship has:– An unique identifier.
– An outgoing tail vertex.
– An incoming head vertex.
– And a collection of properties defined by a map from key to value.
Source: TinkerPop, https://github.com/tinkerpop/gremlin/wiki/Defining-a-Property-Graph
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 24
IN SHORT● A Property Graph is composed by:
– A set of nodes
– A set of relationships
– Properties and id's on both
● Sometimes, nodes and relationship can be typed
– In Blueprints and Neo4j, a label denotes the type of relationship between its two nodes.
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 25
GRAPH DATABASES● A graph database uses graph structures with nodes,
edges, and properties to represent and store data
– ...but there is not an easy way to visualize this
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_database
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 26
HOW IT LOOKS IN PYTHON?
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 27
HOW IT LOOKS IN PYTHON?# Let's create a graph>>> silvester = g.nodes.create(name="Silvester")
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 28
HOW IT LOOKS IN PYTHON?
Name: Silvester
# Let's create a graph>>> silvester = g.nodes.create(name="Silvester")
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 29
HOW IT LOOKS IN PYTHON?# Let's create a graph>>> silvester = g.nodes.create(name="Silvester")>>> arnold = g.nodes.create(name="Arnold")
Name: Silvester
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 30
HOW IT LOOKS IN PYTHON?
Name: Silvester Name: Arnold
# Let's create a graph>>> silvester = g.nodes.create(name="Silvester")>>> arnold = g.nodes.create(name="Arnold")
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 31
HOW IT LOOKS IN PYTHON?
Name: Silvester Name: Arnold
# Let's create a graph>>> silvester = g.nodes.create(name="Silvester")>>> arnold = g.nodes.create(name="Arnold")
>>> punch = arnold.punches(silvester)
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 32
HOW IT LOOKS IN PYTHON?
Name: Silvester Name: Arnold
# Let's create a graph>>> silvester = g.nodes.create(name="Silvester")>>> arnold = g.nodes.create(name="Arnold")
>>> punch = arnold.punches(silvester)
punches
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 33
HOW IT LOOKS IN PYTHON?
Name: Arnold
punches
Name: Silvester
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 34
HOW IT LOOKS IN PYTHON?
Name: Arnold
>>> chuck = g.nodes.create(name="Chuck")
punches
Name: Silvester
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 35
HOW IT LOOKS IN PYTHON?
Name: Arnold
>>> chuck = g.nodes.create(name="Chuck")
punches
Name: Silvester Name: Chuck
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 36
HOW IT LOOKS IN PYTHON?
Name: Arnold
>>> chuck.dropkicks(silvester)>>> chuck.dropkicks(arnold)
punches
Name: Silvester Name: Chuck
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 37
HOW IT LOOKS IN PYTHON?
Name: Arnold
>>> chuck.dropkicks(silvester)>>> chuck.dropkicks(arnold)
punches
Name: Silvester Name: Chuck
dropkicks
dropkicks
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 40
GRAPH DATABASES LANDSCAPE
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_database
Database Data Model Query Method License Python Binding
Neo4j Property GraphCypher, Gremlin,
TraversalGPL, AGPL
Native, Blueprints, REST
OrientDB Property GraphGremlin, Traversal
Apache 2 Blueprints
HyperGraphDBTyped
HypergraphHGQuery,Traversal
LGPL Nope
DEX Property Graph Traversal Commercial Blueprints
Titan Property Graph Gremlin Apache 2 Blueprints
InfoGrid Property Graph TraversalAGPL,
Commercial Nope
InfiniteGraph Property Graph Gremlin Commercial Nope
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 41
GRAPH DATABASES LANDSCAPEAnd more:– AffinityDB
– YarcData uRiKA
– Apache Giraph
– Cassovary
– StigDB
– NuvolaBase
– Pegasus
– Microsoft Trinity
– Sherlock
– And so on
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 42
GRAPH DATABASES LANDSCAPE
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_database
Database Data Model Query Method License Python Binding
Neo4j Property GraphCypher, Gremlin,
TraversalGPL, AGPL
Native, Blueprints, REST
OrientDB Property GraphGremlin, Traversal
Apache 2 Blueprints
HyperGraphDBTyped
HypergraphHGQuery,Traversal
LGPL Nope
DEX Property Graph Traversal Commercial Blueprints
Titan Property Graph Gremlin Apache 2 Blueprints
InfoGrid Property Graph TraversalAGPL,
Commercial Nope
InfiniteGraph Property Graph Gremlin Commercial Nope
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 43
GREMLIN, BLUEPRINTS, WAT?Let me introduce you the TinkerPop Stack
Source:TinkerPop, http://www.tinkerpop.com/
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 44
BLUEPRINTS AND REXSTER● Blueprints is a property graph model interface
● Rexster is a server that exposes any Blueprints graph through REST
Source:TinkerPop, http://www.tinkerpop.com/
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 45
AND WHAT ABOUT PYTHON?● Options to connect to a Blueprints Graph Database
RexsterBlueprints API
Neo4j
REST
bulbflow
python-blueprints
pyblueprints
OrientDB
TitanDEX
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 46
BULBFLOW● Create
● Get
● Update
● DeleteSource: Bulbflow, http://bulbflow.com/docs/
>>> alice = g.vertices.create(name="Alice")>>> bob = g.vertices.create(name="Bob")>>> g.edges.create(alice, "knows", bob)
>>> alice = g.vertices.get(1)>>> bob = g.vertices.get(2)
>>> alice.age = 21>>> alice.save()
>>> alice.delete()
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 47
PYBLUEPRINTS● Create
● Get
● Update
● DeleteSource: PyBlueprints, https://github.com/escalant3/pyblueprints
>>> alice = g.addVertex()>>> alice.setProperty("name", "Alice")>>> bob = g.addVertex()>>> bob.setProperty("name", "Bob")>>> g.addEdge(alice, bob, "knows")
>>> alice = g.getVertex(1)>>> bob = g.getVertex(2)
>>> alice.setProperty("age", 21)
>>> g.removeVertex(alice.getId())
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 48
BUT NEO4J HAS ITS OWN CLIENTS!● REST Clients for Neo4j
RexsterBlueprints API
Neo4j
REST
bulbflow
python-blueprints
pyblueprints
OrientDB
TitanDEX
neo4j-rest-client
py2neo
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 49
HOW CAN I LOOKUP?● An index is a data structure that supports the fast
lookup of elements by some key/value pair
Source: TinkerPop, https://github.com/tinkerpop/blueprints/wiki/Graph-Indices
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 50
INDICES● In Python bindings, are similar to dict
– bulbflow
– PyBlueprints
# bulbflow creates auto indices to make easier basic lookups>>> nodes = g.vertices.index.lookup(name="Alice")>>> for node in nodes:...: print vertex
>>> index = g.getIndex("names", "vertex")>>> index.put("name", alice.getProperty("name"), alice)>>> nodes = index.get("name", "Alice")>>> for node in nodes:...: print node
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 51
INDICES● Some Graph Databases provide full-text queries
– bulbflow
– PyBlueprints
>>> nodes = g.vertices.index.query(name="ali*")>>> for node in nodes:...: print node
>>> index = g.getIndex("names", "vertex")>>> nodes = index.query("name", "ali*")>>> for node in nodes:...: print node
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 52
...MORE COMPLEX SEARCHS?
“Without traversals [FlockDB] is only a persisted graph. But not a graph database.”
– Alex Popescu
Source: myNoSQL, http://nosql.mypopescu.com/
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 53
LET'S TRAVERSE THE GRAPH!● “A graph traversal is the problem of visiting all the
nodes in a graph in a particular manner”
– A* search
– Alpha-beta prunning
– Breadth-First Search (BFS)
– Depth-First Search (DFS)
– Dijkstra's algorithm
– Floyd-Warshall's algortimth
– Etc.
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_traversal
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 54
NEO4J TRAVERSAL API● Python-embedded (native Neo4j Python binding)
● neo4j-rest-client
>>> traverser = gdb.traversal()\ .relationships('knows').traverse(alice)
# The graph is traversed as you loop through the result>>> for node in traverser.nodes:...: print node
>>> traverser = alice.traverse(types=[client.All.knows])
# The graph is traversed as you loop through the result>>> for node in traverser:...: print node
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 55
BLUEPRINTS GREMLIN
● Gremlin is a domain specific language for traversing property graphs
– Defines how to do a query based on the graph structure
Source: TinkerPop Gremlin, https://github.com/tinkerpop/gremlin/wikiSource: Marko Rodríguez, The Graph Traversal Programmin Pattern, http://www.slideshare.net/slidarko/graph-windycitydb2010
>>> gremlin = g.extensions.GremlinPlugin.execute_script>>> params = {'alice_id': alice.id}>>> script = "g.V(alice_id).out('knows')">>> node = gremlin(script=script, params=params)>>> node == bob
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 56
NEO4J CYPHER QUERY LANGUAGE● Declarative graph query language
– Expressive and efficient querying
– Focused on expressing what to retrieve from a graph
– Inspired by SQL
– Pattern matching expressions from SPARQL
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_database
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 57
NEO4J CYPHER QUERY LANGUAGE● Declarative graph query language
– Expressive and efficient querying
– Focused on expressing what to retrieve from a graph
– Inspired by SQL
– Pattern matching expressions from SPARQL
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_database
1 2label
(1) -[:label]- (2)
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 58
NEO4J CYPHER QUERY LANGUAGE● Declarative graph query language
– Expressive and efficient querying
– Focused on expressing what to retrieve from a graph
– Inspired by SQL
– Pattern matching expressions from SPARQL
Source: Wikipedia, https://en.wikipedia.org/wiki/Graph_database
1 2label
START n=(1), m=(2) MATCHn-[r:label]-m
RETURN r
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 59
PY2NEO CYPHER HELPERS● Get or create elements
● Get counts
● Delete
Source: py2neo, http://py2neo.org/
>>> nodes_count = g.get_node_count()>>> rels_count = g.get_relationship_count()
>>> g.delete()
>>> g.get_or_create_relationships(...: (bob, "WORKS WITH", carol, {"since": 2004}), ...: (alice, "DISLIKES!", carol, {"reason": "youth"}),...: (bob, "WORKS WITH", dave, {"since": 2009}), )
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 60
NEO4J-REST-CLIENT CYPHER HELPERS● Query casting
● Complex filtering
Source: neo4j-rest-client, https://github.com/versae/neo4j-rest-client
>>> q = """start n=node(*) match n-[r:punchs]-() """ \ """return n, n.name, r, r.since""">>> results = g.query(q, returns=(Node, unicode, Relationship, int))
lookups = ( Q("name", exact="Arnold") & (Q("surname", istartswith="swar") & ~Q("surname", iendswith="chenegger")))arnolds = g.nodes.filter(lookups)
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 61
LET'S PLAY!● Deploy Neo4j in Heroku or Amazon
● Use one of the available clients
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 62
NEO4J HEROKU ADD-ON● Create a Heroku app and add the Neo4j add-on
● Create a virtualenv with neo4j-rest-client
$ heroku apps:create pyconca$ heroku addons:add neo4j --app pyconca$ xdg-open `heroku config:get NEO4J_URL --app pyconca`$ export NEO4J_URL=`heroku config:get NEO4J_URL --app pyconca`
$ mkvirtualenv --no-site-packages pyconca$ workon pyconca$ pip install ipython neo4jrestclient$ ipython
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 63
NEO4J HEROKU ADD-ON● Run IPython and that's it!
>>> import os>>> NEO4J_URL = os.environ["NEO4J_URL"]>>> from neo4jrestclient import client>>> gdb = client.GraphDatabase(NEO4J_URL + "/db/data")>>> gdb.url
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 64
NEO4J HEROKU ADD-ON● Run IPython and that's it!
>>> import os>>> NEO4J_URL = os.environ["NEO4J_URL"]>>> from neo4jrestclient import client>>> gdb = client.GraphDatabase(NEO4J_URL + "/db/data")>>> gdb.url
THANKS!Questions?
Javier de la Rosa @versae
The CulturePlex LabWestern University, London, ON
PyCon Canada 2012
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 66
APPENDIX: DATA MODELS● neo4django
– https://github.com/scholrly/neo4django
● neomodel
– https://github.com/robinedwards/neomodel
● bulbflow models
– http://bulbflow.com/quickstart/#models
Graph Databases in Python, Javier de la Rosa, PyCon Canada, 2012 67
APPENDIX: VISUALIZE YOUR GRAPH● Export somehow to .gexf for Gephi
– http://gephi.org/
● Use D3.js
– http://d3js.org/
● Use sigma.js
– http://sigmajs.org/
● Take a look on Max De Marzi work
– http://maxdemarzi.com/category/visualization/
● Use Sylva (for newbies)
– http://www.sylvadb.com/