a quick review of python and graph databases
Post on 16-Aug-2015
266 Views
Preview:
TRANSCRIPT
Who am I?◦ Consultant at Deloitte Melbourne
in Enterprise Information Management
◦ Recent graduate of Flinders University in Adelaide
◦ Casual/Enthusiast reviewer of Graph Databases
What is a graph?“A set of objects connected by links” – Wikipedia
Objects: Vertices, nodes, points
Links: Edges, arcs, lines, relationships
Prior Work on Graphs in PythonGraph Database Patterns in Python – Elizabeth Ramirez, PyCon US 2015
Practical Graph/Network Analysis Made Simple – Eric Ma, PyCon US 2015
Graphs, Networks and Python: The Power of Interconnection – Lachlan Blackhall, PyCon AU 2014
An introduction to Python and graph databases with Neo4j - Holger Spill, PyCon NZ 2014
Mogwai: Graph Databases in your App – Cody Lee, PyTexas 2014
Today: Pythonic GraphsAn exploration of graph storage in Python:
◦ API must be Pythonic◦ execute(“<Not Python>”) doesn’t count.
◦ As little configuration as possible
Caveats:
◦ No configuration means no tuning
◦ Can’t compare distributed performance on a single node
◦ Limited to rough comparisons of performance – not a lab environment!
The Simple 1) Set up a dictionary of nodes
2) Each node keeps a list of relationships (or two, if you want a directed graph)
3) Set up add and get convenience methods
Pros:• Sometimes the simplest ways are the best• Very quick
Cons:• Not consistent• Probably going to need to be
maintained• Not persistent
The (slightly less) Simple1) Set up a Shelf of nodes
2) Each node keeps a list of relationships (or two, if you want a directed graph)
3) Set up add and get convenience methods
Pros:• Still reasonably quick
Cons:• Not consistent• Probably going to need to be
maintained
Off-topic: NetworkXAll the advantages of using a dictionary with none of the custom code.
◦ Comes with graph generators
◦ BSD Licenced
◦ Loads of standard analysis algorithms
◦ 90% test coverage
◦ … no persistence (except Pickle).
The Popularity Test
DBMS
Score
Jul2015
Neo4j 31.34
OrientDB 4.46
Titan 3.89
ArangoDB 1.29
Giraph 1.03
The Incumbent: Neo4jReleased in 2007
Written in Java
GPLv3/AGPLv3 or a commercial license
Runs as a server that exposes a REST Interface
Natively uses Cypher – an in-house developed graph query language
Best established, most popular graph-database
Easy to install – unzip and run a script
High Availability, but a little difficult to scale
Neo4j from PythonPy2Neo:
◦ Built by Nigel Small from Neo4j
◦ Actively maintained
Neo4j-rest-client
◦ Javier de la Rosa from University of Western Ontario
◦ Maintained through 9 months ago
neo4jdb-python
◦ Jacob Hansson of Neo4j
◦ Maintained through 8 months ago
◦ Mostly just wrappers around Cypher
Bulbflow:◦ Built by James Thornton of Pipem/Espeed
◦ Maintained to 8 months ago
◦ Connects to multiple backends
Py2Neo: SyntaxSet up a connection:
◦ graph=Graph("http://neo4j:password@localhost:7474/db/data/")
Create a node:◦ graph.create(Node("node_label", name=node_name))
◦ Node labels are like classes
Find a node:◦ graph.find_one("node_label", property_key="name",property_value=node_name)
Create a relationship:◦ graph.create(Relationship(node1, relationship, node2))
Find a relationship:
o graph.match_one(node1, relationship, node2, bidirectional=False)
Py2Neo: Good and BadThe good:
Simple API
Well documented
Easy to connect and get started.
Cool (if preliminary) spatial support
Not so much:◦ Skinny API
◦ No transaction support for Pythonic calls
◦ Performance struggles on large inputs
◦ No ORM (kinda)
neo4j-rest-client SyntaxSet up a connection:
◦ graph=GraphDatabase("http://localhost:7474/db/data/", username="username", password="password")
Create a node:◦ node=graph.nodes.create(name=node_name)
◦ Node labels are like classes
Find a node:◦ graph.nodes.filter(Q("name", iexact=node)).elements[0]
Create a relationship:◦ relationship=node1.is_related_to(node2)
neo4j-rest-client: Good and BadTransaction support with a context manager*
Strong filtering syntax
Very strong labelling syntax – searchable tags for nodes
Lazy evaluation of queries
Still REST based – still difficult to make it perform
*Seemingly. Somewhat difficult to make it work.
Py2Neo vs Neo4j-Rest-Client: Performance100 nodes with 20% connection:
Loading:Py2Neo: ~8 secondsNeo4j-rest-client: ~5 secondsPostgres: 4s
Retrieving:Py2Neo: ~6 secondsNeo4j-rest-client: ~5 secondsPostgres: 4s
1000 nodes with 20% connection:
Loading:Py2Neo: ~7 minutesNeo4j-rest-client: ~50 minutesPostgres: 6 minutes
Retrieving:Py2Neo: ~7 minutesNeo4j-rest-client: ~50 minutesPostgres: 6 minutes
Machine:AWS Memory Optimised xLarge node (30GB RAM) on Ubuntu Server using iPython2 3.0.0
Important noteCompletely unoptimised! No indexes, no attempt to chunk, only a couple OS optimisations.
OrientDB
PyOrient:◦ Official OrientDB Driver for Python
◦ Binary Driver
◦ Not Pythonic
Released in 2011
More NoSQL than Neo and Titan (Documents as well as graphs)
Scalable across multiple servers
Supports SQL
TitanFirst released in 2012
Written in Java
Licenced under Apache Licence
Many storage backends, including Cassandra, HBase and BerkeleyDB
Hadoop integration
Large amount of search back-ends
Built for scalability
Commercially supported by DataStax (formerly Aurelius)
Titan and PythonMogwai:
◦ Written by Cody Lee of wellaware
◦ Binary Driver for RexPro Server
◦ Very pythonic!
Bulbflow:◦ Built by James Thornton of Pipem/Espeed
◦ REST-based interface
◦ Maintained to 8 months ago
◦ Connects to multiple backends
RexPro and theTinkerpop StackApache Incubator Open Source Graph Framework
◦ Built around Gremlin
◦ Written in Java
◦ Extensively documented
Mogwai Performance100 nodes with 20% connection:
Loading:14 seconds
Retrieving:18 seconds
1000 nodes with 20% connection:
Loading:~9 minutes
Retrieving:~25 minutes
So, what should I use?*
Neo4j:◦ Good, relatively quick
bindings
◦ Well supported
◦ Could be expensive
◦ May not scale
*The full title of this slide is “What should I research further to ensure it meets my specific needs and then consider using?” In any case, the answer is still “It depends”
It depends.
Titan:◦ Good bindings
◦ Support in doubt
◦ Should be cheaper
◦ Proven scalability
Orient:◦ Poor bindings
◦ Well supported
◦ Open pricing structure
◦ Should scale well
What about Python Graph Databases?Not just Python bindings –pure(ish) Python.
GrapheekDB: https://bitbucket.org/nidusfr/grapheekdb◦ Uses local memory, Kyoto Cabinet or Symas LMDB as backend
◦ Under active development
◦ Exposes client/server interface
◦ Code is Beta quality at best
◦ Documentation is very spotty
Ajgu: https://bitbucket.org/amirouche/ajgu-graphdb/
◦ Uses Berkeley Database backend
◦ Under active development
◦ “This program is alpha becarful”
◦ Python 3 only
AjguSet up a connection:
◦ graph = GraphDatabase(Storage('./BSDDB/graph'))
Create a node:◦ transaction = self.graph.transaction(sync=True) ◦ node = transaction.vertex.create(node)
Find a node:◦ transaction.vertex.label(start)
Create a relationship:◦ relationship=transaction.edge.create(node1,node2)
Take-awaysGraphs match plenty of data sets
The big three Graph Databases are Neo4j, Titan and Orient
All three have upsides and downsides – depending on the usecase.
If you want to have a bit more fun, try Ajgu or Grapheek!
Py2Neo: Performance andTransactional SupportLarge imports should be done in one transaction to decrease overhead:
Graph.create(long_list_of_nodes_and_relationships)
This kills the client (essentially hangs in string processing).
So:for chunk in izip_longest(*[iter(iterator)]*size, fillvalue=''):
try:chunk = chunk[0:chunk.index('')]
except ValueError:pass
try:self.graph.create(*chunk)
except Exception as ex:pass #chunk dividing goes here
We lose ACID at this point.
What if this fails? Have to chunk it up again to find what failed.
top related