neo4j theory and practice - tareq abedrabbo @ graphconnect london 2013
DESCRIPTION
In this talk Tareq will discuss graph solutions based on his experiences building a varied mix of graph-based systems. He will be sharing techniques and approaches that he has learned and will focus on a number of concepts that may be applied to a wider context.TRANSCRIPT
Neo4j Theory and Practice
Tareq Abedrabbo Graph Connect - 19/11/2013
About me
• CTO/Principal Consultant at OpenCredo
• Working with Neo4j for (almost) 3 years on a number of different projects
• Co-author of Neo4j in Action (Manning)
What is this talk about?
It’s for developers designing and building applications with Neo4j
It’s not a collection of war stories but I will refer to
real-world examples
It is about sharing thoughts and lessons learnt in a useful way
“If I'm to believe Twitter, half of the earth's population are importing
Wikipedia into Neo4j, for very obscure reasons.”
Agenda
• What is Neo4j?
• Approaching graph-based applications
• Design
• Implementation
• Test
• Use cases
• Lessons learnt
What really is Neo4j?
A graph model
A query engine
A database
Neo4j is a solid foundation on which to build graph-
based applications
How should I approach graph-based applications?
Is there a useful way to categorise graph-based
applications?
Domain-centric applications
Data-centric applications
Domain-Centric• Well-defined data model
• Data changes through user interactions
• Flexible but predictable data structure(s)
• Recommendation engines, social networks, etc…
• Top-down design
Data-Centric• Complex connected data that typically models real
world networks
• Integrated from a variety of different sources
• Data can be unpredictable
• Telco networks, utility networks, etc…
• bottom-up design
Typically applications fall somewhere between
these 2 types
How can I use the information available in
my graph?
• Search and pattern-matching
• Find a recommendation based on behaviour
• Graph algorithms
• Shortest path, disconnected components
• Optimisation
• Maximise oil flow while minimising water
Graphs are naturally data-driven
Use case 1: Network Impact Analysis
Requirement: Identify the impact of failing
components
Requirement: Identify interesting patterns, such as single points of failure
Labelled property graph is a natural fit for the
model
Additional “dimensions” can be added to capture abstract concepts: network redundancy, load-balancing
Cypher queries are a natural solution to delivering
the different requirements
Use case 2: Oil flow optimisation
Requirement: Identify candidate configurations
to maximise flow
Requirement: Identify the most practical and valuable adjustments to the network
Simply connected graph with complex components
Interlude: Genetic Algorithms
• Start from an initial population of candidate solutions (individuals or phenotypes), ideally random
• Attribute a score each solution using a fitness function
• The only place with specific business knowledge
• Apply genetic operators to create a new generation
• Cross-breeding to retain best characteristics from each parent
• Mutation to maintain diversity and to avoid converging to a local optima too quickly
• Stop when you want!
Is this even a use case for Neo4j?
Persist and share calculated solutions
Inspect intermediary steps
Use Cypher queries to interrogate solutions
Lessons learnt
Understand your domain
• Don’t follow “best practices” blindly
• For domain-centric applications you can use a mapping framework, such as Spring Data Neo4j
• For data-centric applications, you should stay as close as possible to the graph model
• In any case, don’t try to hide the graph!
Use Cypher
!
• Expressive
• Readable
• Maintainable
• Performant
• Cypher + the web console is the quickest way to experiment and to prototype solutions
Manage complexity with domain knowledge
• Graph algorithms are typically complex
• Knowledge of the domain can simplify queries and traversals
• Make Cypher queries as specific as possible
• Take “shortcuts” when you know the domain
Write robust and flexible code
• Break down problems into a small queries. Return graph resources (or ids) to chain queries.
• Robustness principal: “Be conservative in what you do, be liberal in what you accept from others”
• Use assertions as preconditions
• Assertions document intent
• Fail fast if data doesn’t match
Start with a representative dataset
• Create a small data sets to capture the initial use cases
• Write simple unit tests using these datasets to support design and implementation
• These tests tend to become less useful when requirements are better understood
• Throw them away!
Move to a realistic dataset as soon as
possible
• A realistic data set
• Should capture the complexity of the real data
• Should be sufficiently large
• Ideally based on production data
• Write functional and integration tests against this dataset
Test non-functional aspects
• Graph data is inherently flexible and evolving
• Queries need to be correct and sufficiently performant
• Existing queries’s performance can degrade as the underlying model changes
• Assertions on timeouts should be part of the test suite to detect loops and poor performance
• JUnit’s @Test(timeout=5)
• Spring’s @Timeout(value=5)
Links
• Twitter: @tareq_abedrabbo
• Blog: http://www.terminalstate.net
• OpenCredo: http://www.opencredo.com
Thank you!