neo4j theory and practice - tareq abedrabbo @ graphconnect london 2013

Neo4j Theory and Practice

Tareq Abedrabbo Graph Connect - 19/11/2013

About me

• CTO/Principal Consultant at OpenCredo

• Working with Neo4j for (almost) 3 years on a number of different projects

• Co-author of Neo4j in Action (Manning)

What is this talk about?

It’s for developers designing and building applications with Neo4j

It’s not a collection of war stories but I will refer to

real-world examples

It is about sharing thoughts and lessons learnt in a useful way

“If I'm to believe Twitter, half of the earth's population are importing

Wikipedia into Neo4j, for very obscure reasons.”

Agenda

• What is Neo4j?

• Approaching graph-based applications

• Design

• Implementation

• Test

• Use cases

• Lessons learnt

What really is Neo4j?

A graph model

A query engine

A database

Neo4j is a solid foundation on which to build graph-

based applications

How should I approach graph-based applications?

Is there a useful way to categorise graph-based

applications?

Domain-centric applications

Data-centric applications

Domain-Centric• Well-defined data model

• Data changes through user interactions

• Flexible but predictable data structure(s)

• Recommendation engines, social networks, etc…

• Top-down design

Data-Centric• Complex connected data that typically models real

world networks

• Integrated from a variety of different sources

• Data can be unpredictable

• Telco networks, utility networks, etc…

• bottom-up design

Typically applications fall somewhere between

these 2 types

How can I use the information available in

my graph?

• Search and pattern-matching

• Find a recommendation based on behaviour

• Graph algorithms

• Shortest path, disconnected components

• Optimisation

• Maximise oil flow while minimising water

Graphs are naturally data-driven

Use case 1: Network Impact Analysis

Requirement: Identify the impact of failing

components

Requirement: Identify interesting patterns, such as single points of failure

Labelled property graph is a natural fit for the

model

Additional “dimensions” can be added to capture abstract concepts: network redundancy, load-balancing

Cypher queries are a natural solution to delivering

the different requirements

Use case 2: Oil flow optimisation

Requirement: Identify candidate configurations

to maximise flow

Requirement: Identify the most practical and valuable adjustments to the network

Simply connected graph with complex components

Interlude: Genetic Algorithms

• Start from an initial population of candidate solutions (individuals or phenotypes), ideally random

• Attribute a score each solution using a fitness function

• The only place with specific business knowledge

• Apply genetic operators to create a new generation

• Cross-breeding to retain best characteristics from each parent

• Mutation to maintain diversity and to avoid converging to a local optima too quickly

• Stop when you want!

Is this even a use case for Neo4j?

Persist and share calculated solutions

Inspect intermediary steps

Use Cypher queries to interrogate solutions

Lessons learnt

Understand your domain

• Don’t follow “best practices” blindly

• For domain-centric applications you can use a mapping framework, such as Spring Data Neo4j

• For data-centric applications, you should stay as close as possible to the graph model

• In any case, don’t try to hide the graph!

Use Cypher

!

• Expressive

• Readable

• Maintainable

• Performant

• Cypher + the web console is the quickest way to experiment and to prototype solutions

Manage complexity with domain knowledge

• Graph algorithms are typically complex

• Knowledge of the domain can simplify queries and traversals

• Make Cypher queries as specific as possible

• Take “shortcuts” when you know the domain

Write robust and flexible code

• Break down problems into a small queries. Return graph resources (or ids) to chain queries.

• Robustness principal: “Be conservative in what you do, be liberal in what you accept from others”

• Use assertions as preconditions

• Assertions document intent

• Fail fast if data doesn’t match

Start with a representative dataset

• Create a small data sets to capture the initial use cases

• Write simple unit tests using these datasets to support design and implementation

• These tests tend to become less useful when requirements are better understood

• Throw them away!

Move to a realistic dataset as soon as

possible

• A realistic data set

• Should capture the complexity of the real data

• Should be sufficiently large

• Ideally based on production data

• Write functional and integration tests against this dataset

Test non-functional aspects

• Graph data is inherently flexible and evolving

• Queries need to be correct and sufficiently performant

• Existing queries’s performance can degrade as the underlying model changes

• Assertions on timeouts should be part of the test suite to detect loops and poor performance

• JUnit’s @Test(timeout=5)

• Spring’s @Timeout(value=5)

Links

• Twitter: @tareq_abedrabbo

• Blog: http://www.terminalstate.net

• OpenCredo: http://www.opencredo.com

Thank you!

http://www.terminalstate.net

http://www.opencredo.com

neo4j theory and practice - tareq abedrabbo @ graphconnect london 2013

Technology

graph data

connected graph

graph modelin

graph resources

datacentric complex

graphbased applications

labelled property graph

small data sets