introduction to graph databases @ sai

Thursday

23.5 grap

about me

who am i ...

Davy Suvee@DSUVEE

➡ big data architect @ datablend - continuum• provide big data and nosql consultancy

• share practical knowledge and big data use cases via blog

Big Data

2-3 years ago ...

Nowadays ...

Big Data

What is big data ...

... large and complex data sets that are difficult to process with traditional database management tools ...

What is big data ...

Big Data

... large and complex data sets that are difficult to process with traditional database management tools ...

➡ store (nosql)

➡ enrich (data mining, ml, nlp, ... )

➡ visualize (d3, gephi, mapbox, tableau, ... )

➡ process/analyze (map/reduce, cep, storm, ... )

Volume Variety VelocityData exceeds the limits of vertically

scalable tools requiring novel storage solutions

Data takes different formats that make integration complex and expensive

Data analysis time windows are small compared to the speed of data acquistion

The world has changed ...

Tackling the volume problem ...

➡ Throwing our data away :-(

What we are currently doing ...

➡ Storing preprocessed data :-/

➡ Try to store it anyway ;-(But why?

Vertical Scaling

Your database

Vertical Scaling

Your database

Vertical Scaling

Your database

Vertical Scaling

Your database

Vertical Scaling

Horizontal Scaling

€ x #nodes

Your database

Tackling the variety problem ...

Social streams

Log files

MassiveUnstuctured

Tackling the variety problem ...

One, schema-structured model Best-fit, schema-less model

Your database

Key-Value Databases

Document-Based Databases

Graph Databases

Wide-column Databases

AS IS ...

Tackling the velocity problem ...

➡ Collect

We want to ...

➡ Process

➡ Query

in Real-Time

MASSIVE amounts of Unstructured data

➡ Analyze

Tackling the velocity problem ...

Slow and outdated information Fast and realtime

Your stack

NoSQL &Big Data

APPMap-Reduce

(+ ANALYTICS)

graphs are everywhere ...

a little bit of graph theory ...

Davyage = 33

Datablendbtw = 123...

node/vertex

Janssensector = pharma

Kimage = 26

gender = F

node/vertex node/vertex

founded

in: 2011

worked_forfrom: 2008 to: 2013

knowssince: 2013

Advantages ... ?

➡ whiteboard friendly ➡ schema-less

➡ index-free adjacency (no joins!)

Graph Database

➡ queries as traversals

➡ queries as pattern matching

Advantages ... ?

Products/projects ... ?

➡ databases: neo4j, orientdb, allegrograph, dex, ... ➡ processing: pregel, giraph, hama, goldenorb, ... ➡ APIs: blueprints

Graph Database ➡ query languages: gremlin, cypher, sparql

Graph database 101 (neo4j)

GraphDatabaseService graph = ...

Node davy = graph.createNode();davy.setProperty(“name”,”Davy”);

KimNode kim = graph.createNode();kim.setProperty(“name”,”Kim”);

enum RelTypes implements RelationshipType { KNOWS, WORKED_FOR, FOUNDED}

Relationship davy_kim = davy.createRelationshipTo(kim, RelTypes.KNOWS)

davy_kim.setProperty(“since”, 2013);

Relationship davy_datablend = davy.createRelationshipTo( datablend, RelTypes.FOUNDED)

davy_datablend.setProperty(“in”, 2011);

Datablend

founded

➡ how to access the datablend node?

Index<Node> nodeIndex = graph.index().forNodes(“nodes”);

Node datablend = graph.createNode();datablend.setProperty(“name”,”Datablend”);

nodeIndex.add(datablend, “name”, “Datablend”);

Node found = nodeIndex.get(“name”,”Datablend”).getSingle();

➡ find friends of my friends ...

TraversalDescription td = Traversal.description() .breadthFirst() .relationships(RelTypes.KNOWS, Direction.OUTGOING) .evaluator(Evaluators.toDepth(2));

Traverser traverser = td.traverse(davy);

for (Path path : traverser) { ... }

➡ find friends of my friends ...

START davy=node:node_auto_index(name = “Davy”)MATCH davy-[:KNOWS]->()-[:KNOWS]->fofRETURN davy, fof

ExecutionEngine engine = new ExecutionEngine(graph);

ExecutionResults result = engine.execute(query);for(Map<String,Object> row : result) { ... }

Use cases ... ?

➡ recommendations ➡ access control ➡ routing

Graph Database ➡ social computing/networks

➡ genealogy

insights in big data

➡ typical approach through warehousing★ star schema with fact tables and dimension tables

★ real-time visualization★ filtering★ metrics★ layouting★ modular 1, 2

1. http://gephi.org/plugins/neo4j-graph-database-support/ 2. http://github.com/datablend/gephi-blueprints-plugin

gene expression clustering

★ 4.800 samples★ 27.000 genes

➡ oncology data set:

➡ Question:★ for a particular subset of samples, which genes are co-expressed?

mongodb for storing gene expressions{ "_id" : { "$oid" : "4f1fb64a1695629dd9d916e3"} , "sample_name" : "122551hp133a21.cel" , "genomics_id" : 122551 , "sample_id" : 343981 , "donor_id" : 143981 , "sample_type" : "Tissue" , "sample_site" : "Ascending colon" , "pathology_category" : "MALIGNANT" , "pathology_morphology" : "Adenocarcinoma" , "pathology_type" : "Primary malignant neoplasm of colon" , "primary_site" : "Colon" , "expressions" : [ { "gene" : "X1_at" , "expression" : 5.54217719084415} , { "gene" : "X10_at" , "expression" : 3.92335121981739} , { "gene" : "X100_at" , "expression" : 7.81638155662255} , { "gene" : "X1000_at" , "expression" : 5.44318512260619} , … ]}

pearson correlation through map-reduce

pearson correlation

co-expression graph

➡ create a node for each gene➡ if correlation between two genes >= 0.8, draw an edge between both nodes

co-expression graph

mutation prevalence

analyzing running data

analyzing running data through neo4j

➡ using neo4j spatial extension

➡ create a node for each tracked point

List<GeoPipeFlow> closests = GeoPipeline.startNearestNeighborLatLonSearch( runningLayer, to, 0.02). sort("OrthodromicDistance"). getMin("OrthodromicDistance").toList();

➡ connect succeeding tracking nodes in a graph

analyzing running data

analyzing google analytics data➡ source url -> target url

graphs and time ...

➡ fluxgraph: a blueprints-compatible graph on top of Datomic

➡ make FluxGraph fully time-aware ★ travel your graph through time★ time-scoped iteration of vertices and edges★ temporal graph comparison

➡ towards a time-aware graph ...

➡ reproducible graph state

travel through time

FluxGraph fg = new FluxGraph();

travel through time

Vertex davy = fg.addVertex();davy.setProperty(“name”,”Davy”);

travel through time

Vertex kim = ...

travel through time

Vertex kim = ...

Vertex peter = ...

travel through time

Vertex kim = ...

Vertex peter = ...

Edge e1 = fg.addEdge(davy, kim, “knows”);

travel through time

Date checkpoint = new Date();

travel through time

davy.setProperty(“name”,”David”);

travel through time

Edge e2 = fg.addEdge(davy, peter, “knows”);

travel through time

DavidDavy

checkpoint

current

by default

travel through time

DavidDavy

checkpoint

current

fg.setCheckpointTime(checkpoint);

travel through time

DavidDavy

checkpoint

current

fg.setCheckpointTime(checkpoint);

tcurrrentt3t2

time-scoped iteration

change change change

Davy’’’Davy’ Davy’’

➡ how to find the version of the vertex you are interested in?

tcurrrentt3t2

next next next

previouspreviousprevious

tcurrrentt3t2

next next next

tcurrrentt3t2

Vertex previousDavy = davy.getPreviousVersion();

next next next

tcurrrentt3t2

Iterable<Vertex> allDavy = davy.getNextVersions();

next next next

tcurrrentt3t2

Iterable<Vertex> selDavy = davy.getPreviousVersions(filter);

next next next

tcurrrentt3t2

Iterable<Vertex> selDavy = davy.getPreviousVersions(filter);

Interval valid = davy.getTimerInterval();

PeterPeter

David Davy

temporal graph comparison

current checkpoint

what changed?

temporal graph comparison

➡ difference (A , B) = union (A , B) - B

➡ ... as a (immutable) graph!

difference ( , ) =

t3t2t1

use case: longitudinal patient data

patient patient

smoking

patient

smoking

patient

cancer

patient

cancer

use case: longitudinal patient data

➡ historical data for 15.000 patients over a period of 10 years (2001- 2010)

➡ example analysis: ★ if a male patient is no longer smoking in 2005★ what are the chances of getting lung cancer in 2010, comparing

patients that smoked before 2005

patients that never smoked

FluxGraph

http://github.com/datablend/fluxgraph

➡ available on github

Open Innovation Networking Tool

➡ Many different projects, many different partners, many different domains ...★ how do we keep track?

★ how can we learn from the data?

➡ Store the date in it’s most natural form, a graph

➡ use graph algorithms to identify the importance of each node and their related ones

Open Innovation Networking Tool

More graphs ...

➡ pharma ➡ geospatial ➡ dependency analysis

➡ ontology

➡ ...

Questions?

E-MAIL

info@datablend.be

twitter.com/data_blendwww.datablend.be

www.datablend.be info@datablend.be 0499/05.00.89

datablend - continuum

introduction to graph databases @ sai

Technology

data modeling with graph databases

graph databases -...

graph databases 101

making sense of graph databases

nosql: graph databases. databases why nosql databases?

nosqleu - graph databases and neo4j

intro to graph databases

application modelling with graph databases

querying graph databases

application modeling with graph databases

intro to graph databases workbook

getting started with graph databases

introduction to graph databases

cevora ict symposium - graph databases

graph...

investigative graph search using graph databases

the beneﬁts of graph databases

graph databases and graph data science in neo4j

graph databases - europython 2014

graph databases are awesome