introduction to graph databases @ sai

79
Thursday 23.5 graph databases

Upload: datablend

Post on 26-Jan-2015

119 views

Category:

Technology


2 download

DESCRIPTION

Introduction to Graph Databases @ SAI

TRANSCRIPT

Page 1: Introduction to Graph Databases @ SAI

Thursday

23.5 grap

h da

taba

ses

Page 2: Introduction to Graph Databases @ SAI

about me

who am i ...

Davy Suvee@DSUVEE

➡ big data architect @ datablend - continuum• provide big data and nosql consultancy

• share practical knowledge and big data use cases via blog

Page 3: Introduction to Graph Databases @ SAI

Big Data

2-3 years ago ...

Page 4: Introduction to Graph Databases @ SAI

Nowadays ...

Big Data

Page 5: Introduction to Graph Databases @ SAI

What is big data ...

... large and complex data sets that are difficult to process with traditional database management tools ...

Page 6: Introduction to Graph Databases @ SAI

What is big data ...

Big Data

... large and complex data sets that are difficult to process with traditional database management tools ...

➡ store (nosql)

➡ enrich (data mining, ml, nlp, ... )

➡ visualize (d3, gephi, mapbox, tableau, ... )

➡ process/analyze (map/reduce, cep, storm, ... )

Page 7: Introduction to Graph Databases @ SAI

Volume Variety VelocityData exceeds the limits of vertically

scalable tools requiring novel storage solutions

Data takes different formats that make integration complex and expensive

Data analysis time windows are small compared to the speed of data acquistion

The world has changed ...

Page 8: Introduction to Graph Databases @ SAI

Tackling the volume problem ...

➡ Throwing our data away :-(

What we are currently doing ...

➡ Storing preprocessed data :-/

➡ Try to store it anyway ;-(But why?

Page 9: Introduction to Graph Databases @ SAI

Tackling the volume problem ...

Vertical Scaling

Your database

Page 10: Introduction to Graph Databases @ SAI

Tackling the volume problem ...

Vertical Scaling

€ 2

Your database

Page 11: Introduction to Graph Databases @ SAI

Tackling the volume problem ...

Vertical Scaling

€ 3

Your database

Page 12: Introduction to Graph Databases @ SAI

Tackling the volume problem ...

Vertical Scaling

€ 4

Your database

Page 13: Introduction to Graph Databases @ SAI

Tackling the volume problem ...

Vertical Scaling

€ 4

Horizontal Scaling

€ x #nodes

Your database

NoSQL

Page 14: Introduction to Graph Databases @ SAI

Tackling the variety problem ...

Video

Audio

Social streams

Log files

Text

MassiveUnstuctured

Page 15: Introduction to Graph Databases @ SAI

Tackling the variety problem ...

One, schema-structured model Best-fit, schema-less model

Your database

NoSQL

Key-Value Databases

Document-Based Databases

Graph Databases

Wide-column Databases

AS IS ...

Page 16: Introduction to Graph Databases @ SAI

Tackling the velocity problem ...

➡ Collect

We want to ...

➡ Process

➡ Query

in Real-Time

MASSIVE amounts of Unstructured data

➡ Analyze

Page 17: Introduction to Graph Databases @ SAI

Tackling the velocity problem ...

Slow and outdated information Fast and realtime

Your stack

NoSQL &Big Data

BI

ETL

APP

SYNC

SYNC

APPMap-Reduce

BI

(+ ANALYTICS)

Page 18: Introduction to Graph Databases @ SAI

graphs are everywhere ...

Page 19: Introduction to Graph Databases @ SAI

a little bit of graph theory ...

Davyage = 33

Datablendbtw = 123...

node/vertex

Janssensector = pharma

Kimage = 26

gender = F

edge

node/vertex node/vertex

founded

in: 2011

worked_forfrom: 2008 to: 2013

knowssince: 2013

Page 20: Introduction to Graph Databases @ SAI

Advantages ... ?

➡ whiteboard friendly ➡ schema-less

➡ index-free adjacency (no joins!)

Graph Database

➡ queries as traversals

➡ queries as pattern matching

Page 21: Introduction to Graph Databases @ SAI

Advantages ... ?

Page 22: Introduction to Graph Databases @ SAI

Products/projects ... ?

➡ databases: neo4j, orientdb, allegrograph, dex, ... ➡ processing: pregel, giraph, hama, goldenorb, ... ➡ APIs: blueprints

Graph Database ➡ query languages: gremlin, cypher, sparql

Page 23: Introduction to Graph Databases @ SAI

Graph database 101 (neo4j)

GraphDatabaseService graph = ...

Node davy = graph.createNode();davy.setProperty(“name”,”Davy”);

Davy

KimNode kim = graph.createNode();kim.setProperty(“name”,”Kim”);

Page 24: Introduction to Graph Databases @ SAI

Graph database 101 (neo4j)

enum RelTypes implements RelationshipType { KNOWS, WORKED_FOR, FOUNDED}

Davy

Kim

knows

Relationship davy_kim = davy.createRelationshipTo(kim, RelTypes.KNOWS)

davy_kim.setProperty(“since”, 2013);

Page 25: Introduction to Graph Databases @ SAI

Graph database 101 (neo4j)

Relationship davy_datablend = davy.createRelationshipTo( datablend, RelTypes.FOUNDED)

davy_datablend.setProperty(“in”, 2011);

Davy

Datablend

founded

➡ how to access the datablend node?

Page 26: Introduction to Graph Databases @ SAI

Graph database 101 (neo4j)

Index<Node> nodeIndex = graph.index().forNodes(“nodes”);

Node datablend = graph.createNode();datablend.setProperty(“name”,”Datablend”);

nodeIndex.add(datablend, “name”, “Datablend”);

Node found = nodeIndex.get(“name”,”Datablend”).getSingle();

Page 27: Introduction to Graph Databases @ SAI

Graph database 101 (neo4j)

➡ find friends of my friends ...

TraversalDescription td = Traversal.description()          .breadthFirst()          .relationships(RelTypes.KNOWS, Direction.OUTGOING)          .evaluator(Evaluators.toDepth(2));

Traverser traverser = td.traverse(davy);

for (Path path : traverser) { ... }

Page 28: Introduction to Graph Databases @ SAI

Graph database 101 (neo4j)

➡ find friends of my friends ...

START davy=node:node_auto_index(name = “Davy”)MATCH davy-[:KNOWS]->()-[:KNOWS]->fofRETURN davy, fof

ExecutionEngine engine = new ExecutionEngine(graph);

ExecutionResults result = engine.execute(query);for(Map<String,Object> row : result) { ... }

Page 29: Introduction to Graph Databases @ SAI

Use cases ... ?

➡ recommendations ➡ access control ➡ routing

Graph Database ➡ social computing/networks

➡ genealogy

Page 30: Introduction to Graph Databases @ SAI

insights in big data

➡ typical approach through warehousing★ star schema with fact tables and dimension tables

Page 31: Introduction to Graph Databases @ SAI

insights in big data

➡ typical approach through warehousing★ star schema with fact tables and dimension tables

Page 32: Introduction to Graph Databases @ SAI

insights in big data

➡ typical approach through warehousing★ star schema with fact tables and dimension tables

Page 33: Introduction to Graph Databases @ SAI

insights in big data

★ real-time visualization★ filtering★ metrics★ layouting★ modular 1, 2

1. http://gephi.org/plugins/neo4j-graph-database-support/ 2. http://github.com/datablend/gephi-blueprints-plugin

Page 34: Introduction to Graph Databases @ SAI

gene expression clustering

★ 4.800 samples★ 27.000 genes

➡ oncology data set:

➡ Question:★ for a particular subset of samples, which genes are co-expressed?

Page 35: Introduction to Graph Databases @ SAI

mongodb for storing gene expressions{ "_id" : { "$oid" : "4f1fb64a1695629dd9d916e3"} ,  "sample_name" : "122551hp133a21.cel" ,  "genomics_id" : 122551 ,  "sample_id" : 343981 ,  "donor_id" : 143981 ,  "sample_type" : "Tissue" ,  "sample_site" : "Ascending colon" ,  "pathology_category" : "MALIGNANT" ,  "pathology_morphology" : "Adenocarcinoma" ,  "pathology_type" : "Primary malignant neoplasm of colon" ,  "primary_site" : "Colon" ,  "expressions" : [ { "gene" : "X1_at" , "expression" : 5.54217719084415} ,                    { "gene" : "X10_at" , "expression" : 3.92335121981739} ,                    { "gene" : "X100_at" , "expression" : 7.81638155662255} ,                    { "gene" : "X1000_at" , "expression" : 5.44318512260619} ,                     … ]}

Page 36: Introduction to Graph Databases @ SAI

pearson correlation through map-reduce

pearson correlation

x y

43 99

21 65

25 79

42 75

57 87

59 81

0,52

Page 37: Introduction to Graph Databases @ SAI

co-expression graph

➡ create a node for each gene➡ if correlation between two genes >= 0.8, draw an edge between both nodes

Page 38: Introduction to Graph Databases @ SAI

co-expression graph

Page 39: Introduction to Graph Databases @ SAI
Page 40: Introduction to Graph Databases @ SAI

mutation prevalence

Page 41: Introduction to Graph Databases @ SAI

mutation prevalence

Page 42: Introduction to Graph Databases @ SAI

mutation prevalence

Page 43: Introduction to Graph Databases @ SAI

mutation prevalence

Page 44: Introduction to Graph Databases @ SAI

analyzing running data

<trkpt lon="4.723870977759361" lat="51.075748661533">    <ele>29.799999237060547</ele>    <time>2011-11-08T19:18:39.000Z</time></trkpt><trkpt lon="4.724105251953006" lat="51.075623352080584">    <ele>29.799999237060547</ele>    <time>2011-11-08T19:18:45.000Z</time></trkpt><trkpt lon="4.724143054336309" lat="51.07560558244586">    <ele>29.799999237060547</ele>    <time>2011-11-08T19:18:46.000Z</time></trkpt>

Page 45: Introduction to Graph Databases @ SAI

analyzing running data through neo4j

➡ using neo4j spatial extension

➡ create a node for each tracked point

List<GeoPipeFlow> closests = GeoPipeline.startNearestNeighborLatLonSearch( runningLayer, to, 0.02). sort("OrthodromicDistance"). getMin("OrthodromicDistance").toList();

➡ connect succeeding tracking nodes in a graph

Page 46: Introduction to Graph Databases @ SAI

analyzing running data

Page 47: Introduction to Graph Databases @ SAI

analyzing google analytics data➡ source url -> target url

Page 48: Introduction to Graph Databases @ SAI

graphs and time ...

➡ fluxgraph: a blueprints-compatible graph on top of Datomic

➡ make FluxGraph fully time-aware ★ travel your graph through time★ time-scoped iteration of vertices and edges★ temporal graph comparison

➡ towards a time-aware graph ...

➡ reproducible graph state

Page 49: Introduction to Graph Databases @ SAI

travel through time

FluxGraph fg = new FluxGraph();

Page 50: Introduction to Graph Databases @ SAI

travel through time

FluxGraph fg = new FluxGraph();

Vertex davy = fg.addVertex();davy.setProperty(“name”,”Davy”);

Davy

Page 51: Introduction to Graph Databases @ SAI

travel through time

FluxGraph fg = new FluxGraph();

Vertex davy = fg.addVertex();davy.setProperty(“name”,”Davy”);

Davy

Kim

Vertex kim = ...

Page 52: Introduction to Graph Databases @ SAI

travel through time

FluxGraph fg = new FluxGraph();

Vertex davy = fg.addVertex();davy.setProperty(“name”,”Davy”);

Peter

Davy

Kim

Vertex kim = ...

Vertex peter = ...

Page 53: Introduction to Graph Databases @ SAI

travel through time

FluxGraph fg = new FluxGraph();

Vertex davy = fg.addVertex();davy.setProperty(“name”,”Davy”);

Peter

Davy

Kim

Vertex kim = ...

Vertex peter = ...

Edge e1 = fg.addEdge(davy, kim, “knows”);

knows

Page 54: Introduction to Graph Databases @ SAI

travel through time

Peter

Davy

Kim

knows

Page 55: Introduction to Graph Databases @ SAI

travel through time

Date checkpoint = new Date();

Peter

Davy

Kim

knows

Page 56: Introduction to Graph Databases @ SAI

travel through time

Date checkpoint = new Date();

davy.setProperty(“name”,”David”);

Peter

Davy

Kim

knows

Page 57: Introduction to Graph Databases @ SAI

travel through time

Date checkpoint = new Date();

davy.setProperty(“name”,”David”);

Peter

Kim

knows

David

Page 58: Introduction to Graph Databases @ SAI

travel through time

Date checkpoint = new Date();

davy.setProperty(“name”,”David”);

Peter

Kim

Edge e2 = fg.addEdge(davy, peter, “knows”);

knows

David

knows

Page 59: Introduction to Graph Databases @ SAI

travel through time

Peter

Davy

Kim

DavidDavy

Kim

knows

knows

Peter

knows

checkpoint

current

time

by default

Page 60: Introduction to Graph Databases @ SAI

travel through time

Peter

Davy

Kim

DavidDavy

Kim

knows

knows

Peter

knows

checkpoint

current

time

fg.setCheckpointTime(checkpoint);

Page 61: Introduction to Graph Databases @ SAI

travel through time

Peter

Davy

Kim

DavidDavy

Kim

knows

knows

Peter

knows

checkpoint

current

time

fg.setCheckpointTime(checkpoint);

Page 62: Introduction to Graph Databases @ SAI

tcurrrentt3t2

time-scoped iteration

change change change

Davy’’’Davy’ Davy’’

t1

Davy

➡ how to find the version of the vertex you are interested in?

Page 63: Introduction to Graph Databases @ SAI

tcurrrentt3t2

time-scoped iteration

Davy’’’Davy’ Davy’’

t1

Davy

Page 64: Introduction to Graph Databases @ SAI

next next next

previouspreviousprevious

tcurrrentt3t2

time-scoped iteration

Davy’’’Davy’ Davy’’

t1

Davy

Page 65: Introduction to Graph Databases @ SAI

next next next

previouspreviousprevious

tcurrrentt3t2

time-scoped iteration

Davy’’’Davy’ Davy’’

t1

Davy

Vertex previousDavy = davy.getPreviousVersion();

Page 66: Introduction to Graph Databases @ SAI

next next next

previouspreviousprevious

tcurrrentt3t2

time-scoped iteration

Davy’’’Davy’ Davy’’

t1

Davy

Vertex previousDavy = davy.getPreviousVersion();

Iterable<Vertex> allDavy = davy.getNextVersions();

Page 67: Introduction to Graph Databases @ SAI

next next next

previouspreviousprevious

tcurrrentt3t2

time-scoped iteration

Davy’’’Davy’ Davy’’

t1

Davy

Vertex previousDavy = davy.getPreviousVersion();

Iterable<Vertex> allDavy = davy.getNextVersions();

Iterable<Vertex> selDavy = davy.getPreviousVersions(filter);

Page 68: Introduction to Graph Databases @ SAI

next next next

previouspreviousprevious

tcurrrentt3t2

time-scoped iteration

Davy’’’Davy’ Davy’’

t1

Davy

Vertex previousDavy = davy.getPreviousVersion();

Iterable<Vertex> allDavy = davy.getNextVersions();

Iterable<Vertex> selDavy = davy.getPreviousVersions(filter);

Interval valid = davy.getTimerInterval();

Page 69: Introduction to Graph Databases @ SAI

PeterPeter

Davy

Kim

David Davy

Kim

temporal graph comparison

knows

knows

knows

current checkpoint

what changed?

Page 70: Introduction to Graph Databases @ SAI

temporal graph comparison

➡ difference (A , B) = union (A , B) - B

➡ ... as a (immutable) graph!

difference ( , ) =

David

knows

Page 71: Introduction to Graph Databases @ SAI

t3t2t1

use case: longitudinal patient data

patient patient

smoking

patient

smoking

t4

patient

cancer

t5

patient

cancer

death

Page 72: Introduction to Graph Databases @ SAI

use case: longitudinal patient data

➡ historical data for 15.000 patients over a period of 10 years (2001- 2010)

➡ example analysis: ★ if a male patient is no longer smoking in 2005★ what are the chances of getting lung cancer in 2010, comparing

patients that smoked before 2005

patients that never smoked

Page 73: Introduction to Graph Databases @ SAI

FluxGraph

http://github.com/datablend/fluxgraph

➡ available on github

Page 74: Introduction to Graph Databases @ SAI

Open Innovation Networking Tool

➡ Many different projects, many different partners, many different domains ...★ how do we keep track?

★ how can we learn from the data?

➡ Store the date in it’s most natural form, a graph

➡ use graph algorithms to identify the importance of each node and their related ones

Page 75: Introduction to Graph Databases @ SAI

Open Innovation Networking Tool

Page 76: Introduction to Graph Databases @ SAI

Open Innovation Networking Tool

Page 77: Introduction to Graph Databases @ SAI

More graphs ...

➡ pharma ➡ geospatial ➡ dependency analysis

➡ ontology

➡ ...

Page 78: Introduction to Graph Databases @ SAI

Questions?