introduction to graph databases @ sai

Post on 26-Jan-2015

120 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Introduction to Graph Databases @ SAI

TRANSCRIPT

Thursday

23.5 grap

h da

taba

ses

about me

who am i ...

Davy Suvee@DSUVEE

➡ big data architect @ datablend - continuum• provide big data and nosql consultancy

• share practical knowledge and big data use cases via blog

Big Data

2-3 years ago ...

Nowadays ...

Big Data

What is big data ...

... large and complex data sets that are difficult to process with traditional database management tools ...

What is big data ...

Big Data

... large and complex data sets that are difficult to process with traditional database management tools ...

➡ store (nosql)

➡ enrich (data mining, ml, nlp, ... )

➡ visualize (d3, gephi, mapbox, tableau, ... )

➡ process/analyze (map/reduce, cep, storm, ... )

Volume Variety VelocityData exceeds the limits of vertically

scalable tools requiring novel storage solutions

Data takes different formats that make integration complex and expensive

Data analysis time windows are small compared to the speed of data acquistion

The world has changed ...

Tackling the volume problem ...

➡ Throwing our data away :-(

What we are currently doing ...

➡ Storing preprocessed data :-/

➡ Try to store it anyway ;-(But why?

Tackling the volume problem ...

Vertical Scaling

Your database

Tackling the volume problem ...

Vertical Scaling

€ 2

Your database

Tackling the volume problem ...

Vertical Scaling

€ 3

Your database

Tackling the volume problem ...

Vertical Scaling

€ 4

Your database

Tackling the volume problem ...

Vertical Scaling

€ 4

Horizontal Scaling

€ x #nodes

Your database

NoSQL

Tackling the variety problem ...

Video

Audio

Social streams

Log files

Text

MassiveUnstuctured

Tackling the variety problem ...

One, schema-structured model Best-fit, schema-less model

Your database

NoSQL

Key-Value Databases

Document-Based Databases

Graph Databases

Wide-column Databases

AS IS ...

Tackling the velocity problem ...

➡ Collect

We want to ...

➡ Process

➡ Query

in Real-Time

MASSIVE amounts of Unstructured data

➡ Analyze

Tackling the velocity problem ...

Slow and outdated information Fast and realtime

Your stack

NoSQL &Big Data

BI

ETL

APP

SYNC

SYNC

APPMap-Reduce

BI

(+ ANALYTICS)

graphs are everywhere ...

a little bit of graph theory ...

Davyage = 33

Datablendbtw = 123...

node/vertex

Janssensector = pharma

Kimage = 26

gender = F

edge

node/vertex node/vertex

founded

in: 2011

worked_forfrom: 2008 to: 2013

knowssince: 2013

Advantages ... ?

➡ whiteboard friendly ➡ schema-less

➡ index-free adjacency (no joins!)

Graph Database

➡ queries as traversals

➡ queries as pattern matching

Advantages ... ?

Products/projects ... ?

➡ databases: neo4j, orientdb, allegrograph, dex, ... ➡ processing: pregel, giraph, hama, goldenorb, ... ➡ APIs: blueprints

Graph Database ➡ query languages: gremlin, cypher, sparql

Graph database 101 (neo4j)

GraphDatabaseService graph = ...

Node davy = graph.createNode();davy.setProperty(“name”,”Davy”);

Davy

KimNode kim = graph.createNode();kim.setProperty(“name”,”Kim”);

Graph database 101 (neo4j)

enum RelTypes implements RelationshipType { KNOWS, WORKED_FOR, FOUNDED}

Davy

Kim

knows

Relationship davy_kim = davy.createRelationshipTo(kim, RelTypes.KNOWS)

davy_kim.setProperty(“since”, 2013);

Graph database 101 (neo4j)

Relationship davy_datablend = davy.createRelationshipTo( datablend, RelTypes.FOUNDED)

davy_datablend.setProperty(“in”, 2011);

Davy

Datablend

founded

➡ how to access the datablend node?

Graph database 101 (neo4j)

Index<Node> nodeIndex = graph.index().forNodes(“nodes”);

Node datablend = graph.createNode();datablend.setProperty(“name”,”Datablend”);

nodeIndex.add(datablend, “name”, “Datablend”);

Node found = nodeIndex.get(“name”,”Datablend”).getSingle();

Graph database 101 (neo4j)

➡ find friends of my friends ...

TraversalDescription td = Traversal.description()          .breadthFirst()          .relationships(RelTypes.KNOWS, Direction.OUTGOING)          .evaluator(Evaluators.toDepth(2));

Traverser traverser = td.traverse(davy);

for (Path path : traverser) { ... }

Graph database 101 (neo4j)

➡ find friends of my friends ...

START davy=node:node_auto_index(name = “Davy”)MATCH davy-[:KNOWS]->()-[:KNOWS]->fofRETURN davy, fof

ExecutionEngine engine = new ExecutionEngine(graph);

ExecutionResults result = engine.execute(query);for(Map<String,Object> row : result) { ... }

Use cases ... ?

➡ recommendations ➡ access control ➡ routing

Graph Database ➡ social computing/networks

➡ genealogy

insights in big data

➡ typical approach through warehousing★ star schema with fact tables and dimension tables

insights in big data

➡ typical approach through warehousing★ star schema with fact tables and dimension tables

insights in big data

➡ typical approach through warehousing★ star schema with fact tables and dimension tables

insights in big data

★ real-time visualization★ filtering★ metrics★ layouting★ modular 1, 2

1. http://gephi.org/plugins/neo4j-graph-database-support/ 2. http://github.com/datablend/gephi-blueprints-plugin

gene expression clustering

★ 4.800 samples★ 27.000 genes

➡ oncology data set:

➡ Question:★ for a particular subset of samples, which genes are co-expressed?

mongodb for storing gene expressions{ "_id" : { "$oid" : "4f1fb64a1695629dd9d916e3"} ,  "sample_name" : "122551hp133a21.cel" ,  "genomics_id" : 122551 ,  "sample_id" : 343981 ,  "donor_id" : 143981 ,  "sample_type" : "Tissue" ,  "sample_site" : "Ascending colon" ,  "pathology_category" : "MALIGNANT" ,  "pathology_morphology" : "Adenocarcinoma" ,  "pathology_type" : "Primary malignant neoplasm of colon" ,  "primary_site" : "Colon" ,  "expressions" : [ { "gene" : "X1_at" , "expression" : 5.54217719084415} ,                    { "gene" : "X10_at" , "expression" : 3.92335121981739} ,                    { "gene" : "X100_at" , "expression" : 7.81638155662255} ,                    { "gene" : "X1000_at" , "expression" : 5.44318512260619} ,                     … ]}

pearson correlation through map-reduce

pearson correlation

x y

43 99

21 65

25 79

42 75

57 87

59 81

0,52

co-expression graph

➡ create a node for each gene➡ if correlation between two genes >= 0.8, draw an edge between both nodes

co-expression graph

mutation prevalence

mutation prevalence

mutation prevalence

mutation prevalence

analyzing running data

<trkpt lon="4.723870977759361" lat="51.075748661533">    <ele>29.799999237060547</ele>    <time>2011-11-08T19:18:39.000Z</time></trkpt><trkpt lon="4.724105251953006" lat="51.075623352080584">    <ele>29.799999237060547</ele>    <time>2011-11-08T19:18:45.000Z</time></trkpt><trkpt lon="4.724143054336309" lat="51.07560558244586">    <ele>29.799999237060547</ele>    <time>2011-11-08T19:18:46.000Z</time></trkpt>

analyzing running data through neo4j

➡ using neo4j spatial extension

➡ create a node for each tracked point

List<GeoPipeFlow> closests = GeoPipeline.startNearestNeighborLatLonSearch( runningLayer, to, 0.02). sort("OrthodromicDistance"). getMin("OrthodromicDistance").toList();

➡ connect succeeding tracking nodes in a graph

analyzing running data

analyzing google analytics data➡ source url -> target url

graphs and time ...

➡ fluxgraph: a blueprints-compatible graph on top of Datomic

➡ make FluxGraph fully time-aware ★ travel your graph through time★ time-scoped iteration of vertices and edges★ temporal graph comparison

➡ towards a time-aware graph ...

➡ reproducible graph state

travel through time

FluxGraph fg = new FluxGraph();

travel through time

FluxGraph fg = new FluxGraph();

Vertex davy = fg.addVertex();davy.setProperty(“name”,”Davy”);

Davy

travel through time

FluxGraph fg = new FluxGraph();

Vertex davy = fg.addVertex();davy.setProperty(“name”,”Davy”);

Davy

Kim

Vertex kim = ...

travel through time

FluxGraph fg = new FluxGraph();

Vertex davy = fg.addVertex();davy.setProperty(“name”,”Davy”);

Peter

Davy

Kim

Vertex kim = ...

Vertex peter = ...

travel through time

FluxGraph fg = new FluxGraph();

Vertex davy = fg.addVertex();davy.setProperty(“name”,”Davy”);

Peter

Davy

Kim

Vertex kim = ...

Vertex peter = ...

Edge e1 = fg.addEdge(davy, kim, “knows”);

knows

travel through time

Peter

Davy

Kim

knows

travel through time

Date checkpoint = new Date();

Peter

Davy

Kim

knows

travel through time

Date checkpoint = new Date();

davy.setProperty(“name”,”David”);

Peter

Davy

Kim

knows

travel through time

Date checkpoint = new Date();

davy.setProperty(“name”,”David”);

Peter

Kim

knows

David

travel through time

Date checkpoint = new Date();

davy.setProperty(“name”,”David”);

Peter

Kim

Edge e2 = fg.addEdge(davy, peter, “knows”);

knows

David

knows

travel through time

Peter

Davy

Kim

DavidDavy

Kim

knows

knows

Peter

knows

checkpoint

current

time

by default

travel through time

Peter

Davy

Kim

DavidDavy

Kim

knows

knows

Peter

knows

checkpoint

current

time

fg.setCheckpointTime(checkpoint);

travel through time

Peter

Davy

Kim

DavidDavy

Kim

knows

knows

Peter

knows

checkpoint

current

time

fg.setCheckpointTime(checkpoint);

tcurrrentt3t2

time-scoped iteration

change change change

Davy’’’Davy’ Davy’’

t1

Davy

➡ how to find the version of the vertex you are interested in?

tcurrrentt3t2

time-scoped iteration

Davy’’’Davy’ Davy’’

t1

Davy

next next next

previouspreviousprevious

tcurrrentt3t2

time-scoped iteration

Davy’’’Davy’ Davy’’

t1

Davy

next next next

previouspreviousprevious

tcurrrentt3t2

time-scoped iteration

Davy’’’Davy’ Davy’’

t1

Davy

Vertex previousDavy = davy.getPreviousVersion();

next next next

previouspreviousprevious

tcurrrentt3t2

time-scoped iteration

Davy’’’Davy’ Davy’’

t1

Davy

Vertex previousDavy = davy.getPreviousVersion();

Iterable<Vertex> allDavy = davy.getNextVersions();

next next next

previouspreviousprevious

tcurrrentt3t2

time-scoped iteration

Davy’’’Davy’ Davy’’

t1

Davy

Vertex previousDavy = davy.getPreviousVersion();

Iterable<Vertex> allDavy = davy.getNextVersions();

Iterable<Vertex> selDavy = davy.getPreviousVersions(filter);

next next next

previouspreviousprevious

tcurrrentt3t2

time-scoped iteration

Davy’’’Davy’ Davy’’

t1

Davy

Vertex previousDavy = davy.getPreviousVersion();

Iterable<Vertex> allDavy = davy.getNextVersions();

Iterable<Vertex> selDavy = davy.getPreviousVersions(filter);

Interval valid = davy.getTimerInterval();

PeterPeter

Davy

Kim

David Davy

Kim

temporal graph comparison

knows

knows

knows

current checkpoint

what changed?

temporal graph comparison

➡ difference (A , B) = union (A , B) - B

➡ ... as a (immutable) graph!

difference ( , ) =

David

knows

t3t2t1

use case: longitudinal patient data

patient patient

smoking

patient

smoking

t4

patient

cancer

t5

patient

cancer

death

use case: longitudinal patient data

➡ historical data for 15.000 patients over a period of 10 years (2001- 2010)

➡ example analysis: ★ if a male patient is no longer smoking in 2005★ what are the chances of getting lung cancer in 2010, comparing

patients that smoked before 2005

patients that never smoked

FluxGraph

http://github.com/datablend/fluxgraph

➡ available on github

Open Innovation Networking Tool

➡ Many different projects, many different partners, many different domains ...★ how do we keep track?

★ how can we learn from the data?

➡ Store the date in it’s most natural form, a graph

➡ use graph algorithms to identify the importance of each node and their related ones

Open Innovation Networking Tool

Open Innovation Networking Tool

More graphs ...

➡ pharma ➡ geospatial ➡ dependency analysis

➡ ontology

➡ ...

Questions?

E-MAIL

info@datablend.be

Follow us

twitter.com/data_blendwww.datablend.be

www.datablend.be info@datablend.be 0499/05.00.89

datablend - continuum

top related