real time graph computations in storm, neo4j, python - pycon india 2013

58
Real-Time stream computation on graphs using Storm, Neo4j and Python Sonal Raj http://www.sonalraj.com Presented at Pycon India 2013 Bangalore, India Copyrights © 2013, Sonal Raj, http://www.sonalraj.com 1

Upload: sonal-raj

Post on 01-Nov-2014

7.487 views

Category:

Technology


4 download

DESCRIPTION

This talk briefly outlines the Storm framework and Neo4J graph database, and how to compositely use them to perform computations on complex graphs in Python using the Petrel and Py2neo packages. This talk was given at PyCon India 2013.

TRANSCRIPT

Page 1: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Real-Time stream computation on

graphs using Storm, Neo4j and

Python

Sonal Raj

http://www.sonalraj.com

Presented at Pycon India 2013

Bangalore, India

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

1

Page 2: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Introduction

2

• With data multiplying each day, storage and

knowledge extraction is a major concern.

• Social Data Analysis, Business Intelligence

• Constraints of Real Time and Fault-Tolerant

Processing

Page 3: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

. . In this Talk

3

• A look at storm as a distributed

computation Framework

• Neo4J as a NoSQL graph database

• Some Cool Pictures

• What are we trying to achieve ?

Page 4: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Disclaimer !

4

• This talk presents an overview of Storm and

Neo4J . . Less dirty details

• I’m going to go pretty fast . . . Please hang on.

Page 5: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

5

Part -1

Storm – The Hadoop

of Real Time

Page 6: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Don’t we have Hadoop ?

6

Page 7: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Storm v/s Hadoop

7

STORM

HADOOP

• Distributed

Processing

• Fault Tolerance

Page 8: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Storm v/s Hadoop

8

HADOOP

• Large but Finite Jobs

• Processes a Lot of Data at Once

• High Latency

Page 9: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Storm v/s Hadoop

9

HADOOP

• Large but Finite Jobs

• Processes a Lot of Data at Once

• High Latency

Storm

Infinite Computations called Topologies

Process Infinite Streams of data one-tuple-at-a-time

Low Latency

Page 10: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

So, what Storm gives us . .

10

Real-Time Computations

Guaranteed data Processing

Horizontal Scalability and Fault-Tolerance

No intermediate message Brokers

Higher Abstraction than Message Passing, so makes

sense !

Page 11: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

11

Streams

Tuple Tuple Tuple Tuple Tuple

An unbounded sequence of Tuples

Page 12: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

12

Streams

Tuple Tuple Tuple Tuple Tuple

An unbounded sequence of Tuples

So, what kind of a tuple is this ?

Page 13: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

13

Spouts

A source of Streams

Page 14: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

14

Spouts

A source of Streams

But, what is the source FOR the spouts ?

Page 15: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

15

Bolts

Computational units processing input

streams and producing new streams

Page 16: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

16

Bolts

Computational units processing input

streams and producing new streams

Just 1 stream ?

Page 17: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little deeper . . Concepts

17

Topologies

A network of spouts and bolts

Page 18: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Is that it . . . ?

18

Tasks and Parallelism

A spout or bolt can execute

multiple tasks across the

cluster

Page 19: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

19

[ ]Mr. Tuple

O Shoot, where do I go now?

Page 20: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Groupings . . To the rescue of Mr. Tuple !

20

• Shuffle Grouping #pick a random task

• Fields Grouping #mod hashing on a

subset of tuple fields

• All Grouping #sends to all tasks

• Global Grouping #picks task with lowest

task id

Page 21: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A Storm Cluster

21

NIMBUS

ZOOKEEPER

ZOOKEEPER

ZOOKEEPER

SUPERVISOR

SUPERVISOR

SUPERVISOR

SUPERVISOR

SUPERVISOR

Page 22: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A Storm Cluster

22

NIMBUS

ZOOKEEPER

ZOOKEEPER

ZOOKEEPER

SUPERVISOR

SUPERVISOR

SUPERVISOR

SUPERVISOR

SUPERVISOR

If this were Hadoop

Job TrackerTask Tracker

Page 23: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A Storm Cluster

23

NIMBUS

ZOOKEEPER

ZOOKEEPER

ZOOKEEPER

SUPERVISOR

SUPERVISOR

SUPERVISOR

SUPERVISOR

SUPERVISOR

But it’s NOT Hadoop !

Co-ordinates

Everything

Page 24: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Salient Features . .

24

• Storm > 0.7 supports Transactional Topologies Processes small batches of topologies

If failure during commit, both batch+commit is

retried

• Storm guarantees message Processing using

acknowledgements

• Petrel by AirSage is a python wrapper for

Storm ; you can write and submit topologies in

Python.

Page 25: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

25

Part -2

Neo4J – “Get Graphed”

Page 26: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

26

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

This is how

Graph Data was

represented in

RDBMS.

Page 27: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

27

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

ENTER, NOSQL DATABASES

Page 28: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

28

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Types of NOSQL Databases

Graph

databases

Document

databases

Column-

Family

Key-Value

Stores

Data Complexity

Da

ta S

ize

Page 29: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

29

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Why NOSQL Databases

• Easily horizontally scalable

• Dynamic Schemas, Handle Unstructured data really

well.

• Excel in speed and volume

• Trade off in consistency for efficiency (except in

graph databases . . . We’ll see why )

• Pleasure to code

• Free to use any query language ( even SQL ! )

• Downtime? What Downtime ?

Page 30: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

30

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

The Property Graph Model of Graph Databases

• Core Abstractions

Nodes

Relationship between Nodes

Properties of both

• Traversal Framework

High Performance Queries on connected datasets

• Bindings

REST, Gremlin, etc.

Page 31: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

31

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Neo4J

• Fully ACID with rollbacks support (unbelievable!)

• Schema-less and Efficient storage of Semi Structured

Data

• Fast deep traversal instead of slow SQL queries that

span many table joins

• Whiteboard Friendly

• Very natural to express graph related problems with

traversals (recommendation engine, shortest path etc..)

Page 32: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

32

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Neo4J Pythonized !

• Py2Neo is an excellent binding for Neo4J

• Accesses Neo4J using it’s RESTful API

• Still under development . . Features like labels yet to be

included !

Page 33: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

33

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

So, Will Relational databases be Extinct ?

OOPS!

Page 34: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

34

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Categories of Graphical Data

• Social Networks

• Citations

• Product Co-Purchasing

• Internet peer-to-peer

• Road Network and Map Data

• Web Graphs

Excellent Source of Sample Graphical Data

“ http://snap.Stanford.edu/data/ “

Page 35: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

35

Part -3

Get your hands dirty !

Page 36: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

36

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A demo . .

• Sample Social Network data set

• Data Includes people signing up info,

adding friends, unfriending etc. . . for a

month’s activity

• Neo4J

Store and Update the social data

• Storm

Calculate “friendship-index”

Page 37: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

37

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A demo . .

• “friendship-index”

n = Through how many people is

person “A” connected to person “B”

Gives an idea of how close two people

are !

Useful while searching friends on Social

Networks ( something like friends of friends concept

in facebook’s graph search )

Page 38: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

38

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

The Topology . .

UpdateSpout

UpdateBolt

QuerySpout Query

Bolt

Source

Source

Page 39: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

39Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Update Spout

Page 40: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

40Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Update Spout

Define what kind of tuples

are emitted

Page 41: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

41Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Update Spout

Gets and emits tuple streams

Page 42: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

42Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Update Bolt

Page 43: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

43Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Update Bolt

Objects for database access

and indexing service

Page 44: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

44Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Update Bolt

Page 45: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

45Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Query Spout

Page 46: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

46Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Query Spout

The tuple to be emitted

can contain multiple

entities.

Page 47: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

47Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Query Bolt

Page 48: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

48Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Query Bolt

Page 49: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

49Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Query Bolt

Retrieve caller friend and

requested friend ids

Page 50: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

50Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Query Bolt

Retrieve caller friend

and requested friend

ids as per database

Page 51: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

51Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Create Topology

Page 52: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

52Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Create Topology

Import all spout and

bolt files

Page 53: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

53Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Create Topology

Unfortunately, There was no option in

Petrel to turn off console debug, so the

console view is really messy.

Page 54: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

54Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

Topology.yaml

Configurations to the topology are

specified in this file

Page 55: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

55

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

A little More . .

UpdateSpout

UpdateBolt

QuerySpout Query

Bolt

Source

Source

Page 56: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

56

Final Thoughts

• A Storm-Neo4j framework is a boon for real-time graph computations

• Quite flexible in Java, Python bindings and implementations still have a long way to go.

• If you are an Admin or developer, Analyse your data and computing requirements before narrowing down on a framework.

Page 57: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

57

…to play with Storm and Neo4J

• My PyCon Talk Repo – slides, code skeletons,

etc.http://www.sonalraj.com/neo-storm.html

• Storm documentation (official)http://github.com/nathanmarz/storm

• Storm Bookhttp://www.amazon.com/Getting-Started-Storm-Jonathan-Leibiusky/dp/1449324010

• Deployment of storm on AWShttp://github.com/nathanmarz/storm-deploy

• Neo4J Documentationhttp://www.neo4j.org

Page 58: Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013

Copyrights © 2013, Sonal Raj, http://www.sonalraj.com

58

Ex-terminated . . .

- That’s it- Thanks for Listening !- Questions