real time graph computations in storm, neo4j, python - pycon india 2013
DESCRIPTION
This talk briefly outlines the Storm framework and Neo4J graph database, and how to compositely use them to perform computations on complex graphs in Python using the Petrel and Py2neo packages. This talk was given at PyCon India 2013.TRANSCRIPT
Real-Time stream computation on
graphs using Storm, Neo4j and
Python
Sonal Raj
http://www.sonalraj.com
Presented at Pycon India 2013
Bangalore, India
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
1
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Introduction
2
• With data multiplying each day, storage and
knowledge extraction is a major concern.
• Social Data Analysis, Business Intelligence
• Constraints of Real Time and Fault-Tolerant
Processing
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
. . In this Talk
3
• A look at storm as a distributed
computation Framework
• Neo4J as a NoSQL graph database
• Some Cool Pictures
• What are we trying to achieve ?
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Disclaimer !
4
• This talk presents an overview of Storm and
Neo4J . . Less dirty details
• I’m going to go pretty fast . . . Please hang on.
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
5
Part -1
Storm – The Hadoop
of Real Time
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Don’t we have Hadoop ?
6
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Storm v/s Hadoop
7
STORM
HADOOP
• Distributed
Processing
• Fault Tolerance
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Storm v/s Hadoop
8
HADOOP
• Large but Finite Jobs
• Processes a Lot of Data at Once
• High Latency
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Storm v/s Hadoop
9
HADOOP
• Large but Finite Jobs
• Processes a Lot of Data at Once
• High Latency
Storm
Infinite Computations called Topologies
Process Infinite Streams of data one-tuple-at-a-time
Low Latency
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
So, what Storm gives us . .
10
Real-Time Computations
Guaranteed data Processing
Horizontal Scalability and Fault-Tolerance
No intermediate message Brokers
Higher Abstraction than Message Passing, so makes
sense !
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
A little deeper . . Concepts
11
Streams
Tuple Tuple Tuple Tuple Tuple
An unbounded sequence of Tuples
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
A little deeper . . Concepts
12
Streams
Tuple Tuple Tuple Tuple Tuple
An unbounded sequence of Tuples
So, what kind of a tuple is this ?
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
A little deeper . . Concepts
13
Spouts
A source of Streams
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
A little deeper . . Concepts
14
Spouts
A source of Streams
But, what is the source FOR the spouts ?
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
A little deeper . . Concepts
15
Bolts
Computational units processing input
streams and producing new streams
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
A little deeper . . Concepts
16
Bolts
Computational units processing input
streams and producing new streams
Just 1 stream ?
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
A little deeper . . Concepts
17
Topologies
A network of spouts and bolts
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Is that it . . . ?
18
Tasks and Parallelism
A spout or bolt can execute
multiple tasks across the
cluster
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
19
[ ]Mr. Tuple
O Shoot, where do I go now?
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Groupings . . To the rescue of Mr. Tuple !
20
• Shuffle Grouping #pick a random task
• Fields Grouping #mod hashing on a
subset of tuple fields
• All Grouping #sends to all tasks
• Global Grouping #picks task with lowest
task id
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
A Storm Cluster
21
NIMBUS
ZOOKEEPER
ZOOKEEPER
ZOOKEEPER
SUPERVISOR
SUPERVISOR
SUPERVISOR
SUPERVISOR
SUPERVISOR
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
A Storm Cluster
22
NIMBUS
ZOOKEEPER
ZOOKEEPER
ZOOKEEPER
SUPERVISOR
SUPERVISOR
SUPERVISOR
SUPERVISOR
SUPERVISOR
If this were Hadoop
Job TrackerTask Tracker
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
A Storm Cluster
23
NIMBUS
ZOOKEEPER
ZOOKEEPER
ZOOKEEPER
SUPERVISOR
SUPERVISOR
SUPERVISOR
SUPERVISOR
SUPERVISOR
But it’s NOT Hadoop !
Co-ordinates
Everything
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Salient Features . .
24
• Storm > 0.7 supports Transactional Topologies Processes small batches of topologies
If failure during commit, both batch+commit is
retried
• Storm guarantees message Processing using
acknowledgements
• Petrel by AirSage is a python wrapper for
Storm ; you can write and submit topologies in
Python.
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
25
Part -2
Neo4J – “Get Graphed”
26
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
This is how
Graph Data was
represented in
RDBMS.
27
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
ENTER, NOSQL DATABASES
28
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Types of NOSQL Databases
Graph
databases
Document
databases
Column-
Family
Key-Value
Stores
Data Complexity
Da
ta S
ize
29
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Why NOSQL Databases
• Easily horizontally scalable
• Dynamic Schemas, Handle Unstructured data really
well.
• Excel in speed and volume
• Trade off in consistency for efficiency (except in
graph databases . . . We’ll see why )
• Pleasure to code
• Free to use any query language ( even SQL ! )
• Downtime? What Downtime ?
30
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
The Property Graph Model of Graph Databases
• Core Abstractions
Nodes
Relationship between Nodes
Properties of both
• Traversal Framework
High Performance Queries on connected datasets
• Bindings
REST, Gremlin, etc.
31
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Neo4J
• Fully ACID with rollbacks support (unbelievable!)
• Schema-less and Efficient storage of Semi Structured
Data
• Fast deep traversal instead of slow SQL queries that
span many table joins
• Whiteboard Friendly
• Very natural to express graph related problems with
traversals (recommendation engine, shortest path etc..)
32
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Neo4J Pythonized !
• Py2Neo is an excellent binding for Neo4J
• Accesses Neo4J using it’s RESTful API
• Still under development . . Features like labels yet to be
included !
33
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
So, Will Relational databases be Extinct ?
OOPS!
34
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Categories of Graphical Data
• Social Networks
• Citations
• Product Co-Purchasing
• Internet peer-to-peer
• Road Network and Map Data
• Web Graphs
Excellent Source of Sample Graphical Data
“ http://snap.Stanford.edu/data/ “
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
35
Part -3
Get your hands dirty !
36
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
A demo . .
• Sample Social Network data set
• Data Includes people signing up info,
adding friends, unfriending etc. . . for a
month’s activity
• Neo4J
Store and Update the social data
• Storm
Calculate “friendship-index”
37
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
A demo . .
• “friendship-index”
n = Through how many people is
person “A” connected to person “B”
Gives an idea of how close two people
are !
Useful while searching friends on Social
Networks ( something like friends of friends concept
in facebook’s graph search )
38
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
The Topology . .
UpdateSpout
UpdateBolt
QuerySpout Query
Bolt
Source
Source
39Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Update Spout
40Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Update Spout
Define what kind of tuples
are emitted
41Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Update Spout
Gets and emits tuple streams
42Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Update Bolt
43Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Update Bolt
Objects for database access
and indexing service
44Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Update Bolt
45Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Query Spout
46Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Query Spout
The tuple to be emitted
can contain multiple
entities.
47Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Query Bolt
48Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Query Bolt
49Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Query Bolt
Retrieve caller friend and
requested friend ids
50Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Query Bolt
Retrieve caller friend
and requested friend
ids as per database
51Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Create Topology
52Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Create Topology
Import all spout and
bolt files
53Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Create Topology
Unfortunately, There was no option in
Petrel to turn off console debug, so the
console view is really messy.
54Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
Topology.yaml
Configurations to the topology are
specified in this file
55
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
A little More . .
UpdateSpout
UpdateBolt
QuerySpout Query
Bolt
Source
Source
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
56
Final Thoughts
• A Storm-Neo4j framework is a boon for real-time graph computations
• Quite flexible in Java, Python bindings and implementations still have a long way to go.
• If you are an Admin or developer, Analyse your data and computing requirements before narrowing down on a framework.
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
57
…to play with Storm and Neo4J
• My PyCon Talk Repo – slides, code skeletons,
etc.http://www.sonalraj.com/neo-storm.html
• Storm documentation (official)http://github.com/nathanmarz/storm
• Storm Bookhttp://www.amazon.com/Getting-Started-Storm-Jonathan-Leibiusky/dp/1449324010
• Deployment of storm on AWShttp://github.com/nathanmarz/storm-deploy
• Neo4J Documentationhttp://www.neo4j.org
Copyrights © 2013, Sonal Raj, http://www.sonalraj.com
58
Ex-terminated . . .
- That’s it- Thanks for Listening !- Questions