distributed graph mining

Distributed Graph Mining

Presented By

Sayeed Mahmud

Motivation

• The reason BigData is here

– To make processing data easier which is impossible or overwhelming to process with our existing problem.

• Some Graph Database might be too big for a single machine

– Easier for a distributed system by sharing load

• Graph Database may itself be scattered around the globe

– Google search records.

Distributed Graph Mining

• Partition based

• Divide the problem into independent sub-problems

– Each node of the system can process it independently

– Parallel processing

– Speedup computation

– Enhance scalability of solutions

Techniques

• MRPF

• MapReduce

– We are mainly interested in this

Map Reduce• A programming model for distributed

platforms.

• Proposed by Google

• Abundant open source implementations– Hadoop

• Divides the problem in to sub-problems to be processed in nodes– Mapping

• Combining the processing results– Reduce

Map Reduce Example• Problem: Find frequency of a word in documents available on

a system.

…word….word……

…word….……

…word….word……

<word, count>

Distributed System

<word, 2> <word, 1> <word, 2>

<word, 2 + 1 + 2 = 5> Reduce

Graph Mining using Map Reduce• Problem: Find frequent sub-graphs of a graph database in a

MapReduce programming model (Local Support 2)

Graph Dataset

Distributed System

Run gSpan Run gSpan3

5 Reduce

Data Partitioning

• Performance and load balancing will be depending on Mapping portion

– Termed “Partitioning”

– Which portion of the graph dataset will go to which

– Loss of Data and Load Balancing directly dependent on partitioning.

• Two approach

– MRGP (Map Reduced Partitioning)

– DGP (Density Based Partitioning)

MRGP• Followed in common Map Reduce problems.

• Assigned sequentially

• SimpleGraph Size (KB) Density

G1 1 0.25

G2 2 0.5

G3 2 0.6

G4 1 0.25

G5 2 0.5

G6 2 0.5

G7 2 0.5

G8 2 0.6

G9 2 0.6

G10 2 0.7

G11 3 0.7

G12 3 0.8

4 Partition 6KB Each

G1, G2, G3, G4

G5, G6, G7

G8, G9, G10

G11, G12

DGP• Goes for a balanced distribution

• Uses intermediary Bucket

• First graphs are sorted according to densities.

Graph Size (KB) Density

G1 1 0.25

G2 2 0.5

G3 2 0.6

G4 1 0.25

G5 2 0.5

G6 2 0.5

G7 2 0.5

G8 2 0.6

G9 2 0.6

G10 2 0.7

G11 3 0.7

G12 3 0.8

G1 (0.25)G4 (0.25)G2 (0.5)G5 (0.5)G6 (0.5)G7 (0.5)G3 (0.6)G8 (0.6)G9 (0.6)G10 (0.7)G11 (0.7)G12 (0.8)

DGP cont..• Lets say bucket count for this demo is 2

• Next we equally distribute the sorted list to two buckets.

Bucket 1 Bucket 2

G12G11

Make 4 Partitions in total Divide each Bucket in 4 Non Empty Sub-Bucket

DGP Cont..• Now take one partition from each and form

final partitions

G12G11

G1, G2, G3,

G4, G5, G9,

G6, G11 G7, G12

Support Count

• There are two types of support counts to be considered in distributed graph mining

– Global Support Count

– Local Support Count

• Global Support is the same as in normal graph mining

• When each mapper is running individual job it considers local support count.

Local Support Count

• Each individual node has only partial graph data set.

• Support Count need to be adjusted relative to the original dataset.

• This adjusted support count is Local Support Count.

• Local Support Count = Tolerance Rate * Global Support [Tolerance rate is between 1 and 0]

Loss of Data

• Some frequent sub-graph are lost

• The loss can be mitigated by choosing an optimal tolerance rate.

– Theoretically tolerance rate = 1 means there will be no loss of data.

– But usually higher run time.

Experiment Environment

• Language : Perl

• MapReduce Framework : Hadoop (0.20.1)

• Cluster Size : 5

• Node Specification:

– Processor AMD Opteron Quad Core 2.4 GHz

– 4GB Main memory

Data Sets

• Synthetic (Size Ranging from 18MB to 69GB)

• Real

– Chemical Compound Dataset from National Cancer Institute.

Loss Rate for gSpan Support 30%

Loss Rate for Gaston and FSG Support 30%

Runtime

Thank You

distributed graph mining

Data & Analytics

large graph mining

11 graph pattern mining

g-thinker: a distributed framework for mining subgraphs in...

distributed graph algorithms for planar...

graph mining ibm -...

graph mining, social network analysis, and multirelational...

5.5 graph mining

mining social graph data

a distributed graph mining framework based on...

sensor and graph mining

large graph mining

omni graph mining: graph mining using rdbms

distributed graph coloring

graph mining seminar_2009

lect12 graph mining

distributed graph analytics

distributed monitoring & mining

l12:distributed graph processing

demystifying distributed graph processing

pegasus: a peta-scale graph mining system...