distributed graph mining

Post on 02-Jul-2015

391 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

A simple presentation based on paper [ARIDHI 2013]

TRANSCRIPT

Distributed Graph Mining

Presented By

Sayeed Mahmud

Motivation

Motivation

• The reason BigData is here

– To make processing data easier which is impossible or overwhelming to process with our existing problem.

• Some Graph Database might be too big for a single machine

– Easier for a distributed system by sharing load

• Graph Database may itself be scattered around the globe

– Google search records.

Distributed Graph Mining

• Partition based

• Divide the problem into independent sub-problems

– Each node of the system can process it independently

– Parallel processing

– Speedup computation

– Enhance scalability of solutions

Techniques

• MRPF

• MapReduce

– We are mainly interested in this

Map Reduce• A programming model for distributed

platforms.

• Proposed by Google

• Abundant open source implementations– Hadoop

• Divides the problem in to sub-problems to be processed in nodes– Mapping

• Combining the processing results– Reduce

Map Reduce Example• Problem: Find frequency of a word in documents available on

a system.

…word….word……

…word….……

…word….word……

<word, count>

Distributed System

Map

<word, 2> <word, 1> <word, 2>

<word, 2 + 1 + 2 = 5> Reduce

Graph Mining using Map Reduce• Problem: Find frequent sub-graphs of a graph database in a

MapReduce programming model (Local Support 2)

Graph Dataset

Distributed System

Map

Run gSpan Run gSpan3

2

5 Reduce

Data Partitioning

• Performance and load balancing will be depending on Mapping portion

– Termed “Partitioning”

– Which portion of the graph dataset will go to which

– Loss of Data and Load Balancing directly dependent on partitioning.

• Two approach

– MRGP (Map Reduced Partitioning)

– DGP (Density Based Partitioning)

MRGP• Followed in common Map Reduce problems.

• Assigned sequentially

• SimpleGraph Size (KB) Density

G1 1 0.25

G2 2 0.5

G3 2 0.6

G4 1 0.25

G5 2 0.5

G6 2 0.5

G7 2 0.5

G8 2 0.6

G9 2 0.6

G10 2 0.7

G11 3 0.7

G12 3 0.8

4 Partition 6KB Each

G1, G2, G3, G4

G5, G6, G7

G8, G9, G10

G11, G12

DGP• Goes for a balanced distribution

• Uses intermediary Bucket

• First graphs are sorted according to densities.

Graph Size (KB) Density

G1 1 0.25

G2 2 0.5

G3 2 0.6

G4 1 0.25

G5 2 0.5

G6 2 0.5

G7 2 0.5

G8 2 0.6

G9 2 0.6

G10 2 0.7

G11 3 0.7

G12 3 0.8

G1 (0.25)G4 (0.25)G2 (0.5)G5 (0.5)G6 (0.5)G7 (0.5)G3 (0.6)G8 (0.6)G9 (0.6)G10 (0.7)G11 (0.7)G12 (0.8)

DGP cont..• Lets say bucket count for this demo is 2

• Next we equally distribute the sorted list to two buckets.

Bucket 1 Bucket 2

G1

G2G5

G7G6

G4

G3

G8G10

G12G11

G9

Make 4 Partitions in total Divide each Bucket in 4 Non Empty Sub-Bucket

DGP Cont..• Now take one partition from each and form

final partitions

G1

G2G5

G7G6

G4

G3

G8G10

G12G11

G9

G1, G2, G3,

G8

G4, G5, G9,

G10

G6, G11 G7, G12

Support Count

• There are two types of support counts to be considered in distributed graph mining

– Global Support Count

– Local Support Count

• Global Support is the same as in normal graph mining

• When each mapper is running individual job it considers local support count.

Local Support Count

• Each individual node has only partial graph data set.

• Support Count need to be adjusted relative to the original dataset.

• This adjusted support count is Local Support Count.

• Local Support Count = Tolerance Rate * Global Support [Tolerance rate is between 1 and 0]

Loss of Data

• Some frequent sub-graph are lost

• The loss can be mitigated by choosing an optimal tolerance rate.

– Theoretically tolerance rate = 1 means there will be no loss of data.

– But usually higher run time.

Experiment Environment

• Language : Perl

• MapReduce Framework : Hadoop (0.20.1)

• Cluster Size : 5

• Node Specification:

– Processor AMD Opteron Quad Core 2.4 GHz

– 4GB Main memory

Data Sets

• Synthetic (Size Ranging from 18MB to 69GB)

• Real

– Chemical Compound Dataset from National Cancer Institute.

Loss Rate for gSpan Support 30%

Loss Rate for Gaston and FSG Support 30%

Runtime

Thank You

top related