overview of graph similarity technique in web content mining

9
OVERVIEW OF GRAPH SIMILARITY TECHNIQUES FOR WEB CONTENT MINING Avinash N. Bhute Research Scholar,VJTI, Mumbai, Maharastra State India email:[email protected] Harsha Bhute Pursuing ME(Computer Network), Sinhgad College of Engineering, Pune, Maharastra State, India email:[email protected] Dr.B. B. Meshram Professor & Head, VJTI, Mumbai Maharastra State, India Email:[email protected] Keywords: Graph edit distance, Graph isomorphism, Graph similarity, probabilistic approach, relaxation approach Abstract: Web mining is the application of machine learning (data mining) techniques to web-based data for the purpose of learning or extracting knowledge. Web mining encompasses a wide variety of techniques, including soft computing. web mining methodologies are generally are classified into three distinct categories: web structure mining, web usage mining & web content mining. In web content mining we examine the actual content of web pages and performed some knowledge discovery procedure. In this paper we will discussed about the concepts of graph similarity, graph distance, and graph matching techniques as they form a basis for the novel approaches. The purpose of the current paper is to give a literature survey of the various methods that are used to determine similarity, distance and matching between graphs. 1 INTRODUCTION n this paper we will discussed about the concepts of graph similarity, graph distance, and graph matching techniques as they form a basis for the novel approaches. The purpose of the current paper is to give a literature survey of the various methods that are used to determine similarity, distance and matching between graphs. These topics are closely related to the topics of inexact graph matching or graph similarity, and several practical applications that utilize graph similarity or graph matching are represented, many of them in the field of image processing. K. Harris performed labeling of coronary angiograms with graph matching [K. Haris, 1999]. In [C. Siva Ram Murthy, 1999] a method for allocating tasks in multiprocessor systems using graphs and graph matching is described. In [B. Huet, 1999] describe a graph matching method for shape recognition in image databases. In this paper we are specifically interested in using graph techniques for dealing with web document content. Traditional learning methods applied to the tasks of text or document classification and categorization, such as rule induction [C. Apte, 1997 ] and Bayesian methods [A. McCallum,1998], are based on a vector model of document representation or an even simpler Boolean model. Similarity of graphs in domains outside of information retrieval has largely been studied under the topic of graph matching. In many applications the input graph is not expected to be an exact match to any database graph since the input graph is either previously unseen or assumed to be corrupted with some amount of noise. Thus we sometimes refer to this area as error-tolerant or inexact graph matching. As mentioned above, a number of graph matching applications have been reported in the literature. For a recent survey [H.Bunke, 2000]. We are not aware, however, of any graph matching applications that deal with content-based categorization and classification of web or text documents. I

Upload: anbhute3484

Post on 27-Apr-2015

487 views

Category:

Documents


4 download

DESCRIPTION

Web mining is the application of machine learning (data mining) techniques to web-based data for the purpose of learning or extracting knowledge. Web mining encompasses a wide variety of techniques, including soft computing. web mining methodologies are generally are classified into three distinct categories: web structure mining, web usage mining & web content mining. In web content mining we examine the actual content of web pages and performed some knowledge discovery procedure. In this paper we will discussed about the concepts of graph similarity, graph distance, and graph matching techniques as they form a basis for the novel approaches. The purpose of the current paper is to give a literature survey of the various methods that are used to determine similarity, distance and matching between graphs.

TRANSCRIPT

Page 1: Overview of Graph Similarity Technique in Web Content Mining

OVERVIEW OF GRAPH SIMILARITY TECHNIQUES FOR WEBCONTENT MINING

Avinash N. BhuteResearch Scholar,VJTI, Mumbai, Maharastra State India

email:[email protected]

Harsha BhutePursuing ME(Computer Network), Sinhgad College of Engineering, Pune, Maharastra State, India

email:[email protected]

Dr.B. B. MeshramProfessor & Head, VJTI, Mumbai Maharastra State, India

Email:[email protected]

Keywords: Graph edit distance, Graph isomorphism, Graph similarity, probabilistic approach, relaxation approach

Abstract: Web mining is the application of machine learning (data mining) techniques to web-based data for thepurpose of learning or extracting knowledge. Web mining encompasses a wide variety of techniques,including soft computing. web mining methodologies are generally are classified into three distinctcategories: web structure mining, web usage mining & web content mining. In web content mining weexamine the actual content of web pages and performed some knowledge discovery procedure. In this paperwe will discussed about the concepts of graph similarity, graph distance, and graph matching techniques asthey form a basis for the novel approaches. The purpose of the current paper is to give a literature survey ofthe various methods that are used to determine similarity, distance and matching between graphs.

1 INTRODUCTION

n this paper we will discussed about the conceptsof graph similarity, graph distance, and graphmatching techniques as they form a basis for the

novel approaches. The purpose of the current paperis to give a literature survey of the various methodsthat are used to determine similarity, distance andmatching between graphs. These topics are closelyrelated to the topics of inexact graph matching orgraph similarity, and several practical applicationsthat utilize graph similarity or graph matching arerepresented, many of them in the field of imageprocessing. K. Harris performed labeling ofcoronary angiograms with graph matching [K. Haris,1999]. In [C. Siva Ram Murthy, 1999] a method forallocating tasks in multiprocessor systems usinggraphs and graph matching is described. In [B. Huet,1999] describe a graph matching method for shaperecognition in image databases.

In this paper we are specifically interested inusing graph techniques for dealing with webdocument content. Traditional learning methodsapplied to the tasks of text or documentclassification and categorization, such as ruleinduction [C. Apte, 1997 ] and Bayesian methods [A.McCallum,1998], are based on a vector model ofdocument representation or an even simpler Booleanmodel. Similarity of graphs in domains outside ofinformation retrieval has largely been studied underthe topic of graph matching. In many applicationsthe input graph is not expected to be an exact matchto any database graph since the input graph is eitherpreviously unseen or assumed to be corrupted withsome amount of noise. Thus we sometimes refer tothis area as error-tolerant or inexact graph matching.As mentioned above, a number of graph matchingapplications have been reported in the literature. Fora recent survey [H.Bunke, 2000]. We are not aware,however, of any graph matching applications thatdeal with content-based categorization andclassification of web or text documents.

I

Page 2: Overview of Graph Similarity Technique in Web Content Mining

2 GRAPH AND SUBGRAPHISOMORPHISM

Here we are describe graph and subgraphisomorphism. First give definitions for graph andsubgraph. A graph G [H. Bunke, 2000][Wang,1995] is a 4-tuple: G=(V, E, α, β), where V is a set ofnodes (also called vertices), E V x V is a set ofedges connecting the nodes, :VV is a functionlabeling the nodes, and V x V is a functionlabeling the edges (V and being the sets oflabels that can appear on the nodes and edges,respectively). For brevity, we may abbreviate G asG=(V,E) by omitting the labeling functions. A graphG1 = (V1, E1, α1, β1) is a subgraph [H. Bunke,1997 ]of a graph G2 = (V2, E2, α2, β2) denoted G1G2, ifV1V2, E1E2(V1x V1), α1(x)= α2(x)xV1, andβ1((x,y))= β2 ((x,y)) (x,y)E1. Conversely, graphG2 is also called a supergraph of G1.

When it say that two graphs are isomorphic,means that the graphs contain the same number ofnodes and there is a direct 1-to-1 correspondencebetween the nodes in the two graphs such that theedges between nodes and all labels are preserved.Formally, a graph G1 = (V1, E1, α1, β1) and a graphG2 = (V2, E2, α2, β2),are said to be isomorphic [H.Bunke,1997 ], denoted G1G2, if there exists abijective function f :V1V2 such that α1(x)= α2(f(x))for )xV1 and β1((x,y))= β2((f(x),f(y))) for(x,y)V1x V1. Such a function f is also called agraph isomorphism between G1 and G2.

There is also the notion of subgraphisomorphism, meaning a graph is isomorphic to apart of (i.e. a subgraph of) another graph. Given agraph isomorphism f between graphs G1 and G2 asdefined above and another graph G3, if G2G3 thenf is a subgraph isomorphism [H. Bunke, 2000]between G1 and G3.Subgraph isomorphism tells us ifone graph appears as part of another graph. Formally,the similarity between two graphs G1 and G2,denoted s(G1,G2), is a function that has the followingproperties:(1) 0 s(G1,G2) 1(2) s(G1,G2)=1 G1 G2

(3) s(G1,G2)=s(G2,G1)(4) if G1 is more similar to G2 than to G3, thens(G1,G2) s(G1,G3)

One problem with defining similarity in thisway is that it is not clear what case causess(G1,G2)=0. This comes from the fact that we haveno concept of an exact “opposite” of a graph. We do,however, have the idea of compliments of graphs. A

compliment [T. H. Cormen, 1997] of a graph G,denoted G , is the fully connected version of G suchthat the edges in G have been removed,Ec={(u,v)|(u,v).E}.

G GC

Fig1. Graph G Compliment of Graph GC

However, a graph may be isomorphic to itscompliment (Fig.1), so it does not necessarily holdthat s(G,GC)=0. Given this limitation, the usualmethod of determining numeric similarity betweengraphs is to use a distance measure. A distancemetric [H. Bunke, 2000][H. Bunke,1998][M.-L.Fernández,2001] between two graphs, denotedd(G1,G2), is a function that has the followingproperties:(1) boundary condition: d(G1,G2) 0(2) identical graphs have zero distance: d(G1,G2)=0 G1 G2

(3) symmetry: d(G1,G2)=d(G2,G1)(4) triangle inequality:d(G1,G3)d(G1,G2)+d(G2,G3)We note that it is possible to transform a similaritymeasure into a distance measure, for example by:

d(G1,G2)=1- s(G1,G2)---- (1)

It can be shown that this equation satisfies thevarious conditions above for similarity. Otherequations are also possible for changing distanceinto similarity. Throughout the rest of thisdissertation we will see several proposed distancemeasures, some of which have been created from asimilarity measure.

3 GRAPH EDIT DISTANCE

Edit distance is a method that is used to measure thedifference between symbolic data structures such astrees [K.-C. Tai, 2003] and strings [R. A.Wagner ,2001]. It is also known as the Levenshteindistance, from early work in errorcorrecting/detecting codes that allowed insertion anddeletion of symbols [G. Levi, 1972]. The concept isstraightforward. Various operations are defined onthe structures, such as deletion, insertion, andrenaming of elements. A cost function is associatedwith each operation, and the minimum cost needed

Page 3: Overview of Graph Similarity Technique in Web Content Mining

to transform one structure into the other using theoperations is the distance between them. Editdistance has also been applied to graphs, as graphedit distance [A. Sanfeliu, 2003]. The operations ingraph edit distance are insertion, deletion, and re-labeling of nodes and edges.Formally, an editingmatching function (or an error correcting graphmatching, ecgm [H. Bunke,1997] ) between twographs G1 and G2 is defined as a bijective mappingfunction M:Gx Gy, where GxG1 and GyG2.The following six edit operations on the graphs,which are implied by the mapping M, are alsodefined:(1)If a node vV1 but vVx then we delete node vwith cost cnd.(2)If a node vV2 but vVy then we insert node vwith cost cni.(3)If M(vi)=vj for viVx and vjVy andα1(Vx)α2(Vy) then we substitute node vi withnode vj with cost cns.(4)If an edge eE1 but eEx then we delete edge ewith cost ced.(5)If an edge eE2 but eEy then we insert edge ewith cost cei.(6)If M(ei)=ej for eiEx and ejEy and β1(ex)β2(ey) then we substitute edge ei with edge ej withcost ces.

Usually the cost coefficients c are applicationdependant. In the error correcting graph matchingsense, they can be related to the probability of theoperations (errors) occurring. We assume that thecost coefficients are non-negative and are invariantof the node or edge upon which they are applied (i.e.the costs are constant for each operation). The editdistance between two graphs [H. Bunke,1997 ],denoted d(G1,G2), is defined as the cost of themapping M that results in the lowest /(M). Moreformally:

d(G1,G2)= minM

M

Thus the distance between two graphs is the cost ofan editing function which transforms one graph intothe other via edit operations and which has thelowest cost among all such editing functions.Theadvantage to the graph edit distance approach is thatit is easy to understand and straightforward to apply.The disadvantage is that the costs for the editoperations (6 parameter values) need to bedetermined for each application. In [H. Bunke,1999 ], Bunke gives an examination of costfunctions for graph edit distance.

IV. MAXIMUM COMMONSUBGRAPH / MINIMUMCOMMON SUPERGRAPHAPPROACH

Bunke has shown [H. Bunke,1997 ] that there is adirect relationship between graph edit distance andthe maximum common subgraph between twographs. Specifically, the two are equivalent undercertain restrictions on the cost functions. A graph gis a maximum common subgraph (mcs) [H.Bunke,1997 ] of graphs G1 and G2, denotedmcs(G1,G2), if: (1) gG1 (2) gG2 and (3) there is noother subgraph g’ (g’G1, g’G2) such that |g’|>|g|.(Here |g| is usually taken to mean |V|, i.e. the numberof nodes in the graph; it is used to indicate the “size”of a graph.) Similarly, there is the complimentaryidea of minimum common supergraph. A graph g isa minimum common supergraph (MCS) [H. Bunke,2000] of graphs G1 and G2, denoted MCS(G1,G2),if: (1) G1g (2) G2g and (3) there is no othersupergraph g’ (G1g’, G2g’) such that |g’|<|g|.Methods for determining the mcs are given in [G.Levi, 1972][J. J. McGregor, 1982].

The general approach is to create acompatibility graph for the two given graphs, andthen find the largest clique within it. What Bunkehas shown is that when computing the editingmatching function based on graph edit distance, thefunction with the lowest cost is equivalent to themaximum common subgraph between the twographs under certain conditions on the costcoefficients. This is intuitively appealing, since themaximum common subgraph is the part of bothgraphs that is unchanged by deleting or insertingnodes and edges. To edit graph G1 into graph G2,one only needs to perform the following steps:(1)Delete nodes and edges from G1 that don’t appearin mcs(G1,G2)(2)Perform any node or edge substitutions(3)Add the nodes and edges from G2 that don’tappear in mcs(G1,G2)

Following this observation that the size of themaximum common subgraph is related to thesimilarity between two graphs,[H. Bunke,1998] haveintroduced a distance measure based on mcs. Theydefined the following distance measure:

d MCS 21 ,GG =1-|)||,max(|

|),(|

21

21

GG

GGmcs(3)

Page 4: Overview of Graph Similarity Technique in Web Content Mining

where max(x,y) is the usual maximum of twonumbers x and y, and |...| indicates the size of a graph(usually taken to be the number of nodes in a graph).This distance measure has four important properties[H. Bunke, 2000] .First, it is restricted to producinga number in the interval [0,1]. Second, the distanceis 0 only when the two graphs are identical. Third,the distance between two graphs is symmetric.Fourth, it obeys the triangle inequality, whichensures the distance measure behaves in an intuitiveway. For example, if we have two dissimilar objects(i.e. there is a large distance between them) thetriangle inequality implies that a third object whichis similar (i.e. has a small distance) to one of thoseobjects must be dissimilar to the other. Theadvantage of this approach over the graph editdistance method is that it does not require thedetermination of any cost coefficients or otherparameters.

V. STATE SPACE SEARCHAPPROACH

Depending on the size of the graphs and the costsassociated with the edit operations, finding thelowest cost mapping may require an exhaustiveexamination of all possible matching. If we allowthe possibility of not having to determine the exactdistance between graphs, we can perform other typesof sub-optimal search. These searches may not findthe global minimum cost function, but they can beperformed more quickly (since we do not need tofind all of the possible matching functions) and stillyield acceptable results. Each matching function weconsider becomes a state in a search space. The cost(M) for a state M becomes the value we attempt tominimize through the search. M is actually a graphisomorphism between subgraphs of the two graphsbeing matched; it specifies the operations needed toedit one graph into the other graph. Neighbors of astate M can be determined by adding/deleting nodesand edges to/from these subgraphs along with theircorresponding isomorphic matching; these neighborstates indicate the creation (or removal) of a singlematching between a node or edge in the two graphs(i.e.it specifies a change in the edit operations). Oncethe matching is represented in such a manner, manytechniques become available for performing thesearch, including hill climbing, genetic algorithms,simulated annealing, and so forth. These searchesmay not find the optimal solution, but for someapplications (such as graph matching for retrieval of

images or documents) this may not be a concern.These techniques are also sensitive to initializationand parameter selection, so there can be a widevariety in performance. For a more detaileddescription of this technique as well as experimentalresults comparing different search and initializationstrategies, kindly refer the reader to [Wang, 1995].

VI. PROBABILISTICAPPROACHES

In this section we will give a summary of theapproach proposed by [R. C. Wilson, 1997] which isbased on probability theory. In the probabilisticmethod, we attempt to match a data graph GD and astored model graph GM. These graphs are attributedgraphs. An attributed graph [36] is a graphGy=(V,E,A), where A is a set of attributes associatedwith each node, A={ xv

y , vV}. The attributes inthe data graph are to be matched to those in themodel graph, such that the matched nodes have thesame or similar attributes. Edges may also haveassociated attributes in this model, but they are notconsidered in this approach. Next, we have theconcept of super-clique of a node. A super-clique [R.C. Wilson,1997] of a node i in graph G=(V,E) isdefined as Ci=i{j|(j,i)E}. In other words, thesuper-clique of a node i is the set of nodes whichcontains i and all nodes connected to it by edges. Weattempt to match all super-cliques in the data graphwith super-cliques in the model graph.

The set of all possible matches betweensuper-clique Ci in the data graph GD and super-cliques Sj in the model graph GM is called adictionary [R. C. Wilson , 1997] and is denoted i.To cope with size differences between the data andmodel super-cliques we allow dummy (or null)nodes 3 to be inserted into Sj so that both graphshave equal numbers of nodes. The functionmatching a node in Ci to a node in Sj isf:VDVM

The probability of matching errors (a node inthe data graph is matched to the wrong node in themodel graph) is denoted Pe and the probability ofstructural errors (a node in the data graph is matchedto a dummy node in the model graph) is denoted P.Given these definitions, some assumptions, andthrough application of Bayes’ rule and otherprobability theoretic constructions, Wilson andHancock arrive at a mathematical description for theprobability of a super-clique matching between twographs (denoted 4j for super-clique Cj)

Page 5: Overview of Graph Similarity Technique in Web Content Mining

jj

j

sjijije

j

c

j SkSHkK

)])}(),([),((exp{||

)( (4)

Where||)]1)(1[( j

i

cecK (5)

H(Sj) is the Hamming distance between the super-clique of the data graph under the mapping f and thesuper-clique of the model graph, (Sj)=|Cj|-|Si|(i.e. the number of null nodes inserted into Si), and(is the number of nodes in Cj which are mappedonto null nodes in Si.

7 DISTANCE PRESERVATIONAPPROACH

In [G. Chartrand, 1998] describe an approach forgraph distance calculation based on preserving thedistance between nodes. The idea comes from thefact that when two graphs are isomorphic, thedistances (meaning in this context the number ofedges traversed) between every pair of nodes areidentical in both graphs. Given a graph G=(V,E), thedistance between two nodes x,y*V, denoted dG(x,y),is defined as the minimum number of edges thatneed to be traversed when traveling from x to y [G.Chartrand,, 1998]. Further, the 3-distance [G.Chartrand,1998] between two graphs G1 and G2,denoted d3(G1,G2), is defined as

1

22|),(),(|),( 21

VCC

yx

yxdyxdGGd (6)

where is a 1-to-1 mapping (but not necessarily anisomorphism) between G1 and G2.Here |...| is the standard absolute value operation. If is an isomorphism (i.e. G1 G2), then d(G1,G2)=0; if G1 and G2 are not isomorphic, then d(G1,G2)>0. This leads to a definition of distancebetween two graphs, denoted d(G1,G2 ).Here againwe see the idea of examining all the possiblematching functions ( 3, in the notation of the currentmethod; M in the notation of graph edit distance)between two graphs in order to determine thedistance between them. The authors also go on toshow if the graphs meet certain requirements thenwe can make some other, less expensive calculations.For example, if G1 and G2 are connected graphs withequal numbers of nodes, then we can determine thelower bound on their distance by

)),((),( 2121 min GGdGGd

(7)

Or, in other words, the sum of distances between allpairs of nodes in a graph. Further theoreticalcontributions related to this approach can be foundin [G. Chartrand, G. Kubicki, and M. Schultz, 1998].

8 RELAXATION APPROACHES

As we mentioned in Section 2, some earlyalgorithms for determining exact graph matching(isomorphism) used a matching matrix (M) whichindicates the compatibility of nodes in the twographs being matched. If the ith row and jth columnelement of M, denoted Mij, is a 1, then node i ingraph G1 is matched with node j in graph G2;otherwise there is no match and Mij=0. There areconstraints on the matrix M so that each row hasexactly one 1 and no column has more than one 1.Such a representation and the algorithms applied toit for determining graph matching arestraightforward, however they can requiregenerating all the permutations of possible nodematching over the matrix.

In order to improve time complexity, we caninstead attempt to approximate the optimal solutionby finding good sub-optimal solutions instead. Amethod that is sometimes used to do this for graphmatching problems is called relaxation (or morespecifically, discrete relaxation). Put simply, discreterelaxation is a method of transforming a discreterepresentation (such as the matrix M used for graphmatching) into a continuous representation. Thus wecan transform a discrete optimization problem (exactgraph matching using discrete matrix M) into acontinuous optimization problem. Compared to thestate space search approach , relaxation is a non-linear optimization approach.[ S. Gold,2002] appliedrelaxation to the graph matching problem. Theyhave posed the problem of attributed graph matchingin terms of an optimization problem The goal is thento minimize the objective function. The authors usethe graduated assignment algorithm to find an Mwhich minimizes E. The general procedure of thealgorithm is as follows:(1)Start with some valid initial matrix M0.(2)Determine a first order Taylor expansion of M0yielding:

Qai=-0aiM

E

= aibj

V

b

V

jbjCM

|1|

1

|2|

1

0 (8)

Page 6: Overview of Graph Similarity Technique in Web Content Mining

(3)Use relaxation to create a continuousrepresentation of M0

0aiM =e βQai (9)

where is a control parameter that is slowlyincreased as the procedure runs.(4)Update the matrix M by a normalizationprocedure over both rows and columns.(5)Repeat until convergence or iteration limitreached.

[S. Medasani, 2001] gave a procedure basedon fuzzy assignments and relaxation similar to themethod just described. The objective function forthis approach is

J(M,C)=

)1()(1|1|

1

1|2|

1

1|1|

1

1|2|

1

2ij

V

i

V

jijij

V

i

V

jij MMCfM

(10

)where M is now a fuzzy membership matrix(0Mij1) that relates the degree of match betweennodes, C is a compatibility matrix between nodes(rather than edges as above), is a controlparameter.

The summations in Eq.10 are under theconstraint that (i,j) (|V1|+1,|V2|+1); the extranodes in the graphs are dummy nodes similar toslack variables. The authors then go on to derive thenecessary update equations for M and C in order tominimize J(M,C) and propose an algorithm whichupdates these matrices in an alternating fashion.

IX. MEAN AND MEDIAN OFGRAPHS

In addition to the graph matching approaches wehave described, we should also mention the conceptsof mean and median of a set of graphs [S. Günter,2002]. These do not explicitly give us an indicationof graph similarity, but are useful in summarizing agroup of graphs. This is useful in applications suchas clustering, where we need to represent a group ofgraphs by some exemplar graph that represents thecluster.The mean of two graphs [S. Günter, 2002]G1 and G2 is a graph g such that:

d(G1,g)=d(G2,g)and

d(G1,G2)=d(G1,g)+d(g,G2)

In other words, a mean of two graphs G1and G2 is a graph g that is equidistant from G1 andG2 and which is a distance from G1 or G2 equal tohalf the distance between G1 and G2. Clearly themean will depend on the distance functions chosen,and there may be more than one graph satisfyingthese conditions; it is also possible that no meanexists for a given pair of graphs. The weighted meanof two graphs [H. Bunke, 2001] G1 and G2 is agraph g such that:

d(G1,G)=d(G1,G2)and

d(G1,G2)= d(G1,G2)+d(g,G2)

where 0<<1. If =0.5,An algorithm for finding the weighted mean of

two graphs is given in [H. Bunke, 2001]. Themethod involves finding a subset of editingoperations (given the lowest cost editing functionbetween the graphs) for the given " in order todetermine the mean graph. In [H. Bunke and A.Kandel, 2000 ], a theoretical proof is given that anygraph g such that mcs(G1,G2)gG1 ormcs(G1,G2)gG2 is a mean of G1 and G2. Thusthe problem becomes finding a graph that is asupergraph of the maximum common subgraph, buta subgraph of one of the original graphs. Finally, wehave the concept of the median of a set of graphs,which acts like a representative of the set. Themedian of a set of graphs S [H. Bunke, 2001] is agraph gS such that g has the lowest averagedistance to all elements in S:Since gS, it isstraightforward (and relatively inexpensive) tosimply compute the average distance to all graphsfor each graph in S. Further, the median of a set ofgraphs always exists; it may or may not also be amean.

X. CONCLUSIONS

In this paper we have given a survey of the mostpopular methods for determining graph similarity.Graph isomorphism finds an exact 1-to-1 matchingbetween identical graphs and was the earliestapproach to graph matching. Unfortunately, itcannot handle inexact graph matching. Graph editdistance is a popular approach that can deal withinexact matching. It determines the cost of asequence of edit operations needed to transform onegraph into another. Methods such as state spacesearch and relaxation have also been applied to the

Page 7: Overview of Graph Similarity Technique in Web Content Mining

problem of determining graph similarity. Thesetechniques are often used to provide a sub-optimalapproximation when the original problem is NP-Complete or has a high potential for combinatorialexplosion. For example, state space search can beused if we represent the matching or edit sequencesbetween graphs as states, and then execute a searchstrategy for the state with the lowest cost.

As we have seen, the approaches often havesimilarities with one another. For example,probability can be seen not just in the Bayesianapproach, but also in the cost functions of graph editdistance and some state space search approaches.

REFERENCES

B. Huet and E. R. Hancock, 1999, “Shape recognitionfrom large image libraries by inexact graph matching”,Pattern Recognition Letters, Vol. 20, 1999, pp. 1259–1269

C. Apte, 1997 F. Damerau, and S. M. Weiss, “AutomatedLearning of Decision Rules for Text Categorization”,ACM Transactions on Information Systems, Vol. 12,1994,pp. 233–251.

G. Chartrand, G. Kubicki, and M. Schultz, 1998, “Graphsimilarity and distance in graphs”, AequationesMathematicae, Vol. 55, 1998, pp. 129–145.

G. Levi, 1972, “A note on the derivation of maximalcommon subgraphs of two directed or undirectedgraphs”, Calcolo, Vol. 9, 1972, pp. 341–354.

H. Bunke and A. Kandel, 2000 “Mean and maximumcommon subgraph of two graphs”, PatternRecognition Letters, Vol. 21, 2000, pp. 163–168.

H. Bunke and K. Shearer, 1998 “A graph distance metricbased on the maximal common subgraph”, PatternRecognition Letters, Vol. 19, 1998, pp. 255–259.

H. Bunke, 1999,“Error Correcting Graph Matching: Onthe Influence of the Underlying Cost Function”,IEEETransactions on Pattern Analysis and MachineIntelligence, Vol. 21, No. 9, September 1999, pp. 917–922.

H. Bunke, 2000, X. Jiang, and A. Kandel, “On theminimum Common Supergraph of Two Graphs”,Computing, Vol. 65, 2000, pp. 13–25.

H. Bunke, 2001 S. Günter, and X. Jiang, “TowardsBridging the Gap between Statistical and StructuralPattern Recognition: Two New Concepts in GraphMatching”, in Advances in Pattern Recognition -ICAPR 2001, S. Singh, N. Murshed, and W.Kropatsch (Eds.), Springer Verlag, LNCS 2013, 2001,pp. 1–11.

H. Bunke,1997 “On a relation between graph edit distanceand maximum common subgraph”, PatternRecognition Letters, Vol. 18, 1997, pp. 689–694.

H.Bunke, 2000,“Recent developments in graph matching”,Proceedings of the 15th International Conference on

Pattern Recognition, Vol.!2, Barcelona, 2000, pp.117–124.

J. J. McGregor, 1982, “Backtrack search algorithms andthe maximal common subgraph problem”, SoftwarePractice and Experience, Vol. 12, 1982, pp. 23–34.

J. T. L. Wang, 1995, K. Zhang, and G.-W. Chirn,“Algorithms for Approximate Graph Matching”,Information Sciences, Vol. 82, 1995, pp. 45–74.

K. Haris, 1999, S. N. Efstratiadis, N. Maglaveras, C.Pappas, J. Gourassas, and G. Louridas, “Model-BasedMorphological Segmentation and Labeling ofCoronary Angiograms”, IEEE Transactions onMedical Imaging, Vol. 18, No. 10, October 1999, pp.1003–1015.

K.-C. Tai, 2003, “The tree-to-tree correction problem”,Journal of the Association for Computing Machinery,Vol. 26, No. 3, 2003, pp. 422–433.

M.-L. Fernández and G. Valiente, 2001, “A graph distancemetric combining maximum common subgraph andminimum common supergraph”, Pattern RecognitionLetters,Vol. 22, 2001, pp. 753–758.

McCallum and K. Nigam, 1998, “A comparison of eventmodels for Naive Bayes text classification”, AAAI–98Workshop on Learning for Text Categorization, 1998.

R. A. Wagner and M. J. Fischer,2001, “The String-to-String Correction Problem”, Journal of the Associationfor Computing Machinery, Vol. 21, 2001, pp. 168–173.

R. C. Wilson and E. R. Hancock, 1997, “StructuralMatching by Discrete Relaxation”, IEEE Transactionson Pattern Analysis and Machine Intelligence, Vol. 19,No. 6, June 1997, pp. 634–

S. Gold and A. Rangarajan, 2002, “A GraduatedAssignment Algorithm for Graph Matching”, IEEETransactions on Pattern Analysis and MachineIntelligence, Vol. 18, No. 4, April 1996, pp. 377–388.

S. Günter and H. Bunke, 2002, “Self-organizing map forclustering in the graph domain”, Pattern RecognitionLetters, Vol. 23, 2002, pp. 405–417.

S. Medasani,R. Krishnapuram, and Y. S. Choi,2001,“Graph Matching by Relaxation of Fuzzyssignments”, IEEE Transactions on Fuzzy Systems,Vol. 9, No. 1, February 2001, pp. 173–182.

S. Wei, S. Jun, and Z. Huicheng, 2001, “A fingerprintrecognition system by use of graph matching”,Proceedings of SPIE, Vol. 4554, 2001, pp. 141–146.

Sanfeliu and K. S. Fu, 2003, “A distance measure betweenattributed relational graphs for pattern recognition”,IEEE Transactions on Systems, Man, and Cybernetics,Vol. 13, 1983, pp. 353–363.

T. H. Cormen, 1997,C. E. Leiserson, and R. L. Rivest,Introduction to Algorithms, The MIT Press:Cambridge, Massachusetts, 1997.

T. P. and C. Siva Ram Murthy, 1999, “Optimal taskallocation in distributed systems by graph matchingand state space search”, The Journal of Systems andSoftware, Vol. 46, 1999, pp. 59–75.

Page 8: Overview of Graph Similarity Technique in Web Content Mining

Avinash N Bhute (ACM M’09, CSIM ’09, ISTE LM ’05) is Assistantprofessor at Sinhgad college ofengineering, Pune. He received hisbachelor degree in Computer science andEngineering from Amravati University in1999, M.Tech. from Bharati Vidhyapith,Pune in 2005.He is Active Member of

ACM, CSI, ISTE, and Indian Science Congress Association.Recently he Review the book “Software Engineering” 7th Editionby Stephen Schach, McGraw Hill Publication.He has publishedSeven international papers and 12 national papers. His currentresearch interest includes Knowledge discovery in database,Mining the Web, Ontology, Artificial intelligent, Softwareengineering.

Harsha Bhute (CSI M’03, ISTE M’08)is student, Pursuing Master ofEngineering in Computer Network fromS.C.O.E., Pune. She received herbachelor degree in Computer scienceand Engineering from AmravatiUniversity, Maharastra, India in 1999.She was a lecturer at Govt. Polytechnicfor 6 years. She is a Active member of

CSI since 2003. She has published two international and sevennational papers. Her area of interest includes mobilecommunication, wireless network, System Programming.

Dr. B.B.Meshram (CSI LM’95, IE ’95)is Professor and head of ComputerTechnology Department at VJTI,Matunga, Mumbai, Maharastra state,India. He received bachelor degree,Master degree and doctoral degree incomputer engineering. He hasparticipated in more than 16 refresher

courses to meet the needs of current technology. He has chairmore than 10 AICTE STTP Programs.. He has received theappreciation for lecture at Manchester and Cardip University, UK.He has contributed more than 50 research papers at national,International Journals. He is life member of computer society ofIndia and Institute of Engineers. His current research interests arein Databases, data warehousing, data mining, intelligent Systems,Web Engineering and network security.

Page 9: Overview of Graph Similarity Technique in Web Content Mining