data mining: concepts and techniques — chapter 9 — graph mining and social network analysis

April 21, 2023 1

Data Mining: Concepts and Techniques

— Chapter 9 —Graph mining and Social Network Analysis

Li Xiong

Slides credits: Jiawei Han and Micheline Kamber

Graph Mining and Social Network Analysis

Graph mining Frequent subgraph mining

Social network analysis Social network Social network analysis at different levels Link analysis

April 21, 2023Mining and Searching Graphs in Graph

Databases 2


Databases 3

Graph Mining Methods for Mining Frequent Subgraphs

Applications:

Graph Indexing

Similarity Search

Classification and Clustering

Summary


Databases 4

Why Graph Mining? Graphs are ubiquitous

Chemical compounds (Cheminformatics)

Protein structures, biological pathways/networks (Bioinformactics)

Program control flow, traffic flow, and workflow analysis

XML databases, Web, and social network analysis

Graph is a general model Trees, lattices, sequences, and items are degenerated graphs

Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices),

weighted, with angles & geometry (topological vs. 2-D/3-D)

Complexity of algorithms: many problems are of high

complexity


Databases 5

Graph, Graph, Everywhere

Aspirin Yeast protein interaction network

from

H.

Jeon

g e

t al N

atu

re 4

11

, 4

1

(20

01

)

Internet Co-author network


Databases 6

Graph Pattern Mining Frequent subgraph mining

Finding frequent subgraphs within a single graph

Finding frequent (sub)graphs in a set of graphs

support (occurrence frequency) no less than a

minimum support threshold

Applications of graph pattern mining

Mining biochemical structures, program control flow

analysis, XML structures or Web communities

Building blocks for graph classification, clustering,

compression, comparison, and correlation analysis


Databases 7

Example: Frequent Subgraph Mining in Chemical Compounds

GRAPH DATASET

FREQUENT PATTERNS(MIN SUPPORT IS 2)

(A) (B) (C)

(1) (2)


Databases 8

Graph Mining Algorithms

Finding interesting and frequent substructures in a

single graph SUBDUE

Finding frequent patterns in a set of independent

graphs Apriori-based approach

Pattern-growth approach

April 21, 2023 Li Xiong 9

SUBDUE (Holder et al. KDD’94)

Problem Finding “interesting” and repetitive substructures

(connected subgraphs) in data represented as a graph

Basic idea Minimum description length (MDL) principle

Beam search algorithm Start with best single vertices

Expand best substructures with a new edge

Substructures are evaluated based on their ability to

compress input graphs

Minimum Description Length (MDL) Minimum description length (MDL) principle

A formalization of Occam’s Razor

Best hypothesis minimizes description length of the data (largest

compression) Graph substructure discovery based on MDL

Description length (DL): represent vertices and adjacency matrix

Graph compression: replace substructure instances with pointers

Find best substructure S in G that minimizes: DL(S) + DL(G|S)

R1

C1

T1

S1

T2

S2

T3

S3

T4

S4

Input Database (G) Substructure (S1) Compressed Database (G|S1)

R1

C1

S1S1 S1S1 S1S1

S1S1Triangle

Square

Holder et al.

Beam Search Algorithm Beam search

An optimization of best-first search

Breadth-first search with a predetermined number of

paths kept as candidates (beam width)

Subgraph discovery based on beam search Start with best single vertices

Expand best substructures with a new edge

Substructures are evaluated based on their ability to

compress input graphs (minimize description length)

April 21, 2023 Li Xiong 11

Holder et al. 12

Algorithm

1. Create substructure for each unique vertex label

Substructures (S)

triangle (4), square (4),circle (1), rectangle (1)

circle

rectangle

triangle

square

on

on

triangle

square

on

ontriangle

square

on

ontriangle

square

on

on

on

R1

C1

T1

S1

T2

S2

T3

S3

T4

S4

Input Database (G)Input Database (G)(Graph form)

Holder et al. 13

Algorithm (cont.)

2. Expand best substructures by an edge or edge + neighboring vertex

Substructures (S)

triangle

square

on

rectangle

square

on

rectangle

triangleon

circle

rectangle

triangle

square

on

on

triangle

square

on

ontriangle

square

on

ontriangle

square

on

on

on

rectangle

circle

on

Holder et al. SRL Workshop 14

Algorithm (cont.)3. Keep best beam-width substructures on queue4. Terminate when queue is empty or #discovered

substructures >= limit5. Compress graph with hierarchical description


Databases 15

Frequent Subgraph Mining Approaches Problem: finding frequent subgraphs in a set of graphs Apriori-based approach

AGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03)

Pattern growth approach MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04)

Close pattern mining CLOSEGRAPH: Yan & Han (KDD’03)

April 21, 2023 16

Apriori-Based Approach

…

G

G1

G2

Gn

Frequent subgraphs

Subgraphs with extra vertex, edge

G’

G’’

JOIN

Level-wise algorithm: building candidate subgraphs from small frequent subgraphs


Databases 17

Apriori-Based Search AGM (Apriori-based Graph Mining), Inokuchi, et al. PKDD’00

generates new graphs with one more node

FSG (Frquent SubGraph mining), Kuramochi and Karypis, ICDM’01 generates new graphs with one more edge

cbaa

aa

aa

aa


Databases 18

Pattern Growth Method

…

G

G1

G2

Gn

k-edge

(k+1)-edge

…

(k+2)-edge

…

duplicate graph

Depth-based search and right-most extension


Databases 19

GSPAN (Yan and Han ICDM’02)


Databases 20

Graph Mining Methods for Mining Frequent Subgraphs

Applications:

Classification and Clustering

Graph Indexing

Similarity Search


Databases 21

Using Graph Patterns

Similarity measures based on graph patterns Feature-based similarity measure

Each graph is represented as a feature vector

Frequent subgraphs can be used as features

Vector distance

Structure-based similarity measure

Maximal common subgraph

Graph edit distance: insertion, deletion, and relabel

Frequent and discriminative subgraphs are

high-quality indexing features

Social Network Analysis Social network Different levels of social network analysis Common measures and methods for social

network analysis Link analysis


Databases 22

Social Network Social network: a social structure consists of nodes and

ties. Nodes are the individual actors within the networks

May be different kinds May have attributes, labels or classes

Ties are the relationships between the actors May be different kinds Links may have attributes, directed or undirected

Homogeneous networks Single object type and single link type Single model social networks (e.g., friends) WWW: a collection of linked Web pages

Heterogeneous networks Multiple object and link types Medical network: patients, doctors, disease, contacts, treatments Bibliographic network: publications, authors, venues


Databases 23

http://en.wikipedia.org/wiki/Node_%28computer_science%29

Small World Phenomenon Number of degrees of separation in actual social

networks? Six-degree separation: everyone is an average of

six "steps" away from each person on Earth. Empirical studies

Michael Gurevich,1961. US population linked by 2 intermediaries

Duncan Watts, 2001. Email-delivery on the internet: average number of intermediaries is 6.

Leskovec and Horvitz, 2007. Instant messages: average path length is 6.6


Databases 24

April 21, 2023 Data Mining: Concepts and Techniques 25

Six Degrees of Kevin Bacon

Vertices: actors and actresses Edge between u and v if they appeared in a film together

Is Kevin Bacon the most

connected actor?

NO!

Rank NameAveragedistance

# ofmovies

# oflinks

1 Rod Steiger 2.537527 112 25622 Donald Pleasence 2.542376 180 28743 Martin Sheen 2.551210 136 35014 Christopher Lee 2.552497 201 29935 Robert Mitchum 2.557181 136 29056 Charlton Heston 2.566284 104 25527 Eddie Albert 2.567036 112 33338 Robert Vaughn 2.570193 126 27619 Donald Sutherland 2.577880 107 2865

10 John Gielgud 2.578980 122 294211 Anthony Quinn 2.579750 146 297812 James Earl Jones 2.584440 112 3787…

876 Kevin Bacon 2.786981 46 1811…

876 Kevin Bacon 2.786981 46 1811

Kevin Bacon

No. of movies : 46 No. of actors : 1811 Average separation: 2.79


Rod Steiger

Martin Sheen

Donald Pleasence

#1

#2

#3

#876Kevin Bacon

Social Network Analysis Actor level: centrality, prestige, and roles such as

isolates, liaisons, bridges, etc. Dyadic level: distance and reachability, structural

and other notions of equivalence, and tendencies toward reciprocity.

Triadic level: balance and transitivity Subset level: cliques, cohesive subgroups,

components Network level: connectedness, diameter,

centralization, density, prestige, etc.

April 21, 2023Social network analysis: methods and

applications 27

Measures in Social Network Analysis – Actor level

Non-directional graphs Degree Centrality

The number of direct connections a node has 'connector' or 'hub' in this network

Betweenness Centrality Degree an individual lies between other individuals in the

network an intermediary; liaison; bridge

Closeness Centrality The degree an individual is near all other individuals in a

network (directly or indirectly) Eigenvector centrality

A measure of relative importance of a node Based on the principle that connections to nodes having a high

score contribute more to the current node Directional graphs

Prestige: measure the degree of incoming ties


Databases 28

Actor Centrality Example

April 21, 2023 OrgNet.com 29

Measures in Social Network Analysis – Dyadic, Triadic and Subset Level

Path Length The distances between pairs of nodes in the network.

Structural equivalence Extent to which actors have a common set of linkages

to other actors in the system. Clustering coefficient

A measure of the likelihood that two associates of a node are associates themselves

Cliquishness of u’s neighborhood Cohesion

The degree to which actors are connected directly to each other by cohesive bonds

Cliques


Databases 30

Measures in Social Network Analysis – Network Level

Network Centralization The difference between number of links for each node Centralized vs. decentralized networks

Network density Proportion of ties in a network relative to the total number possible Sparse vs. dense networks

Average Path Length Average of distances between all pairs of nodes

Reach The degree any member of a network can reach other members of

the network. Structural cohesion

The minimum number of members who, if removed from a group, would disconnect the group.


Databases 31


Another Taxonomy of Link Mining Tasks Object-Related Tasks

Link-based object ranking Link-based object classification Object clustering (group detection) Object identification (entity resolution)

Link-Related Tasks Link prediction

Graph-Related Tasks Subgraph discovery Graph classification Generative model for graphs

Social Network Applications

Link-based object ranking for WWW (actor-level analysis) PageRank HITS

Influence and diffusion


Databases 33


Link-Based Object Ranking (LBR)

Exploit the link structure of a graph to order or prioritize the set of objects within the graph Focused on graphs with single object type and single

link type Focus of link analysis community Algorithms

PageRank HITS

PageRank: Ranking web pages (Brin & Page’98)

Intuition Web pages are not equally “important”

www.joe-schmoe.com v www.stanford.edu Links as citations: a page cited often is more important

www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink

Are all links equal? Recursive model: being cited by a highly cited paper

counts a lot… Eigenvector prestige measure

http://www.stanford.edu/

http://www.joe-schmoe.com/

Each link’s vote is proportional to the importance of its source page

If page P with importance x has n outlinks, each link gets x/n votes

Page P’s own importance is the sum of the votes on its inlinks

Simple Recursive Flow Model

Yahoo

M’softAmazon

y

a m

y/2

y/2

a/2

a/2

m

y = y /2 + a /2a = y /2 + mm = a /2

Solving the equation with constraint: y+a+m = 1y = 2/5, a = 2/5, m = 1/5

Matrix formulation Web link matrix M: one row and one column per web page

Suppose page j has n outlinks, if j ! i, then Mij=1/n, else Mij=0 M is a column stochastic matrix - Columns sum to 1

Rank vector r: one entry per web page ri is the importance score of page i |r| = 1

Flow equation: r = Mr Rank vector is an eigenvector of the web matrix

i

j

M r r

=j

i

Matrix formulation Example

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

y = y /2 + a /2a = y /2 + mm = a /2

r = Mr

y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m

Power Iteration method Simple iterative scheme (aka relaxation) Suppose there are N web pages Initialize: r0 = [1/N,….,1/N]T

Iterate: rk+1 = Mrk

Stop when |rk+1 - rk|1 < |x|1 = 1≤i≤N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean

Power Iteration Example

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

ya =m

1/31/31/3

1/31/21/6

5/12 1/3 1/4

3/811/241/6

2/52/51/5

. . .

Random Walk Interpretation Imagine a random web surfer

At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P

uniformly at random Ends up on some page Q linked from P Process repeats indefinitely

p(t) is the probability distribution whose ith component is the probability that the surfer is at page i at time t

The stationary distribution Where is the surfer at time t+1?

p(t+1) = Mp(t) Suppose the random walk reaches a state such

that p(t+1) = Mp(t) = p(t) Then p(t) is a stationary distribution for the random

walk Our rank vector r satisfies r = Mr

Existence and Uniqueness of the Solution

Theory of random walks (aka Markov processes):For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.


Databases 43

Spider traps A group of pages is a spider trap if there are no

links from within the group to outside the group Spider traps violate the conditions needed for the

random walk theorem

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 1

y a m

ya =m

111

11/23/2

3/41/27/4

5/83/82

003

. . .

Random teleports At each time step, the random surfer has two

options: With probability , follow a link at random With probability 1-, jump to some page uniformly at

random Common values for are in the range 0.8 to 0.9

Surfer will teleport out of spider trap within a few time steps

Random teleports Example ()

Yahoo

M’softAmazon

1/2 1/2 0 1/2 0 0 0 1/2 1

1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

y 7/15 7/15 1/15a 7/15 1/15 1/15m 1/15 7/15 13/15

0.8 + 0.2

ya =m

111

1.000.601.40

0.840.601.56

0.7760.5361.688

7/11 5/1121/11

. . .

Matrix formulation Matrix vector A

Aij = Mij + (1-)/N Mij = 1/|O(j)| when j!i and Mij = 0 otherwise Verify that A is a stochastic matrix

The page rank vector r is the principal eigenvector of this matrix satisfying r = Ar

Equivalently, r is the stationary distribution of the random walk with teleports


HITS: Capturing Authorities & Hubs (Kleinberg’98)

Intuitions Pages that are widely cited are good authorities Pages that cite many other pages are good hubs

HITS (Hypertext-Induced Topic Selection)1. Authorities are pages containing useful information and

linked by Hubs course home pages home pages of auto manufacturers

2. Hubs are pages that link to Authorities course bulletin list of US auto manufacturers

Iterative reinforcement …

Hubs Authorities

Matrix Formulation Transition (adjacency) matrix A

A[i, j] = 1 if page i links to page j, 0 if not The hub score vector h: score is

proportional to the sum of the authority scores of the pages it links to h = λAa Constant λ is a scale factor

The authority score vector a: score is proportional to the sum of the hub scores of the pages it is linked from a = μAT h Constant μ is scale factor

Hubs Authorities

Transition Matrix Example

Yahoo

M’softAmazon

y 1 1 1a 1 0 1m 0 1 0

y a m

A =

Iterative algorithm Initialize h, a to all 1’s h = Aa Scale h so that its max entry is 1.0 a = ATh Scale a so that its max entry is 1.0 Continue until h, a converge

Iterative Algorithm Example

1 1 1A = 1 0 1 0 1 0

1 1 0AT = 1 0 1 1 1 0

a(yahoo)a(amazon)a(m’soft)

===

111

111

14/51

1 0.75 1

. . .

. . .

. . .

10.7321

h(yahoo) = 1h(amazon) = 1h(m’soft) = 1

12/31/3

1 0.73 0.27

. . .

. . .

. . .

1.0000.7320.268

10.710.29

Existence and Uniqueness of the Solution

h = λAaa = μAT hh = λμAAT ha = λμATA a

Under reasonable assumptions about A, the dual iterative algorithm converges to vectors h* and a* such that:• h* is the principal eigenvector of the matrix AAT

• a* is the principal eigenvector of the matrix ATA

Page Rank and HITS Similarities

Iterative algorithm based on the linkage of the documents on the web

Same problem: what is the value of a link from S to D? Different models

PageRank: depends on the links into S HITS: depends on the value of the other links out of S

The destinies of PageRank and HITS post-1998 PageRank: trademark of Google HITS: not commonly used by search engines (Ask.com

?)

http://en.wikipedia.org/wiki/Ask.com

Social Network Analysis Applications

Link-based object ranking for WWW (actor-level analysis) PageRank HITS

Influence and diffusion


Databases 55

Influence and Diffusion

OrgNet.com 56CDC: Spread of Airborne Disease

Coming Up Paper presentations:

Knowledge discovery from transportation network data Maximizing the spread of influence through a social

network Wherefore Art Thou R3579X? Anonymized Social

Networks, Hidden Patterns, and Structural Steganography


Databases 57


Databases 58

References (1) T. Asai, et al. “Efficient substructure discovery from large semi-structured data”, SDM'02

C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant

substructures of molecules”, ICDM'02

D. Cai, Z. Shao, X. He, X. Yan, and J. Han, “Community Mining from Multi-Relational

Networks”, PKDD'05.

M. Deshpande, M. Kuramochi, and G. Karypis, “Frequent Sub-structure Based

Approaches for Classifying Chemical Compounds”, ICDM 2003

M. Deshpande, M. Kuramochi, and G. Karypis. “Automated approaches for classifying

structures”, BIOKDD'02

L. Dehaspe, H. Toivonen, and R. King. “Finding frequent substructures in chemical

compounds”, KDD'98

C. Faloutsos, K. McCurley, and A. Tomkins, “Fast Discovery of 'Connection Subgraphs”,

KDD'04

H. Fröhlich, J. Wegner, F. Sieker, and A. Zell, “Optimal Assignment Kernels For Attributed

Molecular Graphs”, ICML’05

T. Gärtner, P. Flach, and S. Wrobel, “On Graph Kernels: Hardness Results and Efficient

Alternatives”, COLT/Kernel’03


Databases 59

References (2)

L. Holder, D. Cook, and S. Djoko. “Substructure discovery in the subdue

system”, KDD'94

J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha.

“Mining spatial motifs from protein structure graphs”, RECOMB’04

J. Huan, W. Wang, and J. Prins. “Efficient mining of frequent subgraph in the

presence of isomorphism”, ICDM'03

H. Hu, X. Yan, Yu, J. Han and X. J. Zhou, “Mining Coherent Dense Subgraphs

across Massive Biological Networks for Functional Discovery”, ISMB'05

A. Inokuchi, T. Washio, and H. Motoda. “An apriori-based algorithm for mining

frequent substructures from graph data”, PKDD'00

C. James, D. Weininger, and J. Delany. “Daylight Theory Manual Daylight

Version 4.82”. Daylight Chemical Information Systems, Inc., 2003.

G. Jeh, and J. Widom, “Mining the Space of Graph Properties”, KDD'04

H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels Between

Labeled Graphs”, ICML’03


Databases 60

References (3)

M. Koyuturk, A. Grama, and W. Szpankowski. “An efficient algorithm for detecting

frequent subgraphs in biological networks”, Bioinformatics, 20:I200--I207, 2004.

T. Kudo, E. Maeda, and Y. Matsumoto, “An Application of Boosting to Graph

Classification”, NIPS’04

M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, ICDM'01

M. Kuramochi and G. Karypis, “GREW: A Scalable Frequent Subgraph Discovery

Algorithm”, ICDM’04

C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for ‘Backtrace'' of

Noncrashing Bugs’'', SDM'05

P. Mahé, N. Ueda, T. Akutsu, J. Perret, and J. Vert, “Extensions of Marginalized Graph

Kernels”, ICML’04

B. McKay. Practical graph isomorphism. Congressus Numerantium, 30:45--87, 1981.

S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference.

KDD'04

J. Prins, J. Yang, J. Huan, and W. Wang. “Spin: Mining maximal frequent subgraphs

from graph databases”. KDD'04


Databases 61

References (4) D. Shasha, J. T.-L. Wang, and R. Giugno. “Algorithmics and applications of tree and

graph searching”, PODS'02 J. R. Ullmann. “An algorithm for subgraph isomorphism”, J. ACM, 23:31--42, 1976. N. Vanetik, E. Gudes, and S. E. Shimony. “Computing frequent graph patterns from

semistructured data”, ICDM'02 C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi. “Scalable mining of large disk-base

graph databases”, KDD'04 T. Washio and H. Motoda, “State of the art of graph-based data mining”, SIGKDD

Explorations, 5:59-68, 2003 X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining”, ICDM'02 X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, KDD'03 X. Yan, P. S. Yu, and J. Han, “Graph Indexing: A Frequent Structure-based Approach”,

SIGMOD'04 X. Yan, X. J. Zhou, and J. Han, “Mining Closed Relational Graphs with Connectivity

Constraints”, KDD'05 X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”,

SIGMOD'05 X. Yan, F. Zhu, J. Han, and P. S. Yu, “Searching Substructures with Superimposed

Distance”, ICDE'06 M. J. Zaki. “Efficiently mining frequent trees in a forest”, KDD'02


Ref: Mining on Social Networks

D. Liben-Nowell and J. Kleinberg. The Link Prediction Problem for Social Networks. CIKM’03

P. Domingos and M. Richardson, Mining the Network Value of Customers. KDD’01

M. Richardson and P. Domingos, Mining Knowledge-Sharing Sites for Viral Marketing. KDD’02

D. Kempe, J. Kleinberg, and E. Tardos, Maximizing the Spread of Influence through a Social Network. KDD’03.

P. Domingos, Mining Social Networks for Viral Marketing. IEEE Intelligent Systems, 20(1), 80-82, 2005.

S. Brin and L. Page, The anatomy of a large scale hypertextual Web search engine. WWW7.

S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Mining the link structure of the World Wide Web. IEEE Computer’99

D. Cai, X. He, J. Wen, and W. Ma, Block-level Link Analysis. SIGIR'2004.

data mining: concepts and techniques — chapter 9 — graph mining and social network analysis

Documents

graph databases

graph miningmethods

graph classification

data mining

micheline kambergraph

frequent substructures

mdldescription length

program control flow