topk interesting subgraph discovery in information networks

1

TopK Interesting Subgraph Discovery in Information Networks

Manish Gupta Jing Gao Xifeng Yan Hasan Cam Jiawei Han

[email protected]/19/2023

[email protected]

Real World ProblemsNetwork Bottlenecks

Discovery

Interestingness = Lowest Bandwidth

Interestingness = Highest Negative Association Strength of Attribute Values

Computer Networks Organization Networks Team Selection

Battlefield Networks Resource Allocation

Interestingness = Highest Historical Compatibility

Interestingness = Lowest Distance between Entities

Suspicious RelationshipsDiscovery

Social Networks

[email protected]

The Basic Underlying ProblemNetwork Bottlenecks

Discovery

Interestingness = Lowest Bandwidth

Team Selection

Interestingness = Highest Historical

Compatibility

Interestingness = Highest Negative Association

Strength

Suspicious RelationshipsDiscovery Resource Allocation

Interestingness = Lowest Distance

• Given– Edge-weighted Typed

Network G– Typed Subgraph Query Q– Edge Interestingness

measure

• Find– TopK matching subgraphs

[email protected]

Naïve Solution: Ranking After Matching

A

A

A

AB

A

C

B

A

C

A

C10

6 5

9

12

4

8

3

7

0.6 0.8

0.6

0.9

0.3 0.5 0.2

0.4

0.1

Network G

B

11

12

13

0.7 0.1

0.20.70.8

0.5

0.2

0.4 0.3

A

A

A

Query Q

1

2 3

B4

A

A

A

B

10

6 5

9

0.60.9

0.3

A A A4 3

B12

0.20.70.8

AA A B10 9 8 7

0.60.3 0.5 A

B

A A4 3

70.1

2

0.70.8

A

A A

B

5

9

4

7

0.80.9 0.1

A

A

A B

5

9 8 70.6

0.9

0.5

A AB A6 5 4 3

0.6 0.8 0.8

A

AB

A

6 5

9 8

0.6

0.6

0.9

𝑴𝟔

𝑴𝟑

𝑴 𝟒

𝑴𝟓

𝑴𝟏

𝑴𝟕

𝑴𝟖

𝑴𝟗

𝑴𝟐

Match Score

2.2

2.2

2.1

2.0

1.8

1.8

1.7

1.6

1.4Matching

Rank

ing

Why compute all matches?

We need only top-2!

A

B

A A4 3 2

0.70.70.8

7

[email protected]

Our Contributions

• New notion: TopK interesting subgraph detection in information networks

• Three new low-cost indexes– Graph topology index– Sorted edge lists– Graph maximum metapath weight index

• Novel top-K algorithm to answer interestingness queries on large graphs

• Detailed effectiveness and efficiency validation on several synthetic and real datasets

[email protected]

Relationship with Previous Work

• Subgraph matching– Approximate: fuzzy node/edge similarity– Exact: Matching without ranking– RDF graphs, probabilistic graphs, temporal graphs

• TopK querying on graphs– H-hop aggregate queries– Keyword queries on RDF graphs– K most frequent patterns– Twig queries

[email protected]

System OverviewNetwork G

Distance D

Breadth First Traversal from each Node up to Distance D

GraphTopology

Index

Graph Maximum MetaPath Weight

Index

Sort Edges

Sorted Edge Lists

Top-K Computation

Find Candidate Nodes

Candidate Nodes

Query Q

Top-K Subgraphs

Offline Index Construction

Online Query Processing

1

2

3

[email protected]

Index StructuresG=(V,E), B=avg #neighbors, T=#types

A

A

A

AB

A

C

B

A

C

A

C10

6 5

9

12

4

8

3

7

0.6 0.8

0.6

0.9

0.3 0.5 0.2

0.4

0.1

Network G

B

11

12

13

0.7 0.1

0.20.70.8

0.5

0.2

0.4 0.3

AA BB CC AB AC BC(5,9):0.9 (12,13):0.2 (2,7): 0.7 (3,12): 0.5 (7,11): 0.2

(3,4):0.8 (5,6): 0.6 (4,12): 0.4 (1,11): 0.1

(4,5):0.8 (8,7): 0.5 (3,13): 0.4

(2,3):0.7 (2,1): 0.2 (2,13): 0.3

(8,9):0.6 (4,7): 0.1

(9,10):0.3

Index Time Complexity

Space Complexity

Sorted edge lists


Space Complexity

Sorted edge lists

Graph topology

index


Space Complexity

Sorted edge lists

Graph topology

index

Graph max

metapath weight index

d 1 2Node

Id A B C AA BA CA AB BB CB AC BC CC

1 0.2 0.1 0.9 0.9 0.3 0.52 0.7 0.7 0.3 1.5 1.2 0.7 1.2 0.9 0.53 0.8 0.5 1.6 0.9 1.4 1.2 0.74 0.8 0.1 0.4 1.7 0.8 0.9 1.4 1.3 0.3 0.65 0.9 0.6 1.6 0.9 1.26 0.6 1.57 0.7 0.2 1.4 0.9 0.3 18 0.6 0.5 1.5 1.2 0.79 0.9 1.7 1.5

10 0.3 1.211 0.2 0.912 0.5 0.2 1.3 0.6 0.5 0.9

d 1 2Node


1 1 1 1 1 1 12 1 2 1 1 2 1 2 1 13 2 2 1 2 2 2 24 2 1 1 2 2 1 1 2 1 15 2 1 3 1 16 1 27 3 1 3 1 1 2

8 1 1 2 2 1

9 3 1 2

10 1 2

11 2 3

12 2 1 4 2 1 1

[email protected]

Find Candidate NodesGraph

TopologyIndex

Query Q

Graph Topology Index

Query Topology

A

A

A

Query Q

1

2 3

B4

2 2 2 1

3 3 3 6

4 4 4 7

5 5 5

8 8 8

9 9 9

10 10 10

d 1 2Node


1 1 12 2 13 1 1 14 1 1

2 2 2 1

3 3 3 6

4 4 4 7

5 5 5

8 8 8

9 9 9

10 10 10

d 1 2Node


1 1 1 1 1 1 12 1 2 1 1 2 1 2 1 13 2 2 1 2 2 2 24 2 1 1 2 2 1 1 2 1 15 2 1 3 1 16 1 27 3 1 3 1 1 2

8 1 1 2 2 1

9 3 1 2

10 1 2

11 2 3

12 2 1 4 2 1 1

[email protected]

Finding and Scoring MatchesKey Idea

Top-K Computation

𝑀 1

𝑀 4 𝑀 2

𝑀 3 𝑀 5

Top-K Heap

More valid edges?

Start

Generate a Size-1 Candidate

Compute Actual and UB Score

Grow Candidates

Update Heap

Done!

TopK Quit?

Candidate Size==|Q|?

Compute Actual and UB ScoreTopK Quit?

Compute Max UB Score

TopK Quit?

Y

Y

YY

YN

N

N

N

NY

A

A

A

Query Q

1

2 3

B4

(5,9):0.9 (2,7): 0.7

(3,4):0.8 (5,6): 0.6

(4,5):0.8 (8,7): 0.5

(2,3):0.7 (2,1): 0.2

(8,9):0.6 (4,7): 0.1

(9,10):0.3

AA BA

[email protected]

Finding and Scoring MatchesGenerating Size-1 Candidates

(5,9):0.9 (2,7): 0.7

(3,4):0.8 (5,6): 0.6

(4,5):0.8 (8,7): 0.5

(2,3):0.7 (2,1): 0.2

(8,9):0.6 (4,7): 0.1

(9,10):0.3

AA BA

A

A

A

Query Q

1

2 3

B4

Size-1 Candidates

A

A

A

5

9

BMultiple query edges of the same type

A

A

A59

B

A

A

A

9

5

B

A

A

A95

BQuery Edge with both endpoints of same type

Order(5,9)(3,4)(4,5)(2,3)(2,7)…

Candidate Growth

A

A

A59

B

Prune?

Grow?

Prune?

Grow?

Heapify?

Discard?

A

A

A59

B8

A

A

A59

B8 6

Prune?

Grow?

Heapify?

Discard?

A

A

A59

B10

A

A

A59

B10 6

[email protected]

Finding and Scoring MatchesActual Score and Upper Bound Score

Candidate Growth

Useful Edge Lists

Actual Score= 0.9

UB Score = 0.9+ UB(NonConsidered Edges) = 0.9+ (0.6+0.6) = 2.1

• Partially grown candidate• Prune if UBScore< min(heap)• Grow otherwise

• Fully grown candidate• Discard if UBScore< min(heap)• Update heap otherwise

A

A

A59

B

A

A

A59

B8

A

A

A59

B8 6

Prune?

Grow?

Prune?

Grow?

Heapify?

Discard?

(5,9):0.9 (2,7): 0.7

(3,4):0.8 (5,6): 0.6

(4,5):0.8 (8,7): 0.5

(2,3):0.7 (2,1): 0.2

(8,9):0.6 (4,7): 0.1

(9,10):0.3

AA BA A

A

A59

B

[email protected]

Finding and Scoring MatchesGlobal Top-K Quit

K=2TopK Heap

(4,3,2,7): 2.2(3,4,5,6): 2.2

0.7+0.6+0.7 = 2 <2.2 Stop

A

A

A

AB

A

C

B

A

C

A

C10

6 5

9

12

4

8

3

7

0.6 0.8

0.6

0.9

0.3 0.5 0.2

0.4

0.1

Network G

B

11

12

13

0.7 0.1

0.20.70.8

0.5

0.2

0.4 0.3

A

A

A

Query Q

1

2 3

B4

(5,9):0.9 (2,7): 0.7

(3,4):0.8 (5,6): 0.6

(4,5):0.8 (8,7): 0.5

(2,3):0.7 (2,1): 0.2

(8,9):0.6 (4,7): 0.1

(9,10):0.3

AA BA

[email protected]

Faster Query Processing using Graph Maximum MetaPath Weight Index

CA B

1

2

3C

C

4 5

1

2

C

C

CA B

13C 4 5

A2

3

C

Query

Partial Candidate

Paths to cover Non-Considered

Edges

UB Score = Actual Score(1-2) + UB(1-3) + UB(2-3) + UB(3-4) + UB(4-5)

UB Score = Actual Score(1-2) + UB(1-3-4-5) + UB(2-3)

Using MMW Index!

CA B

1

2

3C

C

4 51

2

C

C

CA B

13C 4 5

A2

3

C

Query PartialInstantiation

Paths to cover Non-Considered

Edges

CB6 7

C7

Edges to Consider

Separately

B

CB6 7

4

Slight complication

UB Score = Actual Score(1-2) + UB(1-3-4-5-7) + UB(2-3) + UB(4-6) +UB(6-7)

[email protected]

Faster Query Processing using Graph Maximum MetaPath Weight Index

A

A

A

9

5

B

K=2TopK Heap

(8,9,5,6): 2.1(5,9,8,7): 2.0

(5,9):0.9 (2,7): 0.7

(3,4):0.8 (5,6): 0.6

(4,5):0.8 (8,7): 0.5

(2,3):0.7 (2,1): 0.2

(8,9):0.6 (4,7): 0.1

(9,10):0.3

AA BA

Edge-based UBScore0.9+0.8+0.7=2.4 > 2.0

Path-based UBScore0.9+UB(5-A-B)=0.9+0.9=1.8 < 2.0

Grow

Prune

Prune?

Grow?

MMW Index

d 1 2Node


1 0.2 0.1 0.9 0.9 0.3 0.52 0.7 0.7 0.3 1.5 1.2 0.7 1.2 0.9 0.53 0.8 0.5 1.6 0.9 1.4 1.2 0.74 0.8 0.1 0.4 1.7 0.8 0.9 1.4 1.3 0.3 0.65 0.9 0.6 1.6 0.9 1.26 0.6 1.57 0.7 0.2 1.4 0.9 0.3 18 0.6 0.5 1.5 1.2 0.79 0.9 1.7 1.5

10 0.3 1.211 0.2 0.912 0.5 0.2 1.3 0.6 0.5 0.9

[email protected]

Discussions

• Queries with multiple edge semantics• Directed graphs• Homogeneous networks• Weighted query edges– Weights signify expected amount of

interestingness– Weights signify importance of query edge

• Faster computations versus index size

[email protected]

Low-cost Index Structures

1000 10000 100000 10000001

10

100

1000

10000 Topology+MMW (D=2)SPath (D=2)Sorted Edge Lists

|V|

Tim

e (s

ec)

1000 10000 100000 100000010

100

1000

10000

100000

1000000

10000000

100000000 Edge ListsTopology (D=2)Topology (D=3)MMW (D=2)MMW (D=3)SPath (D=2)SPath (D=3)Graph Size

|V|

Inde

x Si

ze (K

Bs)

[email protected]

Faster Query Execution

|Q|=2 |Q|=3 |Q|=4 |Q|=5RAM 158 3186 39294 469962

RWM0 10 165 824 4660RWM1 12 195 1022 5891RWM2 12 212 3135 27363RWM3 111 1486 3978 9972RWM4 12 165 791 4518

|Q|=2 |Q|=3 |Q|=4 |Q|=5RAM 144 8698 34639 174992


|Q|=2 |Q|=3 |Q|=4 |Q|=5RAM 245 2004 14628 169328


Query Execution Time (msec) for PathQueries (Graph G2 and indexes with D=2)

Query Execution Time (msec) for CliqueQueries (Graph G2 and indexes with D=2)

Query Execution Time (msec) for SubgraphQueries (Graph G2 and indexes with D=2)

RAM: Ranking After Matching baseline RWM0: without using the candidate node filteringRWM1: without using the MMW indexRWM2: same as RWM1 without thepruning any partially grown candidatesRWM3: same as RWM1 without the global top-K quit checkRWM4: same as RWM1 with the MMW index

[email protected]

Good ScalabilityQuerySize

Graph |Q|=2 |Q|=3 |Q|=4 |Q|=5 |Q|=6 |Q|=7

|V|=1e+3 5 18 77 382 1870 7656|V|=1e+4 10 90 407 2267 12366 87657|V|=1e+5 52 396 2794 18412 131256 1006773|V|=1e+6 362 4907 28600 184523 1216893 9786327

Good Scalability thanks to Effective Pruning

|Q|=2 |Q|=3 |Q|=4 |Q|=5#Candidates of Size 2 9.54 7.86 4.38 1.63#Candidates of Size 3 28.28 18.31 7.94#Candidates of Size 4 24.42 25.5#Candidates of Size 5 13.61

Running time (msec) for different Query Sizes and Graph Sizes (D=2)

Query Execution Time for Different Values of K

Number of Candidates as Percentage of Total Matches for Different Query Sizes

and Candidate Sizes

|Q|=2 |Q|=3 |Q|=4 |Q|=510

100

1000

10000K=10 K=20 K=50 K=100

Size of the Query

Ave

rage

Que

ry E

xecu

tion

Tim

e (m

sec)

[email protected]

Author

Author

Conf Keyword

Q2

1 2

3

4

Person

Person

Company Settlement

Q4

1 2

3

4Person

Person

Film

Q3

1 2

3

Author

Author

Conf

Q1

1 2

3

Dataset DBLP Wikipedia

#Nodes 138K 670K

#Edges 1.6M 4.1M

#Types 3 10

Edge List Index Size

50 MB 261 MB

Topology Index Size

5.8 MB 148 MB

MMW Index Size 11.4 MB 249 MB

SPath Index Size 4.3 GB 13.7 GB

Topology+MMW Construction Time

513 minutes

1203 minutes

Avg Query Time 100 sec 42 sec

Real Dataset Case Studies

[email protected]


• DBLP– 1: Rohit Gupta, 2: BICoB, 3: Vipin Kumar

• Rohit Gupta -- computer networking• Vipin Kumar -- Data and Information Systems• BICoB -- International Conference on Bioinformatics and

Computational Biology

– 1: Jimeng Sun, 2: Operating Systems Review (SIGOPS), 3: Christos Faloutsos, 4: mining• Jimeng Sun and Christos Faloutsos -- Data and Information Systems,

Artificial intelligence, and Computational biology• "mining" -- Data and Information Systems• "Operating Systems Review (SIGOPS)" -- Operating systems,

Computer architecture, Computer networking

[email protected]


• Wikipedia– 1: Stacy Keach, 2: The Biggest Battle, 3: John Huston

• Stacy Keach and John Huston starred in the movie “The Biggest Battle”• Stacy Keach (American), John Huston (American), movie is Italian• Stacy (narration, comedy, music), John (drama, documentary, adventure),

movie (war)

– 1: Medha Patkar, 2: BBC, 3: Felix D’Alviella, 4: Mogilino• Medha Patkar -- Indian social activist -- won Best International Political

Campaigner by BBC• Felix D’Alviella -- Belgian actor in the BBC soap opera Doctors • Mogilino -- village in Bulgaria -- BBC showed the popular film "Bulgaria’s

Abandoned Children" in 2007• British company rewarding an Indian woman, covering a place in Bulgaria or

linked to a person from Belgium is rare

[email protected]

Related Work (1)

• Theory literature on subgraph isomorphism [Cordella et al., 2004; McKay, 1981; Ullmann, 1976]

• Exact subgraph matching [Cheng et al., 2008; He and Singh, 2008; Sun et al., 2012; Zhang et al., 2007; Zhang et al., 2009; Zhao and Han, 2010; Zou et al., 2009]

• Approximate subgraph matching [Zou et al., 2007; Zeng et al., 2012; Tian et al., 2007; Zhang et al., 2010]

[email protected]

Related Work (2)

• Matching in graph databases [Ranu and Singh, 2009; Yan et al., 2005; Zhu et al., 2012]

• Matching for RDF graphs [Liu et al., 2012], probabilistic graphs [Yuan et al., 2012] and temporal graphs [Bogdanov et al., 2011]

• Top-K queries– h-hop aggregate queries [Yan et al., 2010] – K most frequent patterns [Yang et al., 2012; Zhu et al., 2011]– Top-K keyword queries on RDF graphs [Tran et al., 2009]– Top-K similarity queries [Zou et al., 2007]– Twig queries [Gou and Chirkova, 2008]

[email protected]

Conclusion

• Given– Typed unweighted query– A heterogeneous edge-weighted information network– Edge interestingness measure

• Find– Top-K interesting subgraphs

• Investigated ranking after matching baseline • Proposed three new graph indexes and exploited them for

building a top-K solution• Showed efficiency, scalability and effectiveness on multiple

synthetic and real datasets

[email protected]

Thanks!

topk interesting subgraph discovery in information networks

Documents

interestingness queries

candidate size

compute max ub scoretopk

rankingrdf graphs

probabilistic graphs

large graphs

networ gb1112130

candidatecompute actual