topk interesting subgraph discovery in information networks
DESCRIPTION
TopK Interesting Subgraph Discovery in Information Networks. Manish Gupta Jing Gao Xifeng Yan Hasan Cam Jiawei Han. Real World Problems. Network Bottlenecks Discovery. Computer Networks. Organization Networks. Team Selection. - PowerPoint PPT PresentationTRANSCRIPT
1
TopK Interesting Subgraph Discovery in Information Networks
Manish Gupta Jing Gao Xifeng Yan Hasan Cam Jiawei Han
[email protected]/19/2023
Real World ProblemsNetwork Bottlenecks
Discovery
Interestingness = Lowest Bandwidth
Interestingness = Highest Negative Association Strength of Attribute Values
Computer Networks Organization Networks Team Selection
Battlefield Networks Resource Allocation
Interestingness = Highest Historical Compatibility
Interestingness = Lowest Distance between Entities
Suspicious RelationshipsDiscovery
Social Networks
The Basic Underlying ProblemNetwork Bottlenecks
Discovery
Interestingness = Lowest Bandwidth
Team Selection
Interestingness = Highest Historical
Compatibility
Interestingness = Highest Negative Association
Strength
Suspicious RelationshipsDiscovery Resource Allocation
Interestingness = Lowest Distance
• Given– Edge-weighted Typed
Network G– Typed Subgraph Query Q– Edge Interestingness
measure
• Find– TopK matching subgraphs
Naïve Solution: Ranking After Matching
A
A
A
AB
A
C
B
A
C
A
C10
6 5
9
12
4
8
3
7
0.6 0.8
0.6
0.9
0.3 0.5 0.2
0.4
0.1
Network G
B
11
12
13
0.7 0.1
0.20.70.8
0.5
0.2
0.4 0.3
A
A
A
Query Q
1
2 3
B4
A
A
A
B
10
6 5
9
0.60.9
0.3
A A A4 3
B12
0.20.70.8
AA A B10 9 8 7
0.60.3 0.5 A
B
A A4 3
70.1
2
0.70.8
A
A A
B
5
9
4
7
0.80.9 0.1
A
A
A B
5
9 8 70.6
0.9
0.5
A AB A6 5 4 3
0.6 0.8 0.8
A
AB
A
6 5
9 8
0.6
0.6
0.9
𝑴𝟔
𝑴𝟑
𝑴 𝟒
𝑴𝟓
𝑴𝟏
𝑴𝟕
𝑴𝟖
𝑴𝟗
𝑴𝟐
Match Score
2.2
2.2
2.1
2.0
1.8
1.8
1.7
1.6
1.4Matching
Rank
ing
Why compute all matches?
We need only top-2!
A
B
A A4 3 2
0.70.70.8
7
Our Contributions
• New notion: TopK interesting subgraph detection in information networks
• Three new low-cost indexes– Graph topology index– Sorted edge lists– Graph maximum metapath weight index
• Novel top-K algorithm to answer interestingness queries on large graphs
• Detailed effectiveness and efficiency validation on several synthetic and real datasets
Relationship with Previous Work
• Subgraph matching– Approximate: fuzzy node/edge similarity– Exact: Matching without ranking– RDF graphs, probabilistic graphs, temporal graphs
• TopK querying on graphs– H-hop aggregate queries– Keyword queries on RDF graphs– K most frequent patterns– Twig queries
System OverviewNetwork G
Distance D
Breadth First Traversal from each Node up to Distance D
GraphTopology
Index
Graph Maximum MetaPath Weight
Index
Sort Edges
Sorted Edge Lists
Top-K Computation
Find Candidate Nodes
Candidate Nodes
Query Q
Top-K Subgraphs
Offline Index Construction
Online Query Processing
1
2
3
Index StructuresG=(V,E), B=avg #neighbors, T=#types
A
A
A
AB
A
C
B
A
C
A
C10
6 5
9
12
4
8
3
7
0.6 0.8
0.6
0.9
0.3 0.5 0.2
0.4
0.1
Network G
B
11
12
13
0.7 0.1
0.20.70.8
0.5
0.2
0.4 0.3
AA BB CC AB AC BC(5,9):0.9 (12,13):0.2 (2,7): 0.7 (3,12): 0.5 (7,11): 0.2
(3,4):0.8 (5,6): 0.6 (4,12): 0.4 (1,11): 0.1
(4,5):0.8 (8,7): 0.5 (3,13): 0.4
(2,3):0.7 (2,1): 0.2 (2,13): 0.3
(8,9):0.6 (4,7): 0.1
(9,10):0.3
Index Time Complexity
Space Complexity
Sorted edge lists
Index Time Complexity
Space Complexity
Sorted edge lists
Graph topology
index
Index Time Complexity
Space Complexity
Sorted edge lists
Graph topology
index
Graph max
metapath weight index
d 1 2Node
Id A B C AA BA CA AB BB CB AC BC CC
1 0.2 0.1 0.9 0.9 0.3 0.52 0.7 0.7 0.3 1.5 1.2 0.7 1.2 0.9 0.53 0.8 0.5 1.6 0.9 1.4 1.2 0.74 0.8 0.1 0.4 1.7 0.8 0.9 1.4 1.3 0.3 0.65 0.9 0.6 1.6 0.9 1.26 0.6 1.57 0.7 0.2 1.4 0.9 0.3 18 0.6 0.5 1.5 1.2 0.79 0.9 1.7 1.5
10 0.3 1.211 0.2 0.912 0.5 0.2 1.3 0.6 0.5 0.9
d 1 2Node
Id A B C AA BA CA AB BB CB AC BC CC
1 1 1 1 1 1 12 1 2 1 1 2 1 2 1 13 2 2 1 2 2 2 24 2 1 1 2 2 1 1 2 1 15 2 1 3 1 16 1 27 3 1 3 1 1 2
8 1 1 2 2 1
9 3 1 2
10 1 2
11 2 3
12 2 1 4 2 1 1
Find Candidate NodesGraph
TopologyIndex
Query Q
Graph Topology Index
Query Topology
A
A
A
Query Q
1
2 3
B4
2 2 2 1
3 3 3 6
4 4 4 7
5 5 5
8 8 8
9 9 9
10 10 10
d 1 2Node
Id A B C AA BA CA AB BB CB AC BC CC
1 1 12 2 13 1 1 14 1 1
2 2 2 1
3 3 3 6
4 4 4 7
5 5 5
8 8 8
9 9 9
10 10 10
d 1 2Node
Id A B C AA BA CA AB BB CB AC BC CC
1 1 1 1 1 1 12 1 2 1 1 2 1 2 1 13 2 2 1 2 2 2 24 2 1 1 2 2 1 1 2 1 15 2 1 3 1 16 1 27 3 1 3 1 1 2
8 1 1 2 2 1
9 3 1 2
10 1 2
11 2 3
12 2 1 4 2 1 1
Finding and Scoring MatchesKey Idea
Top-K Computation
𝑀 1
𝑀 4 𝑀 2
𝑀 3 𝑀 5
Top-K Heap
More valid edges?
Start
Generate a Size-1 Candidate
Compute Actual and UB Score
Grow Candidates
Update Heap
Done!
TopK Quit?
Candidate Size==|Q|?
Compute Actual and UB ScoreTopK Quit?
Compute Max UB Score
TopK Quit?
Y
Y
YY
YN
N
N
N
NY
A
A
A
Query Q
1
2 3
B4
(5,9):0.9 (2,7): 0.7
(3,4):0.8 (5,6): 0.6
(4,5):0.8 (8,7): 0.5
(2,3):0.7 (2,1): 0.2
(8,9):0.6 (4,7): 0.1
(9,10):0.3
AA BA
Finding and Scoring MatchesGenerating Size-1 Candidates
(5,9):0.9 (2,7): 0.7
(3,4):0.8 (5,6): 0.6
(4,5):0.8 (8,7): 0.5
(2,3):0.7 (2,1): 0.2
(8,9):0.6 (4,7): 0.1
(9,10):0.3
AA BA
A
A
A
Query Q
1
2 3
B4
Size-1 Candidates
A
A
A
5
9
BMultiple query edges of the same type
A
A
A59
B
A
A
A
9
5
B
A
A
A95
BQuery Edge with both endpoints of same type
Order(5,9)(3,4)(4,5)(2,3)(2,7)…
Candidate Growth
A
A
A59
B
Prune?
Grow?
Prune?
Grow?
Heapify?
Discard?
A
A
A59
B8
A
A
A59
B8 6
Prune?
Grow?
Heapify?
Discard?
A
A
A59
B10
A
A
A59
B10 6
Finding and Scoring MatchesActual Score and Upper Bound Score
Candidate Growth
Useful Edge Lists
Actual Score= 0.9
UB Score = 0.9+ UB(NonConsidered Edges) = 0.9+ (0.6+0.6) = 2.1
• Partially grown candidate• Prune if UBScore< min(heap)• Grow otherwise
• Fully grown candidate• Discard if UBScore< min(heap)• Update heap otherwise
A
A
A59
B
A
A
A59
B8
A
A
A59
B8 6
Prune?
Grow?
Prune?
Grow?
Heapify?
Discard?
(5,9):0.9 (2,7): 0.7
(3,4):0.8 (5,6): 0.6
(4,5):0.8 (8,7): 0.5
(2,3):0.7 (2,1): 0.2
(8,9):0.6 (4,7): 0.1
(9,10):0.3
AA BA A
A
A59
B
Finding and Scoring MatchesGlobal Top-K Quit
K=2TopK Heap
(4,3,2,7): 2.2(3,4,5,6): 2.2
0.7+0.6+0.7 = 2 <2.2 Stop
A
A
A
AB
A
C
B
A
C
A
C10
6 5
9
12
4
8
3
7
0.6 0.8
0.6
0.9
0.3 0.5 0.2
0.4
0.1
Network G
B
11
12
13
0.7 0.1
0.20.70.8
0.5
0.2
0.4 0.3
A
A
A
Query Q
1
2 3
B4
(5,9):0.9 (2,7): 0.7
(3,4):0.8 (5,6): 0.6
(4,5):0.8 (8,7): 0.5
(2,3):0.7 (2,1): 0.2
(8,9):0.6 (4,7): 0.1
(9,10):0.3
AA BA
Faster Query Processing using Graph Maximum MetaPath Weight Index
CA B
1
2
3C
C
4 5
1
2
C
C
CA B
13C 4 5
A2
3
C
Query
Partial Candidate
Paths to cover Non-Considered
Edges
UB Score = Actual Score(1-2) + UB(1-3) + UB(2-3) + UB(3-4) + UB(4-5)
UB Score = Actual Score(1-2) + UB(1-3-4-5) + UB(2-3)
Using MMW Index!
CA B
1
2
3C
C
4 51
2
C
C
CA B
13C 4 5
A2
3
C
Query PartialInstantiation
Paths to cover Non-Considered
Edges
CB6 7
C7
Edges to Consider
Separately
B
CB6 7
4
Slight complication
UB Score = Actual Score(1-2) + UB(1-3-4-5-7) + UB(2-3) + UB(4-6) +UB(6-7)
Faster Query Processing using Graph Maximum MetaPath Weight Index
A
A
A
9
5
B
K=2TopK Heap
(8,9,5,6): 2.1(5,9,8,7): 2.0
(5,9):0.9 (2,7): 0.7
(3,4):0.8 (5,6): 0.6
(4,5):0.8 (8,7): 0.5
(2,3):0.7 (2,1): 0.2
(8,9):0.6 (4,7): 0.1
(9,10):0.3
AA BA
Edge-based UBScore0.9+0.8+0.7=2.4 > 2.0
Path-based UBScore0.9+UB(5-A-B)=0.9+0.9=1.8 < 2.0
Grow
Prune
Prune?
Grow?
MMW Index
d 1 2Node
Id A B C AA BA CA AB BB CB AC BC CC
1 0.2 0.1 0.9 0.9 0.3 0.52 0.7 0.7 0.3 1.5 1.2 0.7 1.2 0.9 0.53 0.8 0.5 1.6 0.9 1.4 1.2 0.74 0.8 0.1 0.4 1.7 0.8 0.9 1.4 1.3 0.3 0.65 0.9 0.6 1.6 0.9 1.26 0.6 1.57 0.7 0.2 1.4 0.9 0.3 18 0.6 0.5 1.5 1.2 0.79 0.9 1.7 1.5
10 0.3 1.211 0.2 0.912 0.5 0.2 1.3 0.6 0.5 0.9
Discussions
• Queries with multiple edge semantics• Directed graphs• Homogeneous networks• Weighted query edges– Weights signify expected amount of
interestingness– Weights signify importance of query edge
• Faster computations versus index size
Low-cost Index Structures
1000 10000 100000 10000001
10
100
1000
10000 Topology+MMW (D=2)SPath (D=2)Sorted Edge Lists
|V|
Tim
e (s
ec)
1000 10000 100000 100000010
100
1000
10000
100000
1000000
10000000
100000000 Edge ListsTopology (D=2)Topology (D=3)MMW (D=2)MMW (D=3)SPath (D=2)SPath (D=3)Graph Size
|V|
Inde
x Si
ze (K
Bs)
Faster Query Execution
|Q|=2 |Q|=3 |Q|=4 |Q|=5RAM 158 3186 39294 469962
RWM0 10 165 824 4660RWM1 12 195 1022 5891RWM2 12 212 3135 27363RWM3 111 1486 3978 9972RWM4 12 165 791 4518
|Q|=2 |Q|=3 |Q|=4 |Q|=5RAM 144 8698 34639 174992
RWM0 10 375 14689 229136RWM1 13 446 16754 200065RWM2 12 562 19088 201708RWM3 156 2277 17182 161533RWM4 11 346 13547 199617
|Q|=2 |Q|=3 |Q|=4 |Q|=5RAM 245 2004 14628 169328
RWM0 15 32 43 122RWM1 19 36 98 178RWM2 20 40 442 6887RWM3 218 1733 2337 3933RWM4 18 34 42 118
Query Execution Time (msec) for PathQueries (Graph G2 and indexes with D=2)
Query Execution Time (msec) for CliqueQueries (Graph G2 and indexes with D=2)
Query Execution Time (msec) for SubgraphQueries (Graph G2 and indexes with D=2)
RAM: Ranking After Matching baseline RWM0: without using the candidate node filteringRWM1: without using the MMW indexRWM2: same as RWM1 without thepruning any partially grown candidatesRWM3: same as RWM1 without the global top-K quit checkRWM4: same as RWM1 with the MMW index
Good ScalabilityQuerySize
Graph |Q|=2 |Q|=3 |Q|=4 |Q|=5 |Q|=6 |Q|=7
|V|=1e+3 5 18 77 382 1870 7656|V|=1e+4 10 90 407 2267 12366 87657|V|=1e+5 52 396 2794 18412 131256 1006773|V|=1e+6 362 4907 28600 184523 1216893 9786327
Good Scalability thanks to Effective Pruning
|Q|=2 |Q|=3 |Q|=4 |Q|=5#Candidates of Size 2 9.54 7.86 4.38 1.63#Candidates of Size 3 28.28 18.31 7.94#Candidates of Size 4 24.42 25.5#Candidates of Size 5 13.61
Running time (msec) for different Query Sizes and Graph Sizes (D=2)
Query Execution Time for Different Values of K
Number of Candidates as Percentage of Total Matches for Different Query Sizes
and Candidate Sizes
|Q|=2 |Q|=3 |Q|=4 |Q|=510
100
1000
10000K=10 K=20 K=50 K=100
Size of the Query
Ave
rage
Que
ry E
xecu
tion
Tim
e (m
sec)
Author
Author
Conf Keyword
Q2
1 2
3
4
Person
Person
Company Settlement
Q4
1 2
3
4Person
Person
Film
Q3
1 2
3
Author
Author
Conf
Q1
1 2
3
Dataset DBLP Wikipedia
#Nodes 138K 670K
#Edges 1.6M 4.1M
#Types 3 10
Edge List Index Size
50 MB 261 MB
Topology Index Size
5.8 MB 148 MB
MMW Index Size 11.4 MB 249 MB
SPath Index Size 4.3 GB 13.7 GB
Topology+MMW Construction Time
513 minutes
1203 minutes
Avg Query Time 100 sec 42 sec
Real Dataset Case Studies
Real Dataset Case Studies
• DBLP– 1: Rohit Gupta, 2: BICoB, 3: Vipin Kumar
• Rohit Gupta -- computer networking• Vipin Kumar -- Data and Information Systems• BICoB -- International Conference on Bioinformatics and
Computational Biology
– 1: Jimeng Sun, 2: Operating Systems Review (SIGOPS), 3: Christos Faloutsos, 4: mining• Jimeng Sun and Christos Faloutsos -- Data and Information Systems,
Artificial intelligence, and Computational biology• "mining" -- Data and Information Systems• "Operating Systems Review (SIGOPS)" -- Operating systems,
Computer architecture, Computer networking
Real Dataset Case Studies
• Wikipedia– 1: Stacy Keach, 2: The Biggest Battle, 3: John Huston
• Stacy Keach and John Huston starred in the movie “The Biggest Battle”• Stacy Keach (American), John Huston (American), movie is Italian• Stacy (narration, comedy, music), John (drama, documentary, adventure),
movie (war)
– 1: Medha Patkar, 2: BBC, 3: Felix D’Alviella, 4: Mogilino• Medha Patkar -- Indian social activist -- won Best International Political
Campaigner by BBC• Felix D’Alviella -- Belgian actor in the BBC soap opera Doctors • Mogilino -- village in Bulgaria -- BBC showed the popular film "Bulgaria’s
Abandoned Children" in 2007• British company rewarding an Indian woman, covering a place in Bulgaria or
linked to a person from Belgium is rare
Related Work (1)
• Theory literature on subgraph isomorphism [Cordella et al., 2004; McKay, 1981; Ullmann, 1976]
• Exact subgraph matching [Cheng et al., 2008; He and Singh, 2008; Sun et al., 2012; Zhang et al., 2007; Zhang et al., 2009; Zhao and Han, 2010; Zou et al., 2009]
• Approximate subgraph matching [Zou et al., 2007; Zeng et al., 2012; Tian et al., 2007; Zhang et al., 2010]
Related Work (2)
• Matching in graph databases [Ranu and Singh, 2009; Yan et al., 2005; Zhu et al., 2012]
• Matching for RDF graphs [Liu et al., 2012], probabilistic graphs [Yuan et al., 2012] and temporal graphs [Bogdanov et al., 2011]
• Top-K queries– h-hop aggregate queries [Yan et al., 2010] – K most frequent patterns [Yang et al., 2012; Zhu et al., 2011]– Top-K keyword queries on RDF graphs [Tran et al., 2009]– Top-K similarity queries [Zou et al., 2007]– Twig queries [Gou and Chirkova, 2008]
Conclusion
• Given– Typed unweighted query– A heterogeneous edge-weighted information network– Edge interestingness measure
• Find– Top-K interesting subgraphs
• Investigated ranking after matching baseline • Proposed three new graph indexes and exploited them for
building a top-K solution• Showed efficiency, scalability and effectiveness on multiple
synthetic and real datasets
Thanks!