8/13/2007kdd 2007, san jose graph x-ray: fast best-effort pattern matching in large attributed...
TRANSCRIPT
8/13/2007 KDD 2007, San Jose
Graph X-Ray: Fast Best-Effort Pattern Matching
in Large Attributed Graphs
Hanghang Tong, Brian Gallagher,
Christos Faloutsos, Tina Eliassi-Rad
L L N L
2
OutputInput
Attributed Data Graph
Query Graph
Matching SubgraphAccountant
CEO
Manager
SEC
3
Terminology: ``Conform’’
Query GraphMatching Subgraph conforms
4
Terminology: ``Interception’’
Query GraphMatching Subgraph
Path 12-13-4 is an Interception
Intermediate node
matching node
matching node
matching node
matching node
5
Terminology: ``Instantiate’’
Query Graph HqMatching Subgraph Ht
Node 11 instantiates SEC nodeHt instantiates Hq
6
Roadmap
• Introduction– Problem Definition
– Motivations
• How to: Graph X-Ray
• Experimental Results
• Conclusion
7
Motivation: Why Not SQL?
• Case 1: Exact match does not exist– Q: How to find approximate answer?
• Case 2: Too many exact matches– Q: How to rank them?
8
Motivation: Why Not SQL? (Cont.)
• Case 3: Exact match might be not the best answer– ``Find CEO who has heavy contact with Accountant’’
• Q: how to find right?
12 1
99
3
2
4
11 4
...
Exact match1 direct connection Inexact match
Many indirect connections
9
Motivation: Efficiency
• Why Not Subgraph Isomorphism?– Polynomial for fixed # of pattern query
• Q1: How to scale up linearly?
• Q2: … and with a small slope?
10
Wish List
• Effectiveness– Both exact match & inexact Match– Ranking among multiple results– ``Best’’ answer (proximity-based)
• Efficiency– Scale linearly– Scale with small scope
G-Ray meets all!
11
Roadmap
• Introduction– Problem Definition
– Motivations
• How to: Graph X-Ray
• Experimental Results
• Conclusion
12
Preliminary: Center-Piece Subgraph [Tong+]
A C
B
A C
B
Original GraphBlack: query nodes
Q
=CePS( , , )A B CCePS is meta opt. in G-Ray!
13
Preliminary: Augmented Graph
• Data nodes– 1,…13
• Attribute nodes– a
Footnote
12
1311
9
10
5
6
7
8
1
2
3
4
Aug. Graph is crucial for computation!
14
G-Ray: quick overview (for loop )
Step 1: SF
Step 2: NE Step 3: BR
Step 4: NE Step 5: BR
Step 6: NE
Step 7: BR Step 8: BR
SF: Seed-FinderNE: Neighborhood -ExpanderBR: Bridge
15
• Q: How to instantiate SEC node?
• A:
Footnote
11 =CePS( )
12
1311
9
10
5
6
7
8
1
2
3
4
Seed-Finder ( )
`11’ is close to some un-known data nodes for `CEO’ `Account.’and `Manager’
16
12
1311
9
10
5
6
7
8
1
2
3
4
Neighborhood-Expander ( )
• Q: How to instantiate CEO node?– Step 1 Step 2?
• A:
• Footnote:– Step 3 Step 4?
– Step 5 Step 6?
11=CePS( )12
11=CePS( )7
=CePS( )4 7 12
17
Step 6: NE
Bridge ( )
• Q:
• A: Prim-like Alg.– To maximize
– Should block node 11 and 7
• Footnote– Connection subgraph, or one single path?
Step 7: BR
?
18
Roadmap
• Introduction– Problem Definition
– Motivation
• How to: Graph X-Ray
• Experimental Results
• Conclusion
19
Experimental Results
• Datasets– DBLP– Node: author (315k)– Edge: co-authorship (1,800k)– Attribute: conference & year (13k)
• KDD-2001, SIGMOD…
20
Effectiveness: star-query
Query Result
21
Effectiveness: line-query
Query
Result
22
Query
Result
Effectiveness: loop-query
23
Efficiency
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
x 106
0
10
20
30
40
50
60
70
80
# of Edges
Ave
rage
Res
pons
e T
ime
(Sec
onds
)
Fast FSGM
Iterative method
# of Edges
Response Time
•Scale linearly•Small slope•3-5 Seconds
~2 M edges
24
Roadmap
• Introduction– Problem Definition
– Motivation
• How to: Graph X-Ray
• Experimental Results
• Conclusion
25
Conclusion
• Graph X-Ray (G-Ray)– Best effort pattern match
• in large attributed graphs
– Scale linearly • with small slope
• More details in Poster Session – Monday (tonight)– board number 8
26
www.cs.cmu.edu/~htong
12
1311
9
10
5
6
7
8
1
2
3
4
12
11 4
7
13
X-Ray G-Ray
Thank you!
27
Backup-slides
28
1
4
3
2
56
7
910
8
11
12
Proximity on Graph
• Multi-faceted• Punish long path• Edge weight
a.k.a relevance, closeness
How to: ---- random walk with restart
29
Random walk with restart
Node 4
Node 1Node 2Node 3Node 4Node 5Node 6Node 7Node 8Node 9Node 10Node 11Node 12
0.130.100.130.220.130.050.050.080.040.030.040.02
1
4
3
2
56
7
910
811
120.13
0.10
0.13
0.13
0.05
0.05
0.08
0.04
0.02
0.04
0.03
Ranking vector More red, more relevant
Nearby nodes, higher scores
4r
30
How to rank the results
• Our goodness function– Measure the proximity between any two matching
nodes if they are required to be connected. (two-way)– Multiply them together
• In G-Ray, we approximately optimize this goodness functions
• If we have multiple matching subgraphs, we can rank them according to this goodness functions
31
How to rank the results
matching node
matching node
matching node
matching node
Goodness = Prox (12, 4) x Prox (4, 12) x Prox (7, 4) x Prox (4, 7) x Prox (11, 7) x Prox (7, 11) x Prox (12, 11) x Prox (11, 12)