requests to tsong-li 1. related work at end of each section 2. screen dumps of treebase at end of...
TRANSCRIPT
Requests to Tsong-Li
• 1. Related work at end of each section
• 2. Screen dumps of treebase at end of treesearch section (you’ll see where)
• 3. Web addresses at the very end.
Searching for and ComparingTrees and Graphs
Dennis Shasha, [email protected]
Courant Institute, NYU
Joint work with
Kaizhong Zhang and Jason Wang
Philosophy
• Trees and graphs represent data in many domains in linguistics, chemistry, and even maybe the web.
• Question: why can’t I search for trees or graphs at the speed of keyword searches?
• Why can’t I compare trees (or graphs) as easily as I can compare strings?
Tree Searching
• Given a small tree t is it present in a bigger tree T?
What does “present” mean?
• Preserving sibling order or not
• Preserving ancestor order
• Preserving distance
• Mismatches
Sibling Order
• Order of children of a node:
A
B C
A
C B
?=
Ancestor Order
• Order between children and parent.
A
B CA
C
B
?=
Ancestor Distance
• Can children become grandchildren:
A
B C
A
B X
?=
C
Mismatches
• Can there be relabellings, inserts, and deletes (Tolstoy problem):
A
B C
A
X C
howfar?
Bottom Line
• There is no one definition of mismatch or subtree (Tolstoy problem). You must choose the package that suits you.
• I will tell you about three.
TreeSearch Query Language
• Query language is simply a tree decorated with single length don’t cares (?) and variable length don’t cares (*).
A
*
B C
?
D
>= 0, oneach side
=1
Exact Match
• Query matches exactly if contained regardless of sibling order or other nodes
A
*
B C
?
D
=
X
Y A
W
Z
C
BX Q
DU
Inexact Match
• Inexact match if missing or differing node labels. Higher differences cost more.
A
*
B C
?
D
Differby 1
X
Y A
W
Z
C
BX Q
EU
Treesearch Conceptual Algorithm
• Take all paths in query tree.
• Find out where each path is in the data tree.
• So notion of distance is number of paths that differ. Higher nodes are more important.
• Implementation: suffix array. A few seconds on several thousand trees.
Treesearch Review
• Ancestor order matters.• Sibling order doesn’t.• Don’t cares: * and ?• Distance metric is based on numbers of path
differences.• Sister system built by Divesh and Sihem at
Bell Labs that allows terms to be “generalized”
Tsong-Li: screen dumps of treebase then related work
Tree Edit
• Order of children matters
A
B C
A’
C B
A->A’del(B)ins(B)
Tree Edit in General
• Operations are relabel A->A’, delete (X), insert (B).
A
X C
A’
C B
A->A’del(B)ins(B)
CC
Review of Tree Edit
• Generalizes string editing distance for trees, a dynamic programming algorithm.
• O(|T1| |T2| depth(T1) depth(T2))
• The basis for XMLdiff.
• Also has * and best removal of subtrees.
Tsong-Li: related work here
Graph Edit
• Thesis work of Rosalba Giugno.
• Find a small graph (with * and ?) in a big graph.
• Doesn’t work fast if query graph is big because graph subisomorphism is exponential.
Example of GraphGrep
• Query graph has nodes and don’t cares
A
B
* DC
Summary of Tools
• Why can’t tree and graph search be like keyword search?
• We are getting there and will provide software if you are interested.
• Current downloads of about 50.