Searching for and Comparing Trees and Graphs - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Searching for and Comparing Trees and Graphs

Description:

Trees and graphs represent data in many domains in linguistics, chemistry, and ... Question: why can't I search for trees or graphs at the speed of keyword searches? ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 31
Provided by: dennis47
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: Searching for and Comparing Trees and Graphs


1
Searching for and ComparingTrees and Graphs
  • Dennis Shasha, shasha_at_cs.nyu.edu
  • Courant Institute, NYU
  • Joint work with
  • Kaizhong Zhang and Jason Wang

2
Philosophy
  • Trees and graphs represent data in many domains
    in linguistics, chemistry, and even maybe the
    web.
  • Question why cant I search for trees or graphs
    at the speed of keyword searches?
  • Why cant I compare trees (or graphs) as easily
    as I can compare strings?

3
Tree Searching
  • Given a small tree t is it present in a bigger
    tree T?

t
T
4
What does present mean?
  • Preserving sibling order or not
  • Preserving ancestor order
  • Preserving distance
  • Mismatches

5
Sibling Order
  • Order of children of a node

A
A
?
B
B
C
C
6
Ancestor Order
  • Order between children and parent.

C
A
?
A
B
C
B
7
Ancestor Distance
  • Can children become grandchildren

A
A
?
X
B
B
C
C
8
Mismatches
  • Can there be relabellings, inserts, and deletes
    (Tolstoy problem)

A
A
how far?
C
B
X
C
9
Bottom Line
  • There is no one definition of mismatch or subtree
    (Tolstoy problem). You must choose the package
    that suits you.
  • I will tell you about three.

10
TreeSearch Query Language
  • Query language is simply a tree decorated with
    single length dont cares (?) and variable length
    dont cares ().

A
0, on each side
?
1

C
D
B
11
Exact Match
  • Query matches exactly if contained regardless of
    sibling order or other nodes

X
A
A
Y
Q
X
?

W
B

Z
D
C
D
U
B
C
12
Inexact Match
  • Inexact match if missing or differing node
    labels. Higher differences cost more.

X
A
A
Y
Q
X
?
Differ by 1
W

B
Z
E
C
D
U
B
C
13
Treesearch Conceptual Algorithm
  • Take all paths in query tree.
  • Find out where each path is in the data tree.
  • So notion of distance is number of paths that
    differ. Higher nodes are more important.
  • Implementation suffix array. A few seconds on
    several thousand trees.

14
Treesearch Review
  • Ancestor order matters.
  • Sibling order doesnt.
  • Dont cares and ?
  • Distance metric is based on numbers of path
    differences.
  • Sister system built by Divesh and Sihem at Bell
    Labs that allows terms to be generalized

15
Screenshots of TreeSearch on TreeBASE
16
Query Screen
17
Query Tree Format
18
Search Results
19
Query Tree
20
One of the Result Trees
21
Related Work
  • S. Amer-Yahia, S. Cho, L.V.S. Lakshmanan, and D.
    Srivastava. Minimization of tree pattern queries.
    SIGMOD, 2001.
  • Z. Chen, H. V. Jagadish, F. Korn, N. Koudas, S.
    Muthukrishnan, R. T. Ng, and D. Srivastava.
    Counting twig matches in a tree. ICDE, 2001.
  • J. Cracraft and M. Donoghue. Assembling the tree
    of life Research needs in phylogenetics and
    phyloinformatics. NSF Workshop Report, Yale
    University, 2000.

22
Tree Edit
  • Order of children matters

A
A
A-A del(B) ins(B)
B
B
C
C
23
Tree Edit in General
  • Operations are relabel A-A, delete (X), insert
    (B).

A
A
A-A del(X) ins(B)
B
X
C
C
C
C
24
Review of Tree Edit
  • Generalizes string editing distance for trees, a
    dynamic programming algorithm.
  • O(T1 T2 depth(T1) depth(T2))
  • The basis for XMLdiff.
  • Also has and best removal of subtrees.

25
Related Work
  • IBM XML Diff and Merge Tool. http//www.alphaworks
    .ibm.com/aw.nsf/textResearchers/CB2EF938D7532F3388
    25671B0068244F
  • K. Zhang and D. Shasha. Editing distance between
    trees. SIAM J. Comp., 1989.
  • K. Zhang, D. Shasha and J. T. L. Wang.
    Approximate tree matching in the presence of
    variable length don't cares. Journal of
    Algorithms, 1994.

26
Graph Edit
  • Thesis work of Rosalba Giugno.
  • Find a small graph (with and ?) in a big graph.
  • Doesnt work fast if query graph is big because
    graph subisomorphism is exponential.

27
Example of GraphGrep
  • Query graph has nodes and dont cares

A
C
D

B
28
Related Work
  • P. Buneman, M. F. Fernandez, and D. Suciu. UnQL
    a query language and algebra for semistructured
    data based on structural recursion. VLDB Journal,
    2000.
  • A. O. Mendelzon and P. T. Wood. Finding regular
    simple paths in graph databases. VLDB, 1989. 
  • Daylight Chemical Information Systems. http//
    www.daylight.com/.
  • Protein Structure Search. http//sss.berkeley.edu/
  • Web Structure Search. http//www.almaden.ibm.com/c
    s/k53/clever.html

29
Summary of Tools
  • Why cant tree and graph search be like keyword
    search?
  • We are getting there and will provide software if
    you are interested.
  • Current downloads of about 50.

30
URLs for Tools
  • http//www.cs.nyu.edu/shasha/papers/graphgrep
  • http//cs.nyu.edu/cs/faculty/shasha/papers/treesea
    rch.html
  • http//web.njit.edu/wangj/sigmod.html
Write a Comment
User Comments (0)
About PowerShow.com